Grouping with Counts
There are various department categories but no sense of how many departments there are in each category. You'll use pandasto gain insight into this information.
In particular, you'll use the .groupby() method of pandas. This was not introduced to you in the course, but you'll see it very frequently in your data science journey and it's an important method to understand. This set of exercises will extend your pandasknowledge by teaching you how to use the .groupby() method.
Calls to .groupby() have the following three components: the column you want to group, the column you want to aggregate, and the statistic you want to aggregate by. For example, in our dataset, if we wanted to see the percentage of women ('sharewomen') per 'major_category', we could leverage a .groupby like so: recent_grads.groupby('major_category')['sharewomen'].mean(). Here, we are grouping by 'major_category', and aggregating 'sharewomen' by the mean. Give it a try in the IPython Shell if you're curious to see the result!
In [1]: recent_grads.groupby('major_category')['sharewomen'].mean()
major_category 별로 그룹핑하여, sharewomen 값의 평균을 내고, 결과를 출력함
Out[1]:
major_category
Agriculture & Natural Resources 0.617938
Arts 0.561851
Biology & Life Science 0.584518
Business 0.405063
Communications & Journalism 0.643835
Computers & Mathematics 0.512752
Education 0.674986
Engineering 0.257158
Health 0.616857
Humanities & Liberal Arts 0.676193
Industrial Arts & Consumer Services 0.449351
Interdisciplinary 0.495397
Law & Public Policy 0.335990
Physical Sciences 0.508749
Psychology & Social Work 0.777763
Social Science 0.539067
Name: sharewomen, dtype: float64
세부 dataset은 아래 링크 참조 ( print(recent_grads) )
http://blog.daum.net/_blog/BlogTypeView.do?blogid=0ejaF&articleno=47&categoryId=17®dt=20190101181259
Introduction
Using the .groupby() method, group the recent_grads DataFrame by 'major_category' and then count the number of departments per category using .count().
# Group by major category and count
print(recent_grads.groupby(['major_category']).major_category.count())
<script.py> output:
major_category
Agriculture & Natural Resources 10
Arts 8
Biology & Life Science 14
Business 13
Communications & Journalism 4
Computers & Mathematics 11
Education 16
Engineering 29
Health 12
Humanities & Liberal Arts 15
Industrial Arts & Consumer Services 7
Interdisciplinary 1
Law & Public Policy 5
Physical Sciences 10
Psychology & Social Work 9
Social Science 9
Name: major_category, dtype: int64
Grouping with Counts, Part 2
You want to get a sense of the number of majors with a lot less women, so you'll perform a similar operation as the one from the last exercise.
Use the fewer_women DataFrame from a previous exercise.
Create a DataFrame that groups the departments by major category and shows the number of departments that are skewed in women.
# Group departments that have less women by category and countprint(recent_grads.groupby(['major_category']).fewer_women.count())
# Group departments that have less women by category and countprint(fewer_women.groupby(['major_category']).count())print(fewer_women.groupby(['major_category']['women].count()']))
# Group departments that have less women by category and count
# print(fewer_women.groupby(['major_category']).size())
# print(fewer_women.head(5))
print(fewer_women.groupby(['major_category']).major_category.count())
<script.py> output:
major_category
Agriculture & Natural Resources 1
Biology & Life Science 1
Business 6
Communications & Journalism 1
Computers & Mathematics 5
Engineering 24
Health 1
Humanities & Liberal Arts 1
Industrial Arts & Consumer Services 3
Law & Public Policy 3
Physical Sciences 1
Social Science 1
Name: major_category, dtype: int64
Grouping one Column with Means
Similar to the exercise you just completed, you can group rows to output the means of different groups within a column.
The column gender_diff is still available in the recent_grads DataFrame.
Write code that outputs the average gender percentage difference by major category.
# Report average gender difference by major category
print(recent_grads.groupby(['major_category']).gender_diff.mean())
major_category
Agriculture & Natural Resources 0.346799
Arts 0.259144
Biology & Life Science 0.296905
Business 0.350740
Communications & Journalism 0.482560
Computers & Mathematics 0.468733
Education 0.415442
Engineering 0.528308
Health 0.378798
Humanities & Liberal Arts 0.417325
Industrial Arts & Consumer Services 0.350490
Interdisciplinary 0.009206
Law & Public Policy 0.400821
Physical Sciences 0.280674
Psychology & Social Work 0.583837
Social Science 0.167048
Name: gender_diff, dtype: float64
Grouping Two Columns with Means
You can expand the previous operation to include two columns and output the means for each. To accomplish this, modify the code you just wrote.
Write a query that outputs the mean number of 'low_wage_jobs' and average 'unemployment_rate' grouped by 'major_category'.
# Find average number of low wage jobs and unemployment rate of each major category
# dept_stats = ____.____(['____'])['____', '____'].____
dept_stats = recent_grads.groupby(['major_category'])['low_wage_jobs', 'unemployment_rate'].mean()
print(dept_stats)
low_wage_jobs unemployment_rate
major_category
Agriculture & Natural Resources 789.900000 0.056328
Arts 7514.500000 0.090173
Biology & Life Science 3053.000000 0.060918
Business 9752.923077 0.071064
Communications & Journalism 12398.750000 0.075538
Computers & Mathematics 1466.909091 0.084256
Education 2554.375000 0.051702
Engineering 864.793103 0.063334
Health 2605.833333 0.065920
Humanities & Liberal Arts 6282.666667 0.081008
Industrial Arts & Consumer Services 3798.571429 0.056083
Interdisciplinary 1061.000000 0.070861
Law & Public Policy 4144.000000 0.090805
Physical Sciences 1407.800000 0.046511
Psychology & Social Work 6249.555556 0.072065
Social Science 6020.000000 0.095729