본문 바로가기

카테고리 없음

[DAT208x] final lab 2-6, 2-7, 2-8 : Manipulating Data (Grouping)

Grouping with Counts

There are various department categories but no sense of how many departments there are in each category. You'll use pandasto gain insight into this information.

In particular, you'll use the .groupby() method of pandas. This was not introduced to you in the course, but you'll see it very frequently in your data science journey and it's an important method to understand. This set of exercises will extend your pandasknowledge by teaching you how to use the .groupby() method.

Calls to .groupby() have the following three components: the column you want to group, the column you want to aggregate, and the statistic you want to aggregate by. For example, in our dataset, if we wanted to see the percentage of women ('sharewomen') per 'major_category', we could leverage a .groupby like so: recent_grads.groupby('major_category')['sharewomen'].mean(). Here, we are grouping by 'major_category', and aggregating 'sharewomen' by the mean. Give it a try in the IPython Shell if you're curious to see the result!

 

In [1]: recent_grads.groupby('major_category')['sharewomen'].mean()

major_category 별로 그룹핑하여, sharewomen 값의 평균을 내고, 결과를 출력함
Out[1]:
major_category
Agriculture & Natural Resources        0.617938
Arts                                   0.561851
Biology & Life Science                 0.584518
Business                               0.405063
Communications & Journalism            0.643835
Computers & Mathematics                0.512752
Education                              0.674986
Engineering                            0.257158
Health                                 0.616857
Humanities & Liberal Arts              0.676193
Industrial Arts & Consumer Services    0.449351
Interdisciplinary                      0.495397
Law & Public Policy                    0.335990
Physical Sciences                      0.508749
Psychology & Social Work               0.777763
Social Science                         0.539067
Name: sharewomen, dtype: float64

 

세부 dataset은 아래 링크 참조 ( print(recent_grads) )

http://blog.daum.net/_blog/BlogTypeView.do?blogid=0ejaF&articleno=47&categoryId=17®dt=20190101181259 

 

Introduction

Using the .groupby() method, group the recent_grads DataFrame by 'major_category' and then count the number of departments per category using .count().

 

# Group by major category and count
print(recent_grads.groupby(['major_category']).major_category.count())

 

<script.py> output:
    major_category
    Agriculture & Natural Resources        10
    Arts                                    8
    Biology & Life Science                 14
    Business                               13
    Communications & Journalism             4
    Computers & Mathematics                11
    Education                              16
    Engineering                            29
    Health                                 12
    Humanities & Liberal Arts              15
    Industrial Arts & Consumer Services     7
    Interdisciplinary                       1
    Law & Public Policy                     5
    Physical Sciences                      10
    Psychology & Social Work                9
    Social Science                          9
    Name: major_category, dtype: int64

 

 

Grouping with Counts, Part 2

You want to get a sense of the number of majors with a lot less women, so you'll perform a similar operation as the one from the last exercise.

Use the fewer_women DataFrame from a previous exercise.

 

Create a DataFrame that groups the departments by major category and shows the number of departments that are skewed in women. 

 

# Group departments that have less women by category and count
print(recent_grads.groupby(['major_category']).fewer_women.count())

# Group departments that have less women by category and count
print(fewer_women.groupby(['major_category']).count())
print(fewer_women.groupby(['major_category']['women].count()']))

 

# Group departments that have less women by category and count
# print(fewer_women.groupby(['major_category']).size())
# print(fewer_women.head(5))
print(fewer_women.groupby(['major_category']).major_category.count())

<script.py> output:
    major_category
    Agriculture & Natural Resources         1
    Biology & Life Science                  1
    Business                                6
    Communications & Journalism             1
    Computers & Mathematics                 5
    Engineering                            24
    Health                                  1
    Humanities & Liberal Arts               1
    Industrial Arts & Consumer Services     3
    Law & Public Policy                     3
    Physical Sciences                       1
    Social Science                          1
    Name: major_category, dtype: int64

 

Grouping one Column with Means

Similar to the exercise you just completed, you can group rows to output the means of different groups within a column.

The column gender_diff is still available in the recent_grads DataFrame.

 

Write code that outputs the average gender percentage difference by major category.

 

# Report average gender difference by major category
print(recent_grads.groupby(['major_category']).gender_diff.mean())

 

major_category
Agriculture & Natural Resources        0.346799
Arts                                   0.259144
Biology & Life Science                 0.296905
Business                               0.350740
Communications & Journalism            0.482560
Computers & Mathematics                0.468733
Education                              0.415442
Engineering                            0.528308
Health                                 0.378798
Humanities & Liberal Arts              0.417325
Industrial Arts & Consumer Services    0.350490
Interdisciplinary                      0.009206
Law & Public Policy                    0.400821
Physical Sciences                      0.280674
Psychology & Social Work               0.583837
Social Science                         0.167048
Name: gender_diff, dtype: float64

 

 

Grouping Two Columns with Means

You can expand the previous operation to include two columns and output the means for each. To accomplish this, modify the code you just wrote.

 

Write a query that outputs the mean number of 'low_wage_jobs' and average 'unemployment_rate' grouped by 'major_category'.

 

# Find average number of low wage jobs and unemployment rate of each major category
# dept_stats = ____.____(['____'])['____', '____'].____
dept_stats = recent_grads.groupby(['major_category'])['low_wage_jobs', 'unemployment_rate'].mean()

print(dept_stats)

 

                                     low_wage_jobs  unemployment_rate
major_category                                                      
Agriculture & Natural Resources         789.900000           0.056328
Arts                                   7514.500000           0.090173
Biology & Life Science                 3053.000000           0.060918
Business                               9752.923077           0.071064
Communications & Journalism           12398.750000           0.075538
Computers & Mathematics                1466.909091           0.084256
Education                              2554.375000           0.051702
Engineering                             864.793103           0.063334
Health                                 2605.833333           0.065920
Humanities & Liberal Arts              6282.666667           0.081008
Industrial Arts & Consumer Services    3798.571429           0.056083
Interdisciplinary                      1061.000000           0.070861
Law & Public Policy                    4144.000000           0.090805
Physical Sciences                      1407.800000           0.046511
Psychology & Social Work               6249.555556           0.072065
Social Science                         6020.000000           0.095729