Python 응용/DataScience교육_MS_DAT208x

[DAT208x] final lab 2-1,2-2, 2-3, 2-4, 2-5 : Section 2: Manipulating Data

rararara 2021. 10. 23. 13:12

Creating Columns I

If you look at the dataset, you'll notice that while there's a column which shows the percentage of women in each department, there is no column which shows the percentage of men.

 

Create a new column named sharemen, that contains the percentage of men for a given department by dividing the number of men by the total number of students for each department.

 

# Add sharemen column
recent_grads['sharemen'] = recent_grads['men'] / recent_grads['total']

 

Select Row with Highest Value

Remember how you found the row of data with the highest percentage of women? Now you'll find the row that corresponds to the department with the highest rate of men.

The module numpy has been imported under the alias np for you.

 

  • Using numpy, find the maximum value for the percentage of men and call this variable max_men.
  • Select the row that has the percentage of men which corresponds to max_men.

 

# Find the maximum percentage value of men
max_men = np.max(recent_grads['sharemen'])
 
# Output the row with the highest percentage of men
print(recent_grads.loc[recent_grads['sharemen'] == max_men])

 

In [1]: # Find the maximum percentage value of men
        max_men = np.max(recent_grads['sharemen'])
        
        # Output the row with the highest percentage of men
        print(recent_grads.loc[recent_grads['sharemen'] == max_men])
    rank  major_code                                   major major_category  \
65    66        2599  MISCELLANEOUS ENGINEERING TECHNOLOGIES    Engineering  

    total  sample_size  men  women  sharewomen  employed    ...     \
65   8804          125  124      0         0.0      7502    ...     

    full_time_year_round  unemployed  unemployment_rate  median  p25th  p75th  \
65                  5825         416           0.052539   40000  30400  56000  

   college_jobs non_college_jobs  low_wage_jobs  sharemen 
65         2446             3896            386       1.0 

[1 rows x 22 columns]

<script.py> output:
        rank  major_code                                   major major_category  \
    65    66        2599  MISCELLANEOUS ENGINEERING TECHNOLOGIES    Engineering  
   
        total  sample_size  men  women  sharewomen  employed    ...     \
    65   8804          125  124      0         0.0      7502    ...     
   
        full_time_year_round  unemployed  unemployment_rate  median  p25th  p75th  \
    65                  5825         416           0.052539   40000  30400  56000  
   
       college_jobs non_college_jobs  low_wage_jobs  sharemen 
    65         2446             3896            386       1.0 
   
    [1 rows x 22 columns]

 

Creating columns II

Eventually you want to figure out which departments are most balanced between men and women. To accomplish this, you'll add a new column that reports the difference in percentages between men and women.

 

Add a column named gender_diff that reports how much higher the rate of women is than the rate of men.

# Add gender_diff column
recent_grads['gender_diff'] = recent_grads['sharewomen'] - recent_grads['sharemen']

 

Updating columns

Your data for the gender_diff column currently consists of negative and positive values, which depend on which group of people (women or men) have a higher percentage. You want to find the five departments with the most balanced gender ratios, but first you decide to make your life easier by replacing the values in the gender_diff column with their respective absolute values.

 

  • Use numpy and pandas to convert each value in the gender_diff column to its absolute value.
  • Output the five departments with the most balanced gender ratios.

 

# Make all gender difference values positive
recent_grads['gender_diff'] = recent_grads['gender_diff'].abs()

# Find the 5 rows with lowest gender rate difference
print(recent_grads.nsmallest(5,'gender_diff'))

 

<script.py> output:
         rank  major_code                                    major  \
    142   143        5500                  GENERAL SOCIAL SCIENCES  
    109   110        4000          MULTI/INTERDISCIPLINARY STUDIES  
    74     75        5003                                CHEMISTRY  
    147   148        2306   PHYSICAL AND HEALTH EDUCATION TEACHING  
    116   117        4001  INTERCULTURAL AND INTERNATIONAL STUDIES  
   
                    major_category  total  sample_size    men  women  sharewomen  \
    142             Social Science  12920          113  62181  62893    0.502846  
    109          Interdisciplinary  12296          128  10031   9848    0.495397  
    74           Physical Sciences  66530          353  32923  33607    0.505141  
    147                  Education  28213          259  29909  30724    0.506721  
    116  Humanities & Liberal Arts  24650          184   9950  10248    0.507377  
   
         employed     ...       unemployed  unemployment_rate  median  p25th  \
    142      9602     ...             1108           0.103455   32000  27000  
    109      9821     ...              749           0.070861   35000  25000  
    74      48535     ...             2769           0.053972   39000  30000  
    147     23794     ...             1920           0.074667   31000  24000  
    116     18824     ...             1718           0.083634   34000  24000  
   
         p75th college_jobs non_college_jobs low_wage_jobs  sharemen  gender_diff 
    142  50000         3602             4778          1634  0.497154     0.005693 
    109  44000         5176             3903          1061  0.504603     0.009206 
    74   49900        30382            14718          4288  0.494859     0.010281 
    147  40000        12777             9328          2042  0.493279     0.013442 
    116  45000         4956            10343          3168  0.492623     0.014754 
   
    [5 rows x 23 columns]

In [4]: 

 

Filtering rows

Finally you can filter out for departments which fail the benchmark of a difference of more than 0.30. Since all the values are now positive, you can do this with a simple boolean operator.

You want to find the rows containing departments that are skewed heavily towards men. Using work you've already done, you'll create a new DataFrame that contains this information.

The DataFrame recent_grads still has the columns sharemen and gender_diff that you created in previous exercises.

 

  • Create diff_30, a boolean Series that is True when the corresponding value of gender_diff is greater than 0.30 and False otherwise.
  • Make another boolean Series called more_men that's true when the corresponding row in recent_grads has more men than women.
  • Combine your two Series to make one that you can use to select rows that have both more men than women and a value of gender_diff greater than 0.30. Save the result as more_men_and_diff_30.
  • Use this new boolean Series to create the DataFrame fewer_women that contains only the rows you're interested in.

 

# Rows where gender rate difference is greater than .30
diff_30 = recent_grads['gender_diff'] > .30

# Rows with more men
more_men = recent_grads['men'] > recent_grads['women']

# Combine more_men and diff_30
more_men_and_diff_30 = np.logical_and(more_men, diff_30)

# Find rows with more men and and gender rate difference greater than .30
fewer_women = recent_grads.loc[more_men_and_diff_30]

 

wrong answer 

# Find rows with more men and and gender rate difference greater than .30
        fewer_women = recent_grads.loc[recent_grads['gender_diff'] == more_men_and_diff_30]