[DAT208x] final lab 2-1,2-2, 2-3, 2-4, 2-5 : Section 2: Manipulating Data
Creating Columns I
If you look at the dataset, you'll notice that while there's a column which shows the percentage of women in each department, there is no column which shows the percentage of men.
Create a new column named sharemen, that contains the percentage of men for a given department by dividing the number of men by the total number of students for each department.
# Add sharemen column
recent_grads['sharemen'] = recent_grads['men'] / recent_grads['total']
Select Row with Highest Value
Remember how you found the row of data with the highest percentage of women? Now you'll find the row that corresponds to the department with the highest rate of men.
The module numpy has been imported under the alias np for you.
- Using numpy, find the maximum value for the percentage of men and call this variable max_men.
- Select the row that has the percentage of men which corresponds to max_men.
# Find the maximum percentage value of men
max_men = np.max(recent_grads['sharemen'])
# Output the row with the highest percentage of men
print(recent_grads.loc[recent_grads['sharemen'] == max_men])
In [1]: # Find the maximum percentage value of men
max_men = np.max(recent_grads['sharemen'])
# Output the row with the highest percentage of men
print(recent_grads.loc[recent_grads['sharemen'] == max_men])
rank major_code major major_category \
65 66 2599 MISCELLANEOUS ENGINEERING TECHNOLOGIES Engineering
total sample_size men women sharewomen employed ... \
65 8804 125 124 0 0.0 7502 ...
full_time_year_round unemployed unemployment_rate median p25th p75th \
65 5825 416 0.052539 40000 30400 56000
college_jobs non_college_jobs low_wage_jobs sharemen
65 2446 3896 386 1.0
[1 rows x 22 columns]
<script.py> output:
rank major_code major major_category \
65 66 2599 MISCELLANEOUS ENGINEERING TECHNOLOGIES Engineering
total sample_size men women sharewomen employed ... \
65 8804 125 124 0 0.0 7502 ...
full_time_year_round unemployed unemployment_rate median p25th p75th \
65 5825 416 0.052539 40000 30400 56000
college_jobs non_college_jobs low_wage_jobs sharemen
65 2446 3896 386 1.0
[1 rows x 22 columns]
Creating columns II
Eventually you want to figure out which departments are most balanced between men and women. To accomplish this, you'll add a new column that reports the difference in percentages between men and women.
Add a column named gender_diff that reports how much higher the rate of women is than the rate of men.
# Add gender_diff column
recent_grads['gender_diff'] = recent_grads['sharewomen'] - recent_grads['sharemen']
Updating columns
Your data for the gender_diff column currently consists of negative and positive values, which depend on which group of people (women or men) have a higher percentage. You want to find the five departments with the most balanced gender ratios, but first you decide to make your life easier by replacing the values in the gender_diff column with their respective absolute values.
- Use numpy and pandas to convert each value in the gender_diff column to its absolute value.
- Output the five departments with the most balanced gender ratios.
# Make all gender difference values positive
recent_grads['gender_diff'] = recent_grads['gender_diff'].abs()
# Find the 5 rows with lowest gender rate difference
print(recent_grads.nsmallest(5,'gender_diff'))
<script.py> output:
rank major_code major \
142 143 5500 GENERAL SOCIAL SCIENCES
109 110 4000 MULTI/INTERDISCIPLINARY STUDIES
74 75 5003 CHEMISTRY
147 148 2306 PHYSICAL AND HEALTH EDUCATION TEACHING
116 117 4001 INTERCULTURAL AND INTERNATIONAL STUDIES
major_category total sample_size men women sharewomen \
142 Social Science 12920 113 62181 62893 0.502846
109 Interdisciplinary 12296 128 10031 9848 0.495397
74 Physical Sciences 66530 353 32923 33607 0.505141
147 Education 28213 259 29909 30724 0.506721
116 Humanities & Liberal Arts 24650 184 9950 10248 0.507377
employed ... unemployed unemployment_rate median p25th \
142 9602 ... 1108 0.103455 32000 27000
109 9821 ... 749 0.070861 35000 25000
74 48535 ... 2769 0.053972 39000 30000
147 23794 ... 1920 0.074667 31000 24000
116 18824 ... 1718 0.083634 34000 24000
p75th college_jobs non_college_jobs low_wage_jobs sharemen gender_diff
142 50000 3602 4778 1634 0.497154 0.005693
109 44000 5176 3903 1061 0.504603 0.009206
74 49900 30382 14718 4288 0.494859 0.010281
147 40000 12777 9328 2042 0.493279 0.013442
116 45000 4956 10343 3168 0.492623 0.014754
[5 rows x 23 columns]
In [4]:
Filtering rows
Finally you can filter out for departments which fail the benchmark of a difference of more than 0.30. Since all the values are now positive, you can do this with a simple boolean operator.
You want to find the rows containing departments that are skewed heavily towards men. Using work you've already done, you'll create a new DataFrame that contains this information.
The DataFrame recent_grads still has the columns sharemen and gender_diff that you created in previous exercises.
- Create diff_30, a boolean Series that is True when the corresponding value of gender_diff is greater than 0.30 and False otherwise.
- Make another boolean Series called more_men that's true when the corresponding row in recent_grads has more men than women.
- Combine your two Series to make one that you can use to select rows that have both more men than women and a value of gender_diff greater than 0.30. Save the result as more_men_and_diff_30.
- Use this new boolean Series to create the DataFrame fewer_women that contains only the rows you're interested in.
# Rows where gender rate difference is greater than .30
diff_30 = recent_grads['gender_diff'] > .30
# Rows with more men
more_men = recent_grads['men'] > recent_grads['women']
# Combine more_men and diff_30
more_men_and_diff_30 = np.logical_and(more_men, diff_30)
# Find rows with more men and and gender rate difference greater than .30
fewer_women = recent_grads.loc[more_men_and_diff_30]
wrong answer
# Find rows with more men and and gender rate difference greater than .30 fewer_women = recent_grads.loc[recent_grads['gender_diff'] == more_men_and_diff_30]