Creating Columns I

If you look at the dataset, you'll notice that while there's a column which shows the percentage of women in each department, there is no column which shows the percentage of men.

Create a new column named sharemen, that contains the percentage of men for a given department by dividing the number of men by the total number of students for each department.

# Add sharemen column
recent_grads['sharemen'] = recent_grads['men'] / recent_grads['total']

Select Row with Highest Value

Remember how you found the row of data with the highest percentage of women? Now you'll find the row that corresponds to the department with the highest rate of men.

The module numpy has been imported under the alias np for you.

Using numpy, find the maximum value for the percentage of men and call this variable max_men.
Select the row that has the percentage of men which corresponds to max_men.

# Find the maximum percentage value of men
max_men = np.max(recent_grads['sharemen'])

# Output the row with the highest percentage of men
print(recent_grads.loc[recent_grads['sharemen'] == max_men])

In [1]: # Find the maximum percentage value of men
        max_men = np.max(recent_grads['sharemen'])

        # Output the row with the highest percentage of men
        print(recent_grads.loc[recent_grads['sharemen'] == max_men])
    rank major_code                                   major major_category \
65    66        2599 MISCELLANEOUS ENGINEERING TECHNOLOGIES    Engineering

total sample_size men women sharewomen employed ... \
65 8804 125 124 0 0.0 7502 ...

full_time_year_round unemployed unemployment_rate median p25th p75th \
65 5825 416 0.052539 40000 30400 56000

college_jobs non_college_jobs low_wage_jobs sharemen
65 2446 3896 386 1.0

[1 rows x 22 columns]

Creating columns II

Eventually you want to figure out which departments are most balanced between men and women. To accomplish this, you'll add a new column that reports the difference in percentages between men and women.

Add a column named gender_diff that reports how much higher the rate of women is than the rate of men.

# Add gender_diff column
recent_grads['gender_diff'] = recent_grads['sharewomen'] - recent_grads['sharemen']

Updating columns

Your data for the gender_diff column currently consists of negative and positive values, which depend on which group of people (women or men) have a higher percentage. You want to find the five departments with the most balanced gender ratios, but first you decide to make your life easier by replacing the values in the gender_diff column with their respective absolute values.

Use numpy and pandas to convert each value in the gender_diff column to its absolute value.
Output the five departments with the most balanced gender ratios.

# Make all gender difference values positive
recent_grads['gender_diff'] = recent_grads['gender_diff'].abs()

# Find the 5 rows with lowest gender rate difference
print(recent_grads.nsmallest(5,'gender_diff'))

<script.py> output:
         rank major_code                                    major \
    142   143        5500                  GENERAL SOCIAL SCIENCES
    109   110        4000          MULTI/INTERDISCIPLINARY STUDIES
    74     75        5003                                CHEMISTRY
    147   148        2306   PHYSICAL AND HEALTH EDUCATION TEACHING
    116   117        4001 INTERCULTURAL AND INTERNATIONAL STUDIES

                    major_category total sample_size    men women sharewomen \
    142             Social Science 12920          113 62181 62893    0.502846
    109          Interdisciplinary 12296          128 10031   9848    0.495397
    74           Physical Sciences 66530          353 32923 33607    0.505141
    147                  Education 28213          259 29909 30724    0.506721
    116 Humanities & Liberal Arts 24650          184   9950 10248    0.507377

         employed     ...       unemployed unemployment_rate median p25th \
    142      9602     ...             1108           0.103455   32000 27000
    109      9821     ...              749           0.070861   35000 25000
    74      48535     ...             2769           0.053972   39000 30000
    147     23794     ...             1920           0.074667   31000 24000
    116     18824     ...             1718           0.083634   34000 24000

         p75th college_jobs non_college_jobs low_wage_jobs sharemen gender_diff
    142 50000         3602             4778          1634 0.497154     0.005693
    109 44000         5176             3903          1061 0.504603     0.009206
    74   49900        30382            14718          4288 0.494859     0.010281
    147 40000        12777             9328          2042 0.493279     0.013442
    116 45000         4956            10343          3168 0.492623     0.014754

    [5 rows x 23 columns]

In [4]:

Filtering rows

Finally you can filter out for departments which fail the benchmark of a difference of more than 0.30. Since all the values are now positive, you can do this with a simple boolean operator.

You want to find the rows containing departments that are skewed heavily towards men. Using work you've already done, you'll create a new DataFrame that contains this information.

The DataFrame recent_grads still has the columns sharemen and gender_diff that you created in previous exercises.

Create diff_30, a boolean Series that is True when the corresponding value of gender_diff is greater than 0.30 and False otherwise.
Make another boolean Series called more_men that's true when the corresponding row in recent_grads has more men than women.
Combine your two Series to make one that you can use to select rows that have both more men than women and a value of gender_diff greater than 0.30. Save the result as more_men_and_diff_30.
Use this new boolean Series to create the DataFrame fewer_women that contains only the rows you're interested in.

# Rows where gender rate difference is greater than .30
diff_30 = recent_grads['gender_diff'] > .30

# Rows with more men
more_men = recent_grads['men'] > recent_grads['women']

# Combine more_men and diff_30
more_men_and_diff_30 = np.logical_and(more_men, diff_30)

# Find rows with more men and and gender rate difference greater than .30
fewer_women = recent_grads.loc[more_men_and_diff_30]

wrong answer

# Find rows with more men and and gender rate difference greater than .30
~~fewer_women = recent_grads.loc[recent_grads['gender_diff'] == more_men_and_diff_30]~~