본문 바로가기

Python 응용/DataScience교육_MS_DAT208x

[DAT208x] final lab 1-3

Replacing Missing Values

 

There are some missing values in the dataset that are coded as a string. You'll update these to a value that Python understands as "missing."

The list columns contains the names of the columns you'll be working with in this exercise.

 

Instructions

  • Look at the dtypes of the columns in columns to make sure that the data is numeric.
  • It looks like a string is being used to encode missing values. Use the .unique() method to figure out what the string is.
  • Search for missing values in the median, p25th, and p75th columns.
  • Replace the found missing values with a NaN value, using numpy's np.nan.

# Names of the columns we're searching for missing values
columns = ['median', 'p25th', 'p75th']

# Take a look at the dtypes
print(recent_grads.dtypes)

print(recent_grads[columns].dtypes)

# Find how missing values are represented
print(recent_grads["median"].unique())

# Replace missing values with NaN
for column in columns:
    recent_grads.loc[recent_grads[column] == 'UN', column] = np.nan

 

print(recent_grads[columns].dtypes) 

median    object
p25th     object
p75th     object
dtype: object

 

result of print(recent_grads.dtypes) 
rank                      int64
major_code                int64
major                    object
major_category           object
total                     int64
sample_size               int64
men                       int64
women                     int64
sharewomen              float64
employed                  int64
full_time                 int64
part_time                 int64
full_time_year_round      int64
unemployed                int64
unemployment_rate       float64
median                   object
p25th                    object
p75th                    object
college_jobs              int64
non_college_jobs          int64
low_wage_jobs             int64
dtype: object


['110000' '75000' '73000' '70000' '65000' 'UN' '62000' '60000' '58000'
 '57100' '57000' '56000' '54000' '53000' '52000' '51000' '50000' '48000'
 '47000' '46000' '45000' '44700' '44000' '42000' '41300' '41000' '40100'
 '40000' '39000' '38400' '38000' '37500' '37400' '37000' '36400' '36200'
 '36000' '35600' '35000' '34000' '33500' '33400' '33000' '32500' '32400'
 '32200' '32100' '32000' '31500' '31000' '30500' '30000' '29000' '28000'
 '27500' '27000' '26000' '25000' '23400' '22000']

In [2]:

 

hint

https://stackoverflow.com/questions/50876108/names-of-the-columns-were-searching-for-missing-values