•   over 4 years ago

Missing values on graduation cohort data


I have a few questions on missing values in your dataset. I am especially interested in the missing values in "MAM_COHORT_1112", "MAS_COHORT_1112", "MBL_COHORT_1112", "MHI_COHORT_1112", "MTR_COHORT_1112","MWH_COHORT_1112".

In the data set features such as "MAM_ COHORT_1011" which is the "Number of Native American students in the graduation cohort" have over 60% NA values. Is it because the values here are actually 0?

If the value is not 0, I could impute. However with such a high number of missing values, imputation is difficult.

I could estimate the missing values from the census data, however I noticed some discrepancies in values. For instance, observation 89 (ALABAMA - Mobile County) records MAM_ COHORT_1011=45. However the census data for the same observation shows NH_AIAN_alone_CEN_2010 = 2. NH_AIAN_alone_CEN_2010 is defined as "Number of people who indicate no Hispanic origin and their only race as "American Indian or Alaska Native" or report entries such as Navajo, Blackfeet, Inupiat, Yup'ik, or Central/South American Indian groups in the 2010 Census population.

  • 1 comment

  • Manager   •   over 4 years ago

    The NA values should be treated as missing and not 0 value. Certain school districts might not have chosen to report these values or they might not have applied to them.

    As far as discrepancy between the school data and the census data, we do expect this to happen and it is due to the fact that we took a single TRACT (the one that had the highest proportionate geographic overlap) and assigned it to a school district. So the census tract acts as more of a proxy to the area around/near the school district. The school district could have overlapped with multiple TRACTs.

