The Gender Income Gap

How do factors (besides gender) affect the gender income gap?

20 min read Ben Hayes

Is there a significant difference in income between men and women? Does the difference vary depending on other factors (e.g., education, marital status, criminal history, drug use, childhood household factors, profession, etc.)?

The issue of gender-income disparity is not new - it is nuanced. In order to answer these questions, we will rely on the National Longitudinal Survey of Youth, 1979 cohort, data set (abbreviated ‘NLSY79’). To draw strong conclusions, we must evaluate the data set provided - is it accurate, relevant, and useful for drawing statistical conclusions? Once we summarize the data, we can discuss methodology - how should we approach the data, what variables should we consider, what techniques are appropriate? Third, we openly discuss findings about the sampled individuals and attempt to infer relationships about the income difference (if any) between men and women in the larger population based on other factors. Lastly, we end with a discussion of the relevancy and signicance of our findings given the context of the survey data available and the methodology used.

Table of Contents:

  1. Data Summary
  2. Methodology
  3. Findings
  4. Closing Discussion

1. Data Summary

The NLSY79 dataset contains records for a national sample of over 12,000 men and women aged 14-22 at the time of the survey. The dataset contains 67 attributes about these 12,000 individuals ranging from basic demographic information such as race, gender, and also more descriptive information such as family size, region, and drug use.

In the boxplot to the below, you can see the distribution of incomes for men and women. Notice that the median income for men is higher than the median income for women and in general, the variability in income of men is higher than the variability in the income of women. Also, notice the outliers for both men and women at the top of the income scale.

The income data from the National Longitudinal Survey of Youth, 1979 cohort, data set is top-coded. The income values for the top 2% of earners are set to the average income of the top 2% of earners (to obscure the actual values). While this change seems harmless, and would be when calculating simple averages, it does have potential to affect our analysis. By setting values of the top earners to the average of that group, the standard deviation for the top 2% has been eliminated. This subtle change will be further examined when we determine and discuss significance of our results later.

But first, let’s examine the data available and begin to hone in on a few key variables.

Gender:

The initial instinct to answer the question is to simply compare the average income of men with the average income of women.

As can be seen in the bar chart:

  • The average salary of men in our dataset is approximately $24961 more than that of women.
  • The 95% confidence intervals do not overlap so the probability that this difference is significant is extremely high (>99.99%).

This depiction of the gap indicates that there certainly may be a problem, but how can we confidently conclude that there is an income gap? Do men work more? Have Higher Education? What happens when we include other variables?

Individual’s Race:

When we include race, we can see that the disparity between men and women changes:

  • The race with the highest difference in average income is the non-hispanic/non-black group.
  • The income difference in race, when holding race constant by using sub-groups, is significant for all 3 races.
  • Note that women in the non-hispanic/non-black group, on average, earn more income than men who are black (although we can show that any difference is not significant).

Geographic Region:

What if you replace race with region of the country?

  • You can see the average income in the northeast is higher than in the other 3 regions but the gap between men and women remains relatively similar.
  • Most of the respondents are from the south or north central regions.
  • All differences in income by gender and region appear to be significant atthe 95% confidence level.

Marital Status:

  • The majority of respondents are either married, never married, or divorced:
  • The first observation for this chart must be the interesting fact that men who are married earn more income than men who are never married, separated, divorced, or widowed - and that there is no corresponding spike (at least proportional to the spike for men) in income for women who are married.
  • The differences in income by gender only appear to be significant for married and divorced individuals.

Country of Origin:

A few noteworthy items about the Country of Origin variable:

  • The majority of individuals who are included in the NLSY79 survey are originally from the United States.
  • Regardless of the country, it appears that the income difference between men and women is significant.
  • Based on the data, men from outside of the U.S. see a small bump in pay relative to their counterpart from the United States - women however do not appear to experience the same bump.

Education Level:

  • The education variable, coupled with the gender variable, shows an interesting relationship:
  • Once the 10th grade education level is attained, we begin to see a significant difference in the incomes between men and women.
  • As education level increases, the gap in income between the genders also increases.
  • Note that the sample sizes for indivdiuals 11th grade and lower have smaller samples.

Industry/Business:

How does the gender income gap change across businesses or industries? This question could be useful for analyzing the gender gap and there are a few obvious relationships:

  • Notice the 95% confidence interval is large for several subgroups. For example, women in the Armed Forces make up 3 of the remaining respondents. The standard deviation for this group is very high causing the confidence interval to dip noticeably below $0.
  • The income gap between men and women is statistically significant at the 95% confidence level in several industries: Manufacturing, Wholesale/Resale Trade, Information, Finance/Insurance, Professional Services, Educational Services, and Healthcare/Social Assistance.

Due to the granularity of this variable and the lack of surveyed individuals for some of the industry~gender sub-groupings, this variable and may not be helpful later when we perform regression analysis.


Return To Top

2. Methodology

In this section, we evaluate how to handle anomalies that arise during most data analyses: missing values, inappropriate values, top-coded values, and final variable selection.

Missing Values

Standard for data analysis, missing values will occur. Handling missing values is a delicate process which requires case by case examination. For the variables used in this analysis, the general approach was to remove missing values. Unfortunately, an individual who has not appropriately reported marital status but has reported all other variables (including 2012 income) were removed. This process may not be acceptable for more robust analyses but to establish cursory findings about the relationship of income and gender and the impact of other factors omission is not ideal but acceptable.

R provides ways to account for missing values (encode as NA and use rm.na = True) but for the purpose of this analysis the missing values were removed.

Inappropriate Values

As a consequence of the survey process which assesses different aspects of human life, the participants are not always able to provide accurate information. Unfortunately, this lack of information presents itself as unhelpful categorical variables.

The following categories of values were removed from the corresponding variable (these data cleanup changes are reflected in the plots shown in the Data Summary section above):

  • ‘Unknown’ from Country of Origin
  • ‘Non-interview’, ‘Invalid Skip’ from Marital Status (2000)
  • ‘Refusal’, ‘Do not know’ from Region of Current Residence
  • ‘Ungraded’ from Education Level
  • ‘Error’, ‘Not in Labor Force’ from Industry/Business

Top-Coded Income Variable

The income variable that was used (from 2012 survey) is top-coded. The top 2% of incomes were replaced with the average of the group. While comparing simple averages is okay, this top-coding presents a problem for deeper analysis as the standard deviation for the top 2% of incomes is eliminated. When performing regression, the residual produced does not account for the natural variance in the top-coded values. This may reduce the total sum of residuals and artificially increase the significance of a test (e.g., t-statistic).

Selecting Variables

Initially, before summarizing the data, variables were chosen to identify if common demographic details for an individual might affect their income relative to individuals from the opposite sex. To move forward with the analysis, we must decide on a set of variables that could affect the difference in income for men and women. The variables chosen for additional analysis were:

  • Income
  • Gender
  • Race
  • Education Level
  • Geographic Region
  • Marital Status

Variables that were previously evaluated but since removed from the analysis include:

  • Country of Origin - This value is suspected to not affect the income gap based on the bar chart above.
  • Industry/Business - This value subdivides the data to levels that are too small to draw statistically significant conclusions from when performing regression analysis.


Return To Top

3.Findings

The initial question asks us to evaluate whether there is a difference in income between men and women. This question seems simple. Yet, when we attempt to answer it in simple terms (see the comparison of average income of men versus the average income of women) we are left feeling unsatisfied. This lack of satisfaction is owed to omitted variable bias - leaving out variables that could increase or decrease the difference in income by gender. What other variables impact the difference in income? By how much? Which have more influence? We attempt to answer these questions in this section in order to more satisfyingly answer the initial question.

We can perform regression analysis (including a range of variables) to help isolate the effect of each variable on income. If we attempt to regress income on gender, you will see familiar results:

To interpret the regression output, pay close attention to the GenderFemale estimated value. This value represents the amount that women can expect to earn less than a man (while considering no other variables) and is statistically significant. You will notice that the value is the same as when we compared the simple averages!

Estimate Std. Error t value Pr(>|t|)
(Intercept) 56132 1014.78 55.314 0
GenderFemale -24961 1412.64 -17.669 0

As mentioned in this section’s preface, we want to develop a more satisfying model for evaluating the effect of other variables on income. Let’s see what happens when we account for other variables:

Main Effects

Income ~ Gender + Race + Education Level + Geographic Region + Marital Status
This regression model provides a list of coefficient estimates based on the categorical values provided in the Gender, Race, Education Level, Geographic Region, and Marital Status variables. For example, a hispanic woman expects to earn approximately $27441 fewer dollars than men who meet the other criteria (marital status, education level, etc.).

Additionally, living in the Northeast typically leads to an increase in income of $6748 more than someone similar living in North Central US, $5968 more than someone similar living in the south, and $5976 more than someone similar living in the west.

But we want to compare how these variables affect the gap in income between men and women… Then let’s do it!

Estimate Std. Error t value Pr(>|t|)
(Intercept) 25581 4951.04 5.167 0.000
GenderFemale -27442 1288.03 -21.305 0.000
RaceBlack -9767 2052.25 -4.759 0.000
RaceNon-Black/Hispanic 3066 1864.02 1.645 0.100
GradeCompleted_20129th Grade 3803 6297.86 0.604 0.546
GradeCompleted_201210th Grade 2557 6410.46 0.399 0.690
GradeCompleted_201211th Grade 5478 6061.20 0.904 0.366
GradeCompleted_201212th Grade 16970 4504.37 3.767 0.000
GradeCompleted_20121st Year College 28402 4876.79 5.824 0.000
GradeCompleted_20122nd Year College 27868 4806.69 5.798 0.000
GradeCompleted_20123rd Year College 31167 5166.32 6.033 0.000
GradeCompleted_20124th Year College 57692 4779.08 12.072 0.000
GradeCompleted_20125th Year/More College 74627 4815.29 15.498 0.000
RegionOfCurrentResidence_2012North Central -6749 2133.76 -3.163 0.002
RegionOfCurrentResidence_2012South -5968 1958.07 -3.048 0.002
RegionOfCurrentResidence_2012West -5976 2266.60 -2.637 0.008
MaritalStatus_2000Married 11664 1772.77 6.579 0.000
MaritalStatus_2000Separated 3249 3044.14 1.067 0.286
MaritalStatus_2000Divorced 5491 2236.76 2.455 0.014
MaritalStatus_2000Widowed 11400 6925.50 1.646 0.100

Interaction Effects

Income ~ Gender + … + Gender * Race
This regression will allow us to compare the effect of race on gender and evaluate how that affects the income gap. Notice that the variable GenderFemale:RaceNon-Black/Hispanic has a coefficient of -16266 which is statistically significant and indicates that on average a woman who is not black or hispanic experiences a larger pay gap ($-16266 larger gap than a hispanic woman). This effect is likely due to non-black/non-hispanic individuals earning more in general as shown in the graph above.

Estimate Std. Error t value Pr(>|t|)
(Intercept) 21729 5152.83 4.217 0.000
GenderFemale -21368 2967.80 -7.200 0.000
RaceBlack -13622 2823.48 -4.825 0.000
RaceNon-Black/Hispanic 11440 2576.92 4.439 0.000
GradeCompleted_20129th Grade 4811 6266.17 0.768 0.443
GradeCompleted_201210th Grade 4102 6382.31 0.643 0.520
GradeCompleted_201211th Grade 6873 6030.92 1.140 0.255
GradeCompleted_201212th Grade 17917 4481.71 3.998 0.000
GradeCompleted_20121st Year College 28884 4851.89 5.953 0.000
GradeCompleted_20122nd Year College 28180 4782.15 5.893 0.000
GradeCompleted_20123rd Year College 31579 5139.00 6.145 0.000
GradeCompleted_20124th Year College 57961 4753.39 12.194 0.000
GradeCompleted_20125th Year/More College 75273 4790.83 15.712 0.000
RegionOfCurrentResidence_2012North Central -7160 2122.26 -3.374 0.001
RegionOfCurrentResidence_2012South -6340 1947.47 -3.256 0.001
RegionOfCurrentResidence_2012West -6047 2253.87 -2.683 0.007
MaritalStatus_2000Married 12031 1763.45 6.822 0.000
MaritalStatus_2000Separated 2662 3027.92 0.879 0.379
MaritalStatus_2000Divorced 5588 2224.14 2.512 0.012
MaritalStatus_2000Widowed 10473 6889.31 1.520 0.129
GenderFemale:RaceBlack 7540 3741.25 2.015 0.044
GenderFemale:RaceNon-Black/Hispanic -16267 3452.93 -4.711 0.000

Regression Comparison

The regression model that includes an interaction term between race and gender indicates that race does have a significant impact on the gender income gap. This is indicated by the reduced residual sum of squares and the P-value < 0.0001.

## Analysis of Variance Table
## 
## Model 1: TotalIncome_2012 ~ Gender + Race + GradeCompleted_2012 + RegionOfCurrentResidence_2012 + 
##     MaritalStatus_2000
## Model 2: TotalIncome_2012 ~ Gender + Race + GradeCompleted_2012 + RegionOfCurrentResidence_2012 + 
##     MaritalStatus_2000 + Gender * Race
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1   6152 1.5352e+13                                   
## 2   6150 1.5173e+13  2 1.7834e+11 36.142 2.484e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interaction Effects

Income ~ Gender + … + Gender * Education Level
This regression will allow us to compare the effect of education level on gender and evaluate how that affects the income gap. Notice the coefficients for women increase in magnitude (becoming more negative) as education level increases. This effect is a consequence of their male counterparts (who meet the other regression criteria or are in the same ‘bucket’ but are male) earn increasingly more than women. For example, a woman with 4 years of college completed and other attributes (marital status, etc.) earns $39398 less than a man with 4 years of college completed and the same other attributes.

Estimate Std. Error t value Pr(>|t|)
(Intercept) 19122 6360.45 3.006 0.003
GenderFemale -11414 8589.75 -1.329 0.184
RaceBlack -9689 2022.11 -4.792 0.000
RaceNon-Black/Hispanic 2854 1838.12 1.553 0.121
GradeCompleted_20129th Grade 6063 8084.54 0.750 0.453
GradeCompleted_201210th Grade 2422 8576.39 0.282 0.778
GradeCompleted_201211th Grade 6057 8054.58 0.752 0.452
GradeCompleted_201212th Grade 18970 6088.08 3.116 0.002
GradeCompleted_20121st Year College 33303 6731.88 4.947 0.000
GradeCompleted_20122nd Year College 30750 6618.76 4.646 0.000
GradeCompleted_20123rd Year College 40414 7271.14 5.558 0.000
GradeCompleted_20124th Year College 77505 6480.13 11.960 0.000
GradeCompleted_20125th Year/More College 104554 6594.11 15.856 0.000
RegionOfCurrentResidence_2012North Central -7266 2101.33 -3.458 0.001
RegionOfCurrentResidence_2012South -6206 1928.34 -3.218 0.001
RegionOfCurrentResidence_2012West -6271 2232.91 -2.809 0.005
MaritalStatus_2000Married 10602 1747.97 6.065 0.000
MaritalStatus_2000Separated 1819 3001.55 0.606 0.545
MaritalStatus_2000Divorced 5251 2206.10 2.380 0.017
MaritalStatus_2000Widowed 8327 6828.86 1.219 0.223
GenderFemale:GradeCompleted_20129th Grade -633 12763.07 -0.050 0.960
GenderFemale:GradeCompleted_201210th Grade 870 12640.48 0.069 0.945
GenderFemale:GradeCompleted_201211th Grade 124 11965.78 0.010 0.992
GenderFemale:GradeCompleted_201212th Grade -4450 8800.54 -0.506 0.613
GenderFemale:GradeCompleted_20121st Year College -11291 9560.32 -1.181 0.238
GenderFemale:GradeCompleted_20122nd Year College -7813 9410.87 -0.830 0.406
GenderFemale:GradeCompleted_20123rd Year College -18892 10148.59 -1.862 0.063
GenderFemale:GradeCompleted_20124th Year College -39398 9285.43 -4.243 0.000
GenderFemale:GradeCompleted_20125th Year/More College -55082 9373.84 -5.876 0.000

Regression Comparison

The regression model that includes an interaction term between education level and gender indicates that education level does have a significant impact on the gender income gap. This is indicated by the reduced residual sum of squares and the P-value < 0.0001.

## Analysis of Variance Table
## 
## Model 1: TotalIncome_2012 ~ Gender + Race + GradeCompleted_2012 + RegionOfCurrentResidence_2012 + 
##     MaritalStatus_2000
## Model 2: TotalIncome_2012 ~ Gender + Race + GradeCompleted_2012 + RegionOfCurrentResidence_2012 + 
##     MaritalStatus_2000 + Gender * GradeCompleted_2012
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1   6152 1.5352e+13                                   
## 2   6143 1.4851e+13  9 5.0049e+11 23.003 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interaction Effects

Income ~ Gender + … + Gender * Geographic Region
When including an interaction term of gender with geographic region, notice how all of the GenderFemale:Region coefficients are positive. This indicates that women in the northeast region experience a larger income gap. For example, a woman in the North Central region experiences an income gap $3932 smaller than a woman with the same qualifications in the north east.

Note the t-statistics are varying in their significance.

Estimate Std. Error t value Pr(>|t|)
(Intercept) 29015 5179.78 5.602 0.000
GenderFemale -34149 3332.04 -10.249 0.000
RaceBlack -9777 2051.47 -4.766 0.000
RaceNon-Black/Hispanic 3080 1863.58 1.653 0.098
GradeCompleted_20129th Grade 3927 6295.72 0.624 0.533
GradeCompleted_201210th Grade 2519 6409.32 0.393 0.694
GradeCompleted_201211th Grade 5264 6060.06 0.869 0.385
GradeCompleted_201212th Grade 16873 4503.76 3.746 0.000
GradeCompleted_20121st Year College 28318 4878.06 5.805 0.000
GradeCompleted_20122nd Year College 27784 4806.87 5.780 0.000
GradeCompleted_20123rd Year College 31182 5165.05 6.037 0.000
GradeCompleted_20124th Year College 57639 4778.75 12.062 0.000
GradeCompleted_20125th Year/More College 74491 4814.94 15.471 0.000
RegionOfCurrentResidence_2012North Central -8736 3037.97 -2.876 0.004
RegionOfCurrentResidence_2012South -10912 2781.96 -3.922 0.000
RegionOfCurrentResidence_2012West -10428 3173.16 -3.286 0.001
MaritalStatus_2000Married 11734 1772.58 6.620 0.000
MaritalStatus_2000Separated 3151 3043.21 1.035 0.301
MaritalStatus_2000Divorced 5594 2236.23 2.501 0.012
MaritalStatus_2000Widowed 11439 6925.01 1.652 0.099
GenderFemale:RegionOfCurrentResidence_2012North Central 3933 4236.04 0.928 0.353
GenderFemale:RegionOfCurrentResidence_2012South 9618 3858.84 2.492 0.013
GenderFemale:RegionOfCurrentResidence_2012West 8796 4414.11 1.993 0.046

Regression Comparison

The regression model that includes an interaction term between geographic region and gender indicates that geographic region does barely have a significant impact on the gender income gap. This is indicated by the reduced residual sum of squares and the P-value < 0.05.

## Analysis of Variance Table
## 
## Model 1: TotalIncome_2012 ~ Gender + Race + GradeCompleted_2012 + RegionOfCurrentResidence_2012 + 
##     MaritalStatus_2000
## Model 2: TotalIncome_2012 ~ Gender + Race + GradeCompleted_2012 + RegionOfCurrentResidence_2012 + 
##     MaritalStatus_2000 + Gender * RegionOfCurrentResidence_2012
##   Res.Df        RSS Df  Sum of Sq      F  Pr(>F)  
## 1   6152 1.5352e+13                               
## 2   6149 1.5332e+13  3 1.9761e+10 2.6418 0.04767 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interaction Effects

Income ~ Gender + … + Gender * Marital Status
When we add an interation term for gender and marital status, we see an interesting effect. Women who are married as of year 2000, experience an income gap that is $30114 larger than women with the same qualifications who are not married. A similar effect is seen for women who are divorced or widowed but less of a bump. This could be because the women in this category are educated but end their career early or take a less demanding job to raise a family, etc.

Estimate Std. Error t value Pr(>|t|)
(Intercept) 17024 5051.46 3.370 0.001
GenderFemale -7884 2941.84 -2.680 0.007
RaceBlack -10335 2036.55 -5.075 0.000
RaceNon-Black/Hispanic 3419 1848.80 1.849 0.064
GradeCompleted_20129th Grade 4373 6247.53 0.700 0.484
GradeCompleted_201210th Grade 3282 6356.89 0.516 0.606
GradeCompleted_201211th Grade 7119 6016.28 1.183 0.237
GradeCompleted_201212th Grade 17402 4467.01 3.896 0.000
GradeCompleted_20121st Year College 28670 4836.33 5.928 0.000
GradeCompleted_20122nd Year College 27885 4766.46 5.850 0.000
GradeCompleted_20123rd Year College 31456 5125.57 6.137 0.000
GradeCompleted_20124th Year College 57505 4739.13 12.134 0.000
GradeCompleted_20125th Year/More College 73989 4775.98 15.492 0.000
RegionOfCurrentResidence_2012North Central -7452 2116.96 -3.520 0.000
RegionOfCurrentResidence_2012South -6303 1942.26 -3.245 0.001
RegionOfCurrentResidence_2012West -6270 2248.08 -2.789 0.005
MaritalStatus_2000Married 25682 2341.13 10.970 0.000
MaritalStatus_2000Separated -21 4662.42 -0.004 0.996
MaritalStatus_2000Divorced 8081 3210.68 2.517 0.012
MaritalStatus_2000Widowed 8888 14456.33 0.615 0.539
GenderFemale:MaritalStatus_2000Married -30114 3366.00 -8.947 0.000
GenderFemale:MaritalStatus_2000Separated 68 6135.03 0.011 0.991
GenderFemale:MaritalStatus_2000Divorced -9112 4404.25 -2.069 0.039
GenderFemale:MaritalStatus_2000Widowed -5134 16460.84 -0.312 0.755

Regression Comparison

The regression model that includes an interaction term between marital status and gender indicates that marital status does have a significant impact on the gender income gap. This is indicated by the reduced residual sum of squares and the P-value < 0.0001.

## Analysis of Variance Table
## 
## Model 1: TotalIncome_2012 ~ Gender + Race + GradeCompleted_2012 + RegionOfCurrentResidence_2012 + 
##     MaritalStatus_2000
## Model 2: TotalIncome_2012 ~ Gender + Race + GradeCompleted_2012 + RegionOfCurrentResidence_2012 + 
##     MaritalStatus_2000 + Gender * MaritalStatus_2000
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1   6152 1.5352e+13                                   
## 2   6148 1.5083e+13  4 2.6906e+11 27.419 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

How ‘good’ is our initial (interaction-less) regression model?

Residuals vs. Fitted

The residual versus fitted plot for the linear regression indicates two problems with our data. If you track the red line, it is clear that the relationship between our fitted values and the residual error (the amount that our regression did not account for in the fitted value) decreases as the fitted value increases. This indicates a non-linear relationship.

Second, above the primary band of plotted points, there is a second band that exists. This separation indicates there is a second influence on the residuals produced by our regression model.

Normal Q-Q

The Normal Q-Q plot is useful for evaluating the normalily of our data. It plots the standardized residual against the quantities generated by the regression - both divided into quartiles. If the data is normal, then the plot of points on the Normal Q-Q plot will appear linear. In the case of our initial regression, you can see that the line is not linear - that our income data is not distributed normally according to our regression model. The upper quartile of incomes produce a residual much higher than the predicted value and there is a subtle dip below the expected normal value on the lower range.

Scale-Location

The scale-location plot is useful for evaluating whether our data is ‘homoscedastic’ or ‘heteroscedastic’. In this case, our data appears to be heteroscedastic - the standard error appears to increase as the fitted value increases. Our ability to predict the value of income given the variables we used as inputs in our regression decreases as the predicted value increases.

Residuals vs Leverage

The residuals versus leverage plot indicates that none of our data values was influential to the regression analysis. If there were cases that were influential, they would appear outside of the Cook’s distance range (which is not represented on the plot). If we were to exclude any values, there would not be much change on the regression model.

How ‘good’ is one of our interactive regression models?

Below are the diagnostic plots for the interactive regression where we consider impacts of education level on gender. Notice how the plots largely remain unchanged. There remain issues of data that is not normal in shape - so using a linear regression may not be advisable. Also notice how the data still demonstrates heteroscedasticity.
Return To Top


4. Closing Discussion

The goal of this analysis is to identify relationships of other variables on the income gap between men and women. To begin we had to clean the NLSY79 data set and determine a strategy for dealing with non-normal data, top-coded values, and missing values. Following that we discussed the impact of variables on the income gap by using regression analysis and the R anova() function to compare two regression models. We noticed signifcant effects on the size of the income gap between men and women caused by the race, geographical region, education level, and marital status variables. These effects are described in more detail above.

How confident are we in the results?

Unfortunately a rigorous analysis of all 67+ variables in the NLSY79 data set could not be completed as this would help explain other relationships in the dataset. For example: how much collinearity is expressed in our regression models? Additionally, the relationship between the variables in some cases as described above is not significant or is close to the 95% confidence cut-off. These could be re-evaluated without removing all records with missing values from the set of variabes that we chose. The significance of the relationship may change. Also the data demonstrates heteroscedasticity which indicates we have omitted variable bias (another variable to explain the variation of our residual values with our fitted values). Lastly, the regression diagnostic plots indicate that the data is not normally distributed so using a linear regression model will not explain all of the effects observed. These considerations reduce the confidence in the analysis and in the regression models produced.



Return To Top