Ben Hayes - The Gender Income Gap

The Gender Income Gap

How do factors (besides gender) affect the gender income gap?

Feb 11, 2018 20 min read Ben Hayes

Is there a significant difference in income between men and women? Does the difference vary depending on other factors (e.g., education, marital status, criminal history, drug use, childhood household factors, profession, etc.)?

The issue of gender-income disparity is not new - it is nuanced. In order to answer these questions, we will rely on the National Longitudinal Survey of Youth, 1979 cohort, data set (abbreviated ‘NLSY79’). To draw strong conclusions, we must evaluate the data set provided - is it accurate, relevant, and useful for drawing statistical conclusions? Once we summarize the data, we can discuss methodology - how should we approach the data, what variables should we consider, what techniques are appropriate? Third, we openly discuss findings about the sampled individuals and attempt to infer relationships about the income difference (if any) between men and women in the larger population based on other factors. Lastly, we end with a discussion of the relevancy and signicance of our findings given the context of the survey data available and the methodology used.

1. Data Summary

The NLSY79 dataset contains records for a national sample of over 12,000 men and women aged 14-22 at the time of the survey. The dataset contains 67 attributes about these 12,000 individuals ranging from basic demographic information such as race, gender, and also more descriptive information such as family size, region, and drug use.

In the boxplot to the below, you can see the distribution of incomes for men and women. Notice that the median income for men is higher than the median income for women and in general, the variability in income of men is higher than the variability in the income of women. Also, notice the outliers for both men and women at the top of the income scale.

The income data from the National Longitudinal Survey of Youth, 1979 cohort, data set is top-coded. The income values for the top 2% of earners are set to the average income of the top 2% of earners (to obscure the actual values). While this change seems harmless, and would be when calculating simple averages, it does have potential to affect our analysis. By setting values of the top earners to the average of that group, the standard deviation for the top 2% has been eliminated. This subtle change will be further examined when we determine and discuss significance of our results later.

But first, let’s examine the data available and begin to hone in on a few key variables.

Gender:

The initial instinct to answer the question is to simply compare the average income of men with the average income of women.

As can be seen in the bar chart:

The average salary of men in our dataset is approximately $24961 more than that of women.
The 95% confidence intervals do not overlap so the probability that this difference is significant is extremely high (>99.99%).

This depiction of the gap indicates that there certainly may be a problem, but how can we confidently conclude that there is an income gap? Do men work more? Have Higher Education? What happens when we include other variables?

Individual’s Race:

When we include race, we can see that the disparity between men and women changes:

The race with the highest difference in average income is the non-hispanic/non-black group.
The income difference in race, when holding race constant by using sub-groups, is significant for all 3 races.
Note that women in the non-hispanic/non-black group, on average, earn more income than men who are black (although we can show that any difference is not significant).

Geographic Region:

What if you replace race with region of the country?

You can see the average income in the northeast is higher than in the other 3 regions but the gap between men and women remains relatively similar.
Most of the respondents are from the south or north central regions.
All differences in income by gender and region appear to be significant atthe 95% confidence level.

Marital Status:

The majority of respondents are either married, never married, or divorced:
The first observation for this chart must be the interesting fact that men who are married earn more income than men who are never married, separated, divorced, or widowed - and that there is no corresponding spike (at least proportional to the spike for men) in income for women who are married.
The differences in income by gender only appear to be significant for married and divorced individuals.

Country of Origin:

A few noteworthy items about the Country of Origin variable:

The majority of individuals who are included in the NLSY79 survey are originally from the United States.
Regardless of the country, it appears that the income difference between men and women is significant.
Based on the data, men from outside of the U.S. see a small bump in pay relative to their counterpart from the United States - women however do not appear to experience the same bump.

Education Level:

The education variable, coupled with the gender variable, shows an interesting relationship:
Once the 10th grade education level is attained, we begin to see a significant difference in the incomes between men and women.
As education level increases, the gap in income between the genders also increases.
Note that the sample sizes for indivdiuals 11th grade and lower have smaller samples.

Industry/Business:

How does the gender income gap change across businesses or industries? This question could be useful for analyzing the gender gap and there are a few obvious relationships:

Notice the 95% confidence interval is large for several subgroups. For example, women in the Armed Forces make up 3 of the remaining respondents. The standard deviation for this group is very high causing the confidence interval to dip noticeably below $0.
The income gap between men and women is statistically significant at the 95% confidence level in several industries: Manufacturing, Wholesale/Resale Trade, Information, Finance/Insurance, Professional Services, Educational Services, and Healthcare/Social Assistance.

Due to the granularity of this variable and the lack of surveyed individuals for some of the industry~gender sub-groupings, this variable and may not be helpful later when we perform regression analysis.

Return To Top

2. Methodology

In this section, we evaluate how to handle anomalies that arise during most data analyses: missing values, inappropriate values, top-coded values, and final variable selection.

Missing Values

Standard for data analysis, missing values will occur. Handling missing values is a delicate process which requires case by case examination. For the variables used in this analysis, the general approach was to remove missing values. Unfortunately, an individual who has not appropriately reported marital status but has reported all other variables (including 2012 income) were removed. This process may not be acceptable for more robust analyses but to establish cursory findings about the relationship of income and gender and the impact of other factors omission is not ideal but acceptable.

R provides ways to account for missing values (encode as NA and use rm.na = True) but for the purpose of this analysis the missing values were removed.

Inappropriate Values

As a consequence of the survey process which assesses different aspects of human life, the participants are not always able to provide accurate information. Unfortunately, this lack of information presents itself as unhelpful categorical variables.

The following categories of values were removed from the corresponding variable (these data cleanup changes are reflected in the plots shown in the Data Summary section above):

‘Unknown’ from Country of Origin
‘Non-interview’, ‘Invalid Skip’ from Marital Status (2000)
‘Refusal’, ‘Do not know’ from Region of Current Residence
‘Ungraded’ from Education Level
‘Error’, ‘Not in Labor Force’ from Industry/Business

Top-Coded Income Variable

The income variable that was used (from 2012 survey) is top-coded. The top 2% of incomes were replaced with the average of the group. While comparing simple averages is okay, this top-coding presents a problem for deeper analysis as the standard deviation for the top 2% of incomes is eliminated. When performing regression, the residual produced does not account for the natural variance in the top-coded values. This may reduce the total sum of residuals and artificially increase the significance of a test (e.g., t-statistic).

Selecting Variables

Initially, before summarizing the data, variables were chosen to identify if common demographic details for an individual might affect their income relative to individuals from the opposite sex. To move forward with the analysis, we must decide on a set of variables that could affect the difference in income for men and women. The variables chosen for additional analysis were:

Income
Gender
Race
Education Level
Geographic Region
Marital Status

Variables that were previously evaluated but since removed from the analysis include:

Country of Origin - This value is suspected to not affect the income gap based on the bar chart above.
Industry/Business - This value subdivides the data to levels that are too small to draw statistically significant conclusions from when performing regression analysis.

Return To Top

3.Findings

The initial question asks us to evaluate whether there is a difference in income between men and women. This question seems simple. Yet, when we attempt to answer it in simple terms (see the comparison of average income of men versus the average income of women) we are left feeling unsatisfied. This lack of satisfaction is owed to omitted variable bias - leaving out variables that could increase or decrease the difference in income by gender. What other variables impact the difference in income? By how much? Which have more influence? We attempt to answer these questions in this section in order to more satisfyingly answer the initial question.

We can perform regression analysis (including a range of variables) to help isolate the effect of each variable on income. If we attempt to regress income on gender, you will see familiar results:

To interpret the regression output, pay close attention to the GenderFemale estimated value. This value represents the amount that women can expect to earn less than a man (while considering no other variables) and is statistically significant. You will notice that the value is the same as when we compared the simple averages!

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	56132	1014.78	55.314	0
GenderFemale	-24961	1412.64	-17.669	0

As mentioned in this section’s preface, we want to develop a more satisfying model for evaluating the effect of other variables on income. Let’s see what happens when we account for other variables:

Main Effects

Income ~ Gender + Race + Education Level + Geographic Region + Marital Status

This regression model provides a list of coefficient estimates based on the categorical values provided in the Gender, Race, Education Level, Geographic Region, and Marital Status variables. For example, a hispanic woman expects to earn approximately $27441 fewer dollars than men who meet the other criteria (marital status, education level, etc.).

Additionally, living in the Northeast typically leads to an increase in income of $6748 more than someone similar living in North Central US, $5968 more than someone similar living in the south, and $5976 more than someone similar living in the west.

But we want to compare how these variables affect the gap in income between men and women… Then let’s do it!

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	25581	4951.04	5.167	0.000
GenderFemale	-27442	1288.03	-21.305	0.000
RaceBlack	-9767	2052.25	-4.759	0.000
RaceNon-Black/Hispanic	3066	1864.02	1.645	0.100
GradeCompleted_20129th Grade	3803	6297.86	0.604	0.546
GradeCompleted_201210th Grade	2557	6410.46	0.399	0.690
GradeCompleted_201211th Grade	5478	6061.20	0.904	0.366
GradeCompleted_201212th Grade	16970	4504.37	3.767	0.000
GradeCompleted_20121st Year College	28402	4876.79	5.824	0.000
GradeCompleted_20122nd Year College	27868	4806.69	5.798	0.000
GradeCompleted_20123rd Year College	31167	5166.32	6.033	0.000
GradeCompleted_20124th Year College	57692	4779.08	12.072	0.000
GradeCompleted_20125th Year/More College	74627	4815.29	15.498	0.000
RegionOfCurrentResidence_2012North Central	-6749	2133.76	-3.163	0.002
RegionOfCurrentResidence_2012South	-5968	1958.07	-3.048	0.002
RegionOfCurrentResidence_2012West	-5976	2266.60	-2.637	0.008
MaritalStatus_2000Married	11664	1772.77	6.579	0.000
MaritalStatus_2000Separated	3249	3044.14	1.067	0.286
MaritalStatus_2000Divorced	5491	2236.76	2.455	0.014
MaritalStatus_2000Widowed	11400	6925.50	1.646	0.100

Interaction Effects

Income ~ Gender + … + **Gender * Race**

This regression will allow us to compare the effect of race on gender and evaluate how that affects the income gap. Notice that the variable GenderFemale:RaceNon-Black/Hispanic has a coefficient of -16266 which is statistically significant and indicates that on average a woman who is not black or hispanic experiences a larger pay gap ($-16266 larger gap than a hispanic woman). This effect is likely due to non-black/non-hispanic individuals earning more in general as shown in the graph above.

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	21729	5152.83	4.217	0.000
GenderFemale	-21368	2967.80	-7.200	0.000
RaceBlack	-13622	2823.48	-4.825	0.000
RaceNon-Black/Hispanic	11440	2576.92	4.439	0.000
GradeCompleted_20129th Grade	4811	6266.17	0.768	0.443
GradeCompleted_201210th Grade	4102	6382.31	0.643	0.520
GradeCompleted_201211th Grade	6873	6030.92	1.140	0.255
GradeCompleted_201212th Grade	17917	4481.71	3.998	0.000
GradeCompleted_20121st Year College	28884	4851.89	5.953	0.000
GradeCompleted_20122nd Year College	28180	4782.15	5.893	0.000
GradeCompleted_20123rd Year College	31579	5139.00	6.145	0.000
GradeCompleted_20124th Year College	57961	4753.39	12.194	0.000
GradeCompleted_20125th Year/More College	75273	4790.83	15.712	0.000
RegionOfCurrentResidence_2012North Central	-7160	2122.26	-3.374	0.001
RegionOfCurrentResidence_2012South	-6340	1947.47	-3.256	0.001
RegionOfCurrentResidence_2012West	-6047	2253.87	-2.683	0.007
MaritalStatus_2000Married	12031	1763.45	6.822	0.000
MaritalStatus_2000Separated	2662	3027.92	0.879	0.379
MaritalStatus_2000Divorced	5588	2224.14	2.512	0.012
MaritalStatus_2000Widowed	10473	6889.31	1.520	0.129
GenderFemale:RaceBlack	7540	3741.25	2.015	0.044
GenderFemale:RaceNon-Black/Hispanic	-16267	3452.93	-4.711	0.000

Regression Comparison

The regression model that includes an interaction term between race and gender indicates that race does have a significant impact on the gender income gap. This is indicated by the reduced residual sum of squares and the P-value < 0.0001.

## Analysis of Variance Table
## 
## Model 1: TotalIncome_2012 ~ Gender + Race + GradeCompleted_2012 + RegionOfCurrentResidence_2012 + 
##     MaritalStatus_2000
## Model 2: TotalIncome_2012 ~ Gender + Race + GradeCompleted_2012 + RegionOfCurrentResidence_2012 + 
##     MaritalStatus_2000 + Gender * Race
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1   6152 1.5352e+13                                   
## 2   6150 1.5173e+13  2 1.7834e+11 36.142 2.484e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interaction Effects

Income ~ Gender + … + **Gender * Education Level**

This regression will allow us to compare the effect of education level on gender and evaluate how that affects the income gap. Notice the coefficients for women increase in magnitude (becoming more negative) as education level increases. This effect is a consequence of their male counterparts (who meet the other regression criteria or are in the same ‘bucket’ but are male) earn increasingly more than women. For example, a woman with 4 years of college completed and other attributes (marital status, etc.) earns $39398 less than a man with 4 years of college completed and the same other attributes.

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	19122	6360.45	3.006	0.003
GenderFemale	-11414	8589.75	-1.329	0.184
RaceBlack	-9689	2022.11	-4.792	0.000
RaceNon-Black/Hispanic	2854	1838.12	1.553	0.121
GradeCompleted_20129th Grade	6063	8084.54	0.750	0.453
GradeCompleted_201210th Grade	2422	8576.39	0.282	0.778
GradeCompleted_201211th Grade	6057	8054.58	0.752	0.452
GradeCompleted_201212th Grade	18970	6088.08	3.116	0.002
GradeCompleted_20121st Year College	33303	6731.88	4.947	0.000
GradeCompleted_20122nd Year College	30750	6618.76	4.646	0.000
GradeCompleted_20123rd Year College	40414	7271.14	5.558	0.000
GradeCompleted_20124th Year College	77505	6480.13	11.960	0.000
GradeCompleted_20125th Year/More College	104554	6594.11	15.856	0.000
RegionOfCurrentResidence_2012North Central	-7266	2101.33	-3.458	0.001
RegionOfCurrentResidence_2012South	-6206	1928.34	-3.218	0.001
RegionOfCurrentResidence_2012West	-6271	2232.91	-2.809	0.005
MaritalStatus_2000Married	10602	1747.97	6.065	0.000
MaritalStatus_2000Separated	1819	3001.55	0.606	0.545
MaritalStatus_2000Divorced	5251	2206.10	2.380	0.017
MaritalStatus_2000Widowed	8327	6828.86	1.219	0.223
GenderFemale:GradeCompleted_20129th Grade	-633	12763.07	-0.050	0.960
GenderFemale:GradeCompleted_201210th Grade	870	12640.48	0.069	0.945
GenderFemale:GradeCompleted_201211th Grade	124	11965.78	0.010	0.992
GenderFemale:GradeCompleted_201212th Grade	-4450	8800.54	-0.506	0.613
GenderFemale:GradeCompleted_20121st Year College	-11291	9560.32	-1.181	0.238
GenderFemale:GradeCompleted_20122nd Year College	-7813	9410.87	-0.830	0.406
GenderFemale:GradeCompleted_20123rd Year College	-18892	10148.59	-1.862	0.063
GenderFemale:GradeCompleted_20124th Year College	-39398	9285.43	-4.243	0.000
GenderFemale:GradeCompleted_20125th Year/More College	-55082	9373.84	-5.876	0.000

Regression Comparison

The regression model that includes an interaction term between education level and gender indicates that education level does have a significant impact on the gender income gap. This is indicated by the reduced residual sum of squares and the P-value < 0.0001.

## Analysis of Variance Table
## 
## Model 1: TotalIncome_2012 ~ Gender + Race + GradeCompleted_2012 + RegionOfCurrentResidence_2012 + 
##     MaritalStatus_2000
## Model 2: TotalIncome_2012 ~ Gender + Race + GradeCompleted_2012 + RegionOfCurrentResidence_2012 + 
##     MaritalStatus_2000 + Gender * GradeCompleted_2012
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1   6152 1.5352e+13                                   
## 2   6143 1.4851e+13  9 5.0049e+11 23.003 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interaction Effects

Income ~ Gender + … + **Gender * Geographic Region**

When including an interaction term of gender with geographic region, notice how all of the GenderFemale:Region coefficients are positive. This indicates that women in the northeast region experience a larger income gap. For example, a woman in the North Central region experiences an income gap $3932 smaller than a woman with the same qualifications in the north east.

Note the t-statistics are varying in their significance.

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	29015	5179.78	5.602	0.000
GenderFemale	-34149	3332.04	-10.249	0.000
RaceBlack	-9777	2051.47	-4.766	0.000
RaceNon-Black/Hispanic	3080	1863.58	1.653	0.098
GradeCompleted_20129th Grade	3927	6295.72	0.624	0.533
GradeCompleted_201210th Grade	2519	6409.32	0.393	0.694
GradeCompleted_201211th Grade	5264	6060.06	0.869	0.385
GradeCompleted_201212th Grade	16873	4503.76	3.746	0.000
GradeCompleted_20121st Year College	28318	4878.06	5.805	0.000
GradeCompleted_20122nd Year College	27784	4806.87	5.780	0.000
GradeCompleted_20123rd Year College	31182	5165.05	6.037	0.000
GradeCompleted_20124th Year College	57639	4778.75	12.062	0.000
GradeCompleted_20125th Year/More College	74491	4814.94	15.471	0.000
RegionOfCurrentResidence_2012North Central	-8736	3037.97	-2.876	0.004
RegionOfCurrentResidence_2012South	-10912	2781.96	-3.922	0.000
RegionOfCurrentResidence_2012West	-10428	3173.16	-3.286	0.001
MaritalStatus_2000Married	11734	1772.58	6.620	0.000
MaritalStatus_2000Separated	3151	3043.21	1.035	0.301
MaritalStatus_2000Divorced	5594	2236.23	2.501	0.012
MaritalStatus_2000Widowed	11439	6925.01	1.652	0.099
GenderFemale:RegionOfCurrentResidence_2012North Central	3933	4236.04	0.928	0.353
GenderFemale:RegionOfCurrentResidence_2012South	9618	3858.84	2.492	0.013
GenderFemale:RegionOfCurrentResidence_2012West	8796	4414.11	1.993	0.046

Regression Comparison

The regression model that includes an interaction term between geographic region and gender indicates that geographic region does barely have a significant impact on the gender income gap. This is indicated by the reduced residual sum of squares and the P-value < 0.05.

## Analysis of Variance Table
## 
## Model 1: TotalIncome_2012 ~ Gender + Race + GradeCompleted_2012 + RegionOfCurrentResidence_2012 + 
##     MaritalStatus_2000
## Model 2: TotalIncome_2012 ~ Gender + Race + GradeCompleted_2012 + RegionOfCurrentResidence_2012 + 
##     MaritalStatus_2000 + Gender * RegionOfCurrentResidence_2012
##   Res.Df        RSS Df  Sum of Sq      F  Pr(>F)  
## 1   6152 1.5352e+13                               
## 2   6149 1.5332e+13  3 1.9761e+10 2.6418 0.04767 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interaction Effects

Income ~ Gender + … + **Gender * Marital Status**

When we add an interation term for gender and marital status, we see an interesting effect. Women who are married as of year 2000, experience an income gap that is $30114 larger than women with the same qualifications who are not married. A similar effect is seen for women who are divorced or widowed but less of a bump. This could be because the women in this category are educated but end their career early or take a less demanding job to raise a family, etc.

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	17024	5051.46	3.370	0.001
GenderFemale	-7884	2941.84	-2.680	0.007
RaceBlack	-10335	2036.55	-5.075	0.000
RaceNon-Black/Hispanic	3419	1848.80	1.849	0.064
GradeCompleted_20129th Grade	4373	6247.53	0.700	0.484
GradeCompleted_201210th Grade	3282	6356.89	0.516	0.606
GradeCompleted_201211th Grade	7119	6016.28	1.183	0.237
GradeCompleted_201212th Grade	17402	4467.01	3.896	0.000
GradeCompleted_20121st Year College	28670	4836.33	5.928	0.000
GradeCompleted_20122nd Year College	27885	4766.46	5.850	0.000
GradeCompleted_20123rd Year College	31456	5125.57	6.137	0.000
GradeCompleted_20124th Year College	57505	4739.13	12.134	0.000
GradeCompleted_20125th Year/More College	73989	4775.98	15.492	0.000
RegionOfCurrentResidence_2012North Central	-7452	2116.96	-3.520	0.000
RegionOfCurrentResidence_2012South	-6303	1942.26	-3.245	0.001
RegionOfCurrentResidence_2012West	-6270	2248.08	-2.789	0.005
MaritalStatus_2000Married	25682	2341.13	10.970	0.000
MaritalStatus_2000Separated	-21	4662.42	-0.004	0.996
MaritalStatus_2000Divorced	8081	3210.68	2.517	0.012
MaritalStatus_2000Widowed	8888	14456.33	0.615	0.539
GenderFemale:MaritalStatus_2000Married	-30114	3366.00	-8.947	0.000
GenderFemale:MaritalStatus_2000Separated	68	6135.03	0.011	0.991
GenderFemale:MaritalStatus_2000Divorced	-9112	4404.25	-2.069	0.039
GenderFemale:MaritalStatus_2000Widowed	-5134	16460.84	-0.312	0.755

Regression Comparison

The regression model that includes an interaction term between marital status and gender indicates that marital status does have a significant impact on the gender income gap. This is indicated by the reduced residual sum of squares and the P-value < 0.0001.

## Analysis of Variance Table
## 
## Model 1: TotalIncome_2012 ~ Gender + Race + GradeCompleted_2012 + RegionOfCurrentResidence_2012 + 
##     MaritalStatus_2000
## Model 2: TotalIncome_2012 ~ Gender + Race + GradeCompleted_2012 + RegionOfCurrentResidence_2012 + 
##     MaritalStatus_2000 + Gender * MaritalStatus_2000
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1   6152 1.5352e+13                                   
## 2   6148 1.5083e+13  4 2.6906e+11 27.419 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

How ‘good’ is our initial (interaction-less) regression model?

Residuals vs. Fitted

The residual versus fitted plot for the linear regression indicates two problems with our data. If you track the red line, it is clear that the relationship between our fitted values and the residual error (the amount that our regression did not account for in the fitted value) decreases as the fitted value increases. This indicates a non-linear relationship.

Second, above the primary band of plotted points, there is a second band that exists. This separation indicates there is a second influence on the residuals produced by our regression model.

Normal Q-Q

The Normal Q-Q plot is useful for evaluating the normalily of our data. It plots the standardized residual against the quantities generated by the regression - both divided into quartiles. If the data is normal, then the plot of points on the Normal Q-Q plot will appear linear. In the case of our initial regression, you can see that the line is not linear - that our income data is not distributed normally according to our regression model. The upper quartile of incomes produce a residual much higher than the predicted value and there is a subtle dip below the expected normal value on the lower range.

Scale-Location

The scale-location plot is useful for evaluating whether our data is ‘homoscedastic’ or ‘heteroscedastic’. In this case, our data appears to be heteroscedastic - the standard error appears to increase as the fitted value increases. Our ability to predict the value of income given the variables we used as inputs in our regression decreases as the predicted value increases.

Residuals vs Leverage

The residuals versus leverage plot indicates that none of our data values was influential to the regression analysis. If there were cases that were influential, they would appear outside of the Cook’s distance range (which is not represented on the plot). If we were to exclude any values, there would not be much change on the regression model.

How ‘good’ is one of our interactive regression models?

Below are the diagnostic plots for the interactive regression where we consider impacts of education level on gender. Notice how the plots largely remain unchanged. There remain issues of data that is not normal in shape - so using a linear regression may not be advisable. Also notice how the data still demonstrates heteroscedasticity.

Return To Top

4. Closing Discussion

The goal of this analysis is to identify relationships of other variables on the income gap between men and women. To begin we had to clean the NLSY79 data set and determine a strategy for dealing with non-normal data, top-coded values, and missing values. Following that we discussed the impact of variables on the income gap by using regression analysis and the R anova() function to compare two regression models. We noticed signifcant effects on the size of the income gap between men and women caused by the race, geographical region, education level, and marital status variables. These effects are described in more detail above.

How confident are we in the results?

Unfortunately a rigorous analysis of all 67+ variables in the NLSY79 data set could not be completed as this would help explain other relationships in the dataset. For example: how much collinearity is expressed in our regression models? Additionally, the relationship between the variables in some cases as described above is not significant or is close to the 95% confidence cut-off. These could be re-evaluated without removing all records with missing values from the set of variabes that we chose. The significance of the relationship may change. Also the data demonstrates heteroscedasticity which indicates we have omitted variable bias (another variable to explain the variation of our residual values with our fitted values). Lastly, the regression diagnostic plots indicate that the data is not normally distributed so using a linear regression model will not explain all of the effects observed. These considerations reduce the confidence in the analysis and in the regression models produced.

Return To Top

Blog

About

Contact

Resume/CV

Table of Contents:

1. Data Summary

Gender:

Individual’s Race:

Geographic Region:

Marital Status:

Country of Origin:

Education Level:

Industry/Business:

2. Methodology

Missing Values

Inappropriate Values

Top-Coded Income Variable

Selecting Variables

3.Findings

Main Effects

Income ~ Gender + Race + Education Level + Geographic Region + Marital Status

Interaction Effects

Income ~ Gender + … + Gender * Race

Regression Comparison

Interaction Effects

Income ~ Gender + … + Gender * Education Level

Regression Comparison

Interaction Effects

Income ~ Gender + … + Gender * Geographic Region

Regression Comparison

Interaction Effects

Income ~ Gender + … + Gender * Marital Status

Regression Comparison

How ‘good’ is our initial (interaction-less) regression model?

Residuals vs. Fitted

Normal Q-Q

Scale-Location

Residuals vs Leverage

How ‘good’ is one of our interactive regression models?

4. Closing Discussion

How confident are we in the results?

Income ~ Gender + … + **Gender * Race**

Income ~ Gender + … + **Gender * Education Level**

Income ~ Gender + … + **Gender * Geographic Region**

Income ~ Gender + … + **Gender * Marital Status**