The Gender Income Gap
How do factors (besides gender) affect the gender income gap?
Is there a significant difference in income between men and women? Does the difference vary depending on other factors (e.g., education, marital status, criminal history, drug use, childhood household factors, profession, etc.)?
The issue of gender-income disparity is not new - it is nuanced. In order to answer these questions, we will rely on the National Longitudinal Survey of Youth, 1979 cohort, data set (abbreviated ‘NLSY79’). To draw strong conclusions, we must evaluate the data set provided - is it accurate, relevant, and useful for drawing statistical conclusions? Once we summarize the data, we can discuss methodology - how should we approach the data, what variables should we consider, what techniques are appropriate? Third, we openly discuss findings about the sampled individuals and attempt to infer relationships about the income difference (if any) between men and women in the larger population based on other factors. Lastly, we end with a discussion of the relevancy and signicance of our findings given the context of the survey data available and the methodology used.
Table of Contents:
1. Data Summary
The NLSY79 dataset contains records for a national sample of over 12,000 men and women aged 14-22 at the time of the survey. The dataset contains 67 attributes about these 12,000 individuals ranging from basic demographic information such as race, gender, and also more descriptive information such as family size, region, and drug use.In the boxplot to the below, you can see the distribution of incomes for men and women. Notice that the median income for men is higher than the median income for women and in general, the variability in income of men is higher than the variability in the income of women. Also, notice the outliers for both men and women at the top of the income scale.
The income data from the National Longitudinal Survey of Youth, 1979 cohort, data set is top-coded. The income values for the top 2% of earners are set to the average income of the top 2% of earners (to obscure the actual values). While this change seems harmless, and would be when calculating simple averages, it does have potential to affect our analysis. By setting values of the top earners to the average of that group, the standard deviation for the top 2% has been eliminated. This subtle change will be further examined when we determine and discuss significance of our results later.
But first, let’s examine the data available and begin to hone in on a few key variables.
Gender:
The initial instinct to answer the question is to simply compare the average income of men with the average income of women.As can be seen in the bar chart:
- The average salary of men in our dataset is approximately $24961 more than that of women.
- The 95% confidence intervals do not overlap so the probability that this difference is significant is extremely high (>99.99%).
This depiction of the gap indicates that there certainly may be a problem, but how can we confidently conclude that there is an income gap? Do men work more? Have Higher Education? What happens when we include other variables?
Individual’s Race:
When we include race, we can see that the disparity between men and women changes:- The race with the highest difference in average income is the non-hispanic/non-black group.
- The income difference in race, when holding race constant by using sub-groups, is significant for all 3 races.
- Note that women in the non-hispanic/non-black group, on average, earn more income than men who are black (although we can show that any difference is not significant).
Geographic Region:
What if you replace race with region of the country?- You can see the average income in the northeast is higher than in the other 3 regions but the gap between men and women remains relatively similar.
- Most of the respondents are from the south or north central regions.
- All differences in income by gender and region appear to be significant atthe 95% confidence level.
Marital Status:
- The majority of respondents are either married, never married, or divorced:
- The first observation for this chart must be the interesting fact that men who are married earn more income than men who are never married, separated, divorced, or widowed - and that there is no corresponding spike (at least proportional to the spike for men) in income for women who are married.
- The differences in income by gender only appear to be significant for married and divorced individuals.
Country of Origin:
A few noteworthy items about the Country of Origin variable:- The majority of individuals who are included in the NLSY79 survey are originally from the United States.
- Regardless of the country, it appears that the income difference between men and women is significant.
- Based on the data, men from outside of the U.S. see a small bump in pay relative to their counterpart from the United States - women however do not appear to experience the same bump.
Education Level:
- The education variable, coupled with the gender variable, shows an interesting relationship:
- Once the 10th grade education level is attained, we begin to see a significant difference in the incomes between men and women.
- As education level increases, the gap in income between the genders also increases.
- Note that the sample sizes for indivdiuals 11th grade and lower have smaller samples.
Industry/Business:
How does the gender income gap change across businesses or industries? This question could be useful for analyzing the gender gap and there are a few obvious relationships:- Notice the 95% confidence interval is large for several subgroups. For example, women in the Armed Forces make up 3 of the remaining respondents. The standard deviation for this group is very high causing the confidence interval to dip noticeably below $0.
- The income gap between men and women is statistically significant at the 95% confidence level in several industries: Manufacturing, Wholesale/Resale Trade, Information, Finance/Insurance, Professional Services, Educational Services, and Healthcare/Social Assistance.
Due to the granularity of this variable and the lack of surveyed individuals for some of the industry~gender sub-groupings, this variable and may not be helpful later when we perform regression analysis.
2. Methodology
In this section, we evaluate how to handle anomalies that arise during most data analyses: missing values, inappropriate values, top-coded values, and final variable selection.Missing Values
Standard for data analysis, missing values will occur. Handling missing values is a delicate process which requires case by case examination. For the variables used in this analysis, the general approach was to remove missing values. Unfortunately, an individual who has not appropriately reported marital status but has reported all other variables (including 2012 income) were removed. This process may not be acceptable for more robust analyses but to establish cursory findings about the relationship of income and gender and the impact of other factors omission is not ideal but acceptable.R provides ways to account for missing values (encode as NA and use rm.na = True) but for the purpose of this analysis the missing values were removed.
Inappropriate Values
As a consequence of the survey process which assesses different aspects of human life, the participants are not always able to provide accurate information. Unfortunately, this lack of information presents itself as unhelpful categorical variables.The following categories of values were removed from the corresponding variable (these data cleanup changes are reflected in the plots shown in the Data Summary section above):
- ‘Unknown’ from Country of Origin
- ‘Non-interview’, ‘Invalid Skip’ from Marital Status (2000)
- ‘Refusal’, ‘Do not know’ from Region of Current Residence
- ‘Ungraded’ from Education Level
- ‘Error’, ‘Not in Labor Force’ from Industry/Business
Top-Coded Income Variable
The income variable that was used (from 2012 survey) is top-coded. The top 2% of incomes were replaced with the average of the group. While comparing simple averages is okay, this top-coding presents a problem for deeper analysis as the standard deviation for the top 2% of incomes is eliminated. When performing regression, the residual produced does not account for the natural variance in the top-coded values. This may reduce the total sum of residuals and artificially increase the significance of a test (e.g., t-statistic).Selecting Variables
Initially, before summarizing the data, variables were chosen to identify if common demographic details for an individual might affect their income relative to individuals from the opposite sex. To move forward with the analysis, we must decide on a set of variables that could affect the difference in income for men and women. The variables chosen for additional analysis were:- Income
- Gender
- Race
- Education Level
- Geographic Region
- Marital Status
Variables that were previously evaluated but since removed from the analysis include:
- Country of Origin - This value is suspected to not affect the income gap based on the bar chart above.
- Industry/Business - This value subdivides the data to levels that are too small to draw statistically significant conclusions from when performing regression analysis.
3.Findings
The initial question asks us to evaluate whether there is a difference in income between men and women. This question seems simple. Yet, when we attempt to answer it in simple terms (see the comparison of average income of men versus the average income of women) we are left feeling unsatisfied. This lack of satisfaction is owed to omitted variable bias - leaving out variables that could increase or decrease the difference in income by gender. What other variables impact the difference in income? By how much? Which have more influence? We attempt to answer these questions in this section in order to more satisfyingly answer the initial question.We can perform regression analysis (including a range of variables) to help isolate the effect of each variable on income. If we attempt to regress income on gender, you will see familiar results:
To interpret the regression output, pay close attention to the GenderFemale estimated value. This value represents the amount that women can expect to earn less than a man (while considering no other variables) and is statistically significant. You will notice that the value is the same as when we compared the simple averages!
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 56132 | 1014.78 | 55.314 | 0 |
GenderFemale | -24961 | 1412.64 | -17.669 | 0 |
As mentioned in this section’s preface, we want to develop a more satisfying model for evaluating the effect of other variables on income. Let’s see what happens when we account for other variables:
Main Effects
Income ~ Gender + Race + Education Level + Geographic Region + Marital Status
This regression model provides a list of coefficient estimates based on the categorical values provided in the Gender, Race, Education Level, Geographic Region, and Marital Status variables. For example, a hispanic woman expects to earn approximately $27441 fewer dollars than men who meet the other criteria (marital status, education level, etc.).Additionally, living in the Northeast typically leads to an increase in income of $6748 more than someone similar living in North Central US, $5968 more than someone similar living in the south, and $5976 more than someone similar living in the west.
But we want to compare how these variables affect the gap in income between men and women… Then let’s do it!
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 25581 | 4951.04 | 5.167 | 0.000 |
GenderFemale | -27442 | 1288.03 | -21.305 | 0.000 |
RaceBlack | -9767 | 2052.25 | -4.759 | 0.000 |
RaceNon-Black/Hispanic | 3066 | 1864.02 | 1.645 | 0.100 |
GradeCompleted_20129th Grade | 3803 | 6297.86 | 0.604 | 0.546 |
GradeCompleted_201210th Grade | 2557 | 6410.46 | 0.399 | 0.690 |
GradeCompleted_201211th Grade | 5478 | 6061.20 | 0.904 | 0.366 |
GradeCompleted_201212th Grade | 16970 | 4504.37 | 3.767 | 0.000 |
GradeCompleted_20121st Year College | 28402 | 4876.79 | 5.824 | 0.000 |
GradeCompleted_20122nd Year College | 27868 | 4806.69 | 5.798 | 0.000 |
GradeCompleted_20123rd Year College | 31167 | 5166.32 | 6.033 | 0.000 |
GradeCompleted_20124th Year College | 57692 | 4779.08 | 12.072 | 0.000 |
GradeCompleted_20125th Year/More College | 74627 | 4815.29 | 15.498 | 0.000 |
RegionOfCurrentResidence_2012North Central | -6749 | 2133.76 | -3.163 | 0.002 |
RegionOfCurrentResidence_2012South | -5968 | 1958.07 | -3.048 | 0.002 |
RegionOfCurrentResidence_2012West | -5976 | 2266.60 | -2.637 | 0.008 |
MaritalStatus_2000Married | 11664 | 1772.77 | 6.579 | 0.000 |
MaritalStatus_2000Separated | 3249 | 3044.14 | 1.067 | 0.286 |
MaritalStatus_2000Divorced | 5491 | 2236.76 | 2.455 | 0.014 |
MaritalStatus_2000Widowed | 11400 | 6925.50 | 1.646 | 0.100 |
Interaction Effects
Income ~ Gender + … + Gender * Race
This regression will allow us to compare the effect of race on gender and evaluate how that affects the income gap. Notice that the variable GenderFemale:RaceNon-Black/Hispanic has a coefficient of -16266 which is statistically significant and indicates that on average a woman who is not black or hispanic experiences a larger pay gap ($-16266 larger gap than a hispanic woman). This effect is likely due to non-black/non-hispanic individuals earning more in general as shown in the graph above.Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 21729 | 5152.83 | 4.217 | 0.000 |
GenderFemale | -21368 | 2967.80 | -7.200 | 0.000 |
RaceBlack | -13622 | 2823.48 | -4.825 | 0.000 |
RaceNon-Black/Hispanic | 11440 | 2576.92 | 4.439 | 0.000 |
GradeCompleted_20129th Grade | 4811 | 6266.17 | 0.768 | 0.443 |
GradeCompleted_201210th Grade | 4102 | 6382.31 | 0.643 | 0.520 |
GradeCompleted_201211th Grade | 6873 | 6030.92 | 1.140 | 0.255 |
GradeCompleted_201212th Grade | 17917 | 4481.71 | 3.998 | 0.000 |
GradeCompleted_20121st Year College | 28884 | 4851.89 | 5.953 | 0.000 |
GradeCompleted_20122nd Year College | 28180 | 4782.15 | 5.893 | 0.000 |
GradeCompleted_20123rd Year College | 31579 | 5139.00 | 6.145 | 0.000 |
GradeCompleted_20124th Year College | 57961 | 4753.39 | 12.194 | 0.000 |
GradeCompleted_20125th Year/More College | 75273 | 4790.83 | 15.712 | 0.000 |
RegionOfCurrentResidence_2012North Central | -7160 | 2122.26 | -3.374 | 0.001 |
RegionOfCurrentResidence_2012South | -6340 | 1947.47 | -3.256 | 0.001 |
RegionOfCurrentResidence_2012West | -6047 | 2253.87 | -2.683 | 0.007 |
MaritalStatus_2000Married | 12031 | 1763.45 | 6.822 | 0.000 |
MaritalStatus_2000Separated | 2662 | 3027.92 | 0.879 | 0.379 |
MaritalStatus_2000Divorced | 5588 | 2224.14 | 2.512 | 0.012 |
MaritalStatus_2000Widowed | 10473 | 6889.31 | 1.520 | 0.129 |
GenderFemale:RaceBlack | 7540 | 3741.25 | 2.015 | 0.044 |
GenderFemale:RaceNon-Black/Hispanic | -16267 | 3452.93 | -4.711 | 0.000 |
Regression Comparison
The regression model that includes an interaction term between race and gender indicates that race does have a significant impact on the gender income gap. This is indicated by the reduced residual sum of squares and the P-value < 0.0001.## Analysis of Variance Table
##
## Model 1: TotalIncome_2012 ~ Gender + Race + GradeCompleted_2012 + RegionOfCurrentResidence_2012 +
## MaritalStatus_2000
## Model 2: TotalIncome_2012 ~ Gender + Race + GradeCompleted_2012 + RegionOfCurrentResidence_2012 +
## MaritalStatus_2000 + Gender * Race
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 6152 1.5352e+13
## 2 6150 1.5173e+13 2 1.7834e+11 36.142 2.484e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interaction Effects
Income ~ Gender + … + Gender * Education Level
This regression will allow us to compare the effect of education level on gender and evaluate how that affects the income gap. Notice the coefficients for women increase in magnitude (becoming more negative) as education level increases. This effect is a consequence of their male counterparts (who meet the other regression criteria or are in the same ‘bucket’ but are male) earn increasingly more than women. For example, a woman with 4 years of college completed and other attributes (marital status, etc.) earns $39398 less than a man with 4 years of college completed and the same other attributes.Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 19122 | 6360.45 | 3.006 | 0.003 |
GenderFemale | -11414 | 8589.75 | -1.329 | 0.184 |
RaceBlack | -9689 | 2022.11 | -4.792 | 0.000 |
RaceNon-Black/Hispanic | 2854 | 1838.12 | 1.553 | 0.121 |
GradeCompleted_20129th Grade | 6063 | 8084.54 | 0.750 | 0.453 |
GradeCompleted_201210th Grade | 2422 | 8576.39 | 0.282 | 0.778 |
GradeCompleted_201211th Grade | 6057 | 8054.58 | 0.752 | 0.452 |
GradeCompleted_201212th Grade | 18970 | 6088.08 | 3.116 | 0.002 |
GradeCompleted_20121st Year College | 33303 | 6731.88 | 4.947 | 0.000 |
GradeCompleted_20122nd Year College | 30750 | 6618.76 | 4.646 | 0.000 |
GradeCompleted_20123rd Year College | 40414 | 7271.14 | 5.558 | 0.000 |
GradeCompleted_20124th Year College | 77505 | 6480.13 | 11.960 | 0.000 |
GradeCompleted_20125th Year/More College | 104554 | 6594.11 | 15.856 | 0.000 |
RegionOfCurrentResidence_2012North Central | -7266 | 2101.33 | -3.458 | 0.001 |
RegionOfCurrentResidence_2012South | -6206 | 1928.34 | -3.218 | 0.001 |
RegionOfCurrentResidence_2012West | -6271 | 2232.91 | -2.809 | 0.005 |
MaritalStatus_2000Married | 10602 | 1747.97 | 6.065 | 0.000 |
MaritalStatus_2000Separated | 1819 | 3001.55 | 0.606 | 0.545 |
MaritalStatus_2000Divorced | 5251 | 2206.10 | 2.380 | 0.017 |
MaritalStatus_2000Widowed | 8327 | 6828.86 | 1.219 | 0.223 |
GenderFemale:GradeCompleted_20129th Grade | -633 | 12763.07 | -0.050 | 0.960 |
GenderFemale:GradeCompleted_201210th Grade | 870 | 12640.48 | 0.069 | 0.945 |
GenderFemale:GradeCompleted_201211th Grade | 124 | 11965.78 | 0.010 | 0.992 |
GenderFemale:GradeCompleted_201212th Grade | -4450 | 8800.54 | -0.506 | 0.613 |
GenderFemale:GradeCompleted_20121st Year College | -11291 | 9560.32 | -1.181 | 0.238 |
GenderFemale:GradeCompleted_20122nd Year College | -7813 | 9410.87 | -0.830 | 0.406 |
GenderFemale:GradeCompleted_20123rd Year College | -18892 | 10148.59 | -1.862 | 0.063 |
GenderFemale:GradeCompleted_20124th Year College | -39398 | 9285.43 | -4.243 | 0.000 |
GenderFemale:GradeCompleted_20125th Year/More College | -55082 | 9373.84 | -5.876 | 0.000 |
Regression Comparison
The regression model that includes an interaction term between education level and gender indicates that education level does have a significant impact on the gender income gap. This is indicated by the reduced residual sum of squares and the P-value < 0.0001.## Analysis of Variance Table
##
## Model 1: TotalIncome_2012 ~ Gender + Race + GradeCompleted_2012 + RegionOfCurrentResidence_2012 +
## MaritalStatus_2000
## Model 2: TotalIncome_2012 ~ Gender + Race + GradeCompleted_2012 + RegionOfCurrentResidence_2012 +
## MaritalStatus_2000 + Gender * GradeCompleted_2012
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 6152 1.5352e+13
## 2 6143 1.4851e+13 9 5.0049e+11 23.003 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interaction Effects
Income ~ Gender + … + Gender * Geographic Region
When including an interaction term of gender with geographic region, notice how all of the GenderFemale:Region coefficients are positive. This indicates that women in the northeast region experience a larger income gap. For example, a woman in the North Central region experiences an income gap $3932 smaller than a woman with the same qualifications in the north east.Note the t-statistics are varying in their significance.
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 29015 | 5179.78 | 5.602 | 0.000 |
GenderFemale | -34149 | 3332.04 | -10.249 | 0.000 |
RaceBlack | -9777 | 2051.47 | -4.766 | 0.000 |
RaceNon-Black/Hispanic | 3080 | 1863.58 | 1.653 | 0.098 |
GradeCompleted_20129th Grade | 3927 | 6295.72 | 0.624 | 0.533 |
GradeCompleted_201210th Grade | 2519 | 6409.32 | 0.393 | 0.694 |
GradeCompleted_201211th Grade | 5264 | 6060.06 | 0.869 | 0.385 |
GradeCompleted_201212th Grade | 16873 | 4503.76 | 3.746 | 0.000 |
GradeCompleted_20121st Year College | 28318 | 4878.06 | 5.805 | 0.000 |
GradeCompleted_20122nd Year College | 27784 | 4806.87 | 5.780 | 0.000 |
GradeCompleted_20123rd Year College | 31182 | 5165.05 | 6.037 | 0.000 |
GradeCompleted_20124th Year College | 57639 | 4778.75 | 12.062 | 0.000 |
GradeCompleted_20125th Year/More College | 74491 | 4814.94 | 15.471 | 0.000 |
RegionOfCurrentResidence_2012North Central | -8736 | 3037.97 | -2.876 | 0.004 |
RegionOfCurrentResidence_2012South | -10912 | 2781.96 | -3.922 | 0.000 |
RegionOfCurrentResidence_2012West | -10428 | 3173.16 | -3.286 | 0.001 |
MaritalStatus_2000Married | 11734 | 1772.58 | 6.620 | 0.000 |
MaritalStatus_2000Separated | 3151 | 3043.21 | 1.035 | 0.301 |
MaritalStatus_2000Divorced | 5594 | 2236.23 | 2.501 | 0.012 |
MaritalStatus_2000Widowed | 11439 | 6925.01 | 1.652 | 0.099 |
GenderFemale:RegionOfCurrentResidence_2012North Central | 3933 | 4236.04 | 0.928 | 0.353 |
GenderFemale:RegionOfCurrentResidence_2012South | 9618 | 3858.84 | 2.492 | 0.013 |
GenderFemale:RegionOfCurrentResidence_2012West | 8796 | 4414.11 | 1.993 | 0.046 |
Regression Comparison
The regression model that includes an interaction term between geographic region and gender indicates that geographic region does barely have a significant impact on the gender income gap. This is indicated by the reduced residual sum of squares and the P-value < 0.05.## Analysis of Variance Table
##
## Model 1: TotalIncome_2012 ~ Gender + Race + GradeCompleted_2012 + RegionOfCurrentResidence_2012 +
## MaritalStatus_2000
## Model 2: TotalIncome_2012 ~ Gender + Race + GradeCompleted_2012 + RegionOfCurrentResidence_2012 +
## MaritalStatus_2000 + Gender * RegionOfCurrentResidence_2012
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 6152 1.5352e+13
## 2 6149 1.5332e+13 3 1.9761e+10 2.6418 0.04767 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interaction Effects
Income ~ Gender + … + Gender * Marital Status
When we add an interation term for gender and marital status, we see an interesting effect. Women who are married as of year 2000, experience an income gap that is $30114 larger than women with the same qualifications who are not married. A similar effect is seen for women who are divorced or widowed but less of a bump. This could be because the women in this category are educated but end their career early or take a less demanding job to raise a family, etc.Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 17024 | 5051.46 | 3.370 | 0.001 |
GenderFemale | -7884 | 2941.84 | -2.680 | 0.007 |
RaceBlack | -10335 | 2036.55 | -5.075 | 0.000 |
RaceNon-Black/Hispanic | 3419 | 1848.80 | 1.849 | 0.064 |
GradeCompleted_20129th Grade | 4373 | 6247.53 | 0.700 | 0.484 |
GradeCompleted_201210th Grade | 3282 | 6356.89 | 0.516 | 0.606 |
GradeCompleted_201211th Grade | 7119 | 6016.28 | 1.183 | 0.237 |
GradeCompleted_201212th Grade | 17402 | 4467.01 | 3.896 | 0.000 |
GradeCompleted_20121st Year College | 28670 | 4836.33 | 5.928 | 0.000 |
GradeCompleted_20122nd Year College | 27885 | 4766.46 | 5.850 | 0.000 |
GradeCompleted_20123rd Year College | 31456 | 5125.57 | 6.137 | 0.000 |
GradeCompleted_20124th Year College | 57505 | 4739.13 | 12.134 | 0.000 |
GradeCompleted_20125th Year/More College | 73989 | 4775.98 | 15.492 | 0.000 |
RegionOfCurrentResidence_2012North Central | -7452 | 2116.96 | -3.520 | 0.000 |
RegionOfCurrentResidence_2012South | -6303 | 1942.26 | -3.245 | 0.001 |
RegionOfCurrentResidence_2012West | -6270 | 2248.08 | -2.789 | 0.005 |
MaritalStatus_2000Married | 25682 | 2341.13 | 10.970 | 0.000 |
MaritalStatus_2000Separated | -21 | 4662.42 | -0.004 | 0.996 |
MaritalStatus_2000Divorced | 8081 | 3210.68 | 2.517 | 0.012 |
MaritalStatus_2000Widowed | 8888 | 14456.33 | 0.615 | 0.539 |
GenderFemale:MaritalStatus_2000Married | -30114 | 3366.00 | -8.947 | 0.000 |
GenderFemale:MaritalStatus_2000Separated | 68 | 6135.03 | 0.011 | 0.991 |
GenderFemale:MaritalStatus_2000Divorced | -9112 | 4404.25 | -2.069 | 0.039 |
GenderFemale:MaritalStatus_2000Widowed | -5134 | 16460.84 | -0.312 | 0.755 |
Regression Comparison
The regression model that includes an interaction term between marital status and gender indicates that marital status does have a significant impact on the gender income gap. This is indicated by the reduced residual sum of squares and the P-value < 0.0001.## Analysis of Variance Table
##
## Model 1: TotalIncome_2012 ~ Gender + Race + GradeCompleted_2012 + RegionOfCurrentResidence_2012 +
## MaritalStatus_2000
## Model 2: TotalIncome_2012 ~ Gender + Race + GradeCompleted_2012 + RegionOfCurrentResidence_2012 +
## MaritalStatus_2000 + Gender * MaritalStatus_2000
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 6152 1.5352e+13
## 2 6148 1.5083e+13 4 2.6906e+11 27.419 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
How ‘good’ is our initial (interaction-less) regression model?
Residuals vs. Fitted
The residual versus fitted plot for the linear regression indicates two problems with our data. If you track the red line, it is clear that the relationship between our fitted values and the residual error (the amount that our regression did not account for in the fitted value) decreases as the fitted value increases. This indicates a non-linear relationship.Second, above the primary band of plotted points, there is a second band that exists. This separation indicates there is a second influence on the residuals produced by our regression model.
Normal Q-Q
The Normal Q-Q plot is useful for evaluating the normalily of our data. It plots the standardized residual against the quantities generated by the regression - both divided into quartiles. If the data is normal, then the plot of points on the Normal Q-Q plot will appear linear. In the case of our initial regression, you can see that the line is not linear - that our income data is not distributed normally according to our regression model. The upper quartile of incomes produce a residual much higher than the predicted value and there is a subtle dip below the expected normal value on the lower range.Scale-Location
The scale-location plot is useful for evaluating whether our data is ‘homoscedastic’ or ‘heteroscedastic’. In this case, our data appears to be heteroscedastic - the standard error appears to increase as the fitted value increases. Our ability to predict the value of income given the variables we used as inputs in our regression decreases as the predicted value increases.Residuals vs Leverage
The residuals versus leverage plot indicates that none of our data values was influential to the regression analysis. If there were cases that were influential, they would appear outside of the Cook’s distance range (which is not represented on the plot). If we were to exclude any values, there would not be much change on the regression model.How ‘good’ is one of our interactive regression models?
Below are the diagnostic plots for the interactive regression where we consider impacts of education level on gender. Notice how the plots largely remain unchanged. There remain issues of data that is not normal in shape - so using a linear regression may not be advisable. Also notice how the data still demonstrates heteroscedasticity.