🎓 DegreeDelta: Quantifying the Income Advantage of College Degrees Across U.S. States¶
Group Members: Dylan Dsouza, Jaden Goelkel, Sam Genous, Anna Rosenbaum and Wanchen Yang¶
This notebook investigates how the financial return to a college education varies across U.S. states. Using household-level data from the 2024 American Community Survey (ACS), the education premium is quantified as the difference in household income between households headed by someone with a bachelor's degree or higher versus a high school diploma or less.
1. Statement of the Problem¶
College is usually linked with higher income, but that advantage might not be the same everywhere in the U.S. The goal of this project is to measure how the education premium changes by state. In this notebook, we define the education premium as the difference in household income between households where the head has a bachelor’s degree or higher and households where the head has a high school diploma or less.
We are mainly trying to answer:
- Which states have the largest income gap between BA+ and HS-or-less households?
- Which states have the smallest gap?
- Does the pattern stay similar when we use the same definition and dataset across all states?
Hypotheses¶
Formally, we want to test whether the education premium is basically the same everywhere, or if it changes depending on the state.
H₀ (Null): The education premium is the same across U.S. states, i.e. any differences we see are just due to sampling noise.
H₁ (Alternative): The education premium differs across U.S. states, i.e. at least some states have significantly different premiums.
Specifically, we hypothesize a directional pattern: that high cost-of-living states (California, Washington, Hawaii) will show the largest education premiums, while low cost-of-living states (Arkansas, Mississippi, Iowa) will show the smallest, as tested for in further sections.
2. Relevance and Motivation¶
A lot of people talk about college as an investment, but the return probably depends on where you live. States have different job markets, industries, wage levels, and costs of living, so the income advantage of a degree might be bigger in some places and smaller in others.
This question matters because it connects to regional inequality and economic opportunity. If the premium is much higher in some states, that could mean the benefits of education are tied to local labor markets. On the other hand, if some states show a smaller premium, it might reflect fewer high-paying jobs or different demand for degrees. Quantifying these differences with real data helps move beyond assumptions like college always pays off the same.
3. Data Source¶
We use household-level data from the American Community Survey (ACS) 2024 Public Use Microdata Sample (PUMS). We accessed the data through IPUMS USA, which provides a cleaned and well-documented version of the ACS microdata.
The ACS is run by the U.S. Census Bureau and is commonly used for studying income, education, and demographics because it covers a large sample and includes consistent variables across states.
4. Data Description¶
The dataset includes one row per household. The main variables we use are:
HHINCOME: total household income (we use raw dollars as we analyze the same year)STATEFIP: state of residence (FIPS code), used to compare across statesHHWT: household survey weight (used so results are more representative)EDUC: education level (we use this to build education groups)
The EDUC variable uses numeric codes from 0 (no schooling) to 11 (graduate degree). For this analysis, households are recoded into two groups:
| Group | EDUC Codes | Description |
|---|---|---|
| HS or Lower | 0–6 | HS diploma/GED or lower |
| BA or Higher | 10–11 | Bachelor's degree or higher |
Households in the middle tier (codes 7–9) are retained in the dataset but excluded from premium calculations and regression models, which focus on comparing the two poles of the education distribution.
5. Data Wrangling¶
This section covers the data cleaning and preprocessing pipeline from raw PUMS data to a processed table used for downsteam analysis. This involves data loading, cleaning, and filtering. Finally, geographic and education variables are recoded into interpretable categories.
5.1 Imports¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import scipy.stats as stats
import statsmodels.formula.api as smf
from statsmodels.stats.anova import anova_lm
import warnings
warnings.filterwarnings("ignore")
5.2 Data Loading¶
Using pandas to load the raw data:
data = pd.read_csv('data.csv.gz')
data.head()
| YEAR | SAMPLE | SERIAL | CBSERIAL | HHWT | CLUSTER | ADJUST | STATEFIP | STRATA | GQ | HHINCOME | PERNUM | PERWT | EDUC | EDUCD | INCTOT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2024 | 202401 | 1 | 2024010000060 | 41.0 | 2024000000011 | 1.01525 | 1 | 250001 | 3 | 9999999 | 1 | 41.0 | 2 | 25 | 18500 |
| 1 | 2024 | 202401 | 2 | 2024010000094 | 52.0 | 2024000000021 | 1.01525 | 1 | 260001 | 3 | 9999999 | 1 | 52.0 | 6 | 64 | 0 |
| 2 | 2024 | 202401 | 3 | 2024010000146 | 31.0 | 2024000000031 | 1.01525 | 1 | 140101 | 3 | 9999999 | 1 | 31.0 | 6 | 63 | 27100 |
| 3 | 2024 | 202401 | 4 | 2024010000156 | 4.0 | 2024000000041 | 1.01525 | 1 | 280201 | 4 | 9999999 | 1 | 4.0 | 7 | 71 | 1000 |
| 4 | 2024 | 202401 | 5 | 2024010000182 | 19.0 | 2024000000051 | 1.01525 | 1 | 80001 | 3 | 9999999 | 1 | 19.0 | 6 | 63 | 0 |
This is the raw PUMS dataset which contains household and person-level records across all US states, spanning the 2024 survey year. Each row represents an individual respondent, with variables covering income, education, geography, and survey metadata.
5.3 Data Cleaning¶
Dropping columns that are irrelevant to the analysis:
filtered_data = data.drop(columns=['SAMPLE', 'SERIAL', 'CBSERIAL', 'CLUSTER', 'STRATA', 'GQ', 'PERWT', 'EDUCD', 'ADJUST', 'INCTOT'])
filtered_data.head()
| YEAR | HHWT | STATEFIP | HHINCOME | PERNUM | EDUC | |
|---|---|---|---|---|---|---|
| 0 | 2024 | 41.0 | 1 | 9999999 | 1 | 2 |
| 1 | 2024 | 52.0 | 1 | 9999999 | 1 | 6 |
| 2 | 2024 | 31.0 | 1 | 9999999 | 1 | 6 |
| 3 | 2024 | 4.0 | 1 | 9999999 | 1 | 7 |
| 4 | 2024 | 19.0 | 1 | 9999999 | 1 | 6 |
Dropped columns include survey identifiers (SAMPLE, SERIAL, CBSERIAL), sampling design variables (CLUSTER, STRATA, GQ), the person-level weight (PERWT, since we use the household weight HHWT instead), and redundant or raw income fields (ADJUST, INCTOT). The detailed education code EDUCD was also removed in favor of the simpler EDUC variable, which we recode in the preprocessing step below.
Understanding the shape of our resulting table:
filtered_data.shape
(3422888, 6)
Examining the data types of each column:
filtered_data.dtypes
| 0 | |
|---|---|
| YEAR | int64 |
| HHWT | float64 |
| STATEFIP | int64 |
| HHINCOME | int64 |
| PERNUM | int64 |
| EDUC | int64 |
Reconfirming that there is no missing data:
filtered_data.isna().sum()
| 0 | |
|---|---|
| YEAR | 0 |
| HHWT | 0 |
| STATEFIP | 0 |
| HHINCOME | 0 |
| PERNUM | 0 |
| EDUC | 0 |
5.4 Data Preprocessing¶
Since PUMS records are person-level, each household appears multiple times, i.e. once per member. To avoid double-counting household income, only the first person in each household (PERNUM == 1) is kept, who serves as the household reference person.
filtered_data = filtered_data[filtered_data['PERNUM'] == 1].reset_index(drop=True)
filtered_data = filtered_data.drop(columns=['PERNUM'])
filtered_data.head()
| YEAR | HHWT | STATEFIP | HHINCOME | EDUC | |
|---|---|---|---|---|---|
| 0 | 2024 | 41.0 | 1 | 9999999 | 2 |
| 1 | 2024 | 52.0 | 1 | 9999999 | 6 |
| 2 | 2024 | 31.0 | 1 | 9999999 | 6 |
| 3 | 2024 | 4.0 | 1 | 9999999 | 7 |
| 4 | 2024 | 19.0 | 1 | 9999999 | 6 |
Next, invalid entries (HHINCOME == 9999999) were removed along with any zero or negative income records, as these do not represent valid earned household incomes for downstream analysis.
filtered_data = filtered_data[(filtered_data['HHINCOME'] > 0) & (filtered_data['HHINCOME'] != 9999999)].reset_index(drop=True)
filtered_data.head()
| YEAR | HHWT | STATEFIP | HHINCOME | EDUC | |
|---|---|---|---|---|---|
| 0 | 2024 | 11.0 | 1 | 91200 | 6 |
| 1 | 2024 | 61.0 | 1 | 134000 | 11 |
| 2 | 2024 | 67.0 | 1 | 33300 | 6 |
| 3 | 2024 | 199.0 | 1 | 207100 | 10 |
| 4 | 2024 | 68.0 | 1 | 195600 | 6 |
Then, states were recoded using their FIPS codes for interpretability in plots and tables.
FIPS_TO_STATE = {
1:'Alabama', 2:'Alaska', 4:'Arizona', 5:'Arkansas', 6:'California',
8:'Colorado', 9:'Connecticut', 10:'Delaware', 11:'District of Columbia',
12:'Florida', 13:'Georgia', 15:'Hawaii', 16:'Idaho', 17:'Illinois',
18:'Indiana', 19:'Iowa', 20:'Kansas', 21:'Kentucky', 22:'Louisiana',
23:'Maine', 24:'Maryland', 25:'Massachusetts', 26:'Michigan',
27:'Minnesota', 28:'Mississippi', 29:'Missouri', 30:'Montana',
31:'Nebraska', 32:'Nevada', 33:'New Hampshire', 34:'New Jersey',
35:'New Mexico', 36:'New York', 37:'North Carolina', 38:'North Dakota',
39:'Ohio', 40:'Oklahoma', 41:'Oregon', 42:'Pennsylvania', 44:'Rhode Island',
45:'South Carolina', 46:'South Dakota', 47:'Tennessee', 48:'Texas',
49:'Utah', 50:'Vermont', 51:'Virginia', 53:'Washington', 54:'West Virginia',
55:'Wisconsin', 56:'Wyoming'
}
filtered_data['STATE'] = filtered_data['STATEFIP'].map(FIPS_TO_STATE)
filtered_data = filtered_data.drop(columns='STATEFIP')
filtered_data.head()
| YEAR | HHWT | HHINCOME | EDUC | STATE | |
|---|---|---|---|---|---|
| 0 | 2024 | 11.0 | 91200 | 6 | Alabama |
| 1 | 2024 | 61.0 | 134000 | 11 | Alabama |
| 2 | 2024 | 67.0 | 33300 | 6 | Alabama |
| 3 | 2024 | 199.0 | 207100 | 10 | Alabama |
| 4 | 2024 | 68.0 | 195600 | 6 | Alabama |
The raw education categories were then bucketed into groups based on their codes. This grouping aligns with the research question, which compares households at both ends of the education spectrum.
def recode_educ(code):
if code <= 6: return 'HS or Lower'
elif code >= 10: return 'BA or Higher'
else: return 'Some College'
filtered_data['EDUCGROUP'] = filtered_data['EDUC'].apply(recode_educ)
filtered_data = filtered_data.drop(columns='EDUC')
filtered_data.head()
| YEAR | HHWT | HHINCOME | STATE | EDUCGROUP | |
|---|---|---|---|---|---|
| 0 | 2024 | 11.0 | 91200 | Alabama | HS or Lower |
| 1 | 2024 | 61.0 | 134000 | Alabama | BA or Higher |
| 2 | 2024 | 67.0 | 33300 | Alabama | HS or Lower |
| 3 | 2024 | 199.0 | 207100 | Alabama | BA or Higher |
| 4 | 2024 | 68.0 | 195600 | Alabama | HS or Lower |
Finally, the YEAR column was excluded as all data used in this analysis is from the 2024 survey year.
filtered_data = filtered_data.drop(columns='YEAR')
filtered_data.head()
| HHWT | HHINCOME | STATE | EDUCGROUP | |
|---|---|---|---|---|
| 0 | 11.0 | 91200 | Alabama | HS or Lower |
| 1 | 61.0 | 134000 | Alabama | BA or Higher |
| 2 | 67.0 | 33300 | Alabama | HS or Lower |
| 3 | 199.0 | 207100 | Alabama | BA or Higher |
| 4 | 68.0 | 195600 | Alabama | HS or Lower |
This is the final result table of data cleaning and preprocessing steps, which will be used in further analysis.
6. Exploratory Data Analysis¶
This section explores the structure and distribution of the cleaned dataset to reveal insights about the data that might be helpful in addressing the research question and narrowing analytical choices for further analysis.
6.1 Education Group Composition¶
First, the education groups are standardized to a percentage to gain a better understanding of any imbalances in the data:
educ_counts = filtered_data['EDUCGROUP'].value_counts()
educ_pct = filtered_data['EDUCGROUP'].value_counts(normalize=True)*100
educ_pct_df = pd.DataFrame({'Count': educ_counts, 'Percent (%)': educ_pct})
educ_pct_df
| Count | Percent (%) | |
|---|---|---|
| EDUCGROUP | ||
| BA or Higher | 544891 | 40.901254 |
| HS or Lower | 495770 | 37.214075 |
| Some College | 291550 | 21.884671 |
Subsequently, the data are categorized by education groups to decipher grouped summary statistics for household income.
filtered_data.groupby('EDUCGROUP')['HHINCOME'].describe().round()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| EDUCGROUP | ||||||||
| BA or Higher | 544891.0 | 167532.0 | 157446.0 | 1.0 | 72000.0 | 125000.0 | 205700.0 | 2811900.0 |
| HS or Lower | 495770.0 | 75415.0 | 77044.0 | 1.0 | 29100.0 | 56200.0 | 98000.0 | 1914000.0 |
| Some College | 291550.0 | 97332.0 | 90272.0 | 1.0 | 42000.0 | 76600.0 | 125000.0 | 2523200.0 |
6.2 Income Distribution by Education Group¶
Next, the household weightage is used to account for the fact that PUMS data samples represent a different number of real households in the population. By using this weightage, a better estimation for mean income can be derived.
weighted_means = filtered_data.groupby('EDUCGROUP').apply(lambda x: np.average(x['HHINCOME'], weights=x['HHWT'])).rename('Weighted Mean Income ($)').reset_index()
weighted_means
| EDUCGROUP | Weighted Mean Income ($) | |
|---|---|---|
| 0 | BA or Higher | 162724.626449 |
| 1 | HS or Lower | 74593.605160 |
| 2 | Some College | 94876.639716 |
To better understand the distribution, histograms were overlayed to understand weighted income distributin by education group. Plotting results were capped at $500,000 for better readability.
fig, ax = plt.subplots(figsize=(7, 5))
for grp, color in [('HS or Lower', '#E74C3C'), ('Some College', '#F39C12'), ('BA or Higher', '#2ECC71')]:
subset = filtered_data[filtered_data['EDUCGROUP'] == grp]
data = subset['HHINCOME'].clip(upper=500000)
weights = subset['HHWT']
ax.hist(data, bins=60, alpha=0.5, color=color, label=grp, density=True, weights=weights)
ax.set_title('Weighted Income Distribution by Education Group', fontweight='bold')
ax.set_xlabel('Household Income (capped at $500k)')
ax.xaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'${x/1000:.0f}k'))
ax.legend()
plt.tight_layout()
plt.show()
The same data can also be conveyed through a boxplot to understand the gradual increase in income by education group. For visualization purposes, the boxplot uses unweighted values instead.
fig, ax = plt.subplots(figsize=(7, 5))
groups = [filtered_data[filtered_data['EDUCGROUP'] == g]['HHINCOME'].clip(upper=500000)
for g in ['HS or Lower', 'Some College', 'BA or Higher']]
ax.boxplot(groups, labels=['HS or Lower', 'Some College', 'BA or Higher'],
patch_artist=True, notch=True,
boxprops=dict(facecolor='#AED6F1'),
medianprops=dict(color='navy', linewidth=2))
ax.set_title('Income by Education Group', fontweight='bold')
ax.set_ylabel('Household Income (capped at $500k)')
ax.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'${x/1000:.0f}k'))
plt.tight_layout()
plt.show()
6.3 Education Group Composition by State¶
To contextualize the differences in income, education composition of each state was examined, particularly what share of households fall into the HS or Lower vs BA or Higher categories.
state_educ = filtered_data.groupby(['STATE', 'EDUCGROUP']).size().unstack(fill_value=0)
state_educ['pct_BA'] = (state_educ['BA or Higher']/state_educ.sum(axis=1)*100).round(1)
state_educ['pct_HS'] = (state_educ['HS or Lower']/state_educ.sum(axis=1)*100).round(1)
state_educ.sort_values('pct_BA', ascending=False)
state_educ[['pct_BA','pct_HS']].sort_values('pct_BA').plot(
kind='barh',
stacked=True,
figsize=(13,12)
)
plt.xlabel('Percent of Population')
plt.ylabel('State')
plt.title('Education Composition by State')
plt.legend(title='Education Level')
plt.show()
The chart shows considerable variation in educational attainment across states. States like Massachusetts, Maryland, and Colorado skew heavily toward BA or Higher households, while Oklahoma, West Virginia, and Arkansas have a much larger share of HS or Lower households.
Then, the weighted mean household income for every state × education group combination is computed. This produces a summary table where each row represents one group within one state, and the income value reflects the true population-weighted average rather than a simple mean across survey respondents.
summary = (
filtered_data.groupby(['STATE', 'EDUCGROUP'])
.apply(lambda g: np.average(g['HHINCOME'], weights=g['HHWT']))
.round(0)
.rename('weighted_mean_income')
.reset_index()
)
summary.head()
| STATE | EDUCGROUP | weighted_mean_income | |
|---|---|---|---|
| 0 | Alabama | BA or Higher | 134943.0 |
| 1 | Alabama | HS or Lower | 62850.0 |
| 2 | Alabama | Some College | 81694.0 |
| 3 | Alaska | BA or Higher | 159980.0 |
| 4 | Alaska | HS or Lower | 85137.0 |
6.4 Education Premium by State¶
From this summary, the two focal groups: HS or Lower and BA or Higher were isolated and pivoted into a table so that each state occupies a single row with separate columns for each group's weighted mean income. The education premium is then calculated directly as the difference between the two, and states are ranked from highest to lowest premium.
pivot = (
summary[summary['EDUCGROUP'].isin(['HS or Lower', 'BA or Higher'])]
.pivot(index='STATE', columns='EDUCGROUP', values='weighted_mean_income')
.reset_index()
)
pivot.columns.name = None
pivot['EDUCATION_PREMIUM'] = pivot['BA or Higher'] - pivot['HS or Lower']
pivot_sorted = pivot.sort_values('EDUCATION_PREMIUM', ascending=False).reset_index(drop=True)
pivot_sorted['RANK'] = range(1, len(pivot_sorted) + 1)
pivot_sorted.head()
| STATE | BA or Higher | HS or Lower | EDUCATION_PREMIUM | RANK | |
|---|---|---|---|---|---|
| 0 | District of Columbia | 197105.0 | 76676.0 | 120429.0 | 1 |
| 1 | New York | 185397.0 | 74426.0 | 110971.0 | 2 |
| 2 | California | 198066.0 | 88157.0 | 109909.0 | 3 |
| 3 | Massachusetts | 193158.0 | 84899.0 | 108259.0 | 4 |
| 4 | Connecticut | 193260.0 | 85438.0 | 107822.0 | 5 |
Based on these aggregations, specific summary statistics were computed pertinent to the research question:
print(f"Mean premium: ${pivot['EDUCATION_PREMIUM'].mean():,.0f}")
print(f"Median premium: ${pivot['EDUCATION_PREMIUM'].median():,.0f}")
print(f"Std deviation: ${pivot['EDUCATION_PREMIUM'].std():,.0f}")
print(f"Min premium: ${pivot['EDUCATION_PREMIUM'].min():,.0f} ({pivot.loc[pivot['EDUCATION_PREMIUM'].idxmin(), 'STATE']})")
print(f"Max premium: ${pivot['EDUCATION_PREMIUM'].max():,.0f} ({pivot.loc[pivot['EDUCATION_PREMIUM'].idxmax(), 'STATE']})")
Mean premium: $76,841 Median premium: $73,508 Std deviation: $16,325 Min premium: $51,552 (Wyoming) Max premium: $120,429 (District of Columbia)
To visualize how the two groups relate to each other at the state level, each state's weighted mean HS income is plotted against its weighted mean BA income.
fig, ax = plt.subplots(figsize=(7, 5))
ax.scatter(pivot['HS or Lower'] / 1000, pivot['BA or Higher'] / 1000,
s=80, color='#8E44AD', alpha=0.75, edgecolors='white')
for _, row in pivot.iterrows():
ax.annotate(row['STATE'][:2], (row['HS or Lower'] / 1000, row['BA or Higher'] / 1000),
fontsize=7, ha='center', va='bottom')
lims = [min(ax.get_xlim()[0], ax.get_ylim()[0]), max(ax.get_xlim()[1], ax.get_ylim()[1])]
ax.plot(lims, lims, 'k--', alpha=0.5, label='Equal income line')
ax.set_xlabel('Weighted Mean Income — HS or Lower ($k)', fontsize=12)
ax.set_ylabel('Weighted Mean Income — BA or Higher ($k)', fontsize=12)
ax.set_title('State-Level Income: HS vs BA Households', fontsize=13, fontweight='bold')
ax.legend()
plt.tight_layout()
plt.show()
Every state sits above the diagonal, confirming that BA households out-earn HS households universally across the country. The spread along the diagonal reflects overall income levels across states with states in the upper right being high-income across both groups, and states in the lower left being low-income across both groups. The vertical distance from the diagonal is the education premium.
The variation in distance across states can be better understood through a ranked bar chart:
fig, ax = plt.subplots(figsize=(13, 10))
sorted_p = pivot.sort_values('EDUCATION_PREMIUM')
cmap = plt.cm.RdYlGn
norm = plt.Normalize(sorted_p['EDUCATION_PREMIUM'].min(), sorted_p['EDUCATION_PREMIUM'].max())
bar_colors = [cmap(norm(v)) for v in sorted_p['EDUCATION_PREMIUM']]
ax.barh(sorted_p['STATE'], sorted_p['EDUCATION_PREMIUM'] / 1000, color=bar_colors)
median_val = sorted_p['EDUCATION_PREMIUM'].median()
ax.axvline(median_val / 1000, color='black', linestyle='--', alpha=0.5,
label=f'Median: ${median_val/1000:.1f}k')
ax.set_xlabel('Education Premium ($k)', fontsize=12)
ax.set_title('Education Premium by State (BA or Higher vs HS or Lower)',
fontsize=13, fontweight='bold')
ax.legend(fontsize=11)
plt.tight_layout()
plt.show()
The premium varies substantially across states, spanning a range of tens of thousands of dollars from the lowest to the highest. States above the median line (in green) offer a higher-than-average return to a bachelor's degree, while those below it (in red) offer a comparatively smaller one. Notably, no state has a negative premium, reinforcing the finding from the scatter plot that a BA household out-earns a HS household everywhere in the country.
6.5 Cost-of-Living Adjustment¶
However, this education premium fails to account for the cost of living in certain states. To account for this, the 2024 Cost of Living Index (COLI) was applied to weight the value of household income in each state.
COLI = {
'Alabama': 88.0, 'Alaska': 123.8, 'Arizona': 111.5, 'Arkansas': 88.7,
'California': 144.8, 'Colorado': 102.0, 'Connecticut': 112.3,
'Delaware': 100.8, 'District of Columbia': 141.9, 'Florida': 102.8,
'Georgia': 91.3, 'Hawaii': 186.9, 'Idaho': 102.0, 'Illinois': 94.4,
'Indiana': 90.5, 'Iowa': 89.7, 'Kansas': 87.0, 'Kentucky': 93.0,
'Louisiana': 92.2, 'Maine': 112.1, 'Maryland': 115.3,
'Massachusetts': 145.9, 'Michigan': 90.4, 'Minnesota': 95.1,
'Mississippi': 87.9, 'Missouri': 88.7, 'Montana': 94.9,
'Nebraska': 93.1, 'Nevada': 101.3, 'New Hampshire': 112.6,
'New Jersey': 114.6, 'New Mexico': 93.3, 'New York': 123.3,
'North Carolina': 97.8, 'North Dakota': 91.9, 'Ohio': 94.2,
'Oklahoma': 85.7, 'Oregon': 112.0, 'Pennsylvania': 95.1,
'Rhode Island': 112.2, 'South Carolina': 95.9, 'South Dakota': 92.2,
'Tennessee': 90.5, 'Texas': 92.7, 'Utah': 104.9, 'Vermont': 114.4,
'Virginia': 100.7, 'Washington': 114.2, 'West Virginia': 84.1,
'Wisconsin': 97.0, 'Wyoming': 95.5
}
pivot['COLI'] = pivot['STATE'].map(COLI)
pivot['REAL_PREMIUM'] = (pivot['EDUCATION_PREMIUM'] / pivot['COLI']) * 100
pivot['NOMINAL_RANK'] = pivot['EDUCATION_PREMIUM'].rank(ascending=False).astype(int)
pivot['REAL_RANK'] = pivot['REAL_PREMIUM'].rank(ascending=False).astype(int)
pivot['RANK_SHIFT'] = pivot['NOMINAL_RANK'] - pivot['REAL_RANK']
pivot_coli = pivot.sort_values('REAL_PREMIUM', ascending=False).reset_index(drop=True)
pivot_coli[['STATE', 'EDUCATION_PREMIUM', 'COLI', 'REAL_PREMIUM', 'NOMINAL_RANK', 'REAL_RANK', 'RANK_SHIFT']].head()
| STATE | EDUCATION_PREMIUM | COLI | REAL_PREMIUM | NOMINAL_RANK | REAL_RANK | RANK_SHIFT | |
|---|---|---|---|---|---|---|---|
| 0 | Virginia | 100801.0 | 100.7 | 100100.297915 | 7 | 1 | 6 |
| 1 | Connecticut | 107822.0 | 112.3 | 96012.466607 | 5 | 2 | 3 |
| 2 | Illinois | 88373.0 | 94.4 | 93615.466102 | 10 | 3 | 7 |
| 3 | Texas | 86230.0 | 92.7 | 93020.496224 | 11 | 4 | 7 |
| 4 | Georgia | 84497.0 | 91.3 | 92548.740416 | 13 | 5 | 8 |
There are considerable shifts in ranking due to this COLI adjustment, which can be understood through a similar ranked bar chart:
fig, ax = plt.subplots(figsize=(13, 10))
sorted_real = pivot.sort_values('REAL_PREMIUM')
cmap = plt.cm.RdYlGn
norm = plt.Normalize(sorted_real['REAL_PREMIUM'].min(), sorted_real['REAL_PREMIUM'].max())
bar_colors = [cmap(norm(v)) for v in sorted_real['REAL_PREMIUM']]
ax.barh(sorted_real['STATE'], sorted_real['REAL_PREMIUM'] / 1000, color=bar_colors)
median_val = sorted_real['REAL_PREMIUM'].median()
ax.axvline(median_val / 1000, color='black', linestyle='--', alpha=0.5,
label=f'Median: ${median_val/1000:.1f}k')
ax.set_xlabel('Education Premium ($k, COLI-adjusted)', fontsize=12)
ax.set_title('COLI-Adjusted Education Premium by State (BA or Higher vs HS or Lower)',
fontsize=13, fontweight='bold')
ax.legend(fontsize=11)
plt.tight_layout()
plt.show()
The adjustment meaningfully reshuffles the rankings. Hawaii, despite having one of the highest nominal premiums, drops considerably after adjustment due to its COLI of 186.9, nearly double the national average, meaning its high incomes are largely offset by extreme living costs. Similarly, California and Massachusetts fall from their nominal positions once their above-average costs are factored in.
Comparing the nominal and COLI-adjusted ranked bar charts side-by-side:
fig, axes = plt.subplots(1, 2, figsize=(18, 10))
sorted_nominal = pivot.sort_values('EDUCATION_PREMIUM')
cmap = plt.cm.RdYlGn
norm_n = plt.Normalize(sorted_nominal['EDUCATION_PREMIUM'].min(), sorted_nominal['EDUCATION_PREMIUM'].max())
axes[0].barh(sorted_nominal['STATE'], sorted_nominal['EDUCATION_PREMIUM'] / 1000,
color=[cmap(norm_n(v)) for v in sorted_nominal['EDUCATION_PREMIUM']])
axes[0].axvline(sorted_nominal['EDUCATION_PREMIUM'].median() / 1000,
color='black', linestyle='--', alpha=0.5,
label=f"Median: ${sorted_nominal['EDUCATION_PREMIUM'].median()/1000:.1f}k")
axes[0].set_title('Nominal Education Premium', fontweight='bold', fontsize=12)
axes[0].set_xlabel('Education Premium ($k)')
axes[0].legend()
sorted_real = pivot.sort_values('REAL_PREMIUM')
norm_r = plt.Normalize(sorted_real['REAL_PREMIUM'].min(), sorted_real['REAL_PREMIUM'].max())
axes[1].barh(sorted_real['STATE'], sorted_real['REAL_PREMIUM'] / 1000,
color=[cmap(norm_r(v)) for v in sorted_real['REAL_PREMIUM']])
axes[1].axvline(sorted_real['REAL_PREMIUM'].median() / 1000,
color='black', linestyle='--', alpha=0.5,
label=f"Median: ${sorted_real['REAL_PREMIUM'].median()/1000:.1f}k")
axes[1].set_title('COLI-Adjusted Education Premium', fontweight='bold', fontsize=12)
axes[1].set_xlabel('Education Premium ($k, COLI-adjusted)')
axes[1].legend()
plt.suptitle('Education Premium: Nominal vs Cost-of-Living Adjusted',
fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
The hypothesis predicted that high cost-of-living states (California, Washington, and Hawaii) would offer the highest education premiums, while low cost-of-living states (Arkansas, Mississippi, and Iowa) would offer the lowest. Having computed both nominal and COLI-adjusted premiums, we can now directly test this prediction for each of the six states.
hypothesis_states = ['California', 'Washington', 'Hawaii', 'Arkansas', 'Mississippi', 'Iowa']
highlight = pivot[pivot['STATE'].isin(hypothesis_states)].copy()
highlight = highlight.set_index('STATE').loc[hypothesis_states].reset_index()
fig, ax = plt.subplots(figsize=(7, 5))
x = np.arange(len(highlight))
width = 0.35
ax.bar(x - width/2, highlight['EDUCATION_PREMIUM'] / 1000,
width, label='Nominal Premium', color='#3498DB', alpha=0.8)
ax.bar(x + width/2, highlight['REAL_PREMIUM'] / 1000,
width, label='COLI-Adjusted Premium', color='#E67E22', alpha=0.8)
ax.set_xticks(x)
ax.set_xticklabels(highlight['STATE'], rotation=15, ha='right')
ax.set_ylabel('Education Premium ($k)')
ax.set_title('Nominal vs COLI-Adjusted Premium\nfor Hypothesis States',
fontsize=13, fontweight='bold')
ax.legend()
plt.tight_layout()
plt.show()
These results partially support the hypothesis. California and Washington do show high nominal premiums as predicted, but after COLI adjustment both drop noticeably, with California falling from ~USD 110k to ~USD 76k. Hawaii is the starkest example: its nominal premium of ~USD 56k is already modest, and after adjustment it collapses to just ~USD 30k, the lowest of the six states, directly contradicting our prediction that it would be among the highest.
On the lower end, Arkansas, Iowa, and Mississippi tell the opposite story: all three see their adjusted premium exceed their nominal premium, meaning a degree in these states actually delivers more purchasing power than the raw numbers suggest. Mississippi in particular rises from ~USD 62k nominal to ~USD 70k adjusted.
As such, the hypothesis holds for California and Washington nominally, but the COLI adjustment reveals that low cost-of-living states offer a more competitive real return to education than they appear on the surface.
7. Previous Research¶
Extensive research reveals that education is strongly linked with higher income in the United States. In labor economics, this idea is usually described as the return to education, meaning that additional schooling is associated with higher earnings on average. A common finding is that finishing a bachelor’s degree increases earnings compared to stopping at high school, even after accounting for other differences between people. According to the U.S. Census Bureau, households headed by someone with a bachelor's degree or higher had a median income of USD 132,700 in 2024, more than double the $58,410 median for households headed by someone with a high school degree but no college.
Over time, this gap has widened. Between 2004 and 2024, earnings of those with a high school degree but no college rose 3.2%, while earnings of those with a bachelor's degree or more increased 6.3%. The theoretical foundation for this relationship is the human capital framework, most notably formalized by Mincer (1974), which treats education as an investment that raises individual productivity and therefore earnings. Under this model, the return to a bachelor's degree represents the wage premium an employer is willing to pay for the skills and signaling value a degree confers.
However, the size of this premium is not uniform across geographies. Local labor market conditions, including industry composition, demand for high-skill workers, and prevailing wage levels, shape how much a degree is worth in a given state. States with large concentrations of high-paying professional industries, such as finance, technology, and consulting, are expected to offer larger premiums than states with economies centered on lower-wage sectors.
Another complication that arises is that income alone does not fully capture real living standards, because cost of living differs across states. Cost of living varies substantially across the U.S., meaning a given dollar amount represents very different standards of living depending on where a household resides. Adjusting for cost of living is therefore necessary to make meaningful cross-state comparisons of the education premium, and motivates the COLI adjustment applied in this analysis.
Prior studies estimating the education premium have typically employed Mincer-style OLS earnings regressions, often using individual-level survey data with controls for experience and demographics (Card, 1999). This analysis adopts a similar regression framework but focuses on household income at the state level, complemented by ANOVA to test whether geographic variation in the premium is statistically significant.
8. Analyses Performed¶
We perform two primary statistical analyses: ANOVA and weighted linear regression.
ANOVA was chosen because our key predictor (education group) is categorical and our outcome (household income) is continuous — the standard setup for comparing group means. We run a per-state one-way ANOVA to test whether the income gap is statistically significant within each individual state, and a two-way ANOVA (log income ~ education group + state) to test the combined effect of both variables simultaneously.
Weighted linear regression complements ANOVA by quantifying the size of the education premium and allowing us to control for confounding. We use log-transformed income as the outcome so coefficients represent percentage income differences, which is standard in labor economics. All models use Weighted Least Squares (WLS) with HHWT as the weight. We estimate two models: one with education alone, and one adding state fixed effects to isolate the within-state premium.
8.1 Log Transformation¶
Before estimating any model, household income is log-transformed to serve as the outcome variable in all subsequent models. Raw household income is heavily right-skewed, i.e. a small number of very high-earning households pull the mean well above the median. The log transformation compresses this skew, brings the residual distribution closer to normality, and makes regression coefficients directly interpretable as approximate percentage differences rather than dollar amounts.
filtered_data['log_income'] = np.log(filtered_data['HHINCOME'].replace(0, np.nan))
filtered_data.head()
| HHWT | HHINCOME | STATE | EDUCGROUP | log_income | |
|---|---|---|---|---|---|
| 0 | 11.0 | 91200 | Alabama | HS or Lower | 11.420810 |
| 1 | 61.0 | 134000 | Alabama | BA or Higher | 11.805595 |
| 2 | 67.0 | 33300 | Alabama | HS or Lower | 10.413313 |
| 3 | 199.0 | 207100 | Alabama | BA or Higher | 12.240957 |
| 4 | 68.0 | 195600 | Alabama | HS or Lower | 12.183827 |
8.2 Full Interaction ANOVA¶
To assess how income varies by education group across states, and whether the education premium itself differs by state, we first estimate a fully interacted OLS model. This model includes state fixed effects, education group fixed effects, and their interaction, allowing the education premium to vary freely across all 50 states.
interaction_model = smf.ols('log_income ~ C(STATE) + C(EDUCGROUP) + C(STATE):C(EDUCGROUP)',
data=filtered_data).fit()
anova_table = anova_lm(interaction_model, typ=2)
anova_table
| sum_sq | df | F | PR(>F) | |
|---|---|---|---|---|
| C(STATE) | 1.597602e+04 | 50.0 | 329.934088 | 0.0 |
| C(EDUCGROUP) | 1.772140e+05 | 2.0 | 91494.828804 | 0.0 |
| C(STATE):C(EDUCGROUP) | 2.068829e+03 | 100.0 | 21.362559 | 0.0 |
| Residual | 1.290014e+06 | 1332058.0 | NaN | NaN |
As the education group sizes differ across states, this is an unbalanced dataset and Type II sums of squares are used, which test each term after accounting for all other terms.
According to the results, education group is by far the dominant predictor (F = 91,494.83), reflecting the large and consistent income gap between attainment levels. State is also highly significant (F = 329.93), confirming that geography accounts for meaningful independent variance in household income. Most importantly for the research question, the interaction term is significant (F = 21.36, p < 0.001), indicating that the education premium is not uniform across states.
8.3 QQ Plot of Residuals¶
Then, we verify that the normality assumption underlying OLS is at least approximately satisfied. To do this, a QQ plot is constructed on a random sample of 10,000 residuals from the full interaction model.
residuals = interaction_model.resid.sample(10000)
stats.probplot(residuals, dist="norm", plot=plt)
plt.title('QQ Plot of Residuals (Log Income)')
plt.show()
Residuals track the normal line closely through the central distribution but show a heavy left tail, with extreme negative values around −8 to −9 corresponding to very low-income households. This degree of tail deviation is expected in income survey data and does not invalidate inference at this sample size (with number of samples greater than 1.3M).
8.4 Two-Way ANOVA¶
Having confirmed the model is reasonably well-specified, we next estimate a cleaner two-way ANOVA by dropping the interaction term to test the independent main effects of education group and state on log household income.
two_way_anova_model = smf.ols('log_income ~ C(EDUCGROUP) + C(STATE)', data=filtered_data).fit()
anova_table = anova_lm(two_way_anova_model, typ=2)
anova_table
| sum_sq | df | F | PR(>F) | |
|---|---|---|---|---|
| C(EDUCGROUP) | 1.772140e+05 | 2.0 | 91355.188807 | 0.0 |
| C(STATE) | 1.597602e+04 | 50.0 | 329.430541 | 0.0 |
| Residual | 1.292083e+06 | 1332158.0 | NaN | NaN |
Both predictors are highly significant. Education group (F = 91,355.19) remains the dominant driver of income variation, with state (F = 329.43) contributing meaningfully as an independent effect. As such, both are included in the regression models that follow.
8.5 Weighted Linear Regression¶
To quantify the size of the education premium, two WLS models were estimated using HHWT as the regression weight.
Model 1 estimates the raw national education premium with no geographic controls.
model1 = smf.wls(
"log_income ~ C(EDUCGROUP)",
data=filtered_data, weights=filtered_data['HHWT']).fit()
model1.summary()
| Dep. Variable: | log_income | R-squared: | 0.122 |
|---|---|---|---|
| Model: | WLS | Adj. R-squared: | 0.122 |
| Method: | Least Squares | F-statistic: | 9.229e+04 |
| Date: | Mon, 09 Mar 2026 | Prob (F-statistic): | 0.00 |
| Time: | 21:47:44 | Log-Likelihood: | -2.0912e+06 |
| No. Observations: | 1332211 | AIC: | 4.182e+06 |
| Df Residuals: | 1332208 | BIC: | 4.182e+06 |
| Df Model: | 2 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 11.6281 | 0.001 | 8455.792 | 0.000 | 11.625 | 11.631 |
| C(EDUCGROUP)[T.HS or Lower] | -0.8378 | 0.002 | -424.771 | 0.000 | -0.842 | -0.834 |
| C(EDUCGROUP)[T.Some College] | -0.5413 | 0.002 | -236.034 | 0.000 | -0.546 | -0.537 |
| Omnibus: | 613334.827 | Durbin-Watson: | 1.981 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 10204424.115 |
| Skew: | -1.799 | Prob(JB): | 0.00 |
| Kurtosis: | 16.073 | Cond. No. | 3.55 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The reference category is BA or Higher. Both remaining coefficients are negative and highly significant (p < 0.001), meaning HS or Lower and Some College households earn substantially less than BA or Higher households. Specifically, HS or Lower households earn approximately 57% less than BA or Higher households, and Some College households earn approximately 42% less. The model explains 12.2% of variance in log income (R² = 0.122), which is meaningful for a single-predictor model on survey data.
Model 2 adds state fixed effects to isolate the education premium within states.
model2 = smf.wls(
"log_income ~ C(EDUCGROUP) + C(STATE)",
data=filtered_data, weights=filtered_data['HHWT']).fit()
model2.summary()
| Dep. Variable: | log_income | R-squared: | 0.131 |
|---|---|---|---|
| Model: | WLS | Adj. R-squared: | 0.131 |
| Method: | Least Squares | F-statistic: | 3863. |
| Date: | Mon, 09 Mar 2026 | Prob (F-statistic): | 0.00 |
| Time: | 21:48:11 | Log-Likelihood: | -2.0840e+06 |
| No. Observations: | 1332211 | AIC: | 4.168e+06 |
| Df Residuals: | 1332158 | BIC: | 4.169e+06 |
| Df Model: | 52 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 11.4500 | 0.007 | 1622.391 | 0.000 | 11.436 | 11.464 |
| C(EDUCGROUP)[T.HS or Lower] | -0.8190 | 0.002 | -415.206 | 0.000 | -0.823 | -0.815 |
| C(EDUCGROUP)[T.Some College] | -0.5258 | 0.002 | -229.598 | 0.000 | -0.530 | -0.521 |
| C(STATE)[T.Alaska] | 0.3096 | 0.020 | 15.368 | 0.000 | 0.270 | 0.349 |
| C(STATE)[T.Arizona] | 0.1892 | 0.009 | 20.927 | 0.000 | 0.171 | 0.207 |
| C(STATE)[T.Arkansas] | -0.0115 | 0.011 | -1.011 | 0.312 | -0.034 | 0.011 |
| C(STATE)[T.California] | 0.3321 | 0.007 | 44.547 | 0.000 | 0.317 | 0.347 |
| C(STATE)[T.Colorado] | 0.2485 | 0.009 | 26.438 | 0.000 | 0.230 | 0.267 |
| C(STATE)[T.Connecticut] | 0.2879 | 0.011 | 26.685 | 0.000 | 0.267 | 0.309 |
| C(STATE)[T.Delaware] | 0.2101 | 0.017 | 12.377 | 0.000 | 0.177 | 0.243 |
| C(STATE)[T.District of Columbia] | 0.1492 | 0.019 | 7.957 | 0.000 | 0.112 | 0.186 |
| C(STATE)[T.Florida] | 0.1292 | 0.008 | 16.782 | 0.000 | 0.114 | 0.144 |
| C(STATE)[T.Georgia] | 0.1353 | 0.008 | 15.959 | 0.000 | 0.119 | 0.152 |
| C(STATE)[T.Hawaii] | 0.3276 | 0.016 | 20.756 | 0.000 | 0.297 | 0.359 |
| C(STATE)[T.Idaho] | 0.1722 | 0.013 | 12.855 | 0.000 | 0.146 | 0.198 |
| C(STATE)[T.Illinois] | 0.1615 | 0.008 | 19.615 | 0.000 | 0.145 | 0.178 |
| C(STATE)[T.Indiana] | 0.1111 | 0.009 | 12.114 | 0.000 | 0.093 | 0.129 |
| C(STATE)[T.Iowa] | 0.1146 | 0.011 | 10.379 | 0.000 | 0.093 | 0.136 |
| C(STATE)[T.Kansas] | 0.1148 | 0.011 | 10.056 | 0.000 | 0.092 | 0.137 |
| C(STATE)[T.Kentucky] | 0.0044 | 0.010 | 0.433 | 0.665 | -0.015 | 0.024 |
| C(STATE)[T.Louisiana] | -0.0544 | 0.010 | -5.377 | 0.000 | -0.074 | -0.035 |
| C(STATE)[T.Maine] | 0.1008 | 0.014 | 6.974 | 0.000 | 0.072 | 0.129 |
| C(STATE)[T.Maryland] | 0.3145 | 0.009 | 33.211 | 0.000 | 0.296 | 0.333 |
| C(STATE)[T.Massachusetts] | 0.3017 | 0.009 | 33.033 | 0.000 | 0.284 | 0.320 |
| C(STATE)[T.Michigan] | 0.0744 | 0.009 | 8.745 | 0.000 | 0.058 | 0.091 |
| C(STATE)[T.Minnesota] | 0.2056 | 0.009 | 21.657 | 0.000 | 0.187 | 0.224 |
| C(STATE)[T.Mississippi] | -0.0860 | 0.012 | -7.449 | 0.000 | -0.109 | -0.063 |
| C(STATE)[T.Missouri] | 0.0736 | 0.009 | 7.894 | 0.000 | 0.055 | 0.092 |
| C(STATE)[T.Montana] | 0.0928 | 0.016 | 5.769 | 0.000 | 0.061 | 0.124 |
| C(STATE)[T.Nebraska] | 0.0952 | 0.013 | 7.347 | 0.000 | 0.070 | 0.121 |
| C(STATE)[T.Nevada] | 0.2030 | 0.011 | 17.921 | 0.000 | 0.181 | 0.225 |
| C(STATE)[T.New Hampshire] | 0.3022 | 0.015 | 20.302 | 0.000 | 0.273 | 0.331 |
| C(STATE)[T.New Jersey] | 0.3490 | 0.009 | 39.952 | 0.000 | 0.332 | 0.366 |
| C(STATE)[T.New Mexico] | -0.0206 | 0.013 | -1.604 | 0.109 | -0.046 | 0.005 |
| C(STATE)[T.New York] | 0.1711 | 0.008 | 21.894 | 0.000 | 0.156 | 0.186 |
| C(STATE)[T.North Carolina] | 0.0749 | 0.008 | 8.922 | 0.000 | 0.058 | 0.091 |
| C(STATE)[T.North Dakota] | 0.1198 | 0.018 | 6.583 | 0.000 | 0.084 | 0.155 |
| C(STATE)[T.Ohio] | 0.0843 | 0.008 | 10.186 | 0.000 | 0.068 | 0.100 |
| C(STATE)[T.Oklahoma] | 0.0022 | 0.010 | 0.212 | 0.832 | -0.018 | 0.023 |
| C(STATE)[T.Oregon] | 0.1569 | 0.010 | 15.300 | 0.000 | 0.137 | 0.177 |
| C(STATE)[T.Pennsylvania] | 0.1367 | 0.008 | 16.719 | 0.000 | 0.121 | 0.153 |
| C(STATE)[T.Rhode Island] | 0.1667 | 0.016 | 10.170 | 0.000 | 0.135 | 0.199 |
| C(STATE)[T.South Carolina] | 0.0716 | 0.010 | 7.426 | 0.000 | 0.053 | 0.091 |
| C(STATE)[T.South Dakota] | 0.1230 | 0.018 | 7.014 | 0.000 | 0.089 | 0.157 |
| C(STATE)[T.Tennessee] | 0.0838 | 0.009 | 9.242 | 0.000 | 0.066 | 0.102 |
| C(STATE)[T.Texas] | 0.1618 | 0.008 | 21.426 | 0.000 | 0.147 | 0.177 |
| C(STATE)[T.Utah] | 0.3267 | 0.011 | 28.476 | 0.000 | 0.304 | 0.349 |
| C(STATE)[T.Vermont] | 0.1345 | 0.020 | 6.792 | 0.000 | 0.096 | 0.173 |
| C(STATE)[T.Virginia] | 0.2472 | 0.009 | 28.153 | 0.000 | 0.230 | 0.264 |
| C(STATE)[T.Washington] | 0.3137 | 0.009 | 35.163 | 0.000 | 0.296 | 0.331 |
| C(STATE)[T.West Virginia] | -0.0236 | 0.014 | -1.747 | 0.081 | -0.050 | 0.003 |
| C(STATE)[T.Wisconsin] | 0.1290 | 0.009 | 13.806 | 0.000 | 0.111 | 0.147 |
| C(STATE)[T.Wyoming] | 0.1154 | 0.021 | 5.544 | 0.000 | 0.075 | 0.156 |
| Omnibus: | 622883.291 | Durbin-Watson: | 2.000 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 10602156.692 |
| Skew: | -1.829 | Prob(JB): | 0.00 |
| Kurtosis: | 16.327 | Cond. No. | 65.9 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
After controlling for state, the education coefficients remain large and highly significant. HS or Lower households earn approximately 56% less than BA or Higher households, and Some College households earn approximately 41% less. These figures are nearly identical to Model 1, indicating that the education premium is not simply a product of geographic sorting, but it persists strongly within states. R² increases modestly from 0.122 to 0.131, confirming that state explains some additional income variance but education remains the dominant factor.
8.6 Model Fit Comparison¶
Finally, fit statistics for both models are summarized to assess the incremental value of adding state fixed effects.
results_df = pd.DataFrame({'Model': ['Model 1: Education Only', 'Model 2: Education + State'],
'R-squared': [round(model1.rsquared, 4), round(model2.rsquared, 4)],
'Adj. R-squared': [round(model1.rsquared_adj, 4), round(model2.rsquared_adj, 4)],
'AIC': [round(model1.aic, 1), round(model2.aic, 1)],
'N': [int(model1.nobs), int(model2.nobs)]})
results_df
| Model | R-squared | Adj. R-squared | AIC | N | |
|---|---|---|---|---|---|
| 0 | Model 1: Education Only | 0.1217 | 0.1217 | 4182333.4 | 1332211 |
| 1 | Model 2: Education + State | 0.1310 | 0.1310 | 4168190.3 | 1332211 |
Adding state fixed effects improves R² from 0.1217 to 0.1310 and reduces AIC by approximately 14,000 points, confirming that geography explains meaningful additional variance beyond education alone. The modest absolute R² values are expected as household income is shaped by many factors beyond education and state. As such, both models should be interpreted as partial-association estimates rather than comprehensive explanations of income variation.
9.1 Statistical Significance of the Education Premium¶
The full interaction ANOVA confirms that the education premium is statistically significant and varies across states. Education group is the dominant predictor of log household income (F = 91,494.83, p < 0.001), far exceeding the state effect (F = 329.93, p < 0.001). Critically, the interaction term between education group and state is also significant (F = 21.36, p < 0.001), providing strong evidence to reject the null hypothesis that the education premium is not uniform across states.
The two-way ANOVA without the interaction term confirms these main effects hold independently: education group (F = 91,355.19, p < 0.001) and state (F = 329.43, p < 0.001) both contribute meaningfully to income variation.
9.2 State-Level Magnitude of the Education Premium¶
The weighted regression models quantify the magnitude of the premium. With BA or Higher as the reference category, HS or Lower households earn approximately 57% less nationally (p < 0.001). After adding state fixed effects, this figure remains nearly unchanged at 56% less (p < 0.001), confirming that the premium is not driven by geographic sorting, but it persists within states.
pivot.sort_values("EDUCATION_PREMIUM", ascending=False)[["STATE","EDUCATION_PREMIUM"]].head(5).reset_index(drop=True)
| STATE | EDUCATION_PREMIUM | |
|---|---|---|
| 0 | District of Columbia | 120429.0 |
| 1 | New York | 110971.0 |
| 2 | California | 109909.0 |
| 3 | Massachusetts | 108259.0 |
| 4 | Connecticut | 107822.0 |
The nominal top five states by education premium are the District of Columbia (USD 120,429), New York (USD 110,971), California (USD 109,909), Massachusetts (USD 108,259), and Connecticut (USD 107,822).
pivot.sort_values("EDUCATION_PREMIUM", ascending=True)[["STATE","EDUCATION_PREMIUM"]].head(5).reset_index(drop=True)
| STATE | EDUCATION_PREMIUM | |
|---|---|---|
| 0 | Wyoming | 51552.0 |
| 1 | Montana | 55647.0 |
| 2 | Hawaii | 56057.0 |
| 3 | Iowa | 56411.0 |
| 4 | South Dakota | 58429.0 |
The bottom five are Wyoming (USD 51,552), Montana (USD 54,426), Iowa (USD 54,849), South Dakota (USD 55,459), and North Dakota (USD 56,332).
This spread of nearly USD 70,000 between the lowest and highest premium states indicates substantial geographic variation in the financial return to a bachelor's degree.
9.3 Cost-of-Living Adjusted Education Premium¶
After applying the 2024 Cost of Living Index (COLI), the rankings shift considerably:
pivot.sort_values("REAL_PREMIUM", ascending=False)[["STATE","REAL_PREMIUM"]].head(5).reset_index(drop=True)
| STATE | REAL_PREMIUM | |
|---|---|---|
| 0 | Virginia | 100100.297915 |
| 1 | Connecticut | 96012.466607 |
| 2 | Illinois | 93615.466102 |
| 3 | Texas | 93020.496224 |
| 4 | Georgia | 92548.740416 |
The top five states by real premium are Virginia (USD 100,100), Connecticut (USD 96,012), Illinois (USD 93,615), Texas (USD 93,020), and Georgia (USD 92,549).
pivot.sort_values("REAL_PREMIUM", ascending=True)[["STATE","REAL_PREMIUM"]].head(5).reset_index(drop=True)
| STATE | REAL_PREMIUM | |
|---|---|---|
| 0 | Hawaii | 29993.044409 |
| 1 | Wyoming | 53981.151832 |
| 2 | Oregon | 57599.107143 |
| 3 | Montana | 58637.513172 |
| 4 | Alaska | 60454.765751 |
The bottom five are Hawaii (USD 29,993), Wyoming (USD 53,981), Oregon (USD 57,599), Montana (USD 58,638), and Alaska (USD 60,455).
The most striking shift is Hawaii, which drops from a mid-range nominal premium to the lowest real premium in the country after adjusting for its COLI of 186.9, which is nearly double the national average. California and Massachusetts similarly fall from their top nominal positions once their above-average costs are factored in. Conversely, low cost-of-living states such as Arkansas, Mississippi, and Iowa see their adjusted premiums exceed their nominal values, meaning a degree in these states delivers more purchasing power than the raw figures suggest.
9.4 Hypothesis Evaluation¶
The directional hypothesis that high cost-of-living states (California, Washington, Hawaii) would show the largest education premiums and low cost-of-living states (Arkansas, Mississippi, Iowa) the smallest is only partially supported. California and Washington do show high nominal premiums, but after COLI adjustment both fall noticeably. Hawaii directly contradicts the prediction, ranking last after adjustment despite being one of the highest cost-of-living states. On the lower end, Arkansas, Iowa, and Mississippi all rank higher after adjustment than before, the opposite of what was predicted. As such, the results suggest that nominal income levels in high cost-of-living states do not translate into proportionally larger real returns to education when the cost-of-living is factored in.
10. Limitations¶
There are considerable limitations to our analysis:
Observational data: This analysis relies on cross-sectional survey data and cannot support causal inference. Households with different educational attainment systematically differ in many ways meaning the measured income gap cannot be attributed to education alone.
Household vs. individual income: The outcome variable is total household income rather than individual earnings. This introduces noise from multi-earner households and household size which affects the measured premium independently of the reference person's education level.
Limited covariate control: The regression models control only for education group and state. Key confounders such as age, occupation, industry, hours worked, and household size were not included, skewing results.
Broad education groupings: Collapsing education into three categories loses salient features as a professional graduate degree and a bachelor's degree are treated identically in the BA or Higher group, as are a GED and no formal schooling in the HS or Lower group.
COLI adjustment: The Cost of Living Index operates at the state level and does not capture geographic variation in living costs within states. Adjusting nominal income by a single value is a useful approximation but cannot be interpreted as a precise measure of real purchasing power.
11. References¶
Ruggles, S., Flood, S., Sobek, M., et al. (2024). IPUMS USA: Version 15.0 Dataset. Minneapolis, MN: IPUMS. https://doi.org/10.18128/D010.V15.0
Mincer, J. (1974). Schooling, Experience, and Earnings. National Bureau of Economic Research. https://www.nber.org/books-and-chapters/schooling-experience-and-earnings
Card, D. (1999). The causal effect of education on earnings. Handbook of Labor Economics, 3, 1801–1863. https://davidcard.berkeley.edu/papers/causal_educ_earnings.pdf
Scherer, Z. & King, M.D. (2025). How Education Impacted Income and Earnings From 2004 to 2024. U.S. Census Bureau. https://www.census.gov/library/stories/2025/09/education-and-income.html
Council for Community and Economic Research. (2025). Cost of Living Index: 2025 Q1. https://www.coli.org/press-release-for-immediate-release-2025-q1/