Final Project: The United States Census

Introduction

The stereotypical “American Dream” has always included owning property and your own house, but this ideal is becoming increasingly out of reach for the average American. The next generation of home buyers are having a difficult time purchasing property due to the high home prices, high rates of rent, and commercialized real estate. As a prospective home buyer within the next 5 to 10 years, I wanted to look into the trends of these soaring home prices and what demographics of people are being impacted by this hike in prices the most.

Data Dictionary

Loading Libraries

# Loading all of the libraries while hiding any extra output
library(tidyverse)
library(lubridate)
library(knitr)
library(skimr)
library(scales)
library(ggplot2)
library(gapminder)
library(arrow)
library(ggthemes)
library(rmdformats)
library(stringr)
library(flextable)
library(tidytext)
library(wordcloud2)
library(gtrendsR)
library(patchwork)

Importing the Data Set

filePath <- "https://raw.githubusercontent.com/dilernia/STA418-518/main/Data/census_data_state_2008-2021.csv" # Storing the filepath

census_data <- read_csv(filePath) # Reading in the file and storing it in census_data

## Rows: 676 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): geoid, county_state
## dbl (24): year, population, median_income, median_monthly_rent_cost, median_...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Skimming the Data

glimpse(census_data) # Using the glimpse function to quickly view the data

## Rows: 676
## Columns: 26
## $ geoid                    <chr> "01", "02", "04", "05", "06", "08", "09", "10…
## $ county_state             <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "…
## $ year                     <dbl> 2008, 2008, 2008, 2008, 2008, 2008, 2008, 200…
## $ population               <dbl> 4661900, 686293, 6500180, 2855390, 36756666, …
## $ median_income            <dbl> 42666, 68460, 50958, 38815, 61021, 56993, 685…
## $ median_monthly_rent_cost <dbl> 631, 949, 866, 606, 1135, 848, 970, 917, 1011…
## $ median_monthly_home_cost <dbl> 742, 1393, 1177, 660, 1919, 1364, 1722, 1209,…
## $ prop_female              <dbl> 0.5154499, 0.4789995, 0.4986763, 0.5102035, 0…
## $ prop_male                <dbl> 0.4845501, 0.5210005, 0.5013237, 0.4897965, 0…
## $ prop_white               <dbl> 0.7028360, 0.6911290, 0.8005912, 0.7871510, 0…
## $ prop_black               <dbl> 0.261833802, 0.036257109, 0.036269457, 0.1550…
## $ prop_native              <dbl> 0.005140822, 0.127259057, 0.044114009, 0.0052…
## $ prop_asian               <dbl> 0.009768335, 0.046373779, 0.023755650, 0.0103…
## $ prop_hawaiin_islander    <dbl> 2.417469e-04, 5.484538e-03, 1.436883e-03, 6.7…
## $ prop_other_race          <dbl> 0.007177117, 0.012462607, 0.068086884, 0.0218…
## $ prop_multi_racial        <dbl> 0.013002209, 0.081033902, 0.025745902, 0.0197…
## $ prop_highschool          <dbl> 0.2604227, 0.2183097, 0.2106347, 0.2958009, 0…
## $ prop_GED                 <dbl> 0.05599385, 0.04658968, 0.03935913, 0.0582695…
## $ prop_some_college        <dbl> 0.05957806, 0.08116479, 0.07719126, 0.0687879…
## $ prop_college_no_degree   <dbl> 0.1546491, 0.2168623, 0.1816943, 0.1534883, 0…
## $ prop_associates          <dbl> 0.06831511, 0.08030577, 0.07837909, 0.0558112…
## $ prop_bachelors           <dbl> 0.1427926, 0.1758716, 0.1590928, 0.1251352, 0…
## $ prop_masters             <dbl> 0.05542474, 0.06465035, 0.06474227, 0.0433444…
## $ prop_professional        <dbl> 0.01338380, 0.01743700, 0.01623786, 0.0121880…
## $ prop_doctoral            <dbl> 0.008333932, 0.015125876, 0.010927963, 0.0072…
## $ prop_poverty             <dbl> 0.15695141, 0.08420041, 0.14718370, 0.1732244…

skim(census_data) # Using the skim function to skim through the data and look for missing values and outliers in the histograms

Data summary
Name	census_data
Number of rows	676
Number of columns	26
_______________________
Column type frequency:
character	2
numeric	24
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
geoid	0	1	2	2	0	52	0
county_state	0	1	4	20	0	52	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
year	1	2014.08	3.88	2008.00	2011.00	2014.00	2017.00	2021.00	▇▇▅▇▅
population	1	6190132.93	6978987.03	532668.00	1786156.75	4257700.00	6972359.00	39557045.00	▇▂▁▁▁
median_income	1	55793.43	12152.82	18314.00	47512.50	54740.00	63255.75	92266.00	▁▅▇▃▁
median_monthly_rent_cost	1	900.19	230.04	419.00	735.75	847.50	1024.50	1774.00	▂▇▃▁▁
median_monthly_home_cost	1	1084.63	360.03	241.00	833.50	987.50	1314.25	2215.00	▁▇▅▂▁
prop_female	1	0.51	0.01	0.47	0.50	0.51	0.51	0.53	▁▁▇▇▁
prop_male	1	0.49	0.01	0.47	0.49	0.49	0.50	0.53	▁▇▇▁▁
prop_white	1	0.76	0.14	0.22	0.68	0.78	0.86	0.96	▁▁▃▇▇
prop_black	1	0.11	0.11	0.00	0.03	0.07	0.16	0.53	▇▃▂▁▁
prop_native	1	0.02	0.03	0.00	0.00	0.00	0.01	0.16	▇▁▁▁▁
prop_asian	1	0.04	0.05	0.00	0.01	0.02	0.04	0.39	▇▁▁▁▁
prop_hawaiin_islander	1	0.00	0.01	0.00	0.00	0.00	0.00	0.11	▇▁▁▁▁
prop_other_race	1	0.03	0.03	0.00	0.01	0.02	0.05	0.30	▇▁▁▁▁
prop_multi_racial	1	0.04	0.04	0.01	0.02	0.03	0.03	0.36	▇▁▁▁▁
prop_highschool	1	0.24	0.04	0.12	0.22	0.24	0.27	0.35	▁▃▇▅▁
prop_GED	1	0.04	0.01	0.02	0.04	0.04	0.05	0.07	▃▇▆▃▁
prop_some_college	1	0.06	0.01	0.01	0.06	0.06	0.07	0.09	▁▁▆▇▂
prop_college_no_degree	1	0.15	0.02	0.09	0.13	0.15	0.16	0.22	▂▅▇▃▁
prop_associates	1	0.08	0.02	0.03	0.07	0.08	0.09	0.15	▁▃▇▂▁
prop_bachelors	1	0.19	0.03	0.10	0.17	0.19	0.21	0.27	▁▅▇▅▁
prop_masters	1	0.08	0.03	0.04	0.06	0.07	0.09	0.23	▇▆▁▁▁
prop_professional	1	0.02	0.01	0.01	0.02	0.02	0.02	0.10	▇▁▁▁▁
prop_doctoral	1	0.01	0.01	0.01	0.01	0.01	0.02	0.05	▇▂▁▁▁
prop_poverty	1	0.14	0.05	0.07	0.11	0.14	0.16	0.46	▇▅▁▁▁

Data Set Variable Types

The variables within the data set are of two variable types, character and numeric of type double.

The two character variables are:

geoid: Geographic region ID
county_state: Geographic region

The twenty-four double variables in this data set are:

year: The year
Population: The population
median_income: Median income in dollars
median_monthly_rent_cost: Median monthly rent costs for renters in dollars
median_monthly_home_cost: Median monthly housing costs for homeowners in dollars
prop_female: Proportion of people who are female
prop_male: Proportion of people who are male
prop_white: Proportion of people who are white alone
prop_black: Proportion of people who are black or African American alone
prop_native: Proportion of people who are American Indian and Alaska Native alone
prop_asian: Proportion of people who are Asian alone
prop_hawaiin_islander: Proportion of people who are Natice Hawaiian and Other Pacific Islander alone
prop_other_race: Proportion of people who are some other race alone
prop_multi_racial: Proportion of people are two or more races
prop_highschool: Proportion of people 25 and older whose highest education-level is high school
prop_GED: Proportion of people 25 and older whose highest education-level is a GED
prop_some_college: Proportion of people 25 and older whose highest education-level is some, but less than 1 year of college
prop_college_no_degree: Proportion of people 25 and older whose highest education-level is greater than 1 year of college but no degree
prop_associates: Proportion of people 25 and older whose highest education-level is an Associates degree
prop_bachelors: Proportion of people 25 and older whose highest education-level is a Bachelors degree
prop_masters: Proportion of people 25 and older whose highest education-level is a Masters degree
prop_professional: Proportion of people 25 and older whose highest education-level is a Professional degree
prop_doctoral: Proportion of people 25 and older whose highest education-level is a Doctoral degree
prop_poverty: Proportion of people 25 and older living in poverty, defined by the Census Bureau as having an income below the poverty threshold for their family size

These 26 variables can also be broken up by categories and subcategories. The two categories are non-proportional and proportional.

Non-proportional:
- Demographics/Characteristics: geoid, county_state, year, population, and median_income
Proportional:
- Gender: prop_female and prop_male
- Race: prop_white, prop_black, prop_native, prop_asian, prop_hawaiin_islander, prop_other_race, and prop_multi_racial
- Education: prop_highschool, prop_GED, prop_some_college, prop_college_no_degree, prop_associates, prop_bachelors, prop_masters, prop_professional, and prop_doctoral
- Poverty: prop_poverty

Missing Values

After analyzing the raw data using the skim() and glimpse() functions, there are no missing data values for any of the variables.

This is very important to check, as it can heavily skew or misrepresent the sample or population, if extrapolated.

Creating Tables Summarizing the Data

Comparing median income from the years 2008 through 2021.

income_table <- census_data %>% 
  group_by(year) %>% 
  skim() %>% 
  dplyr::filter(skim_variable == "median_income") # Grouping by year, filtering by "median_income" and storing it in the variable income_table

print(income_table) # Printing the table

## ── Data Summary ────────────────────────
##                            Values    
## Name                       Piped data
## Number of rows             676       
## Number of columns          26        
## _______________________              
## Column type frequency:               
##   numeric                  1         
## ________________________             
## Group variables            year      
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##    skim_variable year n_missing complete_rate   mean     sd    p0    p25    p50
##  1 median_income 2008         0             1 51454.  9667. 18401 45945. 50173 
##  2 median_income 2009         0             1 49833.  9489. 18314 44470. 48152.
##  3 median_income 2010         0             1 49375.  9134. 18862 43449  48332.
##  4 median_income 2011         0             1 50279.  9479. 18660 44146. 49178 
##  5 median_income 2012         0             1 51349.  9694. 19429 45132. 50482 
##  6 median_income 2013         0             1 52518.  9938. 19183 46596. 51335 
##  7 median_income 2014         0             1 53770. 10380. 18928 47512. 52563 
##  8 median_income 2015         0             1 55680. 10783. 18626 49212. 54442 
##  9 median_income 2016         0             1 57411. 11064. 20078 50791  56406 
## 10 median_income 2017         0             1 59449. 11630. 19343 52712. 58286.
## 11 median_income 2018         0             1 61211. 11995. 20296 55116. 59760.
## 12 median_income 2019         0             1 64645. 12703. 20474 57392  63115 
## 13 median_income 2021         0             1 68340. 12961. 22237 62190. 66539 
##       p75  p100 hist 
##  1 57067. 70545 ▁▁▇▇▃
##  2 55476. 69272 ▁▁▇▅▃
##  3 54923. 68854 ▁▁▇▅▃
##  4 56450. 70004 ▁▁▇▅▃
##  5 57180  71122 ▁▁▇▆▃
##  6 58770. 72483 ▁▁▇▆▃
##  7 61017. 73971 ▁▁▇▆▃
##  8 63056  75847 ▁▁▇▇▃
##  9 65620. 78945 ▁▁▇▆▃
## 10 68366. 82372 ▁▁▇▆▅
## 11 70590. 85203 ▁▁▇▆▅
## 12 74810. 92266 ▁▁▇▅▃
## 13 77751. 90203 ▁▁▅▇▅

The table shows that the median income dipped slightly in 2009 and 2010, likely from the housing market crash, and then grew steadily through 2021.

Creating a Data Dictionary

# Data Dictionary
dataDictionary <- tibble(Variable = colnames(census_data), # Creating a tibble of the variable descriptions and storing it in dataDictionary
                         Description = c(
                           "Geographic Region ID", 
                           "Geographic Region", 
                           "Year (2008-2021)", 
                           "Population", 
                           "Median Income (Dollars)", 
                           "Median Monthly Rent Costs for Renters (Dollars)", 
                           "Median Monthly Housing Costs for Homeowners (Dollars)", 
                           "Proportion of People who are Female", 
                           "Proportion of People who are Male", 
                           "Proportion of People who are White Alone", 
                           "Proportion of People who are Black or African American Alone", 
                           "Proportion of People who are American Indian and Alaska Native Alone", 
                           "Proportion of People who are Asian Alone", 
                           "Proportion of People who are Native Hawaiian and Other Pacific Islander Alone", 
                           "Proportion of People who are Some Other Race Alone", 
                           "Proportion of People who are Two or More Races", 
                           "Proportion of People 25 and Older Whose Highest Education Level is High School", 
                           "Proportion of People 25 and Older Whose Highest Education Level is a GED", 
                           "Proportion of People 25 and Older Whose Highest Education Level is Some, but less than 1 Year of College", 
                           "Proportion of People 25 and Older Whose Highest Education Level is Greater than 1 Year of College but no Degree", 
                           "Proportion of People 25 and Older Whose Highest Education Level is an Associates Degree", 
                           "Proportion of People 25 and Older Whose Highest Education Level is a Bachelor's Degree", 
                           "Proportion of People 25 and Older Whose Highest Education Level is a Masters Degree", 
                           "Proportion of People 25 and Older Whose Highest Education Level is a Professional Degree", 
                           "Proportion of People 25 and Older Whose Highest Education Level is a Doctoral Degree", 
                           "Proportion of People 25 and Older Living in Poverty"
                         ),
                         Type = map_chr(census_data, .f = function(x){typeof(x)[1]}), # Creating a character vector of the types of each variable
                         Class = map_chr(census_data, .f = function(x){class(x)[1]})) # Creating a character vector of the classes of each variable

Displaying the Data Dictionary

# Printing nicely in R Markdown (option 1)
flextable::flextable(dataDictionary, cwidth = 2) # Printing the dataDictionary table using the flextable R package

Variable	Description	Type	Class
geoid	Geographic Region ID	character	character
county_state	Geographic Region	character	character
year	Year (2008-2021)	double	numeric
population	Population	double	numeric
median_income	Median Income (Dollars)	double	numeric
median_monthly_rent_cost	Median Monthly Rent Costs for Renters (Dollars)	double	numeric
median_monthly_home_cost	Median Monthly Housing Costs for Homeowners (Dollars)	double	numeric
prop_female	Proportion of People who are Female	double	numeric
prop_male	Proportion of People who are Male	double	numeric
prop_white	Proportion of People who are White Alone	double	numeric
prop_black	Proportion of People who are Black or African American Alone	double	numeric
prop_native	Proportion of People who are American Indian and Alaska Native Alone	double	numeric
prop_asian	Proportion of People who are Asian Alone	double	numeric
prop_hawaiin_islander	Proportion of People who are Native Hawaiian and Other Pacific Islander Alone	double	numeric
prop_other_race	Proportion of People who are Some Other Race Alone	double	numeric
prop_multi_racial	Proportion of People who are Two or More Races	double	numeric
prop_highschool	Proportion of People 25 and Older Whose Highest Education Level is High School	double	numeric
prop_GED	Proportion of People 25 and Older Whose Highest Education Level is a GED	double	numeric
prop_some_college	Proportion of People 25 and Older Whose Highest Education Level is Some, but less than 1 Year of College	double	numeric
prop_college_no_degree	Proportion of People 25 and Older Whose Highest Education Level is Greater than 1 Year of College but no Degree	double	numeric
prop_associates	Proportion of People 25 and Older Whose Highest Education Level is an Associates Degree	double	numeric
prop_bachelors	Proportion of People 25 and Older Whose Highest Education Level is a Bachelor's Degree	double	numeric
prop_masters	Proportion of People 25 and Older Whose Highest Education Level is a Masters Degree	double	numeric
prop_professional	Proportion of People 25 and Older Whose Highest Education Level is a Professional Degree	double	numeric
prop_doctoral	Proportion of People 25 and Older Whose Highest Education Level is a Doctoral Degree	double	numeric
prop_poverty	Proportion of People 25 and Older Living in Poverty	double	numeric

Data Cleaning

To make the data ready for transformations and visualizations, the data must be first cleaned and organized. This will be achieved by merging data sets and string manipulation.

Merging Data Sets

Sorting States by Region

This function takes a list of every state, categorized by region, and matches each state in the data set with its respective region. This was made to be used later to be able to analyze data based on geographical region.

# Creating the region variable
# Used chatGPT to help list all states with their respective regions
census_data <- census_data %>%
  mutate(
    region = case_when(
      str_detect(county_state, "California|Oregon|Washington|Alaska|Hawaii|Arizona|Colorado|Idaho|Montana|Nevada|New Mexico|Utah|Wyoming") ~ "West", # Detecting West regional states
      str_detect(county_state, "North Dakota|South Dakota|Nebraska|Kansas|Minnesota|Iowa|Missouri|Wisconsin|Illinois|Michigan|Indiana|Ohio") ~ "Midwest", # Detecting Midwest regional states
      str_detect(county_state, "Texas|Oklahoma|Arkansas|Louisiana|Mississippi|Alabama|Florida|Georgia|South Carolina|North Carolina|Tennessee|Kentucky|Virginia|West Virginia|Maryland|Delaware|District of Columbia|Puerto Rico") ~ "South", # Detecting Southern regional states
      str_detect(county_state, "Maine|New Hampshire|Vermont|Massachusetts|Rhode Island|Connecticut|New York|New Jersey|Pennsylvania") ~ "Northeast", # Detecting Northeast regional states
      TRUE ~ NA_character_
    )
  )

# Creating summary table
regions <- census_data %>%
  group_by(region) %>%
  summarize(Unique_States = n_distinct(county_state), .groups = "drop")

# Display the summary table
flextable::flextable(regions)

region	Unique_States
Midwest	12
Northeast	9
South	18
West	13

This table shows the number of states that were identified in each region. Only the unique states are included so that each state listed for each year is not included.

String Manipulation

# State Abbreviations
census_data <- census_data %>%
  mutate(
    # Extract State Abbreviation
    state_abbreviation = str_extract(county_state, "[A-Za-z]{2}"), # Used chatGPT for code to find state abbreviations
    # Capitalize State Name
    capitalized_state_name = str_to_title(county_state),
    # Combine Abbreviation and Capitalized Name
    abbreviations = str_c(capitalized_state_name, "(", state_abbreviation, ")")
  )

# Creating a data table
table_data <- slice_head(select(census_data, county_state, abbreviations), n = 8) # Showing the first 8 states in the table

# Printing the table
flextable(table_data)

county_state	abbreviations
Alabama	Alabama(Al)
Alaska	Alaska(Al)
Arizona	Arizona(Ar)
Arkansas	Arkansas(Ar)
California	California(Ca)
Colorado	Colorado(Co)
Connecticut	Connecticut(Co)
Delaware	Delaware(De)

Exploratory Data Analysis

This section consists of data transformations to display the data.

Tables of Summary Statistics

This sections displays tables with statistics that summarize the data.

Percentage of Each Race by Region

This table displays the average percentage of each race within each region. The low percentage races were lumped together into the Average_Other variable.

# Creating the combined_other_race variable
census_data <- census_data %>% 
  mutate(
    combined_other_race = prop_hawaiin_islander + prop_other_race + prop_multi_racial
  )

# Calculating the average total percentage of race by region
raceByRegion <- census_data %>%
  group_by(region) %>%
  summarize(
    Average_White = mean(prop_white),
    Average_Black = mean(prop_black),
    Average_Native = mean(prop_native),
    Average_Asian = mean(prop_asian),
    Average_Other = mean(combined_other_race),
    .groups = "drop"
  )

# Display the table
flextable::flextable(raceByRegion)

region	Average_White	Average_Black	Average_Native	Average_Asian	Average_Other
Midwest	0.8358631	0.07499745	0.016292110	0.02477064	0.04807675
Northeast	0.8114821	0.07479451	0.003074737	0.04279815	0.06785052
South	0.6883685	0.21240546	0.007785882	0.02477454	0.06666563
West	0.7439349	0.02991676	0.034617635	0.06905562	0.12247509

Finding the median income after home or rent expenses.

This table finds how much money the average American had in income each year after home expenses, broken up by mortgage expenses and rent expenses.

# Creating the expenses variable
expenses <- census_data %>% 
  group_by(year) %>%
  summarize(
    income_after_home_expenses = label_currency(mean((median_income - (12 * (median_monthly_home_cost))))), # Calculates the average median anual income after home costs
    income_after_rent_expenses = label_currency(mean((median_income - (12 * (median_monthly_rent_cost))))) # Calculates the average median anual income after rent costs
    )
# Having the year displayed without commas
expenses$year <- as.character(expenses$year)

# Displaying the table
flextable::flextable(expenses, cwidth = 2.5)

year	income_after_home_expenses	income_after_rent_expenses
2008	$38,149.85	$42,160.85
2009	$36,626.94	$40,314.63
2010	$36,280.58	$39,718.35
2011	$37,274.35	$40,386.96
2012	$38,607.98	$41,286.29
2013	$40,146.69	$42,186.00
2014	$41,282.81	$43,112.81
2015	$42,990.21	$44,745.67
2016	$44,750.90	$46,264.98
2017	$46,529.56	$47,965.63
2018	$47,997.21	$49,302.90
2019	$51,144.75	$52,364.60
2021	$54,331.17	$55,075.87

Obataining a Frequency Table

The overall average percentage of people in poverty was 14.5%. This table compares male vs. female dominated states and having a poverty percentage of greater than 14.5%

# Used chatgpt to help with ifelse statements
# Creating the freq_table variable
freq_table <- census_data %>%
  mutate(Gender_Group = ifelse(prop_female > 0.5, "Female", "Male"), # Finds female or male dominated states
         Poverty_Group = ifelse(prop_poverty > 0.145, "High Poverty", "Low Poverty")) %>% # Finds if they are above or below the national average poverty percentage
  group_by(Gender_Group, Poverty_Group) %>%
  summarize(Frequency = n(), 
            .groups = "drop")

# Displaying the frequency table using flextable
flextable::flextable(freq_table, cwidth = 1.5)

Gender_Group	Poverty_Group	Frequency
Female	High Poverty	257
Female	Low Poverty	286
Male	High Poverty	17
Male	Low Poverty	116

This table shows that most states are predominantly female, but about half of them are below the national poverty percent average. Male dominated states seem to have a much lower poverty rate.

Data Visualizations

This sections displays graphical figures that summarize the data.

Plotting Median Monthly Housing Costs Over the Years in the US

Plotting monthly home and rent expenses over from the years 2008 through 2021.

# Creating the variable avg_costs
avg_costs <- census_data %>%
  pivot_longer(cols = c(median_monthly_home_cost, median_monthly_rent_cost), # Used chatgpt to help combine these two plot together
               names_to = "cost_type", 
               values_to = "value") %>%
  group_by(year, cost_type) %>%
  summarize(avg_value = mean(value), 
            .groups = "drop")

# Creating the variable housingLineChart to plot home and rent expenses
housingLineChart <- avg_costs %>%
  ggplot(aes(x = year, 
             y = avg_value, 
             color = cost_type, 
             group = cost_type)) +
  geom_line(linewidth = 1) +
  labs(title = "Median Monthly Housing Costs Over the Years in the US",
       x = "Year",
       y = "Average Median Monthly Cost",
       caption = "Data source: census_data_state_2008-2021.csv") +
  scale_color_colorblind() + 
  theme_bw() + 
  theme(legend.position = "bottom") + 
  scale_y_continuous(labels = scales::dollar_format(prefix = "$")) # Formats the y-axis to use a currency format

# Displays the graph
housingLineChart

This graph shows that it used to be much cheaper in the US to rent property rather than own it. In 2021, the prices became increasingly comparable.

Percentage of Degree Holders Over the Years in the US

This graph displays the percentage of the population that had some kind of college degree vs. the percentage of the population with no degree over the years.

# Creating the combined_degree_holders variable combining the percentages of all degree holding Americans
census_data <- census_data %>%
  mutate(
    combined_degree_holders = prop_associates + prop_bachelors + prop_masters + prop_professional + prop_doctoral
  )

# Creating the combined_non_degree_holders variable combining the percentages of all non-degree holding Americans
census_data <- census_data %>%
  mutate(
    combined_non_degree_holders = prop_highschool + prop_GED + prop_some_college + prop_college_no_degree
  )

# Combining degree holders and non-degree holders
long_data <- census_data %>%
  select(year, combined_degree_holders, combined_non_degree_holders) %>%
  pivot_longer(cols = c(combined_degree_holders, combined_non_degree_holders),
               names_to = "Education_Level",
               values_to = "Percentage") %>%
  group_by(year, Education_Level) %>%
  summarize(avg_perc = mean(Percentage), 
            .groups = "drop")

# Plotting the data
degreeLineChart <- ggplot(long_data, aes(x = year, 
                                           y = avg_perc, 
                                           color = `Education_Level`, 
                                           group = `Education_Level`)) +
  geom_line(linewidth = 1) +
  labs(title = "Education Level Over the Years in the US",
       x = "Year",
       y = "Combined Percentage",
       caption = "Data source: census_data_state_2008-2021.csv") +
  scale_color_colorblind() +
  theme_bw() +
  theme(legend.position = "bottom")

# Plotting the graph
degreeLineChart

This graph shows that the percentage of degree holders in the US has almost surpassed the percentage of non-degree holders since 2008, starting with nearly a 20% difference in percentage.

Median Income by Region Over the Years in the US

This plot displays the median income by region in the US between the years 2008 and 2021.

# Calculate average median income per year and region
average_income_per_year <- census_data %>%
  group_by(year, region) %>%
  summarize(avg_median_income = mean(median_income), .groups = "drop")

# Plotting the data
incomeLineChart <- average_income_per_year %>% 
  ggplot(aes(x = year, 
             y = avg_median_income, 
             color = region, 
             group = region)) +
  geom_line(linewidth = 1) +
  labs(title = "Average Median Income by Region",
       x = "Year",
       y = "Average Median Income",
       caption = "Data source: census_data_state_2008-2021.csv") +
  scale_color_colorblind() +
  theme_classic() +
  theme(legend.position = "bottom") + 
  scale_y_continuous(labels = scales::dollar_format(prefix = "$")) + # Formatting to show currency in the y-axis
  guides(color = guide_legend(title = "Region"))

# Displaying the graph
incomeLineChart

This shows that growth and decline rates of average median income among each region over the years has stayed consistent with no overlap between any region.

Average Median Monthly Housing Costs by Region

This plot displays the average median monthly housing costs by region in the US between the years 2008 and 2021.

# Calculate average median home and rental costs per year and region
avg_costs_by_region <- census_data %>%
  pivot_longer(cols = c(median_monthly_home_cost, median_monthly_rent_cost), 
               names_to = "cost_type", 
               values_to = "value") %>%
  group_by(year, region) %>%
  summarize(avg_value = mean(value), .groups = "drop")

# Plotting the data
livingCostsLineChart <- avg_costs_by_region %>%
  ggplot(aes(x = year, 
             y = avg_value, 
             color = region, 
             group = region)) +
  geom_line(linewidth = 1) +
  labs(title = "Average Median Monthly Housing Costs by Region",
       x = "Year",
       y = "Average Median Monthly Cost",
       caption = "Data source: census_data_state_2008-2021.csv") +
  scale_color_colorblind() + 
  theme_classic() + 
  theme(legend.position = "bottom") + 
  scale_y_continuous(labels = scales::dollar_format(prefix = "$")) + # Formatting to show currency in the y-axis
  guides(color = guide_legend(title = "Region"))

# Displaying the graph
livingCostsLineChart

This shows that growth and decline rates of average median monthly housing costs among each region over the years has stayed consistent with no overlap between any region.

It is worth noting that the Midwest average median income was consistently higher than that of the south, whereas when it comes to housing costs, it’s the opposite.

Median Income vs. Percentage of Degree Holders by Region, Sized by Population

This graph compares the average median income to the percentage of degree holders, organized by region, and the points are sized by the state populations.

# Creating the incomeDegree variable
incomeDegree <- census_data %>%
  ggplot(aes(
    x = combined_degree_holders,
    y = median_income,
    color = region, # Coloring by region
    size = population # Setting the size to population size
  )) +
  geom_point(alpha = 0.2) +
  scale_color_colorblind() +
  labs(
    title = "Median Income vs. Percentage of Degree Holders by Region",
    x = "Combined Degree Holders (%)",
    y = "Median Income",
    caption = "Data source: census_data_state_2008-2021.csv", 
    color = NULL
  ) +
  facet_grid(. ~ region) + # Creating the four individual graphs based on region
  theme(legend.position = "bottom") +  
  geom_smooth(method = "lm", se = FALSE, size = 1) + 
  guides(
    size = guide_legend(title = "Population Size (Millions of People)"), 
    color = FALSE
  ) + 
  scale_y_continuous(labels = scales::dollar_format(prefix = "$")) + 
  scale_size_continuous(labels = scales::comma_format(scale = 1e-6, suffix = "M")) # Changing the legend to show the number in millions of people rather than in scientific notation

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Displaying the graph
incomeDegree

## `geom_smooth()` using formula = 'y ~ x'

This table shows that the south had the greatest growth in percentage of people that obtained degrees, which increased the median income most of any other region. The other three regions had consistent growth with one another.

Comparing Percent Poverty, Monthly Housing Costs, and Median Income Over the Years

This plot shows a grouped bar chart displaying the year over year percent change in percent poverty, monthly housing expenses, and median income between the years 2008 and 2021.

# Calculating average values per year
avg_data <- census_data %>%
  group_by(year) %>%
  summarize(
    avg_percent_poverty = mean(prop_poverty),
    avg_monthly_housing_costs = mean((median_monthly_home_cost + median_monthly_rent_cost) / 2),
    avg_median_income = mean(median_income)
  )

# Calculating percentage changes
avg_data_percent_change <- avg_data %>% # Used chatGPT to help with this
  mutate(
    median_income_change = (avg_median_income / lag(avg_median_income) - 1) * 100,
    housing_costs_change = ((avg_monthly_housing_costs + lag(avg_monthly_housing_costs)) / (2 * lag(avg_monthly_housing_costs)) - 1) * 100,
    percent_poverty_change = (avg_percent_poverty / lag(avg_percent_poverty) - 1) * 100
  )

# Reshaping data for plotting
avg_data_change_long <- avg_data_percent_change %>% 
  pivot_longer(cols = c(median_income_change, housing_costs_change, percent_poverty_change),
               names_to = "variable",
               values_to = "percent_change")

# Plotting the percent change
percentChange <- avg_data_change_long %>%
  ggplot(aes(x = year, y = percent_change, fill = variable)) +
  geom_col(position = "dodge", width = 0.7, color = "white") +
  labs(title = "Percentage Change in Median Income, Housing Costs, and Poverty Over the Years",
       x = "Year",
       y = "Percentage Change",
       fill = "Variable",
       caption = "Data source: census_data_state_2008-2021.csv") +
  scale_fill_manual(values = c("skyblue", "orange", "lightgreen")) + # Changing the color of each bar
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

percentChange

Housing cost changes closely mimic median income changes. Percent poverty mirrored housing cost changes and median income changes. This makes sense that as median income increases, poverty rates decrease.It should also be noted that the year 2020 is missing from the plot. This is likely due to the COVID-19 pandemic.

Monte Carlo Methods of Inference

For this Monte Carlo Simulation, I will be testing whether or not there has been a statistically significant difference in average monthly home costs between male and female dominated states.

# Used chatGPT to help with some sections and to format the code
# Extracting variables to use
sim_data <- census_data %>% select(year, prop_male, prop_female, median_monthly_home_cost)

# Specify your observed test statistic
observed_statistic <- mean(sim_data$median_monthly_home_cost[sim_data$prop_male > sim_data$prop_female]) -
                      mean(sim_data$median_monthly_home_cost[sim_data$prop_male < sim_data$prop_female])

# Number of simulations
num_simulations <- 10000

# Initialize an empty vector to store simulated test statistics
simulated_stats <- numeric(num_simulations)

# Run the Monte Carlo simulation
for (i in 1:num_simulations) {
# Shuffle the data between the two groups based on proportion of males
  shuffled_data <- sim_data %>%
    mutate(shuffled_group = ifelse(runif(n()) < 0.5, "A", "B"))

# Calculate the test statistic on the shuffled data
  shuffled_statistic <- mean(shuffled_data$median_monthly_home_cost[shuffled_data$shuffled_group == "A"]) -
                        mean(shuffled_data$median_monthly_home_cost[shuffled_data$shuffled_group == "B"])

# Stores the simulated test statistic
  simulated_stats[i] <- shuffled_statistic
}

# Plot the null distribution
null_dist_plot <- ggplot() +
  geom_density(aes(x = simulated_stats), fill = "skyblue", alpha = 0.6) +
  geom_vline(xintercept = observed_statistic, color = "red", linetype = "dashed", size = 1) +
  geom_vline(xintercept = quantile(simulated_stats, 0.975), linetype = "dotted", size = 1, color = "blue") +
  geom_vline(xintercept = quantile(simulated_stats, 0.025), linetype = "dotted", size = 1, color = "blue") +
  labs(title = "Monte Carlo Simulation of Home Costs Test",
       x = "Difference in Average Home Costs (Proportion of Males > Females)",
       y = "Density") +
  theme_minimal()

# Displaying the plot
print(null_dist_plot)

# Calculating the p-value
p_value <- mean(simulated_stats >= observed_statistic) + mean(simulated_stats <= -observed_statistic)

# Displaying the p-value
cat("Observed Test Statistic:", observed_statistic, "\n")

## Observed Test Statistic: 74.92805

cat("p-value:", sprintf("%.4f", p_value), "\n") # Rounding to 4 decimal places

## p-value: 0.0058

# Interpretation
if (p_value < 0.05) {
  cat("At the 95% significance level, there is statistically significant evidence that there is a difference in average home costs between states with higher proportions of males is statistically significant.\n")
} else {
  cat("At the 95% significance level, there is not statistically significant evidence that there is a difference in average home costs between states with higher proportions of males.\n")
}

## At the 95% significance level, there is statistically significant evidence that there is a difference in average home costs between states with higher proportions of males is statistically significant.

Bootstrap Methods of Inference

Bootstrapping is used to test for the confidence interval of median monthly home costs.

# Used chatGPT to help with some sections and to format the code
# Extracting variables to use
bootstrap_data <- census_data %>% select(median_monthly_home_cost)

# Function to calculate the median
calculate_median <- function(data) {
  return(median(data))
}

# Number of bootstrap samples
num_bootstrap_samples <- 1000

# Initializing variable to store bootstrapped samples
bootstrap_medians <- numeric(num_bootstrap_samples)

# Running the bootstrap procedure
for (i in 1:num_bootstrap_samples) {
# Sampling with replacement
  bootstrap_sample <- sample(bootstrap_data$median_monthly_home_cost, replace = TRUE)
  
# Calculating the median for the bootstrap sample
  bootstrap_medians[i] <- calculate_median(bootstrap_sample)
}

# Calculating the 95% bootstrap confidence interval
ci_lower <- quantile(bootstrap_medians, 0.025) %>% round(., 2) # Rounding upper and lower to two decimal places to reflect currency
ci_upper <- quantile(bootstrap_medians, 0.975) %>% round(., 2)

# Displaying the bootstrap distribution
bootstrap_dist_plot <- ggplot() +
  geom_density(aes(x = bootstrap_medians), fill = "skyblue", alpha = 0.6) +
  geom_vline(xintercept = ci_lower, linetype = "dotted", size = 1, color = "blue") +
  geom_vline(xintercept = ci_upper, linetype = "dotted", size = 1, color = "blue") +
  labs(title = "Bootstrap Distribution of Median Home Costs",
       x = "Median Home Costs",
       y = "Density") +
  theme_minimal() + 
    scale_x_continuous(labels = scales::dollar_format(prefix = "$")) # Formatting to show currency in the x-axis

# Displaying the plot
print(bootstrap_dist_plot)

# Interpretation
cat("95% Bootstrap Confidence Interval for Median Home Costs:", 
    sprintf("[%.2f, %.2f]\n", ci_lower, ci_upper))

## 95% Bootstrap Confidence Interval for Median Home Costs: [962.00, 1025.54]

Concluding from the bootstrapping inference, we are 95% confident that the true median home costs is between $962 and $1025.54.

Conclusion

This analysis has shown some of the recent trends between the years 2008 and 2021 regarding income, demographics, education, and housing costs. The overall trends suggests that the cost of living, especially when renting, is on the rise, as well as education levels. It seems that in the near future, the number of degree holders will outweigh the number of non-degree holders. This as well has caused the median income to rise steadily with the growing education rates. The southern states, in particular, had the greatest growth in both education levels and median income. The median income in the southern states has nearly tripled between the years 2008 and 2021. The steady rise in median income has caused the percentage of people in poverty to steadily decrease as well. However, in 2021 there was a sharp rise in the percent change of people in poverty as well as the change in housing costs. This likely could have been due to the COVID-19 pandemic. At the beginning of this project, I was very concerned about the rising cost of living and housing, but after thoroughly analyzing the national data, the trajectory of costs and income are not as bad as I originally perceived them to be.