Final Project: The United States Census

Introduction

The stereotypical “American Dream” has always included owning property and your own house, but this ideal is becoming increasingly out of reach for the average American. The next generation of home buyers are having a difficult time purchasing property due to the high home prices, high rates of rent, and commercialized real estate. As a prospective home buyer within the next 5 to 10 years, I wanted to look into the trends of these soaring home prices and what demographics of people are being impacted by this hike in prices the most.

Data Dictionary

Loading Libraries

# Loading all of the libraries while hiding any extra output
library(tidyverse)
library(lubridate)
library(knitr)
library(skimr)
library(scales)
library(ggplot2)
library(gapminder)
library(arrow)
library(ggthemes)
library(rmdformats)
library(stringr)
library(flextable)
library(tidytext)
library(wordcloud2)
library(gtrendsR)
library(patchwork)

Importing the Data Set

filePath <- "https://raw.githubusercontent.com/dilernia/STA418-518/main/Data/census_data_state_2008-2021.csv" # Storing the filepath

census_data <- read_csv(filePath) # Reading in the file and storing it in census_data
## Rows: 676 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): geoid, county_state
## dbl (24): year, population, median_income, median_monthly_rent_cost, median_...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Skimming the Data

glimpse(census_data) # Using the glimpse function to quickly view the data
## Rows: 676
## Columns: 26
## $ geoid                    <chr> "01", "02", "04", "05", "06", "08", "09", "10…
## $ county_state             <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "…
## $ year                     <dbl> 2008, 2008, 2008, 2008, 2008, 2008, 2008, 200…
## $ population               <dbl> 4661900, 686293, 6500180, 2855390, 36756666, …
## $ median_income            <dbl> 42666, 68460, 50958, 38815, 61021, 56993, 685…
## $ median_monthly_rent_cost <dbl> 631, 949, 866, 606, 1135, 848, 970, 917, 1011…
## $ median_monthly_home_cost <dbl> 742, 1393, 1177, 660, 1919, 1364, 1722, 1209,…
## $ prop_female              <dbl> 0.5154499, 0.4789995, 0.4986763, 0.5102035, 0…
## $ prop_male                <dbl> 0.4845501, 0.5210005, 0.5013237, 0.4897965, 0…
## $ prop_white               <dbl> 0.7028360, 0.6911290, 0.8005912, 0.7871510, 0…
## $ prop_black               <dbl> 0.261833802, 0.036257109, 0.036269457, 0.1550…
## $ prop_native              <dbl> 0.005140822, 0.127259057, 0.044114009, 0.0052…
## $ prop_asian               <dbl> 0.009768335, 0.046373779, 0.023755650, 0.0103…
## $ prop_hawaiin_islander    <dbl> 2.417469e-04, 5.484538e-03, 1.436883e-03, 6.7…
## $ prop_other_race          <dbl> 0.007177117, 0.012462607, 0.068086884, 0.0218…
## $ prop_multi_racial        <dbl> 0.013002209, 0.081033902, 0.025745902, 0.0197…
## $ prop_highschool          <dbl> 0.2604227, 0.2183097, 0.2106347, 0.2958009, 0…
## $ prop_GED                 <dbl> 0.05599385, 0.04658968, 0.03935913, 0.0582695…
## $ prop_some_college        <dbl> 0.05957806, 0.08116479, 0.07719126, 0.0687879…
## $ prop_college_no_degree   <dbl> 0.1546491, 0.2168623, 0.1816943, 0.1534883, 0…
## $ prop_associates          <dbl> 0.06831511, 0.08030577, 0.07837909, 0.0558112…
## $ prop_bachelors           <dbl> 0.1427926, 0.1758716, 0.1590928, 0.1251352, 0…
## $ prop_masters             <dbl> 0.05542474, 0.06465035, 0.06474227, 0.0433444…
## $ prop_professional        <dbl> 0.01338380, 0.01743700, 0.01623786, 0.0121880…
## $ prop_doctoral            <dbl> 0.008333932, 0.015125876, 0.010927963, 0.0072…
## $ prop_poverty             <dbl> 0.15695141, 0.08420041, 0.14718370, 0.1732244…
skim(census_data) # Using the skim function to skim through the data and look for missing values and outliers in the histograms
Data summary
Name census_data
Number of rows 676
Number of columns 26
_______________________
Column type frequency:
character 2
numeric 24
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
geoid 0 1 2 2 0 52 0
county_state 0 1 4 20 0 52 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2014.08 3.88 2008.00 2011.00 2014.00 2017.00 2021.00 ▇▇▅▇▅
population 0 1 6190132.93 6978987.03 532668.00 1786156.75 4257700.00 6972359.00 39557045.00 ▇▂▁▁▁
median_income 0 1 55793.43 12152.82 18314.00 47512.50 54740.00 63255.75 92266.00 ▁▅▇▃▁
median_monthly_rent_cost 0 1 900.19 230.04 419.00 735.75 847.50 1024.50 1774.00 ▂▇▃▁▁
median_monthly_home_cost 0 1 1084.63 360.03 241.00 833.50 987.50 1314.25 2215.00 ▁▇▅▂▁
prop_female 0 1 0.51 0.01 0.47 0.50 0.51 0.51 0.53 ▁▁▇▇▁
prop_male 0 1 0.49 0.01 0.47 0.49 0.49 0.50 0.53 ▁▇▇▁▁
prop_white 0 1 0.76 0.14 0.22 0.68 0.78 0.86 0.96 ▁▁▃▇▇
prop_black 0 1 0.11 0.11 0.00 0.03 0.07 0.16 0.53 ▇▃▂▁▁
prop_native 0 1 0.02 0.03 0.00 0.00 0.00 0.01 0.16 ▇▁▁▁▁
prop_asian 0 1 0.04 0.05 0.00 0.01 0.02 0.04 0.39 ▇▁▁▁▁
prop_hawaiin_islander 0 1 0.00 0.01 0.00 0.00 0.00 0.00 0.11 ▇▁▁▁▁
prop_other_race 0 1 0.03 0.03 0.00 0.01 0.02 0.05 0.30 ▇▁▁▁▁
prop_multi_racial 0 1 0.04 0.04 0.01 0.02 0.03 0.03 0.36 ▇▁▁▁▁
prop_highschool 0 1 0.24 0.04 0.12 0.22 0.24 0.27 0.35 ▁▃▇▅▁
prop_GED 0 1 0.04 0.01 0.02 0.04 0.04 0.05 0.07 ▃▇▆▃▁
prop_some_college 0 1 0.06 0.01 0.01 0.06 0.06 0.07 0.09 ▁▁▆▇▂
prop_college_no_degree 0 1 0.15 0.02 0.09 0.13 0.15 0.16 0.22 ▂▅▇▃▁
prop_associates 0 1 0.08 0.02 0.03 0.07 0.08 0.09 0.15 ▁▃▇▂▁
prop_bachelors 0 1 0.19 0.03 0.10 0.17 0.19 0.21 0.27 ▁▅▇▅▁
prop_masters 0 1 0.08 0.03 0.04 0.06 0.07 0.09 0.23 ▇▆▁▁▁
prop_professional 0 1 0.02 0.01 0.01 0.02 0.02 0.02 0.10 ▇▁▁▁▁
prop_doctoral 0 1 0.01 0.01 0.01 0.01 0.01 0.02 0.05 ▇▂▁▁▁
prop_poverty 0 1 0.14 0.05 0.07 0.11 0.14 0.16 0.46 ▇▅▁▁▁

Data Set Variable Types

The variables within the data set are of two variable types, character and numeric of type double.

The two character variables are:

  • geoid: Geographic region ID

  • county_state: Geographic region

The twenty-four double variables in this data set are:

  • year: The year

  • Population: The population

  • median_income: Median income in dollars

  • median_monthly_rent_cost: Median monthly rent costs for renters in dollars

  • median_monthly_home_cost: Median monthly housing costs for homeowners in dollars

  • prop_female: Proportion of people who are female

  • prop_male: Proportion of people who are male

  • prop_white: Proportion of people who are white alone

  • prop_black: Proportion of people who are black or African American alone

  • prop_native: Proportion of people who are American Indian and Alaska Native alone

  • prop_asian: Proportion of people who are Asian alone

  • prop_hawaiin_islander: Proportion of people who are Natice Hawaiian and Other Pacific Islander alone

  • prop_other_race: Proportion of people who are some other race alone

  • prop_multi_racial: Proportion of people are two or more races

  • prop_highschool: Proportion of people 25 and older whose highest education-level is high school

  • prop_GED: Proportion of people 25 and older whose highest education-level is a GED

  • prop_some_college: Proportion of people 25 and older whose highest education-level is some, but less than 1 year of college

  • prop_college_no_degree: Proportion of people 25 and older whose highest education-level is greater than 1 year of college but no degree

  • prop_associates: Proportion of people 25 and older whose highest education-level is an Associates degree

  • prop_bachelors: Proportion of people 25 and older whose highest education-level is a Bachelors degree

  • prop_masters: Proportion of people 25 and older whose highest education-level is a Masters degree

  • prop_professional: Proportion of people 25 and older whose highest education-level is a Professional degree

  • prop_doctoral: Proportion of people 25 and older whose highest education-level is a Doctoral degree

  • prop_poverty: Proportion of people 25 and older living in poverty, defined by the Census Bureau as having an income below the poverty threshold for their family size

These 26 variables can also be broken up by categories and subcategories. The two categories are non-proportional and proportional.

  1. Non-proportional:

    • Demographics/Characteristics: geoid, county_state, year, population, and median_income
  2. Proportional:

    • Gender: prop_female and prop_male

    • Race: prop_white, prop_black, prop_native, prop_asian, prop_hawaiin_islander, prop_other_race, and prop_multi_racial

    • Education: prop_highschool, prop_GED, prop_some_college, prop_college_no_degree, prop_associates, prop_bachelors, prop_masters, prop_professional, and prop_doctoral

    • Poverty: prop_poverty

Missing Values

After analyzing the raw data using the skim() and glimpse() functions, there are no missing data values for any of the variables.

This is very important to check, as it can heavily skew or misrepresent the sample or population, if extrapolated.

Creating Tables Summarizing the Data

Comparing median income from the years 2008 through 2021.

income_table <- census_data %>% 
  group_by(year) %>% 
  skim() %>% 
  dplyr::filter(skim_variable == "median_income") # Grouping by year, filtering by "median_income" and storing it in the variable income_table

print(income_table) # Printing the table
## ── Data Summary ────────────────────────
##                            Values    
## Name                       Piped data
## Number of rows             676       
## Number of columns          26        
## _______________________              
## Column type frequency:               
##   numeric                  1         
## ________________________             
## Group variables            year      
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────
##    skim_variable year n_missing complete_rate   mean     sd    p0    p25    p50
##  1 median_income 2008         0             1 51454.  9667. 18401 45945. 50173 
##  2 median_income 2009         0             1 49833.  9489. 18314 44470. 48152.
##  3 median_income 2010         0             1 49375.  9134. 18862 43449  48332.
##  4 median_income 2011         0             1 50279.  9479. 18660 44146. 49178 
##  5 median_income 2012         0             1 51349.  9694. 19429 45132. 50482 
##  6 median_income 2013         0             1 52518.  9938. 19183 46596. 51335 
##  7 median_income 2014         0             1 53770. 10380. 18928 47512. 52563 
##  8 median_income 2015         0             1 55680. 10783. 18626 49212. 54442 
##  9 median_income 2016         0             1 57411. 11064. 20078 50791  56406 
## 10 median_income 2017         0             1 59449. 11630. 19343 52712. 58286.
## 11 median_income 2018         0             1 61211. 11995. 20296 55116. 59760.
## 12 median_income 2019         0             1 64645. 12703. 20474 57392  63115 
## 13 median_income 2021         0             1 68340. 12961. 22237 62190. 66539 
##       p75  p100 hist 
##  1 57067. 70545 ▁▁▇▇▃
##  2 55476. 69272 ▁▁▇▅▃
##  3 54923. 68854 ▁▁▇▅▃
##  4 56450. 70004 ▁▁▇▅▃
##  5 57180  71122 ▁▁▇▆▃
##  6 58770. 72483 ▁▁▇▆▃
##  7 61017. 73971 ▁▁▇▆▃
##  8 63056  75847 ▁▁▇▇▃
##  9 65620. 78945 ▁▁▇▆▃
## 10 68366. 82372 ▁▁▇▆▅
## 11 70590. 85203 ▁▁▇▆▅
## 12 74810. 92266 ▁▁▇▅▃
## 13 77751. 90203 ▁▁▅▇▅

The table shows that the median income dipped slightly in 2009 and 2010, likely from the housing market crash, and then grew steadily through 2021.

Creating a Data Dictionary

# Data Dictionary
dataDictionary <- tibble(Variable = colnames(census_data), # Creating a tibble of the variable descriptions and storing it in dataDictionary
                         Description = c(
                           "Geographic Region ID", 
                           "Geographic Region", 
                           "Year (2008-2021)", 
                           "Population", 
                           "Median Income (Dollars)", 
                           "Median Monthly Rent Costs for Renters (Dollars)", 
                           "Median Monthly Housing Costs for Homeowners (Dollars)", 
                           "Proportion of People who are Female", 
                           "Proportion of People who are Male", 
                           "Proportion of People who are White Alone", 
                           "Proportion of People who are Black or African American Alone", 
                           "Proportion of People who are American Indian and Alaska Native Alone", 
                           "Proportion of People who are Asian Alone", 
                           "Proportion of People who are Native Hawaiian and Other Pacific Islander Alone", 
                           "Proportion of People who are Some Other Race Alone", 
                           "Proportion of People who are Two or More Races", 
                           "Proportion of People 25 and Older Whose Highest Education Level is High School", 
                           "Proportion of People 25 and Older Whose Highest Education Level is a GED", 
                           "Proportion of People 25 and Older Whose Highest Education Level is Some, but less than 1 Year of College", 
                           "Proportion of People 25 and Older Whose Highest Education Level is Greater than 1 Year of College but no Degree", 
                           "Proportion of People 25 and Older Whose Highest Education Level is an Associates Degree", 
                           "Proportion of People 25 and Older Whose Highest Education Level is a Bachelor's Degree", 
                           "Proportion of People 25 and Older Whose Highest Education Level is a Masters Degree", 
                           "Proportion of People 25 and Older Whose Highest Education Level is a Professional Degree", 
                           "Proportion of People 25 and Older Whose Highest Education Level is a Doctoral Degree", 
                           "Proportion of People 25 and Older Living in Poverty"
                         ),
                         Type = map_chr(census_data, .f = function(x){typeof(x)[1]}), # Creating a character vector of the types of each variable
                         Class = map_chr(census_data, .f = function(x){class(x)[1]})) # Creating a character vector of the classes of each variable

Displaying the Data Dictionary

# Printing nicely in R Markdown (option 1)
flextable::flextable(dataDictionary, cwidth = 2) # Printing the dataDictionary table using the flextable R package

Variable

Description

Type

Class

geoid

Geographic Region ID

character

character

county_state

Geographic Region

character

character

year

Year (2008-2021)

double

numeric

population

Population

double

numeric

median_income

Median Income (Dollars)

double

numeric

median_monthly_rent_cost

Median Monthly Rent Costs for Renters (Dollars)

double

numeric

median_monthly_home_cost

Median Monthly Housing Costs for Homeowners (Dollars)

double

numeric

prop_female

Proportion of People who are Female

double

numeric

prop_male

Proportion of People who are Male

double

numeric

prop_white

Proportion of People who are White Alone

double

numeric

prop_black

Proportion of People who are Black or African American Alone

double

numeric

prop_native

Proportion of People who are American Indian and Alaska Native Alone

double

numeric

prop_asian

Proportion of People who are Asian Alone

double

numeric

prop_hawaiin_islander

Proportion of People who are Native Hawaiian and Other Pacific Islander Alone

double

numeric

prop_other_race

Proportion of People who are Some Other Race Alone

double

numeric

prop_multi_racial

Proportion of People who are Two or More Races

double

numeric

prop_highschool

Proportion of People 25 and Older Whose Highest Education Level is High School

double

numeric

prop_GED

Proportion of People 25 and Older Whose Highest Education Level is a GED

double

numeric

prop_some_college

Proportion of People 25 and Older Whose Highest Education Level is Some, but less than 1 Year of College

double

numeric

prop_college_no_degree

Proportion of People 25 and Older Whose Highest Education Level is Greater than 1 Year of College but no Degree

double

numeric

prop_associates

Proportion of People 25 and Older Whose Highest Education Level is an Associates Degree

double

numeric

prop_bachelors

Proportion of People 25 and Older Whose Highest Education Level is a Bachelor's Degree

double

numeric

prop_masters

Proportion of People 25 and Older Whose Highest Education Level is a Masters Degree

double

numeric

prop_professional

Proportion of People 25 and Older Whose Highest Education Level is a Professional Degree

double

numeric

prop_doctoral

Proportion of People 25 and Older Whose Highest Education Level is a Doctoral Degree

double

numeric

prop_poverty

Proportion of People 25 and Older Living in Poverty

double

numeric

Data Cleaning

To make the data ready for transformations and visualizations, the data must be first cleaned and organized. This will be achieved by merging data sets and string manipulation.

Merging Data Sets

Sorting States by Region

This function takes a list of every state, categorized by region, and matches each state in the data set with its respective region. This was made to be used later to be able to analyze data based on geographical region.

# Creating the region variable
# Used chatGPT to help list all states with their respective regions
census_data <- census_data %>%
  mutate(
    region = case_when(
      str_detect(county_state, "California|Oregon|Washington|Alaska|Hawaii|Arizona|Colorado|Idaho|Montana|Nevada|New Mexico|Utah|Wyoming") ~ "West", # Detecting West regional states
      str_detect(county_state, "North Dakota|South Dakota|Nebraska|Kansas|Minnesota|Iowa|Missouri|Wisconsin|Illinois|Michigan|Indiana|Ohio") ~ "Midwest", # Detecting Midwest regional states
      str_detect(county_state, "Texas|Oklahoma|Arkansas|Louisiana|Mississippi|Alabama|Florida|Georgia|South Carolina|North Carolina|Tennessee|Kentucky|Virginia|West Virginia|Maryland|Delaware|District of Columbia|Puerto Rico") ~ "South", # Detecting Southern regional states
      str_detect(county_state, "Maine|New Hampshire|Vermont|Massachusetts|Rhode Island|Connecticut|New York|New Jersey|Pennsylvania") ~ "Northeast", # Detecting Northeast regional states
      TRUE ~ NA_character_
    )
  )

# Creating summary table
regions <- census_data %>%
  group_by(region) %>%
  summarize(Unique_States = n_distinct(county_state), .groups = "drop")

# Display the summary table
flextable::flextable(regions)

region

Unique_States

Midwest

12

Northeast

9

South

18

West

13

This table shows the number of states that were identified in each region. Only the unique states are included so that each state listed for each year is not included.

String Manipulation

# State Abbreviations
census_data <- census_data %>%
  mutate(
    # Extract State Abbreviation
    state_abbreviation = str_extract(county_state, "[A-Za-z]{2}"), # Used chatGPT for code to find state abbreviations
    # Capitalize State Name
    capitalized_state_name = str_to_title(county_state),
    # Combine Abbreviation and Capitalized Name
    abbreviations = str_c(capitalized_state_name, "(", state_abbreviation, ")")
  )

# Creating a data table
table_data <- slice_head(select(census_data, county_state, abbreviations), n = 8) # Showing the first 8 states in the table

# Printing the table
flextable(table_data)

county_state

abbreviations

Alabama

Alabama(Al)

Alaska

Alaska(Al)

Arizona

Arizona(Ar)

Arkansas

Arkansas(Ar)

California

California(Ca)

Colorado

Colorado(Co)

Connecticut

Connecticut(Co)

Delaware

Delaware(De)

Exploratory Data Analysis

This section consists of data transformations to display the data.

Tables of Summary Statistics

This sections displays tables with statistics that summarize the data.

Percentage of Each Race by Region

This table displays the average percentage of each race within each region. The low percentage races were lumped together into the Average_Other variable.

# Creating the combined_other_race variable
census_data <- census_data %>% 
  mutate(
    combined_other_race = prop_hawaiin_islander + prop_other_race + prop_multi_racial
  )

# Calculating the average total percentage of race by region
raceByRegion <- census_data %>%
  group_by(region) %>%
  summarize(
    Average_White = mean(prop_white),
    Average_Black = mean(prop_black),
    Average_Native = mean(prop_native),
    Average_Asian = mean(prop_asian),
    Average_Other = mean(combined_other_race),
    .groups = "drop"
  )

# Display the table
flextable::flextable(raceByRegion)

region

Average_White

Average_Black

Average_Native

Average_Asian

Average_Other

Midwest

0.8358631

0.07499745

0.016292110

0.02477064

0.04807675

Northeast

0.8114821

0.07479451

0.003074737

0.04279815

0.06785052

South

0.6883685

0.21240546

0.007785882

0.02477454

0.06666563

West

0.7439349

0.02991676

0.034617635

0.06905562

0.12247509

Finding the median income after home or rent expenses.

This table finds how much money the average American had in income each year after home expenses, broken up by mortgage expenses and rent expenses.

# Creating the expenses variable
expenses <- census_data %>% 
  group_by(year) %>%
  summarize(
    income_after_home_expenses = label_currency(mean((median_income - (12 * (median_monthly_home_cost))))), # Calculates the average median anual income after home costs
    income_after_rent_expenses = label_currency(mean((median_income - (12 * (median_monthly_rent_cost))))) # Calculates the average median anual income after rent costs
    )
# Having the year displayed without commas
expenses$year <- as.character(expenses$year)

# Displaying the table
flextable::flextable(expenses, cwidth = 2.5)

year

income_after_home_expenses

income_after_rent_expenses

2008

$38,149.85

$42,160.85

2009

$36,626.94

$40,314.63

2010

$36,280.58

$39,718.35

2011

$37,274.35

$40,386.96

2012

$38,607.98

$41,286.29

2013

$40,146.69

$42,186.00

2014

$41,282.81

$43,112.81

2015

$42,990.21

$44,745.67

2016

$44,750.90

$46,264.98

2017

$46,529.56

$47,965.63

2018

$47,997.21

$49,302.90

2019

$51,144.75

$52,364.60

2021

$54,331.17

$55,075.87

Obataining a Frequency Table

The overall average percentage of people in poverty was 14.5%. This table compares male vs. female dominated states and having a poverty percentage of greater than 14.5%

# Used chatgpt to help with ifelse statements
# Creating the freq_table variable
freq_table <- census_data %>%
  mutate(Gender_Group = ifelse(prop_female > 0.5, "Female", "Male"), # Finds female or male dominated states
         Poverty_Group = ifelse(prop_poverty > 0.145, "High Poverty", "Low Poverty")) %>% # Finds if they are above or below the national average poverty percentage
  group_by(Gender_Group, Poverty_Group) %>%
  summarize(Frequency = n(), 
            .groups = "drop")

# Displaying the frequency table using flextable
flextable::flextable(freq_table, cwidth = 1.5)

Gender_Group

Poverty_Group

Frequency

Female

High Poverty

257

Female

Low Poverty

286

Male

High Poverty

17

Male

Low Poverty

116

This table shows that most states are predominantly female, but about half of them are below the national poverty percent average. Male dominated states seem to have a much lower poverty rate.

Data Visualizations

This sections displays graphical figures that summarize the data.

Plotting Median Monthly Housing Costs Over the Years in the US

Plotting monthly home and rent expenses over from the years 2008 through 2021.

# Creating the variable avg_costs
avg_costs <- census_data %>%
  pivot_longer(cols = c(median_monthly_home_cost, median_monthly_rent_cost), # Used chatgpt to help combine these two plot together
               names_to = "cost_type", 
               values_to = "value") %>%
  group_by(year, cost_type) %>%
  summarize(avg_value = mean(value), 
            .groups = "drop")

# Creating the variable housingLineChart to plot home and rent expenses
housingLineChart <- avg_costs %>%
  ggplot(aes(x = year, 
             y = avg_value, 
             color = cost_type, 
             group = cost_type)) +
  geom_line(linewidth = 1) +
  labs(title = "Median Monthly Housing Costs Over the Years in the US",
       x = "Year",
       y = "Average Median Monthly Cost",
       caption = "Data source: census_data_state_2008-2021.csv") +
  scale_color_colorblind() + 
  theme_bw() + 
  theme(legend.position = "bottom") + 
  scale_y_continuous(labels = scales::dollar_format(prefix = "$")) # Formats the y-axis to use a currency format

# Displays the graph
housingLineChart

This graph shows that it used to be much cheaper in the US to rent property rather than own it. In 2021, the prices became increasingly comparable.

Percentage of Degree Holders Over the Years in the US

This graph displays the percentage of the population that had some kind of college degree vs. the percentage of the population with no degree over the years.

# Creating the combined_degree_holders variable combining the percentages of all degree holding Americans
census_data <- census_data %>%
  mutate(
    combined_degree_holders = prop_associates + prop_bachelors + prop_masters + prop_professional + prop_doctoral
  )

# Creating the combined_non_degree_holders variable combining the percentages of all non-degree holding Americans
census_data <- census_data %>%
  mutate(
    combined_non_degree_holders = prop_highschool + prop_GED + prop_some_college + prop_college_no_degree
  )

# Combining degree holders and non-degree holders
long_data <- census_data %>%
  select(year, combined_degree_holders, combined_non_degree_holders) %>%
  pivot_longer(cols = c(combined_degree_holders, combined_non_degree_holders),
               names_to = "Education_Level",
               values_to = "Percentage") %>%
  group_by(year, Education_Level) %>%
  summarize(avg_perc = mean(Percentage), 
            .groups = "drop")

# Plotting the data
degreeLineChart <- ggplot(long_data, aes(x = year, 
                                           y = avg_perc, 
                                           color = `Education_Level`, 
                                           group = `Education_Level`)) +
  geom_line(linewidth = 1) +
  labs(title = "Education Level Over the Years in the US",
       x = "Year",
       y = "Combined Percentage",
       caption = "Data source: census_data_state_2008-2021.csv") +
  scale_color_colorblind() +
  theme_bw() +
  theme(legend.position = "bottom")

# Plotting the graph
degreeLineChart

This graph shows that the percentage of degree holders in the US has almost surpassed the percentage of non-degree holders since 2008, starting with nearly a 20% difference in percentage.

Median Income by Region Over the Years in the US

This plot displays the median income by region in the US between the years 2008 and 2021.

# Calculate average median income per year and region
average_income_per_year <- census_data %>%
  group_by(year, region) %>%
  summarize(avg_median_income = mean(median_income), .groups = "drop")

# Plotting the data
incomeLineChart <- average_income_per_year %>% 
  ggplot(aes(x = year, 
             y = avg_median_income, 
             color = region, 
             group = region)) +
  geom_line(linewidth = 1) +
  labs(title = "Average Median Income by Region",
       x = "Year",
       y = "Average Median Income",
       caption = "Data source: census_data_state_2008-2021.csv") +
  scale_color_colorblind() +
  theme_classic() +
  theme(legend.position = "bottom") + 
  scale_y_continuous(labels = scales::dollar_format(prefix = "$")) + # Formatting to show currency in the y-axis
  guides(color = guide_legend(title = "Region"))

# Displaying the graph
incomeLineChart

This shows that growth and decline rates of average median income among each region over the years has stayed consistent with no overlap between any region.

Average Median Monthly Housing Costs by Region

This plot displays the average median monthly housing costs by region in the US between the years 2008 and 2021.

# Calculate average median home and rental costs per year and region
avg_costs_by_region <- census_data %>%
  pivot_longer(cols = c(median_monthly_home_cost, median_monthly_rent_cost), 
               names_to = "cost_type", 
               values_to = "value") %>%
  group_by(year, region) %>%
  summarize(avg_value = mean(value), .groups = "drop")

# Plotting the data
livingCostsLineChart <- avg_costs_by_region %>%
  ggplot(aes(x = year, 
             y = avg_value, 
             color = region, 
             group = region)) +
  geom_line(linewidth = 1) +
  labs(title = "Average Median Monthly Housing Costs by Region",
       x = "Year",
       y = "Average Median Monthly Cost",
       caption = "Data source: census_data_state_2008-2021.csv") +
  scale_color_colorblind() + 
  theme_classic() + 
  theme(legend.position = "bottom") + 
  scale_y_continuous(labels = scales::dollar_format(prefix = "$")) + # Formatting to show currency in the y-axis
  guides(color = guide_legend(title = "Region"))

# Displaying the graph
livingCostsLineChart

This shows that growth and decline rates of average median monthly housing costs among each region over the years has stayed consistent with no overlap between any region.

It is worth noting that the Midwest average median income was consistently higher than that of the south, whereas when it comes to housing costs, it’s the opposite.

Median Income vs. Percentage of Degree Holders by Region, Sized by Population

This graph compares the average median income to the percentage of degree holders, organized by region, and the points are sized by the state populations.

# Creating the incomeDegree variable
incomeDegree <- census_data %>%
  ggplot(aes(
    x = combined_degree_holders,
    y = median_income,
    color = region, # Coloring by region
    size = population # Setting the size to population size
  )) +
  geom_point(alpha = 0.2) +
  scale_color_colorblind() +
  labs(
    title = "Median Income vs. Percentage of Degree Holders by Region",
    x = "Combined Degree Holders (%)",
    y = "Median Income",
    caption = "Data source: census_data_state_2008-2021.csv", 
    color = NULL
  ) +
  facet_grid(. ~ region) + # Creating the four individual graphs based on region
  theme(legend.position = "bottom") +  
  geom_smooth(method = "lm", se = FALSE, size = 1) + 
  guides(
    size = guide_legend(title = "Population Size (Millions of People)"), 
    color = FALSE
  ) + 
  scale_y_continuous(labels = scales::dollar_format(prefix = "$")) + 
  scale_size_continuous(labels = scales::comma_format(scale = 1e-6, suffix = "M")) # Changing the legend to show the number in millions of people rather than in scientific notation
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Displaying the graph
incomeDegree
## `geom_smooth()` using formula = 'y ~ x'

This table shows that the south had the greatest growth in percentage of people that obtained degrees, which increased the median income most of any other region. The other three regions had consistent growth with one another.

Comparing Percent Poverty, Monthly Housing Costs, and Median Income Over the Years

This plot shows a grouped bar chart displaying the year over year percent change in percent poverty, monthly housing expenses, and median income between the years 2008 and 2021.

# Calculating average values per year
avg_data <- census_data %>%
  group_by(year) %>%
  summarize(
    avg_percent_poverty = mean(prop_poverty),
    avg_monthly_housing_costs = mean((median_monthly_home_cost + median_monthly_rent_cost) / 2),
    avg_median_income = mean(median_income)
  )

# Calculating percentage changes
avg_data_percent_change <- avg_data %>% # Used chatGPT to help with this
  mutate(
    median_income_change = (avg_median_income / lag(avg_median_income) - 1) * 100,
    housing_costs_change = ((avg_monthly_housing_costs + lag(avg_monthly_housing_costs)) / (2 * lag(avg_monthly_housing_costs)) - 1) * 100,
    percent_poverty_change = (avg_percent_poverty / lag(avg_percent_poverty) - 1) * 100
  )

# Reshaping data for plotting
avg_data_change_long <- avg_data_percent_change %>% 
  pivot_longer(cols = c(median_income_change, housing_costs_change, percent_poverty_change),
               names_to = "variable",
               values_to = "percent_change")

# Plotting the percent change
percentChange <- avg_data_change_long %>%
  ggplot(aes(x = year, y = percent_change, fill = variable)) +
  geom_col(position = "dodge", width = 0.7, color = "white") +
  labs(title = "Percentage Change in Median Income, Housing Costs, and Poverty Over the Years",
       x = "Year",
       y = "Percentage Change",
       fill = "Variable",
       caption = "Data source: census_data_state_2008-2021.csv") +
  scale_fill_manual(values = c("skyblue", "orange", "lightgreen")) + # Changing the color of each bar
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

percentChange

Housing cost changes closely mimic median income changes. Percent poverty mirrored housing cost changes and median income changes. This makes sense that as median income increases, poverty rates decrease.It should also be noted that the year 2020 is missing from the plot. This is likely due to the COVID-19 pandemic.

Monte Carlo Methods of Inference

For this Monte Carlo Simulation, I will be testing whether or not there has been a statistically significant difference in average monthly home costs between male and female dominated states.

# Used chatGPT to help with some sections and to format the code
# Extracting variables to use
sim_data <- census_data %>% select(year, prop_male, prop_female, median_monthly_home_cost)

# Specify your observed test statistic
observed_statistic <- mean(sim_data$median_monthly_home_cost[sim_data$prop_male > sim_data$prop_female]) -
                      mean(sim_data$median_monthly_home_cost[sim_data$prop_male < sim_data$prop_female])

# Number of simulations
num_simulations <- 10000

# Initialize an empty vector to store simulated test statistics
simulated_stats <- numeric(num_simulations)

# Run the Monte Carlo simulation
for (i in 1:num_simulations) {
# Shuffle the data between the two groups based on proportion of males
  shuffled_data <- sim_data %>%
    mutate(shuffled_group = ifelse(runif(n()) < 0.5, "A", "B"))

# Calculate the test statistic on the shuffled data
  shuffled_statistic <- mean(shuffled_data$median_monthly_home_cost[shuffled_data$shuffled_group == "A"]) -
                        mean(shuffled_data$median_monthly_home_cost[shuffled_data$shuffled_group == "B"])

# Stores the simulated test statistic
  simulated_stats[i] <- shuffled_statistic
}

# Plot the null distribution
null_dist_plot <- ggplot() +
  geom_density(aes(x = simulated_stats), fill = "skyblue", alpha = 0.6) +
  geom_vline(xintercept = observed_statistic, color = "red", linetype = "dashed", size = 1) +
  geom_vline(xintercept = quantile(simulated_stats, 0.975), linetype = "dotted", size = 1, color = "blue") +
  geom_vline(xintercept = quantile(simulated_stats, 0.025), linetype = "dotted", size = 1, color = "blue") +
  labs(title = "Monte Carlo Simulation of Home Costs Test",
       x = "Difference in Average Home Costs (Proportion of Males > Females)",
       y = "Density") +
  theme_minimal()

# Displaying the plot
print(null_dist_plot)

# Calculating the p-value
p_value <- mean(simulated_stats >= observed_statistic) + mean(simulated_stats <= -observed_statistic)

# Displaying the p-value
cat("Observed Test Statistic:", observed_statistic, "\n")
## Observed Test Statistic: 74.92805
cat("p-value:", sprintf("%.4f", p_value), "\n") # Rounding to 4 decimal places
## p-value: 0.0058
# Interpretation
if (p_value < 0.05) {
  cat("At the 95% significance level, there is statistically significant evidence that there is a difference in average home costs between states with higher proportions of males is statistically significant.\n")
} else {
  cat("At the 95% significance level, there is not statistically significant evidence that there is a difference in average home costs between states with higher proportions of males.\n")
}
## At the 95% significance level, there is statistically significant evidence that there is a difference in average home costs between states with higher proportions of males is statistically significant.

Bootstrap Methods of Inference

Bootstrapping is used to test for the confidence interval of median monthly home costs.

# Used chatGPT to help with some sections and to format the code
# Extracting variables to use
bootstrap_data <- census_data %>% select(median_monthly_home_cost)

# Function to calculate the median
calculate_median <- function(data) {
  return(median(data))
}

# Number of bootstrap samples
num_bootstrap_samples <- 1000

# Initializing variable to store bootstrapped samples
bootstrap_medians <- numeric(num_bootstrap_samples)

# Running the bootstrap procedure
for (i in 1:num_bootstrap_samples) {
# Sampling with replacement
  bootstrap_sample <- sample(bootstrap_data$median_monthly_home_cost, replace = TRUE)
  
# Calculating the median for the bootstrap sample
  bootstrap_medians[i] <- calculate_median(bootstrap_sample)
}

# Calculating the 95% bootstrap confidence interval
ci_lower <- quantile(bootstrap_medians, 0.025) %>% round(., 2) # Rounding upper and lower to two decimal places to reflect currency
ci_upper <- quantile(bootstrap_medians, 0.975) %>% round(., 2)

# Displaying the bootstrap distribution
bootstrap_dist_plot <- ggplot() +
  geom_density(aes(x = bootstrap_medians), fill = "skyblue", alpha = 0.6) +
  geom_vline(xintercept = ci_lower, linetype = "dotted", size = 1, color = "blue") +
  geom_vline(xintercept = ci_upper, linetype = "dotted", size = 1, color = "blue") +
  labs(title = "Bootstrap Distribution of Median Home Costs",
       x = "Median Home Costs",
       y = "Density") +
  theme_minimal() + 
    scale_x_continuous(labels = scales::dollar_format(prefix = "$")) # Formatting to show currency in the x-axis

# Displaying the plot
print(bootstrap_dist_plot)

# Interpretation
cat("95% Bootstrap Confidence Interval for Median Home Costs:", 
    sprintf("[%.2f, %.2f]\n", ci_lower, ci_upper))
## 95% Bootstrap Confidence Interval for Median Home Costs: [962.00, 1025.54]

Concluding from the bootstrapping inference, we are 95% confident that the true median home costs is between $962 and $1025.54.

Conclusion

This analysis has shown some of the recent trends between the years 2008 and 2021 regarding income, demographics, education, and housing costs. The overall trends suggests that the cost of living, especially when renting, is on the rise, as well as education levels. It seems that in the near future, the number of degree holders will outweigh the number of non-degree holders. This as well has caused the median income to rise steadily with the growing education rates. The southern states, in particular, had the greatest growth in both education levels and median income. The median income in the southern states has nearly tripled between the years 2008 and 2021. The steady rise in median income has caused the percentage of people in poverty to steadily decrease as well. However, in 2021 there was a sharp rise in the percent change of people in poverty as well as the change in housing costs. This likely could have been due to the COVID-19 pandemic. At the beginning of this project, I was very concerned about the rising cost of living and housing, but after thoroughly analyzing the national data, the trajectory of costs and income are not as bad as I originally perceived them to be.