Final Project: The United States Census
Introduction
The stereotypical “American Dream” has always included owning property and your own house, but this ideal is becoming increasingly out of reach for the average American. The next generation of home buyers are having a difficult time purchasing property due to the high home prices, high rates of rent, and commercialized real estate. As a prospective home buyer within the next 5 to 10 years, I wanted to look into the trends of these soaring home prices and what demographics of people are being impacted by this hike in prices the most.
Data Dictionary
Loading Libraries
# Loading all of the libraries while hiding any extra output
library(tidyverse)
library(lubridate)
library(knitr)
library(skimr)
library(scales)
library(ggplot2)
library(gapminder)
library(arrow)
library(ggthemes)
library(rmdformats)
library(stringr)
library(flextable)
library(tidytext)
library(wordcloud2)
library(gtrendsR)
library(patchwork)
Importing the Data Set
filePath <- "https://raw.githubusercontent.com/dilernia/STA418-518/main/Data/census_data_state_2008-2021.csv" # Storing the filepath
census_data <- read_csv(filePath) # Reading in the file and storing it in census_data
## Rows: 676 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): geoid, county_state
## dbl (24): year, population, median_income, median_monthly_rent_cost, median_...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Skimming the Data
## Rows: 676
## Columns: 26
## $ geoid <chr> "01", "02", "04", "05", "06", "08", "09", "10…
## $ county_state <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "…
## $ year <dbl> 2008, 2008, 2008, 2008, 2008, 2008, 2008, 200…
## $ population <dbl> 4661900, 686293, 6500180, 2855390, 36756666, …
## $ median_income <dbl> 42666, 68460, 50958, 38815, 61021, 56993, 685…
## $ median_monthly_rent_cost <dbl> 631, 949, 866, 606, 1135, 848, 970, 917, 1011…
## $ median_monthly_home_cost <dbl> 742, 1393, 1177, 660, 1919, 1364, 1722, 1209,…
## $ prop_female <dbl> 0.5154499, 0.4789995, 0.4986763, 0.5102035, 0…
## $ prop_male <dbl> 0.4845501, 0.5210005, 0.5013237, 0.4897965, 0…
## $ prop_white <dbl> 0.7028360, 0.6911290, 0.8005912, 0.7871510, 0…
## $ prop_black <dbl> 0.261833802, 0.036257109, 0.036269457, 0.1550…
## $ prop_native <dbl> 0.005140822, 0.127259057, 0.044114009, 0.0052…
## $ prop_asian <dbl> 0.009768335, 0.046373779, 0.023755650, 0.0103…
## $ prop_hawaiin_islander <dbl> 2.417469e-04, 5.484538e-03, 1.436883e-03, 6.7…
## $ prop_other_race <dbl> 0.007177117, 0.012462607, 0.068086884, 0.0218…
## $ prop_multi_racial <dbl> 0.013002209, 0.081033902, 0.025745902, 0.0197…
## $ prop_highschool <dbl> 0.2604227, 0.2183097, 0.2106347, 0.2958009, 0…
## $ prop_GED <dbl> 0.05599385, 0.04658968, 0.03935913, 0.0582695…
## $ prop_some_college <dbl> 0.05957806, 0.08116479, 0.07719126, 0.0687879…
## $ prop_college_no_degree <dbl> 0.1546491, 0.2168623, 0.1816943, 0.1534883, 0…
## $ prop_associates <dbl> 0.06831511, 0.08030577, 0.07837909, 0.0558112…
## $ prop_bachelors <dbl> 0.1427926, 0.1758716, 0.1590928, 0.1251352, 0…
## $ prop_masters <dbl> 0.05542474, 0.06465035, 0.06474227, 0.0433444…
## $ prop_professional <dbl> 0.01338380, 0.01743700, 0.01623786, 0.0121880…
## $ prop_doctoral <dbl> 0.008333932, 0.015125876, 0.010927963, 0.0072…
## $ prop_poverty <dbl> 0.15695141, 0.08420041, 0.14718370, 0.1732244…
skim(census_data) # Using the skim function to skim through the data and look for missing values and outliers in the histograms
Name | census_data |
Number of rows | 676 |
Number of columns | 26 |
_______________________ | |
Column type frequency: | |
character | 2 |
numeric | 24 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
geoid | 0 | 1 | 2 | 2 | 0 | 52 | 0 |
county_state | 0 | 1 | 4 | 20 | 0 | 52 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
year | 0 | 1 | 2014.08 | 3.88 | 2008.00 | 2011.00 | 2014.00 | 2017.00 | 2021.00 | ▇▇▅▇▅ |
population | 0 | 1 | 6190132.93 | 6978987.03 | 532668.00 | 1786156.75 | 4257700.00 | 6972359.00 | 39557045.00 | ▇▂▁▁▁ |
median_income | 0 | 1 | 55793.43 | 12152.82 | 18314.00 | 47512.50 | 54740.00 | 63255.75 | 92266.00 | ▁▅▇▃▁ |
median_monthly_rent_cost | 0 | 1 | 900.19 | 230.04 | 419.00 | 735.75 | 847.50 | 1024.50 | 1774.00 | ▂▇▃▁▁ |
median_monthly_home_cost | 0 | 1 | 1084.63 | 360.03 | 241.00 | 833.50 | 987.50 | 1314.25 | 2215.00 | ▁▇▅▂▁ |
prop_female | 0 | 1 | 0.51 | 0.01 | 0.47 | 0.50 | 0.51 | 0.51 | 0.53 | ▁▁▇▇▁ |
prop_male | 0 | 1 | 0.49 | 0.01 | 0.47 | 0.49 | 0.49 | 0.50 | 0.53 | ▁▇▇▁▁ |
prop_white | 0 | 1 | 0.76 | 0.14 | 0.22 | 0.68 | 0.78 | 0.86 | 0.96 | ▁▁▃▇▇ |
prop_black | 0 | 1 | 0.11 | 0.11 | 0.00 | 0.03 | 0.07 | 0.16 | 0.53 | ▇▃▂▁▁ |
prop_native | 0 | 1 | 0.02 | 0.03 | 0.00 | 0.00 | 0.00 | 0.01 | 0.16 | ▇▁▁▁▁ |
prop_asian | 0 | 1 | 0.04 | 0.05 | 0.00 | 0.01 | 0.02 | 0.04 | 0.39 | ▇▁▁▁▁ |
prop_hawaiin_islander | 0 | 1 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.11 | ▇▁▁▁▁ |
prop_other_race | 0 | 1 | 0.03 | 0.03 | 0.00 | 0.01 | 0.02 | 0.05 | 0.30 | ▇▁▁▁▁ |
prop_multi_racial | 0 | 1 | 0.04 | 0.04 | 0.01 | 0.02 | 0.03 | 0.03 | 0.36 | ▇▁▁▁▁ |
prop_highschool | 0 | 1 | 0.24 | 0.04 | 0.12 | 0.22 | 0.24 | 0.27 | 0.35 | ▁▃▇▅▁ |
prop_GED | 0 | 1 | 0.04 | 0.01 | 0.02 | 0.04 | 0.04 | 0.05 | 0.07 | ▃▇▆▃▁ |
prop_some_college | 0 | 1 | 0.06 | 0.01 | 0.01 | 0.06 | 0.06 | 0.07 | 0.09 | ▁▁▆▇▂ |
prop_college_no_degree | 0 | 1 | 0.15 | 0.02 | 0.09 | 0.13 | 0.15 | 0.16 | 0.22 | ▂▅▇▃▁ |
prop_associates | 0 | 1 | 0.08 | 0.02 | 0.03 | 0.07 | 0.08 | 0.09 | 0.15 | ▁▃▇▂▁ |
prop_bachelors | 0 | 1 | 0.19 | 0.03 | 0.10 | 0.17 | 0.19 | 0.21 | 0.27 | ▁▅▇▅▁ |
prop_masters | 0 | 1 | 0.08 | 0.03 | 0.04 | 0.06 | 0.07 | 0.09 | 0.23 | ▇▆▁▁▁ |
prop_professional | 0 | 1 | 0.02 | 0.01 | 0.01 | 0.02 | 0.02 | 0.02 | 0.10 | ▇▁▁▁▁ |
prop_doctoral | 0 | 1 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.02 | 0.05 | ▇▂▁▁▁ |
prop_poverty | 0 | 1 | 0.14 | 0.05 | 0.07 | 0.11 | 0.14 | 0.16 | 0.46 | ▇▅▁▁▁ |
Data Set Variable Types
The variables within the data set are of two variable types, character and numeric of type double.
The two character variables are:
geoid: Geographic region ID
county_state: Geographic region
The twenty-four double variables in this data set are:
year: The year
Population: The population
median_income: Median income in dollars
median_monthly_rent_cost: Median monthly rent costs for renters in dollars
median_monthly_home_cost: Median monthly housing costs for homeowners in dollars
prop_female: Proportion of people who are female
prop_male: Proportion of people who are male
prop_white: Proportion of people who are white alone
prop_black: Proportion of people who are black or African American alone
prop_native: Proportion of people who are American Indian and Alaska Native alone
prop_asian: Proportion of people who are Asian alone
prop_hawaiin_islander: Proportion of people who are Natice Hawaiian and Other Pacific Islander alone
prop_other_race: Proportion of people who are some other race alone
prop_multi_racial: Proportion of people are two or more races
prop_highschool: Proportion of people 25 and older whose highest education-level is high school
prop_GED: Proportion of people 25 and older whose highest education-level is a GED
prop_some_college: Proportion of people 25 and older whose highest education-level is some, but less than 1 year of college
prop_college_no_degree: Proportion of people 25 and older whose highest education-level is greater than 1 year of college but no degree
prop_associates: Proportion of people 25 and older whose highest education-level is an Associates degree
prop_bachelors: Proportion of people 25 and older whose highest education-level is a Bachelors degree
prop_masters: Proportion of people 25 and older whose highest education-level is a Masters degree
prop_professional: Proportion of people 25 and older whose highest education-level is a Professional degree
prop_doctoral: Proportion of people 25 and older whose highest education-level is a Doctoral degree
prop_poverty: Proportion of people 25 and older living in poverty, defined by the Census Bureau as having an income below the poverty threshold for their family size
These 26 variables can also be broken up by categories and subcategories. The two categories are non-proportional and proportional.
Non-proportional:
- Demographics/Characteristics: geoid, county_state, year, population, and median_income
Proportional:
Gender: prop_female and prop_male
Race: prop_white, prop_black, prop_native, prop_asian, prop_hawaiin_islander, prop_other_race, and prop_multi_racial
Education: prop_highschool, prop_GED, prop_some_college, prop_college_no_degree, prop_associates, prop_bachelors, prop_masters, prop_professional, and prop_doctoral
Poverty: prop_poverty
Missing Values
After analyzing the raw data using the skim()
and
glimpse()
functions, there are no missing data values for
any of the variables.
This is very important to check, as it can heavily skew or misrepresent the sample or population, if extrapolated.
Creating Tables Summarizing the Data
Comparing median income from the years 2008 through 2021.
income_table <- census_data %>%
group_by(year) %>%
skim() %>%
dplyr::filter(skim_variable == "median_income") # Grouping by year, filtering by "median_income" and storing it in the variable income_table
print(income_table) # Printing the table
## ── Data Summary ────────────────────────
## Values
## Name Piped data
## Number of rows 676
## Number of columns 26
## _______________________
## Column type frequency:
## numeric 1
## ________________________
## Group variables year
##
## ── Variable type: numeric ──────────────────────────────────────────────────────
## skim_variable year n_missing complete_rate mean sd p0 p25 p50
## 1 median_income 2008 0 1 51454. 9667. 18401 45945. 50173
## 2 median_income 2009 0 1 49833. 9489. 18314 44470. 48152.
## 3 median_income 2010 0 1 49375. 9134. 18862 43449 48332.
## 4 median_income 2011 0 1 50279. 9479. 18660 44146. 49178
## 5 median_income 2012 0 1 51349. 9694. 19429 45132. 50482
## 6 median_income 2013 0 1 52518. 9938. 19183 46596. 51335
## 7 median_income 2014 0 1 53770. 10380. 18928 47512. 52563
## 8 median_income 2015 0 1 55680. 10783. 18626 49212. 54442
## 9 median_income 2016 0 1 57411. 11064. 20078 50791 56406
## 10 median_income 2017 0 1 59449. 11630. 19343 52712. 58286.
## 11 median_income 2018 0 1 61211. 11995. 20296 55116. 59760.
## 12 median_income 2019 0 1 64645. 12703. 20474 57392 63115
## 13 median_income 2021 0 1 68340. 12961. 22237 62190. 66539
## p75 p100 hist
## 1 57067. 70545 ▁▁▇▇▃
## 2 55476. 69272 ▁▁▇▅▃
## 3 54923. 68854 ▁▁▇▅▃
## 4 56450. 70004 ▁▁▇▅▃
## 5 57180 71122 ▁▁▇▆▃
## 6 58770. 72483 ▁▁▇▆▃
## 7 61017. 73971 ▁▁▇▆▃
## 8 63056 75847 ▁▁▇▇▃
## 9 65620. 78945 ▁▁▇▆▃
## 10 68366. 82372 ▁▁▇▆▅
## 11 70590. 85203 ▁▁▇▆▅
## 12 74810. 92266 ▁▁▇▅▃
## 13 77751. 90203 ▁▁▅▇▅
The table shows that the median income dipped slightly in 2009 and 2010, likely from the housing market crash, and then grew steadily through 2021.
Creating a Data Dictionary
# Data Dictionary
dataDictionary <- tibble(Variable = colnames(census_data), # Creating a tibble of the variable descriptions and storing it in dataDictionary
Description = c(
"Geographic Region ID",
"Geographic Region",
"Year (2008-2021)",
"Population",
"Median Income (Dollars)",
"Median Monthly Rent Costs for Renters (Dollars)",
"Median Monthly Housing Costs for Homeowners (Dollars)",
"Proportion of People who are Female",
"Proportion of People who are Male",
"Proportion of People who are White Alone",
"Proportion of People who are Black or African American Alone",
"Proportion of People who are American Indian and Alaska Native Alone",
"Proportion of People who are Asian Alone",
"Proportion of People who are Native Hawaiian and Other Pacific Islander Alone",
"Proportion of People who are Some Other Race Alone",
"Proportion of People who are Two or More Races",
"Proportion of People 25 and Older Whose Highest Education Level is High School",
"Proportion of People 25 and Older Whose Highest Education Level is a GED",
"Proportion of People 25 and Older Whose Highest Education Level is Some, but less than 1 Year of College",
"Proportion of People 25 and Older Whose Highest Education Level is Greater than 1 Year of College but no Degree",
"Proportion of People 25 and Older Whose Highest Education Level is an Associates Degree",
"Proportion of People 25 and Older Whose Highest Education Level is a Bachelor's Degree",
"Proportion of People 25 and Older Whose Highest Education Level is a Masters Degree",
"Proportion of People 25 and Older Whose Highest Education Level is a Professional Degree",
"Proportion of People 25 and Older Whose Highest Education Level is a Doctoral Degree",
"Proportion of People 25 and Older Living in Poverty"
),
Type = map_chr(census_data, .f = function(x){typeof(x)[1]}), # Creating a character vector of the types of each variable
Class = map_chr(census_data, .f = function(x){class(x)[1]})) # Creating a character vector of the classes of each variable
Displaying the Data Dictionary
# Printing nicely in R Markdown (option 1)
flextable::flextable(dataDictionary, cwidth = 2) # Printing the dataDictionary table using the flextable R package
Variable | Description | Type | Class |
---|---|---|---|
geoid | Geographic Region ID | character | character |
county_state | Geographic Region | character | character |
year | Year (2008-2021) | double | numeric |
population | Population | double | numeric |
median_income | Median Income (Dollars) | double | numeric |
median_monthly_rent_cost | Median Monthly Rent Costs for Renters (Dollars) | double | numeric |
median_monthly_home_cost | Median Monthly Housing Costs for Homeowners (Dollars) | double | numeric |
prop_female | Proportion of People who are Female | double | numeric |
prop_male | Proportion of People who are Male | double | numeric |
prop_white | Proportion of People who are White Alone | double | numeric |
prop_black | Proportion of People who are Black or African American Alone | double | numeric |
prop_native | Proportion of People who are American Indian and Alaska Native Alone | double | numeric |
prop_asian | Proportion of People who are Asian Alone | double | numeric |
prop_hawaiin_islander | Proportion of People who are Native Hawaiian and Other Pacific Islander Alone | double | numeric |
prop_other_race | Proportion of People who are Some Other Race Alone | double | numeric |
prop_multi_racial | Proportion of People who are Two or More Races | double | numeric |
prop_highschool | Proportion of People 25 and Older Whose Highest Education Level is High School | double | numeric |
prop_GED | Proportion of People 25 and Older Whose Highest Education Level is a GED | double | numeric |
prop_some_college | Proportion of People 25 and Older Whose Highest Education Level is Some, but less than 1 Year of College | double | numeric |
prop_college_no_degree | Proportion of People 25 and Older Whose Highest Education Level is Greater than 1 Year of College but no Degree | double | numeric |
prop_associates | Proportion of People 25 and Older Whose Highest Education Level is an Associates Degree | double | numeric |
prop_bachelors | Proportion of People 25 and Older Whose Highest Education Level is a Bachelor's Degree | double | numeric |
prop_masters | Proportion of People 25 and Older Whose Highest Education Level is a Masters Degree | double | numeric |
prop_professional | Proportion of People 25 and Older Whose Highest Education Level is a Professional Degree | double | numeric |
prop_doctoral | Proportion of People 25 and Older Whose Highest Education Level is a Doctoral Degree | double | numeric |
prop_poverty | Proportion of People 25 and Older Living in Poverty | double | numeric |
Data Cleaning
To make the data ready for transformations and visualizations, the data must be first cleaned and organized. This will be achieved by merging data sets and string manipulation.
Merging Data Sets
Sorting States by Region
This function takes a list of every state, categorized by region, and matches each state in the data set with its respective region. This was made to be used later to be able to analyze data based on geographical region.
# Creating the region variable
# Used chatGPT to help list all states with their respective regions
census_data <- census_data %>%
mutate(
region = case_when(
str_detect(county_state, "California|Oregon|Washington|Alaska|Hawaii|Arizona|Colorado|Idaho|Montana|Nevada|New Mexico|Utah|Wyoming") ~ "West", # Detecting West regional states
str_detect(county_state, "North Dakota|South Dakota|Nebraska|Kansas|Minnesota|Iowa|Missouri|Wisconsin|Illinois|Michigan|Indiana|Ohio") ~ "Midwest", # Detecting Midwest regional states
str_detect(county_state, "Texas|Oklahoma|Arkansas|Louisiana|Mississippi|Alabama|Florida|Georgia|South Carolina|North Carolina|Tennessee|Kentucky|Virginia|West Virginia|Maryland|Delaware|District of Columbia|Puerto Rico") ~ "South", # Detecting Southern regional states
str_detect(county_state, "Maine|New Hampshire|Vermont|Massachusetts|Rhode Island|Connecticut|New York|New Jersey|Pennsylvania") ~ "Northeast", # Detecting Northeast regional states
TRUE ~ NA_character_
)
)
# Creating summary table
regions <- census_data %>%
group_by(region) %>%
summarize(Unique_States = n_distinct(county_state), .groups = "drop")
# Display the summary table
flextable::flextable(regions)
region | Unique_States |
---|---|
Midwest | 12 |
Northeast | 9 |
South | 18 |
West | 13 |
This table shows the number of states that were identified in each region. Only the unique states are included so that each state listed for each year is not included.
String Manipulation
# State Abbreviations
census_data <- census_data %>%
mutate(
# Extract State Abbreviation
state_abbreviation = str_extract(county_state, "[A-Za-z]{2}"), # Used chatGPT for code to find state abbreviations
# Capitalize State Name
capitalized_state_name = str_to_title(county_state),
# Combine Abbreviation and Capitalized Name
abbreviations = str_c(capitalized_state_name, "(", state_abbreviation, ")")
)
# Creating a data table
table_data <- slice_head(select(census_data, county_state, abbreviations), n = 8) # Showing the first 8 states in the table
# Printing the table
flextable(table_data)
county_state | abbreviations |
---|---|
Alabama | Alabama(Al) |
Alaska | Alaska(Al) |
Arizona | Arizona(Ar) |
Arkansas | Arkansas(Ar) |
California | California(Ca) |
Colorado | Colorado(Co) |
Connecticut | Connecticut(Co) |
Delaware | Delaware(De) |
Exploratory Data Analysis
This section consists of data transformations to display the data.
Tables of Summary Statistics
This sections displays tables with statistics that summarize the data.
Percentage of Each Race by Region
This table displays the average percentage of each race within each
region. The low percentage races were lumped together into the
Average_Other
variable.
# Creating the combined_other_race variable
census_data <- census_data %>%
mutate(
combined_other_race = prop_hawaiin_islander + prop_other_race + prop_multi_racial
)
# Calculating the average total percentage of race by region
raceByRegion <- census_data %>%
group_by(region) %>%
summarize(
Average_White = mean(prop_white),
Average_Black = mean(prop_black),
Average_Native = mean(prop_native),
Average_Asian = mean(prop_asian),
Average_Other = mean(combined_other_race),
.groups = "drop"
)
# Display the table
flextable::flextable(raceByRegion)
region | Average_White | Average_Black | Average_Native | Average_Asian | Average_Other |
---|---|---|---|---|---|
Midwest | 0.8358631 | 0.07499745 | 0.016292110 | 0.02477064 | 0.04807675 |
Northeast | 0.8114821 | 0.07479451 | 0.003074737 | 0.04279815 | 0.06785052 |
South | 0.6883685 | 0.21240546 | 0.007785882 | 0.02477454 | 0.06666563 |
West | 0.7439349 | 0.02991676 | 0.034617635 | 0.06905562 | 0.12247509 |
Finding the median income after home or rent expenses.
This table finds how much money the average American had in income each year after home expenses, broken up by mortgage expenses and rent expenses.
# Creating the expenses variable
expenses <- census_data %>%
group_by(year) %>%
summarize(
income_after_home_expenses = label_currency(mean((median_income - (12 * (median_monthly_home_cost))))), # Calculates the average median anual income after home costs
income_after_rent_expenses = label_currency(mean((median_income - (12 * (median_monthly_rent_cost))))) # Calculates the average median anual income after rent costs
)
# Having the year displayed without commas
expenses$year <- as.character(expenses$year)
# Displaying the table
flextable::flextable(expenses, cwidth = 2.5)
year | income_after_home_expenses | income_after_rent_expenses |
---|---|---|
2008 | $38,149.85 | $42,160.85 |
2009 | $36,626.94 | $40,314.63 |
2010 | $36,280.58 | $39,718.35 |
2011 | $37,274.35 | $40,386.96 |
2012 | $38,607.98 | $41,286.29 |
2013 | $40,146.69 | $42,186.00 |
2014 | $41,282.81 | $43,112.81 |
2015 | $42,990.21 | $44,745.67 |
2016 | $44,750.90 | $46,264.98 |
2017 | $46,529.56 | $47,965.63 |
2018 | $47,997.21 | $49,302.90 |
2019 | $51,144.75 | $52,364.60 |
2021 | $54,331.17 | $55,075.87 |
Obataining a Frequency Table
The overall average percentage of people in poverty was 14.5%. This table compares male vs. female dominated states and having a poverty percentage of greater than 14.5%
# Used chatgpt to help with ifelse statements
# Creating the freq_table variable
freq_table <- census_data %>%
mutate(Gender_Group = ifelse(prop_female > 0.5, "Female", "Male"), # Finds female or male dominated states
Poverty_Group = ifelse(prop_poverty > 0.145, "High Poverty", "Low Poverty")) %>% # Finds if they are above or below the national average poverty percentage
group_by(Gender_Group, Poverty_Group) %>%
summarize(Frequency = n(),
.groups = "drop")
# Displaying the frequency table using flextable
flextable::flextable(freq_table, cwidth = 1.5)
Gender_Group | Poverty_Group | Frequency |
---|---|---|
Female | High Poverty | 257 |
Female | Low Poverty | 286 |
Male | High Poverty | 17 |
Male | Low Poverty | 116 |
This table shows that most states are predominantly female, but about half of them are below the national poverty percent average. Male dominated states seem to have a much lower poverty rate.
Data Visualizations
This sections displays graphical figures that summarize the data.
Plotting Median Monthly Housing Costs Over the Years in the US
Plotting monthly home and rent expenses over from the years 2008 through 2021.
# Creating the variable avg_costs
avg_costs <- census_data %>%
pivot_longer(cols = c(median_monthly_home_cost, median_monthly_rent_cost), # Used chatgpt to help combine these two plot together
names_to = "cost_type",
values_to = "value") %>%
group_by(year, cost_type) %>%
summarize(avg_value = mean(value),
.groups = "drop")
# Creating the variable housingLineChart to plot home and rent expenses
housingLineChart <- avg_costs %>%
ggplot(aes(x = year,
y = avg_value,
color = cost_type,
group = cost_type)) +
geom_line(linewidth = 1) +
labs(title = "Median Monthly Housing Costs Over the Years in the US",
x = "Year",
y = "Average Median Monthly Cost",
caption = "Data source: census_data_state_2008-2021.csv") +
scale_color_colorblind() +
theme_bw() +
theme(legend.position = "bottom") +
scale_y_continuous(labels = scales::dollar_format(prefix = "$")) # Formats the y-axis to use a currency format
# Displays the graph
housingLineChart
This graph shows that it used to be much cheaper in the US to rent property rather than own it. In 2021, the prices became increasingly comparable.
Percentage of Degree Holders Over the Years in the US
This graph displays the percentage of the population that had some kind of college degree vs. the percentage of the population with no degree over the years.
# Creating the combined_degree_holders variable combining the percentages of all degree holding Americans
census_data <- census_data %>%
mutate(
combined_degree_holders = prop_associates + prop_bachelors + prop_masters + prop_professional + prop_doctoral
)
# Creating the combined_non_degree_holders variable combining the percentages of all non-degree holding Americans
census_data <- census_data %>%
mutate(
combined_non_degree_holders = prop_highschool + prop_GED + prop_some_college + prop_college_no_degree
)
# Combining degree holders and non-degree holders
long_data <- census_data %>%
select(year, combined_degree_holders, combined_non_degree_holders) %>%
pivot_longer(cols = c(combined_degree_holders, combined_non_degree_holders),
names_to = "Education_Level",
values_to = "Percentage") %>%
group_by(year, Education_Level) %>%
summarize(avg_perc = mean(Percentage),
.groups = "drop")
# Plotting the data
degreeLineChart <- ggplot(long_data, aes(x = year,
y = avg_perc,
color = `Education_Level`,
group = `Education_Level`)) +
geom_line(linewidth = 1) +
labs(title = "Education Level Over the Years in the US",
x = "Year",
y = "Combined Percentage",
caption = "Data source: census_data_state_2008-2021.csv") +
scale_color_colorblind() +
theme_bw() +
theme(legend.position = "bottom")
# Plotting the graph
degreeLineChart
This graph shows that the percentage of degree holders in the US has almost surpassed the percentage of non-degree holders since 2008, starting with nearly a 20% difference in percentage.
Median Income by Region Over the Years in the US
This plot displays the median income by region in the US between the years 2008 and 2021.
# Calculate average median income per year and region
average_income_per_year <- census_data %>%
group_by(year, region) %>%
summarize(avg_median_income = mean(median_income), .groups = "drop")
# Plotting the data
incomeLineChart <- average_income_per_year %>%
ggplot(aes(x = year,
y = avg_median_income,
color = region,
group = region)) +
geom_line(linewidth = 1) +
labs(title = "Average Median Income by Region",
x = "Year",
y = "Average Median Income",
caption = "Data source: census_data_state_2008-2021.csv") +
scale_color_colorblind() +
theme_classic() +
theme(legend.position = "bottom") +
scale_y_continuous(labels = scales::dollar_format(prefix = "$")) + # Formatting to show currency in the y-axis
guides(color = guide_legend(title = "Region"))
# Displaying the graph
incomeLineChart
This shows that growth and decline rates of average median income among each region over the years has stayed consistent with no overlap between any region.
Average Median Monthly Housing Costs by Region
This plot displays the average median monthly housing costs by region in the US between the years 2008 and 2021.
# Calculate average median home and rental costs per year and region
avg_costs_by_region <- census_data %>%
pivot_longer(cols = c(median_monthly_home_cost, median_monthly_rent_cost),
names_to = "cost_type",
values_to = "value") %>%
group_by(year, region) %>%
summarize(avg_value = mean(value), .groups = "drop")
# Plotting the data
livingCostsLineChart <- avg_costs_by_region %>%
ggplot(aes(x = year,
y = avg_value,
color = region,
group = region)) +
geom_line(linewidth = 1) +
labs(title = "Average Median Monthly Housing Costs by Region",
x = "Year",
y = "Average Median Monthly Cost",
caption = "Data source: census_data_state_2008-2021.csv") +
scale_color_colorblind() +
theme_classic() +
theme(legend.position = "bottom") +
scale_y_continuous(labels = scales::dollar_format(prefix = "$")) + # Formatting to show currency in the y-axis
guides(color = guide_legend(title = "Region"))
# Displaying the graph
livingCostsLineChart
This shows that growth and decline rates of average median monthly housing costs among each region over the years has stayed consistent with no overlap between any region.
It is worth noting that the Midwest average median income was consistently higher than that of the south, whereas when it comes to housing costs, it’s the opposite.
Median Income vs. Percentage of Degree Holders by Region, Sized by Population
This graph compares the average median income to the percentage of degree holders, organized by region, and the points are sized by the state populations.
# Creating the incomeDegree variable
incomeDegree <- census_data %>%
ggplot(aes(
x = combined_degree_holders,
y = median_income,
color = region, # Coloring by region
size = population # Setting the size to population size
)) +
geom_point(alpha = 0.2) +
scale_color_colorblind() +
labs(
title = "Median Income vs. Percentage of Degree Holders by Region",
x = "Combined Degree Holders (%)",
y = "Median Income",
caption = "Data source: census_data_state_2008-2021.csv",
color = NULL
) +
facet_grid(. ~ region) + # Creating the four individual graphs based on region
theme(legend.position = "bottom") +
geom_smooth(method = "lm", se = FALSE, size = 1) +
guides(
size = guide_legend(title = "Population Size (Millions of People)"),
color = FALSE
) +
scale_y_continuous(labels = scales::dollar_format(prefix = "$")) +
scale_size_continuous(labels = scales::comma_format(scale = 1e-6, suffix = "M")) # Changing the legend to show the number in millions of people rather than in scientific notation
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
This table shows that the south had the greatest growth in percentage of people that obtained degrees, which increased the median income most of any other region. The other three regions had consistent growth with one another.
Comparing Percent Poverty, Monthly Housing Costs, and Median Income Over the Years
This plot shows a grouped bar chart displaying the year over year percent change in percent poverty, monthly housing expenses, and median income between the years 2008 and 2021.
# Calculating average values per year
avg_data <- census_data %>%
group_by(year) %>%
summarize(
avg_percent_poverty = mean(prop_poverty),
avg_monthly_housing_costs = mean((median_monthly_home_cost + median_monthly_rent_cost) / 2),
avg_median_income = mean(median_income)
)
# Calculating percentage changes
avg_data_percent_change <- avg_data %>% # Used chatGPT to help with this
mutate(
median_income_change = (avg_median_income / lag(avg_median_income) - 1) * 100,
housing_costs_change = ((avg_monthly_housing_costs + lag(avg_monthly_housing_costs)) / (2 * lag(avg_monthly_housing_costs)) - 1) * 100,
percent_poverty_change = (avg_percent_poverty / lag(avg_percent_poverty) - 1) * 100
)
# Reshaping data for plotting
avg_data_change_long <- avg_data_percent_change %>%
pivot_longer(cols = c(median_income_change, housing_costs_change, percent_poverty_change),
names_to = "variable",
values_to = "percent_change")
# Plotting the percent change
percentChange <- avg_data_change_long %>%
ggplot(aes(x = year, y = percent_change, fill = variable)) +
geom_col(position = "dodge", width = 0.7, color = "white") +
labs(title = "Percentage Change in Median Income, Housing Costs, and Poverty Over the Years",
x = "Year",
y = "Percentage Change",
fill = "Variable",
caption = "Data source: census_data_state_2008-2021.csv") +
scale_fill_manual(values = c("skyblue", "orange", "lightgreen")) + # Changing the color of each bar
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
percentChange
Housing cost changes closely mimic median income changes. Percent poverty mirrored housing cost changes and median income changes. This makes sense that as median income increases, poverty rates decrease.It should also be noted that the year 2020 is missing from the plot. This is likely due to the COVID-19 pandemic.
Monte Carlo Methods of Inference
For this Monte Carlo Simulation, I will be testing whether or not there has been a statistically significant difference in average monthly home costs between male and female dominated states.
# Used chatGPT to help with some sections and to format the code
# Extracting variables to use
sim_data <- census_data %>% select(year, prop_male, prop_female, median_monthly_home_cost)
# Specify your observed test statistic
observed_statistic <- mean(sim_data$median_monthly_home_cost[sim_data$prop_male > sim_data$prop_female]) -
mean(sim_data$median_monthly_home_cost[sim_data$prop_male < sim_data$prop_female])
# Number of simulations
num_simulations <- 10000
# Initialize an empty vector to store simulated test statistics
simulated_stats <- numeric(num_simulations)
# Run the Monte Carlo simulation
for (i in 1:num_simulations) {
# Shuffle the data between the two groups based on proportion of males
shuffled_data <- sim_data %>%
mutate(shuffled_group = ifelse(runif(n()) < 0.5, "A", "B"))
# Calculate the test statistic on the shuffled data
shuffled_statistic <- mean(shuffled_data$median_monthly_home_cost[shuffled_data$shuffled_group == "A"]) -
mean(shuffled_data$median_monthly_home_cost[shuffled_data$shuffled_group == "B"])
# Stores the simulated test statistic
simulated_stats[i] <- shuffled_statistic
}
# Plot the null distribution
null_dist_plot <- ggplot() +
geom_density(aes(x = simulated_stats), fill = "skyblue", alpha = 0.6) +
geom_vline(xintercept = observed_statistic, color = "red", linetype = "dashed", size = 1) +
geom_vline(xintercept = quantile(simulated_stats, 0.975), linetype = "dotted", size = 1, color = "blue") +
geom_vline(xintercept = quantile(simulated_stats, 0.025), linetype = "dotted", size = 1, color = "blue") +
labs(title = "Monte Carlo Simulation of Home Costs Test",
x = "Difference in Average Home Costs (Proportion of Males > Females)",
y = "Density") +
theme_minimal()
# Displaying the plot
print(null_dist_plot)
# Calculating the p-value
p_value <- mean(simulated_stats >= observed_statistic) + mean(simulated_stats <= -observed_statistic)
# Displaying the p-value
cat("Observed Test Statistic:", observed_statistic, "\n")
## Observed Test Statistic: 74.92805
## p-value: 0.0058
# Interpretation
if (p_value < 0.05) {
cat("At the 95% significance level, there is statistically significant evidence that there is a difference in average home costs between states with higher proportions of males is statistically significant.\n")
} else {
cat("At the 95% significance level, there is not statistically significant evidence that there is a difference in average home costs between states with higher proportions of males.\n")
}
## At the 95% significance level, there is statistically significant evidence that there is a difference in average home costs between states with higher proportions of males is statistically significant.
Bootstrap Methods of Inference
Bootstrapping is used to test for the confidence interval of median monthly home costs.
# Used chatGPT to help with some sections and to format the code
# Extracting variables to use
bootstrap_data <- census_data %>% select(median_monthly_home_cost)
# Function to calculate the median
calculate_median <- function(data) {
return(median(data))
}
# Number of bootstrap samples
num_bootstrap_samples <- 1000
# Initializing variable to store bootstrapped samples
bootstrap_medians <- numeric(num_bootstrap_samples)
# Running the bootstrap procedure
for (i in 1:num_bootstrap_samples) {
# Sampling with replacement
bootstrap_sample <- sample(bootstrap_data$median_monthly_home_cost, replace = TRUE)
# Calculating the median for the bootstrap sample
bootstrap_medians[i] <- calculate_median(bootstrap_sample)
}
# Calculating the 95% bootstrap confidence interval
ci_lower <- quantile(bootstrap_medians, 0.025) %>% round(., 2) # Rounding upper and lower to two decimal places to reflect currency
ci_upper <- quantile(bootstrap_medians, 0.975) %>% round(., 2)
# Displaying the bootstrap distribution
bootstrap_dist_plot <- ggplot() +
geom_density(aes(x = bootstrap_medians), fill = "skyblue", alpha = 0.6) +
geom_vline(xintercept = ci_lower, linetype = "dotted", size = 1, color = "blue") +
geom_vline(xintercept = ci_upper, linetype = "dotted", size = 1, color = "blue") +
labs(title = "Bootstrap Distribution of Median Home Costs",
x = "Median Home Costs",
y = "Density") +
theme_minimal() +
scale_x_continuous(labels = scales::dollar_format(prefix = "$")) # Formatting to show currency in the x-axis
# Displaying the plot
print(bootstrap_dist_plot)
# Interpretation
cat("95% Bootstrap Confidence Interval for Median Home Costs:",
sprintf("[%.2f, %.2f]\n", ci_lower, ci_upper))
## 95% Bootstrap Confidence Interval for Median Home Costs: [962.00, 1025.54]
Concluding from the bootstrapping inference, we are 95% confident that the true median home costs is between $962 and $1025.54.
Conclusion
This analysis has shown some of the recent trends between the years 2008 and 2021 regarding income, demographics, education, and housing costs. The overall trends suggests that the cost of living, especially when renting, is on the rise, as well as education levels. It seems that in the near future, the number of degree holders will outweigh the number of non-degree holders. This as well has caused the median income to rise steadily with the growing education rates. The southern states, in particular, had the greatest growth in both education levels and median income. The median income in the southern states has nearly tripled between the years 2008 and 2021. The steady rise in median income has caused the percentage of people in poverty to steadily decrease as well. However, in 2021 there was a sharp rise in the percent change of people in poverty as well as the change in housing costs. This likely could have been due to the COVID-19 pandemic. At the beginning of this project, I was very concerned about the rising cost of living and housing, but after thoroughly analyzing the national data, the trajectory of costs and income are not as bad as I originally perceived them to be.