GITHUB PAGE

Final Tutorial: COVID19 and the United States Housing Market

CMPS-3160
Data Science
Rebekah Doochin and David Ziman
Fall 2020

To extract data we are using the Pandas library.

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import pairwise_distances

1. Data Collection

We will be looking at various data sets to determine the impact that the COVID-19 pandemic has had on the housing market in the United States. We have chosen to use New York City as a case study and focused on data from the counties which comprise the five boroughs. Specifically, we are looking at population, crime rates, and building permits issued over time to see if we can gain insights about the various factors that drive pricing of housing. Our first step, as seen below, is importing the CSV files containing the data and storing them in dataframes.

# Importing datasets.

## Zillow data we generated ##
zipcode_all_homes = pd.read_csv('../rdoochin.github.io/data/generated_zillow/Zip_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_mon.csv')

## NY population data 1970-present by county ##
ny_pop = pd.read_csv('../rdoochin.github.io/data/Annual_Population_Estimates_for_New_York_State_and_Counties__Beginning_1970 (2).csv')

## NY probations sentenced 1970-present by county ##
ny_crime = pd.read_csv('../rdoochin.github.io/data/Sentences_to_Probation_by_Year__Beginning_1970.csv')

## New York building permits ##
ny_permits = pd.read_csv('../rdoochin.github.io/data/ny_building_permits.csv',
                         dtype={"Job #": object, "Zip Code": object})

2. Data Processing

Home Prices in New York

All prices in the dataset from zillow.com (the dataframe which we have called 'zipcode_all_homes') are derived from the Zillow Home Value Index (ZHVI). After researching the website's methodology we determined that the ZHVI was an accurate and extensive representation of home values over time.
More information regarding the ZHVI can be found here.

The following piece of code begins to tidy the Zillow dataset by dropping unecessary columns such as RegionType and StateName and melting the various date columns into a singular column titled "Date." Corresponsing prices are found in the adjacent column.

# Dropping columns. 
zipcode_all_homes = zipcode_all_homes.drop(["RegionType", "StateName", "Metro"], axis = 1)

# Melting date columns. 
melted_housing = pd.melt(zipcode_all_homes,
                        ["RegionID", "SizeRank", "RegionName", "State", "City", "CountyName"],
                        var_name = "Date",
                        value_name = "Price")
melted_housing.head()

# Getting just the melted data for New York.
ny_df = melted_housing.loc[melted_housing["State"] == "NY"]
ny_df.head()

yearArr = []
for i in ny_df["Date"]:
    # Acquire year in date.
    i = i[0:4] 
    yearArr.append(i)
    
nyYearDF = ny_df.copy()
# Create new column in new dataframe with all years.
nyYearDF["Year"] = yearArr 
nyYearDF = nyYearDF.drop(["Date"], axis = 1)
nyYearDF

# Filter new dataframe to only contain data from select counties.
nyYearDF = nyYearDF[(nyYearDF.CountyName == 'New York County') | 
                    (nyYearDF.CountyName == 'Kings County') | 
                    (nyYearDF.CountyName == 'Richmond County') | 
                    (nyYearDF.CountyName == 'Bronx County') | 
                    (nyYearDF.CountyName == 'Queens County')] 
nyYearDF.head()

Here we are using the groupby function to display the mean prices in each borough for every year since 1996. We display these values in a dataframe and subsequently graph them using a pivot table below to help visualize the data.

groupedPricesNY = nyYearDF.groupby(["CountyName", "Year"])["Price"].mean().to_frame()
groupedPricesNY.head()

3. Exploratory Analysis and Data Visualization

# Created a pivot table in order to plot prices by county over time.
meanPrice = nyYearDF.pivot_table(
    index="Year", columns=["CountyName"],
    values="Price", aggfunc=np.mean) 

meanPrice.plot.line(title = "Prices by County Over Time")

<matplotlib.axes._subplots.AxesSubplot at 0x7ff2597c1a90>

The graph above summarizes the mean price of real estate in each of the five borough for each year. The graphs indicate recessionary periods with throughs and periods of exuberance with peaks. One conclusion drawn from the graph is that New York County experienced the greatest appreciation in real estate prices over time. Another important note is that Kings County (Brooklyn) has been subject to a wave of gentrification, which has led to prices rising in the area since the mid-2000s. Ultimately, we hope this overview of prices will help us to identify trends and predict future cycles in real estate pricing.

fig, (ax1,ax2,ax3,ax4,ax5) = plt.subplots(1,5, figsize=(20,4))
nyYearDF[(nyYearDF.Year == "2008") & (nyYearDF.CountyName == "Kings County")].Price.plot.box(ax = ax1, label = "Kings County 2008")
nyYearDF[(nyYearDF.Year == "2012") & (nyYearDF.CountyName == "Kings County")].Price.plot.box(ax = ax2, label = "Kings County 2012")
nyYearDF[(nyYearDF.Year == "2014") & (nyYearDF.CountyName == "Kings County")].Price.plot.box(ax = ax3, label = "Kings County 2014")
nyYearDF[(nyYearDF.Year == "2018") & (nyYearDF.CountyName == "Kings County")].Price.plot.box(ax = ax4, label = "Kings County 2018")
nyYearDF[(nyYearDF.Year == "2020") & (nyYearDF.CountyName == "Kings County")].Price.plot.box(ax = ax5, label = "Kings County 2020")
fig.tight_layout(pad=2.0)

Plotted in these graphs are the distributions of housing prices from 2008 to 2020; from them, we can easily see the effects of gentrification in Brooklyn (King’s County). In the 2008, sales above 600,000 dollars are far outside of the interquartile range. You can see the distribution becoming wider and moving upwards as time goes on. In 2020, the IQR includes sales over 1.2 million dollars. Ultimately, there was a population boom in Brooklyn which seemed to be a big driver of prices (see below on population data).

Population and Crime

Getting Data for the 5 borough in New York in order to look at their population over time. We first have to drop extra census data. In the dataset we used, there was an extra survey every 10 years, making it so that every decade we had two sets of data. Using the annual population data gave us a more extensive dataset to work with.

# Dropping extra census data.
for i in range(len(ny_pop)):
    if(ny_pop["Program Type"][i] == "Census Base Population"):
        ny_pop = ny_pop.drop(i)

burrow_list = ['New York County', 'Kings County', 'Richmond County', 'Bronx County', 'Queens County']

# Using only the counties that make up the 5 boroughs.
burrow_pop = ny_pop.loc[ny_pop["Geography"].isin(burrow_list)]

# Dropping unnecessary columns.
burrow_pop = burrow_pop.drop(["FIPS Code", "Program Type"], axis = 1)

# Preparing to display the dataframes side by side.
pop_df = burrow_pop.loc[burrow_pop["Geography"] == "New York County"]
pop_df2 = burrow_pop.loc[burrow_pop["Geography"] == "Kings County"]
pop_df3 = burrow_pop.loc[burrow_pop["Geography"] == "Richmond County"]
pop_df4 = burrow_pop.loc[burrow_pop["Geography"] == "Bronx County"]
pop_df5 = burrow_pop.loc[burrow_pop["Geography"] == "Queens County"]

# Creating a figure and axis object with subplots. 
fig, (ax1,ax2,ax3,ax4,ax5) = plt.subplots(1,5, figsize=(20,4))

pop_df.plot(kind='scatter', x='Year', y="Population", legend=False, title="New York County", ax=ax1)
pop_df2.plot(kind='scatter', x='Year', y="Population", title="Kings County", legend=False, ax=ax2)
pop_df3.plot(kind='scatter', x='Year', y="Population", legend=False, title="Richmond County", ax=ax3)
pop_df4.plot(kind='scatter', x='Year', y="Population", title="Bronx County", legend=False, ax=ax4)
pop_df5.plot(kind='scatter', x='Year', y="Population", legend=False, title="Queens County", ax=ax5)

fig.tight_layout(pad=2.0)
plt.show()

The latter are five scatterplots for each of the five boroughs in New York and contain information on their population over time. Population have the capacity to influence real estate prices and can be a proxy for demand. Evidently, there are population "cycles," which seem to somewhat align with economic activity and the real estate price cycles from the former graph.

# Using only the counties that make up the 5 boroughs in the crime dataframe.
burrow_list_4_crime = ['New York', 'Kings', 'Richmond', 'Bronx', 'Queens']
burrow_crime = ny_crime.loc[ny_crime["County"].isin(burrow_list_4_crime)]

Below we are separating the burrow_crime dataframe into 2 new dataframes based on whether or not the probation was sentenced because of a felony or a misdemeanor. We are doing this so we can later add up the total amount of crime in the city.

# Dataframe for felonies.
burrow_felony = burrow_crime.reset_index()
for i in burrow_felony.index:
    if burrow_felony['Conviction'][i] == 'Misdemeanor':
        burrow_felony = burrow_felony.drop(i)
        
# Tidying this dataframe.
burrow_felony = burrow_felony.rename(columns={'Total':'Fel_total'})
burrow_felony = burrow_felony.drop(columns={'Other', 'Property', 'DWI', 'Drugs'})

# Dataframe for Misdemeanors.
burrow_misdemeanor = burrow_crime.reset_index()
for i in burrow_misdemeanor.index:
    if burrow_misdemeanor['Conviction'][i] == 'Felony':
        burrow_misdemeanor = burrow_misdemeanor.drop(i) 
        
# Tidying this dataframe.
burrow_misdemeanor = burrow_misdemeanor.rename(columns={'Total':'Mis_total'})
burrow_misdemeanor = burrow_misdemeanor.drop(columns={'Violent Felony Offenses (VFO)', 'Other', 'Property', 'DWI',
                                                     'Drugs'})

# Merging to get the total number of probations.
Total_probations = burrow_felony.merge(burrow_misdemeanor, on=["County", "Year"], how="outer", suffixes=("-fel", "-mis"))
sum_column = Total_probations["Fel_total"] + Total_probations["Mis_total"]
Total_probations["Total"] = sum_column
Total_probations = Total_probations.drop(columns={'index-fel', 'index-mis', 'Conviction-fel', 'Conviction-mis'})

# Preparing to display the dataframes side by side.
prob_df = Total_probations.loc[Total_probations["County"] == "New York"]
prob_df2 = Total_probations.loc[Total_probations["County"] == "Kings"]
prob_df3 = Total_probations.loc[Total_probations["County"] == "Richmond"]
prob_df4 = Total_probations.loc[Total_probations["County"] == "Bronx"]
prob_df5 = Total_probations.loc[Total_probations["County"] == "Queens"]

fig, (ax1,ax2,ax3,ax4,ax5) = plt.subplots(1,5, figsize=(20,4), sharey=True)

prob_df.plot(kind='line', x='Year', y="Total", legend=False, title="New York County", ax=ax1)
prob_df2.plot(kind='line', x='Year', y="Total", legend=False, title="Kings County", ax=ax2)
prob_df3.plot(kind='line', x='Year', y="Total", legend=False, title="Richmond County", ax=ax3)
prob_df4.plot(kind='line', x='Year', y="Total", legend=False, title="Bronx County", ax=ax4)
prob_df5.plot(kind='line', x='Year', y="Total", legend=False, title="Queens County", ax=ax5)

ax1.set_ylabel("Total Probations Sentenced")
plt.show()

The former indicates crime rates, which we will ultimately compare to population and pricing of real estate in New York.

Population and Price

# Getting the mean prices. Will use in a later dataframe.
mean_price = nyYearDF.groupby(["CountyName", "Year"])["Price"].mean().to_frame()

Population and Crime

# Preparing to merge the two dataframes with population and probations.
burrow_pop = burrow_pop.rename(columns={'Geography':'County'})
Total_probations['County'] = (Total_probations['County'] + " County")

Pop_and_crime = Total_probations.merge(burrow_pop, on=["County", "Year"], how="outer", suffixes=("-pop", "-crime"))
Pop_and_crime.head()

The chart below deomonstrates how the rates of Violent Felony Offenses increases with the total number of probations rather than the population. Our initial thought was that having a greater number of people in a defined space would produce more violent crimes. After seeing the data appear in a graph, however, it is clear that population has little influence on severity of crimes.

fig, ax = plt.subplots()
Pop_and_crime.plot.scatter(x="Total", y="Population", c="Violent Felony Offenses (VFO)", alpha=.8, ax=ax)
ax.set_xlabel("Total Probations")
plt.show()

The graph below shows the number of felonies and misdemeanors for each year.

# Preparing to display the dataframes side by side.
df = burrow_crime.loc[burrow_crime["County"] == "New York"]
df2 = burrow_crime.loc[burrow_crime["County"] == "Kings"]
df3 = burrow_crime.loc[burrow_crime["County"] == "Richmond"]
df4 = burrow_crime.loc[burrow_crime["County"] == "Bronx"]
df5 = burrow_crime.loc[burrow_crime["County"] == "Queens"]

cols = df["Conviction"].map({"Felony": "blue","Misdemeanor": "red"})
cols2 = df2["Conviction"].map({"Felony": "blue","Misdemeanor": "red"})
cols3 = df3["Conviction"].map({"Felony": "blue","Misdemeanor": "red"})
cols4 = df4["Conviction"].map({"Felony": "blue","Misdemeanor": "red"})
cols5 = df5["Conviction"].map({"Felony": "blue","Misdemeanor": "red"})

fig, (ax1,ax2,ax3,ax4,ax5) = plt.subplots(1,5, figsize=(20,4), sharey=True) 

# Creating a second y axis for the population data. 
overlay_ax1 = ax1.twinx()
overlay_ax2 = ax2.twinx()
overlay_ax3 = ax3.twinx()
overlay_ax4 = ax4.twinx()
overlay_ax5 = ax5.twinx()

df.plot(kind='scatter', x='Year', y="Total", legend=False, title="New York County", ax=ax1, c=cols, alpha=.3)
df2.plot(kind='scatter', x='Year', y="Total", title="Kings County", legend=False, ax=ax2, c=cols, alpha=.3)
df3.plot(kind='scatter', x='Year', y="Total", legend=False, title="Richmond County", ax=ax3, c=cols, alpha=.3)
df4.plot(kind='scatter', x='Year', y="Total", title="Bronx County", legend=False, ax=ax4, c=cols, alpha=.3)
df5.plot(kind='scatter', x='Year', y="Total", legend=False, title="Queens County", ax=ax5, c=cols, alpha=.3)

pop_df.plot(kind='line', x='Year', y="Population", ax=overlay_ax1)
pop_df2.plot(kind='line', x='Year', y="Population", ax=overlay_ax2)
pop_df3.plot(kind='line', x='Year', y="Population", ax=overlay_ax3)
pop_df4.plot(kind='line', x='Year', y="Population", ax=overlay_ax4)
pop_df5.plot(kind='line', x='Year', y="Population", ax=overlay_ax5)

fig.tight_layout(pad=2.0)
plt.show()

In the graphs above we have overlayed the population with the two scatter plots of the crime rates. When we do this it becomes clear that the general trend is for the population to dip during years that the number of probations sentenced is high. One can use a correlation matrix to demonstrate this inverse quantitative relationship between the variables.

The red dots indicate felonies, the blue dots indicate misdemeanors, and the blue line is the population.

# Creating a figure and axis object with subplots. 
fig, (ax1,ax2,ax3,ax4,ax5) = plt.subplots(1,5, figsize=(22,6), sharey=True) 

# Creating a second y axis for the population data. 
overlay_ax1 = ax1.twinx()
overlay_ax2 = ax2.twinx()
overlay_ax3 = ax3.twinx()
overlay_ax4 = ax4.twinx()
overlay_ax5 = ax5.twinx()

prob_df.plot(kind='line', x='Year', y="Total", legend=True, title="New York County", color="red", ax=ax1)
prob_df2.plot(kind='line', x='Year', y="Total", legend=True, title="Kings County", color="red", ax=ax2)
prob_df3.plot(kind='line', x='Year', y="Total", legend=True, title="Richmond County", color="red", ax=ax3)
prob_df4.plot(kind='line', x='Year', y="Total", legend=True, title="Bronx County", color="red", ax=ax4)
prob_df5.plot(kind='line', x='Year', y="Total", legend=True, title="Queens County", color="red", ax=ax5)


pop_df.plot(kind='line', x='Year', y="Population", legend=True, ax=overlay_ax1)
pop_df2.plot(kind='line', x='Year', y="Population", legend=True, ax=overlay_ax2)
pop_df3.plot(kind='line', x='Year', y="Population", legend=True, ax=overlay_ax3)
pop_df4.plot(kind='line', x='Year', y="Population", legend=True, ax=overlay_ax4)
pop_df5.plot(kind='line', x='Year', y="Population", legend=True, ax=overlay_ax5)

fig.tight_layout(pad=2.0)
plt.show()

The graphs above overlay the population with the total number of probations sentenced (both felonies and misdemeanors). We displayed this in order to get a more clear representation of the relationship between population and crime.

# Preparing average housing prices. 
county_list = []
year_list = []
for index, row in mean_price.iterrows():
    if ('Bronx County')in(index):
        county_list.append("Bronx County")
    if ('Richmond County')in(index):
        county_list.append("Richmond County")
    if ('Queens County')in(index):
        county_list.append("Queens County")
    if ('Kings County')in(index):
        county_list.append("Kings County")
    if ('New York County')in(index):
        county_list.append("New York County")
    temp=str(index)
    year_list.append(temp[-6:-2])

mean_price['County'] = county_list
mean_price['year'] = year_list

#Dataframe without Richmond County
burrow_list_noRichmond = ['New York County', 'Kings County', 'Bronx County', 'Queens County']

# Using only the counties that make up the 5 boroughs.
pop_and_crime_noRichmond = Pop_and_crime.loc[Pop_and_crime["County"].isin(burrow_list_noRichmond)]
housing_noRichmond = mean_price.loc[mean_price["County"].isin(burrow_list_noRichmond)]

# Preparing to merge the two dataframes.
pop_and_crime_noRichmond = pop_and_crime_noRichmond.rename(columns={'Year':'year'})
housing_noRichmond = housing_noRichmond.astype({"year":int})

# Merge to combine all housing, crime, and population data for the boroughs not including Richmond County.
noRichmond_data = pop_and_crime_noRichmond.merge(housing_noRichmond, on=["County", "year"], how="outer",
                                                 suffixes=('_left', '_right'))
noRichmond_data = noRichmond_data.rename(columns={'year':'Year', 'Fel_total':'Felonies', 'Mis_total':'Misdemeanors',
                                                 'Total':'Total Prosecutions', 'Price':'Avg. Housing Price'})
# Using data from 1996 on. 
noRichmond_data1996 = noRichmond_data
for index, row in noRichmond_data1996.iterrows():
    if int(row['Year']) < 1996:
        noRichmond_data1996 = noRichmond_data1996.drop(index)

print("Correlation Data from 1996 to Present")
noRichmond_data1996.corr()

Correlation Data from 1996 to Present

# Calculate column correlations and make a seaborn heatmap.
corr = noRichmond_data1996.corr()

ax = sns.heatmap(
    corr, 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right');

Correlations by County

# Getting data from correlation matrix above and getting rid of Year.
copy_df = noRichmond_data1996.copy()
copy_df = copy_df.drop({"Year"}, axis=1)

# Grouping dataframes by county.
KC_df = copy_df[copy_df.County == "Kings County"].corr()
NYC_df = copy_df[copy_df.County == "New York County"].corr()
QC_df = copy_df[copy_df.County == "Queens County"].corr()
BC_df = copy_df[copy_df.County == "Bronx County"].corr()

corr2 = KC_df.corr()
ax = sns.heatmap(corr2, vmin=-1, vmax=1, center=0, cmap=sns.diverging_palette(20, 220, n=200), square=True,
    annot=True, annot_kws={'fontsize': 10})
ax.set_xticklabels(ax.get_xticklabels(),rotation=45,horizontalalignment='right');
print('Kings County')
plt.show()

Kings County

corr2 = NYC_df.corr()
ax = sns.heatmap(corr2, vmin=-1, vmax=1, center=0, cmap=sns.diverging_palette(20, 220, n=200), square=True,
    annot=True, annot_kws={'fontsize': 10})
ax.set_xticklabels(ax.get_xticklabels(),rotation=45,horizontalalignment='right');
print('New York County')
plt.show()

New York County

corr2 = QC_df.corr()
ax = sns.heatmap(corr2, vmin=-1, vmax=1, center=0, cmap=sns.diverging_palette(20, 220, n=200), square=True,
    annot=True, annot_kws={'fontsize': 10})
ax.set_xticklabels(ax.get_xticklabels(),rotation=45,horizontalalignment='right');
print('Queens County')
plt.show()

Queens County

corr2 = BC_df.corr()
ax = sns.heatmap(corr2, vmin=-1, vmax=1, center=0, cmap=sns.diverging_palette(20, 220, n=200), square=True,
    annot=True, annot_kws={'fontsize': 10})
ax.set_xticklabels(ax.get_xticklabels(),rotation=45,horizontalalignment='right');
print('Bronx County')
plt.show()

Bronx County

Construction Data for New York

ny_permits_test = ny_permits.copy()
ny_permits_test = ny_permits_test.drop(["Job Start Date"], axis = 1)

# Getting Year from the date the permit was filed. 
yearArr = []
for i in ny_permits_test["Filing Date"]:
    # Acquire year in date.
    i = i[-2:] 
    if int(i[0]) > 3:
        i = '19'+i
    else:
        i = '20'+i
    yearArr.append(int(i))
# Create new column in new dataframe with all years.
ny_permits_test["Year Filed"] = yearArr 
ny_permits_test = ny_permits_test.drop(["Filing Date"], axis = 1)

ny_construction_df = ny_permits_test.groupby(ny_permits_test["BOROUGH"])["Year Filed"].value_counts().to_frame()

# Getting County so we can merge with other NY data on that column. 
county_list = []
year_list = []
for index, row in ny_construction_df.iterrows():
    if ('BRONX')in(index):
        county_list.append("Bronx County")
    if ('STATEN ISLAND')in(index):
        county_list.append("Richmond County")
    if ('QUEENS')in(index):
        county_list.append("Queens County")
    if ('BROOKLYN')in(index):
        county_list.append("Kings County")
    if ('MANHATTAN')in(index):
        county_list.append("New York County")
    temp=str(index)
    year_list.append(int(temp[-6:-1]))
ny_construction_df['County'] = county_list
ny_construction_df['Year'] = year_list

# Tidying.
ny_construction_df = ny_construction_df.rename(columns={'Year Filed':'Permits Filed'})

# Merging construction data with other NY data.
total_nyDF = ny_construction_df.merge(noRichmond_data1996, how = "inner", on = ["Year", "County"])

# Preparing to display the dataframes side by side.
df = total_nyDF.loc[total_nyDF["County"] == "New York County"]
df2 = total_nyDF.loc[total_nyDF["County"] == "Kings County"]
df4 = total_nyDF.loc[total_nyDF["County"] == "Bronx County"]
df5 = total_nyDF.loc[total_nyDF["County"] == "Queens County"]

fig, (ax1,ax2,ax4,ax5) = plt.subplots(1,4, figsize=(20,6)) 

# Creating a second y axis for the population data. 
overlay_ax1 = ax1.twinx()
overlay_ax2 = ax2.twinx()
overlay_ax4 = ax4.twinx()
overlay_ax5 = ax5.twinx()

df.plot(kind='scatter', x='Year', y="Permits Filed", legend=True, title="New York County", ax=ax1, alpha=.6,
         color = "red", label='Permits')
df2.plot(kind='scatter', x='Year', y="Permits Filed", title="Kings County", legend=True, ax=ax2, alpha=.6,
          color = "red", label='Permits')
df4.plot(kind='scatter', x='Year', y="Permits Filed", title="Bronx County", legend=True, ax=ax4,alpha=.6,
          color = "red", label='Permits')
df5.plot(kind='scatter', x='Year', y="Permits Filed", legend=True, title="Queens County", ax=ax5, alpha=.6,
          color = "red", label='Permits')

df.plot(kind='scatter', x='Year', y="Population", ax=overlay_ax1, alpha=.3, label='Population')
df2.plot(kind='scatter', x='Year', y="Population", ax=overlay_ax2, alpha=.3, label='Population')
df4.plot(kind='scatter', x='Year', y="Population", ax=overlay_ax4, alpha=.3, label='Population')
df5.plot(kind='scatter', x='Year', y="Population", ax=overlay_ax5, alpha=.3, label='Population')

fig.tight_layout(pad=2.0)
print('Population and Building Permit Applications')
plt.show()

Population and Building Permit Applications

# Calculate column correlations and make a heatmap for the total NY dataframe.
temp_total_ny = total_nyDF.copy()
temp_total_ny = temp_total_ny.drop(["Year", "Violent Felony Offenses (VFO)"], axis = 1)
corr = temp_total_ny.corr()
ax = sns.heatmap(
    corr, 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True,
    annot=True, annot_kws={'fontsize': 10})
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right');

4. Analysis, Hypothesis Testing, and ML

K-nearest Neighbors Prediction Model for Housing Prices in New York

NY_count = total_nyDF.loc[(total_nyDF.County == "New York County")]

# Using data from 2000 on. 
for index, row in NY_count.iterrows():
    if int(row['Year']) < 2000:
        NY_count = NY_count.drop(index)
print(NY_count['Total Prosecutions'].quantile(0.75))
print(NY_count['Permits Filed'].quantile(0.25))

2515.0
14525.0

Our Hypothesis: Given the story of crime, construction, and population in New York, we believe that if crime increases, prices will behave inversely, if population increases, prices will follow (although 2001-2006 was an outlier), and if construction increases, prices will follow. We predict that we could use population, crime, and construction permits to reasonably predict the average housing prices New York for 2021.

In the cell below, we train the model with New York data from 1996-present. We think the population will continue to steadily decline because mortgage applications for single family houses are still increasing, meaning that people are still moving to the suburbs. Taking into account that there may be a vaccine next year and that movement to suburbs may slow after this year, we extrapolated from 2019 to 2020 and assumed that the rate of population decline will slightly slow.

Although crime rates have been higher recently, they have been steadily declining overall since the 1990s. To get crime rates for our model to use in order to make a prediction about future price, we took the median of the last 20 years of data for New York County and used the upper bound of the IQR. We did the same with the data on the number of permits filed, and used the lower IQR to represent the decreased demand for construction and new housing. All of this being said, we made many assumptions. Our model is no where near perfect, but it does provide a reasonable picture for what the future of housing prices in New York may look like.

from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction import DictVectorizer
from sklearn.neighbors import KNeighborsRegressor

NYC = total_nyDF[total_nyDF.County == "New York County"]
NYC = NYC.dropna(subset = ["Population", "Total Prosecutions", "Permits Filed", "Avg. Housing Price"])

x_dict = NYC[["Population", "Total Prosecutions", "Permits Filed"]].to_dict(orient="records")
x_new_dict = [{
    "Population": 1625040.0,
    "Total Prosecutions": 2515.0,
    "Permits Filed": 14525.0}]
y_train = NYC["Avg. Housing Price"]

# Dummy encoding
vector = DictVectorizer(sparse=False)
vector.fit(x_dict)
x_train = vector.transform(x_dict)
new_x = vector.transform(x_new_dict)

# Standardize
scaler = StandardScaler()
scaler.fit(x_train)
x_train_sc = scaler.transform(x_train)
x_new_sc = scaler.transform(new_x)

# k-nearest neighbors
model = KNeighborsRegressor(n_neighbors=24)
model.fit(x_train_sc, y_train)
model.predict(x_new_sc)

array([690328.35965075])

The output of our model put mean New York housing prices for 2021 around 690,000 dollars. This number is similar to prices in the year 2012.

5. Insight and Policy Decision

When looking at the previous graphics, some interesting conclusions can be drawn. Firstly, the a declining population in the New York boroughs is not a new phenomenon. It has happened several times in recent history. Throughout the 1970s New York was plagued by economic stagnation, rising crime rates, and a fleeting population. The five counties became largely inhospitable and suffered from diminished tax revenue and teetered on the edge of bankruptcy.

Population started to rebound in the city throughout the 1980s leading to a construction boom, although it diminished in the late 1980s. Black Monday plagued the financial markets in 1987, marking the largest single day percentage decline in the stock market. This event shook the markets and was a catalyst for a decline in real estate prices in the early 1990s. This trend is reflected by the trough in population on the scatterplots above. In many counties, crime rates peaked at around point.

The woes of the late 1980s and early 1990s were washed away by an "urban renewal" in New York throughout the 1990s. Prices rebounded as people migrated en masse back to the five bouroughs, as evidenced by the population data graphics. Once population starts to rebound, economic activity returned to New York and tax revenue increased (find data).

Prices continued to increase throughout the 1990s and 2000s, until in 2008 the Great Depression struck. Given this financial crisis largely impacted real assets, housing prices were gutted. In much of New York, population started to decline in the early 2000s and bottomed around 2008/2009.

There seems to be close relationship in the pricing cycles and construction cycles. Intersetingly, construction boomed around this time, which indicates there was a dissonance between supply and demand. Irrationality in the market is evidenced by the divergence of population and construction, which ultimately culminated in the Great Recession. It seems as though a fleeting population is a good indicator of economic downturn and depressed real estate pricing.

Another interesting trend to note is the gentrification of Kings County (Brooklyn) following the Great Recession. Initially in 2008, Kings County data is skewed right, which signals the beginning of gentrification as there are some outliers in the pricing data. Over time, the data becomes less skewed as the mean and median pricing of the area move closer to each other. This signals that the area as a whole became more expensive and the effects of gentrification spread. The data indicates large population growth in the area post-Great Recession, which may explain the gentrifying effect on the area.

From the data, it seems as though there was a flight from the borrows before the onset of the pandemic. Furthermore, both construction permits and pricing fell. Research goes to show that a lack of international migration and high tax rates may explain this diminishing population. COVID only served to accelerate the trend. It would be interesting to explore data on rising mortgage applications for single family homes NY suburbs. This could potentially shed more light on the current movement from urban areas.

New York’s economy depends on its residents and the state relies tax revenue to avoid bankruptcy. This is why it is essential to enact policy to bring people back to the city and mitigate crime (which has been on the incline this year - could not find data on this in a dataset). Evidently this is challenging in a pandemic, but one idea is to make people feel more comfortable with temperature checks in buildings and signs on which guide the movement of people through public, indoor spaces. Diminished taxes may also incentivize businesses and people to return.

	RegionID	SizeRank	RegionName	State	City	CountyName	Date	Price
0	61639	0	10025	NY	New York	New York County	1996-01-31	223469.0
1	84654	1	60657	IL	Chicago	Cook County	1996-01-31	205864.0
2	61637	2	10023	NY	New York	New York County	1996-01-31	227596.0
3	91982	3	77494	TX	Katy	Harris County	1996-01-31	202808.0
4	84616	4	60614	IL	Chicago	Cook County	1996-01-31	276752.0

		Price
CountyName	Year
Bronx County	1996	140700.028037
	1997	141412.208333
	1998	146236.125000
	1999	155125.115741
	2000	170550.777778

	Year	Violent Felony Offenses (VFO)	Felonies	Misdemeanors	Total Prosecutions	Population	Avg. Housing Price
Year	1.000000	-0.891794	-0.855881	-0.872819	-0.905801	0.102109	0.602062
Violent Felony Offenses (VFO)	-0.891794	1.000000	0.726898	0.848808	0.810821	0.040121	-0.595548
Felonies	-0.855881	0.726898	1.000000	0.797493	0.972476	-0.297071	-0.286576
Misdemeanors	-0.872819	0.848808	0.797493	1.000000	0.916120	-0.042063	-0.567530
Total Prosecutions	-0.905801	0.810821	0.972476	0.916120	1.000000	-0.213645	-0.409604
Population	0.102109	0.040121	-0.297071	-0.042063	-0.213645	1.000000	0.011352
Avg. Housing Price	0.602062	-0.595548	-0.286576	-0.567530	-0.409604	0.011352	1.000000

Final Tutorial: COVID19 and the United States Housing Market

1. Data Collection

2. Data Processing

Home Prices in New York

3. Exploratory Analysis and Data Visualization

Population and Crime

Population and Price

Population and Crime

Construction Data for New York

4. Analysis, Hypothesis Testing, and ML

K-nearest Neighbors Prediction Model for Housing Prices in New York

5. Insight and Policy Decision

Updated Collaboration Plan

Links to Datasets

More Resources:

	County	Year	Violent Felony Offenses (VFO)	Fel_total	Mis_total	Total	Population
0	Bronx County	1970	9	138	826	964	1472841
1	Kings County	1970	47	316	442	758	2603413
2	New York County	1970	16	254	1060	1314	1535554
3	Queens County	1970	14	114	295	409	1987909
4	Richmond County	1970	0	2	23	25	297331