Analyzing COVID-19 Vulnerability Through Plot.ly Visualizations
Written on
Chapter 1: Introduction to COVID-19 Data Visualization
With the resurgence of COVID-19 cases across the United States, it's crucial to assess the varying levels of vulnerability among states. I realized that creating a brief guide on utilizing Plot.ly for geo-data visualization would be an engaging project. Today, we'll focus on developing choropleth maps using Plot.ly while examining FIPS data, alongside intriguing insights about our nation. You can access the data and notebook through these links:
Data
Notebook
Data Preparation and Cleaning
For our analysis, we are fortunate that the data is already organized. Although querying an API was an option, I chose to work with CSV data. To start, we will read the CSV file using the read_csv() function from the Pandas library:
import pandas as pd
import numpy as np
import plotly.express as px
df = pd.read_csv("data/CDC_SVI.csv")
Next, we can preview the first ten entries in our dataframe:
df.head(10)
The dataset contains numerous columns, making it impractical to examine each one. Luckily, by referencing the dataset's descriptions, we can eliminate unnecessary columns. Instead of dropping columns one by one, I opted to create a new dataframe with only the relevant features:
df = pd.DataFrame({"FIPS": df["FIPS"], "POP": df["E_TOTPOP"],
"HOUSEHOLDS": df["E_HH"], "POV": df["E_POV"],
"UNEMP": df["E_UNEMP"], "PCI": df["M_PCI"],
"DIS": df["E_DISABL"], "UNINS": df["E_UNINSUR"]})
The selected features include:
- FIPS: Federal ID of the jurisdiction (essential for plotting).
- POP: Estimate of the total population in each county.
- HOUSEHOLDS: Estimated number of households.
- POV: Estimated number of individuals living below the poverty line.
- UNEMP: Estimate of unemployed individuals in the county.
- PCI: Estimated per capita income.
- DIS: Estimate of disabled individuals in the jurisdiction.
- UNINS: Estimate of uninsured individuals.
I also checked for missing values in our dataset, and thankfully, it was entirely clean!
Feature Visualization
To visualize the counties in the U.S., we need a set of JSON data that connects Plot.ly with the correct FIPS areas. This can be sourced from Plot.ly's GitHub repository:
We can either download the data and read it using JSON or Pandas, or we can fetch it directly in Python using URLLib:
from urllib.request import urlopen
import json
counties = json.load(response)
Next, I will define a variable called current_feature to represent the data we want to visualize. This will simplify the process of creating additional choropleths. We'll also calculate the minimum and maximum values for our color scale:
current_feature = "POV"
curr_min = min(df[current_feature])
curr_max = max(df[current_feature])
Now we can generate the choropleth using Plot.ly:
fig = px.choropleth(df, geojson=counties, locations='FIPS', color=current_feature,
color_continuous_scale="reds",
range_color=(curr_min, curr_max),
scope="usa",
labels={current_feature: current_feature}
)
To add a title to the visualization, we can use the fig.update_layout() function:
fig.update_layout(title_text = "US Poverty by County")
Finally, we can display the choropleth:
fig.show()
Upon inspection, it appears that poverty levels are relatively balanced across the country, with notable exceptions like Harris County, Texas, which has experienced significant COVID-19 transmission and hospitalization rates. Let's explore our other features similarly.
The next visualization will focus on population:
current_feature = "POP"
curr_min = min(df[current_feature])
curr_max = max(df[current_feature])
fig = px.choropleth(df, geojson=counties, locations='FIPS', color=current_feature,
color_continuous_scale="blues",
range_color=(curr_min, curr_max),
scope="usa",
labels={current_feature: "Population"}
)
fig.update_layout(title_text = "US Population by County")
fig.show()
When comparing the two visualizations, there seems to be a correlation between population and poverty in various counties. In fact, overlaying these maps reveals that many non-white areas appear purple, indicating a blend of high poverty and population densities.
Now, let's examine the impact of households on COVID-19 risk:
current_feature = "HOUSEHOLDS"
curr_min = min(df[current_feature])
curr_max = max(df[current_feature])
fig = px.choropleth(df, geojson=counties, locations='FIPS', color=current_feature,
color_continuous_scale="greens",
range_color=(curr_min, curr_max),
scope="usa",
labels={current_feature: "Household Count"}
)
fig.update_layout(title_text = "US Household Count By County")
fig.show()
From these analyses, we can qualitatively suggest that population density and shared housing are significant risk factors for COVID-19 transmission. However, poverty and systemic disparities also play a crucial role.
Feature Engineering
To gain deeper insights from the data, I aimed to create new features that might better represent the underlying issues. One key measurement I wanted to explore was the average population living in shared households. To achieve this, I developed a Lambda function to compute this value:
POPPERH = lambda pop, hh: pop / hh
df["POPPERH"] = POPPERH(df["POP"], df["HOUSEHOLDS"])
If you're interested in learning more about Python's Lambda functions, I have an article that covers their usage in detail.
Now, let's visualize this new feature:
Interestingly, this analysis revealed that high population and poverty rates do not necessarily correlate with larger families living together. While this observation cannot be definitively asserted, it provides valuable insights, especially in light of earlier trends observed in Italy.
For my final analysis, I wanted to create a composite score to indicate the risk level of various regions based on several features. This score, which I call the Respiratory Virus Risk Index (RVRI), will be a calculated value derived from the significant features in our dataset.
To establish this index, I formulated the following equation:
RVRI = lambda df: (df["UNEMP"] + df["POV"] + df["UNINS"] / df["POPPERH"]) + df["POP"]
df["RVRI"] = RVRI(df)
Now, let's visualize the RVRI to see how it reflects the data:
# Visualization code here
Conclusion
The visualizations created using the Plot.ly API effectively highlight key vulnerabilities related to COVID-19 across the U.S. The insights derived from the RVRI and other features underscore the interconnectedness of poverty, population density, and health risks. This analysis emphasizes the importance of addressing these disparities to mitigate the impact of the virus, particularly in a challenging year like 2020.
The first video offers a concise tutorial on creating time series line graphs in Python using Plotly, specifically focused on COVID data with minimal code.
The second video discusses web scraping techniques related to COVID-19 data, providing valuable insights for those interested in data collection and analysis.