karasms.com

Analyzing COVID-19 Vulnerability Through Plot.ly Visualizations

Written on

Chapter 1: Introduction to COVID-19 Data Visualization

With the resurgence of COVID-19 cases across the United States, it's crucial to assess the varying levels of vulnerability among states. I realized that creating a brief guide on utilizing Plot.ly for geo-data visualization would be an engaging project. Today, we'll focus on developing choropleth maps using Plot.ly while examining FIPS data, alongside intriguing insights about our nation. You can access the data and notebook through these links:

Data

Notebook

Data Preparation and Cleaning

For our analysis, we are fortunate that the data is already organized. Although querying an API was an option, I chose to work with CSV data. To start, we will read the CSV file using the read_csv() function from the Pandas library:

import pandas as pd

import numpy as np

import plotly.express as px

df = pd.read_csv("data/CDC_SVI.csv")

Next, we can preview the first ten entries in our dataframe:

df.head(10)

Preview of COVID-19 vulnerability data

The dataset contains numerous columns, making it impractical to examine each one. Luckily, by referencing the dataset's descriptions, we can eliminate unnecessary columns. Instead of dropping columns one by one, I opted to create a new dataframe with only the relevant features:

df = pd.DataFrame({"FIPS": df["FIPS"], "POP": df["E_TOTPOP"],

"HOUSEHOLDS": df["E_HH"], "POV": df["E_POV"],

"UNEMP": df["E_UNEMP"], "PCI": df["M_PCI"],

"DIS": df["E_DISABL"], "UNINS": df["E_UNINSUR"]})

The selected features include:

  • FIPS: Federal ID of the jurisdiction (essential for plotting).
  • POP: Estimate of the total population in each county.
  • HOUSEHOLDS: Estimated number of households.
  • POV: Estimated number of individuals living below the poverty line.
  • UNEMP: Estimate of unemployed individuals in the county.
  • PCI: Estimated per capita income.
  • DIS: Estimate of disabled individuals in the jurisdiction.
  • UNINS: Estimate of uninsured individuals.

I also checked for missing values in our dataset, and thankfully, it was entirely clean!

Feature Visualization

To visualize the counties in the U.S., we need a set of JSON data that connects Plot.ly with the correct FIPS areas. This can be sourced from Plot.ly's GitHub repository:

We can either download the data and read it using JSON or Pandas, or we can fetch it directly in Python using URLLib:

from urllib.request import urlopen

import json

counties = json.load(response)

Next, I will define a variable called current_feature to represent the data we want to visualize. This will simplify the process of creating additional choropleths. We'll also calculate the minimum and maximum values for our color scale:

current_feature = "POV"

curr_min = min(df[current_feature])

curr_max = max(df[current_feature])

Now we can generate the choropleth using Plot.ly:

fig = px.choropleth(df, geojson=counties, locations='FIPS', color=current_feature,

color_continuous_scale="reds",

range_color=(curr_min, curr_max),

scope="usa",

labels={current_feature: current_feature}

)

To add a title to the visualization, we can use the fig.update_layout() function:

fig.update_layout(title_text = "US Poverty by County")

Finally, we can display the choropleth:

fig.show()

US Poverty by County visualization

Upon inspection, it appears that poverty levels are relatively balanced across the country, with notable exceptions like Harris County, Texas, which has experienced significant COVID-19 transmission and hospitalization rates. Let's explore our other features similarly.

The next visualization will focus on population:

current_feature = "POP"

curr_min = min(df[current_feature])

curr_max = max(df[current_feature])

fig = px.choropleth(df, geojson=counties, locations='FIPS', color=current_feature,

color_continuous_scale="blues",

range_color=(curr_min, curr_max),

scope="usa",

labels={current_feature: "Population"}

)

fig.update_layout(title_text = "US Population by County")

fig.show()

US Population by County visualization

When comparing the two visualizations, there seems to be a correlation between population and poverty in various counties. In fact, overlaying these maps reveals that many non-white areas appear purple, indicating a blend of high poverty and population densities.

Now, let's examine the impact of households on COVID-19 risk:

current_feature = "HOUSEHOLDS"

curr_min = min(df[current_feature])

curr_max = max(df[current_feature])

fig = px.choropleth(df, geojson=counties, locations='FIPS', color=current_feature,

color_continuous_scale="greens",

range_color=(curr_min, curr_max),

scope="usa",

labels={current_feature: "Household Count"}

)

fig.update_layout(title_text = "US Household Count By County")

fig.show()

US Household Count by County visualization

From these analyses, we can qualitatively suggest that population density and shared housing are significant risk factors for COVID-19 transmission. However, poverty and systemic disparities also play a crucial role.

Feature Engineering

To gain deeper insights from the data, I aimed to create new features that might better represent the underlying issues. One key measurement I wanted to explore was the average population living in shared households. To achieve this, I developed a Lambda function to compute this value:

POPPERH = lambda pop, hh: pop / hh

df["POPPERH"] = POPPERH(df["POP"], df["HOUSEHOLDS"])

If you're interested in learning more about Python's Lambda functions, I have an article that covers their usage in detail.

Now, let's visualize this new feature:

Average Population Per Household visualization

Interestingly, this analysis revealed that high population and poverty rates do not necessarily correlate with larger families living together. While this observation cannot be definitively asserted, it provides valuable insights, especially in light of earlier trends observed in Italy.

For my final analysis, I wanted to create a composite score to indicate the risk level of various regions based on several features. This score, which I call the Respiratory Virus Risk Index (RVRI), will be a calculated value derived from the significant features in our dataset.

To establish this index, I formulated the following equation:

RVRI = lambda df: (df["UNEMP"] + df["POV"] + df["UNINS"] / df["POPPERH"]) + df["POP"]

df["RVRI"] = RVRI(df)

Now, let's visualize the RVRI to see how it reflects the data:

# Visualization code here

Conclusion

The visualizations created using the Plot.ly API effectively highlight key vulnerabilities related to COVID-19 across the U.S. The insights derived from the RVRI and other features underscore the interconnectedness of poverty, population density, and health risks. This analysis emphasizes the importance of addressing these disparities to mitigate the impact of the virus, particularly in a challenging year like 2020.

The first video offers a concise tutorial on creating time series line graphs in Python using Plotly, specifically focused on COVID data with minimal code.

The second video discusses web scraping techniques related to COVID-19 data, providing valuable insights for those interested in data collection and analysis.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Writing with Purpose: Exploring Meaning in Our Words

An exploration of the purpose behind writing and its impact on society and individuals.

# Exploring the Concept of the Ideal Partner in Film and Television

An examination of films and series that explore the notion of the ideal partner and the complexities of love and relationships.

# Effective Strategies to Stay Motivated and Achieve Your Goals

Discover practical tips to maintain focus and motivation while pursuing your goals, ensuring a productive and fulfilling journey.