Data Cleaning and Exploratory Data Analysis


Data Cleaning

The raw dataset required significant preprocessing to make it suitable for analysis. The original Excel file contained metadata and headers that needed to be separated from the actual data.

Key cleaning steps:

  1. Column extraction: Removed metadata rows and extracted proper column names from row 4, with units of measurement from row 5
  2. Data subset selection: Retained 22 relevant columns from the original 57 columns, focusing on temporal, geographic, demographic, and outage characteristics
  3. Data type conversion: Converted OUTAGE.DURATION and CUSTOMERS.AFFECTED to numeric values, handling invalid entries as NaN
  4. Date parsing: Converted date columns to datetime objects for temporal analysis
  5. Missing value handling: Replaced ‘NA’ strings with proper NaN values, and filled missing CUSTOMERS.AFFECTED values with 0 (assuming unreported outages had minimal impact)

Feature Engineering:

Created additional derived features to enhance analysis:

  • OUTAGE_SEASON: Categorized outages by season based on start date
  • IS_WEEKEND: Boolean flag for weekend occurrences
  • CUSTOMER_DENSITY: Ratio of total customers to population
  • URBANIZATION_RATIO: Decimal representation of urban population percentage
  • POPULATION_DENSITY: Weighted average of urban and rural population densities
  • SEVERITY_CATEGORY: Classified outages as “Small” (<10K customers), “Medium” (10K-100K), “Large” (>100K), or “Unknown/Minor”
  • IS_EXTREME_WEATHER: Boolean flag for weather-related causes

Here’s the head of the cleaned DataFrame (limited to a few columns):

OBS YEAR MONTH U.S._STATE ANOMALY.LEVEL CUSTOMERS.AFFECTED OUTAGE.DURATION
1 2011 7 Minnesota -0.3 70000.0 3060
2 2014 5 Minnesota -0.1 68200.0 1
3 2010 10 Minnesota -1.5 70000.0 3000
4 2012 6 Minnesota -0.1 68200.0 2550
5 2015 7 Minnesota 1.2 250000.0 1740

Univariate Analysis

Summary Statistics:

  • Average outage duration: 2,625 minutes (43.75 hours)
  • Median customers affected: 30,534
  • Most common cause: Severe weather
  • Most affected season: Summer

The distribution of outage durations shows a right-skewed pattern with most outages lasting under 4,000 minutes, but with a long tail of extended outages. The median duration of 701 minutes indicates that half of all outages are resolved within a day’s time, while the mean being higher suggests some extremely long-duration events significantly impact the average.

Generally, this data seems to lack instances of small-scale outages, but does have a lot of missing/unclassified data.

Bivariate Analysis

Analysis of outage duration by cause category reveals significant differences between causes. Severe weather events tend to produce the longest outages with the highest variability, while intentional attacks typically result in shorter, more predictable durations. Equipment failures show moderate duration with less variability than weather events.

The relationship between customers affected and outage duration shows a weak positive correlation, suggesting that larger outages don’t necessarily last longer. This indicates that the scope and duration of outages are influenced by different factors - scope likely depends on grid interconnectedness and population density, while duration depends more on the nature of the damage and repair complexity.

I also plotted the seasonality to ascertain whether it made a difference in terms of outage distribution.

Interesting Aggregates

State-by-State Analysis (Top 10 by Average Customers Affected):

State Avg Customers Affected Avg Duration (min) Avg Population
Florida 282,939 4,095 18.1M
South Carolina 251,913 3,135 4.5M
Illinois 198,026 1,602 12.8M
District of Columbia 175,238 4,304 622K
Texas 165,227 2,705 25.2M

Florida leads in both average customers affected and duration, likely due to frequent severe weather events and hurricanes.

Cause Category Analysis:

Cause Category Mean Duration Median Duration Mean Customers Median Customers
Severe Weather 3,884 2,460 177,206 105,000
Fuel Supply Emergency 13,484 3,960 0.02 0
System Operability 729 215 137,941 25,000
Equipment Failure 1,817 221 50,968 0

This table reveals that severe weather causes both the most widespread outages (highest customer impact) and among the longest duration outages, making it the most significant outage category for overall grid reliability.