Data Cleaning and Exploratory Data Analysis
Data Cleaning
The raw dataset required significant preprocessing to make it suitable for analysis. The original Excel file contained metadata and headers that needed to be separated from the actual data.
Key cleaning steps:
- Column extraction: Removed metadata rows and extracted proper column names from row 4, with units of measurement from row 5
- Data subset selection: Retained 22 relevant columns from the original 57 columns, focusing on temporal, geographic, demographic, and outage characteristics
- Data type conversion: Converted
OUTAGE.DURATIONandCUSTOMERS.AFFECTEDto numeric values, handling invalid entries as NaN - Date parsing: Converted date columns to datetime objects for temporal analysis
- Missing value handling: Replaced ‘NA’ strings with proper NaN values, and filled missing
CUSTOMERS.AFFECTEDvalues with 0 (assuming unreported outages had minimal impact)
Feature Engineering:
Created additional derived features to enhance analysis:
- OUTAGE_SEASON: Categorized outages by season based on start date
- IS_WEEKEND: Boolean flag for weekend occurrences
- CUSTOMER_DENSITY: Ratio of total customers to population
- URBANIZATION_RATIO: Decimal representation of urban population percentage
- POPULATION_DENSITY: Weighted average of urban and rural population densities
- SEVERITY_CATEGORY: Classified outages as “Small” (<10K customers), “Medium” (10K-100K), “Large” (>100K), or “Unknown/Minor”
- IS_EXTREME_WEATHER: Boolean flag for weather-related causes
Here’s the head of the cleaned DataFrame (limited to a few columns):
| OBS | YEAR | MONTH | U.S._STATE | ANOMALY.LEVEL | CUSTOMERS.AFFECTED | OUTAGE.DURATION |
|---|---|---|---|---|---|---|
| 1 | 2011 | 7 | Minnesota | -0.3 | 70000.0 | 3060 |
| 2 | 2014 | 5 | Minnesota | -0.1 | 68200.0 | 1 |
| 3 | 2010 | 10 | Minnesota | -1.5 | 70000.0 | 3000 |
| 4 | 2012 | 6 | Minnesota | -0.1 | 68200.0 | 2550 |
| 5 | 2015 | 7 | Minnesota | 1.2 | 250000.0 | 1740 |
Univariate Analysis
Summary Statistics:
- Average outage duration: 2,625 minutes (43.75 hours)
- Median customers affected: 30,534
- Most common cause: Severe weather
- Most affected season: Summer
The distribution of outage durations shows a right-skewed pattern with most outages lasting under 4,000 minutes, but with a long tail of extended outages. The median duration of 701 minutes indicates that half of all outages are resolved within a day’s time, while the mean being higher suggests some extremely long-duration events significantly impact the average.
Generally, this data seems to lack instances of small-scale outages, but does have a lot of missing/unclassified data.
Bivariate Analysis
Analysis of outage duration by cause category reveals significant differences between causes. Severe weather events tend to produce the longest outages with the highest variability, while intentional attacks typically result in shorter, more predictable durations. Equipment failures show moderate duration with less variability than weather events.
The relationship between customers affected and outage duration shows a weak positive correlation, suggesting that larger outages don’t necessarily last longer. This indicates that the scope and duration of outages are influenced by different factors - scope likely depends on grid interconnectedness and population density, while duration depends more on the nature of the damage and repair complexity.
I also plotted the seasonality to ascertain whether it made a difference in terms of outage distribution.
Interesting Aggregates
State-by-State Analysis (Top 10 by Average Customers Affected):
| State | Avg Customers Affected | Avg Duration (min) | Avg Population |
|---|---|---|---|
| Florida | 282,939 | 4,095 | 18.1M |
| South Carolina | 251,913 | 3,135 | 4.5M |
| Illinois | 198,026 | 1,602 | 12.8M |
| District of Columbia | 175,238 | 4,304 | 622K |
| Texas | 165,227 | 2,705 | 25.2M |
Florida leads in both average customers affected and duration, likely due to frequent severe weather events and hurricanes.
Cause Category Analysis:
| Cause Category | Mean Duration | Median Duration | Mean Customers | Median Customers |
|---|---|---|---|---|
| Severe Weather | 3,884 | 2,460 | 177,206 | 105,000 |
| Fuel Supply Emergency | 13,484 | 3,960 | 0.02 | 0 |
| System Operability | 729 | 215 | 137,941 | 25,000 |
| Equipment Failure | 1,817 | 221 | 50,968 | 0 |
This table reveals that severe weather causes both the most widespread outages (highest customer impact) and among the longest duration outages, making it the most significant outage category for overall grid reliability.