Final Model
Added Features and Rationale:
-
Enhanced preprocessing for CUSTOMER_DENSITY: Applied log transformation to handle the skewed distribution of customer density ratios, which better captures the relationship between customer concentration and outage characteristics.
-
Improved numeric preprocessing: Used StandardScaler for most numeric features to ensure equal contribution to tree-based decisions, while applying specialized log transformation for customer density.
-
Advanced categorical handling: Maintained one-hot encoding but with improved imputation strategies.
Model Algorithm: Random Forest Classifier with hyperparameter tuning
Hyperparameter Optimization:
- Used GridSearchCV with 5-fold cross-validation
- Search space included:
- n_estimators: [90, 100, 200, 300]
- max_depth: [None, 10, 20, 30, 40]
- min_samples_split: [2, 5, 10]
- min_samples_leaf: [1, 2, 4, 8]
Best Hyperparameters:
- n_estimators: 300
- max_depth: 20
- min_samples_split: 5
- min_samples_leaf: 1
Final Model Performance:
precision recall f1-score support
Long 0.82 0.82 0.82 170
Short 0.76 0.76 0.76 126
accuracy 0.80 296
Improvement Over Baseline:
- Overall accuracy improved from 76% to 80%
- Precision for Long outages improved from 0.80 to 0.82
- Precision for Short outages improved from 0.71 to 0.76
- More balanced performance across both classes
The Random Forest approach with proper hyperparameter tuning provides better generalization than the single Decision Tree, while the enhanced feature preprocessing better captures the underlying relationships in the data.