Step 1: Predictive TaskΒΆ
Problem Statement:
Given a user's historical workout data from Endomondo Fitness Tracking Data, this project aims to investigate which activity they are most likely to engage in for their next workout session.
Model Evaluation Strategy:
To evaluate this machine learning model, standard evaluation metrics like accuracy, F1-score (macro-averaged to handle class imbalance), and per-class precision and recall will be used. These metrics, along with a confusion matrix, should inform which activity types the model is accurate at predicting.
Model Validation Approach:
For this model, 70% of the existing data will be used to train the model. Validation and testing data sets will comprise 15% each. Given the temporal nature of this data, splits will be performed chronologically instead of randomly, i.e., the first 70% of workout data will be used for training, next 15% for validation, and the final 15% for testing. Additionally, features will only be engineered based on information available at the time of data collection to prevent any sort of look-ahead bias.
Baseline Approaches:
To establish model performance benchmarks, the following baselines will be used as focal points of comparison:
- Frequency Baseline: User's most common activity type predicted.
- Recency Baseline: User's most recent activity type predicted.
- Weighted Frequency-Recency Baseline: User's most common activity in last N workouts predicted.
Classification Models:
The following machine learning models will be used for classification:
- Logistic Regression
- NaΓ―ve Bayes Classifier
- Support Vector Machine (SVM)
- Random Forest
The first 3 models serve as a preliminary approach, whereas Random Forest is a cursory, yet novel approach that I chose to implement for this classification task, which I eventually chose to finetune.
ΒΆ
Step 2: Data Preprocesssing and Exploratory Data Analysis (EDA)ΒΆ
Dataset Context:ΒΆ
Description
This dataset is a collection of user workout records from the fitness tracking platform Endomondo. Data includes multiple sources of sequential sensor data such as heart rate, speed, GPS as well as sport type, user gender and weather condition (i.e. temperature, humidity). These datasets have been collected for academic purpose and use heuristics to clean and filter out abnormal data samples such as overly large magnitude and mismatching timestamps.
Citation
Modeling heart rate and activity data for personalized fitness recommendation
Jianmo Ni, Larry Muhlstein, Julian McAuley
WWW, 2019
Data Preprocessing:ΒΆ
Although the dataset has been cleaned beforehand using heuristics, I conduct downstream wrangling efforts to prepare the data for this project, i.e., before I look to analyze any data, I need to retrieve it from its JSON file into a more usable DataFrame format.
First, I store the JSON file on Google Drive and navigate to the required directory.
from google.colab import drive
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
warnings.filterwarnings("ignore")
drive.mount('/content/drive')
Mounted at /content/drive
Then, I parse the data into a Python list, restricting the number of loaded workouts to 50,000 (RAM restrictions).
data = []
with open("/content/drive/MyDrive/endomondoHR_proper.json") as f:
for i, line in enumerate(f):
if i == 50000:
break
data.append(eval(line))
Finally, I get the data into a usable DataFrame object filtered to contain only the required columns from this dataset.
def summarize_workout(r):
w = {
'user_id': r.get('userId'),
'gender': r.get('gender'),
'sport': r.get('sport'),
'workout_id': r.get('id'),
}
ts = r.get('timestamp', [])
if ts:
w['start'] = min(ts)
w['end'] = max(ts)
w['duration_min'] = (w['end'] - w['start']) / 60
else:
w['start'] = None
w['end'] = None
w['duration_min'] = None
hr = r.get('heart_rate', [])
hr_pos = [x for x in hr if x > 0]
w['avg_heart_rate'] = np.mean(hr_pos) if hr_pos else None
return w
df = pd.DataFrame([summarize_workout(d) for d in data])
df['datetime'] = pd.to_datetime(df['start'], unit='s')
df = df.drop(columns=["start", "end"])
df
| user_id | gender | sport | workout_id | duration_min | avg_heart_rate | datetime | |
|---|---|---|---|---|---|---|---|
| 0 | 10921915 | male | bike | 396826535 | 126.483333 | 152.650 | 2014-08-24 16:45:46 |
| 1 | 10921915 | male | bike | 392337038 | 74.000000 | 147.710 | 2014-08-16 20:41:22 |
| 2 | 10921915 | male | bike | 389643739 | 112.483333 | 140.554 | 2014-08-12 15:47:39 |
| 3 | 10921915 | male | bike | 386729739 | 75.316667 | 147.020 | 2014-08-07 17:20:42 |
| 4 | 10921915 | male | bike (transport) | 383186560 | 22.616667 | 167.154 | 2014-08-01 16:10:34 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 49995 | 4399772 | male | mountain bike | 183825234 | 130.083333 | 164.816 | 2013-05-01 15:00:45 |
| 49996 | 4399772 | male | bike | 183288370 | 106.666667 | 111.954 | 2013-04-30 15:05:54 |
| 49997 | 4399772 | male | indoor cycling | 182873807 | 91.200000 | 127.910 | 2013-04-29 17:39:55 |
| 49998 | 4399772 | male | bike | 181847177 | 176.333333 | 120.300 | 2013-04-26 22:01:36 |
| 49999 | 4399772 | male | bike | 181075198 | 192.200000 | 126.388 | 2013-04-24 22:23:48 |
50000 rows Γ 7 columns
Understanding the number of null values within this dataset to ensure nothing too extreme...
df.isna().sum()
| 0 | |
|---|---|
| user_id | 0 |
| gender | 0 |
| sport | 0 |
| workout_id | 0 |
| duration_min | 0 |
| avg_heart_rate | 88 |
| datetime | 0 |
Exploratory Data Analysis (EDA):ΒΆ
To begin EDA, there are 4 key metrics that I look at given my processed DataFrame object:
- Activity Type: What sport was this activity registered as?
- Activity Duration: How long did this activity last for?
- Avg. Heart Rate: How energy-intensive was this activity?
- Date & Time: When exactly did this activity take place?
1. Activity Type
Understanding the frequency distribution of sports/activity types within the dataset...
df['sport'].value_counts()
| count | |
|---|---|
| sport | |
| run | 22297 |
| bike | 20125 |
| mountain bike | 3382 |
| bike (transport) | 2358 |
| indoor cycling | 428 |
| orienteering | 416 |
| walk | 340 |
| skate | 170 |
| cross-country skiing | 118 |
| core stability training | 109 |
| fitness walking | 63 |
| rowing | 53 |
| hiking | 46 |
| kayaking | 26 |
| soccer | 13 |
| weight training | 11 |
| circuit training | 8 |
| treadmill running | 7 |
| roller skiing | 5 |
| downhill skiing | 5 |
| horseback riding | 3 |
| gymnastics | 3 |
| elliptical | 3 |
| tennis | 2 |
| snowboarding | 2 |
| swimming | 2 |
| basketball | 1 |
| snowshoeing | 1 |
| yoga | 1 |
| aerobics | 1 |
| stair climing | 1 |
Visualized as a bar chart...
df['sport'].value_counts().plot(kind='bar', figsize=(6,4))
plt.title("Workout Count by Sport")
plt.ylabel("Count")
plt.show()
In this situation, there are multiple activity types which simply do not have a representative sample. As such, I chose to filter our data to only contain those having at least 300 samples in the entire data I work with.
sport_counts = df['sport'].value_counts()
valid_sports = sport_counts[sport_counts > 300].index
df = df[df['sport'].isin(valid_sports)]
df
| user_id | gender | sport | workout_id | duration_min | avg_heart_rate | datetime | |
|---|---|---|---|---|---|---|---|
| 0 | 10921915 | male | bike | 396826535 | 126.483333 | 152.650 | 2014-08-24 16:45:46 |
| 1 | 10921915 | male | bike | 392337038 | 74.000000 | 147.710 | 2014-08-16 20:41:22 |
| 2 | 10921915 | male | bike | 389643739 | 112.483333 | 140.554 | 2014-08-12 15:47:39 |
| 3 | 10921915 | male | bike | 386729739 | 75.316667 | 147.020 | 2014-08-07 17:20:42 |
| 4 | 10921915 | male | bike (transport) | 383186560 | 22.616667 | 167.154 | 2014-08-01 16:10:34 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 49995 | 4399772 | male | mountain bike | 183825234 | 130.083333 | 164.816 | 2013-05-01 15:00:45 |
| 49996 | 4399772 | male | bike | 183288370 | 106.666667 | 111.954 | 2013-04-30 15:05:54 |
| 49997 | 4399772 | male | indoor cycling | 182873807 | 91.200000 | 127.910 | 2013-04-29 17:39:55 |
| 49998 | 4399772 | male | bike | 181847177 | 176.333333 | 120.300 | 2013-04-26 22:01:36 |
| 49999 | 4399772 | male | bike | 181075198 | 192.200000 | 126.388 | 2013-04-24 22:23:48 |
49346 rows Γ 7 columns
Visualizing this as a bar chart (representative samples maintained)...
df['sport'].value_counts().plot(kind='bar', figsize=(6,4))
plt.title("Workout Count by Sport")
plt.ylabel("Count")
plt.show()
2. Activity Duration
Understanding the spread of time spent per workout...
df["duration_min"].describe()
| duration_min | |
|---|---|
| count | 4.934600e+04 |
| mean | 2.342148e+02 |
| std | 1.827681e+04 |
| min | 8.350000e+00 |
| 25% | 4.853333e+01 |
| 50% | 6.903333e+01 |
| 75% | 1.138500e+02 |
| max | 2.479691e+06 |
There seems to be an outlier, specifically an activity lasting 2,479,691 minutes, or approximately ~5 years!
As such, I filter for activities with a duration of less than 3 days (4,320 minutes) and regenrate summary statistics.
df = df[df["duration_min"] < 4320]
df["duration_min"].describe()
| duration_min | |
|---|---|
| count | 49342.000000 |
| mean | 87.959696 |
| std | 55.899893 |
| min | 8.350000 |
| 25% | 48.533333 |
| 50% | 69.033333 |
| 75% | 113.845833 |
| max | 299.866667 |
Visualized as a histogram...
df['duration_min'].dropna().hist(bins=40, figsize=(6,4))
plt.title("Distribution of Workout Duration (min)")
plt.show()
3. Avg. Heart Rate
Understanding the distribution of mean heart rate within a workout session...
df["avg_heart_rate"].describe()
| avg_heart_rate | |
|---|---|
| count | 49254.000000 |
| mean | 140.693914 |
| std | 16.155742 |
| min | 40.788000 |
| 25% | 130.612000 |
| 50% | 141.530000 |
| 75% | 151.686000 |
| max | 210.194000 |
Visualized as a histogram...
df['avg_heart_rate'].dropna().hist(bins=40, figsize=(6,4))
plt.title("Distribution of Avg. Heart Rate (bpm)")
plt.show()
4. Date & Time
Understanding the frequency distribution of days of the week...
df['date'] = df['datetime'].dt.date
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.day_name()
df['day_of_week'].value_counts()
| count | |
|---|---|
| day_of_week | |
| Sunday | 8644 |
| Wednesday | 7301 |
| Tuesday | 7299 |
| Thursday | 7226 |
| Saturday | 7033 |
| Friday | 6595 |
| Monday | 5244 |
Visualized as a bar chart...
df['day_of_week'].value_counts().plot(kind='bar', figsize=(6,4))
plt.title("Workout Count by Day of Week")
plt.ylabel("Count")
plt.show()
Plotting the total number of workouts across time, to understand when app popularity peaked...
df.groupby('date').size().plot(figsize=(10,4))
plt.title("Workouts Over Time")
plt.ylabel("Number of Workouts")
plt.show()
Finally, understanding the time at which people typically tend to work out (24-hour clock)...
df['hour'].hist(bins=24, figsize=(6,4))
plt.title("Workouts by Hour of Day")
plt.xlabel("Hour")
plt.ylabel("Frequency")
plt.show()
Step 3: Model DevelopmentΒΆ
Model Context:ΒΆ
Problem Formulation
I formulated predicting the next workout type a user will peform based on their historical workout data as a supervised multiclass classification problem.
Inputs (X)
- prev_sport: Latest activity type
- prev_duration: Most recent activity duration
- prev_avg_heart: Avg. heart rate rate during latest workout
- prev_hour: Hour at which latest activity began
- prev_day_of_week: Day of the most recent workout
- days_since_prev: Number of days since the most recent activity
Output (y)
- target_sport: Predicted activity type for next workout
To begin the model development process, I first generate an updated DataFrame object with the required inputs and outputs as discussed above.
df = df.sort_values(["user_id", "datetime"]).reset_index(drop=True)[["user_id", "datetime","workout_id","gender","sport","duration_min","avg_heart_rate","date","hour","day_of_week"]]
rows = []
for user, user_df in df.groupby("user_id"):
user_df = user_df.sort_values("datetime").reset_index(drop=True)
for i in range(1, len(user_df)):
prev = user_df.loc[i-1]
curr = user_df.loc[i]
rows.append({
"user_id": user,
"prev_sport": prev["sport"],
"prev_duration": prev["duration_min"],
"prev_avg_heart_rate": prev["avg_heart_rate"],
"prev_hour": prev["datetime"].hour,
"prev_dayofweek": prev["datetime"].dayofweek,
"days_since_prev": (curr["datetime"] - prev["datetime"]).total_seconds() / 86400,
"target_sport": curr["sport"]
})
df_ml = pd.DataFrame(rows)
df_ml = df_ml.dropna()
df_ml
| user_id | prev_sport | prev_duration | prev_avg_heart_rate | prev_hour | prev_dayofweek | days_since_prev | target_sport | |
|---|---|---|---|---|---|---|---|---|
| 0 | 5844 | mountain bike | 132.900000 | 103.558 | 22 | 4 | 103.849780 | bike |
| 1 | 5844 | bike | 42.750000 | 115.716 | 18 | 3 | 34.096134 | bike |
| 2 | 5844 | bike | 62.916667 | 102.902 | 20 | 2 | 1.981898 | bike |
| 3 | 5844 | bike | 70.783333 | 104.298 | 20 | 4 | 7.812234 | bike |
| 4 | 5844 | bike | 107.266667 | 96.570 | 15 | 5 | 31.363368 | bike |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 49015 | 15279967 | run | 110.033333 | 157.782 | 19 | 4 | 11.298333 | run |
| 49016 | 15279967 | run | 172.000000 | 153.150 | 2 | 2 | 6.892731 | run |
| 49017 | 15279967 | run | 166.066667 | 151.474 | 23 | 1 | 4.783669 | run |
| 49018 | 15279967 | run | 84.566667 | 135.420 | 18 | 6 | 28.941308 | run |
| 49019 | 15279967 | run | 49.466667 | 156.118 | 17 | 0 | 38.009201 | run |
48933 rows Γ 8 columns
Model ApproachΒΆ
Optimization Objective
My goal is to maximize prediction accuracy of the next workout activity (target_sport), while preventing look-ahead bias and handling class imbalances within the data.
Modeling Choices
Since this is a classification problem, I will first look at the 3 baselines, analyzing their complexity, efficiency, and challenges in implementation:
- Frequency Baseline: User's most common activity type predicted.
- Recency Baseline: User's most recent activity type predicted.
- Weighted Frequency-Recency Baseline: User's most common activity in last N workouts predicted.
1. Frequency Baseline
Predicts a user's most common activity for every instance.
- Complexity: Ξ(Users x TypesOfActivities) setup and Ξ(1) per prediction.
- Efficiency: No iterative training so quick to compute.
- Advantages: Very simple heuristic, captures user's long-term preference.
- Disadvantages: Ignores recency bias and performs poorly for rare activities.
most_freq = df_ml.groupby('user_id')['prev_sport'].agg(lambda x: x.mode()[0])
df_ml['freq_pred'] = df_ml['user_id'].map(most_freq)
freq_acc = accuracy_score(df_ml['target_sport'], df_ml['freq_pred'])
freq_f1_macro = f1_score(df_ml['target_sport'], df_ml['freq_pred'], average='macro')
freq_f1_weighted = f1_score(df_ml['target_sport'], df_ml['freq_pred'], average='weighted')
print("Frequency Baseline:")
print("Accuracy:", round(freq_acc,4))
print("F1 Macro:", round(freq_f1_macro,4))
print("F1 Weighted:", round(freq_f1_weighted,4))
Frequency Baseline: Accuracy: 0.7923 F1 Macro: 0.5324 F1 Weighted: 0.7819
Based on these results, the Frequency Baseline captures the user's dominant activity pretty well (moderately high accuracy), but performs relatively poorly on less common activities (lower accuracy).
2. Recency Baseline
Predicts a user's most recent activity for every instance.
- Complexity: Ξ(Users) setup and Ξ(1) per prediction.
- Efficiency: No iterative training and minimal memory so quick to compute.
- Advantages: Very simple heuristic, captures user's recency bias.
- Disadvantages: Ignores long-term preferences and sensitive to rare activities.
df_ml['recency_pred'] = df_ml.groupby('user_id')['prev_sport'].shift(1)
recency_df = df_ml.dropna(subset=['recency_pred'])
recency_acc = accuracy_score(recency_df['target_sport'], recency_df['recency_pred'])
recency_f1_macro = f1_score(recency_df['target_sport'], recency_df['recency_pred'], average='macro')
recency_f1_weighted = f1_score(recency_df['target_sport'], recency_df['recency_pred'], average='weighted')
print("Recency Baseline:")
print("Accuracy:", round(recency_acc,4))
print("F1 Macro:", round(recency_f1_macro,4))
print("F1 Weighted:", round(recency_f1_weighted,4))
Recency Baseline: Accuracy: 0.7967 F1 Macro: 0.6631 F1 Weighted: 0.7967
Based on these results, the Recency Baseline has a slightly higher accuracy and F1 Macro, showing that recent activities have more predictive power.
3. Weighted Frequency-Recency Baseline
Predicts a user's most frequent activity in the last N workouts.
- Complexity: Ξ(Users x N) setup and Ξ(N) per prediction.
- Efficiency: Slightly higher computational cost due to maintenance of a sliding window.
- Advantages: Good balance of long-term preferences and short-term trends.
- Disadvantages: Choice of N is crucial, as small N overfits to recent activities while large N ignores short-term trends.
N = 5
def weighted_freq_recent_sports(sports):
preds = []
past_activities = []
for sport in sports:
if past_activities:
last_n = past_activities[-N:]
mode_sport = pd.Series(last_n).mode()[0]
preds.append(mode_sport)
else:
preds.append(None)
past_activities.append(sport)
return preds
df_ml['weighted_pred'] = df_ml.groupby('user_id')['prev_sport'].transform(weighted_freq_recent_sports)
weighted_df = df_ml.dropna(subset=['weighted_pred'])
weighted_acc = accuracy_score(weighted_df['target_sport'], weighted_df['weighted_pred'])
weighted_f1_macro = f1_score(weighted_df['target_sport'], weighted_df['weighted_pred'], average='macro')
weighted_f1_weighted = f1_score(weighted_df['target_sport'], weighted_df['weighted_pred'], average='weighted')
print("Weighted Frequency-Recency Baseline:")
print("Accuracy:", round(weighted_acc,4))
print("F1 Macro:", round(weighted_f1_macro,4))
print("F1 Weighted:", round(weighted_f1_weighted,4))
Weighted Frequency-Recency Baseline: Accuracy: 0.8127 F1 Macro: 0.6874 F1 Weighted: 0.8112
These results for the Weighted Frequency-Recency Baseline seem pretty promising! It seems to balance long-term preference and recent trends, achieving the highest accuracy (approx. 81%), F1 Macro (approx. 69%), and F1 Weighted (approx. 81%) scores. As such, it's probably a good idea to use this weighted frequency-recency approach as a starting point for the machine learning models.
Feature EngineeringΒΆ
Before I begin training any model, I first conduct generic feature engineering tasks to make the data robust to each algorithm. The first step is to encode categorical variables (target_sport, prev_sport) using label encoding.
le_sport = LabelEncoder()
df_ml['target_sport_enc'] = le_sport.fit_transform(df_ml['target_sport'])
df_ml['prev_sport_enc'] = le_sport.transform(df_ml['prev_sport'])
Next, I can leverage the well-performing Weighted Frequency-Recency Baseline as a numeric feature, dropping null rows.
df_ml = df_ml.dropna(subset=['weighted_pred'])
df_ml.loc[:, 'weighted_pred_enc'] = le_sport.transform(df_ml['weighted_pred'])
Then, I can standardize numeric columns to floating point decimals.
numeric_cols = ['prev_duration', 'prev_avg_heart_rate', 'prev_hour', 'prev_dayofweek', 'days_since_prev']
for col in numeric_cols:
df_ml[col] = df_ml[col].astype(float)
Subsequently, I can add subtle interaction terms, particularly flagging if the most recent sport matches the baseline and metric for intensity minutes within the activity.
df_ml['prev_eq_weighted'] = (df_ml['prev_sport_enc'] == df_ml['weighted_pred_enc']).astype(int)
df_ml['prev_intensity'] = df_ml['prev_duration'] * df_ml['prev_avg_heart_rate']
Finally, I can drop null values and prepare the final feature DataFrame object.
df_ml = df_ml.dropna(subset=numeric_cols)
feature_cols = ['prev_sport_enc','weighted_pred_enc','prev_duration','prev_avg_heart_rate',
'prev_hour','prev_dayofweek','days_since_prev','prev_eq_weighted','prev_intensity','target_sport_enc']
df_features = df_ml[feature_cols].copy()
df_features
| prev_sport_enc | weighted_pred_enc | prev_duration | prev_avg_heart_rate | prev_hour | prev_dayofweek | days_since_prev | prev_eq_weighted | prev_intensity | target_sport_enc | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 3 | 42.750000 | 115.716 | 18.0 | 3.0 | 34.096134 | 0 | 4946.859000 | 0 |
| 2 | 0 | 0 | 62.916667 | 102.902 | 20.0 | 2.0 | 1.981898 | 1 | 6474.250833 | 0 |
| 3 | 0 | 0 | 70.783333 | 104.298 | 20.0 | 4.0 | 7.812234 | 1 | 7382.560100 | 0 |
| 4 | 0 | 0 | 107.266667 | 96.570 | 15.0 | 5.0 | 31.363368 | 1 | 10358.742000 | 0 |
| 5 | 0 | 0 | 67.633333 | 101.834 | 0.0 | 2.0 | 2.855729 | 1 | 6887.372867 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 49015 | 5 | 5 | 110.033333 | 157.782 | 19.0 | 4.0 | 11.298333 | 1 | 17361.279400 | 5 |
| 49016 | 5 | 5 | 172.000000 | 153.150 | 2.0 | 2.0 | 6.892731 | 1 | 26341.800000 | 5 |
| 49017 | 5 | 5 | 166.066667 | 151.474 | 23.0 | 1.0 | 4.783669 | 1 | 25154.782267 | 5 |
| 49018 | 5 | 5 | 84.566667 | 135.420 | 18.0 | 6.0 | 28.941308 | 1 | 11452.018000 | 5 |
| 49019 | 5 | 5 | 49.466667 | 156.118 | 17.0 | 0.0 | 38.009201 | 1 | 7722.637067 | 5 |
48616 rows Γ 10 columns
Now that I have accomodated the relevant baseline and conducted required feature engineering tasks, I can isolate the target variable and split the data into training, validation, and testing sets.
- For this project, I have chosen 70% of our loaded data to be training data, 15% to serve as a validation set, and the final 15% as the testing data.
- I also used a chronological split to prevent look-ahead bias and respect the time-series nature of the dataset.
- I chose to standardize numeric features to prevent any singular variable domination from obscuring salient features across other inputs. Only training data was used to fit the scaler, ensuring fair evaluation on validation/test sets.
feature_cols = ['prev_sport_enc', 'weighted_pred_enc', 'prev_duration', 'prev_avg_heart_rate',
'prev_hour', 'prev_dayofweek', 'days_since_prev', 'prev_eq_weighted', 'prev_intensity']
X = df_features[feature_cols].copy()
y = df_features['target_sport_enc'].copy()
n = len(df_features)
train_end = int(0.7 * n)
val_end = int(0.85 * n)
X_train, y_train = X.iloc[:train_end], y.iloc[:train_end]
X_val, y_val = X.iloc[train_end:val_end], y.iloc[train_end:val_end]
X_test, y_test = X.iloc[val_end:], y.iloc[val_end:]
numeric_cols = ['prev_duration', 'prev_avg_heart_rate', 'prev_hour', 'prev_dayofweek', 'days_since_prev', 'prev_intensity']
scaler = StandardScaler()
X_train.loc[:, numeric_cols] = scaler.fit_transform(X_train[numeric_cols])
X_val.loc[:, numeric_cols] = scaler.transform(X_val[numeric_cols])
X_test.loc[:, numeric_cols] = scaler.transform(X_test[numeric_cols])
print(f"Train size: {X_train.shape[0]}\nValidation size: {X_val.shape[0]}\nTest size: {X_test.shape[0]}")
Train size: 34031 Validation size: 7292 Test size: 7293
Now that I split the data into training/validating/testing sets, I can attempt to solve the classification problem using the machine learning algorithms discussed above, namely:
- Logistic Regression
- NaΓ―ve Bayes Classifier
- Support Vector Machine (SVM)
- Random Forest
1. Logistic Regression
For the Logistic Regression model, I use the multinomial setting as our target variable includes multiple sport labels. This uses softmax regression, allowing the model to learn all classes jointly instead of one-by-one. I made use of the Limited-memory BFGS solver for convergence as the dataset deals with a large number of samples.
- Complexity: Ξ(Samples x Features) per iteration.
- Efficiency: Very fast to train and works well with high-dimensional sparse data.
- Advantages: Simple, interpretable coefficients for linear finetuning.
- Disadvantages: Struggles to capture non-linear relationships, particularly with class imbalances.
lr = LogisticRegression(
multi_class='multinomial',
solver='lbfgs',
max_iter=1000,
n_jobs=-1
)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_val)
val_classes = np.unique(y_val)
target_names = le_sport.inverse_transform(val_classes)
print("Logistic Regression Performance (VAL)")
print("Accuracy:", round(accuracy_score(y_val, y_pred_lr), 4))
print("F1 Macro:", round(f1_score(y_val, y_pred_lr, average='macro'), 4))
print("F1 Weighted:", round(f1_score(y_val, y_pred_lr, average='weighted'), 4))
print("\nDetailed Classification Report:")
print(classification_report(
y_val,
y_pred_lr,
labels=val_classes,
target_names=target_names,
zero_division=0
))
Logistic Regression Performance (VAL)
Accuracy: 0.7592
F1 Macro: 0.2509
F1 Weighted: 0.7044
Detailed Classification Report:
precision recall f1-score support
bike 0.78 0.86 0.82 2659
bike (transport) 0.00 0.00 0.00 49
indoor cycling 0.00 0.00 0.00 239
mountain bike 1.00 0.00 0.00 521
orienteering 0.00 0.00 0.00 170
run 0.76 0.91 0.82 3589
walk 0.67 0.06 0.11 65
accuracy 0.76 7292
macro avg 0.46 0.26 0.25 7292
weighted avg 0.73 0.76 0.70 7292
The Logistic Regression model saw an accuracy of about 76% which is reasonable, but still under the baselines. Particularly, its low F1 Macro score of 0.2509 indicates that the model only performs well for certain classes and is probably very skewed due to the existing class imbalances.
2. NaΓ―ve Bayes Classifier
I then used the baseline NaΓ―ve Bayes Classifier, again as a comparison point for the classification problem.
- Complexity: Ξ(Samples x Features) setup and Ξ(Features) per prediction.
- Efficiency: Very fast to train uses very less memory in comparison to other classifiers.
- Advantages: Very memory-efficient and works well with categorical data.
- Disadvantages: Data have been encoded and there's no real scope for hyperparameter tuning.
nb = GaussianNB()
nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_val)
val_classes = np.unique(y_val)
target_names = le_sport.inverse_transform(val_classes)
print("NaΓ―ve Bayes Performance (VAL)")
print("Accuracy:", round(accuracy_score(y_val, y_pred_nb), 4))
print("F1 Macro:", round(f1_score(y_val, y_pred_nb, average='macro'), 4))
print("F1 Weighted:", round(f1_score(y_val, y_pred_nb, average='weighted'), 4))
print("\nDetailed Classification Report:")
print(classification_report(
y_val,
y_pred_nb,
labels=val_classes,
target_names=target_names,
zero_division=0
))
NaΓ―ve Bayes Performance (VAL)
Accuracy: 0.6529
F1 Macro: 0.2414
F1 Weighted: 0.6458
Detailed Classification Report:
precision recall f1-score support
bike 0.73 0.61 0.66 2659
bike (transport) 0.02 0.41 0.04 49
indoor cycling 0.00 0.00 0.00 239
mountain bike 0.34 0.03 0.05 521
orienteering 1.00 0.01 0.01 170
run 0.76 0.87 0.81 3589
walk 0.23 0.08 0.11 65
accuracy 0.65 7292
macro avg 0.44 0.28 0.24 7292
weighted avg 0.69 0.65 0.65 7292
The results of the NaΓ―ve Bayes Classifier are rather underwhelming, only hitting an accuracy of 65%, with significantly worse F1 Macro and F1 Weighted scores too.
3. Support Vector Machine (SVM)
For the Support Vector Machine Classifier, I used a radial basis function (RBF) kernel, which is better equipped to capture non-linear relationships. I also used a one-vs-one (OvO) strategy for multi-class classification, wherein the model is able to establish decision boundaries for each class.
- Complexity: Ranges from Ξ(n^2) to Ξ(n^3).
- Efficiency: Heavy compute required, so slower than the other models considered.
- Advantages: Can capture non-linear relationships with the RBF kernel.
- Disadvantages: Very memory-intensive and less explicable than other classifiers.
svm = SVC(
kernel='rbf',
decision_function_shape='ovo',
probability=False,
)
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_val)
val_classes = np.unique(y_val)
target_names = le_sport.inverse_transform(val_classes)
print("Support Vector Machine (SVM) Performance (VAL)")
print("Accuracy:", round(accuracy_score(y_val, y_pred_svm), 4))
print("F1 Macro:", round(f1_score(y_val, y_pred_svm, average='macro'), 4))
print("F1 Weighted:", round(f1_score(y_val, y_pred_svm, average='weighted'), 4))
print("\nDetailed Classification Report:")
print(classification_report(
y_val,
y_pred_svm,
labels=val_classes,
target_names=target_names,
zero_division=0
))
Support Vector Machine (SVM) Performance (VAL)
Accuracy: 0.7999
F1 Macro: 0.4466
F1 Weighted: 0.7802
Detailed Classification Report:
precision recall f1-score support
bike 0.81 0.86 0.84 2659
bike (transport) 0.12 0.45 0.19 49
indoor cycling 0.00 0.00 0.00 239
mountain bike 0.79 0.61 0.69 521
orienteering 0.00 0.00 0.00 170
run 0.82 0.89 0.85 3589
walk 0.72 0.45 0.55 65
accuracy 0.80 7292
macro avg 0.47 0.46 0.45 7292
weighted avg 0.77 0.80 0.78 7292
My implementation of the Support Vector Machine (SVM) Classifier saw a reasonable accuracy of around 80%, with substantial improvement in F1 Weighted scores. This also extends to the F1 Macro scores, from which I can infer that the model is able to better distinguish minor classes.
At this point, I have a good understanding of classification approaches without finetuning and can use these results and takeaways to optimize my final model, Random Forest.
4. Random Forest
Random Forest uses a large collection of decision trees based on different bootstrap samples of the data, with each tree built independently and the final prediction arising by aggregating the outputs from all trees. This allows it to reduce variance, improve generalization, and make the model robust to overfitting, all necessary tasks given our excessively imbalanced data.
For this model, I trained 1,000 decision trees with a max. depth of 20, enforcing minimum split and leaf constraints to prevent overly complex trees. I also set max_features='sqrt', ensuring that each tree only considers a subset of features to boost robustness to randomness. Finally, I used balanced_subsample weighting to handle class imbalance, and trained it using bootstrap aggregation with out-of-bag evaluation enabled.
- Complexity: About Ξ(n^3) for training, which typically takes precedence in comparison to prediction.
- Efficiency: Significant compute required, but parallelizable, so moderately efficient.
- Advantages: Robust to overfitting, naturally handles class imbalances, and good for capturing complex relationships.
- Disadvantages: Significant hyperparameter tuning and feature scaling required to optimize results.
rf_model = RandomForestClassifier(
n_estimators=1000,
max_depth=20,
min_samples_split=60,
min_samples_leaf=2,
max_features='sqrt',
class_weight='balanced_subsample',
bootstrap=True,
oob_score=True,
n_jobs=-1,
random_state=42,
)
rf_model.fit(X_train, y_train)
y_pred = rf_model.predict(X_val)
print("Random Forest Performance (VAL)")
print("Accuracy:", round(accuracy_score(y_val, y_pred), 4))
print("F1 Macro:", round(f1_score(y_val, y_pred, average='macro'), 4))
print("F1 Weighted:", round(f1_score(y_val, y_pred, average='weighted'), 4))
print("\nDetailed Classification Report:")
print(classification_report(
y_val,
y_pred,
labels=val_classes,
target_names=target_names,
zero_division=0
))
Random Forest Performance (VAL)
Accuracy: 0.8206
F1 Macro: 0.6972
F1 Weighted: 0.824
Detailed Classification Report:
precision recall f1-score support
bike 0.85 0.82 0.83 2659
bike (transport) 0.43 0.47 0.45 49
indoor cycling 0.53 0.75 0.62 239
mountain bike 0.66 0.80 0.73 521
orienteering 0.61 0.81 0.69 170
run 0.88 0.84 0.86 3589
walk 0.61 0.80 0.69 65
accuracy 0.82 7292
macro avg 0.65 0.75 0.70 7292
weighted avg 0.83 0.82 0.82 7292
The results of Random Forest are substantially better than the other 3 modeling approaches, with accuracy reaching 82%, higher than the heuristic baseline of 81%. While this may seem moderately insignificant, the more important takeaway is the dramatic increase in F1 Macro scores (70%), which means I finally have a non-heuristic multi-class classification model which sufficiently adapts to imbalanced classes.
Step 4: Model EvaluationΒΆ
Evaluating Model on Testing DataΒΆ
I can now attempt to use this model on testing data to see if it holds strong, and then conduct rigorous model evaluation to identify strengths and weaknesses.
y_test_pred = rf_model.predict(X_test)
print("\nRandom Forest Performance (TEST)")
print("Accuracy:", round(accuracy_score(y_test, y_test_pred), 4))
print("F1 Macro:", round(f1_score(y_test, y_test_pred, average='macro'), 4))
print("F1 Weighted:", round(f1_score(y_test, y_test_pred, average='weighted'), 4))
print("\nDetailed Classification Report (TEST):")
print(classification_report(
y_test,
y_test_pred,
labels=val_classes,
target_names=target_names,
zero_division=0
))
Random Forest Performance (TEST)
Accuracy: 0.8719
F1 Macro: 0.5933
F1 Weighted: 0.8757
Detailed Classification Report (TEST):
precision recall f1-score support
bike 0.93 0.92 0.92 4336
bike (transport) 0.52 0.70 0.60 70
indoor cycling 0.23 0.75 0.35 8
mountain bike 0.53 0.75 0.62 229
orienteering 0.22 0.32 0.26 22
run 0.87 0.82 0.85 2527
walk 0.49 0.64 0.55 101
accuracy 0.87 7293
macro avg 0.54 0.70 0.59 7293
weighted avg 0.88 0.87 0.88 7293
Fortunately, the model seems to do really well on testing data as well, with accuracy and F1 Weighted score exceeding 87%. The slight dip in F1 Macro score can be attributed to a disproportionately smaller sample size for indoor cycling and orienteering in the testing set, likely leaving the model confused.
Evaluation MetricsΒΆ
For this predictive task, I chose to evaluate model performance using 3 metrics, namely:
- Accuracy, which provides an overall view of correct predictions across all classes.
- F1 Macro Score, which captures performance across all classes equally, making it important to analyze an imbalanced dataset as such.
- F1 Weighted Score, which balances the contribution of each class according to the number of occurrences, accounting for class biases while representing overall predictive power.
Arguably, Accuracy and F1 Macro scores are the 2 most appropriate metrics, as the former provides a standard for absolute model performance, while the latter benchmarks weightages biases within the data. I can better compare and contrast our baselines and machine learning models with a summative DataFrame object.
models = ["Frequency Baseline", "Recency Baseline", "Weighted Frequency-Recency Baseline",
"Logistic Regression (VAL)", "NaΓ―ve Bayes (VAL)", "SVM (VAL)", "Random Forest (VAL)", "Random Forest (TEST)"]
accuracy = [0.7923, 0.7967, 0.8127, 0.7592, 0.6529, 0.7999, 0.8206, 0.8719]
f1_macro = [0.5324, 0.6631, 0.6874, 0.2509, 0.2414, 0.4466, 0.6972, 0.5933]
f1_weighted = [0.7819, 0.7967, 0.8112, 0.7044, 0.6458, 0.7802, 0.8240, 0.8757]
df_summary = pd.DataFrame({"Model": models, "Accuracy": accuracy, "F1 Macro": f1_macro, "F1 Weighted": f1_weighted})
df_summary[["Accuracy", "F1 Macro", "F1 Weighted"]] = df_summary[["Accuracy", "F1 Macro", "F1 Weighted"]].round(4)
df_summary
| Model | Accuracy | F1 Macro | F1 Weighted | |
|---|---|---|---|---|
| 0 | Frequency Baseline | 0.7923 | 0.5324 | 0.7819 |
| 1 | Recency Baseline | 0.7967 | 0.6631 | 0.7967 |
| 2 | Weighted Frequency-Recency Baseline | 0.8127 | 0.6874 | 0.8112 |
| 3 | Logistic Regression (VAL) | 0.7592 | 0.2509 | 0.7044 |
| 4 | NaΓ―ve Bayes (VAL) | 0.6529 | 0.2414 | 0.6458 |
| 5 | SVM (VAL) | 0.7999 | 0.4466 | 0.7802 |
| 6 | Random Forest (VAL) | 0.8206 | 0.6972 | 0.8240 |
| 7 | Random Forest (TEST) | 0.8719 | 0.5933 | 0.8757 |
As such, the Random Forest model exceeds baselines for Accuracy and F1 Weighted scores in both validation and testing sets. While it still does exceed baselines in F1 Macro score for the validation set, we can attribute the dip in performance in the testing set to extraordinary non-representative samples which would be otherwise excluded in analysis.
Although I also generated detailed classification reports where appropriate, they are not too relevant for model evaluation purposes. That said, the best way to visually understand my model would be through the use of a confusion matrix.
y_test_orig = le_sport.inverse_transform(y_test)
y_test_pred_orig = le_sport.inverse_transform(y_test_pred)
val_classes_orig = le_sport.inverse_transform(val_classes)
cm = confusion_matrix(y_test_orig, y_test_pred_orig, labels=val_classes_orig)
cm_pct = cm / cm.sum(axis=1, keepdims=True)
plt.figure(figsize=(8,8))
plt.imshow(cm_pct, cmap="Blues")
plt.title("Confusion Matrix (Percent) β Test Set")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.xticks(np.arange(len(val_classes_orig)), val_classes_orig, rotation=90)
plt.yticks(np.arange(len(val_classes_orig)), val_classes_orig)
for i in range(cm_pct.shape[0]):
for j in range(cm_pct.shape[1]):
value = cm_pct[i, j] * 100
text = f"{value:.1f}%" if value > 0 else ""
plt.text(j, i, text, ha="center", va="center", fontsize=9)
plt.colorbar(label="Percent")
plt.tight_layout()
plt.show()
This puts into perspective how the model gets tripped up by orienteering in the test set (only 22 samples) and poses the question of whether orienteering and running can be better segmented in downstream analysis.
From a practical lens, this slip up actually captures a salient feature that our model is actually able to detect. Orienteering is a group of sports in which participants use a map and compass to navigate from point to point in unfamiliar terrain as quickly as possible. This does involve a lot of running, which can explain why our model is specifically struggling with this classification. Although the disparity is not as significant, a similar logic can be applied to explain the confusion between running and walking, as pace greatly varies from person to person.
Step 5: Discussion of Related WorkΒΆ
How has this dataset (or similar datasets) been used before?ΒΆ
The Endomondo Fitness Tracking Data were collected by researchers at University of California, San Diego for academic use. The paper published adjacent to the data collection process focuses on modeling heart rate and activity data to deliver a personalized fitness recommendation.
Specifically, this introductory paper focuses on FitRec, an LSTM-based model that estimates a user's heart rate profile and accordingly recommends suitable activities. The key takeaways from this study suggest that the LSTM model is able to learn personalized, contextual, and activity-specific user dynamics simply based on heart rate profiling.
While FitRec was implemented with a specific purpose in mind, the strong results from this previous study validates the opportunistic use of fitness data, such as Endomondo, for activity recommendations.
How has prior work approached the same (or similar) tasks?ΒΆ
Specific to the Endomondo dataset, previous work has focused on predicting heart rate trajectory to determine the expected response to an activity, in tandem with other factors such as weather, duration, elevation, etc. These predictions have then been used to conduct downstream analyses which typically have to do with predictive modeling or recommender systems.
That said, there have been instances of historical fitness data being used to predict a user's subsequent activity, the same task explored in this project. These typically make use of deep learning architectures, such as Recurrent Neural Networks (RNNs), or Temporal Convolutional Networks (TCNs), which makes sense given the complex underlying patterns.
In a less random sense, N-gram models and Markov chains have also been used, although with lesser success due to their inability to capture long-term temporal trends. There has also been recent work using BERT-adjacent Transformer models, which is a relatively unexplored approach to this classification problem.
How do your results match or differ from what has been reported in related work?ΒΆ
The biggest difference in my results likely arises from the specific Endomondo dataset I used. Given the significant class imbalance and RAM issues encountered, the evaluation metrics were limited to F1 scores and Accuracy. In comparison, most research uses Top-k accuracy or Recall@K to recommend a couple of suggested next activities. Implementing these metrics in this project would likely prove futile due to the limited dataset size.
The models used in this investigation are exploratory, and not meant to build on existing work, which particularly implements stronger theoretical frameworks. That said, even if the entire dataset was used, or a different fitness dataset considered, a matching finding that are project was able to arrive at is the problematic structure of fitness data. Often times, class imbalances are introduced by people engaging in a new activity just for a couple of times before quitting it. This consensus, while agreeable, is unfortunately a reflection of most fitness data, wherein you have to compromise unique user-activity dynamics to generate a more holistic bootstrap sample.
As such, despite differences in evaluation metrics, the common takeaway was to rather have a user-balanced activity-biased dataset than an activity-balanced user-biased dataset. In simpler terms, more emphasis on handling class imbalances than worrying about standardizing or normalizing the specific data collected.