πŸƒβ€β™‚οΈ NextMove: Predicting Next Workout ActivityΒΆ

This notebook focuses on data preprocessing, exploratory data analysis (EDA), development and evaluation of a machine learning model used to predict a user's next workout activity based on their historical fitness data on Endomondo.ΒΆ

Step 1: Predictive TaskΒΆ

Problem Statement:

Given a user's historical workout data from Endomondo Fitness Tracking Data, this project aims to investigate which activity they are most likely to engage in for their next workout session.

Model Evaluation Strategy:

To evaluate this machine learning model, standard evaluation metrics like accuracy, F1-score (macro-averaged to handle class imbalance), and per-class precision and recall will be used. These metrics, along with a confusion matrix, should inform which activity types the model is accurate at predicting.

Model Validation Approach:

For this model, 70% of the existing data will be used to train the model. Validation and testing data sets will comprise 15% each. Given the temporal nature of this data, splits will be performed chronologically instead of randomly, i.e., the first 70% of workout data will be used for training, next 15% for validation, and the final 15% for testing. Additionally, features will only be engineered based on information available at the time of data collection to prevent any sort of look-ahead bias.

Baseline Approaches:

To establish model performance benchmarks, the following baselines will be used as focal points of comparison:

  1. Frequency Baseline: User's most common activity type predicted.
  2. Recency Baseline: User's most recent activity type predicted.
  3. Weighted Frequency-Recency Baseline: User's most common activity in last N workouts predicted.

Classification Models:

The following machine learning models will be used for classification:

  1. Logistic Regression
  2. NaΓ―ve Bayes Classifier
  3. Support Vector Machine (SVM)
  4. Random Forest

The first 3 models serve as a preliminary approach, whereas Random Forest is a cursory, yet novel approach that I chose to implement for this classification task, which I eventually chose to finetune.

ΒΆ


Step 2: Data Preprocesssing and Exploratory Data Analysis (EDA)ΒΆ

Dataset Context:ΒΆ

Description
This dataset is a collection of user workout records from the fitness tracking platform Endomondo. Data includes multiple sources of sequential sensor data such as heart rate, speed, GPS as well as sport type, user gender and weather condition (i.e. temperature, humidity). These datasets have been collected for academic purpose and use heuristics to clean and filter out abnormal data samples such as overly large magnitude and mismatching timestamps.

Citation
Modeling heart rate and activity data for personalized fitness recommendation
Jianmo Ni, Larry Muhlstein, Julian McAuley
WWW, 2019

Data Preprocessing:ΒΆ

Although the dataset has been cleaned beforehand using heuristics, I conduct downstream wrangling efforts to prepare the data for this project, i.e., before I look to analyze any data, I need to retrieve it from its JSON file into a more usable DataFrame format.
First, I store the JSON file on Google Drive and navigate to the required directory.

InΒ [Β ]:
from google.colab import drive
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
warnings.filterwarnings("ignore")
drive.mount('/content/drive')
Mounted at /content/drive

Then, I parse the data into a Python list, restricting the number of loaded workouts to 50,000 (RAM restrictions).

InΒ [Β ]:
data = []
with open("/content/drive/MyDrive/endomondoHR_proper.json") as f:
    for i, line in enumerate(f):
        if i == 50000:
            break
        data.append(eval(line))

Finally, I get the data into a usable DataFrame object filtered to contain only the required columns from this dataset.

InΒ [Β ]:
def summarize_workout(r):
    w = {
        'user_id': r.get('userId'),
        'gender': r.get('gender'),
        'sport': r.get('sport'),
        'workout_id': r.get('id'),
    }

    ts = r.get('timestamp', [])
    if ts:
        w['start'] = min(ts)
        w['end'] = max(ts)
        w['duration_min'] = (w['end'] - w['start']) / 60
    else:
        w['start'] = None
        w['end'] = None
        w['duration_min'] = None


    hr = r.get('heart_rate', [])
    hr_pos = [x for x in hr if x > 0]
    w['avg_heart_rate'] = np.mean(hr_pos) if hr_pos else None

    return w

df = pd.DataFrame([summarize_workout(d) for d in data])
df['datetime'] = pd.to_datetime(df['start'], unit='s')
df = df.drop(columns=["start", "end"])
df
Out[Β ]:
user_id gender sport workout_id duration_min avg_heart_rate datetime
0 10921915 male bike 396826535 126.483333 152.650 2014-08-24 16:45:46
1 10921915 male bike 392337038 74.000000 147.710 2014-08-16 20:41:22
2 10921915 male bike 389643739 112.483333 140.554 2014-08-12 15:47:39
3 10921915 male bike 386729739 75.316667 147.020 2014-08-07 17:20:42
4 10921915 male bike (transport) 383186560 22.616667 167.154 2014-08-01 16:10:34
... ... ... ... ... ... ... ...
49995 4399772 male mountain bike 183825234 130.083333 164.816 2013-05-01 15:00:45
49996 4399772 male bike 183288370 106.666667 111.954 2013-04-30 15:05:54
49997 4399772 male indoor cycling 182873807 91.200000 127.910 2013-04-29 17:39:55
49998 4399772 male bike 181847177 176.333333 120.300 2013-04-26 22:01:36
49999 4399772 male bike 181075198 192.200000 126.388 2013-04-24 22:23:48

50000 rows Γ— 7 columns

Understanding the number of null values within this dataset to ensure nothing too extreme...

InΒ [Β ]:
df.isna().sum()
Out[Β ]:
0
user_id 0
gender 0
sport 0
workout_id 0
duration_min 0
avg_heart_rate 88
datetime 0

Exploratory Data Analysis (EDA):ΒΆ

To begin EDA, there are 4 key metrics that I look at given my processed DataFrame object:

  1. Activity Type: What sport was this activity registered as?
  2. Activity Duration: How long did this activity last for?
  3. Avg. Heart Rate: How energy-intensive was this activity?
  4. Date & Time: When exactly did this activity take place?
    1. Activity Type
    Understanding the frequency distribution of sports/activity types within the dataset...
InΒ [Β ]:
df['sport'].value_counts()
Out[Β ]:
count
sport
run 22297
bike 20125
mountain bike 3382
bike (transport) 2358
indoor cycling 428
orienteering 416
walk 340
skate 170
cross-country skiing 118
core stability training 109
fitness walking 63
rowing 53
hiking 46
kayaking 26
soccer 13
weight training 11
circuit training 8
treadmill running 7
roller skiing 5
downhill skiing 5
horseback riding 3
gymnastics 3
elliptical 3
tennis 2
snowboarding 2
swimming 2
basketball 1
snowshoeing 1
yoga 1
aerobics 1
stair climing 1

Visualized as a bar chart...

InΒ [Β ]:
df['sport'].value_counts().plot(kind='bar', figsize=(6,4))
plt.title("Workout Count by Sport")
plt.ylabel("Count")
plt.show()
No description has been provided for this image

In this situation, there are multiple activity types which simply do not have a representative sample. As such, I chose to filter our data to only contain those having at least 300 samples in the entire data I work with.

InΒ [Β ]:
sport_counts = df['sport'].value_counts()
valid_sports = sport_counts[sport_counts > 300].index
df = df[df['sport'].isin(valid_sports)]
df
Out[Β ]:
user_id gender sport workout_id duration_min avg_heart_rate datetime
0 10921915 male bike 396826535 126.483333 152.650 2014-08-24 16:45:46
1 10921915 male bike 392337038 74.000000 147.710 2014-08-16 20:41:22
2 10921915 male bike 389643739 112.483333 140.554 2014-08-12 15:47:39
3 10921915 male bike 386729739 75.316667 147.020 2014-08-07 17:20:42
4 10921915 male bike (transport) 383186560 22.616667 167.154 2014-08-01 16:10:34
... ... ... ... ... ... ... ...
49995 4399772 male mountain bike 183825234 130.083333 164.816 2013-05-01 15:00:45
49996 4399772 male bike 183288370 106.666667 111.954 2013-04-30 15:05:54
49997 4399772 male indoor cycling 182873807 91.200000 127.910 2013-04-29 17:39:55
49998 4399772 male bike 181847177 176.333333 120.300 2013-04-26 22:01:36
49999 4399772 male bike 181075198 192.200000 126.388 2013-04-24 22:23:48

49346 rows Γ— 7 columns

Visualizing this as a bar chart (representative samples maintained)...

InΒ [Β ]:
df['sport'].value_counts().plot(kind='bar', figsize=(6,4))
plt.title("Workout Count by Sport")
plt.ylabel("Count")
plt.show()
No description has been provided for this image

2. Activity Duration
Understanding the spread of time spent per workout...

InΒ [Β ]:
df["duration_min"].describe()
Out[Β ]:
duration_min
count 4.934600e+04
mean 2.342148e+02
std 1.827681e+04
min 8.350000e+00
25% 4.853333e+01
50% 6.903333e+01
75% 1.138500e+02
max 2.479691e+06

There seems to be an outlier, specifically an activity lasting 2,479,691 minutes, or approximately ~5 years!
As such, I filter for activities with a duration of less than 3 days (4,320 minutes) and regenrate summary statistics.

InΒ [Β ]:
df = df[df["duration_min"] < 4320]
df["duration_min"].describe()
Out[Β ]:
duration_min
count 49342.000000
mean 87.959696
std 55.899893
min 8.350000
25% 48.533333
50% 69.033333
75% 113.845833
max 299.866667

Visualized as a histogram...

InΒ [Β ]:
df['duration_min'].dropna().hist(bins=40, figsize=(6,4))
plt.title("Distribution of Workout Duration (min)")
plt.show()
No description has been provided for this image

3. Avg. Heart Rate
Understanding the distribution of mean heart rate within a workout session...

InΒ [Β ]:
df["avg_heart_rate"].describe()
Out[Β ]:
avg_heart_rate
count 49254.000000
mean 140.693914
std 16.155742
min 40.788000
25% 130.612000
50% 141.530000
75% 151.686000
max 210.194000

Visualized as a histogram...

InΒ [Β ]:
df['avg_heart_rate'].dropna().hist(bins=40, figsize=(6,4))
plt.title("Distribution of Avg. Heart Rate (bpm)")
plt.show()
No description has been provided for this image

4. Date & Time
Understanding the frequency distribution of days of the week...

InΒ [Β ]:
df['date'] = df['datetime'].dt.date
df['hour'] = df['datetime'].dt.hour
df['day_of_week'] = df['datetime'].dt.day_name()
df['day_of_week'].value_counts()
Out[Β ]:
count
day_of_week
Sunday 8644
Wednesday 7301
Tuesday 7299
Thursday 7226
Saturday 7033
Friday 6595
Monday 5244

Visualized as a bar chart...

InΒ [Β ]:
df['day_of_week'].value_counts().plot(kind='bar', figsize=(6,4))
plt.title("Workout Count by Day of Week")
plt.ylabel("Count")
plt.show()
No description has been provided for this image

Plotting the total number of workouts across time, to understand when app popularity peaked...

InΒ [Β ]:
df.groupby('date').size().plot(figsize=(10,4))
plt.title("Workouts Over Time")
plt.ylabel("Number of Workouts")
plt.show()
No description has been provided for this image

Finally, understanding the time at which people typically tend to work out (24-hour clock)...

InΒ [Β ]:
df['hour'].hist(bins=24, figsize=(6,4))
plt.title("Workouts by Hour of Day")
plt.xlabel("Hour")
plt.ylabel("Frequency")
plt.show()
No description has been provided for this image
At this stage, I have a solid exploratory foundation of the data I am working with, and I can begin with the machine learning model development process, i.e., finetuning a model to determine which sporting activity is the most likely for a user the next time they exercise.ΒΆ

ΒΆ


Step 3: Model DevelopmentΒΆ

Model Context:ΒΆ

Problem Formulation
I formulated predicting the next workout type a user will peform based on their historical workout data as a supervised multiclass classification problem.

Inputs (X)

  1. prev_sport: Latest activity type
  2. prev_duration: Most recent activity duration
  3. prev_avg_heart: Avg. heart rate rate during latest workout
  4. prev_hour: Hour at which latest activity began
  5. prev_day_of_week: Day of the most recent workout
  6. days_since_prev: Number of days since the most recent activity

Output (y)

  1. target_sport: Predicted activity type for next workout

To begin the model development process, I first generate an updated DataFrame object with the required inputs and outputs as discussed above.

InΒ [Β ]:
df = df.sort_values(["user_id", "datetime"]).reset_index(drop=True)[["user_id", "datetime","workout_id","gender","sport","duration_min","avg_heart_rate","date","hour","day_of_week"]]
rows = []

for user, user_df in df.groupby("user_id"):
    user_df = user_df.sort_values("datetime").reset_index(drop=True)

    for i in range(1, len(user_df)):
        prev = user_df.loc[i-1]
        curr = user_df.loc[i]

        rows.append({

            "user_id": user,

            "prev_sport": prev["sport"],
            "prev_duration": prev["duration_min"],
            "prev_avg_heart_rate": prev["avg_heart_rate"],
            "prev_hour": prev["datetime"].hour,
            "prev_dayofweek": prev["datetime"].dayofweek,

            "days_since_prev": (curr["datetime"] - prev["datetime"]).total_seconds() / 86400,

            "target_sport": curr["sport"]
        })

df_ml = pd.DataFrame(rows)
df_ml = df_ml.dropna()
df_ml
Out[Β ]:
user_id prev_sport prev_duration prev_avg_heart_rate prev_hour prev_dayofweek days_since_prev target_sport
0 5844 mountain bike 132.900000 103.558 22 4 103.849780 bike
1 5844 bike 42.750000 115.716 18 3 34.096134 bike
2 5844 bike 62.916667 102.902 20 2 1.981898 bike
3 5844 bike 70.783333 104.298 20 4 7.812234 bike
4 5844 bike 107.266667 96.570 15 5 31.363368 bike
... ... ... ... ... ... ... ... ...
49015 15279967 run 110.033333 157.782 19 4 11.298333 run
49016 15279967 run 172.000000 153.150 2 2 6.892731 run
49017 15279967 run 166.066667 151.474 23 1 4.783669 run
49018 15279967 run 84.566667 135.420 18 6 28.941308 run
49019 15279967 run 49.466667 156.118 17 0 38.009201 run

48933 rows Γ— 8 columns

Model ApproachΒΆ

Optimization Objective
My goal is to maximize prediction accuracy of the next workout activity (target_sport), while preventing look-ahead bias and handling class imbalances within the data.

Modeling Choices
Since this is a classification problem, I will first look at the 3 baselines, analyzing their complexity, efficiency, and challenges in implementation:

  1. Frequency Baseline: User's most common activity type predicted.
  2. Recency Baseline: User's most recent activity type predicted.
  3. Weighted Frequency-Recency Baseline: User's most common activity in last N workouts predicted.

1. Frequency Baseline
Predicts a user's most common activity for every instance.

  • Complexity: Θ(Users x TypesOfActivities) setup and Θ(1) per prediction.
  • Efficiency: No iterative training so quick to compute.
  • Advantages: Very simple heuristic, captures user's long-term preference.
  • Disadvantages: Ignores recency bias and performs poorly for rare activities.
InΒ [Β ]:
most_freq = df_ml.groupby('user_id')['prev_sport'].agg(lambda x: x.mode()[0])
df_ml['freq_pred'] = df_ml['user_id'].map(most_freq)

freq_acc = accuracy_score(df_ml['target_sport'], df_ml['freq_pred'])
freq_f1_macro = f1_score(df_ml['target_sport'], df_ml['freq_pred'], average='macro')
freq_f1_weighted = f1_score(df_ml['target_sport'], df_ml['freq_pred'], average='weighted')

print("Frequency Baseline:")
print("Accuracy:", round(freq_acc,4))
print("F1 Macro:", round(freq_f1_macro,4))
print("F1 Weighted:", round(freq_f1_weighted,4))
Frequency Baseline:
Accuracy: 0.7923
F1 Macro: 0.5324
F1 Weighted: 0.7819

Based on these results, the Frequency Baseline captures the user's dominant activity pretty well (moderately high accuracy), but performs relatively poorly on less common activities (lower accuracy).

2. Recency Baseline
Predicts a user's most recent activity for every instance.

  • Complexity: Θ(Users) setup and Θ(1) per prediction.
  • Efficiency: No iterative training and minimal memory so quick to compute.
  • Advantages: Very simple heuristic, captures user's recency bias.
  • Disadvantages: Ignores long-term preferences and sensitive to rare activities.
InΒ [Β ]:
df_ml['recency_pred'] = df_ml.groupby('user_id')['prev_sport'].shift(1)

recency_df = df_ml.dropna(subset=['recency_pred'])

recency_acc = accuracy_score(recency_df['target_sport'], recency_df['recency_pred'])
recency_f1_macro = f1_score(recency_df['target_sport'], recency_df['recency_pred'], average='macro')
recency_f1_weighted = f1_score(recency_df['target_sport'], recency_df['recency_pred'], average='weighted')

print("Recency Baseline:")
print("Accuracy:", round(recency_acc,4))
print("F1 Macro:", round(recency_f1_macro,4))
print("F1 Weighted:", round(recency_f1_weighted,4))
Recency Baseline:
Accuracy: 0.7967
F1 Macro: 0.6631
F1 Weighted: 0.7967

Based on these results, the Recency Baseline has a slightly higher accuracy and F1 Macro, showing that recent activities have more predictive power.

3. Weighted Frequency-Recency Baseline
Predicts a user's most frequent activity in the last N workouts.

  • Complexity: Θ(Users x N) setup and Θ(N) per prediction.
  • Efficiency: Slightly higher computational cost due to maintenance of a sliding window.
  • Advantages: Good balance of long-term preferences and short-term trends.
  • Disadvantages: Choice of N is crucial, as small N overfits to recent activities while large N ignores short-term trends.
InΒ [Β ]:
N = 5
def weighted_freq_recent_sports(sports):
    preds = []
    past_activities = []
    for sport in sports:
        if past_activities:
            last_n = past_activities[-N:]
            mode_sport = pd.Series(last_n).mode()[0]
            preds.append(mode_sport)
        else:
            preds.append(None)
        past_activities.append(sport)
    return preds

df_ml['weighted_pred'] = df_ml.groupby('user_id')['prev_sport'].transform(weighted_freq_recent_sports)

weighted_df = df_ml.dropna(subset=['weighted_pred'])

weighted_acc = accuracy_score(weighted_df['target_sport'], weighted_df['weighted_pred'])
weighted_f1_macro = f1_score(weighted_df['target_sport'], weighted_df['weighted_pred'], average='macro')
weighted_f1_weighted = f1_score(weighted_df['target_sport'], weighted_df['weighted_pred'], average='weighted')

print("Weighted Frequency-Recency Baseline:")
print("Accuracy:", round(weighted_acc,4))
print("F1 Macro:", round(weighted_f1_macro,4))
print("F1 Weighted:", round(weighted_f1_weighted,4))
Weighted Frequency-Recency Baseline:
Accuracy: 0.8127
F1 Macro: 0.6874
F1 Weighted: 0.8112

These results for the Weighted Frequency-Recency Baseline seem pretty promising! It seems to balance long-term preference and recent trends, achieving the highest accuracy (approx. 81%), F1 Macro (approx. 69%), and F1 Weighted (approx. 81%) scores. As such, it's probably a good idea to use this weighted frequency-recency approach as a starting point for the machine learning models.

Feature EngineeringΒΆ

Before I begin training any model, I first conduct generic feature engineering tasks to make the data robust to each algorithm. The first step is to encode categorical variables (target_sport, prev_sport) using label encoding.

InΒ [Β ]:
le_sport = LabelEncoder()
df_ml['target_sport_enc'] = le_sport.fit_transform(df_ml['target_sport'])
df_ml['prev_sport_enc'] = le_sport.transform(df_ml['prev_sport'])

Next, I can leverage the well-performing Weighted Frequency-Recency Baseline as a numeric feature, dropping null rows.

InΒ [Β ]:
df_ml = df_ml.dropna(subset=['weighted_pred'])
df_ml.loc[:, 'weighted_pred_enc'] = le_sport.transform(df_ml['weighted_pred'])

Then, I can standardize numeric columns to floating point decimals.

InΒ [Β ]:
numeric_cols = ['prev_duration', 'prev_avg_heart_rate', 'prev_hour', 'prev_dayofweek', 'days_since_prev']
for col in numeric_cols:
    df_ml[col] = df_ml[col].astype(float)

Subsequently, I can add subtle interaction terms, particularly flagging if the most recent sport matches the baseline and metric for intensity minutes within the activity.

InΒ [Β ]:
df_ml['prev_eq_weighted'] = (df_ml['prev_sport_enc'] == df_ml['weighted_pred_enc']).astype(int)
df_ml['prev_intensity'] = df_ml['prev_duration'] * df_ml['prev_avg_heart_rate']

Finally, I can drop null values and prepare the final feature DataFrame object.

InΒ [Β ]:
df_ml = df_ml.dropna(subset=numeric_cols)
feature_cols = ['prev_sport_enc','weighted_pred_enc','prev_duration','prev_avg_heart_rate',
    'prev_hour','prev_dayofweek','days_since_prev','prev_eq_weighted','prev_intensity','target_sport_enc']

df_features = df_ml[feature_cols].copy()
df_features
Out[Β ]:
prev_sport_enc weighted_pred_enc prev_duration prev_avg_heart_rate prev_hour prev_dayofweek days_since_prev prev_eq_weighted prev_intensity target_sport_enc
1 0 3 42.750000 115.716 18.0 3.0 34.096134 0 4946.859000 0
2 0 0 62.916667 102.902 20.0 2.0 1.981898 1 6474.250833 0
3 0 0 70.783333 104.298 20.0 4.0 7.812234 1 7382.560100 0
4 0 0 107.266667 96.570 15.0 5.0 31.363368 1 10358.742000 0
5 0 0 67.633333 101.834 0.0 2.0 2.855729 1 6887.372867 0
... ... ... ... ... ... ... ... ... ... ...
49015 5 5 110.033333 157.782 19.0 4.0 11.298333 1 17361.279400 5
49016 5 5 172.000000 153.150 2.0 2.0 6.892731 1 26341.800000 5
49017 5 5 166.066667 151.474 23.0 1.0 4.783669 1 25154.782267 5
49018 5 5 84.566667 135.420 18.0 6.0 28.941308 1 11452.018000 5
49019 5 5 49.466667 156.118 17.0 0.0 38.009201 1 7722.637067 5

48616 rows Γ— 10 columns

Now that I have accomodated the relevant baseline and conducted required feature engineering tasks, I can isolate the target variable and split the data into training, validation, and testing sets.

  • For this project, I have chosen 70% of our loaded data to be training data, 15% to serve as a validation set, and the final 15% as the testing data.
  • I also used a chronological split to prevent look-ahead bias and respect the time-series nature of the dataset.
  • I chose to standardize numeric features to prevent any singular variable domination from obscuring salient features across other inputs. Only training data was used to fit the scaler, ensuring fair evaluation on validation/test sets.
InΒ [Β ]:
feature_cols = ['prev_sport_enc', 'weighted_pred_enc', 'prev_duration', 'prev_avg_heart_rate',
                'prev_hour', 'prev_dayofweek', 'days_since_prev', 'prev_eq_weighted', 'prev_intensity']

X = df_features[feature_cols].copy()
y = df_features['target_sport_enc'].copy()

n = len(df_features)
train_end = int(0.7 * n)
val_end = int(0.85 * n)

X_train, y_train = X.iloc[:train_end], y.iloc[:train_end]
X_val, y_val = X.iloc[train_end:val_end], y.iloc[train_end:val_end]
X_test, y_test = X.iloc[val_end:], y.iloc[val_end:]

numeric_cols = ['prev_duration', 'prev_avg_heart_rate', 'prev_hour', 'prev_dayofweek', 'days_since_prev', 'prev_intensity']
scaler = StandardScaler()
X_train.loc[:, numeric_cols] = scaler.fit_transform(X_train[numeric_cols])
X_val.loc[:, numeric_cols]  = scaler.transform(X_val[numeric_cols])
X_test.loc[:, numeric_cols] = scaler.transform(X_test[numeric_cols])
print(f"Train size: {X_train.shape[0]}\nValidation size: {X_val.shape[0]}\nTest size: {X_test.shape[0]}")
Train size: 34031
Validation size: 7292
Test size: 7293

Now that I split the data into training/validating/testing sets, I can attempt to solve the classification problem using the machine learning algorithms discussed above, namely:

  1. Logistic Regression
  2. NaΓ―ve Bayes Classifier
  3. Support Vector Machine (SVM)
  4. Random Forest

1. Logistic Regression
For the Logistic Regression model, I use the multinomial setting as our target variable includes multiple sport labels. This uses softmax regression, allowing the model to learn all classes jointly instead of one-by-one. I made use of the Limited-memory BFGS solver for convergence as the dataset deals with a large number of samples.

  • Complexity: Θ(Samples x Features) per iteration.
  • Efficiency: Very fast to train and works well with high-dimensional sparse data.
  • Advantages: Simple, interpretable coefficients for linear finetuning.
  • Disadvantages: Struggles to capture non-linear relationships, particularly with class imbalances.
InΒ [Β ]:
lr = LogisticRegression(
    multi_class='multinomial',
    solver='lbfgs',
    max_iter=1000,
    n_jobs=-1
)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_val)

val_classes = np.unique(y_val)
target_names = le_sport.inverse_transform(val_classes)

print("Logistic Regression Performance (VAL)")
print("Accuracy:", round(accuracy_score(y_val, y_pred_lr), 4))
print("F1 Macro:", round(f1_score(y_val, y_pred_lr, average='macro'), 4))
print("F1 Weighted:", round(f1_score(y_val, y_pred_lr, average='weighted'), 4))
print("\nDetailed Classification Report:")
print(classification_report(
    y_val,
    y_pred_lr,
    labels=val_classes,
    target_names=target_names,
    zero_division=0
))
Logistic Regression Performance (VAL)
Accuracy: 0.7592
F1 Macro: 0.2509
F1 Weighted: 0.7044

Detailed Classification Report:
                  precision    recall  f1-score   support

            bike       0.78      0.86      0.82      2659
bike (transport)       0.00      0.00      0.00        49
  indoor cycling       0.00      0.00      0.00       239
   mountain bike       1.00      0.00      0.00       521
    orienteering       0.00      0.00      0.00       170
             run       0.76      0.91      0.82      3589
            walk       0.67      0.06      0.11        65

        accuracy                           0.76      7292
       macro avg       0.46      0.26      0.25      7292
    weighted avg       0.73      0.76      0.70      7292

The Logistic Regression model saw an accuracy of about 76% which is reasonable, but still under the baselines. Particularly, its low F1 Macro score of 0.2509 indicates that the model only performs well for certain classes and is probably very skewed due to the existing class imbalances.

2. NaΓ―ve Bayes Classifier
I then used the baseline NaΓ―ve Bayes Classifier, again as a comparison point for the classification problem.

  • Complexity: Θ(Samples x Features) setup and Θ(Features) per prediction.
  • Efficiency: Very fast to train uses very less memory in comparison to other classifiers.
  • Advantages: Very memory-efficient and works well with categorical data.
  • Disadvantages: Data have been encoded and there's no real scope for hyperparameter tuning.
InΒ [Β ]:
nb = GaussianNB()
nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_val)

val_classes = np.unique(y_val)
target_names = le_sport.inverse_transform(val_classes)

print("NaΓ―ve Bayes Performance (VAL)")
print("Accuracy:", round(accuracy_score(y_val, y_pred_nb), 4))
print("F1 Macro:", round(f1_score(y_val, y_pred_nb, average='macro'), 4))
print("F1 Weighted:", round(f1_score(y_val, y_pred_nb, average='weighted'), 4))
print("\nDetailed Classification Report:")
print(classification_report(
    y_val,
    y_pred_nb,
    labels=val_classes,
    target_names=target_names,
    zero_division=0
))
NaΓ―ve Bayes Performance (VAL)
Accuracy: 0.6529
F1 Macro: 0.2414
F1 Weighted: 0.6458

Detailed Classification Report:
                  precision    recall  f1-score   support

            bike       0.73      0.61      0.66      2659
bike (transport)       0.02      0.41      0.04        49
  indoor cycling       0.00      0.00      0.00       239
   mountain bike       0.34      0.03      0.05       521
    orienteering       1.00      0.01      0.01       170
             run       0.76      0.87      0.81      3589
            walk       0.23      0.08      0.11        65

        accuracy                           0.65      7292
       macro avg       0.44      0.28      0.24      7292
    weighted avg       0.69      0.65      0.65      7292

The results of the NaΓ―ve Bayes Classifier are rather underwhelming, only hitting an accuracy of 65%, with significantly worse F1 Macro and F1 Weighted scores too.

3. Support Vector Machine (SVM)
For the Support Vector Machine Classifier, I used a radial basis function (RBF) kernel, which is better equipped to capture non-linear relationships. I also used a one-vs-one (OvO) strategy for multi-class classification, wherein the model is able to establish decision boundaries for each class.

  • Complexity: Ranges from Θ(n^2) to Θ(n^3).
  • Efficiency: Heavy compute required, so slower than the other models considered.
  • Advantages: Can capture non-linear relationships with the RBF kernel.
  • Disadvantages: Very memory-intensive and less explicable than other classifiers.
InΒ [Β ]:
svm = SVC(
    kernel='rbf',
    decision_function_shape='ovo',
    probability=False,
)
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_val)

val_classes = np.unique(y_val)
target_names = le_sport.inverse_transform(val_classes)

print("Support Vector Machine (SVM) Performance (VAL)")
print("Accuracy:", round(accuracy_score(y_val, y_pred_svm), 4))
print("F1 Macro:", round(f1_score(y_val, y_pred_svm, average='macro'), 4))
print("F1 Weighted:", round(f1_score(y_val, y_pred_svm, average='weighted'), 4))
print("\nDetailed Classification Report:")
print(classification_report(
    y_val,
    y_pred_svm,
    labels=val_classes,
    target_names=target_names,
    zero_division=0
))
Support Vector Machine (SVM) Performance (VAL)
Accuracy: 0.7999
F1 Macro: 0.4466
F1 Weighted: 0.7802

Detailed Classification Report:
                  precision    recall  f1-score   support

            bike       0.81      0.86      0.84      2659
bike (transport)       0.12      0.45      0.19        49
  indoor cycling       0.00      0.00      0.00       239
   mountain bike       0.79      0.61      0.69       521
    orienteering       0.00      0.00      0.00       170
             run       0.82      0.89      0.85      3589
            walk       0.72      0.45      0.55        65

        accuracy                           0.80      7292
       macro avg       0.47      0.46      0.45      7292
    weighted avg       0.77      0.80      0.78      7292

My implementation of the Support Vector Machine (SVM) Classifier saw a reasonable accuracy of around 80%, with substantial improvement in F1 Weighted scores. This also extends to the F1 Macro scores, from which I can infer that the model is able to better distinguish minor classes.



At this point, I have a good understanding of classification approaches without finetuning and can use these results and takeaways to optimize my final model, Random Forest.

4. Random Forest
Random Forest uses a large collection of decision trees based on different bootstrap samples of the data, with each tree built independently and the final prediction arising by aggregating the outputs from all trees. This allows it to reduce variance, improve generalization, and make the model robust to overfitting, all necessary tasks given our excessively imbalanced data.

For this model, I trained 1,000 decision trees with a max. depth of 20, enforcing minimum split and leaf constraints to prevent overly complex trees. I also set max_features='sqrt', ensuring that each tree only considers a subset of features to boost robustness to randomness. Finally, I used balanced_subsample weighting to handle class imbalance, and trained it using bootstrap aggregation with out-of-bag evaluation enabled.

  • Complexity: About Θ(n^3) for training, which typically takes precedence in comparison to prediction.
  • Efficiency: Significant compute required, but parallelizable, so moderately efficient.
  • Advantages: Robust to overfitting, naturally handles class imbalances, and good for capturing complex relationships.
  • Disadvantages: Significant hyperparameter tuning and feature scaling required to optimize results.
InΒ [Β ]:
rf_model = RandomForestClassifier(
    n_estimators=1000,
    max_depth=20,
    min_samples_split=60,
    min_samples_leaf=2,
    max_features='sqrt',
    class_weight='balanced_subsample',
    bootstrap=True,
    oob_score=True,
    n_jobs=-1,
    random_state=42,
)
rf_model.fit(X_train, y_train)
y_pred = rf_model.predict(X_val)

print("Random Forest Performance (VAL)")
print("Accuracy:", round(accuracy_score(y_val, y_pred), 4))
print("F1 Macro:", round(f1_score(y_val, y_pred, average='macro'), 4))
print("F1 Weighted:", round(f1_score(y_val, y_pred, average='weighted'), 4))
print("\nDetailed Classification Report:")
print(classification_report(
    y_val,
    y_pred,
    labels=val_classes,
    target_names=target_names,
    zero_division=0
))
Random Forest Performance (VAL)
Accuracy: 0.8206
F1 Macro: 0.6972
F1 Weighted: 0.824

Detailed Classification Report:
                  precision    recall  f1-score   support

            bike       0.85      0.82      0.83      2659
bike (transport)       0.43      0.47      0.45        49
  indoor cycling       0.53      0.75      0.62       239
   mountain bike       0.66      0.80      0.73       521
    orienteering       0.61      0.81      0.69       170
             run       0.88      0.84      0.86      3589
            walk       0.61      0.80      0.69        65

        accuracy                           0.82      7292
       macro avg       0.65      0.75      0.70      7292
    weighted avg       0.83      0.82      0.82      7292

The results of Random Forest are substantially better than the other 3 modeling approaches, with accuracy reaching 82%, higher than the heuristic baseline of 81%. While this may seem moderately insignificant, the more important takeaway is the dramatic increase in F1 Macro scores (70%), which means I finally have a non-heuristic multi-class classification model which sufficiently adapts to imbalanced classes.


Step 4: Model EvaluationΒΆ

Evaluating Model on Testing DataΒΆ

I can now attempt to use this model on testing data to see if it holds strong, and then conduct rigorous model evaluation to identify strengths and weaknesses.

InΒ [Β ]:
y_test_pred = rf_model.predict(X_test)

print("\nRandom Forest Performance (TEST)")
print("Accuracy:", round(accuracy_score(y_test, y_test_pred), 4))
print("F1 Macro:", round(f1_score(y_test, y_test_pred, average='macro'), 4))
print("F1 Weighted:", round(f1_score(y_test, y_test_pred, average='weighted'), 4))

print("\nDetailed Classification Report (TEST):")
print(classification_report(
    y_test,
    y_test_pred,
    labels=val_classes,
    target_names=target_names,
    zero_division=0
))
Random Forest Performance (TEST)
Accuracy: 0.8719
F1 Macro: 0.5933
F1 Weighted: 0.8757

Detailed Classification Report (TEST):
                  precision    recall  f1-score   support

            bike       0.93      0.92      0.92      4336
bike (transport)       0.52      0.70      0.60        70
  indoor cycling       0.23      0.75      0.35         8
   mountain bike       0.53      0.75      0.62       229
    orienteering       0.22      0.32      0.26        22
             run       0.87      0.82      0.85      2527
            walk       0.49      0.64      0.55       101

        accuracy                           0.87      7293
       macro avg       0.54      0.70      0.59      7293
    weighted avg       0.88      0.87      0.88      7293

Fortunately, the model seems to do really well on testing data as well, with accuracy and F1 Weighted score exceeding 87%. The slight dip in F1 Macro score can be attributed to a disproportionately smaller sample size for indoor cycling and orienteering in the testing set, likely leaving the model confused.

Evaluation MetricsΒΆ

For this predictive task, I chose to evaluate model performance using 3 metrics, namely:

  1. Accuracy, which provides an overall view of correct predictions across all classes.
  2. F1 Macro Score, which captures performance across all classes equally, making it important to analyze an imbalanced dataset as such.
  3. F1 Weighted Score, which balances the contribution of each class according to the number of occurrences, accounting for class biases while representing overall predictive power.

Arguably, Accuracy and F1 Macro scores are the 2 most appropriate metrics, as the former provides a standard for absolute model performance, while the latter benchmarks weightages biases within the data. I can better compare and contrast our baselines and machine learning models with a summative DataFrame object.

InΒ [Β ]:
models = ["Frequency Baseline", "Recency Baseline", "Weighted Frequency-Recency Baseline",
    "Logistic Regression (VAL)", "NaΓ―ve Bayes (VAL)", "SVM (VAL)", "Random Forest (VAL)", "Random Forest (TEST)"]

accuracy = [0.7923, 0.7967, 0.8127, 0.7592, 0.6529, 0.7999, 0.8206, 0.8719]
f1_macro = [0.5324, 0.6631, 0.6874, 0.2509, 0.2414, 0.4466, 0.6972, 0.5933]
f1_weighted = [0.7819, 0.7967, 0.8112, 0.7044, 0.6458, 0.7802, 0.8240, 0.8757]

df_summary = pd.DataFrame({"Model": models, "Accuracy": accuracy, "F1 Macro": f1_macro, "F1 Weighted": f1_weighted})
df_summary[["Accuracy", "F1 Macro", "F1 Weighted"]] = df_summary[["Accuracy", "F1 Macro", "F1 Weighted"]].round(4)
df_summary
Out[Β ]:
Model Accuracy F1 Macro F1 Weighted
0 Frequency Baseline 0.7923 0.5324 0.7819
1 Recency Baseline 0.7967 0.6631 0.7967
2 Weighted Frequency-Recency Baseline 0.8127 0.6874 0.8112
3 Logistic Regression (VAL) 0.7592 0.2509 0.7044
4 NaΓ―ve Bayes (VAL) 0.6529 0.2414 0.6458
5 SVM (VAL) 0.7999 0.4466 0.7802
6 Random Forest (VAL) 0.8206 0.6972 0.8240
7 Random Forest (TEST) 0.8719 0.5933 0.8757

As such, the Random Forest model exceeds baselines for Accuracy and F1 Weighted scores in both validation and testing sets. While it still does exceed baselines in F1 Macro score for the validation set, we can attribute the dip in performance in the testing set to extraordinary non-representative samples which would be otherwise excluded in analysis.

Although I also generated detailed classification reports where appropriate, they are not too relevant for model evaluation purposes. That said, the best way to visually understand my model would be through the use of a confusion matrix.

InΒ [Β ]:
y_test_orig = le_sport.inverse_transform(y_test)
y_test_pred_orig = le_sport.inverse_transform(y_test_pred)
val_classes_orig = le_sport.inverse_transform(val_classes)

cm = confusion_matrix(y_test_orig, y_test_pred_orig, labels=val_classes_orig)
cm_pct = cm / cm.sum(axis=1, keepdims=True)

plt.figure(figsize=(8,8))
plt.imshow(cm_pct, cmap="Blues")
plt.title("Confusion Matrix (Percent) β€” Test Set")
plt.xlabel("Predicted")
plt.ylabel("True")

plt.xticks(np.arange(len(val_classes_orig)), val_classes_orig, rotation=90)
plt.yticks(np.arange(len(val_classes_orig)), val_classes_orig)

for i in range(cm_pct.shape[0]):
    for j in range(cm_pct.shape[1]):
        value = cm_pct[i, j] * 100
        text = f"{value:.1f}%" if value > 0 else ""
        plt.text(j, i, text, ha="center", va="center", fontsize=9)

plt.colorbar(label="Percent")
plt.tight_layout()
plt.show()
No description has been provided for this image

This puts into perspective how the model gets tripped up by orienteering in the test set (only 22 samples) and poses the question of whether orienteering and running can be better segmented in downstream analysis.

From a practical lens, this slip up actually captures a salient feature that our model is actually able to detect. Orienteering is a group of sports in which participants use a map and compass to navigate from point to point in unfamiliar terrain as quickly as possible. This does involve a lot of running, which can explain why our model is specifically struggling with this classification. Although the disparity is not as significant, a similar logic can be applied to explain the confusion between running and walking, as pace greatly varies from person to person.


Step 5: Discussion of Related WorkΒΆ

How has this dataset (or similar datasets) been used before?ΒΆ

The Endomondo Fitness Tracking Data were collected by researchers at University of California, San Diego for academic use. The paper published adjacent to the data collection process focuses on modeling heart rate and activity data to deliver a personalized fitness recommendation.

Specifically, this introductory paper focuses on FitRec, an LSTM-based model that estimates a user's heart rate profile and accordingly recommends suitable activities. The key takeaways from this study suggest that the LSTM model is able to learn personalized, contextual, and activity-specific user dynamics simply based on heart rate profiling.

While FitRec was implemented with a specific purpose in mind, the strong results from this previous study validates the opportunistic use of fitness data, such as Endomondo, for activity recommendations.



How has prior work approached the same (or similar) tasks?ΒΆ

Specific to the Endomondo dataset, previous work has focused on predicting heart rate trajectory to determine the expected response to an activity, in tandem with other factors such as weather, duration, elevation, etc. These predictions have then been used to conduct downstream analyses which typically have to do with predictive modeling or recommender systems.

That said, there have been instances of historical fitness data being used to predict a user's subsequent activity, the same task explored in this project. These typically make use of deep learning architectures, such as Recurrent Neural Networks (RNNs), or Temporal Convolutional Networks (TCNs), which makes sense given the complex underlying patterns.

In a less random sense, N-gram models and Markov chains have also been used, although with lesser success due to their inability to capture long-term temporal trends. There has also been recent work using BERT-adjacent Transformer models, which is a relatively unexplored approach to this classification problem.



How do your results match or differ from what has been reported in related work?ΒΆ

The biggest difference in my results likely arises from the specific Endomondo dataset I used. Given the significant class imbalance and RAM issues encountered, the evaluation metrics were limited to F1 scores and Accuracy. In comparison, most research uses Top-k accuracy or Recall@K to recommend a couple of suggested next activities. Implementing these metrics in this project would likely prove futile due to the limited dataset size.

The models used in this investigation are exploratory, and not meant to build on existing work, which particularly implements stronger theoretical frameworks. That said, even if the entire dataset was used, or a different fitness dataset considered, a matching finding that are project was able to arrive at is the problematic structure of fitness data. Often times, class imbalances are introduced by people engaging in a new activity just for a couple of times before quitting it. This consensus, while agreeable, is unfortunately a reflection of most fitness data, wherein you have to compromise unique user-activity dynamics to generate a more holistic bootstrap sample.

As such, despite differences in evaluation metrics, the common takeaway was to rather have a user-balanced activity-biased dataset than an activity-balanced user-biased dataset. In simpler terms, more emphasis on handling class imbalances than worrying about standardizing or normalizing the specific data collected.


Dataset Source:ΒΆ

Modeling heart rate and activity data for personalized fitness recommendation
Jianmo Ni, Larry Muhlstein, Julian McAuley
WWW, 2019

AuthorΒΆ

Dylan DsouzaΒΆ
GitHub Β· Website Β· LinkedInΒΆ