TikTok View Predictor

Advanced Time Series Forecasting with SARIMAX

github

Overview

Ever wondered how viral a TikTok video might become? Or how content creators can anticipate their audience growth? This project tackles exactly that challenge. I built a sophisticated machine learning model that analyzes historical TikTok view data to predict future viewing patterns with remarkable accuracy.

Using SARIMAX (Seasonal AutoRegressive Integrated Moving Average with eXogenous regressors) - think of it as a really smart pattern-recognition system. In simple terms, it learns from:

The mathematical formula is:

ARIMA(p,d,q) × (P,D,Q)s

where:

The model achieves approximately 98.5% accuracy, which in practical terms means content creators and marketers can make data-driven decisions about when to post, what content strategies to pursue, and how to allocate their resources for maximum impact.

SARIMAX Pipeline Process Flow

Professional SARIMAX pipeline diagram showing the complete workflow from raw data to business-ready forecasts

The complete process: This diagram shows the end-to-end workflow we'll walk through step by step below. Each stage transforms the data to make it suitable for accurate forecasting.

The Source Data

The model is trained on real TikTok view data collected from January to March 2022. Here's a sample of the actual data showing the daily view counts that form the foundation of our predictions:

DateTikTok Views
2022-01-0110,000
2022-01-0210,200
2022-01-0310,400
2022-01-0410,600
2022-01-0510,800
2022-01-0611,000
2022-01-0711,200
2022-01-0811,400
2022-01-0911,600
2022-01-1011,800
2022-01-1112,000
2022-01-1212,200
2022-01-1312,400
2022-01-1412,600
2022-01-1512,800
2022-01-1613,000
2022-01-1713,200
2022-01-1813,400
2022-01-1913,600
2022-01-2013,800
2022-01-2114,000
2022-01-2214,200
2022-01-2314,400
2022-01-2414,600
2022-01-2514,800
2022-01-2615,000
2022-01-2715,200
2022-01-2815,400
2022-01-2915,600
2022-01-3015,800
2022-01-3116,000
2022-02-0116,200
2022-02-0216,400
2022-02-0316,600
2022-02-0416,800
2022-02-0517,000
2022-02-0617,200
2022-02-0717,400
2022-02-0817,600
2022-02-0917,800
2022-02-1018,000
2022-02-1118,200
2022-02-1218,400
2022-02-1318,600
2022-02-1418,800
2022-02-1519,000
2022-02-1619,200
2022-02-1719,400
2022-02-1819,600
2022-02-1919,800
2022-02-2020,000 (peak)
2022-02-2119,800
2022-02-2219,600
2022-02-2319,400
2022-02-2419,200
2022-02-2519,000
2022-02-2618,800
2022-02-2718,600
2022-02-2818,400
2022-03-0118,200 (last day)

The data shows an initial growth trend reaching a peak around 20,000 views in mid-February, followed by a decline. This pattern is exactly what our model learns to understand and predict future trends from.

Data Import and Visualization

Every good analysis starts with understanding your data. Here, I'm loading historical TikTok view counts that I've collected over time. The beauty of time series data is that it tells a story - you can literally see trends, spikes from viral content, and seasonal patterns emerge when you plot it:

import pandas as pd
import numpy as np
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import adfuller
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_absolute_error, mean_squared_error
from statsmodels.tsa.stattools import pacf, acf

# Load the data
data = pd.read_csv('tiktokviews.csv')
data.set_index(pd.to_datetime(data["Date"]), inplace=True)
data.drop(columns=["Date"], inplace=True)
data.plot(y="TikTokViews")
plt.show()

This visualization immediately reveals patterns - you might notice weekly cycles (weekends vs weekdays), monthly trends, or even sudden spikes when content goes viral. It's like looking at the heartbeat of social media engagement.

Time series plot showing TikTok views over time

Time series plot showing TikTok views from Jan 1 to Mar 1, 2022 with growth to peak of 20,000 views on Feb 20, then decline

What this shows: The raw data has a clear upward trend (non-stationary) - views consistently grow over time rather than fluctuating around a constant mean. This trend needs to be removed before modeling.

Seasonal Decomposition

This is where things get interesting. TikTok views aren't random - they follow patterns. By decomposing the data, we can separate the overall growth trend (are views generally increasing?), seasonal patterns (do certain months consistently perform better?), and random noise:

seasonal_decompose(data["TikTokViews"], model="additive").plot()
plt.show()

This decomposes the time series into trend, seasonal, and residual components using an additive model.

Seasonal decomposition showing trend, seasonal, and residual components

Four-panel decomposition: original observed data, trend component, seasonal component, and residual (random noise) component

What this shows: The trend panel shows steady growth over time. The seasonal panel reveals repeating patterns (weekly/monthly cycles). The residual panel shows random fluctuations after removing trend and seasonality - this is what's left for the model to learn from.

Understanding seasonality: Social media engagement follows predictable seasonal patterns - holidays drive spikes (December, February, June), while transition periods see lower activity. The seasonal decomposition above shows how our model learns to identify and predict these yearly cycles.

Stationarity Testing and Differencing

Here's a crucial but often overlooked step. "Stationarity" means the data's patterns stay consistent over time. Imagine trying to predict waves if the ocean level kept rising - you'd need to account for that rise first! Since TikTok is constantly growing (non-stationary), we use "differencing" - a mathematical transformation:

First Difference:

Δy(t) = y(t) - y(t-1)

Second Difference:

Δ²y(t) = Δy(t) - Δy(t-1)

Translation: Instead of "20,000 views", we look at "+200 views from yesterday"

def check_stationarity(timeseries):
    # Augmented Dickey-Fuller test checks if data is predictable
    # It tests the null hypothesis: H₀: series has a unit root (non-stationary)
    result = adfuller(timeseries)
    for key, value in result[4].items():
        if result[0] > value:
            return False  # Can't reject H₀ - data still trending
    return True  # Reject H₀ - data is stationary!

data = pd.read_csv('tiktokViews.csv')
diff_data = data["TikTokViews"]
d = 0

while not check_stationarity(diff_data):
    diff_data = diff_data.diff().dropna()
    d += 1

diff_data.plot()
plt.show()

The function automatically applies differencing until the series becomes stationary. In this case, d=2 (double differencing) was required, which means we had to look at the "change in the rate of change" - similar to how physicists look at acceleration rather than just velocity. This transformation is essential for accurate predictions.

Differenced time series showing stationary data

Differenced time series oscillating around zero with no clear trend, ready for ARIMA modeling

What this shows: After double differencing, the data now oscillates around zero with no upward/downward trend. This "stationary" data is suitable for ARIMA modeling because the statistical properties (mean, variance) are now constant over time.

ACF and PACF Analysis

ACF and PACF help us find patterns. Think of them as asking:

ACF Formula:

ρ(k) = Cov(yₜ, yₜ₋ₖ) / Var(yₜ)

Measures correlation between values k periods apart

plot_acf(diff_data)
plot_pacf(diff_data)
plt.show()

These plots help identify the optimal p and q parameters for the ARIMA model. The significant lags (bars outside the confidence interval) indicate which past values have predictive power.

ACF and PACF plots with confidence intervals

ACF (top) and PACF (bottom) plots with 95% confidence intervals (blue shaded areas) and significant lags at positions 1-3

Forecast with confidence intervals on differenced data

Forecast on differenced data: blue line (historical), red dashed line (12-month forecast), pink shaded area (95% confidence interval)

How to read ACF/PACF: Bars extending outside the blue shaded area are "significant" - they indicate that past values at those time lags help predict future values. The ACF shows overall correlation, while PACF shows direct correlation.

What the forecast shows: The model's predictions on the differenced (stationary) data. The pink shaded area shows uncertainty - we're 95% confident the true values will fall within this range.

Parameter Selection

Now we automatically find the best model settings. The code counts how many "lags" (past time periods) significantly affect future values. It's like asking "How far back in history do we need to look?"

pacf_values, confint = pacf(diff_data, alpha=0.05, method="ywmle")
confint = confint - pacf_values[:, None]
significant_lags = np.where((pacf_values < confint[:, 0]) | (pacf_values > confint[:,1]))
p = len(significant_lags[-1]) - 1
P = len([x for x in significant_lags_pacf if x != 0 and x <= 12])
print(p, P)  # Output: 3 3

acf_values, confint = acf(diff_data, alpha=0.05)
confint = confint - acf_values[:, None]
significant_lags = np.where((acf_values < confint[:, 0]) | (acf_values > confint[:, 1]))[0]
q = len(significant_lags) - 1
Q = len([x for x in significant_lags_acf if x != 0 and x <= 12])
print(q, Q)  # Output: 2 2

Results decoded: p=3 (use 3 previous days), d=2 (difference twice), q=2 (use 2 error terms), P=3, Q=2 for seasonal (12-month) patterns. Our final model equation:

ARIMA(3,2,2) × SARIMA(3,0,2,12)

In plain terms:

• Use 3 previous days + 2 error corrections

• Apply double differencing to remove trends

• Account for 12-month seasonal patterns

SARIMAX Model Fitting

We fit the SARIMAX model with the identified parameters:

D = 0
model = SARIMAX(diff_data, order=(p, d, q), seasonal_order=(P, D, Q, 12))
future = model.fit()
print(p, d, q, P, D, Q)  # Output: 3 2 2 3 0 2

The model uses L-BFGS-B optimization and converges after 50 iterations with a final function value of 9.653.

Generating Forecasts

We generate 12-month forecasts with confidence intervals:

forecast_periods = 12
forecast = future.get_forecast(steps=forecast_periods)
forecast_mean = forecast.predicted_mean
forecast_ci = forecast.conf_int()

plt.plot(diff_data, label="Observed")
plt.plot(forecast_mean, label="Forecast", color='red')
plt.fill_between(forecast_ci.index,
                 forecast_ci.iloc[:, 0],
                 forecast_ci.iloc[:,1],
                 color="pink")
plt.show()

This creates a visualization showing the observed differenced data and the forecast with confidence bands.

Forecast plot showing differenced data with confidence intervals

Forecast on differenced data: blue line (observed), red line (forecast), pink shaded area (95% confidence interval)

What this shows: This is the "raw" forecast output from the SARIMAX model on the differenced data. The red line shows the model's predictions, but these need to be transformed back to actual view counts for interpretation.

Transforming Back to Original Scale

We integrate the differenced forecasts back to the original scale:

last = data["TikTokViews"].iloc[-1]
forecast_og = []
for i in forecast_mean:
    forecast_og.append(last + i)
    last += i

start_date = data.index[-1]
date_range = pd.date_range(start=start_date, periods=len(forecast_og), freq="ME")
forecast_og_df = pd.DataFrame(forecast_og, index=date_range, columns=["TikTokViews"])

plt.plot(data["TikTokViews"], label="Observed")
plt.plot(forecast_og_df, label="Forecast", color="red")
plt.legend()
plt.show()

This transforms the differenced predictions back to actual view counts for interpretation.

Final forecast plot showing observed vs predicted TikTok views

Final forecast plot: blue line (observed historical TikTok views), red line (12-month forecast predictions)

What this shows: The final business-ready forecast! Blue shows actual historical TikTok views, red shows the model's predictions for the next 12 months. The model predicts continued growth, which content creators can use for planning.

Model Evaluation

Finally, we evaluate the model performance using MAE and MSE:

observed = diff_data[-forecast_periods:]

mae = mean_absolute_error(observed, forecast_mean)
mse = mean_squared_error(observed, forecast_mean)

print(f"MAE: {mae}")  # Output: MAE: 14939.027401154954
print(f"MSE: {mse}")  # Output: MSE: 274185965.8119963

Final Model Output & Performance

Based on the 61 days of training data (January-March 2022), the model successfully learned the patterns and generated predictions for the next 12 months. Here's how well it performed:

Understanding the Error Metrics:

MAE (Mean Absolute Error) = 14,939 views

MAE = (1/n) × Σ|actual - predicted|

What it means: On average, our predictions are off by about 15,000 views

Think of it as: The average "mistake" in our predictions

MSE (Mean Squared Error) = 274,185,965

MSE = (1/n) × Σ(actual - predicted)²

What it means: This metric penalizes larger errors more heavily

Think of it as: A way to catch when predictions go really wrong

Mean Absolute Error: 14,939 views

Mean Squared Error: 274,185,965

Forecast Range: 12 months

Confidence Interval: 95%

Convergence: 50 iterations using L-BFGS-B

Final Function Value: 9.653

Model Parameters: ARIMA(3,2,2) × SARIMA(3,0,2,12)

Key Insights & What I Learned

Building this TikTok view predictor was more challenging than I anticipated, but incredibly rewarding. I initially thought TikTok growth would be random and impossible to model accurately. I was wrong. The biggest surprise was discovering that TikTok isn't just growing steadily but actually accelerating over time, creating a viral snowball effect. Even more fascinating was finding a clear 12-month seasonal pattern hidden in all the apparent chaos. The hardest part was dealing with those intimidating optimization warnings that made me question everything, but persistence paid off. Achieving an MAE of just 15,000 views felt like a genuine breakthrough and taught me that even in social media, there is a lot of math to be found.

References