Why Use Log Returns?
Why Log Returns? A Mathematical Foundation
Understanding the Correct Transformation for Financial Time Series
Before we can detect market regimes, we must answer a fundamental question:
How should we transform price data for statistical analysis?
This notebook provides the mathematical proof for why log returns are the theoretically correct choice.
Table of Contents
- The Two Competing Approaches
- Data Collection
- Visual Comparison
- Property 1: Stationarity
- Property 2: Time Additivity
- Property 3: Scale Invariance
- Property 4: Symmetric Treatment
- Property 5: Cross-Asset Comparability
- Distribution Properties
- Conclusion
Note: This notebook contains NO regime detection. We’re proving mathematical properties only.
The next notebook will show how to use these log returns for HMM-based regime detection.
See this notebook on GitHub
1# Core imports
2import numpy as np
3import pandas as pd
4import matplotlib.pyplot as plt
5import seaborn as sns
6import yfinance as yf
7from datetime import datetime, timedelta
8import warnings
9warnings.filterwarnings('ignore')
10
11# Statistical tools
12from scipy import stats
13from statsmodels.tsa.stattools import adfuller, kpss, acf, pacf
14from statsmodels.stats.diagnostic import het_arch
15
16# Plotting style
17plt.style.use('seaborn-v0_8-whitegrid')
18sns.set_palette(sns.color_palette())
19
20## 1. The Two Competing Approaches <a name="approaches"></a>
21
22When transforming price data $P_t$ into returns, we have two options:
23
24### Simple (Percentage) Returns
25
26$$r_t = \frac{P_t - P_{t-1}}{P_{t-1}} = \frac{P_t}{P_{t-1}} - 1$$
27
28### Log Returns
29
30$$R_t = \log\left(\frac{P_t}{P_{t-1}}\right) = \log(P_t) - \log(P_{t-1})$$
31
32### Why This Matters
33
34Statistical models (including HMMs) require **stationary** observations. The wrong choice breaks fundamental assumptions.
35
36We'll prove that log returns satisfy **5 critical mathematical properties** that percentage returns do not.
37
38## 2. Data Collection <a name="data"></a>
39
40We'll use 4 assets with different characteristics:
41
42- **SPY**: S&P 500 ETF (market benchmark, ~\$500)
43- **AAPL**: Apple (~\$180)
44- **TSLA**: Tesla (~\$250, high volatility)
45- **KO**: Coca-Cola (~\$60, defensive)
46
47The different price levels ranging from \$60 to \\$500 will demonstrate scale invariance.
48
49
50```python
51# Download 2 years of data
52tickers = ['SPY', 'AAPL', 'TSLA', 'KO']
53end_date = datetime.now()
54start_date = end_date - timedelta(days=730)
55
56print(f"Downloading data from {start_date.date()} to {end_date.date()}\n")
57
58data = {}
59for ticker in tickers:
60 print(f"Downloading {ticker}...", end=' ')
61 df = yf.download(ticker, start=start_date, end=end_date, progress=False, auto_adjust=True)
62 data[ticker] = df
63 price_delta = df['Close'].max()-df['Close'].min()
64 print(f"({len(df)} days, price range: ${price_delta[ticker]:.2f})")
65
66print(f"\nDownloaded {len(tickers)} assets")
Downloading data from 2023-10-12 to 2025-10-11
Downloading SPY... (501 days, price range: $272.78)
Downloading AAPL... (501 days, price range: $94.28)
Downloading TSLA... (501 days, price range: $337.81)
Downloading KO... (501 days, price range: $23.09)
Downloaded 4 assets
1# Visualize raw prices to show scale differences
2fig, axes = plt.subplots(2, 2, figsize=(15, 10))
3axes = axes.ravel()
4
5for idx, ticker in enumerate(tickers):
6 ax = axes[idx]
7 prices = data[ticker]['Close'].squeeze()
8 prices.plot(ax=ax, linewidth=2, color=f'C{idx}')
9
10 ax.set_title(f'{ticker} - Raw Price Data', fontsize=14, fontweight='bold')
11 ax.set_xlabel('Date')
12 ax.set_ylabel('Price ($)')
13 ax.grid(True, alpha=0.3)
14
15 # Add price statistics
16 mean_price = prices.mean()
17 min_price = prices.min()
18 max_price = prices.max()
19
20 ax.text(0.02, 0.98,
21 f'Mean: ${mean_price:.2f}\nRange: ${min_price:.2f}-${max_price:.2f}',
22 transform=ax.transAxes, fontsize=10, verticalalignment='top',
23 bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
24
25plt.tight_layout()
26plt.suptitle('Raw Price Data - Notice Vastly Different Scales',
27 y=1.02, fontsize=16, fontweight='bold')
28plt.show()
Note that the price disparities create a problem when comparing percentages
3. Visual Comparison
Let’s calculate both types of returns and compare them visually.
1# Demonstration: Correct vs Incorrect Display of Log Returns
2ticker = 'SPY'
3prices = data[ticker]['Close'].squeeze()
4log_returns = np.log(prices / prices.shift(1)).dropna()
5simple_returns = prices.pct_change().dropna()
6
7print("HOW TO DISPLAY LOG RETURNS")
8print("=" * 80)
9
10# Calculate statistics
11log_mean = log_returns.mean()
12log_std = log_returns.std()
13simple_mean = simple_returns.mean()
14simple_std = simple_returns.std()
15
16print("\n1. MEAN (Expected Daily Return)")
17print("-" * 80)
18print(f" Log return mean (log space): {log_mean:.6f}")
19print(f" Simple return mean: {simple_mean:.6f}")
20print(f"\n CORRECT: (e^{log_mean:.6f} - 1) * 100 = {(np.exp(log_mean) - 1) * 100:.4f}%")
21print(f" WRONG: {log_mean:.6f} * 100 = {log_mean * 100:.4f}%")
22print(f" Difference: {abs((np.exp(log_mean) - 1) * 100 - log_mean * 100):.4f}%")
23
24print("\n2. STANDARD DEVIATION (Volatility)")
25print("-" * 80)
26print(f" Log return std (log space): {log_std:.6f}")
27print(f" Simple return std: {simple_std:.6f}")
28print(f"\n Approximation: {log_std:.6f} * 100 ≈ {log_std * 100:.4f}%")
29print(f" Simple std: {simple_std:.6f} * 100 = {simple_std * 100:.4f}%")
30print(f" Difference: {abs(log_std * 100 - simple_std * 100):.4f}% (very small!)")
31print(f"\n Note: For daily returns (~1%), the approximation std*100 is acceptable")
32print(f" But ALWAYS use (e^mean - 1)*100 for the mean!")
33
34print("\n3. EXAMPLE: Large Return")
35print("-" * 80)
36large_log_return = 0.10 # 10% in log space
37print(f" Log return: {large_log_return:.4f}")
38print(f" CORRECT: (e^{large_log_return:.4f} - 1) * 100 = {(np.exp(large_log_return) - 1) * 100:.2f}%")
39print(f" WRONG: {large_log_return:.4f} * 100 = {large_log_return * 100:.2f}%")
40print(f" Error: {abs((np.exp(large_log_return) - 1) * 100 - large_log_return * 100):.2f}%")
41
42print("\n" + "=" * 80)
43print("KEY TAKEAWAYS:")
44print("=" * 80)
45print("Mean: ALWAYS convert with (e^mean - 1) * 100")
46print("Std: Multiplying by 100 is acceptable for small daily returns")
47print("Individual returns: ALWAYS convert with (e^r - 1) * 100 for display")
48print("NEVER just multiply log returns by 100 without conversion!")
49print("=" * 80)
HOW TO DISPLAY LOG RETURNS
================================================================================
1. MEAN (Expected Daily Return)
--------------------------------------------------------------------------------
Log return mean (log space): 0.000870
Simple return mean: 0.000923
CORRECT: (e^0.000870 - 1) * 100 = 0.0870%
WRONG: 0.000870 * 100 = 0.0870%
Difference: 0.0000%
2. STANDARD DEVIATION (Volatility)
--------------------------------------------------------------------------------
Log return std (log space): 0.010280
Simple return std: 0.010336
Approximation: 0.010280 * 100 ≈ 1.0280%
Simple std: 0.010336 * 100 = 1.0336%
Difference: 0.0057% (very small!)
Note: For daily returns (~1%), the approximation std*100 is acceptable
But ALWAYS use (e^mean - 1)*100 for the mean!
3. EXAMPLE: Large Return
--------------------------------------------------------------------------------
Log return: 0.1000
CORRECT: (e^0.1000 - 1) * 100 = 10.52%
WRONG: 0.1000 * 100 = 10.00%
Error: 0.52%
================================================================================
KEY TAKEAWAYS:
================================================================================
Mean: ALWAYS convert with (e^mean - 1) * 100
Std: Multiplying by 100 is acceptable for small daily returns
Individual returns: ALWAYS convert with (e^r - 1) * 100 for display
NEVER just multiply log returns by 100 without conversion!
================================================================================
Important: How to Display Log Returns
CRITICAL DISTINCTION: Log returns live in “log space” - they are NOT percentages!
Converting Log Returns to Percentages
CORRECT way to display log return as percentage:
1log_return = 0.01 # This is in log space
2percentage = (np.exp(log_return) - 1) * 100 # ≈ 1.005%
WRONG way (common mistake):
1log_return = 0.01
2percentage = log_return * 100 # = 1.0% (INCORRECT!)
Why the Difference Matters
- Small returns: The approximation
log(1+r) ≈ r
is valid- Example: 1% simple → 0.00995 log (0.5 bp difference)
- Large returns: The approximation breaks down
- Example: 10% simple → 0.0953 log (47 bp difference!)
Volatility (Standard Deviation)
For small returns, volatility approximation is valid:
std(log returns)
≈std(simple returns)
when returns are small- Multiplying by 100 gives approximate volatility in percentage terms
- SPY daily volatility ~1% makes this approximation reasonable
Rule of Thumb:
- Daily log return mean: ALWAYS use
(np.exp(mean) - 1) * 100
- Daily log return std: Multiplying by 100 is acceptable approximation for typical daily returns
- For display/reporting: Stay in log space OR clearly label as approximation
Let’s demonstrate this with SPY data…
1# Calculate both return types for SPY
2ticker = 'SPY'
3prices = data[ticker]['Close'].squeeze()
4
5# Simple returns: (P_t - P_{t-1}) / P_{t-1}
6simple_returns = prices.pct_change().dropna()
7
8# Log returns: log(P_t / P_{t-1})
9log_returns = np.log(prices / prices.shift(1)).dropna()
10
11print(f"{ticker} Returns Summary:")
12print("=" * 60)
13print(f"{'Metric':<30} {'Simple':<15} {'Log':<15}")
14print("=" * 60)
15# Convert log returns to percentage: (e^r - 1) * 100
16print(f"{'Mean (daily %)':<30} {simple_returns.mean()*100:>14.4f} {(np.exp(log_returns.mean()) - 1)*100:>14.4f}")
17print(f"{'Std Dev (daily %, approx.)':<30} {simple_returns.std()*100:>14.4f} {log_returns.std()*100:>14.4f}")
18print(f"{'Min (daily %)':<30} {simple_returns.min()*100:>14.2f} {(np.exp(log_returns.min()) - 1)*100:>14.2f}")
19print(f"{'Max (daily %)':<30} {simple_returns.max()*100:>14.2f} {(np.exp(log_returns.max()) - 1)*100:>14.2f}")
20print(f"{'Correlation':<30} {np.corrcoef(simple_returns, log_returns)[0,1]:>14.6f}")
21print("=" * 60)
SPY Returns Summary:
============================================================
Metric Simple Log
============================================================
Mean (daily %) 0.0923 0.0870
Std Dev (daily %, approx.) 1.0336 1.0280
Min (daily %) -5.85 -5.85
Max (daily %) 10.50 10.50
Correlation 0.999698
============================================================
Interpretation
The table above shows that simple and log returns are nearly identical for daily data:
- Correlation: 0.9997 - They move together almost perfectly
- Mean difference: < 0.01% - Negligible for daily returns
- Std dev difference: < 0.01% - Volatility measures are essentially equivalent
Why the approximation works:
For small returns, the mathematical relationship log(1+r) ≈ r
is extremely accurate:
- 1% simple return → 0.009950 log return
- Difference: only 0.50 basis points (0.005%)
Important caveat:
- The mean should always use proper conversion:
(e^mean - 1) * 100
- The std dev can use the approximation
std * 100
for daily returns - This is why we labeled it “approx.” in the table
For daily returns around 1%, this approximation introduces negligible error while maintaining computational simplicity.
1# Visual comparison
2fig, axes = plt.subplots(2, 3, figsize=(18, 10))
3
4# 1. Time series comparison
5ax = axes[0, 0]
6simple_returns.plot(ax=ax, label='Simple Returns', alpha=0.7, linewidth=1)
7log_returns.plot(ax=ax, label='Log Returns', alpha=0.7, linewidth=1)
8ax.set_title('Returns Over Time', fontweight='bold')
9ax.set_xlabel('Date')
10ax.set_ylabel('Return')
11ax.legend()
12ax.grid(True, alpha=0.3)
13
14# 2. Scatter plot showing relationship
15# Convert log returns to percentage for comparison
16log_returns_pct = (np.exp(log_returns) - 1) * 100
17ax = axes[0, 1]
18ax.scatter(simple_returns * 100, log_returns_pct, alpha=0.3, s=10)
19ax.plot([-5, 5], [-5, 5], 'r--', label='y=x line', linewidth=2)
20ax.set_xlabel('Simple Returns (%)')
21ax.set_ylabel('Log Returns (converted to %)')
22ax.set_title('Log vs Simple Returns', fontweight='bold')
23ax.legend()
24ax.grid(True, alpha=0.3)
25ax.text(0.05, 0.95, f'Correlation: {np.corrcoef(simple_returns, log_returns)[0,1]:.6f}',
26 transform=ax.transAxes, verticalalignment='top',
27 bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
28
29# 3. Distribution comparison
30ax = axes[0, 2]
31ax.hist(simple_returns * 100, bins=50, alpha=0.5, label='Simple', density=True, color='blue')
32ax.hist(log_returns_pct, bins=50, alpha=0.5, label='Log', density=True, color='orange')
33ax.set_xlabel('Return (%)')
34ax.set_ylabel('Density')
35ax.set_title('Distribution Comparison', fontweight='bold')
36ax.legend()
37ax.grid(True, alpha=0.3)
38
39# 4. Q-Q plot: Simple returns
40ax = axes[1, 0]
41stats.probplot(simple_returns, dist="norm", plot=ax)
42ax.set_title('Q-Q Plot: Simple Returns', fontweight='bold')
43ax.grid(True, alpha=0.3)
44
45# 5. Q-Q plot: Log returns
46ax = axes[1, 1]
47stats.probplot(log_returns, dist="norm", plot=ax)
48ax.set_title('Q-Q Plot: Log Returns', fontweight='bold')
49ax.grid(True, alpha=0.3)
50
51# 6. Cumulative returns (testing additivity)
52ax = axes[1, 2]
53cum_simple = (1 + simple_returns).cumprod() - 1
54cum_log = np.exp(log_returns.cumsum()) - 1
55cum_simple.plot(ax=ax, label='From Simple Returns', linewidth=2)
56cum_log.plot(ax=ax, label='From Log Returns', linewidth=2, linestyle='--')
57ax.set_title('Cumulative Returns', fontweight='bold')
58ax.set_xlabel('Date')
59ax.set_ylabel('Cumulative Return')
60ax.legend()
61ax.grid(True, alpha=0.3)
62
63plt.tight_layout()
64plt.show()
While both simple returns and log returns are approximately normal and highly correlated, the log returns have superior mathematical properties.
1# Comprehensive stationarity testing
2from statsmodels.tsa.stattools import adfuller, kpss
3from statsmodels.tools.sm_exceptions import InterpolationWarning
4
5def test_stationarity(series, name):
6 """Run multiple stationarity tests with graceful handling of edge cases"""
7 print(f"\n{name}:")
8 print("-" * 70)
9 # ADF Test
10 adf_result = adfuller(series.dropna(), autolag='AIC')
11 print(f"ADF Test:")
12 print(f" Statistic: {adf_result[0]:.4f}")
13 print(f" p-value: {adf_result[1]:.4f}")
14 print(f" Result: {'STATIONARY' if adf_result[1] < 0.05 else 'NON-STATIONARY'}")
15 # KPSS Test with edge case handling
16 print(f"\nKPSS Test:")
17 # Use conservative nlags to avoid edge cases while remaining statistically valid
18 n_obs = len(series.dropna())
19 nlags = min(int(12 * (n_obs / 100) ** 0.25), n_obs // 4)
20 # Catch interpolation warnings (they actually reinforce our conclusions!)
21 with warnings.catch_warnings(record=True) as w:
22 warnings.simplefilter("always", InterpolationWarning)
23 kpss_result = kpss(series.dropna(), regression='c', nlags=nlags)
24 interp_warnings = [warn for warn in w if issubclass(warn.category, InterpolationWarning)]
25
26 print(f" Statistic: {kpss_result[0]:.4f}")
27 print(f" p-value: {kpss_result[1]:.4f}")
28 print(f" Critical value (5%): {kpss_result[3]['5%']:.4f}")
29 # Explain if test statistic was extreme
30 if interp_warnings:
31 if kpss_result[0] > kpss_result[3]['10%']:
32 print(f" [WARNING] Extremely high statistic (very strong non-stationarity)")
33 else:
34 print(f" [INFO] Extremely low statistic (very strong stationarity)")
35
36 print(f" Result: {'STATIONARY' if kpss_result[1] > 0.05 else 'NON-STATIONARY'}")
37
38 return {
39 'adf_stat': adf_result[0],
40 'adf_pvalue': adf_result[1],
41 'kpss_stat': kpss_result[0],
42 'kpss_pvalue': kpss_result[1],
43 'is_stationary': (adf_result[1] < 0.05) and (kpss_result[1] > 0.05)
44 }
45
46# Test both prices and log returns for SPY
47print("=" * 70)
48print("STATIONARITY TESTS (SPY)")
49print("=" * 70)
50prices_result = test_stationarity(prices, "Raw Prices")
51log_result = test_stationarity(log_returns, "Log Returns")
52print("=" * 70)
======================================================================
STATIONARITY TESTS (SPY)
======================================================================
Raw Prices:
----------------------------------------------------------------------
ADF Test:
Statistic: -1.2480
p-value: 0.6526
Result: NON-STATIONARY
KPSS Test:
Statistic: 2.4675
p-value: 0.0100
Critical value (5%): 0.4630
[WARNING] Extremely high statistic (very strong non-stationarity)
Result: NON-STATIONARY
Log Returns:
----------------------------------------------------------------------
ADF Test:
Statistic: -12.7847
p-value: 0.0000
Result: STATIONARY
KPSS Test:
Statistic: 0.1046
p-value: 0.1000
Critical value (5%): 0.4630
[INFO] Extremely low statistic (very strong stationarity)
Result: STATIONARY
======================================================================
Statistical Conclusion
The test results above provide definitive evidence:
Raw Prices (NON-STATIONARY):
- ADF test: p-value = 0.65 → Cannot reject unit root hypothesis
- KPSS test: Statistic = 2.47 » critical value (0.46) → Reject stationarity
- Interpretation: Prices have a unit root - they follow a random walk
Log Returns (STATIONARY):
- ADF test: p-value < 0.0001 → Strongly reject unit root hypothesis
- KPSS test: Statistic = 0.10 « critical value (0.46) → Cannot reject stationarity
- Interpretation: Log returns are mean-reverting with constant variance
Why This Matters:
Statistical models (including HMMs) require stationary inputs. Using prices would violate this assumption and produce invalid inferences. Log returns satisfy the stationarity requirement, making them suitable for:
- Hidden Markov Models
- ARIMA models
- Mean-variance optimization
- Any time series model assuming stationarity
This is not just a nice-to-have property - it’s a fundamental requirement for valid statistical inference.
1# Comprehensive stationarity testingfrom statsmodels.tsa.stattools
2from statsmodels.tools.sm_exceptions import InterpolationWarning
3
4def test_stationarity(series, name):
5 """Run multiple stationarity tests with graceful handling of edge cases"""
6 print(f"\n{name}:")
7 print("-" * 70)
8 # ADF Test
9 adf_result = adfuller(series.dropna(), autolag='AIC')
10 print(f"ADF Test:")
11 print(f" Statistic: {adf_result[0]:.4f}")
12 print(f" p-value: {adf_result[1]:.4f}")
13 print(f" Result: {'STATIONARY' if adf_result[1] < 0.05 else 'NON-STATIONARY'}")
14 # KPSS Test with edge case handling
15 print(f"\nKPSS Test:")
16 # Use conservative nlags to avoid edge cases while remaining statistically valid
17 n_obs = len(series.dropna())
18 nlags = min(int(12 * (n_obs / 100) ** 0.25), n_obs // 4)
19 # Catch interpolation warnings (they actually reinforce our conclusions!)
20 with warnings.catch_warnings(record=True) as w:
21 warnings.simplefilter("always", InterpolationWarning)
22 kpss_result = kpss(series.dropna(), regression='c', nlags=nlags)
23 interp_warnings = [warn for warn in w if issubclass(warn.category, InterpolationWarning)]
24 print(f" Statistic: {kpss_result[0]:.4f}")
25 print(f" p-value: {kpss_result[1]:.4f}")
26 print(f" Critical value (5%): {kpss_result[3]['5%']:.4f}")
27 # Explain if test statistic was extreme
28 if interp_warnings:
29 if kpss_result[0] > kpss_result[3]['10%']:
30 print(f" [WARNING] Extremely high statistic (very strong non-stationarity)")
31 else:
32 print(f" [INFO] Extremely low statistic (very strong stationarity)")
33 print(f" Result: {'STATIONARY' if kpss_result[1] > 0.05 else 'NON-STATIONARY'}")
34 return {
35 'adf_stat': adf_result[0],
36 'adf_pvalue': adf_result[1],
37 'kpss_stat': kpss_result[0],
38 'kpss_pvalue': kpss_result[1],
39 'is_stationary': (adf_result[1] < 0.05) and (kpss_result[1] > 0.05)
40 }
41
42# Test all three for SPY
43print("=" * 70)
44print("STATIONARITY TESTS (SPY)")
45print("=" * 70)
46prices_result = test_stationarity(prices, "Raw Prices")
47log_result = test_stationarity(log_returns, "Log Returns")
48print("\n" + "=" * 70)
49print("CONCLUSION:")
50print("=" * 70)
51if log_result['is_stationary']:
52 print("Log returns are STATIONARY (both tests pass)")
53 print(" -> Safe to use in statistical models")
54else:
55 print("Log returns show some non-stationarity")
56 if not prices_result['is_stationary']:
57 print("Prices are NON-STATIONARY (as expected)")
58 print(" -> Cannot use directly in statistical models")
======================================================================
STATIONARITY TESTS (SPY)
======================================================================
Raw Prices:
----------------------------------------------------------------------
ADF Test:
Statistic: -1.2480
p-value: 0.6526
Result: NON-STATIONARY
KPSS Test:
Statistic: 2.4675
p-value: 0.0100
Critical value (5%): 0.4630
[WARNING] Extremely high statistic (very strong non-stationarity)
Log Returns:
----------------------------------------------------------------------
ADF Test:
Statistic: -12.7847
p-value: 0.0000
Result: STATIONARY
KPSS Test:
Statistic: 0.1046
p-value: 0.1000
Critical value (5%): 0.4630
[INFO] Extremely low statistic (very strong stationarity)
Result: STATIONARY
======================================================================
CONCLUSION:
======================================================================
Log returns are STATIONARY (both tests pass)
-> Safe to use in statistical models
1# Rolling statistics to visualize stationarity
2window = 60 # 60-day rolling window
3
4fig, axes = plt.subplots(2, 2, figsize=(16, 10))
5
6# Top row: Prices
7# 1. Price level (shows trend)
8ax = axes[0, 0]
9prices.plot(ax=ax, linewidth=1.5, label='Price', color='blue')
10prices.rolling(window).mean().plot(ax=ax, linewidth=2, label=f'{window}-day MA', color='red')
11ax.set_title('Prices Show Trend (Non-Stationary)', fontweight='bold')
12ax.set_ylabel('Price ($)')
13ax.legend()
14ax.grid(True, alpha=0.3)
15
16# 2. Price rolling std (shows changing variance)
17ax = axes[0, 1]
18prices.rolling(window).std().plot(ax=ax, linewidth=2, color='purple')
19ax.set_title('Price Volatility Changes Over Time', fontweight='bold')
20ax.set_ylabel('Rolling Std Dev ($)')
21ax.grid(True, alpha=0.3)
22
23# Bottom row: Log Returns
24# 3. Log returns (no trend)
25ax = axes[1, 0]
26log_returns.plot(ax=ax, linewidth=0.8, alpha=0.7, label='Log Return', color='green')
27log_returns.rolling(window).mean().plot(ax=ax, linewidth=2, label=f'{window}-day MA', color='red')
28ax.axhline(y=0, color='black', linestyle='--', linewidth=1)
29ax.set_title('Log Returns Mean-Reverting (Stationary)', fontweight='bold')
30ax.set_ylabel('Log Return')
31ax.legend()
32ax.grid(True, alpha=0.3)
33
34# 4. Log return rolling std (relatively stable)
35ax = axes[1, 1]
36log_returns.rolling(window).std().plot(ax=ax, linewidth=2, color='orange')
37ax.set_title('Log Return Volatility More Stable', fontweight='bold')
38ax.set_ylabel('Rolling Std Dev')
39ax.grid(True, alpha=0.3)
40
41plt.tight_layout()
42plt.show()
Visual Evidence of Stationarity
The four panels above clearly illustrate why prices are non-stationary while log returns are stationary:
Top Row - Prices (NON-STATIONARY):
- Left panel: Mean wanders significantly over time (upward trend visible)
- Right panel: Variance changes dramatically - volatility in dollar terms increases with price level
Bottom Row - Log Returns (STATIONARY):
- Left panel: Mean oscillates around zero (no persistent trend)
- Right panel: Variance relatively stable over time (some variation due to volatility clustering, but no persistent trend)
Key Insight: Stationarity means statistical properties don’t change over time. Prices clearly violate this (they trend and their volatility grows), while log returns satisfy it (mean-reverting with relatively constant variance).
This visual evidence confirms what the statistical tests told us: use log returns, not prices, for statistical modeling.
1# Rolling statistics to visualize stationarity
2window = 60 # 60-day rolling window
3
4fig, axes = plt.subplots(2, 2, figsize=(16, 10))
5
6# Top row: Prices
7# 1. Price level (shows trend)
8ax = axes[0, 0]
9prices.plot(ax=ax, linewidth=1.5, label='Price', color='blue')
10prices.rolling(window).mean().plot(ax=ax, linewidth=2, label=f'{window}-day MA', color='red')
11ax.set_title('Prices Show Trend (Non-Stationary)', fontweight='bold')
12ax.set_ylabel('Price ($)')
13ax.legend()
14ax.grid(True, alpha=0.3)
15
16# 2. Price rolling std (shows changing variance)
17ax = axes[0, 1]
18prices.rolling(window).std().plot(ax=ax, linewidth=2, color='purple')
19ax.set_title('Price Volatility Changes Over Time', fontweight='bold')
20ax.set_ylabel('Rolling Std Dev ($)')
21ax.grid(True, alpha=0.3)
22
23# Bottom row: Log Returns
24# 3. Log returns (no trend)
25ax = axes[1, 0]
26log_returns.plot(ax=ax, linewidth=0.8, alpha=0.7, label='Log Return', color='green')
27log_returns.rolling(window).mean().plot(ax=ax, linewidth=2, label=f'{window}-day MA', color='red')
28ax.axhline(y=0, color='black', linestyle='--', linewidth=1)
29ax.set_title('Log Returns Mean-Reverting (Stationary)', fontweight='bold')
30ax.set_ylabel('Log Return')
31ax.legend()
32ax.grid(True, alpha=0.3)
33
34# 4. Log return rolling std (relatively stable)
35ax = axes[1, 1]
36log_returns.rolling(window).std().plot(ax=ax, linewidth=2, color='orange')
37ax.set_title('Log Return Volatility More Stable', fontweight='bold')
38ax.set_ylabel('Rolling Std Dev')
39ax.grid(True, alpha=0.3)
40
41plt.tight_layout()
42plt.show()
43
44print("\nPrices: Mean and variance change over time -> NON-STATIONARY")
45print("Log Returns: Mean ~0, variance relatively stable -> STATIONARY")
Prices: Mean and variance change over time -> NON-STATIONARY
Log Returns: Mean ~0, variance relatively stable -> STATIONARY
5. Property #2: Time Additivity
The Problem
Multi-period returns are common:
- Weekly returns (5 days)
- Monthly returns (~21 days)
- Quarterly returns (~63 days)
Simple returns require compounding: $$r_{total} = (1 + r_1)(1 + r_2) \cdots (1 + r_T) - 1$$
Log returns simply add: $$R_{total} = R_1 + R_2 + \cdots + R_T$$
This additivity is mathematically elegant and computationally efficient.
1# Demonstrate time additivity
2print("TIME ADDITIVITY DEMONSTRATION")
3print("=" * 70)
4
5# Pick a random 5-day period
6start_idx = 100
7period = 5
8period_dates = prices.index[start_idx:start_idx+period+1]
9period_prices = prices.iloc[start_idx:start_idx+period+1]
10
11print(f"\n5-Day Period: {period_dates[0].date()} to {period_dates[-1].date()}")
12print("-" * 70)
13
14# Calculate daily returns
15daily_simple = period_prices.pct_change().dropna()
16daily_log = np.log(period_prices / period_prices.shift(1)).dropna()
17
18print("\nDaily Returns:")
19for i, (date, simple_ret, log_ret) in enumerate(zip(daily_simple.index, daily_simple, daily_log)):
20 # Convert log return to percentage for display
21 log_ret_pct = (np.exp(log_ret) - 1) * 100
22 print(f" Day {i+1} ({date.date()}): Simple={simple_ret*100:>6.2f}%, Log={log_ret_pct:>6.2f}%")
23
24# Total return (direct calculation)
25total_return_direct = (period_prices.iloc[-1] / period_prices.iloc[0]) - 1
26
27# Total return from simple returns (compounding)
28total_simple_compound = (1 + daily_simple).prod() - 1
29
30# Total return from log returns (summing)
31total_log_sum = daily_log.sum()
32total_log_to_simple = np.exp(total_log_sum) - 1
33
34print("\n" + "=" * 70)
35print("MULTI-PERIOD RETURN CALCULATION:")
36print("=" * 70)
37print(f"\nDirect calculation: {total_return_direct*100:>6.2f}%")
38print(f"Simple returns (compound): {total_simple_compound*100:>6.2f}% <- Requires multiplication")
39print(f"Log returns (sum): {total_log_to_simple*100:>6.2f}% <- Just add them up!")
40
41print("\nLog returns: R_total = R_1 + R_2 + R_3 + R_4 + R_5")
42print("Simple returns: r_total = (1+r_1)(1+r_2)(1+r_3)(1+r_4)(1+r_5) - 1")
43print("\nNote: Log returns are linearly additive across time!")
TIME ADDITIVITY DEMONSTRATION
======================================================================
5-Day Period: 2024-03-07 to 2024-03-14
----------------------------------------------------------------------
Daily Returns:
Day 1 (2024-03-08): Simple= -0.60%, Log= -0.60%
Day 2 (2024-03-11): Simple= -0.09%, Log= -0.09%
Day 3 (2024-03-12): Simple= 1.08%, Log= 1.08%
Day 4 (2024-03-13): Simple= -0.16%, Log= -0.16%
Day 5 (2024-03-14): Simple= -0.20%, Log= -0.20%
======================================================================
MULTI-PERIOD RETURN CALCULATION:
======================================================================
Direct calculation: 0.03%
Simple returns (compound): 0.03% <- Requires multiplication
Log returns (sum): 0.03% <- Just add them up!
Log returns: R_total = R_1 + R_2 + R_3 + R_4 + R_5
Simple returns: r_total = (1+r_1)(1+r_2)(1+r_3)(1+r_4)(1+r_5) - 1
Note: Log returns are linearly additive across time!
6. Property #3: Scale Invariance
The Problem
Our 4 assets have very different price levels:
- SPY: ~$500
- KO: ~$60
Question: If both move $5, should we consider that the “same” move?
Answer: No! A $5 move on a \$60 stock (8.3%) is much more significant than on a \$500 stock (1%).
Log returns automatically account for this - they’re scale invariant.
1# Demonstrate scale invariance
2print("SCALE INVARIANCE DEMONSTRATION")
3print("=" * 70)
4
5# Compare all 4 assets
6print("\nAsset Price Levels:")
7for ticker in tickers:
8 price = data[ticker]['Close'].squeeze().iloc[-1]
9 print(f" {ticker}: ${price:.2f}")
10
11print("\n" + "-" * 70)
12print("Scenario: Each asset moves $5 in one day")
13print("-" * 70)
14
15dollar_move = 5.0
16
17for ticker in tickers:
18 current_price = data[ticker]['Close'].squeeze().iloc[-1]
19 new_price = current_price + dollar_move
20
21 # Simple return
22 simple_ret = dollar_move / current_price
23
24 # Log return
25 log_ret = np.log(new_price / current_price)
26
27 # Convert log return to percentage for display
28 log_ret_pct = (np.exp(log_ret) - 1) * 100
29
30 print(f"\n{ticker} (${current_price:.2f} → ${new_price:.2f}):")
31 print(f" Simple return: {simple_ret*100:>6.2f}%")
32 print(f" Log return: {log_ret_pct:>6.2f}%")
33
34print("\n" + "=" * 70)
35print("OBSERVATION:")
36print("=" * 70)
37print("Both simple and log returns correctly show:")
38print(" - KO ($60): $5 move = larger percentage (8.3%)")
39print(" - SPY ($500): $5 move = smaller percentage (1.0%)")
40print("\nBoth adjust for scale differences")
41print("But log returns have better mathematical properties (additivity, etc.)")
SCALE INVARIANCE DEMONSTRATION
======================================================================
Asset Price Levels:
SPY: $653.02
AAPL: $245.27
TSLA: $413.49
KO: $67.04
----------------------------------------------------------------------
Scenario: Each asset moves $5 in one day
----------------------------------------------------------------------
SPY ($653.02 → $658.02):
Simple return: 0.77%
Log return: 0.77%
AAPL ($245.27 → $250.27):
Simple return: 2.04%
Log return: 2.04%
TSLA ($413.49 → $418.49):
Simple return: 1.21%
Log return: 1.21%
KO ($67.04 → $72.04):
Simple return: 7.46%
Log return: 7.46%
======================================================================
OBSERVATION:
======================================================================
Both simple and log returns correctly show:
- KO ($60): $5 move = larger percentage (8.3%)
- SPY ($500): $5 move = smaller percentage (1.0%)
Both adjust for scale differences
But log returns have better mathematical properties (additivity, etc.)
7. Property #4: Symmetric Treatment
The Problem
Consider these two scenarios:
- Stock goes from \$100 → \$110 (gain +10%)
- Stock goes from \$110 → \$100 (loss -9.09%)
These are inverse movements (same magnitude, opposite direction), but:
- Simple returns: +10% vs -9.09% (asymmetric)
- Log returns: +9.53% vs -9.53% (symmetric!)
Log returns treat gains and losses symmetrically.
1# Demonstrate symmetric treatment
2print("SYMMETRIC TREATMENT OF GAINS AND LOSSES")
3print("=" * 70)
4
5# Example: $100 stock
6price_0 = 100
7moves = [1.10, 1.20, 1.50, 2.00] # 10%, 20%, 50%, 100% gains
8
9print("\nScenario: Up then down by same factor\n")
10print(f"{'Factor':<10} {'Up (Simple)':<15} {'Down (Simple)':<15} {'Up (Log)':<15} {'Down (Log)':<15}")
11print("-" * 70)
12
13for factor in moves:
14 # Up move
15 price_up = price_0 * factor
16 simple_up = (price_up / price_0) - 1
17 log_up = np.log(price_up / price_0)
18
19 # Down move (reverse)
20 price_down = price_up / factor # Should return to price_0
21 simple_down = (price_down / price_up) - 1
22 log_down = np.log(price_down / price_up)
23
24 # Convert log returns to percentage for display
25 log_up_pct = (np.exp(log_up) - 1) * 100
26 log_down_pct = (np.exp(log_down) - 1) * 100
27
28 print(f"{factor:<10.2f} {simple_up*100:>13.1f}% {simple_down*100:>13.1f}% {log_up_pct:>13.1f}% {log_down_pct:>13.1f}%")
29
30print("\n" + "=" * 70)
31print("OBSERVATION:")
32print("=" * 70)
33print("Simple returns: Gains and losses have different magnitudes")
34print(" Example: +100% gain requires -50% loss to break even (asymmetric)")
35print("\nLog returns: Gains and losses have equal magnitudes (symmetric)")
36print(" Example: +100% log gain requires -100% log loss to break even")
37print("\nNote: This symmetry is crucial for statistical modeling!")
SYMMETRIC TREATMENT OF GAINS AND LOSSES
======================================================================
Scenario: Up then down by same factor
Factor Up (Simple) Down (Simple) Up (Log) Down (Log)
----------------------------------------------------------------------
1.10 10.0% -9.1% 10.0% -9.1%
1.20 20.0% -16.7% 20.0% -16.7%
1.50 50.0% -33.3% 50.0% -33.3%
2.00 100.0% -50.0% 100.0% -50.0%
======================================================================
OBSERVATION:
======================================================================
Simple returns: Gains and losses have different magnitudes
Example: +100% gain requires -50% loss to break even (asymmetric)
Log returns: Gains and losses have equal magnitudes (symmetric)
Example: +100% log gain requires -100% log loss to break even
Note: This symmetry is crucial for statistical modeling!
8. Property #5: Cross-Asset Comparability
The Problem
We want to:
- Compare regime behavior across different assets
- Build models that work for multiple assets
- Identify when assets are in similar regimes
Log returns enable valid cross-asset comparison because they’re scale-invariant and have consistent statistical properties.
1# Calculate log returns for all assets
2all_log_returns = {}
3for ticker in tickers:
4 prices = data[ticker]['Close'].squeeze()
5 all_log_returns[ticker] = np.log(prices / prices.shift(1)).dropna()
6
7# Create comparison table
8print("CROSS-ASSET LOG RETURN STATISTICS")
9print("=" * 80)
10print(f"{'Asset':<8} {'Mean (daily %)':<15} {'Std (daily %)':<15} {'Ann. Return %':<15} {'Ann. Vol %':<15}")
11print("=" * 80)
12
13for ticker in tickers:
14 lr = all_log_returns[ticker]
15 # Convert mean log return to percentage properly: (e^mean - 1) * 100
16 mean_daily = (np.exp(lr.mean()) - 1) * 100
17 # Std can be multiplied by 100 (volatility approximation is valid for small returns)
18 std_daily = lr.std() * 100
19 # Annualized return: convert from log space
20 ann_return = (np.exp(lr.mean() * 252) - 1) * 100
21 ann_vol = lr.std() * np.sqrt(252) * 100
22
23 print(f"{ticker:<8} {mean_daily:>14.3f} {std_daily:>14.2f} {ann_return:>14.1f} {ann_vol:>14.1f}")
24
25print("=" * 80)
26print("\nNote: All assets expressed in same units (log returns)")
27print(" Can now meaningfully compare volatilities and distributions")
CROSS-ASSET LOG RETURN STATISTICS
================================================================================
Asset Mean (daily %) Std (daily %) Ann. Return % Ann. Vol %
================================================================================
SPY 0.087 1.03 24.5 16.3
AAPL 0.063 1.75 17.2 27.8
TSLA 0.094 3.97 26.6 63.0
KO 0.060 0.94 16.2 14.8
================================================================================
Note: All assets expressed in same units (log returns)
Can now meaningfully compare volatilities and distributions
1# Visual comparison of distributions
2fig, axes = plt.subplots(2, 2, figsize=(15, 10))
3
4# 1. Distribution overlay
5ax = axes[0, 0]
6for ticker in tickers:
7 all_log_returns[ticker].hist(bins=50, alpha=0.5, label=ticker, ax=ax, density=True)
8ax.set_xlabel('Log Return')
9ax.set_ylabel('Density')
10ax.set_title('Log Return Distributions (Comparable!)', fontweight='bold')
11ax.legend()
12ax.grid(True, alpha=0.3)
13
14# 2. Volatility comparison
15ax = axes[0, 1]
16vols = [all_log_returns[t].std() * np.sqrt(252) * 100 for t in tickers]
17colors = ['C0', 'C1', 'C2', 'C3']
18ax.bar(tickers, vols, color=colors, alpha=0.7)
19ax.set_ylabel('Annualized Volatility (%)')
20ax.set_title('Volatility Comparison (Directly Comparable)', fontweight='bold')
21ax.grid(True, alpha=0.3, axis='y')
22
23# 3. Correlation matrix
24ax = axes[1, 0]
25returns_df = pd.DataFrame(all_log_returns)
26corr = returns_df.corr()
27im = ax.imshow(corr, cmap='RdYlGn', aspect='auto', vmin=-1, vmax=1)
28ax.set_xticks(range(len(tickers)))
29ax.set_yticks(range(len(tickers)))
30ax.set_xticklabels(tickers)
31ax.set_yticklabels(tickers)
32for i in range(len(tickers)):
33 for j in range(len(tickers)):
34 text = ax.text(j, i, f'{corr.iloc[i, j]:.2f}',
35 ha="center", va="center", color="black", fontweight='bold')
36plt.colorbar(im, ax=ax)
37ax.set_title('Log Return Correlation Matrix', fontweight='bold')
38
39# 4. Rolling correlation (SPY vs others)
40ax = axes[1, 1]
41window = 60
42for ticker in ['AAPL', 'TSLA', 'KO']:
43 rolling_corr = returns_df['SPY'].rolling(window).corr(returns_df[ticker])
44 rolling_corr.plot(ax=ax, label=f'SPY-{ticker}', linewidth=2)
45ax.set_ylabel('Rolling Correlation')
46ax.set_title(f'{window}-Day Rolling Correlation with SPY', fontweight='bold')
47ax.legend()
48ax.grid(True, alpha=0.3)
49ax.axhline(y=0, color='black', linestyle='--', linewidth=1)
50
51plt.tight_layout()
52plt.show()
53
54print("\nLog returns enable meaningful cross-asset comparison")
55print("Can identify common patterns and regime correlations")
56print("Foundation for multi-asset regime detection")
Log returns enable meaningful cross-asset comparison
Can identify common patterns and regime correlations
Foundation for multi-asset regime detection
9. Distribution Properties
Are Log Returns Normally Distributed?
Theory suggests log returns should be approximately normal (Geometric Brownian Motion assumption).
Reality: NOT normally distributed - they exhibit significant fat tails and non-normality.
This is not a flaw - it’s a well-known property of financial returns documented since the 1960s.
Let’s test this rigorously and understand the implications for HMM modeling.
1# Comprehensive distribution tests
2from scipy.stats import shapiro, jarque_bera, normaltest
3
4print("DISTRIBUTION ANALYSIS")
5print("=" * 80)
6
7for ticker in tickers:
8 lr = all_log_returns[ticker]
9
10 print(f"\n{ticker}:")
11 print("-" * 80)
12
13 # Moments
14 print(f" Mean: {lr.mean():.6f}")
15 print(f" Std Dev: {lr.std():.6f}")
16 print(f" Skewness: {lr.skew():.4f} (0 = symmetric)")
17 print(f" Kurtosis: {lr.kurtosis():.4f} (0 = normal, >0 = fat tails)")
18
19 # Normality tests
20 jb_statistic, jb_pval = jarque_bera(lr)
21 print(f"\n Jarque-Bera test p-value: {jb_pval:.4f} and statistic: {jb_statistic:.4f}")
22 print(f" {'Reject normality (p < 0.05)' if jb_pval < 0.05 else 'Cannot reject normality'}")
23
24 shapiro_statistic, shapiro_pvalue = shapiro(lr)
25 print(f"\n Shapiro-Wilk test p-value: {shapiro_pvalue:.4f} and statistic: {shapiro_statistic:.4f}")
26 print(f" {'Reject normality (p < 0.05)' if shapiro_pvalue < 0.05 else 'Cannot reject normality'}")
DISTRIBUTION ANALYSIS
================================================================================
SPY:
--------------------------------------------------------------------------------
Mean: 0.000870
Std Dev: 0.010280
Skewness: 0.7646 (0 = symmetric)
Kurtosis: 21.0208 (0 = normal, >0 = fat tails)
Jarque-Bera test p-value: 0.0000 and statistic: 9061.0586
Reject normality (p < 0.05)
Shapiro-Wilk test p-value: 0.0000 and statistic: 0.8397
Reject normality (p < 0.05)
AAPL:
--------------------------------------------------------------------------------
Mean: 0.000630
Std Dev: 0.017507
Skewness: 0.6114 (0 = symmetric)
Kurtosis: 11.6025 (0 = normal, >0 = fat tails)
Jarque-Bera test p-value: 0.0000 and statistic: 2774.1240
Reject normality (p < 0.05)
Shapiro-Wilk test p-value: 0.0000 and statistic: 0.8926
Reject normality (p < 0.05)
TSLA:
--------------------------------------------------------------------------------
Mean: 0.000937
Std Dev: 0.039717
Skewness: 0.2935 (0 = symmetric)
Kurtosis: 3.7372 (0 = normal, >0 = fat tails)
Jarque-Bera test p-value: 0.0000 and statistic: 290.4843
Reject normality (p < 0.05)
Shapiro-Wilk test p-value: 0.0000 and statistic: 0.9578
Reject normality (p < 0.05)
KO:
--------------------------------------------------------------------------------
Mean: 0.000597
Std Dev: 0.009350
Skewness: 0.1202 (0 = symmetric)
Kurtosis: 2.4398 (0 = normal, >0 = fat tails)
Jarque-Bera test p-value: 0.0000 and statistic: 121.5465
Reject normality (p < 0.05)
Shapiro-Wilk test p-value: 0.0000 and statistic: 0.9786
Reject normality (p < 0.05)
STATISTICAL CONCLUSION:
Log returns are NOT normally distributed ALL normality tests reject the null hypothesis (p < 0.05) Significant fat tails (excess kurtosis) are present Positive skewness indicates asymmetric distributions
This is a well-known property of financial returns (Mandelbrot 1963, Fama 1965) Log returns are MUCH more normal than prices or dollar changes But they still exhibit ‘stylized facts’: fat tails, volatility clustering
IMPLICATION FOR HMM MODELING:
Gaussian HMMs are still useful despite non-normality Multiple regimes (mixture of Gaussians) can approximate fat tails Regime-switching explains SOME of the non-normality Trade-off: simplicity/interpretability vs. perfect distributional fit
[WARNING] ALTERNATIVES (if tail risk is critical):
- Student-t emission distributions (heavier tails)
- Mixture models
- Non-parametric methods
For regime detection purposes, Gaussian HMMs provide reasonable results while maintaining interpretability. This is a standard approach in quantitative finance.
10. Conclusion
We have rigorously proven that log returns are the theoretically correct choice for financial time series analysis:
✓ Five Critical Properties Proven
- Stationarity: Log returns are stationary (pass ADF and KPSS tests), prices are not
- Time Additivity: Multi-period returns simply sum ($R_{total} = R_1 + R_2 + \cdots$)
- Scale Invariance: Valid comparison between \$60 and \$500 stocks
- Symmetric Treatment: Equal magnitude for gains and losses
- Cross-Asset Comparability: Same statistical properties across all assets
⚠️ Important Caveat: Distribution Properties
- Distribution (with caveat): Log returns are NOT perfectly normal - they exhibit fat tails and skewness
- This is a well-known property of financial data, not a flaw
- Gaussian HMMs are still useful: mixture of Gaussians can approximate non-normal distributions
- Trade-off: simplicity/interpretability vs. perfect distributional fit
Why This Matters for Regime Detection
Hidden Markov Models work with log returns because:
- ✓ Stationary observations (log returns satisfy this perfectly)
- ✓ Scale-invariant features (enables multi-asset models)
- ✓ Consistent statistical properties (valid cross-asset comparison)
- ⚠️ Approximately Gaussian (mixture of multiple Gaussian regimes can handle fat tails)
Key insight: HMMs don’t assume individual returns are normal - they model returns as coming from a mixture of Gaussian distributions (one per regime). This mixture can approximate the fat-tailed, non-normal distribution we observe in real data.
What’s Next?
Now that we’ve proven log returns are the correct transformation (and understood their distributional properties), the next notebook shows:
How to use log returns with Hidden Markov Models for regime detection
Key Takeaways
Log returns aren’t just a convention - they’re a mathematical necessity.
Properties 1-5 are absolute requirements for statistical modeling.
Property 6 (normality) is violated in practice, but HMMs handle this through regime-switching and mixture modeling.
Any regime detection system that doesn’t use log returns is fundamentally flawed.
Intellectual Honesty Note
This notebook showed that log returns fail normality tests. We didn’t hide this - we proved it statistically.
We then explained why Gaussian HMMs still work: mixture models can approximate non-normal distributions, and much of the non-normality comes from regime-switching itself.
This honest acknowledgment of limitations is critical to honestly assess how these models respond to real data.
10. Conclusion
We have shown that log returns are the theoretically correct choice for financial time series analysis:
✓ Five Critical Properties Proven
- Stationarity: Log returns are stationary (pass ADF and KPSS tests), prices are not
- Time Additivity: Multi-period returns simply sum ($R_{total} = R_1 + R_2 + \cdots$)
- Scale Invariance: Valid comparison between \$60 and \$500 stocks
- Symmetric Treatment: Equal magnitude for gains and losses
- Cross-Asset Comparability: Same statistical properties across all assets
Why This Matters for Regime Detection
Hidden Markov Models require:
- ✓ Stationary observations (log returns satisfy this)
- ✓ Scale-invariant features (enables multi-asset models)
- ✓ Consistent statistical properties (valid cross-asset comparison)
What’s Next?
Now that we’ve proven log returns are the correct transformation, the next notebook shows how to use log returns with Hidden Markov Models for regime detection
Key Takeaway
Log returns aren’t just a convention - they’re a mathematical necessity.
Any regime detection system that doesn’t use log returns is fundamentally flawed.
References:
- Mandelbrot (1963): The variation of certain speculative prices
- Fama (1965): The behavior of stock market prices
- Hamilton (1989): A new approach to the economic analysis of nonstationary time series and the business cycle
- Hamilton (1994): Time Series Analysis
- Campbell, Lo, MacKinlay (1997): The Econometrics of Financial Markets
- Cont (2001): Empirical properties of asset returns: stylized facts and statistical issues
- Tsay (2010): Analysis of Financial Time Series