Dollar Bars and Fractional Differentiation: The Data Your ML Model Actually Needs
Most ML pipelines for financial data start wrong. Not wrong in the model choice or the hyperparameters. Wrong in how they represent the raw data.
I spent months building a quantitative trading system and the two techniques that mattered most were not in the model layer. They were in the data layer. Dollar bars replaced time bars. Fractional differentiation replaced standard returns. Everything downstream improved because the inputs finally respected the statistical properties that ML models assume.
The Problem with Time Bars
A 1-hour OHLCV bar samples at fixed intervals regardless of market activity. At 3 AM UTC, a bar might contain 12 trades across a $2 range. At 12 PM UTC during London-New York overlap, a bar might contain 8,000 trades across a $30 range. Both are treated as equally informative observations.
This violates the IID (independent and identically distributed) assumption that most ML models rely on. The variance of returns per bar is not constant. Serial correlation changes with activity. A model trained on these bars learns a mixture of two different statistical processes: "quiet market" and "active market," blended together in a way that depends on what time zone the training data happens to cover.
The standard fix is to increase the bar interval (use daily bars). This reduces noise but also reduces sample count. For a 6-year dataset, you get around 1,500 daily bars. That is not enough to train a model with more than 5-6 free parameters without overfitting.
Dollar Bars: Sample on Information, Not Time
Dollar bars emit a new bar when cumulative dollar volume (price times traded volume) exceeds a threshold. During quiet periods, a single bar might span 4 hours. During active sessions, you might get 10 bars in 30 minutes.
def generate_dollar_bars(df, threshold="auto", target_bars_per_day=50):
cum_dollar_vol = 0.0
cum_volume = 0.0
for i in range(len(df)):
dollar_vol = closes[i] * volumes[i]
cum_dollar_vol += dollar_vol
if cum_dollar_vol >= threshold:
bars.append({
"timestamp": bar_open,
"open": bar_open_price,
"high": bar_high,
"low": bar_low,
"close": closes[i],
"volume": cum_volume,
"dollar_volume": cum_dollar_vol,
})
cum_dollar_vol = 0.0
cum_volume = 0.0The auto threshold divides average daily dollar volume by the target bars per day:
def _auto_threshold(df, target_bars_per_day=50):
df["_dollar_vol"] = df["close"] * df["volume"]
daily_dollar_vol = df["_dollar_vol"].resample("D").sum()
avg_daily_dollar_vol = daily_dollar_vol[daily_dollar_vol > 0].mean()
return avg_daily_dollar_vol / target_bars_per_day50 bars per day produces roughly hourly resolution during active sessions and multi-hour bars during quiet periods. Each bar contains approximately the same dollar-weighted information content.
The 20-Year Problem
One catch: dollar volume is not stationary over long periods. Our dataset spans 2003-2026. Daily dollar volume in the asset we trade grew roughly 100x over that period. A fixed threshold produces 5 bars per day in 2003 and 500 per day in 2025.
The fix is adaptive thresholds. We recalibrate quarterly based on a rolling window of recent daily dollar volume:
def generate_adaptive_dollar_bars(df, target_bars_per_day=50, recalib_freq="Q"):
for period_start, period_end in quarterly_windows:
period_data = df[period_start:period_end]
threshold = _auto_threshold(period_data, target_bars_per_day)
bars.extend(generate_dollar_bars(period_data, threshold))This maintains roughly uniform bar counts across decades of data. From 7.6 million M1 records, we generated 323,127 adaptive dollar bars.
Empirical Comparison
We tested both representations on the same ML model (XGBoost, identical features and hyperparameters):
| Metric | Time Bars (H1) | Dollar Bars |
|---|---|---|
| Return distribution normality (Jarque-Bera p) | 0.000 | 0.003 |
| Serial correlation (Ljung-Box p) | 0.000 | 0.12 |
| Variance ratio (closer to 1.0 = better) | 0.76 | 0.91 |
Dollar bars are not perfectly IID, but they are substantially closer. The serial correlation test is the big one. Standard time bars have extreme serial correlation (p < 0.001), which means the model is partially learning temporal artifacts rather than genuine patterns.
The Differentiation Problem
Prices are non-stationary. You cannot feed raw prices into an ML model and expect it to generalise. The distribution shifts over time. A model trained on prices between $1,800 and $2,000 has no idea what to do when prices hit $3,000.
The standard fix is to take returns: r_t = (p_t - p_{t-1}) / p_{t-1}. This makes the series stationary but destroys long-range memory. The price level, the trend, the distance from a prior high, all of that information is gone. Each return is essentially independent of what happened 50 or 100 bars ago.
This matters because financial markets exhibit long-range dependence. A trend that has been running for 200 bars carries different information than a trend that started 5 bars ago. Returns cannot distinguish between these.
Fractional Differentiation: The Middle Ground
Instead of differencing with d=1 (standard returns), use a fractional d between 0 and 1. The smaller the d, the more memory is preserved. The larger the d, the more stationary the result.
The weights are computed recursively:
w_0 = 1
w_k = -w_{k-1} * (d - k + 1) / k
For d=0.4, the first few weights are: 1.0, -0.4, -0.12, -0.064, -0.042, ... They decay but never reach zero. A fixed-width window truncates when |w_k| < 1e-4, giving a lookback of roughly 280 bars at d=0.4.
def frac_diff_ffd(series, d=0.4, threshold=1e-4):
weights = _get_weights_ffd(d, threshold)
width = len(weights)
for i in range(width - 1, len(values)):
window = values[i - width + 1: i + 1]
result[i] = np.dot(weights, window)
return resultFinding the Right d
The goal is the minimum d where the ADF (Augmented Dickey-Fuller) test rejects the unit root hypothesis at p < 0.05.
def find_min_d(series, d_range=np.arange(0.0, 1.05, 0.05)):
for d in d_range:
ffd = frac_diff_ffd(np.log(series), d, threshold=1e-4)
adf_stat, pvalue, *_ = adfuller(ffd.dropna(), maxlag=1)
corr = np.corrcoef(series[-len(ffd):], ffd)[0, 1]
if pvalue < 0.05:
return d, corr # Minimum d achieving stationarityFor the asset we tested, d* landed between 0.35 and 0.45. At d=0.4:
| Metric | Returns (d=1.0) | FFD (d=0.4) | Raw Prices (d=0.0) |
|---|---|---|---|
| ADF p-value | < 0.001 | 0.02 | 0.89 |
| Correlation with original | 0.02 | 0.87 | 1.00 |
| Lookback memory | 1 bar | ~280 bars | Infinite |
Returns are stationary but memoryless. Raw prices have full memory but are non-stationary. FFD at d=0.4 is stationary AND preserves 87% correlation with the original series. The ML model can see both the local dynamics and the longer-term context.
CUSUM Event Filter
One more piece that connects dollar bars and FFD: the CUSUM (Cumulative Sum) filter for event detection. Instead of predicting at every bar (which generates a lot of noise and many zero-information labels), we only predict when something interesting happens.
The filter accumulates deviations from the expected return and triggers when the cumulative deviation exceeds a threshold:
S_t = max(0, S_{t-1} + |y_t - E[y_t]| - h)
Trigger when S_t > h
We set h = 0.5 * daily volatility, which produces 50-100 events per day. Each event gets a triple barrier label (profit target, stop loss, or time expiry). The model only trains and predicts on these events, not on every single bar.
This is the full pre-processing chain: M1 ticks to dollar bars to FFD features to CUSUM-filtered events to triple barrier labels. By the time data reaches the ML model, it is approximately IID, stationary, memory-preserving, and sampled at information-driven intervals.
Does It Actually Help?
Honestly, these techniques did not save our ML direction predictor from failing (I wrote about that failure separately). The model failed for reasons that no amount of data engineering can fix: regime shifts that invalidate historical patterns.
But the data engineering was not wasted. When we pivoted to regime detection (HMM on dollar bar features), the model's state classification was substantially cleaner than on time bars. When we tested relative value strategies, the FFD features provided trend context that raw returns could not. And when we built the risk management overlays, the CUSUM filter correctly identified volatility regime changes 2-3 bars earlier than fixed-interval detection.
The lesson is that data representation matters independently of what you do with it downstream. Getting the inputs right does not guarantee the model works. But getting them wrong guarantees it does not.
If you are feeding hourly OHLCV bars and standard returns into a gradient boosted model and wondering why it does not generalise, the answer is probably not in the model. It is in the bars.