We Built a 65-Feature ML Pipeline and It Lost 89%

I want to tell you about the most expensive lesson I have learned building AI systems. Not expensive in dollars (though it would have been, had we gone live). Expensive in assumptions it destroyed.

We built a proper ML pipeline for trading commodity futures. Not a weekend hack. Months of research across 50+ open-source repos, 40+ textbooks (Lopez de Prado, Chan, Jansen), and 15+ research papers. The pipeline had fractionally differentiated features, triple barrier labels, purged cross-validation with embargo, sequential bootstrap sample weights, XGBoost with SHAP interpretability, and a meta-labeling layer for bet sizing.

The model achieved 55% directional accuracy on out-of-sample data.

It lost 89%.

Buy-and-hold, the strategy that requires zero intelligence, gained 258% over the same period.

The Pipeline (What We Built)

I will walk through the full stack because the failure is only instructive if you understand the sophistication of what failed.

Dollar Bars Instead of Time Bars

Standard OHLCV bars sample at fixed time intervals (1-minute, 1-hour, daily). This violates IID assumptions because market activity is not uniformly distributed across time. A bar at 3 AM contains almost no information. A bar during London open contains a lot.

Dollar bars emit a new bar when cumulative dollar volume exceeds a threshold:

python

for i in range(len(df)):
    dollar_vol = closes[i] * volumes[i]
    cum_dollar_vol += dollar_vol
 
    if cum_dollar_vol >= threshold:
        bars.append({
            "timestamp": bar_open,
            "open": bar_open_price,
            "high": bar_high,
            "low": bar_low,
            "close": price,
            "volume": cum_volume,
        })
        cum_dollar_vol = 0.0

We calibrated the threshold to produce roughly 50 bars per day, recalibrating quarterly to account for changes in market volume across a 20-year dataset.

Fractional Differentiation (FFD)

Raw prices are non-stationary. Standard differencing (returns) makes them stationary but destroys long-range memory. Fractional differentiation finds the minimum differencing parameter d that achieves stationarity (ADF test, p < 0.05) while preserving as much memory as possible.

python

def frac_diff_ffd(series, d=0.4, threshold=1e-4):
    # Weights: w_0 = 1, w_k = -w_{k-1} * (d - k + 1) / k
    weights = _get_weights_ffd(d, threshold)
    width = len(weights)
 
    for i in range(width - 1, len(values)):
        window = values[i - width + 1: i + 1]
        result[i] = np.dot(weights, window)
 
    return result

We found the minimum d for our asset was around 0.35-0.45, preserving correlation of 0.8-0.95 with the original series while passing the ADF test.

Triple Barrier Labels

Instead of binary up/down classification, we used Lopez de Prado's triple barrier method. Three barriers define the outcome of each trade:

Upper barrier: Price rises by 1.5x volatility (profit target hit)
Lower barrier: Price falls by 1.0x volatility (stop loss hit)
Vertical barrier: 3 trading days pass (time expired, no edge)

The asymmetric profit/loss targets (1.5 vs 1.0) were calibrated to the asset's volatility skew. Labels: +1 (upper hit first), -1 (lower hit first), 0 (time expired).

Purged K-Fold Cross-Validation

Standard k-fold CV on time series leaks future information because labels overlap in time. If a training sample's label extends into the test period, you are training on the future.

We implemented purged CV with a 1% embargo buffer:

python

# PURGE: Remove training samples whose labels overlap test period
for i, ti in enumerate(train_idx):
    label_end = t1.loc[X.index[ti]]
    if label_end >= test_start_time:
        purge_mask[i] = True  # Remove this sample
 
# EMBARGO: Buffer zone after test set
embargo_end = min(test_end + embargo, n)
train_idx = [i for i in train_idx if i not in range(test_end, embargo_end)]

XGBoost + Meta-Labeling

The primary model (XGBoost, 500 trees, max depth 4, heavy L1/L2 regularization) predicted direction. A secondary Random Forest predicted whether to take the trade at all. The secondary model's probability mapped to position size via a half-Kelly formula:

code

p = P(correct) from secondary model
z = (p - 0.5) / sqrt(p * (1-p))
size = base_size * (2 * Phi(z) - 1) / 2

This is textbook two-model bet sizing. Decouple "which direction" from "how confident."

65 Features Across 8 Categories

Price features (FFD-transformed), technical indicators (EMA ratios, RSI, MACD, Bollinger, ATR, ADX), microstructure (Kyle's Lambda, Amihud illiquidity, VPIN), structural (Hurst exponent, Shannon entropy, variance ratio), macro (real yields, DXY, VIX, fed funds), sentiment (finBERT on headlines, COT positioning), and calendar features (session flags, event days, sin/cos hour encoding).

Feature importance via three-method intersection: MDI, MDA (permutation), and SFI. Only features ranking top-15 in at least two methods survived.

Where It All Went Wrong

The model's SHAP analysis showed it learned real economics. Real yield was the #1 feature at 18.3% importance. DXY momentum was #2 at 12.1%. These are genuine drivers. The model was not fitting noise. It understood the fundamental relationships.

But it had a structural sell bias. 84% of its predictions were short.

Here is why: the training data spanned 2014-2023, a period where the asset was range-bound to mildly bearish. 56% of training labels were sells. The model learned that "the default state is down or sideways." That was true for that period.

The test period was 2024-2026, one of the strongest bull runs in the asset's history. The fundamental relationships themselves changed. Real yields were positive (historically bearish for this asset) but prices rallied anyway, driven by central bank buying and geopolitical flows that the model had never seen.

Being short 84% of the time during a bull market that returned 258% produces a return of roughly -89%.

The meta-labeling layer made it worse. It was trained on in-sample predictions that were 70%+ correct by definition. So it learned "always trade." The confidence filter that was supposed to keep us out of low-conviction trades instead said "yes" to everything.

What This Taught Us

1. Accuracy is not profitability

55% directional accuracy sounds good. In a regime where 56% of labels are sells and the market goes up 258%, it is catastrophic. Your 55% correct predictions are mostly correct shorts during a brief dip. Your 45% wrong predictions are wrong shorts during the sustained rally.

2. Backtests from one regime do not generalise to another

The model learned "real yield up means price down." That relationship held for a decade. Then it broke. Not because the model was wrong about the correlation. Because new, larger forces (sovereign buying, de-dollarisation) overwhelmed the historical pattern.

Walk-forward testing caught this. In-sample Sharpe was 1.4. Out-of-sample Sharpe was -0.3. If we had only looked at in-sample results, we would have deployed this.

3. Meta-labeling can amplify failure

The secondary model is supposed to add a precision filter on top of a high-recall primary. But when the primary model has a structural directional bias, the secondary model just learns to agree with it. You end up with two models confidently doing the wrong thing.

4. Complexity is not a feature

We had 65 features, purged CV, sequential bootstrap, meta-labeling. None of it saved us from a regime shift. A simple 50/200 EMA crossover (go flat when 50 EMA is below 200 EMA) delivered 83% of buy-and-hold return with 56% of the drawdown. One rule. Two inputs.

What Actually Worked

After this failure, we pivoted completely. Instead of predicting direction, we focused on three things:

Regime detection (HMM): A 3-state Hidden Markov Model correctly classified bull, range, and bear periods. We used it not to predict direction but to adjust position sizing and risk parameters.

Relative value signals: Instead of asking "will price go up?", we asked "is this asset outperforming or underperforming a correlated asset?" Momentum in the ratio between two correlated instruments turned out to be the only signal that beat buy-and-hold out of sample (Sharpe 1.07 vs 0.96), across walk-forward and Monte Carlo validation with 92% of parameter combinations profitable.

Risk management overlays: The simplest addition. Reduce position when trend structure breaks. This does not improve returns much. It reduces maximum drawdown from 21% to 13%. The risk-adjusted return (Sharpe per unit of max drawdown) improved by 30%.

The Uncomfortable Conclusion

The ML pipeline was not wrong. It was solving the wrong problem. Direction prediction in trending markets is a losing proposition because regime shifts invalidate historical patterns, and you cannot know you are in a new regime until it is too late.

The things that actually work are boring. Trend following. Risk management. Relative value. Position sizing. None of them require a 65-feature XGBoost model. They require discipline and honest validation.

I still use ML in the system. For regime classification (not prediction), parameter optimization (Bayesian, daily batch), and anomaly detection (autoencoders on feature distributions). But never for the signal itself.

A 2-3% R-squared on return prediction is genuinely valuable at scale. But a model that confidently predicts the wrong direction is worse than no model at all.

If your backtest looks too good, it probably is. If your model agrees with itself across all validation methods, check whether your validation methods share the same blind spot.

I want to tell you about the most expensive lesson I have learned building AI systems. Not expensive in dollars (though it would have been, had we gone live). Expensive in assumptions it destroyed.

The model achieved 55% directional accuracy on out-of-sample data.

It lost 89%.

Buy-and-hold, the strategy that requires zero intelligence, gained 258% over the same period.

The Pipeline (What We Built)

I will walk through the full stack because the failure is only instructive if you understand the sophistication of what failed.

Dollar Bars Instead of Time Bars

Dollar bars emit a new bar when cumulative dollar volume exceeds a threshold:

python

for i in range(len(df)):
    dollar_vol = closes[i] * volumes[i]
    cum_dollar_vol += dollar_vol
 
    if cum_dollar_vol >= threshold:
        bars.append({
            "timestamp": bar_open,
            "open": bar_open_price,
            "high": bar_high,
            "low": bar_low,
            "close": price,
            "volume": cum_volume,
        })
        cum_dollar_vol = 0.0

We calibrated the threshold to produce roughly 50 bars per day, recalibrating quarterly to account for changes in market volume across a 20-year dataset.

Fractional Differentiation (FFD)

python

def frac_diff_ffd(series, d=0.4, threshold=1e-4):
    # Weights: w_0 = 1, w_k = -w_{k-1} * (d - k + 1) / k
    weights = _get_weights_ffd(d, threshold)
    width = len(weights)
 
    for i in range(width - 1, len(values)):
        window = values[i - width + 1: i + 1]
        result[i] = np.dot(weights, window)
 
    return result

We found the minimum d for our asset was around 0.35-0.45, preserving correlation of 0.8-0.95 with the original series while passing the ADF test.

Triple Barrier Labels

Instead of binary up/down classification, we used Lopez de Prado's triple barrier method. Three barriers define the outcome of each trade:

Upper barrier: Price rises by 1.5x volatility (profit target hit)
Lower barrier: Price falls by 1.0x volatility (stop loss hit)
Vertical barrier: 3 trading days pass (time expired, no edge)

The asymmetric profit/loss targets (1.5 vs 1.0) were calibrated to the asset's volatility skew. Labels: +1 (upper hit first), -1 (lower hit first), 0 (time expired).

Purged K-Fold Cross-Validation

Standard k-fold CV on time series leaks future information because labels overlap in time. If a training sample's label extends into the test period, you are training on the future.

We implemented purged CV with a 1% embargo buffer:

python

# PURGE: Remove training samples whose labels overlap test period
for i, ti in enumerate(train_idx):
    label_end = t1.loc[X.index[ti]]
    if label_end >= test_start_time:
        purge_mask[i] = True  # Remove this sample
 
# EMBARGO: Buffer zone after test set
embargo_end = min(test_end + embargo, n)
train_idx = [i for i in train_idx if i not in range(test_end, embargo_end)]

XGBoost + Meta-Labeling

code

p = P(correct) from secondary model
z = (p - 0.5) / sqrt(p * (1-p))
size = base_size * (2 * Phi(z) - 1) / 2

This is textbook two-model bet sizing. Decouple "which direction" from "how confident."

65 Features Across 8 Categories

Feature importance via three-method intersection: MDI, MDA (permutation), and SFI. Only features ranking top-15 in at least two methods survived.

Where It All Went Wrong

But it had a structural sell bias. 84% of its predictions were short.

Being short 84% of the time during a bull market that returned 258% produces a return of roughly -89%.

What This Taught Us

1. Accuracy is not profitability

2. Backtests from one regime do not generalise to another

Walk-forward testing caught this. In-sample Sharpe was 1.4. Out-of-sample Sharpe was -0.3. If we had only looked at in-sample results, we would have deployed this.

3. Meta-labeling can amplify failure

4. Complexity is not a feature

What Actually Worked

After this failure, we pivoted completely. Instead of predicting direction, we focused on three things:

Regime detection (HMM): A 3-state Hidden Markov Model correctly classified bull, range, and bear periods. We used it not to predict direction but to adjust position sizing and risk parameters.

The Uncomfortable Conclusion

A 2-3% R-squared on return prediction is genuinely valuable at scale. But a model that confidently predicts the wrong direction is worse than no model at all.

If your backtest looks too good, it probably is. If your model agrees with itself across all validation methods, check whether your validation methods share the same blind spot.