Skip to main content
Pro Feature. Requires a Pro or Ultra subscription. Get started at api.mathematicalcompany.com

Horizon Backtesting

Horizon provides a full backtesting engine via hz.backtest(). It uses the same pipeline, risk engine, and paper exchange as live trading. Your strategy code runs identically in both modes.
By default, backtesting uses mid-price matching against the paper exchange. For simulation, enable L2 orderbook replay, probabilistic fill models, market impact, and latency simulation. All matching logic runs in Rust.

Quick Start

import horizon as hz

def model(ctx):
    return ctx.feed.price * 1.02

def quoter(ctx, fair):
    if fair > ctx.feed.price:
        return hz.quotes(ctx.feed.price, spread=0.04, size=10)

result = hz.backtest(
    name="simple-backtest",
    markets=["my-market"],
    data=[
        {"timestamp": 1000, "price": 0.50},
        {"timestamp": 1001, "price": 0.52},
        {"timestamp": 1002, "price": 0.48},
        {"timestamp": 1003, "price": 0.55},
        {"timestamp": 1004, "price": 0.60},
    ],
    pipeline=[model, quoter],
)

print(result.summary())

hz.backtest() Signature

hz.backtest(
    name: str = "backtest",
    markets: list[str] = ["market"],
    data = None,
    feeds = None,
    pipeline: list[Callable] = [...],
    risk = None,
    params = None,
    paper_fee_rate: float = 0.001,
    paper_maker_fee_rate: float | None = None,
    paper_taker_fee_rate: float | None = None,
    initial_capital: float = 1000.0,
    outcomes: dict[str, float] | None = None,
    # L2 Book Simulation
    book_data: dict[str, list[dict]] | None = None,
    fill_model: str = "deterministic",
    fill_model_params: dict[str, float] | None = None,
    impact_temporary_bps: float = 0.0,
    impact_permanent_fraction: float = 0.0,
    latency_ms: float = 0.0,
    rng_seed: int | None = None,
)
ParameterTypeDescription
namestrName for this backtest run
marketslist[str]Market IDs to simulate
datavariousHistorical data (see formats below)
feedsdictFeed name mapping
pipelinelist[Callable]Same pipeline functions as hz.run()
riskRiskConfigRisk configuration (same as live)
paramsdictStrategy parameters passed to pipeline
paper_fee_ratefloatFee rate applied to paper fills (default 0.1%)
paper_maker_fee_ratefloat or NoneMaker fee rate (overrides paper_fee_rate for maker fills)
paper_taker_fee_ratefloat or NoneTaker fee rate (overrides paper_fee_rate for taker fills)
initial_capitalfloatStarting capital (default $1000)
outcomesdict[str, float]Market outcomes for Brier score (0.0 or 1.0 per market)
book_datadictL2 orderbook snapshots per market (see below)
fill_modelstr"deterministic", "probabilistic", or "glft"
fill_model_paramsdictFill model parameters (see below)
impact_temporary_bpsfloatTemporary market impact in basis points
impact_permanent_fractionfloatFraction of temporary impact that persists (0-1)
latency_msfloatSimulated order-to-fill latency in milliseconds
rng_seedintRandom seed for stochastic fill models

Data Formats

Horizon accepts historical data in multiple formats.
The simplest format. Each dict represents one tick.
data = [
    {"timestamp": 1700000000, "price": 0.55, "bid": 0.54, "ask": 0.56, "volume": 100},
    {"timestamp": 1700000001, "price": 0.56, "bid": 0.55, "ask": 0.57, "volume": 150},
    {"timestamp": 1700000002, "price": 0.54, "bid": 0.53, "ask": 0.55, "volume": 80},
]

result = hz.backtest(
    markets=["my-market"],
    data=data,
    pipeline=[model, quoter],
)
Required fields: timestamp, price. Optional: bid, ask, volume.

L2 Orderbook Simulation

For realistic prediction market backtesting, replay historical L2 orderbook snapshots. Orders are matched by walking the book at each tick, not at a single mid-price.

Book Data Format

Pass book_data as a dict mapping market IDs to lists of orderbook snapshots:
book_data = {
    "election-winner": [
        {
            "timestamp": 1700000000,
            "bids": [(0.54, 100), (0.53, 200), (0.52, 500)],
            "asks": [(0.56, 100), (0.57, 200), (0.58, 500)],
        },
        {
            "timestamp": 1700000001,
            "bids": [(0.55, 150), (0.54, 250)],
            "asks": [(0.57, 150), (0.58, 250)],
        },
    ],
}

result = hz.backtest(
    data=tick_data,
    pipeline=[my_strategy],
    book_data=book_data,
)
Each snapshot has timestamp (float), bids (list of (price, size) tuples, descending), and asks (list of (price, size) tuples, ascending). Book state carries forward between snapshots. When book_data is provided, the engine automatically switches to the BookSim exchange which walks the L2 book to fill orders. BookSim supports split maker/taker fees via paper_maker_fee_rate and paper_taker_fee_rate, computing mid from the best bid/ask to determine each fill’s maker/taker status.

Fill Models

Control how realistically orders are filled against the book.
Default behavior. Orders fill if the price crosses the book level. 100% fill rate at each level.
result = hz.backtest(
    data=data,
    pipeline=[my_strategy],
    book_data=book_data,
    fill_model="deterministic",
)

Market Impact

Simulate price impact from your own orders. Both temporary (during fill) and permanent (persists after fill) impact are supported.
result = hz.backtest(
    data=data,
    pipeline=[my_strategy],
    book_data=book_data,
    impact_temporary_bps=5.0,           # 5 bps temporary impact per unit
    impact_permanent_fraction=0.3,      # 30% of temporary impact persists
)
ParameterDescription
impact_temporary_bpsPrice shift per unit during book walk (basis points). Effective fill price worsens as you consume more depth.
impact_permanent_fractionFraction of temporary impact that becomes permanent. Shifts the book for subsequent ticks.
How it works: When your buy order walks the ask side, each level’s effective price increases by filled_so_far * temporary_bps / 10000. After the fill, total_notional * permanent_fraction * temporary_bps / 10000 is added as a persistent book displacement.

Latency Simulation

Simulate the delay between order submission and arrival at the exchange:
result = hz.backtest(
    data=data,
    pipeline=[my_strategy],
    book_data=book_data,
    latency_ms=50.0,     # 50ms order latency
)
Latency is converted to ticks based on the average tick interval in your data. Orders enter a pending queue and only become active after the specified delay. This models the real-world effect of network latency on fill rates.

Calibration Analytics

Evaluate prediction accuracy with Rust-powered analytics. Available when outcomes are provided.

Calibration Curve

from horizon._horizon import calibration_curve

# predictions: your entry prices (probability estimates)
# outcomes: actual results (0.0 or 1.0)
result = calibration_curve(
    predictions=[0.3, 0.7, 0.9, 0.1, 0.6, 0.8],
    outcomes=[0.0, 1.0, 1.0, 0.0, 1.0, 0.0],
    n_bins=5,
)

print(f"Brier Score: {result.brier_score:.4f}")
print(f"Log Loss:    {result.log_loss:.4f}")
print(f"ECE:         {result.ece:.4f}")     # Expected Calibration Error

# Bins: (bin_center, actual_frequency, count)
for center, freq, count in result.bins:
    print(f"  Predicted ~{center:.1%}: Actual {freq:.1%} (n={count})")

Log Loss

from horizon._horizon import log_loss

ll = log_loss(
    predictions=[0.7, 0.3, 0.9],
    outcomes=[1.0, 0.0, 1.0],
)
print(f"Log Loss: {ll:.4f}")  # Lower is better

Edge Decay

Measure how your edge decays as events approach resolution:
from horizon._horizon import edge_decay

result = edge_decay(
    entry_prices=[0.45, 0.55, 0.40, 0.60],
    outcomes=[1.0, 1.0, 0.0, 0.0],
    entry_ts=[1000.0, 2000.0, 3000.0, 4000.0],
    resolution_ts=[5000.0, 5000.0, 5000.0, 5000.0],
    n_buckets=10,
)

print(f"Edge half-life: {result.half_life_hours:.1f} hours")
for hours, avg_edge in result.decay_curve:
    print(f"  {hours:.0f}h before resolution: {avg_edge:.4f} avg edge")

Walk-Forward Optimization

Avoid overfitting with rolling out-of-sample testing. The walk_forward() function splits your data into train/test windows, runs grid search on each training window, and evaluates the best parameters on the held-out test window.
from horizon.walkforward import walk_forward

def pipeline_factory(params):
    """Create a pipeline from parameter dict."""
    spread = params["spread"]
    size = params["size"]

    def quoter(ctx):
        fair = ctx.feed.price
        return hz.quotes(fair=fair, spread=spread, size=size)

    return [quoter]

result = walk_forward(
    data=tick_data,
    pipeline_factory=pipeline_factory,
    param_grid={
        "spread": [0.02, 0.04, 0.06, 0.08],
        "size": [5, 10, 20],
    },
    n_splits=5,
    train_ratio=0.7,
    expanding=True,          # Anchored expanding window
    objective="sharpe_ratio", # Optimize for Sharpe
    purge_gap=3600.0,        # 1 hour purge between train/test
)

# Per-window results
for i, (window, params) in enumerate(zip(result.windows, result.best_params_per_window)):
    test = result.test_results[i]
    print(f"Window {i}: best params={params}, OOS Sharpe={test.metrics.sharpe_ratio:.3f}")

# Aggregate out-of-sample performance
m = result.aggregate_metrics
print(f"\nAggregate OOS: Return={m.total_return_pct:.2%}, Sharpe={m.sharpe_ratio:.3f}")

walk_forward() Parameters

ParameterTypeDefaultDescription
datavariousrequiredSame formats as hz.backtest()
pipeline_factoryCallablerequiredparams_dict -> pipeline
param_griddictrequired{param_name: [values]} for grid search
n_splitsint5Number of train/test splits
train_ratiofloat0.7Fraction used for training
expandingboolTrueAnchored expanding (True) or rolling (False)
objectivestr"sharpe_ratio"Metric to optimize
purge_gapfloat0.0Seconds to purge between train/test
marketslist[str]NonePassed to backtest
riskRiskConfigNonePassed to backtest
initial_capitalfloat1000.0Starting capital
All additional **kwargs are passed through to each backtest() call (e.g., fill_model, impact_temporary_bps).

WalkForwardResult

FieldTypeDescription
windowslist[WalkForwardWindow]Train/test time boundaries
best_params_per_windowlist[dict]Optimal parameters per window
test_resultslist[BacktestResult]Out-of-sample results per window
aggregate_equitylist[tuple]Chained OOS equity curve
aggregate_metricsMetricsCombined OOS performance metrics

BacktestResult

hz.backtest() returns a BacktestResult object with full analytics.

result.metrics

The metrics property returns a lazy-computed Metrics object with all performance statistics.
result = hz.backtest(...)
m = result.metrics

print(f"Total Return:     ${m.total_return:.2f}")
print(f"Total Return %:   {m.total_return_pct:.2%}")
print(f"CAGR:             {m.cagr:.2%}")
print(f"Sharpe Ratio:     {m.sharpe_ratio:.3f}")
print(f"Sortino Ratio:    {m.sortino_ratio:.3f}")
print(f"Calmar Ratio:     {m.calmar_ratio:.3f}")
print(f"Max Drawdown:     ${m.max_drawdown:.2f}")
print(f"Max Drawdown %:   {m.max_drawdown_pct:.2%}")
print(f"Max DD Duration:  {m.max_drawdown_duration_secs:.0f}s")
print(f"Total Trades:     {m.total_trades}")
print(f"Win Rate:         {m.win_rate:.1f}%")
print(f"Profit Factor:    {m.profit_factor:.2f}")
print(f"Expectancy:       ${m.expectancy:.4f}")
print(f"Avg Win:          ${m.avg_win:.4f}")
print(f"Avg Loss:         ${m.avg_loss:.4f}")
print(f"Largest Win:      ${m.largest_win:.4f}")
print(f"Largest Loss:     ${m.largest_loss:.4f}")
print(f"Total Fees:       ${m.total_fees:.4f}")

Full Metrics Reference

MetricTypeDescription
total_returnfloatAbsolute PnL in dollars
total_return_pctfloatPercentage return on initial capital
cagrfloatCompound annual growth rate
sharpe_ratiofloatAnnualized Sharpe ratio
sortino_ratiofloatAnnualized Sortino ratio (downside deviation only)
calmar_ratiofloatCAGR / max drawdown
max_drawdownfloatLargest peak-to-trough decline in dollars
max_drawdown_pctfloatLargest peak-to-trough decline as percentage
max_drawdown_duration_secsfloatLongest drawdown duration in seconds
total_tradesintNumber of fills
win_ratefloatPercentage of profitable trades (0-100)
profit_factorfloatGross profit / gross loss
expectancyfloatAverage profit per trade
avg_winfloatAverage winning trade size
avg_lossfloatAverage losing trade size
largest_winfloatBest single trade
largest_lossfloatWorst single trade
total_feesfloatTotal fees paid
brier_scorefloat or NoneBrier score (only if outcomes provided)
avg_edgefloat or NoneAverage predicted edge across trades

result.summary()

Returns a formatted string summary of all metrics, ready for printing.
print(result.summary())
=== Backtest: simple-backtest ===
Total Return:     $142.50 (14.25%)
CAGR:             87.32%
Sharpe:           2.145
Sortino:          3.012
Calmar:           4.231
Max Drawdown:     $20.65 (2.07%)
Trades:           48 (Win Rate: 62.50%)
Profit Factor:    1.87
Fees:             $4.80

result.pnl_by_market()

Returns a dict mapping each market ID to its realized PnL.
pnl = result.pnl_by_market()
for market, realized in pnl.items():
    print(f"{market}: ${realized:.2f}")

result.equity_curve

A list of (timestamp, equity) tuples showing the portfolio value over time.
curve = result.equity_curve
for ts, equity in curve[:5]:
    print(f"t={ts}: ${equity:.2f}")

result.trades

A list of Fill objects representing every trade executed during the backtest.
for fill in result.trades[:5]:
    print(f"{fill.side} {fill.order_side} {fill.size} @ {fill.price}")

result.to_csv()

Export results to CSV files for further analysis.
# Export equity curve
result.to_csv("equity.csv", what="equity")

# Export trade log
result.to_csv("trades.csv", what="trades")

Examples

Basic Backtest

import horizon as hz

def mean_reversion(ctx):
    """Simple mean-reversion model."""
    price = ctx.feed.price
    fair = 0.50  # Assume fair value is 0.50
    return fair

def quoter(ctx, fair):
    edge = fair - ctx.feed.price
    if abs(edge) < 0.03:
        return None  # No edge, skip

    return hz.quotes(ctx.feed.price, spread=0.04, size=10)

# Generate sample data
import random
random.seed(42)
price = 0.50
data = []
for i in range(1000):
    price += random.gauss(0, 0.01)
    price = max(0.01, min(0.99, price))
    data.append({"timestamp": 1700000000 + i, "price": round(price, 4)})

result = hz.backtest(
    name="mean-reversion",
    markets=["test-market"],
    data=data,
    pipeline=[mean_reversion, quoter],
    initial_capital=1000.0,
    paper_fee_rate=0.002,
)

print(result.summary())

Multi-Feed Backtest

import horizon as hz

def cross_market_model(ctx):
    """Use a secondary feed as a signal for a prediction market."""
    signal_price = ctx.feeds["binance"].price

    # Simple threshold model
    if signal_price > 0.60:
        return 0.75
    else:
        return 0.35

def quoter(ctx, fair):
    edge = fair - ctx.feed.price
    if edge > 0.05:
        return hz.quotes(ctx.feed.price, spread=0.04, size=20)
    elif edge < -0.05:
        return hz.quotes(ctx.feed.price, spread=0.04, size=20)

data = {
    "polymarket_book": [
        {"timestamp": t, "price": 0.50 + (t % 10) * 0.02, "bid": 0.49, "ask": 0.53}
        for t in range(1700000000, 1700000500)
    ],
    "binance": [
        {"timestamp": t, "price": 0.55 + (t % 20) * 0.01}
        for t in range(1700000000, 1700000500)
    ],
}

result = hz.backtest(
    name="cross-market",
    markets=["btc-above-100k"],
    data=data,
    feeds={"btc-above-100k": "polymarket_book"},
    pipeline=[cross_market_model, quoter],
    initial_capital=5000.0,
)

print(result.summary())
print("\nPnL by market:")
for market, pnl in result.pnl_by_market().items():
    print(f"  {market}: ${pnl:.2f}")

DataFrame Input

import pandas as pd
import horizon as hz

# Load real historical data
df = pd.read_csv("historical_prices.csv")

# Ensure required columns exist
assert "timestamp" in df.columns
assert "price" in df.columns

def momentum(ctx):
    return ctx.feed.price * 1.01

def quoter(ctx, fair):
    if fair > ctx.feed.ask:
        return hz.quotes(ctx.feed.ask, spread=0.04, size=5)

result = hz.backtest(
    name="momentum-df",
    markets=["my-market"],
    data=df,
    pipeline=[momentum, quoter],
)

# Export for analysis in pandas
result.to_csv("equity_curve.csv", what="equity")
result.to_csv("trade_log.csv", what="trades")

Brier Score with Outcomes

For prediction markets, you can evaluate calibration by providing known outcomes.
import horizon as hz

def probability_model(ctx):
    """Model that estimates event probability."""
    return 0.65

def quoter(ctx, fair):
    if fair > ctx.feed.price + 0.03:
        return hz.quotes(ctx.feed.price, spread=0.04, size=10)

data = [
    {"timestamp": t, "price": 0.55 + (t % 5) * 0.01}
    for t in range(1700000000, 1700000200)
]

result = hz.backtest(
    name="calibration-test",
    markets=["will-it-rain"],
    data=data,
    pipeline=[probability_model, quoter],
    outcomes={"will-it-rain": 1.0},  # Event resolved Yes
)

m = result.metrics
print(f"Brier Score: {m.brier_score:.4f}")  # Lower is better (0 = perfect)
print(f"Avg Edge:    {m.avg_edge:.4f}")
print(result.summary())
A Brier score of 0.0 means perfect calibration; 0.25 is equivalent to always predicting 50%. Scores below 0.2 indicate meaningful predictive power.

With Risk Configuration

import horizon as hz
from horizon import RiskConfig

risk = RiskConfig(
    max_position_per_market=100.0,
    max_order_size=20.0,
    max_portfolio_notional=5000.0,
    max_daily_drawdown_pct=10.0,
)

def model(ctx):
    return 0.60

def aggressive_quoter(ctx, fair):
    return hz.quotes(ctx.feed.price, spread=0.04, size=50)

data = [
    {"timestamp": t, "price": 0.50 + (t % 10) * 0.005}
    for t in range(1700000000, 1700001000)
]

result = hz.backtest(
    name="risk-limited",
    markets=["test-market"],
    data=data,
    pipeline=[model, aggressive_quoter],
    risk=risk,
    initial_capital=2000.0,
)

print(result.summary())
For more realistic results, enable L2 book simulation with probabilistic fills, market impact, and latency. This significantly reduces the gap between backtest and live performance. Use walk-forward optimization to validate that your strategy parameters are robust out-of-sample.

Tearsheet

Generate performance reports from backtest results.
from horizon import backtest, generate_tearsheet

result = backtest(...)
tearsheet = result.tearsheet()

# Or directly from equity curve (list of (timestamp, equity) tuples):
tearsheet = generate_tearsheet(
    equity_curve=[(1000, 100), (1001, 101), (1002, 99), (1003, 102), (1004, 105)],
    trades=result.fills if result else [],
    initial_capital=100.0,
)

print(tearsheet.monthly_returns)    # {"YYYY-MM": return_pct}
print(tearsheet.drawdowns)          # List of DrawdownRecord (top 5)
print(tearsheet.avg_win)            # Average winning trade PnL
print(tearsheet.avg_loss)           # Average losing trade PnL
print(tearsheet.largest_win)        # Largest single win
print(tearsheet.largest_loss)       # Largest single loss
print(tearsheet.win_streak)         # Max consecutive wins
print(tearsheet.loss_streak)        # Max consecutive losses
print(tearsheet.tail_ratio)         # 95th / 5th percentile ratio
print(tearsheet.time_of_day)        # {hour: avg_return}
print(tearsheet.rolling_sharpe)     # [(timestamp, sharpe)]
print(tearsheet.rolling_sortino)    # [(timestamp, sortino)]
Even with L2 simulation, backtests cannot perfectly replicate live trading. Your own orders would have changed the book in real time (market impact feedback), and fill probabilities are estimates. Always apply a conservative discount to backtest results.