Horizon Backtesting
Horizon provides a full backtesting engine via hz.backtest(). It uses the same pipeline, risk engine, and paper exchange as live trading. Your strategy code runs identically in both modes.
By default, backtesting uses mid-price matching against the paper exchange. For simulation, enable L2 orderbook replay, probabilistic fill models, market impact, and latency simulation. All matching logic runs in Rust.
Quick Start
import horizon as hz
def model(ctx):
return ctx.feed.price * 1.02
def quoter(ctx, fair):
if fair > ctx.feed.price:
return hz.quotes(ctx.feed.price, spread=0.04, size=10)
result = hz.backtest(
name="simple-backtest",
markets=["my-market"],
data=[
{"timestamp": 1000, "price": 0.50},
{"timestamp": 1001, "price": 0.52},
{"timestamp": 1002, "price": 0.48},
{"timestamp": 1003, "price": 0.55},
{"timestamp": 1004, "price": 0.60},
],
pipeline=[model, quoter],
)
print(result.summary())
hz.backtest() Signature
hz.backtest(
name: str = "backtest",
markets: list[str] = ["market"],
data = None,
feeds = None,
pipeline: list[Callable] = [...],
risk = None,
params = None,
paper_fee_rate: float = 0.001,
paper_maker_fee_rate: float | None = None,
paper_taker_fee_rate: float | None = None,
initial_capital: float = 1000.0,
outcomes: dict[str, float] | None = None,
# L2 Book Simulation
book_data: dict[str, list[dict]] | None = None,
fill_model: str = "deterministic",
fill_model_params: dict[str, float] | None = None,
impact_temporary_bps: float = 0.0,
impact_permanent_fraction: float = 0.0,
latency_ms: float = 0.0,
rng_seed: int | None = None,
)
| Parameter | Type | Description |
|---|
name | str | Name for this backtest run |
markets | list[str] | Market IDs to simulate |
data | various | Historical data (see formats below) |
feeds | dict | Feed name mapping |
pipeline | list[Callable] | Same pipeline functions as hz.run() |
risk | RiskConfig | Risk configuration (same as live) |
params | dict | Strategy parameters passed to pipeline |
paper_fee_rate | float | Fee rate applied to paper fills (default 0.1%) |
paper_maker_fee_rate | float or None | Maker fee rate (overrides paper_fee_rate for maker fills) |
paper_taker_fee_rate | float or None | Taker fee rate (overrides paper_fee_rate for taker fills) |
initial_capital | float | Starting capital (default $1000) |
outcomes | dict[str, float] | Market outcomes for Brier score (0.0 or 1.0 per market) |
book_data | dict | L2 orderbook snapshots per market (see below) |
fill_model | str | "deterministic", "probabilistic", or "glft" |
fill_model_params | dict | Fill model parameters (see below) |
impact_temporary_bps | float | Temporary market impact in basis points |
impact_permanent_fraction | float | Fraction of temporary impact that persists (0-1) |
latency_ms | float | Simulated order-to-fill latency in milliseconds |
rng_seed | int | Random seed for stochastic fill models |
Horizon accepts historical data in multiple formats.
List of Dicts
CSV File
Pandas DataFrame
Multi-Feed Dict
The simplest format. Each dict represents one tick.data = [
{"timestamp": 1700000000, "price": 0.55, "bid": 0.54, "ask": 0.56, "volume": 100},
{"timestamp": 1700000001, "price": 0.56, "bid": 0.55, "ask": 0.57, "volume": 150},
{"timestamp": 1700000002, "price": 0.54, "bid": 0.53, "ask": 0.55, "volume": 80},
]
result = hz.backtest(
markets=["my-market"],
data=data,
pipeline=[model, quoter],
)
Required fields: timestamp, price. Optional: bid, ask, volume. Pass a file path to a CSV with the same columns.# data.csv:
# timestamp,price,bid,ask,volume
# 1700000000,0.55,0.54,0.56,100
# 1700000001,0.56,0.55,0.57,150
result = hz.backtest(
markets=["my-market"],
data="data.csv",
pipeline=[model, quoter],
)
Pass a DataFrame directly. Column names must match.import pandas as pd
df = pd.DataFrame({
"timestamp": [1700000000, 1700000001, 1700000002],
"price": [0.55, 0.56, 0.54],
"bid": [0.54, 0.55, 0.53],
"ask": [0.56, 0.57, 0.55],
"volume": [100, 150, 80],
})
result = hz.backtest(
markets=["my-market"],
data=df,
pipeline=[model, quoter],
)
For strategies using multiple feeds, pass a dict mapping feed names to data.data = {
"polymarket_book": [
{"timestamp": 1700000000, "price": 0.55, "bid": 0.54, "ask": 0.56},
{"timestamp": 1700000001, "price": 0.56, "bid": 0.55, "ask": 0.57},
],
"binance": [
{"timestamp": 1700000000, "price": 0.60},
{"timestamp": 1700000001, "price": 0.62},
],
}
result = hz.backtest(
markets=["btc-above-100k"],
data=data,
feeds={"btc-above-100k": "polymarket_book"},
pipeline=[model, quoter],
)
L2 Orderbook Simulation
For realistic prediction market backtesting, replay historical L2 orderbook snapshots. Orders are matched by walking the book at each tick, not at a single mid-price.
Pass book_data as a dict mapping market IDs to lists of orderbook snapshots:
book_data = {
"election-winner": [
{
"timestamp": 1700000000,
"bids": [(0.54, 100), (0.53, 200), (0.52, 500)],
"asks": [(0.56, 100), (0.57, 200), (0.58, 500)],
},
{
"timestamp": 1700000001,
"bids": [(0.55, 150), (0.54, 250)],
"asks": [(0.57, 150), (0.58, 250)],
},
],
}
result = hz.backtest(
data=tick_data,
pipeline=[my_strategy],
book_data=book_data,
)
Each snapshot has timestamp (float), bids (list of (price, size) tuples, descending), and asks (list of (price, size) tuples, ascending). Book state carries forward between snapshots.
When book_data is provided, the engine automatically switches to the BookSim exchange which walks the L2 book to fill orders. BookSim supports split maker/taker fees via paper_maker_fee_rate and paper_taker_fee_rate, computing mid from the best bid/ask to determine each fill’s maker/taker status.
Fill Models
Control how realistically orders are filled against the book.
Deterministic
Probabilistic
GLFT
Default behavior. Orders fill if the price crosses the book level. 100% fill rate at each level.result = hz.backtest(
data=data,
pipeline=[my_strategy],
book_data=book_data,
fill_model="deterministic",
)
Models queue position and distance-based fill probability:P(fill) = exp(-lambda * distance) * (1 - queue_frac)Where distance is the price distance from mid, lambda controls decay rate, and queue_frac models your queue position (0.5 = middle of queue).result = hz.backtest(
data=data,
pipeline=[my_strategy],
book_data=book_data,
fill_model="probabilistic",
fill_model_params={"lambda": 1.0, "queue_frac": 0.5},
rng_seed=42,
)
| Parameter | Default | Description |
|---|
lambda | 1.0 | Exponential decay rate |
queue_frac | 0.5 | Queue position fraction (0 = front, 1 = back) |
Gueant-Lehalle-Fernandez-Tapia fill model. Based on the academic market making framework:P(fill) = intensity * exp(-kappa * delta)Where delta is the distance from mid-price.result = hz.backtest(
data=data,
pipeline=[my_strategy],
book_data=book_data,
fill_model="glft",
fill_model_params={"intensity": 1.0, "kappa": 1.5},
rng_seed=42,
)
| Parameter | Default | Description |
|---|
intensity | 1.0 | Base fill intensity |
kappa | 1.5 | Exponential decay parameter |
Market Impact
Simulate price impact from your own orders. Both temporary (during fill) and permanent (persists after fill) impact are supported.
result = hz.backtest(
data=data,
pipeline=[my_strategy],
book_data=book_data,
impact_temporary_bps=5.0, # 5 bps temporary impact per unit
impact_permanent_fraction=0.3, # 30% of temporary impact persists
)
| Parameter | Description |
|---|
impact_temporary_bps | Price shift per unit during book walk (basis points). Effective fill price worsens as you consume more depth. |
impact_permanent_fraction | Fraction of temporary impact that becomes permanent. Shifts the book for subsequent ticks. |
How it works: When your buy order walks the ask side, each level’s effective price increases by filled_so_far * temporary_bps / 10000. After the fill, total_notional * permanent_fraction * temporary_bps / 10000 is added as a persistent book displacement.
Latency Simulation
Simulate the delay between order submission and arrival at the exchange:
result = hz.backtest(
data=data,
pipeline=[my_strategy],
book_data=book_data,
latency_ms=50.0, # 50ms order latency
)
Latency is converted to ticks based on the average tick interval in your data. Orders enter a pending queue and only become active after the specified delay. This models the real-world effect of network latency on fill rates.
Calibration Analytics
Evaluate prediction accuracy with Rust-powered analytics. Available when outcomes are provided.
Calibration Curve
from horizon._horizon import calibration_curve
# predictions: your entry prices (probability estimates)
# outcomes: actual results (0.0 or 1.0)
result = calibration_curve(
predictions=[0.3, 0.7, 0.9, 0.1, 0.6, 0.8],
outcomes=[0.0, 1.0, 1.0, 0.0, 1.0, 0.0],
n_bins=5,
)
print(f"Brier Score: {result.brier_score:.4f}")
print(f"Log Loss: {result.log_loss:.4f}")
print(f"ECE: {result.ece:.4f}") # Expected Calibration Error
# Bins: (bin_center, actual_frequency, count)
for center, freq, count in result.bins:
print(f" Predicted ~{center:.1%}: Actual {freq:.1%} (n={count})")
Log Loss
from horizon._horizon import log_loss
ll = log_loss(
predictions=[0.7, 0.3, 0.9],
outcomes=[1.0, 0.0, 1.0],
)
print(f"Log Loss: {ll:.4f}") # Lower is better
Edge Decay
Measure how your edge decays as events approach resolution:
from horizon._horizon import edge_decay
result = edge_decay(
entry_prices=[0.45, 0.55, 0.40, 0.60],
outcomes=[1.0, 1.0, 0.0, 0.0],
entry_ts=[1000.0, 2000.0, 3000.0, 4000.0],
resolution_ts=[5000.0, 5000.0, 5000.0, 5000.0],
n_buckets=10,
)
print(f"Edge half-life: {result.half_life_hours:.1f} hours")
for hours, avg_edge in result.decay_curve:
print(f" {hours:.0f}h before resolution: {avg_edge:.4f} avg edge")
Walk-Forward Optimization
Avoid overfitting with rolling out-of-sample testing. The walk_forward() function splits your data into train/test windows, runs grid search on each training window, and evaluates the best parameters on the held-out test window.
from horizon.walkforward import walk_forward
def pipeline_factory(params):
"""Create a pipeline from parameter dict."""
spread = params["spread"]
size = params["size"]
def quoter(ctx):
fair = ctx.feed.price
return hz.quotes(fair=fair, spread=spread, size=size)
return [quoter]
result = walk_forward(
data=tick_data,
pipeline_factory=pipeline_factory,
param_grid={
"spread": [0.02, 0.04, 0.06, 0.08],
"size": [5, 10, 20],
},
n_splits=5,
train_ratio=0.7,
expanding=True, # Anchored expanding window
objective="sharpe_ratio", # Optimize for Sharpe
purge_gap=3600.0, # 1 hour purge between train/test
)
# Per-window results
for i, (window, params) in enumerate(zip(result.windows, result.best_params_per_window)):
test = result.test_results[i]
print(f"Window {i}: best params={params}, OOS Sharpe={test.metrics.sharpe_ratio:.3f}")
# Aggregate out-of-sample performance
m = result.aggregate_metrics
print(f"\nAggregate OOS: Return={m.total_return_pct:.2%}, Sharpe={m.sharpe_ratio:.3f}")
walk_forward() Parameters
| Parameter | Type | Default | Description |
|---|
data | various | required | Same formats as hz.backtest() |
pipeline_factory | Callable | required | params_dict -> pipeline |
param_grid | dict | required | {param_name: [values]} for grid search |
n_splits | int | 5 | Number of train/test splits |
train_ratio | float | 0.7 | Fraction used for training |
expanding | bool | True | Anchored expanding (True) or rolling (False) |
objective | str | "sharpe_ratio" | Metric to optimize |
purge_gap | float | 0.0 | Seconds to purge between train/test |
markets | list[str] | None | Passed to backtest |
risk | RiskConfig | None | Passed to backtest |
initial_capital | float | 1000.0 | Starting capital |
All additional **kwargs are passed through to each backtest() call (e.g., fill_model, impact_temporary_bps).
WalkForwardResult
| Field | Type | Description |
|---|
windows | list[WalkForwardWindow] | Train/test time boundaries |
best_params_per_window | list[dict] | Optimal parameters per window |
test_results | list[BacktestResult] | Out-of-sample results per window |
aggregate_equity | list[tuple] | Chained OOS equity curve |
aggregate_metrics | Metrics | Combined OOS performance metrics |
BacktestResult
hz.backtest() returns a BacktestResult object with full analytics.
result.metrics
The metrics property returns a lazy-computed Metrics object with all performance statistics.
result = hz.backtest(...)
m = result.metrics
print(f"Total Return: ${m.total_return:.2f}")
print(f"Total Return %: {m.total_return_pct:.2%}")
print(f"CAGR: {m.cagr:.2%}")
print(f"Sharpe Ratio: {m.sharpe_ratio:.3f}")
print(f"Sortino Ratio: {m.sortino_ratio:.3f}")
print(f"Calmar Ratio: {m.calmar_ratio:.3f}")
print(f"Max Drawdown: ${m.max_drawdown:.2f}")
print(f"Max Drawdown %: {m.max_drawdown_pct:.2%}")
print(f"Max DD Duration: {m.max_drawdown_duration_secs:.0f}s")
print(f"Total Trades: {m.total_trades}")
print(f"Win Rate: {m.win_rate:.1f}%")
print(f"Profit Factor: {m.profit_factor:.2f}")
print(f"Expectancy: ${m.expectancy:.4f}")
print(f"Avg Win: ${m.avg_win:.4f}")
print(f"Avg Loss: ${m.avg_loss:.4f}")
print(f"Largest Win: ${m.largest_win:.4f}")
print(f"Largest Loss: ${m.largest_loss:.4f}")
print(f"Total Fees: ${m.total_fees:.4f}")
Full Metrics Reference
| Metric | Type | Description |
|---|
total_return | float | Absolute PnL in dollars |
total_return_pct | float | Percentage return on initial capital |
cagr | float | Compound annual growth rate |
sharpe_ratio | float | Annualized Sharpe ratio |
sortino_ratio | float | Annualized Sortino ratio (downside deviation only) |
calmar_ratio | float | CAGR / max drawdown |
max_drawdown | float | Largest peak-to-trough decline in dollars |
max_drawdown_pct | float | Largest peak-to-trough decline as percentage |
max_drawdown_duration_secs | float | Longest drawdown duration in seconds |
total_trades | int | Number of fills |
win_rate | float | Percentage of profitable trades (0-100) |
profit_factor | float | Gross profit / gross loss |
expectancy | float | Average profit per trade |
avg_win | float | Average winning trade size |
avg_loss | float | Average losing trade size |
largest_win | float | Best single trade |
largest_loss | float | Worst single trade |
total_fees | float | Total fees paid |
brier_score | float or None | Brier score (only if outcomes provided) |
avg_edge | float or None | Average predicted edge across trades |
result.summary()
Returns a formatted string summary of all metrics, ready for printing.
=== Backtest: simple-backtest ===
Total Return: $142.50 (14.25%)
CAGR: 87.32%
Sharpe: 2.145
Sortino: 3.012
Calmar: 4.231
Max Drawdown: $20.65 (2.07%)
Trades: 48 (Win Rate: 62.50%)
Profit Factor: 1.87
Fees: $4.80
result.pnl_by_market()
Returns a dict mapping each market ID to its realized PnL.
pnl = result.pnl_by_market()
for market, realized in pnl.items():
print(f"{market}: ${realized:.2f}")
result.equity_curve
A list of (timestamp, equity) tuples showing the portfolio value over time.
curve = result.equity_curve
for ts, equity in curve[:5]:
print(f"t={ts}: ${equity:.2f}")
result.trades
A list of Fill objects representing every trade executed during the backtest.
for fill in result.trades[:5]:
print(f"{fill.side} {fill.order_side} {fill.size} @ {fill.price}")
result.to_csv()
Export results to CSV files for further analysis.
# Export equity curve
result.to_csv("equity.csv", what="equity")
# Export trade log
result.to_csv("trades.csv", what="trades")
Examples
Basic Backtest
import horizon as hz
def mean_reversion(ctx):
"""Simple mean-reversion model."""
price = ctx.feed.price
fair = 0.50 # Assume fair value is 0.50
return fair
def quoter(ctx, fair):
edge = fair - ctx.feed.price
if abs(edge) < 0.03:
return None # No edge, skip
return hz.quotes(ctx.feed.price, spread=0.04, size=10)
# Generate sample data
import random
random.seed(42)
price = 0.50
data = []
for i in range(1000):
price += random.gauss(0, 0.01)
price = max(0.01, min(0.99, price))
data.append({"timestamp": 1700000000 + i, "price": round(price, 4)})
result = hz.backtest(
name="mean-reversion",
markets=["test-market"],
data=data,
pipeline=[mean_reversion, quoter],
initial_capital=1000.0,
paper_fee_rate=0.002,
)
print(result.summary())
Multi-Feed Backtest
import horizon as hz
def cross_market_model(ctx):
"""Use a secondary feed as a signal for a prediction market."""
signal_price = ctx.feeds["binance"].price
# Simple threshold model
if signal_price > 0.60:
return 0.75
else:
return 0.35
def quoter(ctx, fair):
edge = fair - ctx.feed.price
if edge > 0.05:
return hz.quotes(ctx.feed.price, spread=0.04, size=20)
elif edge < -0.05:
return hz.quotes(ctx.feed.price, spread=0.04, size=20)
data = {
"polymarket_book": [
{"timestamp": t, "price": 0.50 + (t % 10) * 0.02, "bid": 0.49, "ask": 0.53}
for t in range(1700000000, 1700000500)
],
"binance": [
{"timestamp": t, "price": 0.55 + (t % 20) * 0.01}
for t in range(1700000000, 1700000500)
],
}
result = hz.backtest(
name="cross-market",
markets=["btc-above-100k"],
data=data,
feeds={"btc-above-100k": "polymarket_book"},
pipeline=[cross_market_model, quoter],
initial_capital=5000.0,
)
print(result.summary())
print("\nPnL by market:")
for market, pnl in result.pnl_by_market().items():
print(f" {market}: ${pnl:.2f}")
import pandas as pd
import horizon as hz
# Load real historical data
df = pd.read_csv("historical_prices.csv")
# Ensure required columns exist
assert "timestamp" in df.columns
assert "price" in df.columns
def momentum(ctx):
return ctx.feed.price * 1.01
def quoter(ctx, fair):
if fair > ctx.feed.ask:
return hz.quotes(ctx.feed.ask, spread=0.04, size=5)
result = hz.backtest(
name="momentum-df",
markets=["my-market"],
data=df,
pipeline=[momentum, quoter],
)
# Export for analysis in pandas
result.to_csv("equity_curve.csv", what="equity")
result.to_csv("trade_log.csv", what="trades")
Brier Score with Outcomes
For prediction markets, you can evaluate calibration by providing known outcomes.
import horizon as hz
def probability_model(ctx):
"""Model that estimates event probability."""
return 0.65
def quoter(ctx, fair):
if fair > ctx.feed.price + 0.03:
return hz.quotes(ctx.feed.price, spread=0.04, size=10)
data = [
{"timestamp": t, "price": 0.55 + (t % 5) * 0.01}
for t in range(1700000000, 1700000200)
]
result = hz.backtest(
name="calibration-test",
markets=["will-it-rain"],
data=data,
pipeline=[probability_model, quoter],
outcomes={"will-it-rain": 1.0}, # Event resolved Yes
)
m = result.metrics
print(f"Brier Score: {m.brier_score:.4f}") # Lower is better (0 = perfect)
print(f"Avg Edge: {m.avg_edge:.4f}")
print(result.summary())
A Brier score of 0.0 means perfect calibration; 0.25 is equivalent to always predicting 50%. Scores below 0.2 indicate meaningful predictive power.
With Risk Configuration
import horizon as hz
from horizon import RiskConfig
risk = RiskConfig(
max_position_per_market=100.0,
max_order_size=20.0,
max_portfolio_notional=5000.0,
max_daily_drawdown_pct=10.0,
)
def model(ctx):
return 0.60
def aggressive_quoter(ctx, fair):
return hz.quotes(ctx.feed.price, spread=0.04, size=50)
data = [
{"timestamp": t, "price": 0.50 + (t % 10) * 0.005}
for t in range(1700000000, 1700001000)
]
result = hz.backtest(
name="risk-limited",
markets=["test-market"],
data=data,
pipeline=[model, aggressive_quoter],
risk=risk,
initial_capital=2000.0,
)
print(result.summary())
For more realistic results, enable L2 book simulation with probabilistic fills, market impact, and latency. This significantly reduces the gap between backtest and live performance. Use walk-forward optimization to validate that your strategy parameters are robust out-of-sample.
Tearsheet
Generate performance reports from backtest results.
from horizon import backtest, generate_tearsheet
result = backtest(...)
tearsheet = result.tearsheet()
# Or directly from equity curve (list of (timestamp, equity) tuples):
tearsheet = generate_tearsheet(
equity_curve=[(1000, 100), (1001, 101), (1002, 99), (1003, 102), (1004, 105)],
trades=result.fills if result else [],
initial_capital=100.0,
)
print(tearsheet.monthly_returns) # {"YYYY-MM": return_pct}
print(tearsheet.drawdowns) # List of DrawdownRecord (top 5)
print(tearsheet.avg_win) # Average winning trade PnL
print(tearsheet.avg_loss) # Average losing trade PnL
print(tearsheet.largest_win) # Largest single win
print(tearsheet.largest_loss) # Largest single loss
print(tearsheet.win_streak) # Max consecutive wins
print(tearsheet.loss_streak) # Max consecutive losses
print(tearsheet.tail_ratio) # 95th / 5th percentile ratio
print(tearsheet.time_of_day) # {hour: avg_return}
print(tearsheet.rolling_sharpe) # [(timestamp, sharpe)]
print(tearsheet.rolling_sortino) # [(timestamp, sortino)]
Even with L2 simulation, backtests cannot perfectly replicate live trading. Your own orders would have changed the book in real time (market impact feedback), and fill probabilities are estimates. Always apply a conservative discount to backtest results.