Skip to main content
Pro Feature. Requires a Pro or Ultra subscription. Get started at api.mathematicalcompany.com

Robustness Testing

Three statistical tests designed specifically for prediction markets. They answer a question every trader must ask: “Is my edge real, or did I just get lucky?”

Event Permutation

Shuffle which outcomes map to which markets. Tests whether your market selection was skillful.

Outcome Randomization

Keep trades fixed, re-draw binary outcomes from Bernoulli(p). Tests whether you got lucky on resolutions.

Path Simulation

Shuffle trade ordering to test if your drawdown and loss streaks were unusually good or bad.

Quick Start

import horizon as hz

result = hz.backtest(
    data=historical_data,
    pipeline=[my_strategy],
    markets=["mkt_a", "mkt_b", "mkt_c"],
    outcomes={"mkt_a": 1.0, "mkt_b": 0.0, "mkt_c": 1.0},
)

# Run all applicable tests
report = result.robustness(n_simulations=10000, seed=42)
print(report.summary())
print(f"Strategy is robust: {report.is_robust}")
Output:
==================================================
  ROBUSTNESS REPORT
==================================================

Permutation Test (10000 permutations)
  Observed PnL:   +12.5000
  Mean permuted:  +2.1000
  Std permuted:   8.3000
  p-value:        0.0340 (SIGNIFICANT)

Outcome Randomization (10000 simulations)
  Observed PnL:   +12.5000
  Mean random:    -1.2000
  Std random:     6.8000
  p-value:        0.0120 (SIGNIFICANT)

Path Simulation (10000 simulations)
  Observed max DD:     3.2000
  Mean simulated DD:   5.1000
  Std simulated DD:    1.8000
  DD percentile:       0.7800 (FAVORABLE)
  Observed max consec losses: 2
  Mean max consec losses:     3.1

  Overall: ROBUST
==================================================

Event Permutation Test

Tests whether your strategy’s market selection was genuinely skillful. Since prediction market events are independent, the order in which outcomes resolve shouldn’t matter if your edge is real. This test shuffles which outcomes map to which markets and recomputes PnL. If your observed PnL is in the upper tail of the permuted distribution, your selection was skillful (not lucky).

Signature

result = hz.permutation_test(
    result,                  # BacktestResult with outcomes
    n_permutations=1000,     # Number of permutations
    seed=42,                 # Reproducibility
)

Example

import horizon as hz

backtest_result = hz.backtest(
    data=data,
    pipeline=[strategy],
    markets=["election", "fed-rate", "gdp"],
    outcomes={"election": 1.0, "fed-rate": 0.0, "gdp": 1.0},
)

perm = hz.permutation_test(backtest_result, n_permutations=5000, seed=42)

print(f"Observed PnL:  {perm.observed_pnl:+.2f}")
print(f"Mean permuted: {perm.mean_permuted_pnl:+.2f}")
print(f"p-value:       {perm.p_value:.4f}")
print(f"Significant:   {perm.is_significant}")

PermutationTestResult

FieldTypeDescription
observed_pnlfloatActual total PnL from the backtest
mean_permuted_pnlfloatMean PnL across permuted orderings
std_permuted_pnlfloatStandard deviation of permuted PnLs
p_valuefloatFraction of permutations with PnL >= observed
n_permutationsintNumber of permutations executed
permuted_pnlslist[float]All permuted PnLs (for histograms)
is_significantboolTrue if p_value < 0.05
Let Π\Pi be the set of all permutations of outcome assignments. For each permutation σΠ\sigma \in \Pi:PnLσ=mPnL(m,σ(om))\text{PnL}_\sigma = \sum_m \text{PnL}(m, \sigma(o_m))The p-value is computed conservatively:p=#{σ:PnLσPnLobserved}+1N+1p = \frac{\#\{\sigma : \text{PnL}_\sigma \geq \text{PnL}_{\text{observed}}\} + 1}{N + 1}Under H0H_0 (no selection skill), the observed PnL should be typical of the permuted distribution. A small p-value rejects H0H_0.

Outcome Randomization

Tests whether your realized outcomes were luckier than expected given market pricing. Keeps every trade exactly as-is (same prices, sizes, timing). For each simulation, re-draws each market’s binary outcome from Bernoulli(p) where p is the market’s implied probability (average buy price). If your strategy has genuinely better calibration than the market, your real PnL will sit in the upper tail.

Signature

result = hz.outcome_randomization(
    result,                  # BacktestResult with outcomes
    n_simulations=10000,     # Number of Bernoulli resamples
    seed=42,                 # Reproducibility
    probabilities=None,      # Optional explicit {market_id: prob}
)

Example

import horizon as hz

backtest_result = hz.backtest(
    data=data,
    pipeline=[strategy],
    markets=["election", "fed-rate"],
    outcomes={"election": 1.0, "fed-rate": 0.0},
)

oc = hz.outcome_randomization(backtest_result, n_simulations=10000, seed=42)

print(f"Observed PnL:  {oc.observed_pnl:+.2f}")
print(f"Mean random:   {oc.mean_random_pnl:+.2f}")
print(f"p-value:       {oc.p_value:.4f}")
print(f"Significant:   {oc.is_significant}")

OutcomeRandomizationResult

FieldTypeDescription
observed_pnlfloatActual total PnL
mean_random_pnlfloatMean PnL under randomized outcomes
std_random_pnlfloatStandard deviation of randomized PnLs
p_valuefloatFraction of random runs with PnL >= observed
n_simulationsintNumber of randomization runs
simulated_pnlslist[float]All simulated PnLs
is_significantboolTrue if p_value < 0.05
For closed round-trips (buy then sell before expiry), the PnL is fixed regardless of the final outcome. Only open positions held to resolution contribute to variance across simulations.For each simulation ii:
  1. For each market mm, draw om(i)Bernoulli(pm)o_m^{(i)} \sim \text{Bernoulli}(p_m) where pmp_m is the implied probability
  2. Compute resolution PnL from open positions using om(i)o_m^{(i)}
  3. Total PnL = (fixed round-trip PnL) + (random resolution PnL)
This decomposition makes the simulation very fast: only open positions contribute to the variance across simulations.
If all your positions are closed before resolution, every simulation produces the same PnL. In that case, outcome randomization is uninformative — use the permutation test instead.

PnL Path Simulation

Tests whether your drawdown and loss streaks were unusually good or bad. Total PnL is invariant to ordering (same sum), but path-dependent statistics like max drawdown and consecutive losses depend heavily on which order trades occurred. This test shuffles the order of round-trip PnLs and computes a distribution of these path statistics.

Signature

result = hz.path_simulation(
    result,                  # BacktestResult with trades
    n_simulations=10000,     # Number of shuffled paths
    seed=42,                 # Reproducibility
)

Example

import horizon as hz

backtest_result = hz.backtest(
    data=data,
    pipeline=[strategy],
    markets=["market"],
)

ps = hz.path_simulation(backtest_result, n_simulations=10000, seed=42)

print(f"Observed max DD:     {ps.observed_max_drawdown:.2f}")
print(f"Mean simulated DD:   {ps.mean_max_drawdown:.2f}")
print(f"DD percentile:       {ps.p_value_drawdown:.2%}")
print(f"Max consec losses:   {ps.max_consecutive_losses_observed}")
print(f"Mean consec losses:  {ps.mean_consecutive_losses:.1f}")

# If p_value_drawdown > 0.5, your path was luckier than average
# If p_value_drawdown < 0.5, your path had worse drawdowns than average

PathSimulationResult

FieldTypeDescription
observed_max_drawdownfloatActual max drawdown
observed_terminal_equityfloatFinal equity (invariant across shuffles)
mean_max_drawdownfloatMean max DD across shuffled paths
std_max_drawdownfloatStd of simulated max drawdowns
p_value_drawdownfloatFraction of paths with DD >= observed
max_consecutive_losses_observedintActual longest loss streak
mean_consecutive_lossesfloatMean longest loss streak across paths
n_simulationsintNumber of shuffled paths
simulated_max_drawdownslist[float]All simulated max DDs
simulated_max_consecutive_losseslist[int]All simulated loss streaks
Consider two sequences of trade PnLs: [+10, +10, -5, -5] and [-5, -5, +10, +10].Both have the same total PnL (+10), but:
  • Sequence 1: max drawdown = 10 (peak at +20, trough at +10)
  • Sequence 2: max drawdown = 10 (peak at 0, trough at -10)
With real trade sequences containing dozens or hundreds of trades, the variance in max drawdown across orderings can be substantial. Path simulation tells you where your actual sequence falls in this distribution.

Convenience Wrapper

Run all three tests with a single call:
report = hz.robustness_test(
    result,                          # BacktestResult
    tests=["permutation", "outcome", "path"],  # Which tests to run
    n_simulations=1000,              # Simulations per test
    seed=42,                         # Reproducibility
)
Or call it directly on the result:
report = result.robustness(n_simulations=1000, seed=42)
The wrapper automatically skips tests that require unavailable data:
  • permutation and outcome require outcomes with >= 2 markets
  • path requires at least 1 trade

RobustnessReport

FieldTypeDescription
permutationPermutationTestResult | NoneEvent permutation result
outcomeOutcomeRandomizationResult | NoneOutcome randomization result
pathPathSimulationResult | NonePath simulation result
is_robustboolTrue if all executed tests pass

Interpreting Results

p-values

p-valueInterpretation
< 0.01Strong evidence of genuine edge
0.01 - 0.05Significant evidence
0.05 - 0.10Marginal — investigate further
> 0.10Insufficient evidence that edge is real

Which test tells you what

TestQuestion answered
Permutation”Did I pick the right markets?”
Outcome”Were the resolutions luckier than market pricing implied?”
Path”Was my drawdown path unusually good/bad?”
  1. Run all three tests with >= 1000 simulations
  2. If permutation test fails: your market selection may be random
  3. If outcome test fails: your calibration edge may not be real
  4. If path test shows unfavorable drawdown: size down — your worst drawdown is likely ahead
A passing robustness test does not guarantee future profitability. It only confirms that your backtest results are unlikely to be explained by luck alone. Out-of-sample validation via walk-forward optimization remains essential.