Pro Feature. Requires a Pro or Ultra subscription. Get started at api.mathematicalcompany.com

Robustness Testing

Three statistical tests designed specifically for prediction markets. They answer a question every trader must ask: “Is my edge real, or did I just get lucky?”

Event Permutation

Shuffle which outcomes map to which markets. Tests whether your market selection was skillful.

Outcome Randomization

Keep trades fixed, re-draw binary outcomes from Bernoulli(p). Tests whether you got lucky on resolutions.

Path Simulation

Shuffle trade ordering to test if your drawdown and loss streaks were unusually good or bad.

Quick Start

import horizon as hz

result = hz.backtest(
    data=historical_data,
    pipeline=[my_strategy],
    markets=["mkt_a", "mkt_b", "mkt_c"],
    outcomes={"mkt_a": 1.0, "mkt_b": 0.0, "mkt_c": 1.0},
)

# Run all applicable tests
report = result.robustness(n_simulations=10000, seed=42)
print(report.summary())
print(f"Strategy is robust: {report.is_robust}")

Output:

==================================================
  ROBUSTNESS REPORT
==================================================

Permutation Test (10000 permutations)
  Observed PnL:   +12.5000
  Mean permuted:  +2.1000
  Std permuted:   8.3000
  p-value:        0.0340 (SIGNIFICANT)

Outcome Randomization (10000 simulations)
  Observed PnL:   +12.5000
  Mean random:    -1.2000
  Std random:     6.8000
  p-value:        0.0120 (SIGNIFICANT)

Path Simulation (10000 simulations)
  Observed max DD:     3.2000
  Mean simulated DD:   5.1000
  Std simulated DD:    1.8000
  DD percentile:       0.7800 (FAVORABLE)
  Observed max consec losses: 2
  Mean max consec losses:     3.1

  Overall: ROBUST
==================================================

Event Permutation Test

Tests whether your strategy’s market selection was genuinely skillful. Since prediction market events are independent, the order in which outcomes resolve shouldn’t matter if your edge is real. This test shuffles which outcomes map to which markets and recomputes PnL. If your observed PnL is in the upper tail of the permuted distribution, your selection was skillful (not lucky).

Signature

result = hz.permutation_test(
    result,                  # BacktestResult with outcomes
    n_permutations=1000,     # Number of permutations
    seed=42,                 # Reproducibility
)

Example

import horizon as hz

backtest_result = hz.backtest(
    data=data,
    pipeline=[strategy],
    markets=["election", "fed-rate", "gdp"],
    outcomes={"election": 1.0, "fed-rate": 0.0, "gdp": 1.0},
)

perm = hz.permutation_test(backtest_result, n_permutations=5000, seed=42)

print(f"Observed PnL:  {perm.observed_pnl:+.2f}")
print(f"Mean permuted: {perm.mean_permuted_pnl:+.2f}")
print(f"p-value:       {perm.p_value:.4f}")
print(f"Significant:   {perm.is_significant}")

PermutationTestResult

Field	Type	Description
`observed_pnl`	`float`	Actual total PnL from the backtest
`mean_permuted_pnl`	`float`	Mean PnL across permuted orderings
`std_permuted_pnl`	`float`	Standard deviation of permuted PnLs
`p_value`	`float`	Fraction of permutations with PnL >= observed
`n_permutations`	`int`	Number of permutations executed
`permuted_pnls`	`list[float]`	All permuted PnLs (for histograms)
`is_significant`	`bool`	True if `p_value < 0.05`

Mathematical Details

Let

\Pi

be the set of all permutations of outcome assignments. For each permutation

\sigma \in \Pi

\text{PnL}_\sigma = \sum_m \text{PnL}(m, \sigma(o_m))

The p-value is computed conservatively:

p = \frac{\#\{\sigma : \text{PnL}_\sigma \geq \text{PnL}_{\text{observed}}\} + 1}{N + 1}

Under

H_0

(no selection skill), the observed PnL should be typical of the permuted distribution. A small p-value rejects

H_0

Outcome Randomization

Tests whether your realized outcomes were luckier than expected given market pricing. Keeps every trade exactly as-is (same prices, sizes, timing). For each simulation, re-draws each market’s binary outcome from Bernoulli(p) where p is the market’s implied probability (average buy price). If your strategy has genuinely better calibration than the market, your real PnL will sit in the upper tail.

Signature

result = hz.outcome_randomization(
    result,                  # BacktestResult with outcomes
    n_simulations=10000,     # Number of Bernoulli resamples
    seed=42,                 # Reproducibility
    probabilities=None,      # Optional explicit {market_id: prob}
)

Example

import horizon as hz

backtest_result = hz.backtest(
    data=data,
    pipeline=[strategy],
    markets=["election", "fed-rate"],
    outcomes={"election": 1.0, "fed-rate": 0.0},
)

oc = hz.outcome_randomization(backtest_result, n_simulations=10000, seed=42)

print(f"Observed PnL:  {oc.observed_pnl:+.2f}")
print(f"Mean random:   {oc.mean_random_pnl:+.2f}")
print(f"p-value:       {oc.p_value:.4f}")
print(f"Significant:   {oc.is_significant}")

OutcomeRandomizationResult

Field	Type	Description
`observed_pnl`	`float`	Actual total PnL
`mean_random_pnl`	`float`	Mean PnL under randomized outcomes
`std_random_pnl`	`float`	Standard deviation of randomized PnLs
`p_value`	`float`	Fraction of random runs with PnL >= observed
`n_simulations`	`int`	Number of randomization runs
`simulated_pnls`	`list[float]`	All simulated PnLs
`is_significant`	`bool`	True if `p_value < 0.05`

How it works

For closed round-trips (buy then sell before expiry), the PnL is fixed regardless of the final outcome. Only open positions held to resolution contribute to variance across simulations.For each simulation

i

For each market $m$ , draw $o_m^{(i)} \sim \text{Bernoulli}(p_m)$ where $p_m$ is the implied probability
Compute resolution PnL from open positions using $o_m^{(i)}$
Total PnL = (fixed round-trip PnL) + (random resolution PnL)

This decomposition makes the simulation very fast: only open positions contribute to the variance across simulations.

If all your positions are closed before resolution, every simulation produces the same PnL. In that case, outcome randomization is uninformative — use the permutation test instead.

PnL Path Simulation

Tests whether your drawdown and loss streaks were unusually good or bad. Total PnL is invariant to ordering (same sum), but path-dependent statistics like max drawdown and consecutive losses depend heavily on which order trades occurred. This test shuffles the order of round-trip PnLs and computes a distribution of these path statistics.

Signature

result = hz.path_simulation(
    result,                  # BacktestResult with trades
    n_simulations=10000,     # Number of shuffled paths
    seed=42,                 # Reproducibility
)

Example

import horizon as hz

backtest_result = hz.backtest(
    data=data,
    pipeline=[strategy],
    markets=["market"],
)

ps = hz.path_simulation(backtest_result, n_simulations=10000, seed=42)

print(f"Observed max DD:     {ps.observed_max_drawdown:.2f}")
print(f"Mean simulated DD:   {ps.mean_max_drawdown:.2f}")
print(f"DD percentile:       {ps.p_value_drawdown:.2%}")
print(f"Max consec losses:   {ps.max_consecutive_losses_observed}")
print(f"Mean consec losses:  {ps.mean_consecutive_losses:.1f}")

# If p_value_drawdown > 0.5, your path was luckier than average
# If p_value_drawdown < 0.5, your path had worse drawdowns than average

PathSimulationResult

Field	Type	Description
`observed_max_drawdown`	`float`	Actual max drawdown
`observed_terminal_equity`	`float`	Final equity (invariant across shuffles)
`mean_max_drawdown`	`float`	Mean max DD across shuffled paths
`std_max_drawdown`	`float`	Std of simulated max drawdowns
`p_value_drawdown`	`float`	Fraction of paths with DD >= observed
`max_consecutive_losses_observed`	`int`	Actual longest loss streak
`mean_consecutive_losses`	`float`	Mean longest loss streak across paths
`n_simulations`	`int`	Number of shuffled paths
`simulated_max_drawdowns`	`list[float]`	All simulated max DDs
`simulated_max_consecutive_losses`	`list[int]`	All simulated loss streaks

Why drawdown depends on ordering

Consider two sequences of trade PnLs: [+10, +10, -5, -5] and [-5, -5, +10, +10].Both have the same total PnL (+10), but:

Sequence 1: max drawdown = 10 (peak at +20, trough at +10)
Sequence 2: max drawdown = 10 (peak at 0, trough at -10)

With real trade sequences containing dozens or hundreds of trades, the variance in max drawdown across orderings can be substantial. Path simulation tells you where your actual sequence falls in this distribution.

Convenience Wrapper

Run all three tests with a single call:

report = hz.robustness_test(
    result,                          # BacktestResult
    tests=["permutation", "outcome", "path"],  # Which tests to run
    n_simulations=1000,              # Simulations per test
    seed=42,                         # Reproducibility
)

Or call it directly on the result:

report = result.robustness(n_simulations=1000, seed=42)

The wrapper automatically skips tests that require unavailable data:

permutation and outcome require outcomes with >= 2 markets
path requires at least 1 trade

RobustnessReport

Field	Type	Description
`permutation`	`PermutationTestResult \| None`	Event permutation result
`outcome`	`OutcomeRandomizationResult \| None`	Outcome randomization result
`path`	`PathSimulationResult \| None`	Path simulation result
`is_robust`	`bool`	True if all executed tests pass

Interpreting Results

p-values

p-value	Interpretation
< 0.01	Strong evidence of genuine edge
0.01 - 0.05	Significant evidence
0.05 - 0.10	Marginal — investigate further
> 0.10	Insufficient evidence that edge is real

Which test tells you what

Test	Question answered
Permutation	”Did I pick the right markets?”
Outcome	”Were the resolutions luckier than market pricing implied?”
Path	”Was my drawdown path unusually good/bad?”

Recommended workflow

Run all three tests with >= 1000 simulations
If permutation test fails: your market selection may be random
If outcome test fails: your calibration edge may not be real
If path test shows unfavorable drawdown: size down — your worst drawdown is likely ahead

A passing robustness test does not guarantee future profitability. It only confirms that your backtest results are unlikely to be explained by luck alone. Out-of-sample validation via walk-forward optimization remains essential.

​Robustness Testing

Event Permutation

Outcome Randomization

Path Simulation

​Quick Start

​Event Permutation Test

​Signature

​Example

​PermutationTestResult

​Outcome Randomization

​Signature

​Example

​OutcomeRandomizationResult

​PnL Path Simulation

​Signature

​Example

​PathSimulationResult

​Convenience Wrapper

​RobustnessReport

​Interpreting Results

​p-values

​Which test tells you what

​Recommended workflow

Robustness Testing

Quick Start

Event Permutation Test

Signature

Example

PermutationTestResult

Outcome Randomization

Signature

Example

OutcomeRandomizationResult

PnL Path Simulation

Signature

Example

PathSimulationResult

Convenience Wrapper

RobustnessReport

Interpreting Results

p-values

Which test tells you what

Recommended workflow