Market Data Collector
Polymarket and Kalshi don’t retain all historical data indefinitely. Orderbook depth, trade history, and market metadata can disappear. The collector module creates a persistent PIT (point-in-time) data archive that captures:- Full L2 orderbooks: every bid/ask level at configurable intervals
- Trade-by-trade history: individual trades with deduplication
- Market metadata: names, tokens, volumes, liquidity, status
- Event metadata: event-level data with associated markets
Two Modes
Pipeline Mode
Integrate directly into yourhz.run() strategy to record orderbooks alongside trading:
Standalone Daemon
Run as a 24/7 passive collector without any trading:Parquet Export
Export collected data to Parquet for batch analysis with pandas, polars, or DuckDB:pyarrow:
Query Helpers
Query collected data directly:Programmatic Access (Rust)
For maximum control, use the RustCollector directly:
Schema Reference
orderbook_snapshots
| Column | Type | Description |
|---|---|---|
market_id | TEXT | Market identifier |
exchange | TEXT | Exchange name |
bids_json | TEXT | JSON array [[price, size], ...] |
asks_json | TEXT | JSON array [[price, size], ...] |
best_bid | REAL | Top-of-book bid price |
best_ask | REAL | Top-of-book ask price |
mid_price | REAL | (best_bid + best_ask) / 2 |
spread | REAL | best_ask - best_bid |
timestamp | REAL | Source timestamp (UNIX epoch) |
market_trades
| Column | Type | Description |
|---|---|---|
trade_id | TEXT | Unique trade identifier |
market_id | TEXT | Market identifier |
price | REAL | Trade price |
size | REAL | Trade size |
side | TEXT | buy or sell |
timestamp | REAL | Trade timestamp |
market_metadata
| Column | Type | Description |
|---|---|---|
market_id | TEXT | Market identifier (PK) |
exchange | TEXT | Exchange name (PK) |
question | TEXT | Market question |
volume | REAL | Total volume |
liquidity | REAL | Current liquidity |
status | TEXT | Market status |
Storage Estimates
| Data Type | Interval | Per Market/Day | 10 Markets/Month |
|---|---|---|---|
| Orderbook (20 levels) | 5s | ~35 MB | ~10.5 GB |
| Orderbook (20 levels) | 30s | ~6 MB | ~1.8 GB |
| Trades | 10s polls | ~5 MB | ~1.5 GB |
| Metadata | 5 min | ~0.5 MB | ~150 MB |
retention_hours and purge() to control storage growth. Parquet export + purge is recommended for long-term archiving.