How to Scrape Prediction Markets Data Using Python + MCP
Prediction markets are moving serious money - Kalshi alone hit $2.3 billion in a single week in December 2025, with Polymarket and Kalshi combining for over $44 billion annually. Both platforms offer APIs, but here's the catch: Polymarket's free tier caps you at 1,000 calls/hour, and WebSocket feeds sit behind a $99/month paywall. For real-time arbitrage detection across multiple markets, those limits don't cut it.
Quick Summary TLDR
Quick Summary TLDR
- 1Official APIs exist but rate limits (1,000/hr) and paywalled WebSocket feeds make high-frequency arbitrage detection impractical
- 2Playwright enables protocol-level network interception to capture real-time market data beyond API constraints
- 3Mobile proxies maintain stable sessions for sustained polling without triggering rate limits or IP blocks
- 4MCP AI agents adapt to frontend changes automatically, eliminating brittle CSS selector maintenance
- 5Store scraped data in time-series databases for cross-platform arbitrage signal detection
Most tutorials point you to the official APIs and call it a day. That works for casual monitoring, but real-time arbitrage requires polling dozens of markets per second across both platforms. At 1,000 calls/hour, you're stuck checking each market once every few minutes - an eternity when price discrepancies close in seconds.
Why API Limits Push You Toward Scraping
Polymarket's API is solid for basic use, but high-frequency traders hit walls quickly. The free tier's rate limits mean you can't monitor 50+ markets in real-time simultaneously. Premium WebSocket access ($99/month) helps, but still throttles during high-volatility events when you need data most. And their bot detection kicks in if you hammer endpoints too aggressively.
Kalshi requires OAuth tokens with short refresh cycles for API access. Scraping the public frontend seems easier until market IDs start rotating and the DOM structure changes based on A/B test cohorts - a fun discovery to make at 2am when the scraper stops working.
Apify templates promise "no-code prediction market scraping" but they're just Puppeteer scripts with residential proxy rotation. Common scenario: actors hit memory limits and crash after a handful of successful extractions, burning through proxy credits for minimal data yield.
Rate Limits Are Dynamic
Both platforms adjust rate limits during high-traffic events (elections, major sports). A scraper that works fine on a quiet Tuesday might get throttled hard during a presidential debate. Build in adaptive backoff and multiple fallback strategies.
The Playwright + Mobile Proxy Approach
Playwright solves the JavaScript execution problem. Unlike Selenium, it intercepts network traffic at the protocol level, which means capturing WebSocket frames, modifying headers mid-flight, and blocking unnecessary resources to speed up page loads.
Scraping Kalshi starts with launching a persistent browser context that mimics real user behavior. Random mouse movements, variable scroll speeds, 2-4 second pauses between clicks. The critical piece is network interception though.
1 from playwright.sync_api import sync_playwright 2
3 captured_data = [] 4
5 def intercept_market_data(route): 6 if '/api/v2/markets' in route.request.url: 7 response = route.fetch() 8 data = response.json() 9 captured_data.append(data) # Store for processing 10 route.fulfill(response=response) # Forward original response 11 else: 12 route.continue_() 13
14 with sync_playwright() as p: 15 browser = p.chromium.launch(proxy={ 16 "server": "http://mobile-proxy.voidmob.com:8080", 17 "username": "user", 18 "password": "pass" 19 }) 20 page = browser.new_page() 21 page.route('**/*', intercept_market_data) 22 page.goto('https://kalshi.com/markets')
Mobile proxies matter more than most guides admit. Datacenter IPs get flagged quickly on Polymarket - sustained traffic through AWS proxies typically results in rapid blocking. Residential proxies work better but typically cost $3-9/GB and throttle after sustained traffic.
Carrier-grade mobile proxies from real 4G/5G devices handle sustained polling because they present as authentic smartphone traffic. CGNAT places many customers behind a single IP address, making mobile traffic look like legitimate multi-user patterns. Mobile IPs maintain stable sessions for hours compared to datacenter alternatives that get flagged within minutes.
| Proxy Type | Block Rate | Cost/GB | Session Stability |
|---|---|---|---|
| Datacenter | High | $1-3 | Under 1 minute |
| Residential | Moderate | $3-9 | 3-5 minutes |
| Mobile (carrier) | Low | $5-10 | 60+ minutes |
Adding MCP for Adaptive Scraping
Model Context Protocol (MCP) changes the game for prediction market odd scraping. Instead of hardcoding CSS selectors that break when Polymarket updates their frontend, MCP AI agents discover DOM structures dynamically and adjust extraction logic in real time.
The point is not writing brittle XPath queries anymore. The MCP server exposes "tools" that the AI can invoke - find_odds_container, extract_volume_data, detect_market_status. When the page structure shifts, the agent queries available tools, tests selectors, and picks the working method.
1 # MCP tool definition 2 @mcp.tool() 3 async def extract_polymarket_odds(market_url: str): 4 """Dynamically locate and extract current odds for a market""" 5 page = await browser.new_page() 6 await page.goto(market_url) 7
8 # AI agent decides which selector works 9 selectors = [ 10 '.market-odds-primary', 11 '[data-testid="odds-display"]', 12 'div.odds-container > span' 13 ] 14
15 for selector in selectors: 16 element = await page.query_selector(selector) 17 if element and await element.is_visible(): 18 return await element.inner_text() 19
20 return None
When running this approach on multiple Polymarket events over extended periods, traditional scrapers typically break when CSS classes get renamed. MCP versions adapt automatically, maintaining high uptime. Research shows AI-based scrapers reduce maintenance overhead by 30-40% by adapting to dynamic webpage structures without manual intervention.
Handling Rate Limits and Session Rotation
Playwright proxies need rotation logic or IP reputation burns fast. Prediction markets track request patterns - hitting the same endpoint repeatedly from one IP gets flagged regardless of proxy quality.
Smart rotation means switching IPs based on behavior triggers, not fixed intervals. After a certain number of requests, after a 429 response, after sustained traffic periods. Sticky sessions are needed for markets that require login state because rotating mid-session kills cookies and forces re-authentication.
1 proxy_pool = [ 2 {"server": f"http://us-{i}.voidmob.com:8080", "sticky": True} 3 for i in range(1, 6) 4 ] 5
6 current_proxy = 0 7 request_count = 0 8
9 async def rotate_if_needed(): 10 global current_proxy, request_count 11 if request_count > 40 or detect_rate_limit(): 12 current_proxy = (current_proxy + 1) % len(proxy_pool) 13 request_count = 0 14 await browser.new_context(proxy=proxy_pool[current_proxy])
Real-time volumes require WebSocket handling. Polymarket streams tick data over WSS connections that standard HTTP clients can't intercept. Playwright captures these natively through the page.on('websocket') event listener.
For more on time-based proxy rotation strategies, check our deep-dive on safer scraping patterns.
Storing and Processing Market Data
Scraped odds lose value fast. Arbitrage windows close quickly on liquid markets. Streaming data into a time-series database like TimescaleDB or InfluxDB allows querying historical spreads and detecting patterns.
The schema needs timestamps at millisecond precision - market ID, event slug, yes/no odds, total volume, last trade price. Index on timestamp and market_id for fast range queries.
1 CREATE TABLE market_snapshots ( 2 timestamp TIMESTAMPTZ NOT NULL, 3 market_id TEXT NOT NULL, 4 yes_odds DECIMAL(5,2), 5 no_odds DECIMAL(5,2), 6 volume_24h BIGINT, 7 PRIMARY KEY (timestamp, market_id) 8 ); 9
10 SELECT create_hypertable('market_snapshots', 'timestamp');
Processing pipelines should calculate implied probabilities and compare across platforms. A market trading at different prices on Polymarket versus Kalshi signals potential arbitrage (if both sides can be executed before the spread tightens).
1 from dataclasses import dataclass 2 from datetime import datetime 3
4 @dataclass 5 class MarketOdds: 6 platform: str 7 market_id: str 8 yes_price: float # 0.0 to 1.0 9 no_price: float 10 timestamp: datetime 11
12 def detect_arbitrage(polymarket: MarketOdds, kalshi: MarketOdds, min_spread: float = 0.02): 13 """ 14 Detect arbitrage when you can buy YES on one platform 15 and NO on another for combined cost < 1.0 16 """ 17 opportunities = [] 18
19 # Strategy 1: Buy YES on Polymarket, NO on Kalshi 20 cost_1 = polymarket.yes_price + kalshi.no_price 21 if cost_1 < (1.0 - min_spread): 22 opportunities.append({ 23 "strategy": "YES@Polymarket + NO@Kalshi", 24 "cost": cost_1, 25 "profit_margin": 1.0 - cost_1, 26 "polymarket_action": f"BUY YES @ {polymarket.yes_price:.3f}", 27 "kalshi_action": f"BUY NO @ {kalshi.no_price:.3f}" 28 }) 29
30 # Strategy 2: Buy NO on Polymarket, YES on Kalshi 31 cost_2 = polymarket.no_price + kalshi.yes_price 32 if cost_2 < (1.0 - min_spread): 33 opportunities.append({ 34 "strategy": "NO@Polymarket + YES@Kalshi", 35 "cost": cost_2, 36 "profit_margin": 1.0 - cost_2, 37 "polymarket_action": f"BUY NO @ {polymarket.no_price:.3f}", 38 "kalshi_action": f"BUY YES @ {kalshi.yes_price:.3f}" 39 }) 40
41 return opportunities 42
43 # Example: Same event priced differently across platforms 44 poly_odds = MarketOdds("polymarket", "btc-100k-jan", 0.62, 0.40, datetime.now()) 45 kalshi_odds = MarketOdds("kalshi", "btc-100k-jan", 0.58, 0.44, datetime.now()) 46
47 arb = detect_arbitrage(poly_odds, kalshi_odds) 48 # Returns: 2% profit margin buying NO@Polymarket + YES@Kalshi
The key insight: prediction markets should price YES + NO = 1.0 (minus fees). When the combined cost across platforms drops below that, you've found an arbitrage window. These typically last seconds, which is why high-frequency polling matters.
Common Issues and Fixes
Cookie persistence breaks when switching proxies mid-session. The solution is using Playwright's storageState to save auth tokens and reload them after rotation.
WebSocket connections drop randomly on mobile networks. Implementing reconnection logic with exponential backoff helps - first retry after 2 seconds, then 4, then 8, max 60.
MCP agents hallucinate selectors when pages load slowly. Adding explicit waits for network idle state before invoking extraction tools fixes this.
1 await page.wait_for_load_state('networkidle') 2 await page.wait_for_selector('.market-container', state='visible')
Cloudflare challenges occasionally require CAPTCHA solving. Integrating 2Captcha or CapSolver adds several seconds per solve, which needs to be factored into request timing.
FAQ
1Why not just use the official Polymarket API?
The free tier works fine for casual monitoring, but caps at 1,000 calls/hour. For real-time arbitrage across 50+ markets, you need faster polling than rate limits allow. Premium WebSocket access ($99/mo) helps but still throttles during peak events.
2How often should mobile proxies be rotated?
Every 35-50 requests or 6-8 minutes of sustained traffic. Carrier IPs have higher trust scores from CGNAT, so longer sessions can be pushed than residential proxies.
3Can MCP AI agents handle JavaScript-heavy SPAs?
Yes, because they operate on fully rendered DOM after Playwright executes all scripts. The agent sees the same page state a human user would.
4What's the minimum hardware for real-time scraping?
4 CPU cores, 8GB RAM handles 3-4 concurrent browser contexts. Scraping 50+ markets simultaneously needs 16GB and SSD storage for browser cache.
5Do prediction markets block mobile IPs?
Rarely. Mobile traffic from carrier networks appears identical to legitimate app users checking odds on their phones. Mobile IPs show significantly lower block rates compared to datacenter alternatives.
Wrapping Up
Official APIs are fine for casual monitoring, but real-time arbitrage demands more than 1,000 calls/hour across dozens of markets. The winning stack combines Playwright for protocol-level interception, carrier-grade mobile proxies for sustained high-frequency polling, and MCP for adaptive extraction when frontends change.
Real-time volumes and arbitrage signals demand WebSocket handling and millisecond-precision storage. For anyone serious about prediction market data at scale, the API-only approach hits walls fast - building adaptive scraping infrastructure that complements rate-limited APIs is the way forward.
Need reliable mobile proxies for prediction market scraping?
VoidMob provides carrier-grade 4G/5G proxies with sticky sessions and instant rotation - no datacenter fingerprints, no VPN detection. Start scraping with real mobile infrastructure.