How to Scrape Prediction Markets Data Using Python + MCP

Prediction markets are moving serious money - Kalshi alone hit $2.3 billion in a single week in December 2025, with Polymarket and Kalshi combining for over $44 billion annually. Both platforms offer APIs, but here's the catch: Polymarket's free tier caps you at 1,000 calls/hour, and WebSocket feeds sit behind a $99/month paywall. For real-time arbitrage detection across multiple markets, those limits don't cut it.

Quick Summary TLDR

1Official APIs exist but rate limits (1,000/hr) and paywalled WebSocket feeds make high-frequency arbitrage detection impractical
2Playwright enables protocol-level network interception to capture real-time market data beyond API constraints
3Mobile proxies maintain stable sessions for sustained polling without triggering rate limits or IP blocks
4MCP AI agents adapt to frontend changes automatically, eliminating brittle CSS selector maintenance
5Store scraped data in time-series databases for cross-platform arbitrage signal detection

Most tutorials point you to the official APIs and call it a day. That works for casual monitoring, but real-time arbitrage requires polling dozens of markets per second across both platforms. At 1,000 calls/hour, you're stuck checking each market once every few minutes - an eternity when price discrepancies close in seconds.

Why API Limits Push You Toward Scraping

Polymarket's API is solid for basic use, but high-frequency traders hit walls quickly. The free tier's rate limits mean you can't monitor 50+ markets in real-time simultaneously. Premium WebSocket access ($99/month) helps, but still throttles during high-volatility events when you need data most. And their bot detection kicks in if you hammer endpoints too aggressively.

Kalshi requires OAuth tokens with short refresh cycles for API access. Scraping the public frontend seems easier until market IDs start rotating and the DOM structure changes based on A/B test cohorts - a fun discovery to make at 2am when the scraper stops working.

Apify templates promise "no-code prediction market scraping" but they're just Puppeteer scripts with residential proxy rotation. Common scenario: actors hit memory limits and crash after a handful of successful extractions, burning through proxy credits for minimal data yield.

Rate Limits Are Dynamic

Both platforms adjust rate limits during high-traffic events (elections, major sports). A scraper that works fine on a quiet Tuesday might get throttled hard during a presidential debate. Build in adaptive backoff and multiple fallback strategies.

The Playwright + Mobile Proxy Approach

Playwright solves the JavaScript execution problem. Unlike Selenium, it intercepts network traffic at the protocol level, which means capturing WebSocket frames, modifying headers mid-flight, and blocking unnecessary resources to speed up page loads.

Scraping Kalshi starts with launching a persistent browser context that mimics real user behavior. Random mouse movements, variable scroll speeds, 2-4 second pauses between clicks. The critical piece is network interception though.

intercept_markets.pypython

1 from playwright.sync_api import sync_playwright
2 
3 captured_data = []
4 
5 def intercept_market_data(route):
6   if '/api/v2/markets' in route.request.url:
7       response = route.fetch()
8       data = response.json()
9       captured_data.append(data)  # Store for processing
10       route.fulfill(response=response)  # Forward original response
11   else:
12       route.continue_()
13 
14 with sync_playwright() as p:
15   browser = p.chromium.launch(proxy={
16       "server": "http://mobile-proxy.voidmob.com:8080",
17       "username": "user",
18       "password": "pass"
19   })
20   page = browser.new_page()
21   page.route('**/*', intercept_market_data)
22   page.goto('https://kalshi.com/markets')

Mobile proxies matter more than most guides admit. Datacenter IPs get flagged quickly on Polymarket - sustained traffic through AWS proxies typically results in rapid blocking. Residential proxies work better but typically cost $3-9/GB and throttle after sustained traffic.

Carrier-grade mobile proxies from real 4G/5G devices handle sustained polling because they present as authentic smartphone traffic. CGNAT places many customers behind a single IP address, making mobile traffic look like legitimate multi-user patterns. Mobile IPs maintain stable sessions for hours compared to datacenter alternatives that get flagged within minutes.

Proxy Type	Block Rate	Cost/GB	Session Stability
Datacenter	High	$1-3	Under 1 minute
Residential	Moderate	$3-9	3-5 minutes
Mobile (carrier)	Low	$5-10	60+ minutes

Adding MCP for Adaptive Scraping

Model Context Protocol (MCP) changes the game for prediction market odd scraping. Instead of hardcoding CSS selectors that break when Polymarket updates their frontend, MCP AI agents discover DOM structures dynamically and adjust extraction logic in real time.

The point is not writing brittle XPath queries anymore. The MCP server exposes "tools" that the AI can invoke - find_odds_container, extract_volume_data, detect_market_status. When the page structure shifts, the agent queries available tools, tests selectors, and picks the working method.

mcp_odds_extractor.pypython

1 # MCP tool definition
2 @mcp.tool()
3 async def extract_polymarket_odds(market_url: str):
4   """Dynamically locate and extract current odds for a market"""
5   page = await browser.new_page()
6   await page.goto(market_url)
7 
8   # AI agent decides which selector works
9   selectors = [
10       '.market-odds-primary',
11       '[data-testid="odds-display"]',
12       'div.odds-container > span'
13   ]
14 
15   for selector in selectors:
16       element = await page.query_selector(selector)
17       if element and await element.is_visible():
18           return await element.inner_text()
19 
20   return None

When running this approach on multiple Polymarket events over extended periods, traditional scrapers typically break when CSS classes get renamed. MCP versions adapt automatically, maintaining high uptime. Research shows AI-based scrapers reduce maintenance overhead by 30-40% by adapting to dynamic webpage structures without manual intervention.

30-40%

Maintenance Cut

AI adaptation vs manual selectors

Automatic

Adaptation

No manual selector updates needed

Minimal

Downtime

Survives weekly frontend changes

Handling Rate Limits and Session Rotation

Playwright proxies need rotation logic or IP reputation burns fast. Prediction markets track request patterns - hitting the same endpoint repeatedly from one IP gets flagged regardless of proxy quality.

Smart rotation means switching IPs based on behavior triggers, not fixed intervals. After a certain number of requests, after a 429 response, after sustained traffic periods. Sticky sessions are needed for markets that require login state because rotating mid-session kills cookies and forces re-authentication.

proxy_rotation.pypython

1 proxy_pool = [
2   {"server": f"http://us-{i}.voidmob.com:8080", "sticky": True}
3   for i in range(1, 6)
4 ]
5 
6 current_proxy = 0
7 request_count = 0
8 
9 async def rotate_if_needed():
10   global current_proxy, request_count
11   if request_count > 40 or detect_rate_limit():
12       current_proxy = (current_proxy + 1) % len(proxy_pool)
13       request_count = 0
14       await browser.new_context(proxy=proxy_pool[current_proxy])

Real-time volumes require WebSocket handling. Polymarket streams tick data over WSS connections that standard HTTP clients can't intercept. Playwright captures these natively through the page.on('websocket') event listener.

For more on time-based proxy rotation strategies, check our deep-dive on safer scraping patterns.

Storing and Processing Market Data

Scraped odds lose value fast. Arbitrage windows close quickly on liquid markets. Streaming data into a time-series database like TimescaleDB or InfluxDB allows querying historical spreads and detecting patterns.

The schema needs timestamps at millisecond precision - market ID, event slug, yes/no odds, total volume, last trade price. Index on timestamp and market_id for fast range queries.

timescaledb_schema.sqlsql

1 CREATE TABLE market_snapshots (
2   timestamp TIMESTAMPTZ NOT NULL,
3   market_id TEXT NOT NULL,
4   yes_odds DECIMAL(5,2),
5   no_odds DECIMAL(5,2),
6   volume_24h BIGINT,
7   PRIMARY KEY (timestamp, market_id)
8 );
9 
10 SELECT create_hypertable('market_snapshots', 'timestamp');

Processing pipelines should calculate implied probabilities and compare across platforms. A market trading at different prices on Polymarket versus Kalshi signals potential arbitrage (if both sides can be executed before the spread tightens).

arbitrage_detector.pypython

1 from dataclasses import dataclass
2 from datetime import datetime
3 
4 @dataclass
5 class MarketOdds:
6   platform: str
7   market_id: str
8   yes_price: float  # 0.0 to 1.0
9   no_price: float
10   timestamp: datetime
11 
12 def detect_arbitrage(polymarket: MarketOdds, kalshi: MarketOdds, min_spread: float = 0.02):
13   """
14   Detect arbitrage when you can buy YES on one platform
15   and NO on another for combined cost < 1.0
16   """
17   opportunities = []
18 
19   # Strategy 1: Buy YES on Polymarket, NO on Kalshi
20   cost_1 = polymarket.yes_price + kalshi.no_price
21   if cost_1 < (1.0 - min_spread):
22       opportunities.append({
23           "strategy": "YES@Polymarket + NO@Kalshi",
24           "cost": cost_1,
25           "profit_margin": 1.0 - cost_1,
26           "polymarket_action": f"BUY YES @ {polymarket.yes_price:.3f}",
27           "kalshi_action": f"BUY NO @ {kalshi.no_price:.3f}"
28       })
29 
30   # Strategy 2: Buy NO on Polymarket, YES on Kalshi
31   cost_2 = polymarket.no_price + kalshi.yes_price
32   if cost_2 < (1.0 - min_spread):
33       opportunities.append({
34           "strategy": "NO@Polymarket + YES@Kalshi",
35           "cost": cost_2,
36           "profit_margin": 1.0 - cost_2,
37           "polymarket_action": f"BUY NO @ {polymarket.no_price:.3f}",
38           "kalshi_action": f"BUY YES @ {kalshi.yes_price:.3f}"
39       })
40 
41   return opportunities
42 
43 # Example: Same event priced differently across platforms
44 poly_odds = MarketOdds("polymarket", "btc-100k-jan", 0.62, 0.40, datetime.now())
45 kalshi_odds = MarketOdds("kalshi", "btc-100k-jan", 0.58, 0.44, datetime.now())
46 
47 arb = detect_arbitrage(poly_odds, kalshi_odds)
48 # Returns: 2% profit margin buying NO@Polymarket + YES@Kalshi

The key insight: prediction markets should price YES + NO = 1.0 (minus fees). When the combined cost across platforms drops below that, you've found an arbitrage window. These typically last seconds, which is why high-frequency polling matters.

Common Issues and Fixes

Cookie persistence breaks when switching proxies mid-session. The solution is using Playwright's storageState to save auth tokens and reload them after rotation.

WebSocket connections drop randomly on mobile networks. Implementing reconnection logic with exponential backoff helps - first retry after 2 seconds, then 4, then 8, max 60.

MCP agents hallucinate selectors when pages load slowly. Adding explicit waits for network idle state before invoking extraction tools fixes this.

wait_states.pypython

1 await page.wait_for_load_state('networkidle')
2 await page.wait_for_selector('.market-container', state='visible')

Cloudflare challenges occasionally require CAPTCHA solving. Integrating 2Captcha or CapSolver adds several seconds per solve, which needs to be factored into request timing.

FAQ

1Why not just use the official Polymarket API?

The free tier works fine for casual monitoring, but caps at 1,000 calls/hour. For real-time arbitrage across 50+ markets, you need faster polling than rate limits allow. Premium WebSocket access ($99/mo) helps but still throttles during peak events.

2How often should mobile proxies be rotated?

Every 35-50 requests or 6-8 minutes of sustained traffic. Carrier IPs have higher trust scores from CGNAT, so longer sessions can be pushed than residential proxies.

3Can MCP AI agents handle JavaScript-heavy SPAs?

Yes, because they operate on fully rendered DOM after Playwright executes all scripts. The agent sees the same page state a human user would.

4What's the minimum hardware for real-time scraping?

4 CPU cores, 8GB RAM handles 3-4 concurrent browser contexts. Scraping 50+ markets simultaneously needs 16GB and SSD storage for browser cache.

5Do prediction markets block mobile IPs?

Rarely. Mobile traffic from carrier networks appears identical to legitimate app users checking odds on their phones. Mobile IPs show significantly lower block rates compared to datacenter alternatives.

Wrapping Up

Official APIs are fine for casual monitoring, but real-time arbitrage demands more than 1,000 calls/hour across dozens of markets. The winning stack combines Playwright for protocol-level interception, carrier-grade mobile proxies for sustained high-frequency polling, and MCP for adaptive extraction when frontends change.

Real-time volumes and arbitrage signals demand WebSocket handling and millisecond-precision storage. For anyone serious about prediction market data at scale, the API-only approach hits walls fast - building adaptive scraping infrastructure that complements rate-limited APIs is the way forward.

Need reliable mobile proxies for prediction market scraping?

VoidMob provides carrier-grade 4G/5G proxies with sticky sessions and instant rotation - no datacenter fingerprints, no VPN detection. Start scraping with real mobile infrastructure.

Try VoidMob Mobile Proxies

1	from playwright.sync_api import sync_playwright
2
3	captured_data = []
4
5	def intercept_market_data(route):
6	if '/api/v2/markets' in route.request.url:
7	response = route.fetch()
8	data = response.json()
9	captured_data.append(data) # Store for processing
10	route.fulfill(response=response) # Forward original response
11	else:
12	route.continue_()
13
14	with sync_playwright() as p:
15	browser = p.chromium.launch(proxy={
16	"server": "http://mobile-proxy.voidmob.com:8080",
17	"username": "user",
18	"password": "pass"
19	})
20	page = browser.new_page()
21	page.route('*/', intercept_market_data)
22	page.goto('https://kalshi.com/markets')

1	# MCP tool definition
2	@mcp.tool()
3	async def extract_polymarket_odds(market_url: str):
4	"""Dynamically locate and extract current odds for a market"""
5	page = await browser.new_page()
6	await page.goto(market_url)
7
8	# AI agent decides which selector works
9	selectors = [
10	'.market-odds-primary',
11	'[data-testid="odds-display"]',
12	'div.odds-container > span'
13	]
14
15	for selector in selectors:
16	element = await page.query_selector(selector)
17	if element and await element.is_visible():
18	return await element.inner_text()
19
20	return None

1	proxy_pool = [
2	{"server": f"http://us-{i}.voidmob.com:8080", "sticky": True}
3	for i in range(1, 6)
4	]
5
6	current_proxy = 0
7	request_count = 0
8
9	async def rotate_if_needed():
10	global current_proxy, request_count
11	if request_count > 40 or detect_rate_limit():
12	current_proxy = (current_proxy + 1) % len(proxy_pool)
13	request_count = 0
14	await browser.new_context(proxy=proxy_pool[current_proxy])

1	CREATE TABLE market_snapshots (
2	timestamp TIMESTAMPTZ NOT NULL,
3	market_id TEXT NOT NULL,
4	yes_odds DECIMAL(5,2),
5	no_odds DECIMAL(5,2),
6	volume_24h BIGINT,
7	PRIMARY KEY (timestamp, market_id)
8	);
9
10	SELECT create_hypertable('market_snapshots', 'timestamp');

1	from dataclasses import dataclass
2	from datetime import datetime
3
4	@dataclass
5	class MarketOdds:
6	platform: str
7	market_id: str
8	yes_price: float # 0.0 to 1.0
9	no_price: float
10	timestamp: datetime
11
12	def detect_arbitrage(polymarket: MarketOdds, kalshi: MarketOdds, min_spread: float = 0.02):
13	"""
14	Detect arbitrage when you can buy YES on one platform
15	and NO on another for combined cost < 1.0
16	"""
17	opportunities = []
18
19	# Strategy 1: Buy YES on Polymarket, NO on Kalshi
20	cost_1 = polymarket.yes_price + kalshi.no_price
21	if cost_1 < (1.0 - min_spread):
22	opportunities.append({
23	"strategy": "YES@Polymarket + NO@Kalshi",
24	"cost": cost_1,
25	"profit_margin": 1.0 - cost_1,
26	"polymarket_action": f"BUY YES @ {polymarket.yes_price:.3f}",
27	"kalshi_action": f"BUY NO @ {kalshi.no_price:.3f}"
28	})
29
30	# Strategy 2: Buy NO on Polymarket, YES on Kalshi
31	cost_2 = polymarket.no_price + kalshi.yes_price
32	if cost_2 < (1.0 - min_spread):
33	opportunities.append({
34	"strategy": "NO@Polymarket + YES@Kalshi",
35	"cost": cost_2,
36	"profit_margin": 1.0 - cost_2,
37	"polymarket_action": f"BUY NO @ {polymarket.no_price:.3f}",
38	"kalshi_action": f"BUY YES @ {kalshi.yes_price:.3f}"
39	})
40
41	return opportunities
42
43	# Example: Same event priced differently across platforms
44	poly_odds = MarketOdds("polymarket", "btc-100k-jan", 0.62, 0.40, datetime.now())
45	kalshi_odds = MarketOdds("kalshi", "btc-100k-jan", 0.58, 0.44, datetime.now())
46
47	arb = detect_arbitrage(poly_odds, kalshi_odds)
48	# Returns: 2% profit margin buying NO@Polymarket + YES@Kalshi

1	await page.wait_for_load_state('networkidle')
2	await page.wait_for_selector('.market-container', state='visible')