39% of hotels are missing from the data that trains AI
We checked 108,109 hotel websites against Common Crawl — the open web archive behind much of what large language models learn from. Nearly two in five aren’t in it at all.
There are two ways an AI can know about your hotel. It can fetch you live at query time (retrieval — Google Places, OTAs, reviews), or it can already know you from training. The training layer is built largely on Common Crawl, the open monthly web archive LLM builders pull from. If you’re not in the crawl, the model can’t learn you — it can only look you up. Nobody had measured how many hotels actually make it in. So we did.
Who’s in, who’s out
A representative sample of the hotels we checked. Green dots are in Common Crawl; red are absent. The pattern is geographic — coverage is denser in some markets than others.
Representative sample of ~6,200 hotels (of 108,109 domains checked). Green = present in Common Crawl (May 2026), red = absent. Drag, zoom, and filter.
Present isn’t the same as known
Being in the crawl is one thing; being in it deeply is another. We counted how many pages of each hotel the May 2026 crawl captured.
Only ~3,400 hotels (3%) are captured deeply (100+ pages). The bulk of those that are “in” sit at 5–99 pages — enough to be known, not enough to be richly represented.
Independents beat chains
The intuitive guess — big chains dominate the crawl — is backwards. Independent hotels are markedly more likely to be in Common Crawl than chain properties.
More than half of chain domains are absent. The likely reason is structural: chain and corporate sites are more often JavaScript-rendered single-page apps or sit behind a CDN/WAF that turns crawlers away — both make a site hard for Common Crawl to capture. An independent hotel on a plain WordPress site is, paradoxically, easier for the training crawl to read than a global brand’s booking platform.
By country and TLD
Coverage varies sharply by market.
But the TLD tells a sharper story — and one that ties straight into how AI under-serves non-English markets. Local European TLDs are present but shallow: a .de hotel is well-crawled (71%) but at only ~39 pages on average, where a .com hotel averages ~109. And .es is the clear laggard at 37% — Spanish hotels on .com do fine, but the .es TLD itself is poorly crawled.
| TLD | In crawl | Avg pages (when present) |
|---|---|---|
| .de | 71.4% | 39 |
| .fr | 64.8% | 42 |
| .com | 60% | 109 |
| .it | 59.5% | 35 |
| .nl | 59.1% | 56 |
| .co.uk | 54.3% | 59 |
| .es | 37% | 87 |
The crawl knows local-market hotels — but thinly. It’s the same English-leaning tilt we see in live AI answers, showing up a layer earlier, in the training data itself.
Why some hotels are missing
Absence isn’t random. Three things keep a hotel out of the crawl:
- Rendering. If your site only assembles its content after JavaScript runs, the crawler often captures an empty shell. Static HTML gets read; SPA booking widgets frequently don’t.
- Access. A CDN or firewall that blocks bots — often a default no one chose — turns the crawler away before it reaches a page. This is invisible in robots.txt, which is why our AI-blocking study’s ~3% figure is a floor.
- Connectivity. Common Crawl prioritises well-linked domains. A site few others link to gets crawled rarely and shallowly. Metehan Yeşilyurt’s work on Common Crawl rank (Harmonic Centrality) lays this out — and you can check a domain’s rank at webgraph.metehan.ai.
What to do: the fixes are the cheap, structural ones — serve real HTML (not JS-only), make sure no CDN/WAF setting is quietly blocking AI crawlers, and earn a few quality links so you’re worth crawling. Check your own site in seconds with the Common Crawl checker. For hotels, this is the secondary lever — most AI hotel answers are built from live retrieval, not trained memory (see the two-layers guide) — but it costs nothing to not wall yourself out of it.
Method & limits
We took every hotel in our 200K-property index with its own website, a Google Place ID, and at least 10 reviews — 142,405 hotels, resolving to 108,109 distinct domains(chains share one). Junk (parked domains, closed properties, non-hotels) was filtered out. Each domain was checked against the columnar URL index of the May 2026 Common Crawl snapshot (CC-MAIN-2026-21), counting captured pages for the host and its www variant. Present = 5+ pages, shallow = 1–4, absent = 0.
FAQ
Of 108,109 hotel websites checked against the May 2026 snapshot, 60.6% are in Common Crawl and 39.4% are absent. Of those present, 41.4% have real depth (5+ pages captured) and 19.2% are shallow (1–4 pages).
Check your own hotel
See whether your site is in Common Crawl — and how deeply — in a few seconds.
Run the Common Crawl checker