{"@context":"https://schema.org","@type":"BlogPosting","headline":"Are Hotels in Common Crawl? 39% Are Missing From AI Training Data (2026)","description":"108,109 hotel websites checked against the May 2026 Common Crawl columnar index. 60.6% present (41.4% with depth, 19.2% shallow), 39.4% absent. Independent hotels (61%) are more crawled than chains (45.9%); coverage runs 69% (Germany) to 47% (Indonesia); local-market TLDs are present but shallow and .es lags at 37%.","datePublished":"2026-06-09","dateModified":"2026-06-09","url":"https://nicolassitter.com/research/hotels-in-common-crawl-2026","category":"research","keywords":["hotels in common crawl","AI training data hotels","CCBot hotel coverage","is my site in common crawl","LLM training data hotels"],"articleSection":"Research","wordCount":1400,"readTime":"6 min","articleBody":"Research · June 2026\n\n# 39% of hotels are missing from the data that trains AI\n\nWe checked **108,109 hotel websites** against Common Crawl — the open web archive behind much of what large language models learn from. Nearly two in five aren’t in it at all.\n\n60.6%\n\nin Common Crawl\n\n39.4%\n\nabsent entirely\n\n108,109\n\ndomains checked\n\nThere are two ways an AI can know about your hotel. It can _fetch_ you live at query time (retrieval — Google Places, OTAs, reviews), or it can already _know_ you from training. The training layer is built largely on [Common Crawl](https://commoncrawl.org), the open monthly web archive LLM builders pull from. If you’re not in the crawl, the model can’t learn you — it can only look you up. Nobody had measured how many hotels actually make it in. So we did.\n\n**The result:** of 108,109 hotel websites (every hotel in our index with its own domain, a Google Place ID, and ≥10 reviews), **60.6% are in Common Crawl** and **39.4% are absent**. Of those present, 41.4% are there with real depth (5+ pages captured) and 19.2% only shallowly (1–4 pages). So a large minority of legitimate, reviewed hotels are invisible to the training layer of every major model.\n\n## Who’s in, who’s out\n\nA representative sample of the hotels we checked. Green dots are in Common Crawl; red are absent. The pattern is geographic — coverage is denser in some markets than others.\n\nin crawl absent\n\nRepresentative sample of ~6,200 hotels (of 108,109 domains checked). Green = present in Common Crawl (May 2026), red = absent. Drag, zoom, and filter.\n\n## Present isn’t the same as known\n\nBeing in the crawl is one thing; being in it _deeply_ is another. We counted how many pages of each hotel the May 2026 crawl captured.\n\n0 — absent\n\n42,575\n\n1–4 — shallow\n\n20,749\n\n5–19\n\n26,096\n\n20–99\n\n15,267\n\n100–499\n\n2,802\n\n500+\n\n620\n\nOnly ~3,400 hotels (3%) are captured deeply (100+ pages). The bulk of those that are “in” sit at 5–99 pages — enough to be known, not enough to be richly represented.\n\n## Independents beat chains\n\nThe intuitive guess — big chains dominate the crawl — is backwards. Independent hotels are markedly _more_ likely to be in Common Crawl than chain properties.\n\n61.0%\n\nIndependents in crawl\n\nn = 105,483 domains\n\n45.9%\n\nChains in crawl\n\nn = 2,735 domains\n\nMore than half of chain domains are absent. The likely reason is structural: chain and corporate sites are more often JavaScript-rendered single-page apps or sit behind a CDN/WAF that turns crawlers away — both make a site hard for Common Crawl to capture. An independent hotel on a plain WordPress site is, paradoxically, easier for the training crawl to read than a global brand’s booking platform.\n\n## By country and TLD\n\nCoverage varies sharply by market.\n\nGermany\n\n69.3%\n\nFrance\n\n64.6%\n\nNetherlands\n\n58.8%\n\nSpain\n\n58.6%\n\nItaly\n\n58.5%\n\nUnited States\n\n57.4%\n\nUnited Kingdom\n\n53.6%\n\nIndonesia\n\n47.4%\n\nBut the TLD tells a sharper story — and one that ties straight into how AI under-serves non-English markets. Local European TLDs are _present but shallow_: a `.de` hotel is well-crawled (71%) but at only ~39 pages on average, where a `.com` hotel averages ~109. And `.es` is the clear laggard at 37% — Spanish hotels on `.com` do fine, but the `.es` TLD itself is poorly crawled.\n\nTLD\n\nIn crawl\n\nAvg pages (when present)\n\n.de\n\n71.4%\n\n39\n\n.fr\n\n64.8%\n\n42\n\n.com\n\n60%\n\n109\n\n.it\n\n59.5%\n\n35\n\n.nl\n\n59.1%\n\n56\n\n.co.uk\n\n54.3%\n\n59\n\n.es\n\n37%\n\n87\n\nThe crawl knows local-market hotels — but thinly. It’s the same English-leaning tilt we see in [live AI answers](/research/bookstores-tokyo-ai-search-2026), showing up a layer earlier, in the training data itself.\n\n## Why some hotels are missing\n\nAbsence isn’t random. Three things keep a hotel out of the crawl:\n\n-   **Rendering.** If your site only assembles its content after JavaScript runs, the crawler often captures an empty shell. Static HTML gets read; SPA booking widgets frequently don’t.\n-   **Access.** A CDN or firewall that blocks bots — often a default no one chose — turns the crawler away before it reaches a page. This is invisible in robots.txt, which is why our [AI-blocking study](/research/hotel-robots-ai-blocking-study-2026)’s ~3% figure is a floor.\n-   **Connectivity.** Common Crawl prioritises well-linked domains. A site few others link to gets crawled rarely and shallowly. Metehan Yeşilyurt’s [work on Common Crawl rank](https://metehan.ai/blog/cc-rank/) (Harmonic Centrality) lays this out — and you can check a domain’s rank at [webgraph.metehan.ai](https://webgraph.metehan.ai).\n\n**What to do:** the fixes are the cheap, structural ones — serve real HTML (not JS-only), make sure no CDN/WAF setting is quietly blocking AI crawlers, and earn a few quality links so you’re worth crawling. Check your own site in seconds with the [Common Crawl checker](/tools/common-crawl). For hotels, this is the _secondary_ lever — most AI hotel answers are built from live retrieval, not trained memory (see the [two-layers guide](/guide/ai-search-for-hotels#two-layers)) — but it costs nothing to not wall yourself out of it.\n\n## Method & limits\n\nWe took every hotel in our 200K-property index with its own website, a Google Place ID, and at least 10 reviews — 142,405 hotels, resolving to **108,109 distinct domains**(chains share one). Junk (parked domains, closed properties, non-hotels) was filtered out. Each domain was checked against the **columnar URL index of the May 2026 Common Crawl snapshot** (CC-MAIN-2026-21), counting captured pages for the host and its `www` variant. _Present_ = 5+ pages, _shallow_ = 1–4, _absent_ = 0.\n\n**Limits.** It’s one monthly snapshot — a site absent in May may appear next month. Content served on an unrelated sub-domain isn’t counted. “Absent” means _not captured_, which could be a block, JS-only rendering, low connectivity, or a new domain — we don’t attribute the cause per hotel. And presence in the crawl is necessary, not sufficient, for a model to actually surface you.\n\n## FAQ\n\nOf 108,109 hotel websites checked against the May 2026 snapshot, 60.6% are in Common Crawl and 39.4% are absent. Of those present, 41.4% have real depth (5+ pages captured) and 19.2% are shallow (1–4 pages).\n\n### Summarize with AI\n\n## Check your own hotel\n\nSee whether your site is in Common Crawl — and how deeply — in a few seconds.\n\n[Run the Common Crawl checker](/tools/common-crawl)","author":{"@type":"Person","name":"Nicolas Sitter","url":"https://nicolassitter.com/about","sameAs":["https://www.linkedin.com/in/nicolassitternolleau/","https://github.com/Nicositter88","https://hotelrank.ai"]},"publisher":{"@type":"Person","name":"Nicolas Sitter","url":"https://nicolassitter.com"},"image":"https://nicolassitter.com/api/og/hotels-in-common-crawl-2026","mainEntityOfPage":{"@type":"WebPage","@id":"https://nicolassitter.com/research/hotels-in-common-crawl-2026"},"tags":["Common Crawl","AI Training Data","Hotels","AI Visibility"],"sameAs":["https://hotelrank.ai/research/hotels-in-common-crawl-2026"],"alternateFormat":{"html":"https://nicolassitter.com/research/hotels-in-common-crawl-2026","json":"https://nicolassitter.com/api/post/hotels-in-common-crawl-2026","rss":"https://nicolassitter.com/rss.xml"},"datasets":[{"name":"summary","contentUrl":"https://nicolassitter.com/data/hotels-in-common-crawl-2026/summary.csv","encodingFormat":"text/csv"}]}