{"@context":"https://schema.org","@type":"BlogPosting","headline":"Are Hotels in Common Crawl? 39% Are Missing From AI Training Data (2026)","description":"108,109 hotel websites checked against the May 2026 Common Crawl columnar index. 60.6% present (41.4% with depth, 19.2% shallow), 39.4% absent. Independent hotels (61%) are more crawled than chains (45.9%); coverage runs 69% (Germany) to 47% (Indonesia); local-market TLDs are present but shallow and .es lags at 37%.","datePublished":"2026-06-09","dateModified":"2026-06-09","url":"https://nicolassitter.com/research/hotels-in-common-crawl-2026","category":"research","keywords":["hotels in common crawl","AI training data hotels","CCBot hotel coverage","is my site in common crawl","LLM training data hotels"],"articleSection":"Research","wordCount":1400,"readTime":"6 min","articleBody":"Research · June 2026\n\n# 39% of hotel websites are missing from Common Crawl\n\nWe checked **108,109 hotel website domains** against Common Crawl — one of the major open-web datasets AI training is built on. Nearly two in five aren’t in the May 2026 snapshot at all.\n\n60.6%\n\nin Common Crawl\n\n39.4%\n\nabsent entirely\n\n108,109\n\ndomains checked\n\n## Quick answers\n\n### How can I check if my hotel website is in Common Crawl?\n\nTo check if your hotel website is in Common Crawl, use the free Common Crawl checker at nicolassitter.com/tools/common-crawl: enter your domain and it queries the public columnar index for recent snapshots. A 5-minute self-check matters because in a study of 108,109 hotel website domains against the May 2026 snapshot (CC-MAIN-2026-21), 39.4% were absent entirely. To improve coverage, fix the three things that keep hotels out: serve real static HTML rather than JavaScript-only pages, make sure no CDN or firewall setting silently blocks AI crawlers like CCBot, and earn a few quality inbound links, since Common Crawl decides what to capture by Harmonic Centrality, a link-graph connectivity score. — Nicolas Sitter, Are Hotels in Common Crawl? 39% of Hotel Websites Are Missing (2026)\n\n### Are hotel websites in Common Crawl and AI training data?\n\nMost hotel websites are in Common Crawl, but a large minority are not: of 108,109 distinct hotel website domains (resolved from 142,405 reviewed properties) checked against the May 2026 snapshot, 60.6% are captured and 39.4% are absent. Of those captured, 41.4% are deep (5+ pages) and 19.2% shallow (1-4 pages); the median in-crawl hotel holds just 9 pages, often the whole small site. Coverage varies by market: Germany 69.3%, France 64.6%, down to the .es TLD at 37%. Independents (61%) beat chains (45.9%), though that gap is driven almost entirely by Louvre Hotels' thin microsites at 11.8%. Being in Common Crawl is necessary but not sufficient for a model to surface you. — Nicolas Sitter, Are Hotels in Common Crawl? 39% of Hotel Websites Are Missing (2026)\n\nThere are two ways an AI can know about your hotel. It can _fetch_ you live at query time (retrieval — a search, scraping or not, plus Google Places, OTAs, reviews), or it can already _know_ you from training. The open-web part of that training layer often starts with [Common Crawl](https://commoncrawl.org), the monthly web archive many LLM builders use. If your own site isn’t in the crawl, a model can’t learn your hotel _from that source_ — though it may still pick you up indirectly through OTAs, reviews and travel guides. I hadn’t seen anyone look at this for hotels specifically — so I dug into it.\n\n**The result:** we started from 142,405 hotel properties and resolved them to **108,109 distinct website domains** (every hotel in our index with its own domain, a Google Place ID, and ≥10 reviews). Against the May 2026 snapshot, **60.6% are captured** in Common Crawl and **39.4% are absent**. Overall, 41.4% are captured deeply (5+ pages) and 19.2% only shallowly (1–4 pages). So the official websites of a large minority of legitimate, reviewed hotels are invisible to this major open-web training layer. All figures below are domain-level, not property-level, unless stated.\n\nTwo things to keep straight before the numbers. First, being in Common Crawl isn’t the same as being in a model’s final training set — that data gets filtered, deduplicated and down-weighted. But within this source, absence is decisive: a page that was never captured can’t survive any later filtering. Second, the point isn’t that AI can’t find these hotels at all. It’s that **the hotel’s own website is missing from one of the main open-web memory layers**, leaving OTAs, directories and review platforms to define the property instead.\n\n## Who’s in, who’s out\n\nA representative sample of the hotels we checked. Green dots are in Common Crawl; red are absent. The pattern is geographic — coverage is denser in some markets than others.\n\nin crawl absent\n\nRepresentative sample of ~6,200 hotels (of 108,109 domains checked). Green = present in Common Crawl (May 2026), red = absent. Drag, zoom, and filter.\n\n## How many pages — and why that matters less here\n\nWe counted how many pages of each hotel the May 2026 crawl captured. A caveat before reading this chart: a hotel website is a _small_ site. Home, a rooms page, a few room types, a gallery, contact, maybe a restaurant — that’s often the whole thing. The median hotel that’s in the crawl is captured at just **9 pages**, and 9 pages can be the entire site. So unlike a news or e-commerce domain, depth isn’t really the worry for a hotel — presence is. The line that matters is 0 vs. 1+ — or rather, that’s the first line. The second is whether those captured pages carry real hotel content and not just a JavaScript shell (more on rendering below).\n\n0 — absent\n\n42,575\n\n1–4 — shallow\n\n20,749\n\n5–19\n\n26,096\n\n20–99\n\n15,267\n\n100–499\n\n2,802\n\n500+\n\n620\n\nThe ~3,400 hotels (3%) captured at 100+ pages are mostly not small properties at all — they’re hotels whose listed website sits on a big shared platform (a chain domain, an OTA, a directory). For an ordinary independent on its own domain, 5–30 captured pages is a complete reading, not a shallow one. Which is why the real story below isn’t depth — it’s the 39% at zero.\n\n## “Chains” underperform — but it’s really one budget group\n\nThe headline split looks backwards: independent hotels are in Common Crawl far more often than chain properties. But that average hides almost everything interesting, so we broke it down by brand.\n\n61.0%\n\nIndependents in crawl\n\nn = 105,483 domains\n\n45.9%\n\nAll chains in crawl\n\nn = 2,735 domains\n\n### Coverage by brand\n\nHilton\n\n81.5%\n\nBest Western\n\n78.8%\n\nAccor\n\n71.4%\n\nIHG\n\n65.3%\n\nMarriott\n\n63.9%\n\nWyndham\n\n57.5%\n\nChoice\n\n44.2%\n\nBrit Hotel\n\n14.4%\n\nLouvre Hotels\n\n11.8%\n\nWorldHotels\n\n7.5%\n\nGreen = above the 60.6% overall line. Brands with ≥15 domains in the run.\n\nThe marquee global brands are _fine_ — Hilton (81%), Best Western (79%), Accor (71%), Marriott (64%) all beat the 60.6% overall line. The chain average is dragged down almost entirely by one budget group: **Louvre Hotels** — Campanile, Kyriad, Première Classe — whose 779 properties often sit on templated per-location microsites (`lille-est-hem.kyriad.com`) and land at just 11.8% coverage. The pattern is consistent with low-link, cookie-cutter microsites. Drop that one group and chains rise to **59.4%**, right in line with everyone else — a domain-architecture effect as much as a brand one.\n\n## What absence actually looks like\n\nSix real rows from the run. The two at the top are recognisable independents with their own sites — and they’re simply not in the crawl. The middle two are budget-chain microsites at zero. The last two are small hotels captured at a page or two.\n\nHotel\n\nDomain\n\nPages in crawl\n\nHotel Chapter Roma\n\nRome · design hotel\n\nchapter-roma.com\n\n0\n\nMaison Tremé\n\nNew Orleans · boutique\n\nmaisontreme.com\n\n0\n\nKyriad Lille Est Hem\n\nLouvre Hotels microsite\n\nlille-est-hem.kyriad.com\n\n0\n\nCampanile Brive-la-Gaillarde\n\nLouvre Hotels microsite\n\nbrive-la-gaillarde-ouest.campanile.com\n\n0\n\nHôtel L’Aubergade\n\nGérardmer · independent\n\nlaubergade-gerardmer.fr\n\n4\n\nHotel Villa dei Mosaici\n\nSpello · independent\n\nhotelvilladeimosaicispello.it\n\n1\n\nA reviewed, bookable hotel whose official site is at 0 is invisible _through that site_ to any model learning from this archive — it can only be reached live, if the engine runs a search.\n\n## By country and TLD\n\nCoverage varies sharply by market. Among the largest markets in the dataset:\n\nGermany\n\n69.3%\n\nFrance\n\n64.6%\n\nNetherlands\n\n58.8%\n\nSpain\n\n58.6%\n\nItaly\n\n58.5%\n\nUnited States\n\n57.4%\n\nUnited Kingdom\n\n53.6%\n\nOne distinction first, to avoid a false contradiction: **country is the hotel’s location; TLD is the website’s domain.** Spain reads 58.6% by country yet 37% on the `.es` TLD, because many Spanish hotels sit on `.com` — the two measure different things.\n\nWith that in mind, the TLD tells a sharper story about how the open-web training layer may under-represent non-English markets. Local European TLDs are _present but shallow_: a `.de` hotel is well-crawled (71%) but at only ~39 pages on average, where a `.com` hotel averages ~109. And `.es` is the clear laggard at 37% — Spanish hotels on `.com` do fine, but the `.es` TLD itself is poorly crawled.\n\nTLD\n\nIn crawl\n\nAvg pages (when present)\n\n.de\n\n71.4%\n\n39\n\n.fr\n\n64.8%\n\n42\n\n.com\n\n60%\n\n109\n\n.it\n\n59.5%\n\n35\n\n.nl\n\n59.1%\n\n56\n\n.co.uk\n\n54.3%\n\n59\n\n.es\n\n37%\n\n87\n\nThe crawl knows local-market hotels — but thinly. This looks like one mechanism behind the English-leaning tilt we see in [live AI answers](/research/bookstores-tokyo-ai-search-2026) — showing up a layer earlier, in the training data itself.\n\n## Why some hotels are missing\n\nAbsence isn’t random. Three things keep a hotel out of the crawl:\n\n-   **Rendering.** If your site only assembles its content after JavaScript runs, the crawler often captures an empty shell. Static HTML gets read; SPA booking widgets frequently don’t.\n-   **Access.** A CDN or firewall that blocks bots — often a default no one chose — turns the crawler away before it reaches a page. This is invisible in robots.txt, which is why our [AI-blocking study](/research/hotel-robots-ai-blocking-study-2026)’s ~3% figure is a floor.\n-   **Connectivity — and this is the main lever.** Common Crawl doesn’t crawl the web evenly. It decides which domains to capture, and how deeply, by **Harmonic Centrality** — a link-graph score for how well-connected a domain is. High-scoring domains get crawled often and deep; low-scoring long-tail sites get crawled rarely or not at all (documented in Mozilla’s _“Training Data for the Price of a Sandwich”_). For a hotel this is usually decisive: a property with few inbound links scores low and gets skipped, however good the site is. Metehan Yeşilyurt’s [work on Common Crawl rank](https://metehan.ai/blog/cc-rank/) lays this out, and you can look up your domain’s Harmonic Centrality and PageRank at [webgraph.metehan.ai](https://webgraph.metehan.ai).\n\n**What to do:** the fixes are the cheap, structural ones — serve real HTML (not JS-only), make sure no CDN/WAF setting is quietly blocking AI crawlers, and earn a few quality links so you’re worth crawling. Check your own site in seconds with the [Common Crawl checker](/tools/common-crawl). For hotels, this is the _secondary_ lever — most AI hotel answers are built from live retrieval, not trained memory (see the [two-layers guide](/guide/ai-search-for-hotels#two-layers)) — but it costs nothing to not wall yourself out of it.\n\n### A 5-minute self-check\n\n-   • Fetch your homepage and a room page as plain text (curl, or the checker above) — is the content there _before_ any JavaScript runs?\n-   • Scan robots.txt and your CDN/WAF logs for blocked crawler user-agents — including CCBot and the AI crawlers.\n-   • Keep a clean sitemap submitted to search engines (Common Crawl discovery mostly follows links, but it doesn’t hurt).\n-   • Make sure room, location and content pages are internally linked — not reachable only through the booking widget.\n-   • Don’t trap your core content behind a JavaScript booking iframe.\n\n## Method & limits\n\nWe took every hotel in our 200K-property index with its own website, a Google Place ID, and at least 10 reviews — 142,405 hotels, resolving to **108,109 distinct domains**(some chains consolidate many properties under fewer domains). Junk (parked domains, closed properties, non-hotels) was filtered out. Each domain was checked against the **columnar URL index of the May 2026 Common Crawl snapshot** (CC-MAIN-2026-21), counting captured pages for the host and its `www` variant. _Captured_ = 1+ page (what we count as “in Common Crawl”); _deep capture_ = 5+ pages; _shallow_ = 1–4; _absent_ = 0.\n\n**Limits.** It’s one monthly snapshot — a site absent in May may appear next month. Content served on an unrelated sub-domain isn’t counted. “Absent” means _not captured_, which could be a block, JS-only rendering, low connectivity, or a new domain — we don’t attribute the cause per hotel. And presence in the crawl is necessary, not sufficient, for a model to actually surface you.\n\n## FAQ\n\nOf 108,109 hotel website domains checked against the May 2026 snapshot, 60.6% are captured in Common Crawl and 39.4% are absent. Overall, 41.4% are captured deeply (5+ pages) and 19.2% shallow (1–4 pages). Figures are domain-level, not property-level. And to be upfront: that’s a slice of our own dataset, not every hotel alive — but at 108,109 domains it’s already a pretty big slice to draw a line through.\n\n### Summarize with AI\n\n## Check your own hotel\n\nSee whether your site is in Common Crawl — and how deeply — in a few seconds.\n\n[Run the Common Crawl checker](/tools/common-crawl)","author":{"@type":"Person","name":"Nicolas Sitter","url":"https://nicolassitter.com/about","sameAs":["https://www.linkedin.com/in/nicolassitternolleau/","https://github.com/Nicositter88","https://hotelrank.ai"]},"publisher":{"@type":"Person","name":"Nicolas Sitter","url":"https://nicolassitter.com"},"image":"https://nicolassitter.com/api/og/hotels-in-common-crawl-2026","mainEntityOfPage":{"@type":"WebPage","@id":"https://nicolassitter.com/research/hotels-in-common-crawl-2026"},"tags":["Common Crawl","AI Training Data","Hotels","AI Visibility"],"sameAs":["https://hotelrank.ai/research/hotels-in-common-crawl-2026"],"alternateFormat":{"html":"https://nicolassitter.com/research/hotels-in-common-crawl-2026","json":"https://nicolassitter.com/api/post/hotels-in-common-crawl-2026","rss":"https://nicolassitter.com/rss.xml"},"datasets":[{"name":"summary","contentUrl":"https://nicolassitter.com/data/hotels-in-common-crawl-2026/summary.csv","encodingFormat":"text/csv"}]}