How Google Crawls & Indexes

Mở đầu

Tất cả SEO work — metadata, structured data, sitemap — đều phục vụ một mục đích: giúp Google tìm, hiểu, và index content. Nhưng Google thực sự hoạt động thế nào? Hiểu quy trình Crawl → Index → Rank giúp anh biết TẠI SAO mỗi SEO technique hoạt động, thay vì chỉ follow checklist mù quáng.

Mục tiêu: Hiểu 3 giai đoạn Google Search (Crawling, Indexing, Ranking), Googlebot behavior, và tại sao static export sites có lợi thế SEO.

3 Giai đoạn của Google Search

1. CRAWLING          2. INDEXING          3. RANKING
   │                    │                    │
   ├─ Discover URLs     ├─ Parse HTML        ├─ Match query
   ├─ Fetch pages       ├─ Extract content   ├─ Evaluate signals
   ├─ Follow links      ├─ Understand meaning├─ Score relevance
   └─ Respect robots    └─ Store in index    └─ Display results

Giai đoạn 1: Crawling

Googlebot = crawler tự động, liên tục truy cập URLs:

Discover URLs — Từ đâu?
- Sitemap XML (sitemap.xml mà anh submit qua GSC)
- Links từ pages đã index
- Links từ websites khác
- Chrome user data (anonymized)

Fetch pages — Googlebot gửi HTTP GET request:

GET /vi/blog/learning-in-public HTTP/2
Host: leduykhuong.com
User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Render JavaScript — Googlebot chạy Chromium headless. Nó CÓ THỂ execute JavaScript — nhưng có delay (Web Rendering Service queue). Static HTML = advantage vì không cần render.
Follow links — Parse <a href> → add URLs mới vào crawl queue.
Respect robots.txt — Check rules trước khi crawl. Disallow: /_next/ → skip _next directory.

Crawl Budget

Google không crawl unlimited. Mỗi site có crawl budget — số lượng pages Google sẵn sàng crawl trong một khoảng thời gian:

Factor	Tăng budget	Giảm budget
Site speed	Fast response	Slow response
Server errors	None	Many 5xx
Content freshness	Updated often	Stale content
Site importance	High authority	Low authority

Cho leduykhuong.com (~200 pages): Crawl budget không phải concern. Google có thể crawl toàn bộ site trong vài giây. Crawl budget quan trọng cho sites lớn (>10,000 pages).

Giai đoạn 2: Indexing

Sau khi fetch HTML, Google index content:

Parse HTML — Extract title, description, headings, body text, links
Read structured data — Parse JSON-LD (BlogPosting, BreadcrumbList...)
Understand meaning — NLP models phân tích content, xác định topics, entities
Evaluate quality — Content quality signals (E-E-A-T: Experience, Expertise, Authoritativeness, Trustworthiness)
Store in index — Lưu vào Google's distributed database

Lưu ý: Indexed ≠ Ranked. Google có thể index page nhưng rank nó ở position 100 (trang 10) — technically indexed nhưng nobody sees it.

Giai đoạn 3: Ranking

Khi user search, Google:

Match query — Tìm indexed pages liên quan tới query
Evaluate 200+ signals — Content quality, backlinks, user experience, freshness, relevance...
Personalize — User location, search history, device
Display results — SERP với title, description, rich results

Static Export vs Server Rendering — SEO Perspective

Aspect	Static Export (leduykhuong.com)	Server Rendering
Crawl speed	Instant (HTML ready)	Depends on server
JavaScript needed	No (pre-rendered)	Sometimes
Content availability	100% in HTML	May need JS rendering
Cache-friendly	Very (CDN)	Varies
Dynamic content	Build-time only	Real-time

Static export = SEO advantage vì:

Googlebot nhận full HTML ngay — không cần đợi JavaScript render
CDN serving = fast response → tốt cho crawl budget
Content deterministic — cùng URL luôn return cùng content

Googlebot Rendering — Common Misconceptions

Myth: "Googlebot không render JavaScript"

Reality: Googlebot dùng Chromium (latest stable) và CÓ THỂ render JS. Nhưng có delay — Web Rendering Service queue có thể mất seconds tới hours.

Myth: "CSR sites không được index"

Reality: Client-side rendered sites CAN BE indexed. Nhưng:

Delay giữa crawl và render → content discovery chậm hơn
Some JS frameworks tạo issues (infinite scroll, lazy loading)
Google khuyến nghị SSR hoặc static rendering

leduykhuong.com advantage

Static export → mọi page là complete HTML:

<!-- /vi/blog/learning-in-public.html -->
<html>
  <head>
    <title>Learning in Public | Le Duy Khuong</title>
    <meta name="description" content="..." />
    <meta property="og:title" content="..." />
    <script type="application/ld+json">{"@type":"BlogPosting",...}</script>
  </head>
  <body>
    <!-- Full content rendered -->
  </body>
</html>

Googlebot fetch → nhận 100% content ngay → index immediately. No JS rendering needed.

Monitoring Crawling — GSC

Crawl Stats (GSC → Settings → Crawl stats)

Total crawl requests: 1,200/day
Average response time: 120ms
Host status: Healthy

What Google sees (URL Inspection → "View crawled page")

Click "View crawled page" → xem exact HTML Googlebot received. Compare với actual page → verify content matches.

Thực hành

Bài tập 1: Check crawl stats

GSC → Settings → Crawl stats:

Bao nhiêu crawl requests/day?
Average response time?
Status code distribution (200, 301, 404)?

Bài tập 2: View crawled page

GSC → URL Inspection → nhập blog post URL → "View crawled page":

HTML có full content?
Structured data hiển thị?
Compare với actual page — khác biệt gì?

Bài tập 3: Check robots.txt effectiveness

# View deployed robots.txt
curl -s https://leduykhuong.com/robots.txt

Câu hỏi: Googlebot có thể truy cập tất cả blog posts? /_next/ có bị block không?

Tóm tắt

3 giai đoạn: Crawling (discover + fetch) → Indexing (parse + understand) → Ranking (match + score)
Googlebot dùng Chromium, CAN render JS, nhưng static HTML faster
Crawl budget — không concern cho small sites (<10K pages)
Static export = SEO advantage — full HTML ready, CDN fast, no JS rendering delay
GSC Crawl Stats — monitor crawl health, response time, error rates
URL Inspection — see exactly what Google sees for any URL

Bài tiếp theo

Bài 18: Sitemap Strategy cho Static Sites — Beyond basics: sitemap priorities, update strategy, multilingual sitemaps, và khi nào cần multiple sitemaps.

Cách Google Crawl & Index