Key Takeaways
For RAG scenarios, explain why scheduled crawling of vertical websites requires proxies and deduplication, and provide the complete architecture and Python implementation from scheduling, proxy requests, parsing, to vectorization and database storage.
Proxy and Crawling in RAG Knowledge Base Construction: Technical Implementation from Scheduled Crawling to Vectorized Storage
RAG (Retrieval-Augmented Generation) relies on high-quality, updatable knowledge bases. These knowledge bases are often sourced from scheduled crawls of vertical or document-based sites. High request volumes and numerous target sites can easily trigger anti-crawling measures and bans. Proxies, by distributing IPs at the crawling layer and coordinating with scheduling and deduplication, are crucial for ensuring stable, ongoing updates to the RAG data pipeline.This article first provides a combined analysis (explaining why RAG scenarios require proxies and scheduled scraping), then presents a technical implementation (an architecture combining scheduling + proxy scraping + parsing + vectorization into the repository, along with Python examples).
I. Combined Analysis: Why RAG Requires Proxies and Controlled Crawling
1.1 Data Characteristics of RAG Knowledge Bases
- Primary sources are web pages/documents: help documentation, product pages, blogs, forums, etc., requiring HTML or API scraping from target sites.
- Requires Continuous Updates: Knowledge evolves, and RAG effectiveness depends on data freshness, necessitating scheduled incremental crawling rather than one-time full crawls.
- Multi-site, multi-page: A single site may contain tens of thousands of URLs. Aggregating requests across multiple sites creates massive traffic volumes. Concentrating requests from a single IP exit point easily triggers throttling or bans.
Direct high-frequency scraping with a fixed IP may result in 429 errors/captchas at best, or IP bans at worst, causing scraping interruptions and failed knowledge base updates. Therefore, proxies (dynamically rotated or session-based) must be introduced at the scraping layer, combined with scheduling, deduplication, and throttling to balance "update frequency" with "operational stability."
1.2 Value of Proxies in the RAG Pipeline
| Capability | Role in RAG |
|------|------------------|
| IP Diversification | Distributes requests across multiple sites and URLs to different exit IPs, reducing single-IP request density and mitigating target site risk controls. |
| Multi-Region (Optional) | For sites returning region-specific content, region-bound proxies retrieve "localized" pages to enrich knowledge base perspectives. |
| Session Persistence (Optional) | For document sites requiring login or cookies, reuse the same IP within a set timeframe to avoid frequent logins. |
| Observability & Retries | Combined with status codes, retries, and circuit breakers, it enables tracking success rates and latency for SLA and capacity planning. |
Conclusion: In RAG knowledge base pipelines, proxies should be standard components of the scraping layer, forming a complete chain alongside schedulers, deduplication, parsing, and vectorization.
1.3 Comparison with "Agentless" and "One-Time Crawling"
- Agentless, single-machine crawling: Simple to implement but prone to bans at scale, unsuitable for production-grade RAG updates.
- One-time full crawl: Fails to reflect knowledge changes, leading to declining RAG effectiveness over time; full crawls often generate higher request volumes, increasing blocking risks.
- Proxy + Scheduled Incremental Crawling: Scheduled based on policies (e.g., by site or URL priority), combined with deduplication and change detection to crawl only new or updated pages, keeping request volume manageable. Proxies ensure distributed exit points and high stability, making this the recommended approach.
II. Technical Implementation: Overall Architecture and Data Flow
2.1 Pipeline Overview
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ 调度器 │────▶│ 抓取 Worker │────▶│ 代理服务 │
│ (定时/队列) │ │ (代理+重试+限速) │ │ (动态代理 API) │
└─────────────────┘ └────────┬────────┘ └─────────────────┘
│
▼
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ 去重/变更检测 │◀────│ 原始 HTML/文本 │ │ 目标站点 │
└────────┬────────┘ └──────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐ ┌──────────────────┐
│ 解析与分块 │────▶│ 向量化 + 落库 │ ← RAG 检索使用
└─────────────────┘ └──────────────────┘- Scheduler: Driven by Cron or queues, generates a list of URLs to crawl (readable from DB/config files/sitemap), supporting priority and deduplication (prevents duplicate entries).
- Crawling Worker: Retrieves URLs from scheduler or queue, initiates HTTP requests via proxy with retry logic and rate limiting; outputs raw HTML or cleaned text.
- Proxy Service: Provides dynamic proxy endpoints (e.g., BytesFlows ), rotates IPs per request or session, with optional region binding.
- Deduplication/Change Detection: Determines whether a page is new or modified based on URL or content fingerprinting. Only processes new/modified pages for subsequent parsing and vectorization, conserving computational resources and storage.
- Parsing and Segmentation: Convert HTML to plain text or structured segments (e.g., by heading, paragraph), facilitating segmentation strategies (fixed-length, paragraph-based, etc.).
- Vectorization + Storage: Invokes Embedding API to generate vectors, writes to vector databases (e.g., Milvus, Chroma, pgvector) for RAG retrieval.
The proxy only appears in the "Crawler Worker → Proxy Service → Target Site" segment; downstream parsing and vectorization are decoupled from the proxy.
2.2 Proxy Integration Methods (same as previous article)
- Request-based rotation: Each URL uses a different IP, offering strong anti-blocking effectiveness and suitability for most document sites.
- Session-based: Reuse IPs within the same site and session window, suitable for document sources requiring login.
- Region-Binding: When requiring a "region-specific version" page, append parameters like `
country=xx` to the proxy URL.
Implementation: Within the crawling worker, configure the HTTP client's proxy to point to the proxy gateway provided by the proxy service provider.
III. Python Implementation: Scheduling, Proxy Crawling, Parsing, and Database Integration Overview
The following Python sequence demonstrates: Schedule URL generation → Proxy-based scraping → Basic parsing and segmentation → Embedding processing and database insertion. Production environments may substitute Celery/Redis queues, Scrapy, Playwright, etc. This example prioritizes minimal operational feasibility.
3.1 Dependencies and Configuration
# pip install requests beautifulsoup4 openai # 向量化示例用 OpenAI,可换成其他
import os
import time
import hashlib
import requests
from bs4 import BeautifulSoup
from typing import Iterator
# 代理网关(替换为实际代理服务地址,如 BytesFlows)
PROXY_GATEWAY = os.getenv("PROXY_GATEWAY", "http://user:pass@proxy.example.com:8080")
REQUEST_TIMEOUT = 30
MAX_RETRIES = 3
RETRY_BACKOFF = 2
CHUNK_SIZE = 500 # 字符数,按需调整
CHUNK_OVERLAP = 50 # 块间重叠3.2 Proxy-Based Scraping of Single URL (with Retries)
def fetch_with_proxy(url: str) -> str | None:
proxies = {"http": PROXY_GATEWAY, "https": PROXY_GATEWAY}
for attempt in range(MAX_RETRIES):
try:
r = requests.get(
url,
proxies=proxies,
timeout=REQUEST_TIMEOUT,
headers={"User-Agent": "RAG-Crawler/1.0 (Knowledge Base)"},
)
if r.status_code == 429:
time.sleep(RETRY_BACKOFF ** attempt)
continue
r.raise_for_status()
return r.text
except requests.RequestException:
time.sleep(RETRY_BACKOFF ** attempt)
return None3.3 HTML Parsing and Segmentation
def extract_text(html: str) -> str:
soup = BeautifulSoup(html, "html.parser")
for tag in soup(["script", "style"]):
tag.decompose()
return soup.get_text(separator="\n", strip=True)
def content_fingerprint(text: str) -> str:
return hashlib.sha256(text.encode()).hexdigest()
def chunk_text(text: str, size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> list[str]:
chunks = []
start = 0
while start < len(text):
end = start + size
chunks.append(text[start:end])
start = end - overlap
return chunks3.4 Single URL Full Workflow: Crawl → Dedupe → Segment → Vectorize and Store (Illustrative)
# 此处用占位表示“写入向量库”;实际替换为 Milvus/Chroma/OpenAI 等
def embed_and_store(chunks: list[str], url: str, doc_id: str):
for i, chunk in enumerate(chunks):
# 调用 Embedding API 得到向量(示例省略)
# vector = openai.Embedding.create(input=chunk, model="text-embedding-3-small")["data"][0]["embedding"]
# vector_db.upsert(collection, [{"id": f"{doc_id}_{i}", "url": url, "text": chunk, "vector": vector}])
print(f" stored chunk {i+1}/{len(chunks)} for {url}")
def process_one_url(url: str, seen_fingerprints: set) -> bool:
html = fetch_with_proxy(url)
if not html:
return False
text = extract_text(html)
fp = content_fingerprint(text)
if fp in seen_fingerprints:
return True # 跳过重复内容
seen_fingerprints.add(fp)
chunks = chunk_text(text)
doc_id = hashlib.sha256(url.encode()).hexdigest()[:12]
embed_and_store(chunks, url, doc_id)
return True3.5 Scheduling Entry Point: Sequential batch URL scraping (can be modified to queue consumption)
def run_scheduled_crawl(urls: list[str]):
seen = set()
for url in urls:
ok = process_one_url(url, seen)
print(f"{'OK' if ok else 'FAIL'}: {url}")
time.sleep(1) # 简单限速,生产可用 token bucket 等
if __name__ == "__main__":
urls = [
"https://example.com/docs/page1",
"https://example.com/docs/page2",
]
run_scheduled_crawl(urls) Replace PROXY_GATEWAY with an actual proxy address (e.g., BytesFlows) to obtain a functional RAG crawling pipeline: Scheduling → Proxy-based Crawling → Deduplication → Parsing & Partitioning → Vectorization & Storage.For production environments, replace ` run_scheduled_crawl ` with URL consumption from Redis/DB, use Celery for scheduled tasks, and connect ` embed_and_store ` to actual vector stores and Embedding services.
IV. Observability and Stability Recommendations
- Logging: Record each URL's status code, processing time, and deduplication hit status. Track success rates and latency per site.
- Rate limiting: Set per-site QPS caps and global concurrency limits to avoid maxing out proxy quotas and reduce load on target sites.
- Circuit Breaker: Pause tasks for a site with excessively high failure rates within a short timeframe before retrying.
- Incremental Crawling & Prioritization: Prioritize URLs with newer lastmod dates in sitemaps or sort by citation popularity to enhance update efficiency.
V. Summary
- Combined Analysis: RAG knowledge bases rely on multi-site, sustainably updated crawling. Fixed IP high-frequency crawling easily triggers risk controls; proxies distribute requests at the crawling layer, ensuring stable updates. Combined with scheduled tasks and deduplication, this forms a sustainable RAG data pipeline.
- Technical Implementation: Scheduler generates URLs → Crawling workers make proxy requests → Deduplication/change detection → Parsing and segmentation → Vectorization and database storage;The Python examples herein cover the complete pipeline from scraping to database storage. Replace the proxy gateway with solutions like BytesFlows, integrate queues and real vector databases as needed, and you can implement the closed-loop "proxy + scheduled scraping + vectorization" workflow for RAG scenarios.
Complementing the previous article "Dynamic Proxies in AI Data Pipelines": The previous article focused on generic data collection + dynamic proxies + retry circuit breakers; this article emphasizes RAG scenarios + scheduling + deduplication + parsing and vectorization. Together, they cover typical technical implementations of "proxy integration with AI technologies."