網(wǎng)易首頁 > 網(wǎng)易號(hào) > 正文申請(qǐng)入駐

? 4.1萬Star！這個(gè)Python爬蟲把Cloudflare按在地上摩擦

2026-05-03 20:16:19　來源: 侃故事的阿慶

福建舉報(bào)

分享至

一個(gè)庫(kù) = Requests + BeautifulSoup + Scrapy + Playwright + 反反爬，解析速度是 BS4 的 784 倍

一、爬蟲圈炸了！這個(gè)項(xiàng)目?jī)H 18 個(gè)月狂攬 4.1 萬 Star

如果你搞 Python 爬蟲，你大概率經(jīng)歷過這些崩潰時(shí)刻：

目標(biāo)網(wǎng)站改版了 —— 你的 XPath 全廢了
加了 Cloudflare Turnstile —— 代碼直接歇菜
小網(wǎng)站還能單線程湊合 —— 遇到上萬頁面直接跑斷腿
Requests 搞不定 JS 渲染 —— 切 Selenium 又慢又笨重

過去你需要：Requests + BeautifulSoup + Scrapy + Playwright + 反反爬中間件 + 代理池，一套組合拳打下來，光調(diào)環(huán)境就要半天。

但現(xiàn)在，一個(gè)庫(kù)全搞定。

它就是Scrapling——由安全研究員 Karim Shoair（D4Vinci）打造的「自適應(yīng) Web Scraping 框架」。2024 年 10 月才開源，短短 18 個(gè)月，GitHub Star 飆到 4.1 萬，平均每天新增 75 個(gè) Star，爬蟲圈最火的項(xiàng)目沒有之一！

二、憑什么這么火？三個(gè)核心技術(shù)讓你沉默1?? 自適應(yīng)解析引擎：網(wǎng)站怎么改都不怕

這是 Scrapling最炸裂的特性。

傳統(tǒng)爬蟲寫得再漂亮，網(wǎng)站一次改版就全廢。但 Scrapling 的解析器能學(xué)習(xí)網(wǎng)站結(jié)構(gòu)變化，自動(dòng)重新定位你的元素。

# 第一次爬取：保存元素特征products = page.css('.product', auto_save=True)# 網(wǎng)站改版后：自適應(yīng)找回?cái)?shù)據(jù)！products = page.css('.product', adaptive=True)

背后用的是智能相似度算法，auto_save=True 時(shí)會(huì)保存元素的特征信息，后續(xù)用 adaptive=True 就能自動(dòng)匹配。說白了就是：你的爬蟲從此學(xué)會(huì)了「找不同」。

2?? 四大 Fetcher：從普通請(qǐng)求到高難度反反爬全覆蓋

Fetcher

適用場(chǎng)景

反檢測(cè)能力

Fetcher

普通 HTTP 請(qǐng)求

TLS 指紋模擬 + HTTP/3

AsyncFetcher

高并發(fā)異步請(qǐng)求

同上

StealthyFetcher

高難度反爬網(wǎng)站

繞過 Cloudflare Turnstile！

DynamicFetcher

JS 動(dòng)態(tài)渲染頁面

完整瀏覽器自動(dòng)化

尤其注意StealthyFetcher—— 它能開箱即用地繞過 Cloudflare Turnstile 驗(yàn)證，這對(duì)于國(guó)內(nèi)爬蟲玩家簡(jiǎn)直是剛需。

from scrapling.fetchers import StealthyFetcher# 一行代碼繞過 Cloudflare！page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare',solve_cloudflare=True

3?? Spider 框架：從單頁面到大規(guī)模爬蟲的無縫升級(jí)

Scrapling 的 Spider API完美克隆了 Scrapy 的設(shè)計(jì)，但多了一大堆現(xiàn)代特性：

from scrapling.spiders import Spider, Responseclass QuotesSpider(Spider):name = "quotes"start_urls = ["https://quotes.toscrape.com/"]concurrent_Requests = 10  # 10個(gè)并發(fā)！async def parse(self, response: Response):for quote in response.css('.quote'):yield {"text": quote.css('.text::text').get(),"author": quote.css('.author::text').get(),result = QuotesSpider().start()result.items.to_json("quotes.json")

關(guān)鍵特性一網(wǎng)打盡：

并發(fā)爬取：可配并發(fā)數(shù)、按域名限速
多 Session 管理：普通請(qǐng)求和隱身瀏覽器可以同一爬蟲混用
暫停/恢復(fù)：Ctrl+C 優(yōu)雅暫停，重啟后自動(dòng)續(xù)爬
Streaming 模式：邊爬邊獲取數(shù)據(jù)，實(shí)時(shí)看統(tǒng)計(jì)
自動(dòng)檢測(cè)被屏蔽：發(fā)現(xiàn)被屏蔽自動(dòng)重試

# 多 Session 實(shí)戰(zhàn)：普通頁面走快通道，反爬頁面走隱身通道class MultiSessionSpider(Spider):def configure_sessions(self, manager):manager.add("fast", FetcherSession(impersonate="chrome"))manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)async def parse(self, response: Response):for link in response.css('a::attr(href)').getall():if "protected" in link:yield Request(link, sid="stealth")  # 走隱身模式else:yield Request(link, sid="fast")

三、解析性能碾壓：比 BS4 快 784 倍！

以下數(shù)據(jù)來自官方基準(zhǔn)測(cè)試（100+ 輪取平均）：

排名

解析庫(kù)

耗時(shí) (ms)

相比 Scrapling

Scrapling

2.02

1.0x

Parsel/Scrapy

2.04

1.01x

Raw Lxml

2.54

1.26x

PyQuery

~12x

Selectolax

~41x

MechanicalSoup

~767x

BS4 + Lxml

~784x

BS4 + html5lib

~1679x

結(jié)論很明確：Scrapling 的解析速度 ≈ Parsel / Scrapy，比 BeautifulSoup 快 784 倍，比 PyQuery 快 12 倍。

自適應(yīng)查找更是吊打競(jìng)品：Scrapling 2.39ms vs AutoScraper 12.45ms，快了 5 倍多。

四、更騷的是：自帶 CLI + MCP 服務(wù)器命令行一鍵爬取

# 直接把網(wǎng)頁內(nèi)容導(dǎo)出為 Markdown，一行代碼不用寫scrapling extract get 'https://example.com' content.md# 指定 CSS 選擇器 + 隱身模式scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' \captchas.html --css-selector '#padded_content a' --solve-cloudflare

MCP 服務(wù)器：AI Agent 的爬蟲助手

這是 2025 年加入的最有意思的特性 ——Scrapling 自帶 MCP（Model Context Protocol）服務(wù)器，AI Agent（Claude、Cursor 等）可以直接調(diào)用來執(zhí)行網(wǎng)頁爬取。MCP 服務(wù)器會(huì)先用 Scrapling 提取目標(biāo)內(nèi)容，只把精華數(shù)據(jù)傳給 AI，大幅減少 Token 消耗和成本。

這個(gè)頁面更有意思，在 ClawHub 上還有專門的 Agent Skill 可以安裝！

五、安裝使用

# 基礎(chǔ)解析引擎pip install scrapling# 帶 Fetcher 和瀏覽器pip install "scrapling[fetchers]"scrapling install# 帶 CLI Shellpip install "scrapling[shell]"# 全功能pip install "scrapling[all]"

要求：Python 3.10+，有現(xiàn)成的 Docker 鏡像：

docker pull pyd4vinci/scrapling

六、項(xiàng)目速覽

指標(biāo)

數(shù)據(jù)

Star 數(shù)

41,405

Fork 數(shù)

3,730

開源協(xié)議

BSD-3-Clause（免費(fèi)商用）

作者

Karim Shoair（D4Vinci）

創(chuàng)建時(shí)間

2024-10-13

依賴

Python 3.10+

?? 測(cè)試覆蓋率

92%

? 核心標(biāo)簽

AI, MCP, Cloudflare 繞過, Playwright

項(xiàng)目地址：github.com/D4Vinci/Scrapling官方文檔：scrapling.readthedocs.io

在爬蟲領(lǐng)域，歷來沒有哪個(gè)庫(kù)能把「請(qǐng)求 + 解析 + 爬蟲框架 + 反反爬 + AI 集成」做在一個(gè)庫(kù)里的。Scrapling 做到了，而且每個(gè)模塊的質(zhì)量都很能打。

對(duì)于 Python 爬蟲開發(fā)者來說，這可能是 2025 年最值得學(xué)習(xí)的開源項(xiàng)目。18 個(gè)月 4.1 萬 Star，不是沒道理的。

溫馨提示：請(qǐng)遵守目標(biāo)網(wǎng)站的 robots.txt 和服務(wù)條款，合理使用爬蟲技術(shù)。此庫(kù)僅用于合法的數(shù)據(jù)采集和教育研究。

特別聲明：以上內(nèi)容(如有圖片或視頻亦包括在內(nèi))為自媒體平臺(tái)“網(wǎng)易號(hào)”用戶上傳并發(fā)布，本平臺(tái)僅提供信息存儲(chǔ)服務(wù)。

Notice: The content above (including the pictures and videos if any) is uploaded and posted by a user of NetEase Hao, which is a social media platform and only provides information storage services.