主要资料
- Scrapy A Fast and Powerful Scraping and Web Crawling Framework
- python3网络爬虫开发实践(第二版)
- Scrapy Tutorial — Scrapy 2.3.0 documentation
- Scrapy 教程 — Scrapy 2.5.0 文档 (翻译的很差)
- Scrapy 入门教程 菜鸟教程
第一个例子
版本为Scrapy 2.3.0,爬取目的网站为quotes.toscrape.com。
scrapy startproject mytest
cd mytest
scrapy genspider quotes quotes.toscrape.com
目录结构如下:
mytest/
scrapy.cfg
mytest/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
quotes.py
生成的quotes.py文件如下:
# -*- coding: utf-8 -*-
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes" #爬虫id
allowed_domains = ["quotes.toscrape.com"] #允许爬取的域名
start_urls = ['http://quotes.toscrape.com/'] #初始请求
def parse(self, response): #用于解析响应结果
pass
修改items.py,用于指定爬取目的字段
import scrapy
class MytestItem(scrapy.Item):
# define the fields for your item here like:
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
可以利用scrapy提供的工具来测试自己是否选择了合适的爬取目标:
scrapy shell 'http://quotes.toscrape.com/page/1/'
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7fd9cdc17fd0>
[s] item {}
[s] request <GET http://quotes.toscrape.com/page/1/>
[s] response <200 http://quotes.toscrape.com/page/1/>
[s] settings <scrapy.settings.Settings object at 0x7fd9cd93d198>
[s] spider <DefaultSpider 'default' at 0x7fd9cd512978>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
>>> quotes = response.css('.quote')
>>> quotes[0].css('.text::text').extract()
['\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d']
>>> quotes[0].css('.author::text').extract_first()
'Albert Einstein'
>>> quotes[0].css('.tags .tag::text').extract()
['change', 'deep-thoughts', 'thinking', 'world']
>>> response.css('.pager .next a::attr(href)').extract_first()
'/page/2/'
修改parse函数
def parse(self, response): #用于解析响应结果
quotes = response.css('.quote') #css选择器
for quote in quotes:
text = quote.css('.text::text').extract_first()
author = quote.css('.author::text').extract_first()
tags = quote.css('.tags .tag::text').extract()
yield {
'text': text,
'author': author,
'tags': tags
}
# 爬取下一页
next = response.css('.pager .next a::attr(href)').extract_first()
url = response.urljoin(next)
yield scrapy.Request(url=url, callback=self.parse)
现在就可以直接运行爬虫了:
root@debian:/home/kody/test_spiders/mytest# scrapy crawl quotes
2024-01-26 07:43:53 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: mytest)
......
2024-01-26 07:43:54 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2024-01-26 07:43:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
2024-01-26 07:43:55 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': '\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
......
2024-01-26 07:43:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: http://quotes.toscrape.com/)
2024-01-26 07:43:55 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': "\u201cThis life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.\u201d", 'author': 'Marilyn Monroe', 'tags': ['friends', 'heartbreak', 'inspirational', 'life', 'love', 'sisters']}
......
2024-01-26 07:43:58 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'text': '\u201c... a mind needs books as a sword needs a whetstone, if it is to keep its edge.\u201d', 'author': 'George R.R. Martin', 'tags': ['books', 'mind']}
2024-01-26 07:43:58 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://quotes.toscrape.com/page/10/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2024-01-26 07:43:58 [scrapy.core.engine] INFO: Closing spider (finished)
2024-01-26 07:43:58 [scrapy.core.engine] INFO: Spider closed (finished)
想要把爬取到的内容输出到文件,只需要在命令中增加-o
选项即可
例如: scrapy crawl quotes -o quotes.json
输出结果到MongoDB
利用scrapy提供的组件,只需要修改pipelines.py即可
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.exceptions import DropItem
import pymongo
# 这个类稍微处理了下text的长度
class TextPipeline(object):
def __init__(self):
self.limit = 50
def process_item(self, item, spider):
if item['text']:
if len(item['text']) > self.limit:
# 截取前50个字符
item['text'] = item['text'][0:self.limit].rstrip() + '...'
return item
else:
return DropItem('Missing Text')
class MongoPipeline(object):
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri #数据库地址
self.mongo_db = mongo_db #数据库名称
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'), #从settings.py中读取
mongo_db=crawler.settings.get('MONGO_DB', 'items')
)
def open_spider(self, spider):
# 连接数据库
self.client = pymongo.MongoClient(self.mongo_uri)
# 选择数据库
self.db = self.client[self.mongo_db]
def process_item(self, item, spider):
# 在集合temp_collection中插入数据
self.db['temp_collection'].insert_one(dict(item))
return item
def close_spider(self, spider):
# 关闭数据库连接
self.client.close()
另外在settings.py中添加相关的配置:
ITEM_PIPELINES = {
'mytest.pipelines.TextPipeline': 300,
'mytest.pipelines.MongoPipeline': 400,
}
MONGO_URI = 'mongodb://alice:alice@localhost:27017/temp'
MONGO_DB = 'temp'
这样,执行scrapy crawl quotes
后,即将数据保存在了MongoDB中:
root@debian:/home/kody/test_spiders/mytest# mongo
MongoDB shell version: 3.2.9
connecting to: test
rs0:PRIMARY> use temp
switched to db temp
rs0:PRIMARY> db.auth('alice','alice')
1
rs0:PRIMARY> show tables;
temp_collection
rs0:PRIMARY> db.temp_collection.find().pretty()
{
"_id" : ObjectId("65b36e2a038992ab6b85cfac"),
"text" : "“The world as we have created it is a process of o...",
"author" : "Albert Einstein",
"tags" : [
"change",
"deep-thoughts",
"thinking",
"world"
]
}
......