scrapy学习初步

Posted by Kody Black on January 26, 2024

主要资料

第一个例子

版本为Scrapy 2.3.0,爬取目的网站为quotes.toscrape.com。

scrapy startproject mytest
cd mytest
scrapy genspider quotes quotes.toscrape.com

目录结构如下:

mytest/
    scrapy.cfg
    mytest/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            quotes.py

生成的quotes.py文件如下:

# -*- coding: utf-8 -*-
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"  #爬虫id
    allowed_domains = ["quotes.toscrape.com"] #允许爬取的域名
    start_urls = ['http://quotes.toscrape.com/'] #初始请求

    def parse(self, response): #用于解析响应结果
        pass

修改items.py,用于指定爬取目的字段

import scrapy

class MytestItem(scrapy.Item):
    # define the fields for your item here like:
    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()

可以利用scrapy提供的工具来测试自己是否选择了合适的爬取目标:

scrapy shell 'http://quotes.toscrape.com/page/1/'
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fd9cdc17fd0>
[s]   item       {}
[s]   request    <GET http://quotes.toscrape.com/page/1/>
[s]   response   <200 http://quotes.toscrape.com/page/1/>
[s]   settings   <scrapy.settings.Settings object at 0x7fd9cd93d198>
[s]   spider     <DefaultSpider 'default' at 0x7fd9cd512978>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
>>> quotes = response.css('.quote')
>>> quotes[0].css('.text::text').extract()
['\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d']
>>> quotes[0].css('.author::text').extract_first()
'Albert Einstein'
>>> quotes[0].css('.tags .tag::text').extract()
['change', 'deep-thoughts', 'thinking', 'world']
>>> response.css('.pager .next a::attr(href)').extract_first()
'/page/2/'

修改parse函数

    def parse(self, response): #用于解析响应结果
        quotes = response.css('.quote') #css选择器
        for quote in quotes:
            text = quote.css('.text::text').extract_first()
            author = quote.css('.author::text').extract_first()
            tags = quote.css('.tags .tag::text').extract()
            yield {
                'text': text,
                'author': author,
                'tags': tags
            }
        # 爬取下一页
        next = response.css('.pager .next a::attr(href)').extract_first()
        url = response.urljoin(next)
        yield scrapy.Request(url=url, callback=self.parse)

现在就可以直接运行爬虫了:

root@debian:/home/kody/test_spiders/mytest# scrapy crawl quotes
2024-01-26 07:43:53 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: mytest)
......
2024-01-26 07:43:54 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2024-01-26 07:43:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
2024-01-26 07:43:55 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': '\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
......
2024-01-26 07:43:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: http://quotes.toscrape.com/)
2024-01-26 07:43:55 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': "\u201cThis life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.\u201d", 'author': 'Marilyn Monroe', 'tags': ['friends', 'heartbreak', 'inspirational', 'life', 'love', 'sisters']}
......
2024-01-26 07:43:58 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/10/>
{'text': '\u201c... a mind needs books as a sword needs a whetstone, if it is to keep its edge.\u201d', 'author': 'George R.R. Martin', 'tags': ['books', 'mind']}
2024-01-26 07:43:58 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET http://quotes.toscrape.com/page/10/> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2024-01-26 07:43:58 [scrapy.core.engine] INFO: Closing spider (finished)
2024-01-26 07:43:58 [scrapy.core.engine] INFO: Spider closed (finished)

想要把爬取到的内容输出到文件,只需要在命令中增加-o选项即可

例如: scrapy crawl quotes -o quotes.json

输出结果到MongoDB

利用scrapy提供的组件,只需要修改pipelines.py即可

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.exceptions import DropItem
import pymongo

# 这个类稍微处理了下text的长度
class TextPipeline(object):
    def __init__(self):
        self.limit = 50

    def process_item(self, item, spider):
        if item['text']:
            if len(item['text']) > self.limit:
                # 截取前50个字符
                item['text'] = item['text'][0:self.limit].rstrip() + '...'
            return item
        else:
            return DropItem('Missing Text')


class MongoPipeline(object):
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri #数据库地址
        self.mongo_db = mongo_db #数据库名称

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'), #从settings.py中读取
            mongo_db=crawler.settings.get('MONGO_DB', 'items')
        )

    def open_spider(self, spider):
        # 连接数据库
        self.client = pymongo.MongoClient(self.mongo_uri)
        # 选择数据库
        self.db = self.client[self.mongo_db]

    def process_item(self, item, spider):
        # 在集合temp_collection中插入数据
        self.db['temp_collection'].insert_one(dict(item))
        return item

    def close_spider(self, spider):
        # 关闭数据库连接
        self.client.close()

另外在settings.py中添加相关的配置:

ITEM_PIPELINES = {
        'mytest.pipelines.TextPipeline': 300,
        'mytest.pipelines.MongoPipeline': 400,
}

MONGO_URI = 'mongodb://alice:alice@localhost:27017/temp'
MONGO_DB = 'temp'

这样,执行scrapy crawl quotes后,即将数据保存在了MongoDB中:

root@debian:/home/kody/test_spiders/mytest# mongo
MongoDB shell version: 3.2.9
connecting to: test
rs0:PRIMARY> use temp
switched to db temp
rs0:PRIMARY> db.auth('alice','alice')
1
rs0:PRIMARY> show tables;
temp_collection
rs0:PRIMARY> db.temp_collection.find().pretty()
{
        "_id" : ObjectId("65b36e2a038992ab6b85cfac"),
        "text" : "“The world as we have created it is a process of o...",
        "author" : "Albert Einstein",
        "tags" : [
                "change", 
                "deep-thoughts",
                "thinking",
                "world"
        ]
}
......