Traditional Culture Encyclopedia - Weather forecast - Overview of Python’s crawler framework

Overview of Python’s crawler framework

丨Overview

After getting started with crawlers, we have two ways to go.

One is to continue to study in depth, as well as some knowledge about design patterns, strengthen Python-related knowledge, build your own wheels, and continue to add distributed, multi-threading and other functional extensions to your crawlers. Another way is to learn some excellent frameworks. Use these frameworks first to ensure that you can handle some basic crawler tasks, which is the so-called solution to the problem of food and clothing, and then further study its source code and other knowledge to further strengthen it.

Personally speaking, the former method is actually to build the wheel yourself. The predecessors actually have some better frameworks that can be used directly, but in order to be able to study more deeply and Have a more comprehensive understanding of crawlers and do more by yourself. The latter method is to directly use the relatively excellent frameworks that have been written by predecessors and use them well. First, make sure that you can complete the tasks you want to complete, and then study and study them in depth yourself. For the first type, if you explore more, you will have a more thorough knowledge of reptiles. The second way is to use other people's work, which is convenient for you, but you may not be in the mood to study the framework in depth, and your ideas may be constrained.

But personally, I prefer the latter. Building wheels is good, but even if you build wheels, aren't you still building wheels on the basic class library? Use what you can use. The role of learning the framework is to ensure that you can meet some crawler needs. This is the most basic issue of food and clothing. If you keep building the wheel but end up with nothing, and someone asks you to write a reptile research for so long but you still can't write it, isn't it a bit worth the loss? Therefore, for advanced crawlers, I still recommend learning the framework as a few weapons for yourself. At least, we can do it, just like you take a gun and go to the battlefield. At least, you can attack the enemy, which is much better than if you keep sharpening your knife, right?

丨Framework overview

The blogger has come into contact with several crawler frameworks, among which Scrapy and PySpider are more useful. Personally speaking, pyspider is easier to get started and easier to operate because it adds a WEB interface, can write crawlers quickly, and integrates phantomjs, which can be used to crawl js-rendered pages. Scrapy has a high degree of customization and is lower-level than PySpider. It is suitable for learning and research. It requires a lot of relevant knowledge to learn, but it is very suitable for studying distributed and multi-threading by yourself.

Here the blogger will write down his own learning experience and share it with everyone. I hope you can like it and I hope it can give you some help.

丨PySpider

PySpider is an open source implementation of a crawler architecture made by binux. The main functional requirements are:

· Capture, update and schedule specific pages of multiple sites

· Need to extract structured information from pages

· Flexible Scalable, stable and monitorable

This is also the requirement of most Python crawlers - directional crawling and structured parsing. However, in the face of various websites with different structures, a single crawling mode may not be sufficient, and flexible crawling control is a must. In order to achieve this goal, simple configuration files are often not flexible enough, so controlling crawling through scripts is the last option.

The deduplication scheduling, queue, crawling, exception handling, monitoring and other functions serve as a framework to provide crawling scripts and ensure flexibility. Finally, the web editing and debugging environment and web task monitoring are added to form this framework.

The design basis of pyspider is: a crawling ring model crawler driven by python scripts

· Extract structured information through python scripts, follow link scheduling and crawling control to achieve maximum Flexibility

· Through web-based scripting and debugging environment. Web displays scheduling status

· The crawling ring model is mature and stable, and the modules are independent of each other. They are connected through message queues and can be flexibly expanded from single process to multi-machine distributed

pyspider-arch< /p>

The architecture of pyspider is mainly divided into scheduler (scheduler), fetcher (grabber), processor (script execution):

· Message queue connection is used between each component, except scheduler As a single point, both fetcher and processor can be deployed in a distributed manner with multiple instances. scheduler is responsible for overall scheduling control.

· The task is scheduled by the scheduler, the fetcher captures the web page content, and the processor executes the pre-written python script, outputs the results or generates a new chain fetching task (sent to the scheduler), forming a closed loop.

· Each script can flexibly use various python libraries to parse the page, use the framework API to control the next crawling action, and control the parsing action by setting callbacks.

丨Scrapy

Scrapy is an application framework written to crawl website data and extract structural data. It can be used in a series of programs including data mining, information processing or storing historical data.

It was originally designed for page scraping (more specifically, web scraping), but can also be used to obtain data returned by APIs (such as Amazon Associates Web Services) or general web crawlers. . Scrapy is widely used and can be used for data mining, monitoring and automated testing

Scrapy uses the Twisted asynchronous network library to handle network communication. The overall architecture is roughly as follows

Scrapy mainly includes the following components:

· Engine (Scrapy): used to handle data flow processing of the entire system and trigger transactions (core framework)

·Scheduler: Used to accept requests from the engine, push them into the queue, and return when the engine requests again. It can be imagined as a URL (the URL or link of the web page that is captured) Priority queue, which determines the next URL to be crawled and removes duplicate URLs

· Downloader (Downloader): used to download web content and return the web content to the spider (The Scrapy downloader is built on the efficient asynchronous model of twisted)

· Spiders: Spiders are mainly used to extract the information they need from specific web pages, that is, The so-called entity (Item). Users can also extract links from them and let Scrapy continue to crawl the next page

· Project pipeline (Pipeline): Responsible for processing entities extracted from web pages by crawlers. The main function is to persist entities and verify entities. effectiveness and remove unnecessary information. When the page is parsed by the crawler, it will be sent to the project pipeline and the data will be processed through several specific sequences.

·Downloader Middlewares: A framework located between the Scrapy engine and the downloader. It mainly handles requests and responses between the Scrapy engine and the downloader.

· Spider Middlewares: A framework between the Scrapy engine and the crawler. Its main job is to process the spider's response input and request output.

Scheduler Middewares: The middleware between the Scrapy engine and the scheduler, sending requests and responses from the Scrapy engine to the scheduler.

The Scrapy running process is roughly as follows:

· First, the engine takes out a link (URL) from the scheduler for subsequent crawling

· Engine Encapsulate the URL into a request (Request) and pass it to the downloader. The downloader downloads the resource and encapsulates it into a response package (Response)

· Then, the crawler parses the Response

· If the entity (Item) is parsed, it is handed over to the entity pipeline for further processing.

· If the link (URL) is parsed, the URL will be handed over to the Scheduler to wait for crawling.

Text | Cui Qingcai Source | Jingmi