Traditional Culture Encyclopedia - Hotel accommodation - What can python web crawler do?

What can python web crawler do?

Python crawler development engineer, starting from a certain page (usually the home page) of a website, reads the contents of the webpage, finds other link addresses in the webpage, and then finds the next webpage through these link addresses, and so on, until all the webpages of this website are crawled. If the whole Internet is regarded as a website, then web spiders can use this principle to grab all the pages on the Internet.

Web crawler (also called web spider, web robot, and more often called web chaser in FOAF community) is a program or script that automatically crawls information on the World Wide Web according to certain rules. Other less common names are ant, automatic index, emulator or worm. Crawlers automatically traverse the pages of the website and download all the content.

Other names not commonly used by web crawlers are ant, automatic index, simulator or worm. With the rapid development of the network, the World Wide Web has become the carrier of a large amount of information, and how to effectively extract and use this information has become a huge challenge. Search engines, such as traditional general search engines AltaVista, Yahoo! As a tool to help people retrieve information, Google has become the entrance and guide for users to access the World Wide Web. However, these general search engines also have some limitations, such as:

(1) Users in different fields and backgrounds often have different retrieval purposes and needs, and the results returned by general search engines contain a large number of web pages that users don't care about.

(2) The goal of general search engine is to cover as many networks as possible, and the contradiction between limited search engine server resources and unlimited network data resources will be further deepened.

(3) With the rich data forms of the World Wide Web and the continuous development of network technology, a large number of different data such as pictures, databases, audio, video and multimedia appear, and general search engines are often unable to find and obtain these information-intensive and structured data.

(4) Most general search engines provide keyword-based retrieval, and it is difficult to support queries based on semantic information.

In order to solve the above problems, focused crawler came into being, and targeted to grab related web resources. Focus crawler is a program that automatically downloads web pages. It selectively accesses the web pages and related links on the World Wide Web according to the established crawling goal to obtain the required information. Compared with general reptiles (general? Purpose web crawler), focus crawler does not pursue large coverage, but aims to grab the web pages related to a specific topic content and prepare data resources for topic-oriented user queries.