Traditional Culture Encyclopedia - Weather forecast - Sina weather

Sina weather

0x00。 cause

Because to participate in the innovation competition of college students and study the emotions expressed by Weibo's blog posts, a large number of Weibo's blog posts are needed, and no matter whether it is a domestic degree, csdn, or Google, gayhub and codeproject abroad, they can't find the desired program, so they have to write their own programs.

Hymn I found a similar program in Climbing Alliance, but under windows, the source code is closed. Moreover, I crawled the saved file and opened it with notepad++, and there were many strange problems, so I gave up.

0x00 1。 basic knowledge

This program is written in python, so basic Python knowledge is necessary. In addition, if you have a certain computer network foundation, you will take fewer detours in the preliminary preparation.

For reptiles, you need to be clear about the following points:

1. Crawling objects can be classified into the following categories: The first category does not need to log in, such as the China Weather Network that bloggers used to climb when practicing their hands. This kind of web page is difficult to crawl, so it is suggested that the novice crawler climb this kind of web page; The second is to log in, such as Douban and Sina Weibo, which are difficult to climb; The third is independent of the first two, and the information you want is generally dynamically refreshed, such as AJAX or embedded resources. This kind of crawler is the most difficult, and bloggers have never studied it, so I won't elaborate here (according to classmates, Taobao's product reviews belong to this category).

2. If the same data source has multiple forms (such as computer version, mobile phone version, client, etc.). ), a more "pure" presentation is more popular. For example, Sina Weibo has a web version and a mobile version, and the mobile version can be accessed through a computer browser. At this time, I prefer the mobile version of Sina Weibo.

3. Crawlers usually download web pages to the local area, and then extract interesting information in some way. In other words, crawling the web page is only half done, and you need to extract the information you are interested in from the downloaded html file. At this time, you need some knowledge of xml. In this project, bloggers use XPath to extract information, and they can also use other technologies such as XQuery. Please visit w3cschool for details.

4. Reptiles should imitate humans as much as possible. Now the anti-crawling mechanism of the website has been developed. From verification code to IP prohibition, crawler technology and anti-crawler technology can be described as a continuous game.

0x02。 go to

After determining the target of the crawler, you should first visit the target webpage to find out which crawler the target webpage belongs to. In addition, record the steps you need to take to get the information you are interested in, such as whether you need to log in, and if you need to log in, whether you need a verification code; What do you need to do to get the information you want? Do you need to submit some forms? What are the rules of the url of the page where the information you want is located, and so on.

The following blog post takes the blogger project as an example. This project crawls all Weibo blog posts of a specific Sina Weibo user since registration, and crawls 100 pages of Weibo blog posts by keywords (about 1000).

0x03。 Collect the necessary information

First visit the target web page and find that you need to log in. Enter the login page as follows: Sina Weibo mobile login page.

Note that there are many escape characters like "%xx" in the second half of the url, which will be discussed later in this article.

As you can see from this page, you need to fill in the account number, password and verification code to log in to Sina Weibo Mobile Edition.

This verification code only needs to be provided recently (this article was created at 20 16.3. 1 1). If you don't need to provide a verification code, there are two login methods.

The first method is to conduct js simulation, fill in the account password and click the "Login" button. Bloggers used this method to write a Java crawler before, but now they can't find the project, so I won't go into details here.

The second type requires a certain HTTP foundation and submits an HTTP POST request containing the required information. We need the Wireshark tool to capture the packets we send and receive when logging in to Weibo. As shown in the figure below, I grabbed the packets I sent and received when I logged in. Wireshark got the result of 1.

Provide the search criteria "/(displayID) in the search bar? Page=(pagenum) ".This will be the basis of our crawler splicing url.

Next, look at the source code of the web page and find the location of the information we want. Open the browser developer tool and directly locate a Weibo, and you can find its location, as shown below.

xpath

Observing the html code, it is found that all Weibo are in the < div> tag, and there are two attributes in this tag, among which the class attribute is "c" and a unique id attribute value. Obtaining this information helps to extract the required information.

In addition, there are some factors that need special attention.

* Weibo is divided into original Weibo and forwarding Weibo.

* Depending on the release time and the current time, there are many ways to display the time on the page, such as "MM minutes ago", "today's HH:MM" and "MM month dd day HH:MM-DD hh: mm: SS". * One page of Sina Weibo for mobile phone shows about 10 Weibo, and pay attention to the total amount * *.

0x04。 encode

1. Grab the user Weibo.

The development language of this project is Python 2.7, and some third-party libraries are used in the project, which can be added through pip.

Because the verification code blocks the idea of automatic login, users can only provide cookies when they want to visit a specific user's Weibo page.

The first is Python's request module, which provides url requests with cookies.

Import request

Print the request. get (url,cookies = cookies)。 Content Use this code to print URL request page results with cookies.

First, get the number of Weibo pages of users. By checking the source code of the web page, we can find the elements representing the number of pages, and extract the number of pages through XPath and other technologies.

number of pages

This project uses lxml module to extract html through XPath.

First, import the lxml module. Only etree is used in the project, so import etree from lxml.

Then return the page number in the following way.

def getpagenum(self):

URL = self . geturl(pagenum = 1)

html = requests.get(url,cookies=self.cook)。 Content # Visit the first page to get the page number.

Selector = etree. HTML(html)

pagenum = selector . XPath('//input[@ name = " MP "]/@ value ')[0]

return int(pagenum)

The next step is to continuously splice URLs-> Visit URLs-> Download the webpage.

It should be noted that due to the existence of Sina's anti-crawling mechanism, if the same cookies visits the page too frequently, it will enter a similar "cooling-off period", that is, it will return to a useless page. By analyzing this useless page, we find that this page will have specific information in a specific place, and whether this page is useful to us can be judged by XPath technology.

def ispageneeded(html):

Selector = etree. HTML(html)

Try:

title = selector . XPath('//title ')[0]

Except:

Returns False

Go back to title.text! =' Weibo Square' and title.text! =' Weibo'

If there are useless pages, you just need to visit them again. However, through later experiments, it is found that if you visit them frequently for a long time, the returned pages will be useless and the program will fall into an infinite loop. In order to prevent the program from falling into an infinite loop, the blogger sets a trycount threshold, after which the method will automatically return.

The following code fragment shows the method of single-threaded crawler.

def startcrawling(self,startpage= 1,trycount=20):

Attempt = 0

Try:

Os.mkdir (sys.path [0]+'/Weibo _ raw/'+self.wanted) Except for exceptions, e:

Print string (e)

isdone = False

While not isdone and try < lt try to count:

Try:

pagenum = self.getpagenum()

isdone = True

With exceptions, e:

Try += 1

if attempt == trycount:

Returns False

I = start page

And I < = pagenum:

Attempt = 0

isneeded = False

html = ' '

While not is needed and tried & lt try to count:

html = self . getpage(self . geturl(I))

isneeded = self . ispageneeded(html)

If not:

Try += 1

if attempt == trycount:

Returns False

self . save html(sys . path[0]+'/Weibo _ raw/'+self . wanted+'/'+str(I)+'。 Txt', html) print string (i)+'/'+string (pagenum-1)

i += 1

Return True

Considering the time efficiency of the program, after writing a single-threaded crawler, the blogger wrote a multi-threaded crawler version. The basic idea is to divide the number of pages in Weibo by the number of posts. For example, if a user in Weibo has 100 Weibo pages and the program has 10 threads, then each thread is only responsible for crawling 10 pages. Other basic ideas are similar to single thread, only boundary values need to be handled carefully, so I won't go into details here. In addition, due to the high efficiency of multithreading, the concurrency is particularly large, and the server can easily return invalid pages, so the setting of trycount is more important. While writing this Weibo, the blogger used a new cookie to test who climbed the Weibo of Beijing University of Posts and Telecommunications. All 3976 Weibo articles were successfully crawled and blog posts were extracted. Only 15s is used, which may actually be related to old and new cookies and the network environment. The command line is set as follows. The meaning of the command line is explained in the project website: python main.py _ t _ wm = xxxshub = xxxsub = xxxgsid _ ctandwm = xxxubupptm2020 The basic introduction of the above crawling work is over, and then the second part of the crawler is analyzed. Because the project provides a multi-thread crawling method, and multi-threads are generally out of order, but Weibo's blog posts are sorted by time, so the project adopts a compromise method, and saves the downloaded pages in the local file system, and each page takes its page number as the file name. After crawling, traverse and parse all files in the folder.

Through the previous observation, we know what features Weibo's blog posts have. By using XPath technology, it is not difficult to extract all tags with this feature from this page.

Thirdly, Weibo is divided into forwarding Weibo and original Weibo, expressing time. In addition, because our research topic is only interested in Weibo characters, we don't consider illustrations.

def startparsing(self,parsing time = datetime . datetime . now()):

basepath = sys . path[0]+'/Weibo _ raw/'+self . uid for filename in OS . listdir(basepath):

if filename.startswith(' . '):

continue

path = basepath + '/' + filename

F = open (path, "r")

html = f.read()

Selector = etree. HTML(html)

Weibo items = selector.xpath ('//div [@ class = "c"] [@ id]') is used for items in Weibo items:

Weibo = Weibo ()

weibo.id = item.xpath('。 /@id')[0]

cmt = item.xpath('。 /div/span[@ class = " CMT "]')if len(CMT)! = 0:

weibo.isrepost = True

weibo.content = cmt[0]。 text

Otherwise:

weibo.isrepost = False

ctt = item.xpath('。 /div/span[@class="ctt"]')[0]

If ctt.text is not None:

weibo.content += ctt.text

For the. /a '):

If a.text is not None:

Weibo. Content+= a.text.

If a.tail is not None:

weibo.content += a.tail

if len(cmt)! = 0:

reason = CMT[ 1]. text . split(u ' \ xa0 ')

If len (reason)! = 1:

Weibo . reportstroy = reason[0]

ct = item.xpath('。 /div/span[@class="ct"]')[0]

time = ct.text.split(u'\xa0')[0]

Weibo.time = self.gettime(self, time, parsingtime)self.weibos.append (Weibo. _ _ dictionary _ _)

f.close()

The original intention of setting the parameter parsingtime passed by the method is that the crawling and parsing may not be carried out at the same time in the early stage of development (not strictly "simultaneous"), and the Weibo time display is based on the access time, for example, the crawling time is 10:00, and a Weibo display was released five minutes ago, but if the parsing time is 10:30, the parsing time will be wrong, so, to.

The parsing results are saved in the list. Finally, the list is saved to the file system in json format, and the conversion folder is deleted.

Definition save (self):

f = open(sys . path[0]+'/Weibo _ parsed/'+self . uid+'。 Txt',' w') jsonstr = json.dumps (self.weibos, indent=4, ensure _ascii=False)f.write(jsonstr).

f.close()

Grab keywords

Similarly, collect the necessary information. Enter "python" in Weibo's mobile phone search page, observe the website address and study its laws. Although there is no rule on the first page, we found a rule on the second page, which can be applied back to the first page.

Second pages

The first page after application

Observing the url, we can find that the only variables in the url are keywords and pages (in fact, hideSearchFrame has no effect on our search results and crawlers), so we can control these two variables in the code.

In addition, if the keyword is Chinese, then the URL needs to be converted into Chinese characters. For example, if we type "Happy" in the search box to search, we find that the url shows Happy Search as follows.

But it was copied as

/search/mblog? Hidesearchframe = & keywords =% E5% BC% 80% E5% BF% 83 & page =1Fortunately, python's urllib library has the function of qoute to handle Chinese conversion (if it is English, it will not be changed), so this method is used to handle parameters before splicing URLs.

In addition, considering that keyword search belongs to the method used in the data collection stage, we only provide single-thread download of web pages here. If there is a need for multithreading, you can rewrite it yourself according to the method of grabbing user Weibo by multithreading. Finally, extract and save the downloaded webpage (I know this module design is a bit strange, so I plan to change it when I recreate it (Hao), that's all).

Def keyword crawl (self, keyword):

Realkeyword = urllib.quote (keyword) # Processing keywords in Chinese.

Try:

OS . mkdir(sys . path[0]+'/keywords ')

With exceptions, e:

Print string (e)

Weibo = []

Try:

High points = re.compile (u' [\ u00010000-\ u0010ffff]') # handles emoticons, but it doesn't seem to work.

Except for re.error:

high points = re . compile(u '[\ ud 800-\ uDBFF][\ UDC 00-\ uDFFF]')pagenum = 0

isneeded = False

When not needed:

html = self . get page('/search/mblog? Keywords =% s & page =1'%realkeyword) is needed = self.ispageneed (html)

If necessary:

Selector = etree. HTML(html)

Try:

Pagenum = int (selector.xpath ('/input [@ name = "MP"]/@ value') [0]) except:

pagenum = 1

For I( 1, pagenum+1) in the range:

Try:

isneeded = False

When not needed:

html = self . get page('/search/mblog? Keywords =% s & page =% s'% (real keyword, str (i)) is needed = self.ispageneed (html)

Selector = etree. HTML(html)

Weibo items = selector.xpath ('//div [@ class = "c"] [@ id]') is used for items in Weibo items:

cmt = item.xpath('。 /div/span[@ class = " CMT "]')if(len(CMT))= = 0:

ctt = item.xpath('。 /div/span[@class="ctt"]')[0]

If ctt.text is not None:

text = etree.tostring(ctt,method='text ',encoding = " unicode ")tail = CTT . tail

if text.endswith(tail):

index = -len(tail)

text = text[ 1:index]

Text = highpoints.sub(u'\u25FD', text) # The handling method of emoji seems to be unworkable.

Weibo text = text.

Attached by Weibo (Weibo text)

Print string (i)+'/'+string (pagenum)

With exceptions, e:

Print string (e)

f = open(sys . path[0]+'/keywords/'+keyword+'。 Txt',' w') try:

F.write (json.dumps (Weibo, indent=4, ensure _ ascii = false)) except, for example:

Print string (ex)

Finally:

f.close()

Bloggers have never written any crawler programs before. In order to get Sina Weibo blog posts, bloggers have written three different crawler programs, including Python and Java. It's normal that reptiles can't be used. Don't be discouraged. Crawler programs and anti-crawling mechanisms have been constantly playing games.

In addition, please inform the blogger if you reprint. If you think Bo is the boss, you don't need to tell him.