百度爬虫

1. 目的

使用爬虫脚本 爬去 百度搜索关键字后获得r链接地址以及域名信息
可结合GHDB语法
e.g. inurl:php?id=

2. 知识结构

2.1 使用 threading & queue 模块,多线程处理,自定义线程数
2.2 使用BeautifulSoup & re模块,处理href 匹配
2.3 使用requests 模块,处理web请求&获得请求后的真实地址(r.url)
2.4 百度最大搜索页面76页,pn max 760
2.5 将结果存入文本,域名已去重

3. 爬虫脚本

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72

#coding=utf-8

import requests
import re
import Queue
import threading
from bs4 import BeautifulSoup as bs
import os,sys,time

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}


class BaiduSpider(threading.Thread):
def __init__(self,queue):
threading.Thread.__init__(self)
self._queue = queue
def run(self):
while not self._queue.empty():
url = self._queue.get_nowait()
try:
#print url
self.spider(url)
except Exception,e:
print e
pass

def spider(self,url):
#if not add self , error:takes exactly 1 argument (2 given)
r = requests.get(url=url,headers=headers)
soup = bs(r.content,'lxml')
urls = soup.find_all(name='a',attrs={'data-click':re.compile(('.')),'class':None})
for url in urls:
#print url['href']
new_r = requests.get(url=url['href'],headers=headers,timeout=3)
if new_r.status_code == 200 :
url_para = new_r.url
url_index_tmp = url_para.split('/')
url_index = url_index_tmp[0]+'//'+url_index_tmp[2]
print url_para+'\n'+url_index
with open('url_para.txt','a+') as f1:
f1.write(url_para+'\n')
with open('url_index.txt','a+') as f2:
with open('url_index.txt', 'r') as f3:
if url_index not in f3.read():
f2.write(url_index+'\n')
else:
print 'no access',url['href']

def main(keyword):
queue = Queue.Queue()
de_keyword = keyword.decode(sys.stdin.encoding).encode('utf-8')
print keyword
# baidu max pages 76 , so pn=750 max
for i in range(0,760,10):
#queue.put('https://www.baidu.com/s?ie=utf-8&wd=%s&pn=%d'%(keyword,i))
queue.put('https://www.baidu.com/s?ie=utf-8&wd=%s&pn=%d'%(de_keyword,i))
threads = []
thread_count = 4
for i in range(thread_count):
threads.append(BaiduSpider(queue))
for t in threads:
t.start()
for t in threads:
t.join()

if __name__ == '__main__':
if len(sys.argv) != 2:
print 'Enter:%s keyword'%sys.argv[0]
sys.exit(-1)
else:
main(sys.argv[1])

效果图

4. 待优化点

4.1 多个搜索引擎的处理
4.2 多参数的处理
4.2 payload 结合

5. 参考信息

5.1. ADO ichunqiu Python安全工具开发应用
5.2. https://github.com/sharpdeep/CrawlerBaidu/blob/master/CrawlerBaidu.py