最近项目不忙,乘此机会重新学习了一下python爬虫, 引发了一些思考,以下几点:

  1. 学习python真的是从众现象一时热吗? 学了有什么用,能改变现状吗? 如果不学又当如何,干点什么事情好呢

  2. 学习python到了现在是个分水岭,几乎各方面的知识都有所涉及,到底是往哪个方向发展呢?

  3. 此时此刻,真的有些迷茫 , 但是人总是想着改变,只能说目前没有别的想法 ,学到一个是一个吧 ,但愿能派上用场吧




[TOC]

爬虫技巧

"人生苦短,我用python",也许是因为简单,也许是因为效率高... 这门语言已经风靡全球,有些学校已经列为必修课了。

下载器,使用什么框架爬取 (必备)

requests + BeautifulSoup (正则或者xpath也可)

  • 用传统正则解析
import urllib.request
from urllib.request import Request
from urllib.parse import urlencode
import re
import random

base_url = 'https://www.douban.com/doulist/3936288/'
pattern = re.compile('<div\sclass="title">\s.*?<a.*?>(.*?)</a>',re.S)
user_agent = [
    'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
]
for i in range(0,250,25):
    data = {
        'start':i
    }
    data = bytes(urlencode(data),'utf-8')
    headers = {'User-Agent':random.choice(user_agent)}
    requ = Request(base_url,data)
    html = urllib.request.urlopen(requ).read().decode('utf-8')
    results = re.findall(pattern,html)
    for result in results:
        result = re.sub('\n','',result)
        print(result)


  • 用BeautifulSoup解析
import  urllib.request
import  re
import  random
from urllib.request import  Request
from  urllib.parse import  urlencode
from  bs4 import  BeautifulSoup
import urllib.error


base_url = 'https://www.douban.com/doulist/3936288/'
user_agent = ['Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11']

for i in range(0,250,25):
    data = {
        'start':i
    }

data = bytes(urlencode(data),'utf-8') #参数打成二进制
header = { 'User-Agent' : random.choice(user_agent) } #请求头
req = Request(base_url,data,headers=header) # url + 参数+ 请求头伪装成浏览器
try:
    response = urllib.request.urlopen(req)
except urllib.error.HttpError as e:
    print(e.reason)
else:
    html = response.read().decode('utf-8')
    # print(html)

####注意以下是骚操作
    soup =BeautifulSoup(html,'html.parser')
    for item in soup.select('.title'): #类访问到div
        a = item.select('a')[0] #取到div里的a标签
        title = a.get_text() #获取a标签的文本内容
        print(title)

scrapy + BeautifulSoup (正则或者xpath也可)

scrapy中文教程

  • BeautifulSoup
html = = response.text
soup =BeautifulSoup(html) # 原理同上

  • 正则
html = = response.text
pattern = re.compile('<div.?class="autho r.?>.?<a.?.?<a.?>(.?).?<div.?class'+ '="content".?title="(.?)">(.?)(.*?)<div class="stats.?class="number">(.?)',re.S)
items = re.findall(pattern,html)
for item in items:
     	print(item)
  • xpath
# 豆瓣电影Top250
import scrapy
import re
import json
from bs4 import  BeautifulSoup
import time

class  DoubanDianYing(scrapy.Spider):
            name = 'douban'
            allow_domains = ['movie.douban.com']
            datalist = []
            def start_requests(self):

                    headers = {
                        'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
                    }


                    for page in range(0,251,25):
                        if page is not 0:
                             url = 'https://movie.douban.com/top250?start=%d&filter=' % page
                        else:
                             url = 'https://movie.douban.com/top250'
                        yield scrapy.Request(url,headers=headers)

            def parse(self, response):
                    # html = response.text
                    # bs = BeautifulSoup(html)
                    # 练习xpath
                    titles = response.xpath('//div/a/span[1]/text()').extract()
                    imgs = response.xpath('//div[@class="pic"]/a/img/@src').extract()
                    links = response.xpath('//div[@class="hd"]/a/@href').extract()
                    actors = response.xpath('//div[@class="bd"]/p/text()').extract()
                    descs = response.xpath('//span[@class="inq"]/text()').extract()
                    rating = response.xpath('//div[@class="star"]/span[@class="rating_num"]/text()').extract()
                    hots = response.xpath('//div[@class="star"]/span[4]/text()').extract()

                    for index in range(len(imgs)):
                             dic = {}
                             dic['title'] = titles[index]
                             dic['img'] = imgs[index]
                             dic['actors'] = str(actors[index]).strip().replace("\n","")
                             dic['rating'] = rating[index]
                             dic['hot'] = hots[index]
                             dic['link'] = links[index]
                             dic['desc'] = descs[index]
                             self.datalist.append(dic)
                    if len(self.datalist) == 250:
                        with open('dataSource.json', 'a+') as f:
                            json.dump(self.datalist, f, ensure_ascii=False, indent=4)

selenium + phantomJS + python3

 from selenium import webdriver
 driver = webdriver.PhantomJS()
 driver.get('网址')
 html = driver.page_source # 解析原理同上

另外它也支持自己的解析语法:

参考博客

参考崔庆才的博客https://cuiqingcai.com/2577.html

参考知乎selenium文章https://zhuanlan.zhihu.com/p/29435831

参考selenium博客https://cuiqingcai.com/2599.html

selenium+python+PhantomJS的使用http://www.cnblogs.com/jinxiao-pu/p/6677782.html#_label0

目标网站即URL (必须)

  • 爬什么,心里得有数吧!

代理ip(可选)

  • 网上一搜免费的代理ip一大堆,但是都不是很稳定的,用于学习还是可以满足的,商用的话还是建议花钱买稳定的好

使用cookies (可选)

  • 参考python学习笔记 有提到代理ip怎么设置 headers怎么设置 cookies的读写删除

  • 使用selenium管理cookies

# cookies
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('https://www.zhihu.com/explore')
print(browser.get_cookies())
browser.add_cookie({'name': 'name', 'domain': 'www.zhihu.com', 'value': 'germey'})
print(browser.get_cookies())
browser.delete_all_cookies()
print(browser.get_cookies())

如何处理AJAX和JS渲染的内容?

  • ajax 无非就是表单提交网络请求, 只要找到对应的标签节点和js函数,使用selenium的JS行为操作一番,把结果获取到进行解析即可
  • JS同理

# 执行JS
from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https://www.zhihu.com/explore')
browser.execute_script('window.scrollTo(0, document.body.scrollHeight)')
browser.execute_script('alert("To Bottom")')

如何绕过反爬虫机制?

  • 这里的反爬虫机制其实就是我们爬取数据的时候伪装成用户操作的行为一样就ok,频率和使用的硬件等信息要相似,验证码之类,或者弹窗之类的一般先模仿浏览器打开然后操作一步一步进行推进,最后到达想要的页面进行操作获取页面的数据

平时操练的代码已经托管在github上了