requests请求模块详解
requests模块
概念
一个机遇网络请求的模块,作用就是用来模拟浏览器发起请求
asdsa
编码流程
– 指定url
– 进行请求的发送
– 获取相应数据(即,爬取到的数据)
– 持久化存储
** 环境的安装
** pip install requests
代码示例
1 | #导入request模块 |
易错点:
- 请求时,并不一定要传入数据,url是必须的,同时,要根据网站的要求进行对应的转码
- 当爬虫被反爬时,西药进行ua伪装
requests使用post请求:
- 使用post请求时,需要传入的参数基本上没有变化,唯一的区别就是params变成了data
- 除此以外和get请求基本一致,但是要注意,当使用axios请求时,需要将你要用到的参数放进去,如页码,每页最大数量等
- 由于axios返回的数据实际上是json格式的字符串,因此,在进行数据解析的时候,不能再使用text
- 要使用json来进行解析,然后才能进行操作
request概念
请求方法
- GET : 请求页面,并返回页面内容
- POST : 用于提交表单数据或上传文件,数据包含在请求体重
- PUT : 从客户端想服务器传送的数据取代指定文档中的内容
- DELETE : 请求服务器删除指定的页面
- HEAD : 类似于GET请求,只不过返回的响应中没有具体的内容,用于获取报头
- CONNECT : 把服务器当成跳板,让服务器代替客户端访问其他网页
- OPTIONS : 允许客户端查看服务器的性能
- TRACE : 会先服务器收到的请求,用于测试或诊断
GET POST 区别:
- GET请求包含在URL里面,明文传输,安全性低
- POST请求数据封装到了请求体中,安全性较高
关于访问请求的方法:
- text
- 查看返回的TEXT/HTML格式代码
- json
- 将获取到的json字符串转化为dict格式
- content
- 将获取到的文件流进行转化存储
UA伪装:
关于UA伪装:
- 模拟headers
- 用以模拟浏览器对网页url进行访问,以此来避开页面的反爬机制
- user_agent 存储用户使用的系统版本号与浏览器信息等数据,用来模拟浏览器访问
request
- get/post
- url
- data/params (对请求参数进行封装)
- headers (UA伪装)
- 动态加载
- ajax
- js
- 鉴定是否具备动态加载
- 局部搜索
- 全局搜索
re模块匹配:
使用re模块对获取到的页面进行匹配,用到的比较少,但也是个重点
1 | import re |
requests高阶应用:
文件上传功能
```python
import requests
url = ‘http://www.httpbin.org/'
Headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
}
f = open(‘./baidu.html’,’rb’)
files = {'file': f
}
response = requests.get(url=url,headers=Headers,files=files)
if response:print('上传成功')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
- 文件上传功能
2. 模拟cookie
1. 手动添加cookie
- 在headers中添加cookie属性
- ```python
Headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
'Cookie': "BAIDUID=97DA851F88C92BE45D7963ADA75C3434:FG=1; BIDUPSID=97DA851F88C92BE45D7963ADA75C3434; PSTM=1567649375; BD_UPN=12314753; BDUSS=3ZzTkJOTHRNVkpqVGR4cE9BaWlOVlJUcX5JbHVXMWEwWVQ3V1pmbUJEbVQxQmRlRUFBQUFBJCQAAAAAAAAAAAEAAABbuGacAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAJNH8F2TR~Bdb; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; BDSFRCVID=cvFOJeC62CrXf4cwU0e2tW9rYM4hm03TH6aIl2P5JFJeo_dcX41gEG0PDf8g0KubMVkPogKKBeOTHg_F_2uxOjjg8UtVJeC6EG0Ptf8g0M5; H_BDCLCKID_SF=JRA8oKPhtCvhDRTvhCcjh-FSMgTBKI62aKDs2ROYBhcqEIL4W6-aKPIp5nO8W4KL-6nH2IOK5DoEHxbSj4QzQU0zDPvl0RQuWI3bhqvztp5nhMJFXj7JDMP0qJ7j2RQy523ion6vQpn-KqQ3DRoWXPIqbN7P-p5Z5mAqKl0MLPbtbb0xXj_0j63-DaAJt6njJTRe0Rj-KbP_hDL9eKTjhPrMjHrJWMT-0bFH_JR2QJP5fUbJQpPVKJ3-MMc9BMvuJan7_JjYbRDVMfQhWhJChtLeyU6r3fQxtNRB-CnjtpvhKJjjWtcobUPUDMc9LUvqHmcdot5yBbc8eIna5hjkbfJBQttjQn3hfIkj2CKLtD05MDIwjTt3-4LtMxrXKK6QKCjj3toObnnqq6rkbJ83yhFzXP6-35KHf6rMKR_bbhrBj5CzKJOb5MrXeJb-5h37JD6yb-n5Lb8KDRPGehrAMJFYQ4oxJpO7BRbMopvaKquVoJQvbURvD-ug3-7qex5dtjTO2bc_5KnlfMQ_bf--QfbQ0hOhqP-jBRIEoD-2tKD5MIDr5nJbq4I85M5H54cX--o2WbCQaM7O8pcNLTDKLnLNjb72-MC8BNrJ-KTua-PMjbCxjqO1j4_PMa8OKhFeLHIebbQJyIb-hl5jDh3Ub6ksD-Rt5frp2aRy0hvc0J5cShnkDMjrDRLbXU6BK5vPbNcZ0l8K3l02V-bIe-t2XjQhDG8JJ5_eJbCsL-35HJcqfbo4-tr_KICShUFs5b5lB2Q-5KL-0-oieCbn5p_2bjD33NrHXxuJ2IDjoMbdJJjzDKoMjf6Py4LeBUO2XT0D52TxoUJg5DnJhhvG-xc4Mp8ebPRi3tQ9QgbMMhQ7tt5W8ncFbT7l5hKpbt-q0x-jLTnhVn0MBCK0HPonHjLBDTb33H; H_PS_PSSID=1460_21110_30210_30284_22160; yjs_js_security_passport=3b54287a45b25629d2f7cfdb10008d7e9d67b5bd_1576637318_js; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; delPer=0; BD_CK_SAM=1; PSINO=2; BD_HOME=1; H_PS_645EC=b428Vxu0iZX7iP6ukpImwh5YVC5tlT7Iwn5ZIzZa5ijoKNVsMoi%2FCv%2FtNME"
}如上图所示
```python
import requestsurl = ‘https://www.baidu.com/s'
创建requests.cookies.RequestsCookiesJar对象
jar = requests.cookies.RequestsCookieJar()
Cookie = “BAIDUID=97DA851F88C92BE45D7963ADA75C3434:FG=1; “ \"BIDUPSID=97DA851F88C92BE45D7963ADA75C3434; " \ "PSTM=1567649375; BD_UPN=12314753; " \ "BDUSS=3ZzTkJOTHRNVkpqVGR4cE9BaWlOVlJUcX5JbHVXMWEwWVQ3V1pmbUJEbVQxQmRlRUFBQUFBJCQAAAAAAAAAAAEAAABbuGacAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAJNH8F2TR~Bdb; " \ "BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; " \ "BDSFRCVID=cvFOJeC62CrXf4cwU0e2tW9rYM4hm03TH6aIl2P5JFJeo_dcX41gEG0PDf8g0KubMVkPogKKBeOTHg_F_2uxOjjg8UtVJeC6EG0Ptf8g0M5; " \ "H_BDCLCKID_SF=JRA8oKPhtCvhDRTvhCcjh-FSMgTBKI62aKDs2ROYBhcqEIL4W6-aKPIp5nO8W4KL-6nH2IOK5DoEHxbSj4QzQU0zDPvl0RQuWI3bhqvztp5nhMJFXj7JDMP0qJ7j2RQy523ion6vQpn-KqQ3DRoWXPIqbN7P-p5Z5mAqKl0MLPbtbb0xXj_0j63-DaAJt6njJTRe0Rj-KbP_hDL9eKTjhPrMjHrJWMT-0bFH_JR2QJP5fUbJQpPVKJ3-MMc9BMvuJan7_JjYbRDVMfQhWhJChtLeyU6r3fQxtNRB-CnjtpvhKJjjWtcobUPUDMc9LUvqHmcdot5yBbc8eIna5hjkbfJBQttjQn3hfIkj2CKLtD05MDIwjTt3-4LtMxrXKK6QKCjj3toObnnqq6rkbJ83yhFzXP6-35KHf6rMKR_bbhrBj5CzKJOb5MrXeJb-5h37JD6yb-n5Lb8KDRPGehrAMJFYQ4oxJpO7BRbMopvaKquVoJQvbURvD-ug3-7qex5dtjTO2bc_5KnlfMQ_bf--QfbQ0hOhqP-jBRIEoD-2tKD5MIDr5nJbq4I85M5H54cX--o2WbCQaM7O8pcNLTDKLnLNjb72-MC8BNrJ-KTua-PMjbCxjqO1j4_PMa8OKhFeLHIebbQJyIb-hl5jDh3Ub6ksD-Rt5frp2aRy0hvc0J5cShnkDMjrDRLbXU6BK5vPbNcZ0l8K3l02V-bIe-t2XjQhDG8JJ5_eJbCsL-35HJcqfbo4-tr_KICShUFs5b5lB2Q-5KL-0-oieCbn5p_2bjD33NrHXxuJ2IDjoMbdJJjzDKoMjf6Py4LeBUO2XT0D52TxoUJg5DnJhhvG-xc4Mp8ebPRi3tQ9QgbMMhQ7tt5W8ncFbT7l5hKpbt-q0x-jLTnhVn0MBCK0HPonHjLBDTb33H; " \ "H_PS_PSSID=1460_21110_30210_30284_22160; " \ "yjs_js_security_passport=3b54287a45b25629d2f7cfdb10008d7e9d67b5bd_1576637318_js; " \ "BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; " \ "delPer=0; " \ "BD_CK_SAM=1; " \ "PSINO=2; " \ "BD_HOME=1; " \ "H_PS_645EC=b428Vxu0iZX7iP6ukpImwh5YVC5tlT7Iwn5ZIzZa5ijoKNVsMoi%2FCv%2FtNME"
将准备好的cookie进行处理,并添加到jar对象中去
for i in Cookie.split(‘;’):key,value = i.split('=',1) jar.set(key,value)
Headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
}
params = {
"wd" : '666'
}
发送请求时添加cookies参数
response = requests.get(url=url,params=params,headers=Headers,cookies=jar)
response.encoding = ‘utf8’text = response.text
print(text)1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
- 上述为第二种cookies的使用方法
2. 全自动添加cookie
- ```python
from requests import Session
# 构建全安心的Session实例
session = Session()
url = 'https://www.baidu.com/s'
Headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
}
params = {
"wd" : '666'
}
response = session.get(url=url,headers=Headers,params=params)
response.encoding = 'utf8'
text = response.text使用session的时候,将会全自动生成cookies
同时,相比较requests访问,session访问将会存储状态,
requests访问时无状态访问,session访问将会进行状态保持,维持会话
省略了大量的cookie代码,相比较来说,更加方便
```python
requests访问的时候,每次都需要传入cookie
如果需要访问的网站时需要维持登陆状态的话,使用resquests将会很不方便,因为没有状态保持
session可以维持用户的登陆状态
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
## SSL证书验证:
```python
import requests
import urllib3
url = 'https://www.tutumanhua.com/gaoxiao/'
# 关闭网页证书报错
urllib3.disable_warnings()
res = requests.get(url=url, verify=False)
# verify=False
# 关闭SSL证书验证
Request,Session:
1 | from requests import Session,Request |
Urllib3:
- request