r

http 和 https 都是用来传输文本数据的, https 比 http 有加密的功能,所以更加的安全。

HTTP

常用请求头信息

  1. User-Agent: 请求载体的身份标识
  2. Connection: 请求完毕后,是断开连接还是保持连接

查看:浏览器—-> 开发者工具—-> network

常用响应头信息

  1. Content-Type: 服务器响应客户端的数据类型

HTTPS

安全的超文本传输协议

加密方式

  • 对称密钥加密
  • 非对称密钥加密
  • 证书密钥加密

requests模块

介绍:Python原生模块,功能强大,简单便捷,效率极高,替代 urllib 模块。

作用: 模拟浏览器发请求

环境安装 : pip install requests

使用:

- 指定 URL
- 发起请求
- 获取响应数据
- 持久化存储

实战编码1:获取百度首页的页面数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# -*- coding: utf-8 -*-
"""
@Time : 2022/12/15 15:39
@Author : daokunn
@File :爬取百度首页.py
@IDE :PyCharm
@Motto: Don’t cry over spilt milk.
功能: 爬取百度首页页面数据
"""
import requests
if __name__ == "__main__":
# 指定url
url = 'https://baidu.com'
# 发起请求
response = requests.get(url=url)
# 获取响应数据
page_text = response.text
print(page_text)
# 持久化存储
with open('./baidu.html','w',encoding = 'utf-8') as fp:
fp.write(page_text)
print("爬取数据结束!")

实战编码2: 爬取百度指定词条对应的搜索页面(简易网页采集器)

实战编码3:破解百度翻译

加入UA伪装,看起来是一个正常的请求。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# -*- coding: utf-8 -*-
"""
@Time : 2022/12/15 16:14
@Author : daokunn
@File :简易网页采集器.py
@IDE :PyCharm
@Motto: Don’t cry over spilt milk.
"""
import requests
if __name__ == '__main__':
# UA伪装,封装到一个字典中
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36 Edg/108.0.1462.46'
}
url = 'http
://www.baidu.com/s?'
# 处理url携带的参数,封装到字典里
kw = input('输入关键字:')
param = {
'wd':kw
}
# 利用requests库传入参数
response = requests.get(url=url,params=param,headers=headers)
page_text = response.text
print(page_text)
fileName = kw + '.html'
with open(fileName,'w',encoding='utf-8') as fp:
fp.write(page_text)
print(fileName,"保存成功!")

实战编码3:破解百度翻译

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# -*- coding: utf-8 -*-
"""
@Time : 2022/12/15 19:58
@Author : daokunn
@File :百度翻译破解.py
@IDE :PyCharm
@Motto: Don’t cry over spilt milk.
"""

# 百度翻译是post请求
# 响应数据是json数据
import json

import requests
if __name__ == '__main__':
# 1.指定url
post_url = 'https://fanyi.baidu.com/sug'
# 2.进行UA伪装
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36 Edg/108.0.1462.46'
}

# 3.post请求参数处理
word = input('输入单词:')
data = {
'kw':word
}
# 4.请求发送
response = requests.post(url=post_url,data=data,headers=headers)

# 5.获取响应数据(确认返回类型是json才可以用json方法)
dic_obj = response.json()

# 6.持久化存储
fileName = word + '.json'
fp = open(fileName, 'w', encoding='utf-8')
json.dump(dic_obj,fp=fp,ensure_ascii=False) # 中文不能用ascii编码
print(dic_obj)

单个网页处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# -*- coding: utf-8 -*-
"""
@Time : 2022/12/16 21:57
@Author : daokunn
@File :壁纸爬取.py
@IDE :PyCharm
@Motto: Don’t cry over spilt milk.
"""
import os.path

import requests
import re
if __name__ == '__main__':
# 创建保存文件夹
if not os.path.exists('./wallhaven'):
os.makedirs('./wallhaven')

url = 'https://wallhaven.cc/toplist?page=1'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36 Edg/108.0.1462.46'
}

page_text = requests.get(url=url,headers=headers).text
# print(page_text) # 测试用的


# 观察图片


ex = '"https://wallhaven.cc/w/.*?"'

img_src_list = re.findall(ex,page_text,re.S)
print(img_src_list)
# https://w.wallhaven.cc/full/vq/wallhaven-vqg28m.png
# https://wallhaven.cc/w/vqg28m
for src in img_src_list:
src = 'https://w.wallhaven.cc/full/' +src[24:26]+'/wallhaven-'+ src[24:-1] + '.jpg'
try:
test = requests.get(url=src,headers=headers)
test.raise_for_status()
except:
src = src[:-3] + 'png'
# 请求图片的二进制数据
img_data =requests.get(url=src,headers=headers).content
# 生成图片名称
img_name = src.split('/')[-1]
# 图片路径
img_path = './wallhaven/'+ img_name
with open(img_path,'wb') as fp:
fp.write(img_data)
print(img_name,"下载成功")
# print(src) # 测试用的

分页爬取

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# -*- coding: utf-8 -*-
"""
@Time : 2022/12/17 19:34
@Author : daokunn
@File :分页爬取.py
@IDE :PyCharm
@Motto: Don’t cry over spilt milk.
"""

import os.path

import requests
import re
if __name__ == '__main__':

if not os.path.exists('./wallhaven'):
os.makedirs('./wallhaven')
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36 Edg/108.0.1462.46'
}
# 设置通用url模板
url = 'https://wallhaven.cc/toplist?page=%d'
page_num = 1
num = int(input('你要爬取到第几页:'))
for page_num in range(1,num):
# 对应页码的url
new_url = format(url%page_num)


page_text = requests.get(url=new_url,headers=headers).text

ex = '"https://wallhaven.cc/w/.*?"'

img_src_list = re.findall(ex,page_text,re.S)
print(img_src_list)
for src in img_src_list:
src = 'https://w.wallhaven.cc/full/' +src[24:26]+'/wallhaven-'+ src[24:-1] + '.jpg'
try:
test = requests.get(url=src,headers=headers)
test.raise_for_status()
except:
src = src[:-3] + 'png'
# 请求图片的二进制数据
img_data =requests.get(url=src,headers=headers).content
# 生成图片名称
img_name = src.split('/')[-1]
# 图片路径
img_path = './wallhaven/'+ img_name
with open(img_path,'wb') as fp:
fp.write(img_data)
print(img_name,"下载成功")