r
http 和 https 都是用来传输文本数据的, https 比 http 有加密的功能,所以更加的安全。
HTTP
常用请求头信息
- User-Agent: 请求载体的身份标识
- Connection: 请求完毕后,是断开连接还是保持连接
查看:浏览器—-> 开发者工具—-> network
常用响应头信息
- Content-Type: 服务器响应客户端的数据类型
HTTPS
安全的超文本传输协议
加密方式
requests模块
介绍:Python原生模块,功能强大,简单便捷,效率极高,替代 urllib 模块。
作用: 模拟浏览器发请求
环境安装 : pip install requests
使用:
- 指定 URL
- 发起请求
- 获取响应数据
- 持久化存储
实战编码1:获取百度首页的页面数据
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
| """ @Time : 2022/12/15 15:39 @Author : daokunn @File :爬取百度首页.py @IDE :PyCharm @Motto: Don’t cry over spilt milk. 功能: 爬取百度首页页面数据 """ import requests if __name__ == "__main__": url = 'https://baidu.com' response = requests.get(url=url) page_text = response.text print(page_text) with open('./baidu.html','w',encoding = 'utf-8') as fp: fp.write(page_text) print("爬取数据结束!")
|
实战编码2: 爬取百度指定词条对应的搜索页面(简易网页采集器)
实战编码3:破解百度翻译
加入UA伪装,看起来是一个正常的请求。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
| """ @Time : 2022/12/15 16:14 @Author : daokunn @File :简易网页采集器.py @IDE :PyCharm @Motto: Don’t cry over spilt milk. """ import requests if __name__ == '__main__': headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36 Edg/108.0.1462.46' } url = 'http ://www.baidu.com/s?' kw = input('输入关键字:') param = { 'wd':kw } response = requests.get(url=url,params=param,headers=headers) page_text = response.text print(page_text) fileName = kw + '.html' with open(fileName,'w',encoding='utf-8') as fp: fp.write(page_text) print(fileName,"保存成功!")
|
实战编码3:破解百度翻译
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
| """ @Time : 2022/12/15 19:58 @Author : daokunn @File :百度翻译破解.py @IDE :PyCharm @Motto: Don’t cry over spilt milk. """
import json
import requests if __name__ == '__main__': post_url = 'https://fanyi.baidu.com/sug' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36 Edg/108.0.1462.46' }
word = input('输入单词:') data = { 'kw':word } response = requests.post(url=post_url,data=data,headers=headers)
dic_obj = response.json()
fileName = word + '.json' fp = open(fileName, 'w', encoding='utf-8') json.dump(dic_obj,fp=fp,ensure_ascii=False) print(dic_obj)
|
单个网页处理
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
| """ @Time : 2022/12/16 21:57 @Author : daokunn @File :壁纸爬取.py @IDE :PyCharm @Motto: Don’t cry over spilt milk. """ import os.path
import requests import re if __name__ == '__main__': if not os.path.exists('./wallhaven'): os.makedirs('./wallhaven')
url = 'https://wallhaven.cc/toplist?page=1' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36 Edg/108.0.1462.46' }
page_text = requests.get(url=url,headers=headers).text
ex = '"https://wallhaven.cc/w/.*?"'
img_src_list = re.findall(ex,page_text,re.S) print(img_src_list) for src in img_src_list: src = 'https://w.wallhaven.cc/full/' +src[24:26]+'/wallhaven-'+ src[24:-1] + '.jpg' try: test = requests.get(url=src,headers=headers) test.raise_for_status() except: src = src[:-3] + 'png' img_data =requests.get(url=src,headers=headers).content img_name = src.split('/')[-1] img_path = './wallhaven/'+ img_name with open(img_path,'wb') as fp: fp.write(img_data) print(img_name,"下载成功")
|
分页爬取
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
| """ @Time : 2022/12/17 19:34 @Author : daokunn @File :分页爬取.py @IDE :PyCharm @Motto: Don’t cry over spilt milk. """
import os.path
import requests import re if __name__ == '__main__':
if not os.path.exists('./wallhaven'): os.makedirs('./wallhaven') headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36 Edg/108.0.1462.46' } url = 'https://wallhaven.cc/toplist?page=%d' page_num = 1 num = int(input('你要爬取到第几页:')) for page_num in range(1,num): new_url = format(url%page_num)
page_text = requests.get(url=new_url,headers=headers).text
ex = '"https://wallhaven.cc/w/.*?"'
img_src_list = re.findall(ex,page_text,re.S) print(img_src_list) for src in img_src_list: src = 'https://w.wallhaven.cc/full/' +src[24:26]+'/wallhaven-'+ src[24:-1] + '.jpg' try: test = requests.get(url=src,headers=headers) test.raise_for_status() except: src = src[:-3] + 'png' img_data =requests.get(url=src,headers=headers).content img_name = src.split('/')[-1] img_path = './wallhaven/'+ img_name with open(img_path,'wb') as fp: fp.write(img_data) print(img_name,"下载成功")
|