Python数据分析及可视化实例之异步编程asyncio

发布时间:2021-12-03 公开文章

最近需要采集一个目标网站的数据(MP3),该网站不限速,于是试过了多线程,多进程,都不尽人意,最后祭出大杀器:asyncio

需要安装库:

pip install asyncio  -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install aiohttp  -i https://pypi.tuna.tsinghua.edu.cn/simple

在这个过程中,假如你的并发达到2000个,程序会报错:ValueError: too many file descriptors in select()。报错的原因字面上看是 Python 调取的 select 对打开的文件有最大数量的限制,这个其实是操作系统的限制,linux打开文件的最大数默认是1024,windows默认是509,超过了这个值,程序就开始报错。个人推荐限制并发数的方法,设置并发数为100~500,处理速度更快。

#coding:utf-8
import asyncio,aiohttp

#目标网站
url = 'https://www.baidu.com/'

#采集程序
async def hello(url,semaphore):
    async with semaphore:
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as response:
                return await response.read()

# 预处理函数
async def run():
    semaphore = asyncio.Semaphore(500) # 限制并发量为500
    to_get = [hello(url.format(),semaphore) for _ in range(1000)] #总共1000任务
    await asyncio.wait(to_get)


if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(run())
    loop.close()

 

'''

async with aiohttp.ClientSession(headers=headers, cookies=cookies) as session:

result_text = None

try:

result = await session.post(url, timeout=timeout, data=data)

result_text = await result.text()

except Exception as e:

raise (e)

return result

 

'''

 

aiohttp请求和requests大部分是相同的。tasks是固定长度的,首页,次页规则分别爬取。

 

 

import time
import asyncio,aiohttp

# 起始时间
start_time = time.time()

# 原始列表
urls = [
    'https://www.baidu.com',
    'https://www.sogou.com',
    'https://www.csdn.net/'
]


async def get_page(url):
    async with aiohttp.ClientSession() as session:
        async with await session.get(url) as response:
            # text()返回字符串数据;read()返回二进制数据;json()返回json对象
            # 在获取响应数据操作之前一定要使用await进行手动挂起
            page_text = await response.text()
#             print(page_text)
            print(url)
 
    
# 异步列表
tasks = []
for url in urls:
    c = get_page(url)
    task = asyncio.ensure_future(c)
    tasks.append(task)

# 异步循环    
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))


# 运行时间
end_time = time.time()
print(end_time - start_time)