Python 多进程程批量爬取小 - HelloWorld开发者社区

使用python多进程跑同样的代码。

python中的多线程其实并不是真正的多线程，如果想要充分地使用多核CPU的资源，在python中大部分情况需要使用多进程。Python提供了非常好用的多进程包multiprocessing，只需要定义一个函数，Python会完成其他所有事情。借助这个包，可以轻松完成从单进程到并发执行的转换。multiprocessing支持子进程、通信和共享数据、执行不同形式的同步，提供了Process、Queue、Pipe、Lock等组件。

1. Process

创建进程的类：Process([group [, target [, name [, args [, kwargs]]]]])，target表示调用对象，args表示调用对象的位置参数元组。kwargs表示调用对象的字典。name为别名。group实质上不使用。
方法：is_alive() 、join([timeout])、run()、start()、terminate()。其中，Process以start()启动某个进程。

is_alive()：判断该进程是否还活着

join([timeout])：主进程阻塞，等待子进程的退出， join方法要在close或terminate之后使用。

run():进程p调用start()时，自动调用run()

属性：authkey、daemon（要通过start()设置）、exitcode(进程在运行时为None、如果为–N，表示被信号N结束）、name、pid。其中daemon是父进程终止后自动终止，且自己不能产生新进程，必须在start()之前设置。

下面的demo。爬取笔趣阁小说网，只是爬了4本小说，同时启动四个线程。启动的方式有点low.为了统计时间，所以就那么写，有什么更好的方法可以留言，欢迎指导。

Python 多进程程批量爬取小

#!/usr/bin/env python# -*- coding: utf-8 -*-# @Time    : 2019/1/3 17:15# @Author  : jia.zhao# @Desc    :# @File    : process_spider.py# @Software: PyCharmfrom multiprocessing import Process, Lock, Queueimport timefrom selenium import webdriverfrom selenium.webdriver.chrome.options import Optionsimport requestsfrom lxml import etreeexitFlag = 0q = Queue()chrome_options = Options()chrome_options.add_argument('--headless')class scrapy_biquge():    def get_url(self):        browser = webdriver.Chrome(chrome_options=chrome_options)        browser.get('http://www.xbiquge.la/xuanhuanxiaoshuo/')        # 获取小说        content = browser.find_element_by_class_name("r")        content = content.find_elements_by_xpath('//ul/li/span[@class="s2"]/a')        for i in range(len(content)):            # 小说名字            title = content[i].text            # 小说的url            href = content[i].get_attribute('href')            print(title + '+' + href)            # 装进队列            q.put(title + '+' + href)            if i == 3:                break        browser.close()        browser.quit()def get_dir(title, href):    time.sleep(2)    res = requests.get(href, timeout=60)    res.encoding = 'utf8'    novel_contents = etree.HTML(res.text)    novel_dir = novel_contents.xpath('//div[@id="list"]/dl/dd/a//text()')    novel_dir_href = novel_contents.xpath('//div[@id="list"]/dl/dd/a/@href')    path = 'novel/' + title + '.txt'    list_content = []    i = 0    for novel in range(len(novel_dir)):        novel_dir_content = get_content('http://www.xbiquge.la'+novel_dir_href[novel])        print(title, novel_dir[novel])        list_content.append(novel_dir[novel] + '\n' + ''.join(novel_dir_content) + '\n')        i = i + 1        if i == 2:            try:                with open(path, 'a', encoding='utf8') as f:                    f.write('\n'.join(list_content))                    f.close()                list_content = []                i = 0            except Exception as e:                print(e)def get_content(novel_dir_href):    time.sleep(2)    res = requests.get(novel_dir_href, timeout=60)    res.encoding = 'utf8'    html_contents = etree.HTML(res.text)    novel_dir_content = html_contents.xpath('//div[@id="content"]//text()')    return novel_dir_contentclass MyProcess(Process):    def __init__(self, q, lock):        Process.__init__(self)        self.q = q        self.lock = lock    def run(self):        print(self.q.qsize(), '队列大小')        print('Pid: ' + str(self.pid) + ' LoopCount: ')        self.lock.acquire()        while not self.q.empty():            item = self.q.get()            print(item)            self.lock.release()            title = item.split('+')[0]            href = item.split('+')[1]            try:                get_dir(title, href)            except Exception as e:                print(e, '出现异常跳过循环')                continueif __name__ == '__main__':    start_time = time.time()    print(start_time)    scrapy_biquge().get_url()    lock = Lock()    p0 = MyProcess(q, lock)    p0.start()    p1 = MyProcess(q, lock)    p1.start()    p2 = MyProcess(q, lock)    p2.start()    p3 = MyProcess(q, lock)    p3.start()    p0.join()    p1.join()    p2.join()    p3.join()    end_time = time.time()    print(start_time, end_time, end_time - start_time, '时间差')

使用多进程中的队列处理，实现进程间数据共享。代码应该可以直接运行，有问题可以留言

可以参考：https://cuiqingcai.com/3335.html

Python 多进程程批量爬取小