博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
爬虫爬取金庸小说--回顾经典小说
阅读量:6248 次
发布时间:2019-06-22

本文共 3393 字,大约阅读时间需要 11 分钟。

金庸小说网爬虫

爬虫新人 欢迎各位大神指出BUG

链接

requests + BeautifulSoup + multiprocessing

github地址:https://github.com/Frankssss/NovelSpider

 

import randomimport requestsimport timefrom multiprocessing import Process, Queuefrom bs4 import BeautifulSoup as bsclass NovelDownload(Process):    def __init__(self, name, url_queue):        super(NovelDownload, self).__init__()        self.name = name        self.url_queue = url_queue    def run(self):        print('%s 开始' % self.name)        time.sleep(1)        while 1:            if self.url_queue.empty():                break            book = self.url_queue.get()            print('%s 开始下载' % book[0])            for chapter in self.getChapterList(book):                with open('d:\\data\\金庸\\%s.txt' % book[0], 'a', encoding='utf8') as f:                    f.write(chapter[0] + '\n')                    for content in self.getContent(chapter):                        f.write(content + '\n')            print('%s 下载完毕' % book[0])        print('%s 结束' % self.name)    @staticmethod    def getBookList(url_queue):        book_list = []        url = 'http://www.jinyongwang.com/book/'        html = NovelDownload.getHtmlText(url)        soup = bs(html, 'lxml')        booklist = soup.find('div', attrs={"class": 'booklist'})        ul = booklist.find('ul', attrs={'class': 'list'})        lis = ul.find_all('li')        for li in lis:            book_url = url.rsplit('/', 2)[0] + li.find('a').get('href')            book_name = li.find('img').get('alt')            book_list.append([book_name, book_url])            url_queue.put([book_name, book_url])        return book_list    def getChapterList(self, book):        html = NovelDownload.getHtmlText(book[1])        soup = bs(html, 'lxml')        ul = soup.find('ul', attrs={'class', 'mlist'})        lis = ul.find_all('li')        for li in lis:            chapter_url = 'http://www.jinyongwang.com' + li.find('a').get('href')            chapter_name = li.find('a').get_text()            yield [                chapter_name, chapter_url            ]    def getContent(self, chapter):        html = NovelDownload.getHtmlText(chapter[1])        soup = bs(html, 'lxml')        div = soup.find('div', attrs={'class': 'vcon'})        ps = div.find_all('p')        for p in ps:            content = p.get_text()            yield content    @staticmethod    def getHtmlText(url):        time.sleep(random.random())        headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}        try:            r = requests.get(url, headers=headers)            r.encoding = r.apparent_encoding            r.raise_for_status()            if r.status_code == 200:                return r.text        except:            return Nonedef create_pool(url_queue):    pool_list = []    pool_name = ['进程1', '进程2', '进程3', '进程4']    for name in pool_name:        p = NovelDownload(name, url_queue)        pool_list.append(p)    return pool_listdef create_queue():    url_queue = Queue()    return url_queuedef main():    url_queue = create_queue()    NovelDownload.getBookList(url_queue)    pool_list = create_pool(url_queue)    for p in pool_list:        p.start()    for p in pool_list:        p.join()if __name__ == '__main__':    temp = time.time()    main()    print(time.time() - temp)

  

转载于:https://www.cnblogs.com/frank-shen/p/9910272.html

你可能感兴趣的文章
Linux coredump
查看>>
Ubuntu 10.04安装水晶(Mercury)无线网卡驱动
查看>>
我的友情链接
查看>>
ElasticSearch 2 (32) - 信息聚合系列之范围限定
查看>>
windows查看端口占用
查看>>
Yii用ajax实现无刷新检索更新CListView数据
查看>>
App 卸载记录
查看>>
JavaScript变量和作用域
查看>>
开源SIP服务器加密软件NethidPro升级
查看>>
南京大学周志华教授当选欧洲科学院外籍院士
查看>>
计算机网络与Internet应用
查看>>
linux性能剖析工具
查看>>
Mars说光场(3)— 光场采集
查看>>
Django 文件下载功能
查看>>
xBIM 插入复制功能
查看>>
持续集成简介(转)
查看>>
一行代码让你的TableView动起来-iOS动画
查看>>
AI技术出海 - 阿里云GPU服务器助力旷视勇夺4项世界第一
查看>>
Spring Boot中初始化资源的几种方式
查看>>
通过测试发现的Exchange 2013 CU16存在的一个小bug
查看>>