起点小说网txt下载(起点小说网怎么下载小说)怎么可以错过

来源：互联网
|
2023-04-30
|
0 条评论
|
T小字　 T大字

欢迎点击右上角关注小编，除了分享技术文章之外还有很多福利，私信01可以领取包括不限于Python实战演练、PDF电子文档、面试集锦、学习资料等。

欢迎点击右上角关注小编，除了分享技术文章之外还有很多福利，私信01可以领取包括不限于Python实战演练、PDF电子文档、面试集锦、学习资料等喜欢看网络小说的朋友们，经常需要从网上下载小说有些人不想向正版网页交钱，也不想注册其他网站的账号，那么对于某些比较冷门的小说或者是正在更新的小说来说，就很难下载到txt或者其他格式的小说。

所以小编就想着用爬虫爬小说，因为本次案例为初级案例，代码量也不会很多，可以作为爬虫的入门学习小编声明：如果你有能力请阅读正版小说，毕竟作者费时费脑给咱们写出那么精彩的小说，请尊重他人的劳动成果，不需要你打赏，支持正版小说就好！

新建scrapy爬虫项目scrapy是python的爬虫框架使用以下语句安装scrapypip install scrapy安装完成后，打开命令行窗口，转到你想建立project的目录下，使用下面这句话新建scrapy项目。

scrapy startproject ebook新建后就会出现一些基础代码框架然后，cd ebook scrapy genspider example example.com使用genspider根据模板创建一个爬虫，在项目的spider目录下就会多一个example.py文件。

实例一第一个例子，我选取了起点中文网。在起点上随便选择了一本小说。

scrapy genspider qxzz qidian.com使用这句话创建了qxzz.py文件。打开后，如下图。

将start_urls中的内容改为这本小说的地址然后，打开浏览器右上角“更多工具”——>“开发者工具”（chrome）就可以看到下图的样子爬取章节地址打开浏览器右上角“更多工具”——>“开发者工具”（chrome）就可以看到下图的样子。

在右边的窗口中找到目录所在的标签。就在下图的第一个方框中，每一章节的内容在第二个方框中。

# -*- coding: utf-8 -*- import scrapy class QxzzSpider(scrapy.Spider): name = qxzz allowed_domains = [qidian.com] start_urls = [https://book.qidian.com/info/1011146676/] def parse(self, response): # 获取目录列表 pages = response.xpath(//div[@id="j-catalogWrap"]//ul[@class="cf"]/li) for page in pages: # 遍历子节点，查询li标签内a子节点的href属性 url = page.xpath(./child::a/attribute::href).extract_first() print url

程序编写完之后，在命令行运行下面语句查看结果scrapy crawl qxzz输出结果如下

如果你碰到“No module named win32api”的错误，pip install pypiwin32 即可。爬取每章内容

从图中我们可以看出，整个章节在main-text-wrap标签（即第一个框）中，章节名在第二个框中，正文在第三个框中修改后的代码如下# -*- coding: utf-8 -*- import scrapy class QxzzSpider(scrapy.Spider): name = qxzz allowed_domains = [qidian.com] start_urls = [https://book.qidian.com/info/1011146676/] def parse(self, response): pages = response.xpath(//div[@id="j-catalogWrap"]//ul[@class="cf"]/li) for page in pages: url = page.xpath(./child::a/attribute::href).extract_first() req = response.follow(url, callback=self.parse_chapter) yield req def parse_chapter(self, response): title = response.xpath(//div[@class="main-text-wrap"] //h3[@class="j_chapterName"]/text())\ .extract_first().strip() content = response.xpath(//div[@class="main-text-wrap"] //div[@class="read-content j_readContent"])\ .extract_first().strip()。

内容的保存在爬取内容后，可以以html的形式保存而为了排序的方便，需要给文件名加一个序号在起点网上，刚好在li标签内有“data-rid”可以作为序号修改后代码如下# -*- coding: utf-8 -*- import scrapy class QxzzSpider(scrapy.Spider): name = qxzz allowed_domains = [qidian.com] start_urls = [https://book.qidian.com/info/1011146676/] def parse(self, response): pages = response.xpath(//div[@id="j-catalogWrap"]//ul[@class="cf"]/li) for page in pages: url = page.xpath(./child::a/attribute::href).extract_first() idx = page.xpath(./attribute::data-rid).extract_first() req = response.follow(url, callback=self.parse_chapter) req.meta[idx] = idx yield req def parse_chapter(self, response): idx = response.meta[idx] title = response.xpath(//div[@class="main-text-wrap"] //h3[@class="j_chapterName"]/text())\ .extract_first().strip() content = response.xpath(//div[@class="main-text-wrap"] //div[@class="read-content j_readContent"])\ .extract_first().strip() filename = ./down/%s_%s.html % (idx, title) cnt =

%s

%s % (title, content) with open(filename, wb) as f: f.write(cnt.encode(utf-8))。

记住，需要现在目录下新建down文件夹，否则可能会出错同一个网站，电子书网页的源代码架构都应该是一样的，因此，只需改变一条网址的语句，就可以爬取该网站的其他电子书在爬取结束并且保存后，可以通过Sigil来制作epub电子书，也可以转换成其他格式。

实例二第二个例子没有新的知识，只是为了更加熟悉这些内容。第二个例子选择的是下面这本小说。

目录和正文内容所在的标签都在下面的图中框出。

# -*- coding: utf-8 -*- import scrapy from numpy import * class DmhsSpider(scrapy.Spider): name = dmhs allowed_domains = [m.x23us.com] start_urls = [https://m.x23us.com/html/51/51940/] def parse(self, response): pages = response.xpath(//div[@class="cover"]//ul[@class="chapter"]/li) # pages = response.xpath(//ul[@class="chapter"]/li) for page in pages: url = page.xpath(./child::a/attribute::href).extract_first() idx = str(url[0:8]) req = response.follow(url, callback=self.parse_chapter) req.meta[idx] = idx yield req def parse_chapter(self, response): idx = response.meta[idx] title = response.xpath(//div[@class="content"]//h1[@id="nr_title"]/text())\ .extract_first().strip() content = response.xpath(//div[@class="content"]//div[@class = "txt"])\ .extract_first().strip() filename = ./down/%s_%s.html % (idx, title) cnt =

%s

%s % (title, content) with open(filename, wb) as f: f.write(cnt.encode(utf-8))

在这个网站中，没有类似“data-rid”的东西，但又发现每章节的网页地址的数字是随章节号而递增的，因此截取网页地址中的数字来排序保存后的html文件其他都与例子一相同果然需求才是学习的动力啊最后多说一句，小编是一名python开发工程师，这里有我自己整理了一套最新的python系统学习教程，包括从基础的python脚本到web开发、爬虫、数据分析、数据可视化、机器学习等。

想要这些资料的可以关注小编，并在后台私信小编：“01”即可领取

免责声明：本站所有信息均搜集自互联网，并不代表本站观点，本站不对其真实合法性负责。如有信息侵犯了您的权益，请告知，本站将立刻处理。联系QQ：1640731186