您的位置首页  散文杂谈

daomubiji.com(盗墓笔记com是官方的吗)满满干货

按这个目录结构抓取。items,settings,middlewares正常配置。spider。pipelines。

daomubiji.com(盗墓笔记com是官方的吗)满满干货

 

按这个目录结构抓取

items,settings,middlewares正常配置spiderimport scrapy import os classDmbjSpider(scrapy.Spider): name =

dmbj allowed_domains = [www.daomubiji.com] defstart_requests(self): start_url =

for i in range(1, 12): if i < 9: start_url = http://www.daomubiji.com/dao-mu-bi-ji-{}

.format(i) elif i == 9: start_url = http://www.daomubiji.com/dao-mu-bi-ji-2015

elif i == 10: start_url = http://www.daomubiji.com/sha-haielif i == 11: start_url =

http://www.daomubiji.com/zang-hai-huayield scrapy.Request(start_url, callback=self.list_parse) def

list_parse(self, response): list_urls = response.xpath(//article[@class="excerpt excerpt-c3"]/a/@href

) for url in list_urls: item = {} # item要在循环内定义,否则会被覆盖为最后一个url detail_url = url.get() item[

url] = detail_url ifqi-xing-lu-wangin item[url]: item[path] = 盗墓笔记/七星鲁王/elif

nu-hai-qian-shain item[url]: item[path] = 盗墓笔记/怒海潜沙/elifqin-ling-shen-shuin item[url]: item[

path] = 盗墓笔记/秦岭神树/elifyun-ding-tian-gongin item[url]: item[path] = 盗墓笔记/云顶天宫/elifshe-zhao-gui-cheng

in item[url]: item[path] = 盗墓笔记/蛇沼鬼城/elifmi-hai-gui-chaoin item[url]: item[

path] = 盗墓笔记/谜海归巢/elif2-yin-ziin item[url]: item[path] = 盗墓笔记/第二季/引子/elifyin-shan-gu-lou

in item[url]: item[path] = 盗墓笔记/第二季/阴山古楼/elifqiong-long-shi-yingin item[url]: item[

path] = 盗墓笔记/第二季/邛笼石影/elifdao-mu-bi-ji-7in item[url]: item[path] = 盗墓笔记/第二季/盗墓笔记7/elif

dajiejuin item[url]: item[path] = 盗墓笔记/第二季/大结局/elif2015in item[url]: item[

path] = 盗墓笔记/2015年更新/elifshahaiin item[url]: item[path] = 盗墓笔记/沙海/elifzang-hai-huain item[

url]: item[path] = 盗墓笔记/藏海花/else: print(这个网页没找到路径:, item[url])

ifnot os.path.exists(item[path]): os.makedirs(item[path]) yield scrapy.Request(detail_url, meta={

item:item},callback=self.parse) defparse(self, response, **kwargs): item = response.meta[

item] item[name] = response.xpath(//h1/text()).get().replace(?, ) contents = response.xpath(

//article//text()) content = for i in contents: content += i.get().strip().replace(

\\u3000, ) + \n item[content] = content yield itempipelinesclassDaomuPipeline:defprocess_item

(self, item, spider): file_name = item[name] + .txtwith open(item[path] + file_name, w, encoding=

utf-8) as f: f.write(item[content]) print(file_name + --> 保存到 /{} --> 成功!.format(item[

path])) return item

免责声明:本站所有信息均搜集自互联网,并不代表本站观点,本站不对其真实合法性负责。如有信息侵犯了您的权益,请告知,本站将立刻处理。联系QQ:1640731186