真没想到crowdraw(crowdraw编程)
Make hay while the sun shines.
本文作者:陈志玲文字编辑:余术玲技术总编:张 邯在之前的《爬虫实战——聚募网股权众筹信息爬取》一文中我们介绍了如何使用requests和time 库来爬取网站信息,尽管我们实现了这一目标,但冗长的程序让人看起来杂乱无章。
为了让程序看起来简洁美观,可读易懂,我们今天来介绍如何将其封装成函数,最后只需要调用函数获取信息在Python中,定义一个函数需要使用def语句,依次写出函数名、括号、括号中的参数和冒号,然后,在缩进块中编写函数体,函数的返回值用。
return语句返回即:def(参数列表): return首先我们还是需要先导入三个库:import requestsimport timeimport json
接下来我们开始将获取一个页面的程序进行封装:#首先定义一个函数名 GetPageProjectInfo(只使用字母和下划线),并传入参数(该函数为完成任务所需要的信息),此处即为页数pagedefGetPageProjectInfo
(page):#此处headers在完整程序中查看 timestamp= int(round(time.time()*1000)) url=f"https://www.dreammove.cn/list/get_list.html?type=8&industry=0&city=0&offset=
{page}&keyword=&_={timestamp}" raw_html= requests.get(url,headers= headers)#requests.get(url, params=None, **Kwargs)
html_text= raw_html.textreturn json.loads(html_text)["data"]["list"] #return语句返回一个我们所需要的值现在我们已经定义了
GetPageProjectInfo()函数,只是在内存中声明了一个函数,如果我们想调用该函数获取第一个页面的信息,则向GetPageProjectInfo()传入参数1,并将其返回值打印出来如下所示:。
接下来,我们开始封装第二段——获取所有页面id,同样先定义一个函数名:defGetAllProjectInfo():#该函数无需向其传入参数,即可调用 ProjectInfo = []for i
in range(1,23):#将GetPageProjectInfo(i)返回的值拼接到空列表ProjectInfo中 ProjectInfo.extend(GetPageProjectInfo(i))
return ProjectInfo按照《爬虫实战——聚募网股权众筹信息爬取》的顺序,我们接下来需要将GetAllProjectInfo()返回内容写入csv文件的程序进行封装,定义一个函数Json2csv_Project(Info,VarName,FileName)
,并向其传入参数,Info即上述函数返回的ProjectInfo(所有项目的信息),VarName即传入变量名,作为csv文件的表头,FileName即我们最后获得的文件名,具体过程如下:def Json2csv_Project(Info,VarName,FileName):
withopen(FileName,"w",encoding ="gb18030") as f: f.write("\t".join(VarName)+"\n")for EachInfo in
Info: tempInfo = []forkeyin VarName:ifkeyin EachInfo: tempInfo.append(
str(EachInfo[key]).replace("\n","").replace("\t","").replace("\r",""))else: tempInfo.append(
"") f.write("\t".join(tempInfo)+"\n")然后我们需要从csv文件中提取id,并将提取id的过程进行封装:def GetId(FileName):
#将csv文件作为参数传入withopen(FileName,"r",encoding ="gb18030") as f: final_Info = f.readlines() ProjectId = []
for i inrange(1,len(final_Info)): ProjectId.append(final_Info[i].split("\t")[0])return ProjectId
#返回所有项目的id获得id之后,我们依旧按原来步骤,继续爬取第二层页面的信息,定义一个函数GetTeamInfo(),只是此处我们进行了函数封装我们可以直接将上述函数获得ProjectId当作参数传入,无需再像之前爬取那一篇中所写的那样对198个id进行遍历。
具体过程如下:defGetTeamInfo(ProjectId): timestamp = int(round(time.time()*1000)) url =f"https://www.dreammove.cn/project/project_team/id/
{ProjectId}?_={timestamp}"#此处headers在完整程序中查看 raw_html = requests.get(url,headers =headers) html_text = raw_html.text
return json.loads(html_text)["data"]["team_list"]现在,我们调用上述函数得到每个项目的团队信息,然后对其进行拼接,汇总到一个列表中为了可以随时查看我们爬取的信息,今天我们多做一步,将所获得的信息写入csv文件进行保存。
此处与上述将所有项目信息写入csv文件类似具体程序如下:#Team_Id即所有id,VarNamem即变量名,最终的表头,FileName即最终的csv文件def Json2csv_Team(Team_Id,VarNamem,FileName):
withopen(FileName,"w",encoding = "gb18030") as f:#按gb18030编码往FileName中写入内容 f.write("id\t"+"\t".join(VarName)+
"\n")for Eachid in Team_Id : TeamInfo = GetTeamInfo(Eachid)#利用函数GetTeamInfo(id)获得团队信息if TeamInfo.__class__ ==
list: #判断上一步获得的TeamInfo是否是列表for Eachperson in TeamInfo :#如果是列表,再对列表TeamInfo进行遍历,输出每一个元素 print(Eachperson)
tempInfo = [Eachid]#放置第一层遍历的idforkeyin VarName: #对变量名进行遍历ifkeyin Eachperson:#如果是 键列表TeamInfo中的元素
tempInfo.append(str(Eachperson[key]).replace("\n","").replace("\t","").replace(
"\r",""))#列表TeamInfo中的元素——字典的键转化成字符串 去掉换行符,制表符,回车符else: tempInfo.append(
"") f.write("\t".join(tempInfo)+"\n")至此,我们已经封装好所有程序,接下来就是调用所有函数,即所谓的主程序#用来让脚本判断自己是被当做模块调用还是被直接运行,当被import作为模块调用时。
if以下的代码就不会被执行if __name__ == "__main__":#调用GetAllProjectInfo()函数获得项目id,在调用Json2csv_Project函数获得csv文件Project_FileName
VarName =[id,update_time,province_name,subsite_id,is_open,industry,type,open_flag,project_name,step
,seo_string,abstract,cover,project_phase,member_count,province,city,address,company_name,project_url,
uid,over_time,vote_leader_step,stage,is_agree,is_del,agreement_id,barcode,sort,display_subsite_id,need_fund
,real_fund,project_valuation,final_valuation,min_lead_fund,min_follow_fund,total_fund,agree_total_fund
,leader_flag,leader_id,read_cnt,follow_cnt,inverstor_cnt,comment_cnt,nickname,short_name,site_url,site_logo
,storelevel,industry_name] Project_FileName ="C:\\CrowdFunding\\dreammove\\ProjectInfo.csv" Json2csv_Project(GetAllProjectInfo(),VarName,Project_FileName)
#调用GetAllProjectInfo()函数获得项目id,在调用Json2csv_Project函数获得csv文件Project_FileName#获取二级团队信息,存入TeamInfo.csv文件中
VarName =[name,duty,src,intro,is_fulltime,relationship,short_intro,shared_rate,amount,member_type
] Team_FileName ="C:\\CrowdFunding\\dreammove\\TeamInfo.csv"#GetId函数获取Project_FileName中的项目id 再用Json2csv_Team函数获得团队信息
Json2csv_Team(GetId(Project_FileName),VarName,Team_FileName)最后总结一下,在定义函数时,我们需要确定函数名和参数个数其实函数就是组织好的、可以重复利用的、用来实现单一或相关功能的代码段。
它能提高代码的重复利用率,保护一致性,易维护,可扩展性强所以,我们需要将一大段看起来杂乱无章的代码进行封装最后附上完整程序:import requestsimport timeimport jsondef
GetPageProjectInfo(page): headers = {"Accept":"application/json, text/javascript, */*; q=0.01","Referer"
:"https://www.dreammove.cn/list/index.html?industry=0&type=8&city=0","User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100Mobile Safari/537.36"
,"X-Requested-With": "XMLHttpRequest"} timestamp =int(round(time.time()*1000)) url =f"https://www.dreammove.cn/list/get_list.html?type=8&industry=0&city=0&offset=
{page}&keyword=&_={timestamp}" raw_html =requests.get(url,headers= headers) html_text =raw_html.text
return json.loads(html_text)["data"]["list"]#print(GetPageProjectInfo(1))defGetAllProjectInfo(): ProjectInfo =[]
for i inrange(1,23): ProjectInfo.extend(GetPageProjectInfo(i))return ProjectInfo#print(GetAllProjectInfo())
defJson2csv_Project(Info,VarName,FileName):with open(FileName,"w",encoding = "gb18030") as f: f.write(
"\t".join(VarName)+"\n")for EachInfo in Info: tempInfo = []for key in VarName:if key in EachInfo:
tempInfo.append(str(EachInfo[key]).replace("\n","").replace("\t","").replace("\r",
"")) else: tempInfo.append("") f.write("\t".join(tempInfo)+"\n")defGetId(FileName)
:with open(FileName,"r",encoding = "gb18030") as f: final_Info = f.readlines() ProjectId= []
for i in range(1,len(final_Info)): ProjectId.append(final_Info[i].split("\t")[0])return ProjectId
defGetTeamInfo(ProjectId): timestamp =int(round(time.time()*1000)) url =f"https://www.dreammove.cn/project/project_team/id/
{ProjectId}?_={timestamp}" headers ={"Accept": "application/json, text/javascript, */*;q=0.01","Accept-Encoding"
: "gzip, deflate, br","Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8","Connection": "keep-alive","Cookie"
: "PHPSESSID=bk25qacnlg8g68d205h4pqeq56;Hm_lvt_c18b08cac9b94bf4628c0277d3a4d7de=1562549437;,jumu_web_idu=MDAwMDAwMDAwMLGGhpiGr36zsa96r7WEvXE;jumu_web_idp=MDAwMDAwMDAwMMafpd-afJ2NtZ9-r7OXoXE;Hm_lpvt_c18b08cac9b94bf4628c0277d3a4d7de=1562561558"
,"Host": "www.dreammove.cn","Referer":"https://www.dreammove.cn/project/detail/id/97","User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100Mobile Safari/537.36"
,"X-Requested-With": "XMLHttpRequest"} raw_html =requests.get(url,headers = headers) html_text =raw_html.text
return json.loads(html_text)["data"]["team_list"]defJson2csv_Team(Team_Id,VarNamem,FileName):with open(FileName,
"w",encoding = "gb18030") as f: f.write("id\t"+"\t".join(VarName)+"\n")for Eachid in Team_Id :#print(eachid)
TeamInfo = GetTeamInfo(Eachid)if TeamInfo.__class__ == list:for Eachperson in TeamInfo : print(Eachperson)
tempInfo = [Eachid]for key in VarName:if key in Eachperson: tempInfo.append(str(Eachperson[key]).replace(
"\n","").replace("\t","").replace("\r",""))else: tempInfo.append("") f.write(
"\t".join(tempInfo)+"\n")if __name__ == "__main__": VarName =[id,update_time,province_name,subsite_id
,is_open,industry,type,open_flag,project_name,step,seo_string,abstract,cover,project_phase,member_count
,province,city,address,company_name,project_url,uid,over_time,vote_leader_step,stage,is_agree,is_del,
agreement_id,barcode,sort,display_subsite_id,need_fund,real_fund,project_valuation,final_valuation,min_lead_fund
,min_follow_fund,total_fund,agree_total_fund,leader_flag,leader_id,read_cnt,follow_cnt,inverstor_cnt,
comment_cnt,nickname,short_name,site_url,site_logo,storelevel,industry_name] Project_FileName = "C:\\CrowdFunding\\dreammove\\ProjectInfo.csv"
Json2csv_Project(GetAllProjectInfo(),VarName,Project_FileName) VarName =[name,duty,src,intro,is_fulltime
,relationship,short_intro,shared_rate,amount,member_type] Team_FileName= "C:\\CrowdFunding\\dreammove\\TeamInfo.csv"
Json2csv_Team(GetId(Project_FileName),VarName,Team_FileName)
对我们的推文累计打赏超过1000元,我们即可给您开具发票,发票类别为“咨询费”用心做事,不负您的支持!往期推文推荐Zipfile(二)利用collapse命令转化原始数据Stata中的数值型爬虫实战——聚募网股权众筹信息爬取。
duplicates drop之前,我们要做什么?类型内置函数-type() isinstance()数据含义记不住?—— label“大神”来帮忙实战演练-如何获取众筹项目的团队信息Zipfile(一)
tabplot命令Jupyter Notebook不为人知的秘密字符串方法(三)数据,我要“拷打”你encode 和decode——带你探索编码与解码的世界字符串方法(二)如何快速生成分组变量?用Stata实现数据标准化
关于我们微信公众号“Stata and Python数据分析”分享实用的stata、python等软件的数据处理知识,欢迎转载、打赏我们是由李春涛教授领导下的研究生及本科生组成的大数据处理和分析团队此外,欢迎大家踊跃投稿,介绍一些关于stata和python的数据处理和分析技巧。
投稿邮箱:statatraining@163.com投稿要求:1)必须原创,禁止抄袭;2)必须准确,详细,有例子,有截图;注意事项:1)所有投稿都会经过本公众号运营团队成员的审核,审核通过才可录用,一经录用,会在该推文里为作者署名,并有赏金分成。
2)邮件请注明投稿,邮件名称为“投稿+推文名称”。3)应广大读者要求,现开通有偿问答服务,如果大家遇到有关数据处理、分析等问题,可以在公众号中提出,只需支付少量赏金,我们会在后期的推文里给予解答。
- 标签:
- 编辑:李松一
- 相关文章
-
zuo是什么?左氧氟沙星滴眼液?
这款专为医生打造的输入法拥有海量医疗词库,拥有超10万个包括中、西医在内的医学词汇,还针对医生场景进行了优化。…
-
惟一是什么?惟一拼音?
用一首歌的时间陪你入睡。
- 惟一是什么?惟一拼音?
- 心的笔顺是什么?心的笔顺是什么写的?
- 高洁是什么?高洁丝卫生巾安全吗?
- 驼峰是什么?驼峰桥的标志图片?
- 电场强度是什么?电场强度的定义?