python - Scrapy spider not crawling the required pages -
here website link trying crawl. http://search.epfoservices.in/est_search_display_result.php?pagenum_search=1&totalrows_search=72045&old_rg_id=ap&office_name=&pincode=&estb_code=&estb_name=&paging=paging , below scrapper,as 1 of first attempts scrapping, pardon silly mistakes. kindly have , suggest changes make code running.
items.py
import scrapy class epfocrawl2item(scrapy.item): # define fields item here like: # name = scrapy.field() scrapy.item import item, field s_no = field() old_region_code = field() region_code = field() name = field() address = field() pin = field() epfo_office = field() under_ro = field() under_acc = field() payment = field() pass
epfocrawl1_spider.py
import scrapy scrapy.selector import htmlxpathselector class epfocrawlspider(scrapy.spider): """spider regularly updated search.epfoservices.in""" name = "pfdata" allowed_domains = ["search.epfoservices.in"] starturls = ["http://search.epfoservices.in/est_search_display_result.php?pagenum_search=1&totalrows_search=72045&old_rg_id=ap&office_name=&pincode=&estb_code=&estb_name=&paging=paging"] def parse(self,response): hxs = htmlxpathselector(response) rows = hxs.select('//tr"]') items = [] val in rows: item = val() item['s_no'] = val.select('/td[0]/text()').extract() item['old_region_code'] = val.select('/td[1]/text').extract() item['region_code'] = val.select('/td[2]/text()').extract() item['name'] = val.select('/td[3]/text()').extract() item['address'] = val.select('/td[4]/text()').extract() item['pin'] = val.select('/td[5]/text()').extract() item['epfo_office'] = val.select('/td[6]/text()').extract() item['under_ro'] = val.select('/td[7]/text()').extract() item['under_acc'] = val.select('/td[8]/text()').extract() item['payment'] = val.select('a/@href').extract() items.append(item) yield items
and below log after running "scrapy crawl pfdata"
016-05-25 13:45:11+0530 [scrapy] info: enabled item pipelines: 2016-05-25 13:45:11+0530 [pfdata] info: spider opened 2016-05-25 13:45:11+0530 [pfdata] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-05-25 13:45:11+0530 [scrapy] debug: telnet console listening on 127.0.0.1:6023 2016-05-25 13:45:11+0530 [scrapy] debug: web service listening on 127.0.0.1:6080 2016-05-25 13:45:11+0530 [pfdata] info: closing spider (finished) 2016-05-25 13:45:11+0530 [pfdata] info: dumping scrapy stats: {'finish_reason': 'finished', 'finish_time': datetime.datetime(2016, 5, 25, 8, 15, 11, 343313), 'log_count/debug': 2, 'log_count/info': 7, 'start_time': datetime.datetime(2016, 5, 25, 8, 15, 11, 341872)} 2016-05-25 13:45:11+0530 [pfdata] info: spider closed (finished)
suggestions requested.
the start urls list has start_urls
, not starturls
Comments
Post a Comment