python - Using scrapy to extract multiple data in table td elements -
i new scrapy , try use extract following data "name", "address", "state", "postal_code" sample html code below:
<div id="superheroes"> <table width="100%" border="0" "> <tr> <td valign="top"> <h2>superheroes in new york</h2> <hr/> </td> </tr> <tr valign="top"> <td width="75%"> <h2>peter parker</h2> <hr /> <table width="100%"> <tr valign="top"> <td width="13%" height="70" valign="top"><img src="/img/spidey.jpg"/></td> <td width="87%" valign="top"><strong>address:</strong> new york city<br/> <strong>state:</strong>new york<br/> <strong>postal code:</strong>12345<br/> <strong>telephone:</strong> 555-123-4567</td> </tr> <tr> <td height="18" valign="top"> </td> <td align="right" valign="top"><a href="spiderman"><strong>read more</strong></a></td> </tr> </table> <h2>tony stark</h2> <hr /> <table width="100%" border="0" cellpadding="2" cellspacing="2" valign="top"> <tr valign="top"> <td width="13%" height="70" valign="top"><img src="/img/ironman.jpg"/></td> <td width="87%" valign="top"><strong>address:</strong> new york city<br/> <strong>state:</strong> new york<br/> <strong>postal code:</strong> 54321<br/> <strong>telephone:</strong> 555-987-6543</td> </tr> <tr> <td height="18" valign="top"> </td> <td align="right" valign="top"><a href="iron_man"><strong>read more</strong></a></td> </tr> </table> </td> <td width="25%"> <script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script> </script> </td> </tr> </table> </div>
my superheroes.py contains following code:
from scrapy.spider import crawlspider, rule scrapy.contrib.spiders import crawlspider, rule scrapy.linkextractors.sgml import sgmllinkextractor scrapy.selector import selector superheroes.items import superheroes items = [] class myspider(crawlspider): name = "superheroes" allowed_domains = ["www.somedomain.com"] start_urls = ["http://www.somedomain.com/ny"] rules = [rule(sgmllinkextractor(allow=()), callback='parse_item')] def parse_item(self, response): sel = selector(response) tables = sel.xpath('//div[contains(@id, "superheroes")]/table/tr[2]/td[1]') table in tables: item = superheroes() item['name'] = table.xpath('h2/text()').extract() item['address'] = table.xpath('/tr[1]/td[2]/strong[1]/text()').extract() item['state'] = table.xpath('/tr[1]/td[2]/strong[2]/text()').extract() item['postal_code'] = table.xpath('/tr[1]/td[2]/strong[3]/text()').extract() items.append(item) return items
and items.py contains:
import scrapy class superheroes(scrapy.item): name = scrapy.field() address = scrapy.field() state = scrapy.field() postal_code = scrapy.field()
when ran "scrapy runspider superheroes.py -o super_db -t csv", output file empty.
could me error in code above?
thanks help!
you should change xpath expressions in for
cycle , yield
every item, instead of return
array
def parse_item(self, response): sel = selector(response) tables = sel.xpath('//div[contains(@id, "superheroes")]/table/tr[2]/td[1]') name, data in zip(tables.xpath('./h2/text()'), tables.xpath('./table')): item = superheroes() item['name'] = name.extract() item['address'] = data.xpath('.//strong[1]/text()').extract() item['state'] = data.xpath('.//strong[2]/text()').extract() item['postal_code'] = data.xpath('.//strong[3]/text()').extract() yield item
Comments
Post a Comment