python - Using scrapy to extract multiple data in table td elements -


i new scrapy , try use extract following data "name", "address", "state", "postal_code" sample html code below:

<div id="superheroes"> <table width="100%" border="0" ">   <tr>   <td valign="top">   <h2>superheroes in new york</h2>   <hr/>   </td>   </tr>   <tr valign="top">     <td width="75%">                           <h2>peter parker</h2>       <hr />       <table width="100%">         <tr valign="top">           <td width="13%" height="70" valign="top"><img src="/img/spidey.jpg"/></td>           <td width="87%" valign="top"><strong>address:</strong> new york city<br/>             <strong>state:</strong>new york<br/>             <strong>postal code:</strong>12345<br/>             <strong>telephone:</strong> 555-123-4567</td>         </tr>         <tr>           <td height="18" valign="top">&nbsp;</td>           <td align="right" valign="top"><a href="spiderman"><strong>read more</strong></a></td>         </tr>       </table>       <h2>tony stark</h2>       <hr />       <table width="100%" border="0" cellpadding="2" cellspacing="2" valign="top">         <tr valign="top">           <td width="13%" height="70" valign="top"><img src="/img/ironman.jpg"/></td>           <td width="87%" valign="top"><strong>address:</strong> new york city<br/>             <strong>state:</strong> new york<br/>             <strong>postal code:</strong> 54321<br/>             <strong>telephone:</strong> 555-987-6543</td>         </tr>         <tr>           <td height="18" valign="top">&nbsp;</td>           <td align="right" valign="top"><a href="iron_man"><strong>read more</strong></a></td>         </tr>       </table>     </td>     <td width="25%">        <script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>        </script>         </td>   </tr> </table> </div> 

my superheroes.py contains following code:

from scrapy.spider import crawlspider, rule scrapy.contrib.spiders import crawlspider, rule scrapy.linkextractors.sgml import sgmllinkextractor scrapy.selector import selector superheroes.items import superheroes  items = []  class myspider(crawlspider):   name = "superheroes"   allowed_domains = ["www.somedomain.com"]   start_urls = ["http://www.somedomain.com/ny"]   rules = [rule(sgmllinkextractor(allow=()), callback='parse_item')]     def parse_item(self, response):      sel = selector(response)      tables = sel.xpath('//div[contains(@id, "superheroes")]/table/tr[2]/td[1]')      table in tables:         item = superheroes()         item['name'] = table.xpath('h2/text()').extract()         item['address'] = table.xpath('/tr[1]/td[2]/strong[1]/text()').extract()         item['state'] = table.xpath('/tr[1]/td[2]/strong[2]/text()').extract()         item['postal_code'] = table.xpath('/tr[1]/td[2]/strong[3]/text()').extract()         items.append(item)      return items 

and items.py contains:

import scrapy class superheroes(scrapy.item):     name = scrapy.field()     address = scrapy.field()     state = scrapy.field()     postal_code = scrapy.field()     

when ran "scrapy runspider superheroes.py -o super_db -t csv", output file empty.

could me error in code above?

thanks help!

you should change xpath expressions in for cycle , yield every item, instead of return array

def parse_item(self, response):     sel = selector(response)     tables = sel.xpath('//div[contains(@id, "superheroes")]/table/tr[2]/td[1]')     name, data in zip(tables.xpath('./h2/text()'), tables.xpath('./table')):         item = superheroes()         item['name'] = name.extract()         item['address'] = data.xpath('.//strong[1]/text()').extract()         item['state'] = data.xpath('.//strong[2]/text()').extract()         item['postal_code'] = data.xpath('.//strong[3]/text()').extract()         yield item 

Comments

Popular posts from this blog

PySide and Qt Properties: Connecting signals from Python to QML -

c# - DevExpress.Wpf.Grid.InfiniteGridSizeException was unhandled -

scala - 'wrong top statement declaration' when using slick in IntelliJ -