Parse Text in Python (Django) -
i have text looks like:
link(base_url=u'http://www.bing.com/search?q=site%3asomesite.com', url='http://www.somesite.com/prof.php?pid=478', text='somesite - professor rating of louis scerbo', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pid=478'), ('h', 'id=serp,5105.1')])link(base_url=u'http://www.bing.com/search?q=site%3asomesite.com', url='http://www.somesite.com/prof.php?pid=527', text='somesite - professor rating of jahan \xe2\x80\xa6', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pid=527'), ('h', 'id=serp,5118.1')])link(base_url=u'http://www.bing.com/search?q=site%3asomesite.com', url='http://www.somesite.com/prof.php?pid=645', text='somesite - professor rating of david kutzik', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pid=645'), ('h', 'id=serp,5131.1')])
questions
does know format of text?
how go parsing out values of element
url
example (from above text): http://www.somesite.com/prof.php?pid=478 http://www.somesite.com/prof.php?pid=527what python library(s) recommend parsing type of output, xml, json, etc?
i trying loop through url
, parse value of url
only.
keep in mind i'm using django.
thank can provide.
edit *current code:*
domainlinkoutputasstring = str(domainlinkoutput) r = re.compile(" url='(.*?)',", ) ##errorenous, must 're' compliant. properdomains = r.findall(domainlinkoutputasstring) return httpresponse(properdomains)
you can use python regexp:
import re text = "link(base_url=u'http://www.bing.com/search?q=site%3asomesite.com', url='http://www.somesite.com/prof.php?pid=478', text='somesite - professor rating of louis scerbo', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pid=478'), ('h', 'id=serp,5105.1')])link(base_url=u'http://www.bing.com/search?q=site%3asomesite.com', url='http://www.somesite.com/prof.php?pid=527', text='somesite - professor rating of jahan \xe2\x80\xa6', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pid=527'), ('h', 'id=serp,5118.1')])link(base_url=u'http://www.bing.com/search?q=site%3asomesite.com', url='http://www.somesite.com/prof.php?pid=645', text='somesite - professor rating of david kutzik', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pid=645'), ('h', 'id=serp,5131.1')])" # create regexp object match value of 'url' r = re.compile(" url='(.*?)',", ) # print matches print r.findall(text) >>>['http://www.somesite.com/prof.php?pid=478', 'http://www.somesite.com/prof.php?pid=527', 'http://www.somesite.com/prof.php?pid=645']
Comments
Post a Comment