Parse Text in Python (Django) -


i have text looks like:

link(base_url=u'http://www.bing.com/search?q=site%3asomesite.com', url='http://www.somesite.com/prof.php?pid=478', text='somesite -  professor rating of louis scerbo', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pid=478'), ('h', 'id=serp,5105.1')])link(base_url=u'http://www.bing.com/search?q=site%3asomesite.com', url='http://www.somesite.com/prof.php?pid=527', text='somesite -  professor rating of jahan \xe2\x80\xa6', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pid=527'), ('h', 'id=serp,5118.1')])link(base_url=u'http://www.bing.com/search?q=site%3asomesite.com', url='http://www.somesite.com/prof.php?pid=645', text='somesite -  professor rating of david kutzik', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pid=645'), ('h', 'id=serp,5131.1')]) 

questions

  1. does know format of text?

  2. how go parsing out values of element url example (from above text): http://www.somesite.com/prof.php?pid=478 http://www.somesite.com/prof.php?pid=527

  3. what python library(s) recommend parsing type of output, xml, json, etc?

i trying loop through url , parse value of url only.

keep in mind i'm using django.

thank can provide.

edit *current code:*

domainlinkoutputasstring = str(domainlinkoutput)   r = re.compile(" url='(.*?)',", )  ##errorenous, must 're' compliant.  properdomains = r.findall(domainlinkoutputasstring)  return httpresponse(properdomains) 

you can use python regexp:

import re text = "link(base_url=u'http://www.bing.com/search?q=site%3asomesite.com', url='http://www.somesite.com/prof.php?pid=478', text='somesite -  professor rating of louis scerbo', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pid=478'), ('h', 'id=serp,5105.1')])link(base_url=u'http://www.bing.com/search?q=site%3asomesite.com', url='http://www.somesite.com/prof.php?pid=527', text='somesite -  professor rating of jahan \xe2\x80\xa6', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pid=527'), ('h', 'id=serp,5118.1')])link(base_url=u'http://www.bing.com/search?q=site%3asomesite.com', url='http://www.somesite.com/prof.php?pid=645', text='somesite -  professor rating of david kutzik', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pid=645'), ('h', 'id=serp,5131.1')])"  # create regexp object match value of 'url' r = re.compile(" url='(.*?)',", )  # print matches print r.findall(text)  >>>['http://www.somesite.com/prof.php?pid=478', 'http://www.somesite.com/prof.php?pid=527', 'http://www.somesite.com/prof.php?pid=645'] 

Comments

Popular posts from this blog

scala - 'wrong top statement declaration' when using slick in IntelliJ -

c# - DevExpress.Wpf.Grid.InfiniteGridSizeException was unhandled -

PySide and Qt Properties: Connecting signals from Python to QML -