Parse Text in Python (Django) -

i have text looks like:

link(base_url=u'http://www.bing.com/search?q=site%3asomesite.com', url='http://www.somesite.com/prof.php?pid=478', text='somesite -  professor rating of louis scerbo', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pid=478'), ('h', 'id=serp,5105.1')])link(base_url=u'http://www.bing.com/search?q=site%3asomesite.com', url='http://www.somesite.com/prof.php?pid=527', text='somesite -  professor rating of jahan \xe2\x80\xa6', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pid=527'), ('h', 'id=serp,5118.1')])link(base_url=u'http://www.bing.com/search?q=site%3asomesite.com', url='http://www.somesite.com/prof.php?pid=645', text='somesite -  professor rating of david kutzik', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pid=645'), ('h', 'id=serp,5131.1')])

questions

does know format of text?
how go parsing out values of element url example (from above text): http://www.somesite.com/prof.php?pid=478 http://www.somesite.com/prof.php?pid=527
what python library(s) recommend parsing type of output, xml, json, etc?

i trying loop through url , parse value of url only.

keep in mind i'm using django.

thank can provide.

edit *current code:*

domainlinkoutputasstring = str(domainlinkoutput)   r = re.compile(" url='(.*?)',", )  ##errorenous, must 're' compliant.  properdomains = r.findall(domainlinkoutputasstring)  return httpresponse(properdomains)

you can use python regexp:

import re text = "link(base_url=u'http://www.bing.com/search?q=site%3asomesite.com', url='http://www.somesite.com/prof.php?pid=478', text='somesite -  professor rating of louis scerbo', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pid=478'), ('h', 'id=serp,5105.1')])link(base_url=u'http://www.bing.com/search?q=site%3asomesite.com', url='http://www.somesite.com/prof.php?pid=527', text='somesite -  professor rating of jahan \xe2\x80\xa6', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pid=527'), ('h', 'id=serp,5118.1')])link(base_url=u'http://www.bing.com/search?q=site%3asomesite.com', url='http://www.somesite.com/prof.php?pid=645', text='somesite -  professor rating of david kutzik', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pid=645'), ('h', 'id=serp,5131.1')])"  # create regexp object match value of 'url' r = re.compile(" url='(.*?)',", )  # print matches print r.findall(text)  >>>['http://www.somesite.com/prof.php?pid=478', 'http://www.somesite.com/prof.php?pid=527', 'http://www.somesite.com/prof.php?pid=645']

Search This Blog

Business

Parse Text in Python (Django) -

Comments

Post a Comment

Popular posts from this blog

scala - 'wrong top statement declaration' when using slick in IntelliJ -

C# - WPF - ColumnGroups Footer? (telerik) -

c# - DevExpress.Wpf.Grid.InfiniteGridSizeException was unhandled -