Need Python Lxml Syntax Help For Parsing Html
Solution 1:
Okay, first, in regards to parsing the HTML: if you follow the recommendation of zweiterlinde and S.Lott at least use the version of beautifulsoup included with lxml. That way you will also reap the benefit of a nice xpath or css selector interface.
However, I personally prefer Ian Bicking's HTML parser included in lxml.
Secondly, .find()
and .findall()
come from lxml trying to be compatible with ElementTree, and those two methods are described in XPath Support in ElementTree.
Those two functions are fairly easy to use but they are very limited XPath. I recommend trying to use either the full lxml xpath()
method or, if you are already familiar with CSS, using the cssselect()
method.
Here are some examples, with an HTML string parsed like this:
from lxml.html importfromstringmySearchTree= fromstring(your_input_string)
Using the css selector class your program would roughly look something like this:
# Find all'a' elements inside 'tr'tablerowswith css selector
for a in mySearchTree.cssselect('tr a'):
print 'found "%s" link to href "%s"'% (a.text, a.get('href'))
The equivalent using xpath method would be:
# Find all'a' elements inside 'tr'tablerowswith xpath
for a in mySearchTree.xpath('.//tr/*/a'):
print 'found "%s" link to href "%s"'% (a.text, a.get('href'))
Solution 2:
Is there a reason you're not using Beautiful Soup for this project? It will make dealing with imperfectly formed documents much easier.
Post a Comment for "Need Python Lxml Syntax Help For Parsing Html"