Using Beautifulsoup To Extract Text Between Line Breaks (e.g.
Tags)
Solution 1:
If you just want any text which is between two <br />
tags, you could do something like the following:
from BeautifulSoup import BeautifulSoup, NavigableString, Tag
input = '''<br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br />'''
soup = BeautifulSoup(input)
for br in soup.findAll('br'):
next_s = br.nextSibling
ifnot (next_s andisinstance(next_s,NavigableString)):
continue
next2_s = next_s.nextSibling
if next2_s andisinstance(next2_s,Tag) and next2_s.name == 'br':
text = str(next_s).strip()
if text:
print"Found:", next_s
But perhaps I misunderstand your question? Your description of the problem doesn't seem to match up with the "important" / "non important" in your example data, so I've gone with the description ;)
Solution 2:
So, for test purposes, let's assume that this chunk of HTML is inside a span
tag:
x = """<span><br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br /></span>"""
Now I'm going to parse it and find my span tag:
from BeautifulSoup importBeautifulSoupy= soup.find('span')
If you iterate over the generator in y.childGenerator()
, you will get both the br's and the text:
In [4]: for a in y.childGenerator(): printtype(a), str(a)
....:
<type'instance'> <br />
<class'BeautifulSoup.NavigableString'>
Important Text 1
<type'instance'> <br />
<class'BeautifulSoup.NavigableString'>
<type'instance'> <br />
<class'BeautifulSoup.NavigableString'>
Not Important Text
<type'instance'> <br />
<class'BeautifulSoup.NavigableString'>
Important Text 2
<type'instance'> <br />
<class'BeautifulSoup.NavigableString'>
Important Text 3
<type'instance'> <br />
<class'BeautifulSoup.NavigableString'>
<type'instance'> <br />
<class'BeautifulSoup.NavigableString'>
Non Important Text
<type'instance'> <br />
<class'BeautifulSoup.NavigableString'>
Important Text 4
<type'instance'> <br />
Solution 3:
A slight improvement to Ken Kinder's answer. You can instead access the stripped_strings
attribute of the BeautifulSoup element. For example, let's say your particular chunk of HTML is inside a span
tag:
x = """<span><br />
Important Text 1
<br />
<br />
Not Important Text
<br />
Important Text 2
<br />
Important Text 3
<br />
<br />
Non Important Text
<br />
Important Text 4
<br /></span>"""
First we parse x
with BeautifulSoup. Then look for the element, in this case span
, and then access the stripped_strings
attribute. Like so,
from bs4 import BeautifulSoup
soup = BeautifulSoup(x)
span = soup.find("span")
text = list(span.stripped_strings)
Now print(text)
will give the following output:
['Important Text 1',
'Not Important Text',
'Important Text 2',
'Important Text 3',
'Non Important Text',
'Important Text 4']
Solution 4:
The following worked for me:
for br in soup.findAll('br'):
if str(type(br.contents[0])) == '<class \'BeautifulSoup.NavigableString\'>':
print br.contents[0]
Post a Comment for "Using Beautifulsoup To Extract Text Between Line Breaks (e.g.
Tags)"