Adding a fix in soup2text for a common pathological case: <br><br> used instead

of <p /> to indicate paragraph breaks.

This changes the failed diff for /iesg/telechat/detail/354/ to show only three
differences, where two are whitespace differences and one shows a difference
between '@ietf.org. The' and '@ietf.org . The' and is an artifact of the text
extraction.  Will look at fixing that next.
 - Legacy-Id: 300
This commit is contained in:
Henrik Levkowetz 2007-06-11 03:36:08 +00:00
parent da2de838c9
commit a7a6d956af

View file

@ -66,6 +66,8 @@ class TextSoup(BeautifulSoup):
return str
def soup2text(html):
# some preprocessing to handle common pathological cases
html = re.sub("<br */?>[ \t\r\n]*(<br */?>)+", "<p/>", html)
soup = TextSoup(html)
return str(soup)