Adding a fix in soup2text for a common pathological case: <br><br> used instead
of <p /> to indicate paragraph breaks. This changes the failed diff for /iesg/telechat/detail/354/ to show only three differences, where two are whitespace differences and one shows a difference between '@ietf.org. The' and '@ietf.org . The' and is an artifact of the text extraction. Will look at fixing that next. - Legacy-Id: 300
This commit is contained in:
parent
da2de838c9
commit
a7a6d956af
|
@ -66,6 +66,8 @@ class TextSoup(BeautifulSoup):
|
|||
return str
|
||||
|
||||
def soup2text(html):
|
||||
# some preprocessing to handle common pathological cases
|
||||
html = re.sub("<br */?>[ \t\r\n]*(<br */?>)+", "<p/>", html)
|
||||
soup = TextSoup(html)
|
||||
return str(soup)
|
||||
|
||||
|
|
Loading…
Reference in a new issue