python - Combine multiple tags with lxml -
i have html file looks like:
... <p> <strong>this </strong> <strong>a lin</strong> <strong>e want </strong> <strong>join.</strong> </p> <p> 2. <strong>but not </strong> <strong>touch this</strong> <em>maybe other tags well.</em> bla bla blah... </p> ...
what need is, if tags in 'p' block 'strong', combine them 1 line, i.e.
<p> <strong>this line want join.</strong> </p>
without touching other block since contains else.
any suggestions? using lxml.
update:
so far tried:
for p in self.tree.xpath('//body/p'): if p.tail none: #no text before first element children = p.getchildren() child in children: if len(children)==1 or child.tag!='strong' or child.tail not none: break else: etree.strip_tags(p,'strong')
with these code able strip off strong tag in part desired, giving:
<p> line want join. </p>
so need way put tag in...
i able bs4 (beautifulsoup):
from bs4 import beautifulsoup bs html = """<p> <strong>this </strong> <strong>a lin</strong> <strong>e want </strong> <strong>join.</strong> </p> <p> <strong>but not </strong> <strong>touch this</strong> </p>""" soup = bs(html) s = '' # note use 0th <p> block ...[0], # make appropriate change in code t in soup.find_all('p')[0].text: s = s+t.strip('\n') s = '<p><strong>'+s+'</strong></p>' print s # prints: <p><strong>this line want join.</strong></p>
then use replace_with()
:
p_tag = soup.p p_tag.replace_with(bs(s, 'html.parser')) print soup
prints:
<html><body><p><strong>this line want join.</strong></p> <p> <strong>but not </strong> <strong>touch this</strong> </p></body></html>
Comments
Post a Comment