python - Combine multiple tags with lxml -


i have html file looks like:

... <p>       <strong>this </strong>       <strong>a lin</strong>       <strong>e want </strong>       <strong>join.</strong>   </p> <p>     2.     <strong>but not </strong>     <strong>touch this</strong>     <em>maybe other tags well.</em>     bla bla blah... </p> ... 

what need is, if tags in 'p' block 'strong', combine them 1 line, i.e.

<p>     <strong>this line want join.</strong> </p> 

without touching other block since contains else.

any suggestions? using lxml.

update:

so far tried:

for p in self.tree.xpath('//body/p'):         if p.tail none: #no text before first element             children = p.getchildren()             child in children:                 if len(children)==1 or child.tag!='strong' or child.tail not none:                     break             else:                 etree.strip_tags(p,'strong') 

with these code able strip off strong tag in part desired, giving:

<p>       line want join.   </p>   

so need way put tag in...

i able bs4 (beautifulsoup):

from bs4 import beautifulsoup bs  html = """<p>   <strong>this </strong>   <strong>a lin</strong>   <strong>e want </strong>   <strong>join.</strong>   </p> <p> <strong>but not </strong> <strong>touch this</strong> </p>"""  soup = bs(html) s = '' # note use 0th <p> block ...[0], # make appropriate change in code t in soup.find_all('p')[0].text:     s = s+t.strip('\n') s = '<p><strong>'+s+'</strong></p>' print s # prints: <p><strong>this line want join.</strong></p> 

then use replace_with():

p_tag = soup.p p_tag.replace_with(bs(s, 'html.parser')) print soup 

prints:

<html><body><p><strong>this line want join.</strong></p> <p> <strong>but not </strong> <strong>touch this</strong> </p></body></html> 

Comments

Popular posts from this blog

c# - Validate object ID from GET to POST -

node.js - Custom Model Validator SailsJS -

php - Find a regex to take part of Email -