linux - How to extract single-/multiline regex-matching items from an unpredictably formatted file and put each one in a single line into output file? -


i have very huge file looks this:

<a>text</a>text blah   <b>data1</b>abc<b>data2</b>       <b>data3</b>blahblah     <c>text</c>   <d>text</d> <x>blahblah<b>data4       data5           data6</b>       <b>data7 </x> 

that is, formatting unpredictable. need extract each <b>...</b> item (it might contain multiline text!) , put every 1 of them in single separate line. @ same time, need replace newlines , spaces single space.

desired output:

<b>data1</b> <b>data2</b> <b>data3</b> <b>data4 data5 data6</b> 

all i've found two-steps-long way:

gawk '{if ($0 != "") { printf "%s", gensub(/\s+/, " ", "g", gensub(/\s+$/, "", "g", $0)) } }' path/to/input.txt > path/to/single-line.txt   

and

grep -pzo '(?s)<b>.*?</b>' path/to/single-line.txt > path/to/output.txt 

but don't it! having convert multigb text file single line... not seem nice. possible solve such problem in single pass, “on fly”?

assuming document well-formed, i.e. <b> opening tags match </b> closing tag, may need:

sed 's@<[/]\?b>@\n&\n@g' path/to/input.txt |  awk 'begin {buf=""}    /<b>/ {y=1; buf=""}    /<\/b>/ {y=0; print buf"</b>"}    y {buf = buf$0} ' | tr -s ' ' 

output:

<b>data1</b> <b>data2</b> <b>data3</b> <b>data4 data5 data6</b> 

explanation:

we first use sed 's@<[/]\?b>@\n&\n@g' move <b> , </b> own line.

then implement simple parser awk:

  • begin {buf=""} : initialize buffer
  • /<b>/ {y=1; buf=""}: when <b> found, enable capturing (y=1) , empty buffer
  • /<\/b>/ {y=0; print buf"</b>"} : when </b> found, disable capturing , print buffer contents along closing tag
  • y {buf = buf$0} : if capturing flag true, append current line buffer

finally pass output through tr -s ' ' squeeze multiple-spaces single-space.

if want in one-line:

sed 's@<[/]\?b>@\n&\n@g' in.txt | awk 'begin{b=""} /<b>/{y=1;b=""} /<\/b>/{y=0;print b"</b>"} y{b=b$0}' | tr -s ' ' 

or save shell script (extract_b.sh):

#!/usr/bin/sh sed 's@<[/]\?b>@\n&\n@g' "$1" | awk 'begin{b=""} /<b>/{y=1;b=""} /<\/b>/{y=0;print b"</b>"} y{b=b$0}' | tr -s ' ' 

and use this:

extract_b.sh path/to/input.txt > /path/to/output.txt 

also tested mawk faster (27 mb/s vs. 17mb/s in tests) , may prefer using multigb file.


Comments

Popular posts from this blog

c# - Validate object ID from GET to POST -

node.js - Custom Model Validator SailsJS -

php - Find a regex to take part of Email -