linux - How to extract single-/multiline regex-matching items from an unpredictably formatted file and put each one in a single line into output file? -
i have very huge file looks this:
<a>text</a>text blah <b>data1</b>abc<b>data2</b> <b>data3</b>blahblah <c>text</c> <d>text</d> <x>blahblah<b>data4 data5 data6</b> <b>data7 </x>
that is, formatting unpredictable. need extract each <b>...</b>
item (it might contain multiline text!) , put every 1 of them in single separate line. @ same time, need replace newlines , spaces single space.
desired output:
<b>data1</b> <b>data2</b> <b>data3</b> <b>data4 data5 data6</b>
all i've found two-steps-long way:
gawk '{if ($0 != "") { printf "%s", gensub(/\s+/, " ", "g", gensub(/\s+$/, "", "g", $0)) } }' path/to/input.txt > path/to/single-line.txt
and
grep -pzo '(?s)<b>.*?</b>' path/to/single-line.txt > path/to/output.txt
but don't it! having convert multigb text file single line... not seem nice. possible solve such problem in single pass, “on fly”?
assuming document well-formed, i.e. <b>
opening tags match </b>
closing tag, may need:
sed 's@<[/]\?b>@\n&\n@g' path/to/input.txt | awk 'begin {buf=""} /<b>/ {y=1; buf=""} /<\/b>/ {y=0; print buf"</b>"} y {buf = buf$0} ' | tr -s ' '
output:
<b>data1</b> <b>data2</b> <b>data3</b> <b>data4 data5 data6</b>
explanation:
we first use sed 's@<[/]\?b>@\n&\n@g'
move <b>
, </b>
own line.
then implement simple parser awk:
begin {buf=""}
: initialize buffer/<b>/ {y=1; buf=""}
: when<b>
found, enable capturing (y=1) , empty buffer/<\/b>/ {y=0; print buf"</b>"}
: when</b>
found, disable capturing , print buffer contents along closing tagy {buf = buf$0}
: if capturing flag true, append current line buffer
finally pass output through tr -s ' '
squeeze multiple-spaces single-space.
if want in one-line:
sed 's@<[/]\?b>@\n&\n@g' in.txt | awk 'begin{b=""} /<b>/{y=1;b=""} /<\/b>/{y=0;print b"</b>"} y{b=b$0}' | tr -s ' '
or save shell script (extract_b.sh
):
#!/usr/bin/sh sed 's@<[/]\?b>@\n&\n@g' "$1" | awk 'begin{b=""} /<b>/{y=1;b=""} /<\/b>/{y=0;print b"</b>"} y{b=b$0}' | tr -s ' '
and use this:
extract_b.sh path/to/input.txt > /path/to/output.txt
also tested mawk
faster (27 mb/s vs. 17mb/s in tests) , may prefer using multigb file.
Comments
Post a Comment