Adam and I were having an offline discussion, and some testing shows that AWK outperforms SED by a slight margin:<div><br></div><div><span class="Apple-style-span" style="font-family: arial, sans-serif; font-size: 13px; border-collapse: collapse; "><div>
[sean@bob tmp]$ W=/usr/share/dict/words</div><div>[sean@bob tmp]$ (tail -1000 $W; echo output start; cat $W; echo output end; head -1000 $W) > infile</div><div>[sean@bob tmp]$ wc -l infile</div><div>481831 infile</div>
<div>[sean@bob tmp]$ time awk '/output start/,/output end/' < infile > /dev/null</div><div><br></div><div>real 0m0.411s</div><div>user 0m0.393s</div><div>sys 0m0.016s</div><div>[sean@bob tmp]$ time sed -n '/output start/,/output end/p' < infile > /dev/null</div>
<div><br></div><div>real 0m0.678s</div><div>user 0m0.631s</div><div>sys 0m0.029s</div></span><div><br></div><div>I ran it a bunch more times and the results were similar. YMMV, benchmarks are lies, etc.</div><div>
<br></div><div>Sean</div><br><div class="gmail_quote">On Wed, Nov 10, 2010 at 11:32 AM, Gilles Detillieux <span dir="ltr"><<a href="mailto:grdetil@scrc.umanitoba.ca">grdetil@scrc.umanitoba.ca</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
I may have misinterpreted the question before. If you want the "output<br>
start" and "output end" marker lines in the output (which I guess your<br>
grep pipeline would do), then Adam's sed script will do that. Mine,<br>
using the "d" commands, will output only the data in between. The<br>
shortest awk script to do the same would be:<br>
<br>
awk '/output start/{s=1};s==1;/output end/{s=0};'<br>
<br>
or<br>
<br>
awk '/output end/{s=0};s==1;/output start/{s=1};'<br>
<br>
The first is a simplification of Adam's, which outputs the output marker<br>
lines, while the second, using the same statements in the opposite<br>
order, suppresses the markers. Of perl, awk and sed, I suspect sed is<br>
the most lightweight, and probably the quickest, unless perl can<br>
outperform sed on larger files. awk has a reputation for being pretty<br>
slow. I tend to favour sed unless awk or perl makes the job a lot easier.<br>
<br>
Gilles<br>
<div class="im"><br>
On 11/10/2010 11:13 AM, Adam Thompson wrote:<br>
> The AWK version is functionally identical, and not very much shorter, or<br>
> any more elegant:<br>
><br>
> awk ‘/output start/ {s=1};{if (s==1) print $0};/output end/ {s=0}’<br>
><br>
> (the perl version can generally be made that small, too.)<br>
><br>
><br>
><br>
> I would instead suggest sed(1), since this is precisely what it’s<br>
> designed for:<br>
><br>
> sed –n ‘/output start/,/output end/p’ < infile<br>
><br>
><br>
><br>
> -Adam<br>
><br>
><br>
><br>
><br>
><br>
> *From:* <a href="mailto:roundtable-bounces@muug.mb.ca">roundtable-bounces@muug.mb.ca</a><br>
> [mailto:<a href="mailto:roundtable-bounces@muug.mb.ca">roundtable-bounces@muug.mb.ca</a>] *On Behalf Of *Sean Walberg<br>
> *Sent:* Wednesday, November 10, 2010 10:56<br>
> *To:* Continuation of Round Table discussion<br>
> *Subject:* Re: [RndTbl] Command line challenge: trim garbage from start<br>
> and end of a file.<br>
><br>
><br>
><br>
> OTTOMH:<br>
><br>
><br>
><br>
> perl -n -e 'BEGIN {$state = 0} $state = 1 if ($state == 0 and /output<br>
> start/); $state = 2 if ($state == 1 and /output end/) ; print if<br>
> ($state == 1)' < infile > outfile<br>
><br>
> I'll bet there's a shorter AWK version though.<br>
><br>
><br>
><br>
> Sean<br>
><br>
><br>
><br>
> On Wed, Nov 10, 2010 at 10:51 AM, John Lange <<a href="mailto:john@johnlange.ca">john@johnlange.ca</a><br>
</div><div class="im">> <mailto:<a href="mailto:john@johnlange.ca">john@johnlange.ca</a>>> wrote:<br>
><br>
> I have files with the following structure:<br>
><br>
> garbage<br>
> garbage<br>
> garbage<br>
> output start<br>
> .. good data<br>
> .. good data<br>
> .. good data<br>
> .. good data<br>
> output end<br>
> garbage<br>
> garbage<br>
> garbage<br>
><br>
> How can I extract the good data from the file trimming the garbage<br>
> from the beginning and end?<br>
><br>
> The following works just fine but it's dirty because I don't like the<br>
> fact that I have to pick an arbitrarily large number for the "before"<br>
> and "after" values.<br>
><br>
> grep -A 999999 "output start" <infile> | grep -B 999999 "output end" ><br>
> newfile<br>
><br>
> Can anyone come up with something more elegant?<br>
><br>
> --<br>
> John Lange<br>
</div>> <a href="http://www.johnlange.ca" target="_blank">www.johnlange.ca</a> <<a href="http://www.johnlange.ca" target="_blank">http://www.johnlange.ca</a>><br>
<div class="im"><br>
--<br>
Gilles R. Detillieux E-mail: <<a href="mailto:grdetil@scrc.umanitoba.ca">grdetil@scrc.umanitoba.ca</a>><br>
Spinal Cord Research Centre WWW: <a href="http://www.scrc.umanitoba.ca/" target="_blank">http://www.scrc.umanitoba.ca/</a><br>
Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 0J9 (Canada)<br>
_______________________________________________<br>
</div><div><div></div><div class="h5">Roundtable mailing list<br>
<a href="mailto:Roundtable@muug.mb.ca">Roundtable@muug.mb.ca</a><br>
<a href="http://www.muug.mb.ca/mailman/listinfo/roundtable" target="_blank">http://www.muug.mb.ca/mailman/listinfo/roundtable</a><br>
</div></div></blockquote></div><br><br clear="all"><br>-- <br>Sean Walberg <<a href="mailto:sean@ertw.com">sean@ertw.com</a>> <a href="http://ertw.com/">http://ertw.com/</a><br>
</div>