<div>I used whatever was on my Fedora 13 box:</div><div><br></div><div>[sean@bob ~]$ awk --version</div><div>GNU Awk 3.1.8</div><div>[sean@bob ~]$ sed --version</div><div>GNU sed version 4.2.1</div><div><br></div><div>The difference gets much bigger if you use a more complex regexp.</div>
<div><br></div><div><div>[sean@bob tmp]$ time awk '/.*output.*start.*/,/.*output.*end.*/' < infile > /dev/null</div><div><br></div><div>real 0m0.450s</div><div>user 0m0.393s</div><div>sys 0m0.010s</div>
<div>[sean@bob tmp]$ time sed -n '/.*output.*start.*/,/.*output.*end.*/p' < infile > /dev/null</div><div><br></div><div>real 0m1.726s</div><div>user 0m1.495s</div><div>sys 0m0.017s</div></div><div>
<br></div><div>Awk didn't seem to blink an eye. Strangely enough, since the beginning and ending .*'s are completely superfluous, they seem to throw sed for a loop, even if the middle .* is replaced with a space.</div>
<div><br></div><div>Sean</div><br><div class="gmail_quote">On Wed, Nov 10, 2010 at 12:37 PM, Gilles Detillieux <span dir="ltr"><<a href="mailto:grdetil@scrc.umanitoba.ca">grdetil@scrc.umanitoba.ca</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">Interesting! Which version of awk did you test? I have to admit I<br>
haven't looked into awk performance in quite some time. My early<br>
experience, on older Unix systems (pre-Linux), confirmed what I had read<br>
about awk being pretty slow. But I seem to recall that even on older<br>
Linux systems, gawk wasn't exactly speedy then either. I imagine the<br>
GNU awk developers must have remedied that since, though, if that is<br>
indeed what you were testing.<br>
<br>
Searching online for discussions on awk performance found one from 2002<br>
suggesting gawk was much faster than nawk, and another from this past<br>
August that suggested the opposite. Perhaps the developers of the two<br>
have been leap-frogging each other with optimizations to their code?<br>
<div class="im"><br>
On 11/10/2010 11:56 AM, Sean Walberg wrote:<br>
> Adam and I were having an offline discussion, and some testing shows<br>
> that AWK outperforms SED by a slight margin:<br>
><br>
> [sean@bob tmp]$ W=/usr/share/dict/words<br>
> [sean@bob tmp]$ (tail -1000 $W; echo output start; cat $W; echo output<br>
> end; head -1000 $W) > infile<br>
> [sean@bob tmp]$ wc -l infile<br>
> 481831 infile<br>
> [sean@bob tmp]$ time awk '/output start/,/output end/' < infile > /dev/null<br>
><br>
> real 0m0.411s<br>
> user 0m0.393s<br>
> sys 0m0.016s<br>
> [sean@bob tmp]$ time sed -n '/output start/,/output end/p' < infile ><br>
> /dev/null<br>
><br>
> real 0m0.678s<br>
> user 0m0.631s<br>
> sys 0m0.029s<br>
><br>
> I ran it a bunch more times and the results were similar. YMMV,<br>
> benchmarks are lies, etc.<br>
><br>
> Sean<br>
><br>
> On Wed, Nov 10, 2010 at 11:32 AM, Gilles Detillieux<br>
</div><div><div></div><div class="h5">> <<a href="mailto:grdetil@scrc.umanitoba.ca">grdetil@scrc.umanitoba.ca</a> <mailto:<a href="mailto:grdetil@scrc.umanitoba.ca">grdetil@scrc.umanitoba.ca</a>>> wrote:<br>
><br>
> I may have misinterpreted the question before. If you want the "output<br>
> start" and "output end" marker lines in the output (which I guess your<br>
> grep pipeline would do), then Adam's sed script will do that. Mine,<br>
> using the "d" commands, will output only the data in between. The<br>
> shortest awk script to do the same would be:<br>
><br>
> awk '/output start/{s=1};s==1;/output end/{s=0};'<br>
><br>
> or<br>
><br>
> awk '/output end/{s=0};s==1;/output start/{s=1};'<br>
><br>
> The first is a simplification of Adam's, which outputs the output marker<br>
> lines, while the second, using the same statements in the opposite<br>
> order, suppresses the markers. Of perl, awk and sed, I suspect sed is<br>
> the most lightweight, and probably the quickest, unless perl can<br>
> outperform sed on larger files. awk has a reputation for being pretty<br>
> slow. I tend to favour sed unless awk or perl makes the job a lot<br>
> easier.<br>
><br>
> Gilles<br>
><br>
> On 11/10/2010 11:13 AM, Adam Thompson wrote:<br>
> > The AWK version is functionally identical, and not very much<br>
> shorter, or<br>
> > any more elegant:<br>
> ><br>
> > awk ‘/output start/ {s=1};{if (s==1) print $0};/output end/<br>
> {s=0}’<br>
> ><br>
> > (the perl version can generally be made that small, too.)<br>
> ><br>
> ><br>
> ><br>
> > I would instead suggest sed(1), since this is precisely what it’s<br>
> > designed for:<br>
> ><br>
> > sed –n ‘/output start/,/output end/p’ < infile<br>
> ><br>
> ><br>
> ><br>
> > -Adam<br>
> ><br>
> ><br>
> ><br>
> ><br>
> ><br>
> > *From:* <a href="mailto:roundtable-bounces@muug.mb.ca">roundtable-bounces@muug.mb.ca</a><br>
> <mailto:<a href="mailto:roundtable-bounces@muug.mb.ca">roundtable-bounces@muug.mb.ca</a>><br>
> > [mailto:<a href="mailto:roundtable-bounces@muug.mb.ca">roundtable-bounces@muug.mb.ca</a><br>
> <mailto:<a href="mailto:roundtable-bounces@muug.mb.ca">roundtable-bounces@muug.mb.ca</a>>] *On Behalf Of *Sean Walberg<br>
> > *Sent:* Wednesday, November 10, 2010 10:56<br>
> > *To:* Continuation of Round Table discussion<br>
> > *Subject:* Re: [RndTbl] Command line challenge: trim garbage from<br>
> start<br>
> > and end of a file.<br>
> ><br>
> ><br>
> ><br>
> > OTTOMH:<br>
> ><br>
> ><br>
> ><br>
> > perl -n -e 'BEGIN {$state = 0} $state = 1 if ($state == 0 and /output<br>
> > start/); $state = 2 if ($state == 1 and /output end/) ; print if<br>
> > ($state == 1)' < infile > outfile<br>
> ><br>
> > I'll bet there's a shorter AWK version though.<br>
> ><br>
> ><br>
> ><br>
> > Sean<br>
> ><br>
> ><br>
> ><br>
> > On Wed, Nov 10, 2010 at 10:51 AM, John Lange <<a href="mailto:john@johnlange.ca">john@johnlange.ca</a><br>
> <mailto:<a href="mailto:john@johnlange.ca">john@johnlange.ca</a>><br>
</div></div><div class="im">> > <mailto:<a href="mailto:john@johnlange.ca">john@johnlange.ca</a> <mailto:<a href="mailto:john@johnlange.ca">john@johnlange.ca</a>>>> wrote:<br>
> ><br>
> > I have files with the following structure:<br>
> ><br>
> > garbage<br>
> > garbage<br>
> > garbage<br>
> > output start<br>
> > .. good data<br>
> > .. good data<br>
> > .. good data<br>
> > .. good data<br>
> > output end<br>
> > garbage<br>
> > garbage<br>
> > garbage<br>
> ><br>
> > How can I extract the good data from the file trimming the garbage<br>
> > from the beginning and end?<br>
> ><br>
> > The following works just fine but it's dirty because I don't like the<br>
> > fact that I have to pick an arbitrarily large number for the "before"<br>
> > and "after" values.<br>
> ><br>
> > grep -A 999999 "output start" <infile> | grep -B 999999 "output<br>
> end" ><br>
> > newfile<br>
> ><br>
> > Can anyone come up with something more elegant?<br>
> ><br>
> > --<br>
> > John Lange<br>
</div>> > <a href="http://www.johnlange.ca" target="_blank">www.johnlange.ca</a> <<a href="http://www.johnlange.ca" target="_blank">http://www.johnlange.ca</a>> <<a href="http://www.johnlange.ca" target="_blank">http://www.johnlange.ca</a>><br>
<div class="im">><br>
> --<br>
> Gilles R. Detillieux E-mail: <<a href="mailto:grdetil@scrc.umanitoba.ca">grdetil@scrc.umanitoba.ca</a><br>
</div>> <mailto:<a href="mailto:grdetil@scrc.umanitoba.ca">grdetil@scrc.umanitoba.ca</a>>><br>
<div class="im">> Spinal Cord Research Centre WWW: <a href="http://www.scrc.umanitoba.ca/" target="_blank">http://www.scrc.umanitoba.ca/</a><br>
> Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 0J9 (Canada)<br>
> _______________________________________________<br>
> Roundtable mailing list<br>
</div>> <a href="mailto:Roundtable@muug.mb.ca">Roundtable@muug.mb.ca</a> <mailto:<a href="mailto:Roundtable@muug.mb.ca">Roundtable@muug.mb.ca</a>><br>
<div class="im">> <a href="http://www.muug.mb.ca/mailman/listinfo/roundtable" target="_blank">http://www.muug.mb.ca/mailman/listinfo/roundtable</a><br>
><br>
><br>
><br>
><br>
> --<br>
</div>> Sean Walberg <<a href="mailto:sean@ertw.com">sean@ertw.com</a> <mailto:<a href="mailto:sean@ertw.com">sean@ertw.com</a>>> <a href="http://ertw.com/" target="_blank">http://ertw.com/</a><br>
><br>
><br>
> ------------------------------------------------------------------------<br>
<div><div></div><div class="h5">><br>
> _______________________________________________<br>
> Roundtable mailing list<br>
> <a href="mailto:Roundtable@muug.mb.ca">Roundtable@muug.mb.ca</a><br>
> <a href="http://www.muug.mb.ca/mailman/listinfo/roundtable" target="_blank">http://www.muug.mb.ca/mailman/listinfo/roundtable</a><br>
<br>
--<br>
Gilles R. Detillieux E-mail: <<a href="mailto:grdetil@scrc.umanitoba.ca">grdetil@scrc.umanitoba.ca</a>><br>
Spinal Cord Research Centre WWW: <a href="http://www.scrc.umanitoba.ca/" target="_blank">http://www.scrc.umanitoba.ca/</a><br>
Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 0J9 (Canada)<br>
_______________________________________________<br>
Roundtable mailing list<br>
<a href="mailto:Roundtable@muug.mb.ca">Roundtable@muug.mb.ca</a><br>
<a href="http://www.muug.mb.ca/mailman/listinfo/roundtable" target="_blank">http://www.muug.mb.ca/mailman/listinfo/roundtable</a><br>
</div></div></blockquote></div><br><br clear="all"><br>-- <br>Sean Walberg <<a href="mailto:sean@ertw.com">sean@ertw.com</a>> <a href="http://ertw.com/">http://ertw.com/</a><br>