Have I been using grep wrong this whole time?
At some point in our lives we may stop and ask ourselves - are we doing the right thing? I've asked myself that question numerous times, most recently - am I using the grep wrong?
Let's start from the beginning - what is grep?
grep
is a pattern searching command line tool in linux which goes into the file and searches for pattern you have provided and prints out the result. I use it quite a lot, however, rather simple, nothing fancy. Here is its description from the man pages[1]:
DESCRIPTION
grep searches for PATTERNS in each FILE. PATTERNS is one or more
patterns separated by newline characters, and grep prints each
line that matches a pattern. Typically PATTERNS should be quoted
when grep is used in a shell command.
A FILE of “-” stands for standard input. If no FILE is given,
recursive searches examine the working directory, and
nonrecursive searches read standard input.
Okay, that was straight forward. Now the reason for me questioning my decisions - I often times use grep
in combination with cat
. What is cat
now?
cat
is, despite being a domestic species of a small carnivorous mammal[2], also a popular command line tool in Linux which is used primarily (at least, this is how I use it) for showing the output of some file. Here is the short description from man pages[3]:
DESCRIPTION
Concatenate FILE(s) to standard output.
With no FILE, or when FILE is -, read standard input.
How do I use them together you may ask?
I use them in combination - cat
some file and piping it to grep
to search for some pattern. Something like the below snippet:
$ cat some-file.txt | grep <pattern>
And what is wrong with the above command? Well, if we don't take into account that we spin up additional process, and type a bit longer, no, nothing is wrong with the above command, at least I think it isn't.
This got me thinking - why do I not just use the grep
instead? Was this the faster way to do the searching? As it turns out, this question of mine was mentioned in this reddit post [4], and person who started this thread was quite annoyed that people were using it wrong - the cat | grep
way instead of grep
alone. The blogpost was not available, so I needed to consult the wayback machine to get to the source article[5]. Okay, seems that I've been doing it wrong this whole time. Nevermind that, however, maybe I can actually perform some testing and find out for sure if I was doing it wrong.
Let's find out. Below, is the output of the first testing I've performed on my CentOS 7 machine:
[user@host]$ du -sh file.log
1.4G file.log
[user@host]$ time cat kubelet.log | grep "E0111" > cat_grep.log
real 0m24.291s
user 0m3.778s
sys 0m7.941s
[user@host]$ time grep "E0111" kubelet.log > grep.log
real 0m22.256s
user 0m0.676s
sys 0m10.507s
[user@host]$ wc -l cat_grep.log
288355 cat_grep.log
[user@host]$ wc -l grep.log
288355 grep.log
First command shows the file size of the log file. As you can see, the file is a big one, taking 1.4G on the machine. The second command measures the time of the process[6].
In the first part, I'm running cat
, piping it to grep
and outputting everything into grep.log
file. Why? Because I want to see the number of lines that the grep
command found for the comparison sake.
Second part is almost the same as previous, but instead of running cat
, I'm running grep
directly. The last commands wc -l
just outputs the number of lines in a file[7].
The thing with the above test is that it might not be the appropriate one, because it writes the lines into the separate file, which can be different from time to time, based on the disk IO. I've tested the above part several times and each time I've got different numbers, sometimes cat | grep
was better, and the other times grep
alone showed better times.
However, if we exclude the writing to disk part, and just pipe the output into a wc
command, the numbers are a bit different:
[user@host]$ time cat kubelet.log | grep "E0111" | wc -l
288355
real 0m10.072s
user 0m2.320s
sys 0m6.277s
[user@host]$ time grep "E0111" kubelet.log | wc -l
288355
real 0m15.221s
user 0m1.224s
sys 0m7.518s
Each time I've run this test, the time
command showed better processing time of cat | grep
command. That was really interesting to me, especially because I've expected that the grep
alone will be faster. Maybe, the reason for this is that I've piped everything into wc
command, for easier output. Okay, lets run it last time, but this time without last pipe:
[user@host]$ time cat kubelet.log | grep "E0111"
...
...
real 6m37.486s
user 0m43.496s
sys 0m18.369s
[user@host]$ time grep "E0111" kubelet.log
...
...
real 6m58.121s
user 0m44.362s
sys 0m23.814s
The last test shows that the cat | grep
option is faster, however, I understand that many more things are going below the surface when we run each and every command from the above. As to why the cat | grep
option is faster? I cannot give appropriate answer now, because I don't know. I might explore this in some other post(s).
For now, I'm going to keep using my pattern cat | grep
and maybe use grep
from time to time, when I actually get bored with typing, and I'm totally okay with that, because I feel that in this case - there is no right or wrong! :)