Today  I’ve found a nice problem to solve in stack overflow. There was this guy claiming why unix utility grep didn’t parse this file in his Linux.

Long story short the file was being somehow «recognized» as binary for grep, because it had four NUL values.

It had \000 in lines 16426, 16428, 16430, 16432. I’ve revealed the existence of those lines using sed:

sed -i 's/\x0/!!!/g' min14.pdb

I’ve searched for a quick way to ride over those NUL values and found this post in Unix Stack Exchange that says:

If there is a NUL character anywhere in the file, grep will consider it as a binary file.

Which is not  true.

The behavior of grep is to read the file line by line and parse it, then it will stop doing it when the first NUL character appears.

Taking as example the file min14.pdb we can do:

mortiz@florida:~$ cat min14.pdb |grep ATOM
ATOM  16174 CG   ARG  1068     -21.558  -1.010  13.789  1.00  0.00           C
ATOM  16175 HG2  ARG  1068     -22.612  -0.898  13.536  1.00  0.00           H
ATOM  16174 CG   ARG  1068     -21.558  -1.010  13.789  1.00  0.00           C
ATOM  16175 HG2  ARG  1068     -22.612  -0.898  13.536  1.00  0.00           H
Binary file (standard input) matches

#or

mortiz@florida:~$ cat min14.pdb |grep ATOM
ATOM  16174 CG   ARG  1068     -21.558  -1.010  13.789  1.00  0.00           C
ATOM  16175 HG2  ARG  1068     -22.612  -0.898  13.536  1.00  0.00           H
ATOM  16174 CG   ARG  1068     -21.558  -1.010  13.789  1.00  0.00           C
ATOM  16175 HG2  ARG  1068     -22.612  -0.898  13.536  1.00  0.00           H
Binary file (standard input) matches

The user tried directly matching for the string ‘WAT’:

mortiz@florida:~$ cat min14.pdb |grep WAT
Binary file (standard input) matches

There was not output but just the message of a binary file. This happens because the first occurrence or match for «WAT» happens at line 132285 and the first NUL value appears at line 16426, so grep stopped parsing the file at line 16426, that doesn’t mean that because having a NUL value in the file grep will take it as a binary one.

For this kind of situations, the best practice is to test files using «strings» to avoid any kind of character that make grep behave that way, of course the alternative «grep -a» is also valid:

mortiz@florida:~$ strings min14.pdb |grep WAT
ATOM  49143 H1   WAT  7339      80.726  87.371 -21.278  1.00  0.00           H
ATOM  49144 H2   WAT  7339      79.817  86.479 -22.069  1.00  0.00           H
ATOM  49145 O    WAT  7340      82.862  80.075 -23.368  1.00  0.00           O
ATOM  49146 H1   WAT  7340      83.597  79.624 -23.851  1.00  0.00           H
ATOM  49147 H2   WAT  7340      83.311  80.136 -22.472  1.00  0.00           H