Today I’ve found a nice problem to solve in stack overflow. There was this guy claiming why unix utility grep didn’t parse this file in his Linux.
Long story short the file was being somehow «recognized» as binary for grep, because it had four NUL values.
It had \000 in lines 16426, 16428, 16430, 16432. I’ve revealed the existence of those lines using sed:
sed -i 's/\x0/!!!/g' min14.pdb
I’ve searched for a quick way to ride over those NUL values and found this post in Unix Stack Exchange that says:
If there is a
NUL
character anywhere in the file, grep will consider it as a binary file.
Which is not true.
The behavior of grep is to read the file line by line and parse it, then it will stop doing it when the first NUL character appears.
Taking as example the file min14.pdb we can do:
mortiz@florida:~$ cat min14.pdb |grep ATOM ATOM 16174 CG ARG 1068 -21.558 -1.010 13.789 1.00 0.00 C ATOM 16175 HG2 ARG 1068 -22.612 -0.898 13.536 1.00 0.00 H ATOM 16174 CG ARG 1068 -21.558 -1.010 13.789 1.00 0.00 C ATOM 16175 HG2 ARG 1068 -22.612 -0.898 13.536 1.00 0.00 H Binary file (standard input) matches #or mortiz@florida:~$ cat min14.pdb |grep ATOM ATOM 16174 CG ARG 1068 -21.558 -1.010 13.789 1.00 0.00 C ATOM 16175 HG2 ARG 1068 -22.612 -0.898 13.536 1.00 0.00 H ATOM 16174 CG ARG 1068 -21.558 -1.010 13.789 1.00 0.00 C ATOM 16175 HG2 ARG 1068 -22.612 -0.898 13.536 1.00 0.00 H Binary file (standard input) matches
The user tried directly matching for the string ‘WAT’:
mortiz@florida:~$ cat min14.pdb |grep WAT Binary file (standard input) matches
There was not output but just the message of a binary file. This happens because the first occurrence or match for «WAT» happens at line 132285 and the first NUL value appears at line 16426, so grep stopped parsing the file at line 16426, that doesn’t mean that because having a NUL value in the file grep will take it as a binary one.
For this kind of situations, the best practice is to test files using «strings» to avoid any kind of character that make grep behave that way, of course the alternative «grep -a» is also valid:
mortiz@florida:~$ strings min14.pdb |grep WAT ATOM 49143 H1 WAT 7339 80.726 87.371 -21.278 1.00 0.00 H ATOM 49144 H2 WAT 7339 79.817 86.479 -22.069 1.00 0.00 H ATOM 49145 O WAT 7340 82.862 80.075 -23.368 1.00 0.00 O ATOM 49146 H1 WAT 7340 83.597 79.624 -23.851 1.00 0.00 H ATOM 49147 H2 WAT 7340 83.311 80.136 -22.472 1.00 0.00 H