Posts

Some UNIX tools, such as CUT, by default expect tabular files to be separated by single TABS. But not AWK!

Given a tabular file (awk_test.txt) with the structure:

A<TAB>B<TAB>C<TAB>D<TAB>E
A<SPACE>B<SPACE>C<SPACE>D<SPACE>E
A<TAB><TAB>B<TAB><TAB>C<TAB><TAB>D<TAB><TAB>E
A<SPACE><SPACE>B<SPACE><SPACE>C<SPACE><SPACE>D<SPACE><SPACE>E

If you run an AWK command to count the number of fields:

awk '{print NF}' awk_test.txt

Each of these lines has 5 fields!

That is because AWK does not look for single spaces or single tabs as field separators; it looks for whitespace sequences. So multiple whitespace characters are treated as single separators!

The safe way to use AWK is to explicitly specify the field separator (FS) and output separator (OFS):

awk 'BEGIN{FS="\t"; OFS="\t"} {print NF}' awk_test.txt

Beware that setting the field separator to space does not introduce single space separators but reengages the default separator scheme (i.e. whitespace sequences). So don’t use space separators!

I am a software developer and data analyst for the McCarthy Group at the Wellcome Centre for Human Genetics and OCDEM.

My personal webpage at Wellcome can be found here