Pulling back the frontiers of science part 1: Fixed width data files

To do science these days you need to be able to deal with very long lists of numbers. Luckily, computers are very good at doing that for us – most of the time.

However, being human we like to screw things up for reasons like ‘it looks nicer’. Which brings me to the topic of this post: fixed-width data files.

Fixed width data files are generally beautifully-presented blocks of text with lines that are always 80, 120 or some other number of characters wide. Nice to look at, but quite possibly one of the reasons that science is so damn hard!

Trying to extract data from these beasts can be excruciating, since it is uncertain how much whitespace is between each of the actual data fields. You need to painstakingly go through every combination of how many characters a data field can occupy, or become very good at regular expressions. Not to mention if your example file is missing a data column you have no real way of telling.

On the other had, if you standardise the bits between the data [for example, always put two spaces or a comma between data fields], life becomes much simpler. Two commas ina row? missing data. Six spaces in a row? missing data. Simple. A given data value can be 3 characters or 35 characters long, it doesn’t matter!

Sure – your file looks messy, but it is orders of magnitude easier to interpret, display, plot, make into beautiful graphics that get your message across, et cetera.

So, if you’re thinking of producing datasets: please think of all the people who will need to process them at some later stage and avoid the temptation to make fixed-width output files.

Thats it for pulling back the frontiers of science today ;-)

Post a Comment

You must be logged in to post a comment.