Copyrighrt (c) 2014-2020 UPVI, LLC

Saving Data -- Text Files

Last Updated: Wednesday, 24 September 2014 Published: Wednesday, 24 September 2014 Written by Jason D. Sommerville

The First Entry

Welcome to Article 1 of the UPVI Blog! I intend to use this space on a bi-weekly basis to gush forth on a variety of issues in the field of data acquisition, analysis and control that I have come across over the years. I'm not going to claim that anything written here will be revolutionary, but it is my hope that struggling developer or student may wonder across this site and their work will be a little better for it.

Given that I am launching my company with the release of version 1.0 of the LVHDF5 toolkit, I thought it fitting to spend the first few articles discussing data formats for saving scientific data. While I would love to regale you, dear reader, with the wonders of HDF5, I found it best to start as simple as possible. That leads us straight away to...

Storing data in an ASCII text file

Pros and cons of ASCII

Yes, ASCII text. It's old. It's trusty. Everything can read it, even Excel, even Notepad. You can give your data file to a complete stranger and, at the very least, they'll be able to figure out how to look at the numbers inside of it. It's been working for over 50 years, and, if I had to guess, it will be working for another 500, barring a return to the Dark Ages. It always works--maybe not well, but it always works. That is ASCII's only advantage, but it is a huge advantage that should not be lightly disregarded.

What about the downsides? There a many.

  • Format is only loosely specified and may require further documentation
  • Difficult to represent higher-rank data
  • The storage of attributes may be confusing
  • File size is large
  • Grouping of datasets typically relies on file system leading to data fragmentation

However, before I expound upon the downsides, let's look a little more closely at the format.

What is ASCII, really?

I suspect that most readers are aware what we mean by an ASCII text file in general. (If not, I direct you to Wikepedia.) When we talk about storing data in a text file, we are almost always talking about a separated value text file. This means that the bulk of the file contains rows of data which a separated into columns by a delimiter, usually a tab, one or more spaces, or a comma. In each column, the ASCII string representation of the value is stored, e.g. the string "3.14" or "7.297E-3". So, the bulk of the file looks something like this

1.251, 2.545, 1.456, 2.342E-3
4.234, 1.974, 5.629, 1.497
5.145, 12.54, 86.32, 0.832
...

Frequently the files are called comma separated value files, or CSV files for short. The CSV is then used as the file name extension, e.g. mydata.csv often without regard to the actual delimiter used by the file. Therefore you may find a tab-separated value file which is still called a CSV file and has a CSV extension.

There is one additional format that is worth mentioning, which is the fixed-field format text file. This format is typically generated by older programs, particularly those written FORTRAN. In this format, there are rows and columns of data, as with CSV files. However, rather than having a delimiter, each column is allotted a fixed number of characters, often 10. Generally unused characters in a column are filled with spaces, and programs ensure that they leave at least one space between each column by never using the full column width.

For my own work, I nearly always use tab delimited files. I find that the data files are easier to read in a basic text editor than comma separated values, and can be easily read by all of the programs and languages I regularly use.

Data of rank one or two

If I'm storing a 2-D array, the row/column format of this file works out great. It also works great if I'm storing several 1-D arrays of related data, say data sampled a the same rate from multiple devices. In this second, more common case, it's critically important to write column headers on your data. I can't stress this enough. While it seems simple and straightforward to do so, I can't begin to tell you the number of data files I've opened up and been confronted with a mass of numbers with no way of identifying what they mean. A good header should label each column and specify the units of the data, if applicable. For example

Time (s)	Input Voltage (V)	Output Voltage (V)	Temperature (C)	Salinity (ppm)
0.5	1.251	2.545   1.456	2.342E-3
1.0	4.234	1.974   75.62	1.497
1.5	5.145	12.54   76.32	0.832
2.0	1.548	15.45   77.32	1.01E-2

Here I'm using tab delimited data. One notes that the data does not format perfectly in raw text, but is good enough to get the idea. When imported into Excel, it will be properly rendered into the appropriate columns.

Most high-level programming languages include functions for reading in such a data file. Some are able to identify the column headers and give names to the various channels in whatever format is native to the program. At the very least, most languages have a way of instructing the read function to skip over one or more header lines.

Data of higher rank

It should be immediately obvious that for data sets of rank higher than two this format becomes a problem. 

Because of this representation, the precision of the data is related directly to the number of characters used to represent the value. Pi in the example above required 4 bytes and has 3 decimal digits of precision. The fine structure constant above required 8 bytes to store, as much as a double-precision float in binary but gave us only 4 decimal digits of precision.