Benchmarking CSV_IO and BINARY_IO

Because the procedures in the CSV_IO module parse the data in complex ways and rearrange them into the CSV file, operations are expected to be slower than "raw" data writing.

Here are some comparisons of the writing speed between the HEDTOOLS CSV_IO module, BINARY_IO data writing and basic "raw" plain text file output.

Writing speed

The benchmarking program was compiled using Intel Fortran compiler 17.4 on Linux using kind 8 (double precision) real data matrix.

Clearly, CSV_IO can be quite slow with very large data files. However, CSV_IO module makes it very easy to use high level procedures, the level of complexity does not seem to increase much with increasing the complexity of the data. In contrast, using raw Fortran instructions would increase significantly if many different variables with different types are to be saved.

TIMER: Writing data matrix 100 x 500 to CSV took 0.100000000000000s
TIMER: Writing data matrix 100 x 500 to BIN took ZERO TIME
TIMER: Writing data matrix 100 x 500 to TXT took 1.600000000000000E-002s

TIMER: Writing data matrix 1000 x 500 to CSV took 1.05600000000000s
TIMER: Writing data matrix 1000 x 500 to BIN took 1.200000000000001E-002s
TIMER: Writing data matrix 1000 x 500 to TXT took 0.160000000000000s

TIMER: Writing data matrix 10000 x 500 to CSV took 10.5520000000000s
TIMER: Writing data matrix 10000 x 500 to BIN took 4.400000000000048E-002s
TIMER: Writing data matrix 10000 x 500 to TXT took 1.46400000000000s

Unlike CSV and plain text, binary data write procedures, (see BINARY_IO module) are very fast. However, binary format is not human readable. CSV, in contrast is both human readable and easy to get in spreadsheet and statistical packages.

For reasonably large data, the writing speed is probably not an issue because saving data is quite fast, a few seconds, anyway. If very huge volumes of data are to be written, CSV_IO module may be not optimal.

Data file size

Raw data file size: In the raw uncompressed form, binary format data occupies 40% less space than CSV

Compression efficiency: If the data files are compressed (gzip with default options used for testing), CSV can take nearly the same or smaller disk space as the binary data. Binary data on the other hand are not compressed much. This is actually expected because truly random data cannot be compressed. Real non-random data are expected to have a better compression ratio. CSV data have many repeated text separators (quites, commas), so a better compression is expected.

Single precision (kind 4)

100 x 500:

Format File size gzipped size
Binary 196K 176K
CSV 557K 190K
text 587K 234K

1000 x 500:

Format File size gzipped size
Binary 2,0M 1.8M
CSV 5.5M 1.9M
text 5.8M 2.3M

10000 x 500:

Format File size gzipped size
Binary 20M 18M
CSV 55M 19M
text 58M 23M

Double precision (kind 8)

100 x 500:

Format File size gzipped size
Binary 391K 369K
CSV 557K 190K
text 587K 234K

1000 x 500:

Format File size gzipped size
Binary 3.9M 3.6M
CSV 5.5M 1.9M
text 5.8M 2.3M

10000 x 500:

Format File size gzipped size
Binary 39M 36M
CSV 55M 19M
text 58M 23M

Conclusions

  • Use CSV_IO and BINARY_IO for saving relatively complex data sets that would require quite large and complicated amount of raw Fortran code otherwise.
  • Use CSV_IO for two-dimensional data (e.g. statistical tables with "variables" versus "cases") that should be later imported to a stats package or browsed with a spreadsheet.
  • Use CSV_IO if the data file should be human-readable with any basic text editor.
  • Use CSV_IO if the (small) size of the data file is important and the data can be compressed (e.g. by a post-processing script).
  • Use CSV_IO if the data should be transferred from the computational platform via ssh: use ssh automatic background compression option (-C) in such a case.
  • Use BINARY_IO to save a multi-dimensional (3D and more) data matrix.
  • Use BINARY_IO if the whole numeric precision is crucial. CSV_IO currently truncates high-precision real data (double, kind 8) to the default precision; it may be ok for normal intermediate outputs but sometimes can create a problem.
  • Use BINARY_IO if writing speed is strong priority and there is no need to browse the raw data by a human (e.g. to substitute a potentially unlimited disk file for limited random access memory if the data array size is really huge).

Links

Code

The code of the benchmark is csv_benchmark.f90