Tabular data format

1/15/2024

Centralised storage of datasets creates an unnecessary administration burden, as does an extra layer of user management. Thus, storing this information in a relational database with its sophisticated concurrency control is entirely unnecessary. Data files are usually written once and not subsequently expected to change. All of these aspects require significant expertise which is why so many researchers and programs continue to use textual data formats.Ī relational database server is far more than we require in the majority of cases. In the most common case, a database server must be maintained and so storage and user permissions must be carefully managed. Similarly, accessing data requires a knowledge of SQL. It is not a straightforward task to design a schema for a particular dataset, particularly not if portability across different databases is required.

Relational databases are complex systems, each supporting different features and SQL dialects. However, there are many problems with this approach. Databases store values in binary form so that parsing is not required, and support efficient retrieval and indexing. The obvious solution to these problems is to load tabular data into a relational database. This is inflexible and error prone, and multiplies already significant storage requirements. The only viable means of working with a subset of the data, therefore, is to create another file consisting of the subset of interest. This means that many operations on a table require a complete scan through the file. As a result, simple calculations over a large dataset may take many hours to complete.Īnother problem with tables stored as text files is that it is difficult to index the information in particular columns. This is a computationally expensive process, and compression (if it is used) adds substantial overhead. This is the major flaw in using text files as a data format before we can perform calculations, we must first parse the encoded information into native machine values. It is not sufficient, however, to simply store and retrieve data. Using compression, text files can be quite compact, and specialised indexing methods are available to retrieve specific rows, for example rows which intersect with a given genomic interval. These files are easy to view and interpret and can be processed on any platform with the minimum of library dependencies.

Data is usually encoded in lines of text, with each row consisting of a series of tab-delimited values. These benefits of reduced code complexity and greatly increased performance allow users much greater freedom to explore their data.ĭespite the ever increasing volumes of data being processed in bioinformatics, the methods used are almost entirely based on plain text files. Since third-party libraries or custom code are no longer needed to parse complex plain text formats, analysis code can also be substantially simpler as well as being uniform across different data formats. Furthermore, the indexing facilities provide efficient access to subsets of the data along with providing useful methods of summarising columns. Wormtable’s simple API allows users to process large tables orders of magnitude more quickly than is possible when parsing text. Files written in existing formats can be easily converted to wormtable format, and we provide conversion utilities for the VCF and GTF formats. Wormtable stores data in a compact binary format, provides random access to rows, and enables sophisticated indexing on columns within these tables. We introduce a new data format and software library called wormtable, which provides efficient access to tabular data in Python. Furthermore, there is no simple means of indexing these files so that rows containing particular values can be quickly found. Data are commonly represented in tables stored as plain text files and require line-by-line parsing for analysis, which is time consuming and error prone. Modern biological science generates a vast amount of data, the analysis of which presents a major challenge to researchers.

0 Comments

Tabular data format

Leave a Reply.

Author

Archives

Categories