--- title: "Introduction to the TSVIO Package" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to the TSVIO Package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Background The TSVIO package provides a fast but simple interface for accessing, read-only, (subsets of) potentially very large (many gigabytes) data matrices stored in plain text files. ## Data File Format Data files are required to be plain text files containing lines with tab-separated data columns. Each line is separated into logical columns (fields) by tab characters. The first line must contain *unique* labels for each data column. The first line may contain one less field than the remaining lines. Such files are often produced by R. Alternatively, the first line may contain the same number of fields as the remaining lines and the first field on that line is ignored. Such files are often produced by anything other than R. Every line (row) after the first must contain the same number of fields. The first field of each line must be a *unique* row label. (Row and column labels are treated separately and can have labels in common.) `tsvio` assumes that the data file is static and does not change during an R session. ## Index File Before data can be read from a data file, an index file containing the starting position of the data line for each row label must be generated. The index file can be generated explicitly by calling `tsvGenIndex`: ```{r, results='hide', message=FALSE, warning=FALSE, eval=FALSE} tsvGenIndex (filename, indexfile) ``` `tsvio` assumes that the data file is static and does not change during an R session. Hence, an index file, once created, does not change during an R session either. The index file must be regenerated by the user whenever the data file changes. The `tsvio` package cannot detect that the data file has changed. Using an outdated index file can result in erroneous results or a run-time error. The data access functions described below can generate the index file automatically on first access. Depending on file permissions, this may allow the user to simply remove the index file whenever the data file is modified. A new index file will be generated on the next access (which will thus be slower than normal). ## Matrix Data Access The function `tsvGetData` is used to read data as a matrix: ```{r, results='hide', message=FALSE, warning=FALSE, eval=FALSE} tsvGetData (filename, indexfile, rowpatterns, colpatterns, dtype="", findany=TRUE) ``` `rowpatterns` is either `NULL` or a vector of row labels. If `NULL`, data from all lines in the file is returned. Otherwise, only data from rows matching an entry in `rowpatterns` is returned. Only exact matches are supported. Similarly, `colpatterns` specifies which columns to return data for. Thus, the entire data matrix can be returned by specifying `NULL` for both `rowpatterns` and `colpatterns`. The return value is always a data matrix with two dimensions. If `rowpatterns` or `colpatterns` is a single element, the corresponding axis of the returned matrix is not 'dropped'. The standard R function `drop` can be used to delete any dimensions of length one if desired. By default, if `rowpatterns` or `colpatterns` are not `NULL`, any specified labels not in the data file will be silently ignored and not included in the result. However, if there are no matching rows or no matching columns, `tsvGetData` will throw an error. Setting the optional parameter `findany` to `FALSE` will cause `tsvGetData` to throw an error if any specified label is not in the data file. Rows and columns in the returned matrix will occur in same order as they appear in `rowpatterns` and `colpatterns` respectively. Duplicate entries in `rowpatterns` or `colpatterns` will never match any label (and always result in an error if `findany` is `FALSE`). ### Matrix Data Type The returned matrix will have the same mode as the `dtype` parameter, which can be a string, a numeric, or an integer. The value of the parameter is ignored. Returning a numeric or integer matrix can be much faster than returning a character matrix and then converting it. However, it requires all data elements in the data file to conform to that type. Otherwise `tsvGetData` will throw an error. ## Row Data Access The function `tsvGetLines` returns a subset of the lines in the data file as a string vector: ```{r, results='hide', message=FALSE, warning=FALSE, eval=FALSE} tsvGetLines (filename, indexfile, patterns, findany=TRUE) ``` The string vector returned by `tsvGetLines` consists of the entire first line in the data file, followed by the entirety of every line whose row label occurs in patterns. Unlike with `tsvGetData`, patterns cannot be `NULL` and matching lines are ordered by their order in the data file, not the order of their labels in patterns. If `findany` is `TRUE`, labels in patterns that do not occur are ignored. If no labels match, an error is thrown. If `findany` is `FALSE`, an error is thrown if there is no row for any label in patterns.