Title: | Simple Utilities for Tab-Separated-Value (TSV) Files |
---|---|
Description: | Utilities for rapidly loading specified rows and/or columns of data from large tab-separated value (tsv) files (large: e.g. 1 GB file of 10000 x 10000 matrix). 'tsvio' is an R wrapper to 'C' code that creates an index file for the rows of the tsv file, and uses that index file to collect rows and/or columns from the tsv file without reading the whole file into memory. |
Authors: | Bradley M Broom [aut] , Mary A Rohrdanz [ctb, cre], Chris Wakefield [ctb], James Melott [ctb], Paul Hsieh [ctb], MD Anderson Cancer Center [cph] |
Maintainer: | Mary A Rohrdanz <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.0.6 |
Built: | 2024-11-07 03:56:21 UTC |
Source: | https://github.com/MD-Anderson-Bioinformatics/tsvio |
Simple Utilities for Tab Separated Value (TSV) Files
Utilities for indexing and rapidly loading (subsets of) data from (large) tab separated value (TSV) files. The TSV files are required to have a unique row label in the first column of each line and a unique column label in the first line of the file. Files may be formatted in either spreadsheet/Unix format (same number of fields on each line) or R format (one less column on the first line only). The data matrix in the files are expected to have the same data types in all entries. (The row and column labels are always expected to be strings.)
tsvGenIndex, tsvGetLines, tsvGetData
This function reads a TSV file and produces an index to the start of each row. The TSV file is required to have a header line and at least one data line. The header line may contain either the same number or one fewer columns than the data lines, which must all contain the same number of columns. The first column of each data line will be indexed.
tsvGenIndex(filename, indexfile)
tsvGenIndex(filename, indexfile)
filename |
The name (and path) of the file(s) containing the data to index. |
indexfile |
The name (and path) of the file(s) to which the index will be written. There must be exactly one index file for every filename. |
NULL. This function generates an index file.
tsvGetLines, tsvGetData
datafile = tempfile("data"); df <- data.frame(C1 = c("Foo", "Boing", "The"), C2 = c("Bar", "Boing", "End")); rownames(df) <- c("R1", "R2", "R3"); write.table(df, file=datafile, sep="\t", quote=FALSE, row.names=TRUE, col.names=TRUE); indexfile = tempfile("index"); tsvGenIndex (datafile, indexfile)
datafile = tempfile("data"); df <- data.frame(C1 = c("Foo", "Boing", "The"), C2 = c("Bar", "Boing", "End")); rownames(df) <- c("R1", "R2", "R3"); write.table(df, file=datafile, sep="\t", quote=FALSE, row.names=TRUE, col.names=TRUE); indexfile = tempfile("index"); tsvGenIndex (datafile, indexfile)
This function reads lines that match the given patterns from a TSV file with the assistance of a pre-computed index file to the start of each row.
tsvGetData( filename, indexfile, rowpatterns, colpatterns, dtype = "", findany = TRUE )
tsvGetData( filename, indexfile, rowpatterns, colpatterns, dtype = "", findany = TRUE )
filename |
The name (and path) of the file containing the data to index. |
indexfile |
The name (and path) of the file to which the index will be written. |
rowpatterns |
A vector of strings containing the string to match against the index entries. Only lines with keys that exactly match at least one pattern string are returned. If rowpatterns is NULL, data from all rows is returned. |
colpatterns |
A vector of strings to match against the column headers in the first row |
dtype |
A prototype element that specifies by example the type of matrix to return. The value of the parameter is ignored. Accepted types are string (default), numeric (float), and integer. |
findany |
If false, all patterns must be matched. If true (default) at least one pattern must match. |
The index file must have been created by tsvGenIndex and the data file must not have changed since the index file was created.
A matrix containing one row for each matched line and one column for each matched column.
tsvGenIndex, tsvGetLines
datafile = tempfile("data"); df <- data.frame(C1 = c("Foo", "Boing", "The"), C2 = c("Bar", "Boing", "End")); rownames(df) <- c("R1", "R2", "R3"); write.table(df, file=datafile, sep="\t", quote=FALSE, row.names=TRUE, col.names=TRUE); indexfile = tempfile("index"); tsvGenIndex (datafile, indexfile); tsvGetData (datafile, indexfile, c("R1", "R3"), c('C2'))
datafile = tempfile("data"); df <- data.frame(C1 = c("Foo", "Boing", "The"), C2 = c("Bar", "Boing", "End")); rownames(df) <- c("R1", "R2", "R3"); write.table(df, file=datafile, sep="\t", quote=FALSE, row.names=TRUE, col.names=TRUE); indexfile = tempfile("index"); tsvGenIndex (datafile, indexfile); tsvGetData (datafile, indexfile, c("R1", "R3"), c('C2'))
This function reads lines that match the given patterns from a TSV file with the assistance of a pre-computed index file to the start of each row.
tsvGetLines(filename, indexfile, patterns, findany = TRUE)
tsvGetLines(filename, indexfile, patterns, findany = TRUE)
filename |
The name (and path) of the file containing the data to index. |
indexfile |
The name (and path) of the file to which the index will be written. |
patterns |
A vector of strings containing the string to match against the index entries. Only lines with keys that exactly match at least one pattern string are returned. |
findany |
If false, all patterns must be matched. If true (default) at least one pattern must match. |
The index file must have been created by tsvGenIndex and the data file must not have changed since the index file was created.
A string vector whose first element is the first line from data file. Subsequent elements of the vector are lines from the data file whose labels match an entry in patterns.
tsvGenIndex, tscGetData
datafile = tempfile("data"); df <- data.frame(C1 = c("Foo", "Boing", "The"), C2 = c("Bar", "Boing", "End")); rownames(df) <- c("R1", "R2", "R3"); write.table(df, file=datafile, sep="\t", quote=FALSE, row.names=TRUE, col.names=TRUE); indexfile = tempfile("index"); tsvGenIndex (datafile, indexfile); tsvGetLines (datafile, indexfile, c("R1", "R3"))
datafile = tempfile("data"); df <- data.frame(C1 = c("Foo", "Boing", "The"), C2 = c("Bar", "Boing", "End")); rownames(df) <- c("R1", "R2", "R3"); write.table(df, file=datafile, sep="\t", quote=FALSE, row.names=TRUE, col.names=TRUE); indexfile = tempfile("index"); tsvGenIndex (datafile, indexfile); tsvGetLines (datafile, indexfile, c("R1", "R3"))