Genomedata

Hoffman MM, Buske OJ, Noble WS. 2010. The Genomedata format for storing large-scale functional genomics data. Bioinformatics, 26(11):1458-1459; doi:10.1093/bioinformatics/btq164

Genomedata is a format for efficient storage of multiple tracks of numeric data anchored to a genome. The format allows fast random access to hundreds of gigabytes of data, while retaining a small disk space footprint. We have also developed utilities to load data into this format. A reference implementation in Python and C components is available here under the GNU General Public License.

Installation

To install Genomedata, you must have HDF5 and Python 3.7 (or later) installed on your system. Genomedata can then be installed using the following command on your Linux/Unix based system*:

pip install genomedata

For more detailed instructions on how to install Genomedata, see the documentation linked below.

* We have only tested this software on Linux and Mac systems. We would love to extend our support to other systems in the future, and we would gladly accept any contributions toward this end. Specicially, we have successfully installed Genomedata on the following platforms:

RHEL5/i686,x86_64
RHEL4/x86_64
Debian 4.0 (etch)/x86_64
Mac OS X/intel

Documentation

Genomedata is briefly described in the Bioinformatics application note cited and linked at the top of this page.

The application's documentation is available in three formats:

Source code

Assemblies

Reference assemblies can be downloaded from the National Center for Biotechnology Information FTP site. On the FTP Site, there is the current human reference genome assembly, hg38.

In the latest release:

1.7.2:
* required Python is now >=3.9
* fixed consistency in array shape output when track indexing on bigWig files

1.7.1:
* fix array dimensionality consistency for summary statistics on bigWig files
* add debug representation for chromosomes for bigWig files

1.7.0:
* adapted existing python interface to open bigWig files

1.6.0:
* required Python is now >=3.7
* genomedata-load-data: changed to a python script with a c-extension

1.5.0:
* genomedata-load-data: fix bad error message when loading process fails
* genomedata-load-seq: add chromsome name mapping based on assembly reports

1.4.4:
* fixed pkg-config output encoding when finding HDF5 directories

1.4.3:
* fixed genomedata script entry points for Python 3

1.4.2:
* added compatibilty for Python 3
* genomedata-load-seq: adjacent AGP entries are merged into a single supercontig
* Use pkg-config during setup to determine paths to HDF5 directories
* Removed forked-path dependency, added Path.py

1.4.1:
* genomedata-hard-mask: fix verbosity line not outputting to stderr
* genomedata-load-data: fix hdf5 group leak

1.4.0:
* genomedata-close-data: chunk metadata now truncates telomeres and trims large
  gaps between supercontigs
* genomedata-load-data: new option for masking data with --maskfile
* genomedata-hardmask: new command added to filter out track regions
* hardmask_data: new python interface to filter out track regions
* Genome: add ability to open archives for writing
* genomedata-load-seq: AGP are now correctly loaded regardless of filename and
  may be concatenated together
* genomedata-load-seq: fix assertion failure on argument parsing when loading
  fasta sequence (thanks to Kate Cook)
* genomedata-load: fix agp files not being recognized from this entry point
* docs: clarified that agp files cannot be combined
* docs: warned users that globs must be quoted to be parsed by genomedata-load

1.3.6:
* `sizes` command added to `genomedata-info` (Jay Hesselberth)
* Updated installation instructions for installing with PyTables 3.1.1
* toward python3 compatibility (Jay Hesselberth)
  - genomedata now requires python 2.7+
  - moved from `optparse` to `argparse` throughout
  - package-wide `__version__` lets modules report true version number
  - __future__ imports added to all modules and python3 `print()`
    functions

1.3.5:
* Removed platform specific builds from distribution

1.3.4:

* fixed bug related to updated PyTables
* compile works with HDF5 setups even when they were built
  --with-default-api-version=v16
* doc fixes
* fixed DeprecatingWarnings associated with PyTables 3.0
* updated dependency to PyTables >= 3.0

1.3.3:

* genomedata-query: new command that prints data from a Genomedata archive for your
  non-Python scripting needs (thanks to Max Libbrecht)
* genomedata-histogram: new command that prints histograms from a Genomedata archive
  (combination of a new module by Max Libbrecht and an old module by Michael Hoffman)
* genomedata-info: add "contigs" subcommand (thanks to Max Libbrecht)
* genomedata-info: friendlier error when unsupported command name used
* genomedata-load-data: friendlier errors when invalid BED3+1/bedGraph data supplied
* genomedata-load-seq: always makes chromosome and supercontig
  coordinates with unsigned 32-bit integers instead of system int
* genomedata-load-data: more detailed error message when initial file open fails
* genomedata-load-data: bugfix
* now compile with -Wextra
* doc fixes

1.3.2:

* API: now allow array of tracks. For example: chromosome[245:270, array([7, 5])]

1.3.1:

* API: now allow lists of tracks when directly accessing chromosome data, for example:
  chromosome[245:270, ["data1", "data3"]] or chromosome[245:270, [7, 5]]
* genomedata-load-seq: add --assembly option which supports AGP files,
  to allow avoid loading seq while still dealing with assembly gaps
  properly
* genomedata-load: now supports --assembly and --sizes options
* genomedata-load-assembly: alias for genomedata-load-seq.
  genomedata-load-seq will be deprecated in the future
* genomedata-load-data: now support DOS-style line endings ("\r\n")
* genomedata-load: print genomedata-load-data error code on failure
* genomedata-load-data: print more informative messages when ignoring data
* genomedata-load: all diagnostics messages to stderr
* genomedata-load: some diagnostics now include timestamp so we can
  see where performance bottlenecks are
* genomedata-load: more descriptive error messages
* genomedata-load-seq: print more descriptive error message when
  attempting to load sequence from a non-FASTA file
* genomedata-load: fixed issue 10: now compiles on gcc 4.6.2
* docs: add links to source code
* docs: genomedata-load: sequence "option" is mandatory. In a future
  version, we should change this to an argument to reflect this.
* test: add tests for DOS-style line-endings

1.3.0:

* genomedata supercontigs are no longer guaranteed to have seq data
* add --sizes option to genomedata-load-seq, to allow avoid loading seq
* Genome.add_track_continuous() has a significant performance
  improvement. This also means that genomedata-open-data will run much
  faster, as well as genomedata-load-data on fresh tracks
* fix bug where genomedata-load-seq didn't work
* fix bug where directory genomedata archive didn't work with only one chromosome

1.2.3:

* allow use with PyTables >=2.2
* new command: genomedata-info: "genomedata-info tracknames ARCHIVE"
  prints the tracknames for ARCHIVE
* Genome.format_version will now return 0 when files are missing a
  genomedata_format_version attribute
* Genome.__init__: future-proof to future versions of file format by throwing an error
* tests: add regression tests, lots of changes
* docs: add man pages

1.2.2:

* genomedata-load: will now support track filenames with "=" in the names
* genomedata-load: now supports UNIX glob wildcards as arguments to -s
* genomedata-load-data: allow other delimiters besides space for
  variableStep and fixedStep, allow wiggle_0 track specification
* genomedata-load-data, genomedata-load: remove unused --chunk-size option
* genomedata-close-data: fix bug where chunk_starts, chunk_ends not
  written for supercontigs with zero present data
* installation: move from path.py to forked-path
* docs: fixed small errors
* various: removed exclamation marks from error messages. It's not *that* exciting.
* some portability improvements
* tests: improve unit test interface

1.2.1:

* Fixed an installation bug where HDF5 installations later in
  LIBRARY_PATH might override those specified first, leading to
  linking errors during build.

Example scripts

genomedata_random_access.py : Given genomic positions on stdin, prints the corresponding data values for a set of tracks in a Genomedata collection.
genomedata_offline_random_access.py : Similar to genomedata_random_access.py, except the full set of input positions is first read, sorted, and then Genomedata is scanned for these locations. Close to constant-time performance in the number of input positions.

Support

There is a moderated genomedata-announce mailing list that you can subscribe to for information on new releases of Genomedata.

There is also a genomedata-users mailing list for general discussion and questions about the use of the Genomedata system.

If you want to report a bug or request a feature, please do so using the Genomedata issue tracker.

For other support with Genomedata, or to provide feedback, please e-mail Michael. We are interested in all comments regarding the package and the ease of use of installation and documentation.

genomedata-users mailing list