htsget

module
v0.0.0-...-768b56e Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 24, 2018 License: Apache-2.0

README

htsget on GCS

GoDoc Build Status

This repository contains an implementation of the htsget protocol that provides access to reads data stored in Google Cloud Storage buckets.

Currently, only BAM is supported and the BAM index file must be colocated with the BAM file (that is, sample.bam and sample.bam.bai must be in the same GCS bucket).

CRAM support will be added in the very near future.

Note that this is an Early Access Preview release. If you use this software for production workloads, please be careful and help us by reporting problems using the issue tracker.

Quick start using Docker

A docker image containing the compiled htsget-server binary is available at gcr.io/genomics-tools/htsget.

If you have docker already installed, you can start the server and make it available to the host by running:

$ docker run -d -P gcr.io/genomics-tools/htsget

To determine the port that has been exposed to the host use the docker port command. By default, the server can only access public data sources (see below for more information on secured access).

Quick start using AppEngine

Google AppEngine provides a secure and automatically scalable way to run the htsget server. To get started, clone the htsget source code and deploy the application in the appengine directory.

$ export GOPATH=$PWD
$ go get github.com/googlegenomics/htsget/appengine
$ gcloud app deploy src/github.com/googlegenomics/htsget/appengine

Once the deployment completes, you can make authorized htsget requests to https://your-project-id.appspot.com/.

The appengine application will serve requests to any GCS bucket by default. This behavior can be modified using the settings found in the app.yaml file.

Building the server

In order to build the server, you will need the Go tool chain (at least version 1.8).

Once Go is installed, you can build the server by running:

$ go get github.com/googlegenomics/htsget/htsget-server

This will produce a binary in $GOPATH/bin called htsget-server.

Usage

You can use htsget-server in one of two modes:

  • Insecure mode for public resources. When used in this way, the htsget-server does not use TLS and does not authenticate requests it makes to Google Cloud storage. This is useful if you want to access public data via the htsget protocol on a machine that is already running htslib based tools (like samtools).

  • Secure mode with authentication. In this mode, the server requires a TLS certificate and key to be passed as command line flags. It will then listen on all interfaces and accept requests secured via TLS. Each request must contain an OAuth2 Bearer Access Token which will be used to fetch data from GCS.

Required file layout

In either mode, read requests identify the bucket and object (file) to read. As an example, /reads/testing/123.bam will cause the server to try to access the GCS bucket 'testing' and read two objects: 123.bam and 123.bam.bai. The index file MUST be in the same bucket and have the .bai suffix.

Running the server

Insecure mode

$ bin/htsget-server --port=1234 &
$ samtools flagstat http://localhost:1234/reads/public-bucket/test.bam

This will use htsget to retrieve data from 'test.bam' stored in the GCS bucket 'public-bucket'.

Secure mode

$ bin/htsget-server --secure=true --port=443 --https_cert=server.crt --https_key=server.key &
$ export CURL_CA_BUNDLE=server.crt
$ export HTS_AUTH_LOCATION=/path/to/my-oauth2-token
$ samtools flagstat http://localhost:1234/reads/private-bucket/test.bam

The file server.crt and server.key can be generated using the generate_cert tool that comes with Go, or using openssl.

Note that you will require versions of samtools and htslib that support the environment variables used above (CURL_CA_BUNDLE and HTS_AUTH_LOCATION). This support was added in October of 2017.

Bucket Whitelist

In both secure and insecure mode the list of buckets from which the server is allowed to read from can be restricted by passing a comma-separated list of buckets via the --buckets flag. If the --buckets flag is not specified then there is no restriction on the buckets from which the server can read.

Known Issues

  • The server isn't very efficient at limiting what reads are returned. This is an area we are actively working to improve (see issue #7).

  • Filters on fields are ignored. The server does not implement any filtering beyond read range and reference name filters. We do not currently plan to add support for this. If this is important to you, please file an issue and let us know.

Directories

Path Synopsis
Package api implements the htsget readset retrieval API.
Package api implements the htsget readset retrieval API.
This binary provides an htsget client that supports Google authentication.
This binary provides an htsget client that supports Google authentication.
This binary provides an htsget server that backs onto resources in GCS.
This binary provides an htsget server that backs onto resources in GCS.
internal
analytics
Package analytics provides functions for sending data to Google Analytics.
Package analytics provides functions for sending data to Google Analytics.
bam
Package bam provides support for parsing BAM files.
Package bam provides support for parsing BAM files.
bcf
Package bcf contains support for parsing BCF files.
Package bcf contains support for parsing BCF files.
bgzf
Package bgzf provides support for parsing BGZF files.
Package bgzf provides support for parsing BGZF files.
binary
Package binary provides support for operating on binary data.
Package binary provides support for operating on binary data.
csi
Package csi contains support for processing the information in a CSI file (http://samtools.github.io/hts-specs/CSIv1.pdf).
Package csi contains support for processing the information in a CSI file (http://samtools.github.io/hts-specs/CSIv1.pdf).
genomics
Package genomics contains definitions related to Genomic data.
Package genomics contains definitions related to Genomic data.
sam
Package sam provides support for parsing SAM files.
Package sam provides support for parsing SAM files.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL