bamprovider

package
v0.0.0-...-d966d87 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 18, 2020 License: Apache-2.0 Imports: 21 Imported by: 8

Documentation

Overview

Package bamprovider provider utilities for scanning a BAM/PAM file in parallel.

The Provider is an interface for reading BAM or PAM file in parallel.

PairIterator is implemented on top of Provider to combine read pairs (R1+R2).

Example (Shardedread)

Example of reading a BAM file in parallel.

package main

import (
	"runtime"
	"sync"

	gbam "github.com/grailbio/bio/encoding/bam"
	"github.com/grailbio/bio/encoding/bamprovider"
	"github.com/grailbio/testutil"
)

func main() {
	path := testutil.GetFilePath("//go/src/grail.com/bio/encoding/bam/testdata/170614_WGS_LOD_Pre_Library_B3_27961B_05.merged.10000.bam")
	provider := bamprovider.NewProvider(path)
	shards, err := provider.GenerateShards(bamprovider.GenerateShardsOpts{})
	if err != nil {
		panic(err)
	}
	shardCh := gbam.NewShardChannel(shards)

	wg := sync.WaitGroup{}
	for i := 0; i < runtime.NumCPU(); i++ {
		wg.Add(1)
		go func() {
			defer wg.Done()
			for shard := range shardCh {
				iter := provider.NewIterator(shard)
				for iter.Scan() {
					// use iter.Record
				}
				if err := iter.Close(); iter != nil {
					panic(err)
				}
			}
		}()
	}
	wg.Wait()
	if err := provider.Close(); err != nil {
		panic(err)
	}
}
Output:

Index

Examples

Constants

View Source
const (
	// DefaultBytesPerShard is the default value for GenerateShardsOpts.BytesPerShard
	DefaultBytesPerShard = int64(128 << 20)
	// DefaultMinBasesPerShard is the default value for GenerateShardsOpts.MinBasesPerShard
	DefaultMinBasesPerShard = 10000
)
View Source
const DefaultMaxPairSpan = 1000

Variables

This section is empty.

Functions

func FinishBoundedPairIterators

func FinishBoundedPairIterators(iters []*BoundedPairIterator) error

FinishBoundedPairIterators should be called after reading all pairs, or on error-exit. If any iterators are still open, this assumes error-exit occurred and just closes them. Otherwise, it returns an error iff there are some unpaired reads.

func FinishPairIterators

func FinishPairIterators(iters []*PairIterator) error

FinishPairIterators should be called after reading all pairs. It returns an error if there are some unpaired reads.

func RefByName

func RefByName(h *sam.Header, refName string) *sam.Reference

RefByName finds a sam.Reference with the given name. It returns nil if a reference is not found.

Types

type BAMProvider

type BAMProvider struct {
	// Path of the *.bam file. Must be nonempty.
	Path string
	// Index is the pathname of *.bam.bai file. If "", Path + ".bai"
	Index string
	// contains filtered or unexported fields
}

BAMProvider implements Provider for BAM files. Both BAM and the index filenames are allowed to be S3 URLs, in which case the data will be read from S3. Otherwise the data will be read from the local filesystem.

func (*BAMProvider) Close

func (b *BAMProvider) Close() error

Close implements the Provider interface.

func (*BAMProvider) FileInfo

func (b *BAMProvider) FileInfo() (FileInfo, error)

FileInfo implements the Provider interface.

func (*BAMProvider) GenerateShards

func (b *BAMProvider) GenerateShards(opts GenerateShardsOpts) ([]gbam.Shard, error)

GenerateShards implements the Provider interface.

func (*BAMProvider) GetFileShards

func (b *BAMProvider) GetFileShards() ([]gbam.Shard, error)

GetFileShards implements the Provider interface.

func (*BAMProvider) GetHeader

func (b *BAMProvider) GetHeader() (*sam.Header, error)

GetHeader implements the Provider interface.

func (*BAMProvider) NewIterator

func (b *BAMProvider) NewIterator(shard gbam.Shard) Iterator

NewIterator implements the Provider interface.

type BoundedPairIterator

type BoundedPairIterator struct {
	// contains filtered or unexported fields
}

BoundedPairIterator is a deterministic alternative to PairIterator, usable in settings where distant mates (and read-pairs with unmapped read(s)) can be ignored. ("Distant mates" are defined in the next comment.)

func NewBoundedPairIterators

func NewBoundedPairIterators(provider Provider, opts BoundedPairIteratorOpts) (iters []*BoundedPairIterator, err error)

NewBoundedPairIterators returns a slice of BoundedPairIterators covering the provider's BAM/PAM.

func (*BoundedPairIterator) Record

func (bpi *BoundedPairIterator) Record() Pair

Record returns the current pair, or an error.

REQUIRES: Scan() has been called and its last call returned true.

func (*BoundedPairIterator) Scan

func (bpi *BoundedPairIterator) Scan() bool

Scan reads the next record. It returns true if a record has been read, and false on end of data stream.

func (*BoundedPairIterator) Shard

func (bpi *BoundedPairIterator) Shard() gbam.Shard

Shard returns the shard covered by this iterator. Note that this shard's StartSeq and EndSeq values are not guaranteed to be zero.

type BoundedPairIteratorOpts

type BoundedPairIteratorOpts struct {
	DuplicateShardCrossers bool
	MaxPairSpan            int
	TargetParallelism      int
}

BoundedPairIteratorOpts controls the behavior of NewBoundedPairIterators() below. - There are two modes.

  • In the default mode, each valid read-pair is returned by exactly one iterator, dependent on min(r1Start, r2Start). All read-pairs returned by iterator 0 start before all read-pairs returned by iterator 1, etc.
  • In the DuplicateShardCrossers=true mode, read-pairs which span multiple shards are returned by all of those shard-iterators. This facilitates computation of position-based stats (e.g. bio-pileup).
  • For the purpose of these iterators, a read-pair has distant mates if either (i) the reads are mapped to different chromosomes, or (ii) max(r1End, r2End) - min(r1Start, r2Start) is greater than MaxPairSpan. If MaxPairSpan is zero, DefaultMaxPairSpan is used. No read-pairs with distant mates are returned.
  • The function attempts to return an iterator-slice of length TargetParallelism. If TargetParallelism is zero, runtime.NumCPU() is used. Occasionally, the returned slice will have a different length: e.g. if we're working with a BAM with only one read-pair, there's no point in subdividing it.

type FileInfo

type FileInfo struct {
	// ModTime is the last-modified time of the file. For PAM, it represents the
	// modtime of the *.index file of the first shard in the directory.
	ModTime time.Time
	// Size is the size of the file, in bytes.  For PAM, it represents the size of
	// the *.index file of the first shard in the directory.
	Size int64
}

FileInfo stores metadata of BAM or PAM.

type FileType

type FileType int

FileType represents the type of a BAM-like file.

const (
	// Unknown is a sentinel.
	Unknown FileType = iota
	// BAM file
	BAM
	// PAM file
	PAM
)

func GuessFileType

func GuessFileType(path string) FileType

GuessFileType returns the file type from the pathname and/or contents. Returns Unknown on error.

func ParseFileType

func ParseFileType(name string) FileType

ParseFileType parses the file type string. "bam" returns bamprovider.BAM, for example. On error, it returns Unknown.

Example
package main

import (
	"fmt"

	"github.com/grailbio/bio/encoding/bamprovider"
)

func main() {
	fmt.Printf("%d\n", bamprovider.ParseFileType("bam"))
	fmt.Printf("%d\n", bamprovider.ParseFileType("pam"))
	fmt.Printf("%d\n", bamprovider.ParseFileType("invalid"))
}
Output:

1
2
0

type GenerateShardsOpts

type GenerateShardsOpts struct {
	// Strategy defines sharding strategy.
	Strategy ShardingStrategy

	Padding int
	// IncludeUnmapped causes GenerateShards() to produce shards for the
	// unmapped && mate-unmapped reads.
	IncludeUnmapped bool

	// SplitUnmappedCoords allows GenerateShards() to split unmapped
	// reads into multiple shards. Setting this flag true will cause shard
	// size to be more even, but the caller must be able to handle split
	// unmapped reads.
	SplitUnmappedCoords bool

	// SplitMappedCoords allows GenerateShards() to split mapped reads of
	// the same <refid, alignment position> into multiple shards. Setting
	// this flag true will cause shard size to be more even, but the caller
	// must be able to handle split reads.
	SplitMappedCoords bool

	// AlwaysSplitMappedAndUnmappedCoords causes GenerateShard always to split
	// shards at the boundary of mapped and unmapped reads.
	AlwaysSplitMappedAndUnmappedCoords bool

	// BytesPerShard is the target shard size, in bytes. This is consulted only in
	// ByteBased sharding strategy.
	BytesPerShard int64

	// NumShards is the target shard count. It is consulted by the ByteBased sharding
	// strategy, and is ignored if BytesPerShard is set.
	NumShards int

	// MinBasesPerShard defines the nimimum number of bases in each shard. This is
	// consulted only in ByteBased sharding strategy.
	MinBasesPerShard int
}

GenerateShardsOpts defines behavior of Provider.GenerateShards.

type Iterator

type Iterator interface {
	// Scan returns where there are any records remaining in the iterator,
	// and if so, advances the iterator to the next record. If the iterator
	// reaches the end of its range, Scan() returns false.  If an error
	// occurs, Scan() returns false and the error can be retrieved by
	// calling Error().
	//
	// Scan and Record always yield records in the ascending coordinate
	// (refid,position) order.
	//
	// REQUIRES: Close has not been called.
	Scan() bool

	// Record returns the current record in the iterator. This must be
	// called only after a call to Scan() returns true.
	//
	// REQUIRES: Close has not been called.
	Record() *sam.Record

	// Err returns the error encoutered during iteration, or nil if no error
	// occurred.  An io.EOF error will be translated to nil.
	Err() error

	// Close must be called exactly once. It returns the value of Err().
	Close() error
}

Iterator iterates over sam.Records in a particular genomic range, in coordinate order. Thread compatible.

func NewErrorIterator

func NewErrorIterator(err error) Iterator

NewErrorIterator creates an Iterator that yields no record and returns "err" in Err and Close.

func NewRefIterator

func NewRefIterator(p Provider, refName string, start, limit int) Iterator

NewRefIterator creates an iterator for half-open range [refName:start, refName:limit). Start and limit are both base zero. The iterator will yield reads whose start positions are in the given range.

type MissingMateError

type MissingMateError struct {
	Message string
}

MissingMateError is a specific error that can be used when one or more mates are missing.

func (MissingMateError) Error

func (mme MissingMateError) Error() string

type PAMProvider

type PAMProvider struct {
	// Path prefix. Must be nonempty.
	Path string
	// Opts is passed to pam.NewReader.
	Opts pam.ReadOpts
	// contains filtered or unexported fields
}

PAMProvider reads PAM files. The path can be S3 URLs, in which case the data will be read from S3. Otherwise the data will be read from the local filesystem.

func (*PAMProvider) Close

func (p *PAMProvider) Close() error

Close implements the Provider interface.

func (*PAMProvider) FileInfo

func (p *PAMProvider) FileInfo() (FileInfo, error)

FileInfo implements the Provider interface.

func (*PAMProvider) GenerateShards

func (p *PAMProvider) GenerateShards(opts GenerateShardsOpts) ([]gbam.Shard, error)

GenerateShards implements the Provider interface.

func (*PAMProvider) GetFileShards

func (p *PAMProvider) GetFileShards() ([]gbam.Shard, error)

GetFileShards implements the Provider interface.

func (*PAMProvider) GetHeader

func (p *PAMProvider) GetHeader() (*sam.Header, error)

GetHeader implements the Provider interface.

func (*PAMProvider) NewIterator

func (p *PAMProvider) NewIterator(shard gbam.Shard) Iterator

NewIterator implements Provider.GetIndexedReader.

type Pair

type Pair = gbam.Pair

Pair encapsulates a pair of SAM records for a pair of reads, and whether any error was encountered in retrieving them.

type PairIterator

type PairIterator struct {
	// contains filtered or unexported fields
}

PairIterator reads matched pairs of records from a BAM or PAM file. Use NewPairIterators to create an iterator.

func NewPairIterators

func NewPairIterators(provider Provider, includeUnmapped bool) ([]*PairIterator, error)

NewPairIterators creates a set of PairIterators. A PairIterator yields pairs of records in the BAM or PAM data corresponding to primary alignments for paired reads. Records will not be included if they represent secondary or supplemental alignments (based on SAM flags). Pairs that have both reads unmapped will not be included unless includeUnmapped is true.

The pairs in the BAM file will be randomly sharded across the PairIterators created by this function. Pairs are returned in an unspecified order, even within one PairIterator. (Use BoundedPairIterator instead if you want deterministic behavior and do not need to process distant mates.)

Each PairIterator is thread-compatible. It is recommended to create one goroutine for each iterator.

Example

Example_pairiterators is an example of NewPairIterator

package main

import (
	"sync"

	"github.com/grailbio/bio/encoding/bamprovider"
	"github.com/grailbio/testutil"
)

func main() {
	bamPath := testutil.GetFilePath("//go/src/grail.com/bio/encoding/bam/testdata/170614_WGS_LOD_Pre_Library_B3_27961B_05.merged.10000.bam")
	provider := bamprovider.NewProvider(bamPath)
	iters, err := bamprovider.NewPairIterators(provider, true)
	if err != nil {
		panic(err)
	}

	wg := sync.WaitGroup{}
	for _, iter := range iters {
		wg.Add(1)
		go func(iter *bamprovider.PairIterator) {
			defer wg.Done()
			for iter.Scan() {
				p := iter.Record()
				if p.Err != nil {
					panic(p.Err)
				}
				// use p.R1 and p.R2
			}
		}(iter)
	}
	wg.Wait()
	if err := bamprovider.FinishPairIterators(iters); err != nil {
		panic(err)
	}
}
Output:

func (*PairIterator) Record

func (l *PairIterator) Record() Pair

Record returns the current pair, or an error.

REQUIRES: Scan() has been called and its last call returned true.

func (*PairIterator) Scan

func (l *PairIterator) Scan() bool

Scan reads the next record. It returns true if a record has been read, and false on end of data stream.

type Provider

type Provider interface {
	// FileInfo returns metadata of the underlying file(s).
	//
	// TODO(saito) consider merging GetHeader into FileInfo.
	FileInfo() (FileInfo, error)

	// GetHeader returns the header for the provided BAM data.  The callee
	// must not modify the returned header object.
	//
	// REQUIRES: Close has not been called.
	GetHeader() (*sam.Header, error)

	// GenerateShards prepares for parallel reading of genomic data.
	//
	// The Shards split the BAM data from the given provider into
	// contiguous, non-overlapping genomic intervals. A SAM record is
	// associated with a shard if its alignment start position is within the
	// given padding distance of the shard. This means reads near shard
	// boundaries may be associated with more than one shard.
	//
	// Use NewIterator to read records in a shard.
	//
	// REQUIRES: Close has not been called.
	GenerateShards(opts GenerateShardsOpts) ([]gbam.Shard, error)

	// GetFileShards describes how records are split into files. For BAM, this
	// function just returns a UniversalShard since all records are in one
	// file. For PAM, this function returns one entry per fileshard.
	//
	// REQUIRES: Close has not been called.
	GetFileShards() ([]gbam.Shard, error)

	// NewIterator returns an iterator over record contained in the shard.  The
	// "shard" parameter is usually produced by GenerateShards, but the caller may
	// also manually construct it.
	//
	// REQUIRES: Close has not been called.
	NewIterator(shard gbam.Shard) Iterator

	// Close must be called exactly once. It returns any error encountered
	// by the provider, or any iterator created by the provider.
	//
	// REQUIRES: All the iterators created by NewIterator have been closed.
	Close() error
}

Provider allows reading BAM or PAM file in parallel. Thread safe.

func NewFakeProvider

func NewFakeProvider(header *sam.Header, recs []*sam.Record) Provider

NewFakeProvider creates a provider that returns "header" in response to a GetHeader() call, and recs by GenerateShards+NewIterator calls

func NewProvider

func NewProvider(path string, optList ...ProviderOpts) Provider

NewProvider creates a Provider object that can handle BAM or PAM file of "path". The file type is autodetected from the path.

type ProviderOpts

type ProviderOpts struct {
	// Index specifies the name of the BAM inde file. This field is meaningful
	// only for BAM files. If Index=="", it defaults to path + ".bai".
	Index string

	// DropFields causes the listed fields not to be filled in sam.Record. This
	// option is recognized only by the PAM reader.
	DropFields []gbam.FieldType
}

ProviderOpts defines options for NewProvider.

type ShardingStrategy

type ShardingStrategy int

ShardingStrategy defines algorithms used by Provider.GenerateShards.

const (
	// Automatic picks some good strategy. In practice, it means ByteBased for
	// PAM, PositionBased for BAM.
	Automatic ShardingStrategy = iota
	// ByteBased strategy partitions the file so that each shard has roughly equal
	// number of bytes.
	ByteBased
	// PositionBased strategy partitions the file so that each shard covers a
	// genomic coordinate range ([<startref,startpos>, <limitref,limitpos>) with
	// uniform width - i.e., value of (limitpos - startpos) is uniform across
	// shards.
	PositionBased
)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL