interval

package
v0.0.0-...-9ba24aa Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 1, 2024 License: Apache-2.0 Imports: 14 Imported by: 0

Documentation

Overview

Package interval implements interval-union operations in a manner optimized

for sets of genomic coordinates represented by BED files.
(Note the 'union'.  Overlapping intervals are merged, not tracked
separately; it is currently necessary to use another package when that is not
the desired behavior.)
It assumes every position fits in a PosType, which is currently defined as
int32 since that's what BAM files are limited to.

Index

Constants

View Source
const PosTypeMax = math.MaxInt32

PosTypeMax is the maximum value that can be represented by a PosType.

Variables

This section is empty.

Functions

This section is empty.

Types

type BEDUnion

type BEDUnion struct {

	// RefNames maps reference IDs to names, when idMap is initialized.
	RefNames []string
	// contains filtered or unexported fields
}

BEDUnion represents the set of positions covered by the intervals in a BED file, in a manner that supports efficient iteration and intersection queries. (It does not keep track of interval names, etc.; just the final position set.)

func NewBEDUnion

func NewBEDUnion(reader io.Reader, opts NewBEDOpts) (bedUnion BEDUnion, err error)

NewBEDUnion loads just the intervals from a sorted (by first coordinate) interval-BED, merging touching/overlapping intervals and eliminating empty ones in the process. A BEDUnion is returned.

func NewBEDUnionFromEntries

func NewBEDUnionFromEntries(entries []Entry, opts NewBEDOpts) (bedUnion BEDUnion, err error)

NewBEDUnionFromEntries initializes a BEDUnion from a sorted []Entry. This ignores opts.OneBasedInput, since start0 is defined to be zero-based.

func NewBEDUnionFromPath

func NewBEDUnionFromPath(path string, opts NewBEDOpts) (bedUnion BEDUnion, err error)

NewBEDUnionFromPath is a wrapper for NewBEDUnion that takes a path instead of an io.Reader.

func (*BEDUnion) Clone

func (u *BEDUnion) Clone() (bedUnion BEDUnion)

Clone returns a new BEDUnion which shares the interval set, but has its own search state.

func (*BEDUnion) ContainsByID

func (u *BEDUnion) ContainsByID(refID int, pos PosType) bool

ContainsByID checks whether the (0-based) interval [pos, pos+1) is contained within the BEDUnion, where reference is specified by sam.Header ID.

func (*BEDUnion) ContainsByName

func (u *BEDUnion) ContainsByName(refName string, pos PosType) bool

ContainsByName checks whether the (0-based) interval [pos, pos+1) is contained within the BEDUnion, where reference is specified by name.

func (*BEDUnion) EndpointsByID

func (u *BEDUnion) EndpointsByID(refID int) []PosType

EndpointsByID returns the sorted interval-endpoint slice for the reference with the given ID. This must be treated as read-only; it may alias an internal data structure.

func (*BEDUnion) EndpointsByName

func (u *BEDUnion) EndpointsByName(refName string) []PosType

EndpointsByName has the same functionality as EndpointsByID, without requiring a *sam.Header at BEDUnion construction time.

func (*BEDUnion) IntersectionByID

func (u *BEDUnion) IntersectionByID(refID int, startPos, limitPos PosType) []PosType

IntersectionByID is similar to OverlapByID, except that the intersection between [startPos, limitPos) and the BEDUnion is returned (so the first and/or the last interval may be truncated). The return value must be treated as read-only.

func (*BEDUnion) Intersects

func (u *BEDUnion) Intersects(startRefID int, startPos PosType, limitRefID int, limitPos PosType) bool

Intersects checks whether the given contiguous possibly-multi-reference region intersects the interval set. References must be specified by ID. It panics if limitRefID:limitPos isn't after startRefID:startPos.

func (*BEDUnion) IntersectsByID

func (u *BEDUnion) IntersectsByID(refID int, startPos PosType, limitPos PosType) bool

IntersectsByID checks whether the given nonempty single-reference [startPos, limitPos) interval intersects the BEDUnion, where the reference is specified by ID.

func (*BEDUnion) OverlapByID

func (u *BEDUnion) OverlapByID(refID int, startPos, limitPos PosType) []PosType

OverlapByID is a low-level function which returns a []PosType describing all the BED intervals overlapping [startPos, limitPos) on the given reference, where the reference is specified by ID. For example, assuming positive startPos, the BED interval [0, startPos) would not be returned, while [0, startPos + 1) would. IMPORTANT: the return value must be treated as read-only; it may alias an internal data structure. The return slice is arranged such that the value at index 2n is the start of overlapping interval n, and the value at index (2n+1) is the end of that overlapping interval.

func (*BEDUnion) RefNameSet

func (u *BEDUnion) RefNameSet() map[string]bool

RefNameSet returns the set of reference names. This works even when the BEDUnion was constructed without a *sam.Header.

func (*BEDUnion) Subset

func (u *BEDUnion) Subset(startRefID int, startPos PosType, limitRefID int, limitPos PosType) (bedUnion BEDUnion)

Subset returns a new BEDUnion which describes the intersection between the original interval set and the given (possibly multi-reference) interval. The original BEDUnion must support ID-based lookup. Parameters are handled in the same way as Intersects().

type EndpointIndex

type EndpointIndex uint32

EndpointIndex is intended to represent the result of SearchPosTypes(endpoints, pos+1). NOTE THE "+1"! This is necessary to get SearchPosTypes to line up with our usual left-closed right-open intervals.

func ExpsearchPosType

func ExpsearchPosType(a []PosType, x PosType, idx EndpointIndex) EndpointIndex

ExpsearchPosType performs "exponential search" (https://en.wikipedia.org/wiki/Exponential_search ), checking a[idx], then a[idx + 1], then a[idx + 3], then a[idx + 7], etc., and finishing with binary search once it's either found an element larger than the target or has hit the end of the slice. It's usually a better choice than SearchPosTypes when iterating. (However, an inlined simple linear search may be better in practice. Can benchmark later if it matters.)

func NewEndpointIndex

func NewEndpointIndex(pos PosType, endpoints []PosType) EndpointIndex

NewEndpointIndex returns an EndpointIndex initialized to SearchPosTypes(endpoints, pos+1).

func SearchPosTypes

func SearchPosTypes(a []PosType, x PosType) EndpointIndex

SearchPosTypes returns the index of x in a[], or the position where x would be inserted if x isn't in a (this could be len(a)). It's exactly the same as sort.SearchInts(), except for PosType.

func (EndpointIndex) Begin

func (ei EndpointIndex) Begin() EndpointIndex

Begin returns:

  • the index for the beginning of the current interval, if we're inside an interval
  • otherwise, the index for the beginning of the next interval

func (EndpointIndex) Contained

func (ei EndpointIndex) Contained() bool

Contained returns whether we're inside an interval.

func (EndpointIndex) Finished

func (ei EndpointIndex) Finished(endpoints []PosType) bool

Finished returns whether we're past all the intervals.

func (*EndpointIndex) Update

func (ei *EndpointIndex) Update(newPos PosType, endpoints []PosType)

Update updates the EndpointIndex to refer to newPos, which cannot be smaller than the previous position referred to by this EndpointIndex. It is substantially faster than NewEndpointIndex when the position is increasing slowly.

type Entry

type Entry struct {
	RefName string
	Start0  PosType
	End     PosType
}

Entry represents a single interval, with 0-based coordinates.

func ParseRegionString

func ParseRegionString(region string) (result Entry, err error)

ParseRegionString parses a region string of one of the forms

[contig ID]:[1-based first pos]-[last pos]
[contig ID]:[1-based pos]
[contig ID]

returning a contig ID and 0-based interval boundaries. The interval [0, PosTypeMax - 1] is returned if there is no positional restriction.

type NewBEDOpts

type NewBEDOpts struct {
	// SAMHeader enables ID-based lookup.  (This is more convenient than
	// string-based lookup when using gbam.Shard.)
	SAMHeader *sam.Header
	// Invert causes the complement of the interval-union to be returned.  The
	// complement extends down to position -1 at the beginning of each
	// reference, and currently 2^31 - 2 inclusive at the end.  If SAMHeader is
	// provided, any reference mentioned in the SAMHeader but entirely absent
	// from the BED will be fully included.  Otherwise, only the references
	// mentioned in the BED file are included.  (A single empty interval
	// qualifies as a "mention" for the latter purpose.)
	Invert bool
	// OneBasedInput interprets the BED interval boundaries as one-based [start,
	// end] instead of the usual zero-based [start, end).
	OneBasedInput bool
}

NewBEDOpts defines behavior of this package's BED-loading function(s).

type PosType

type PosType int32

PosType is the type used to represent interval coordinates. int32 should be wide enough for some time to come, since that's what BAM is limited to.

(This, and PosTypeMax, should move to a more central package once an appropriate one exists. And then, when generics finally become part of the language *crosses fingers*, we can allow some applications to redefine this as uint32 or a 64-bit type as appropriate.)

type UnionScanner

type UnionScanner struct {
	// contains filtered or unexported fields
}

UnionScanner supports iteration over an interval-union. Invariants:

endpointIdx == SearchPosTypes(endpoints, pos+1)
Pos is either contained in an interval, or is PosTypeMax

func NewUnionScanner

func NewUnionScanner(endpoints []PosType) UnionScanner

NewUnionScanner returns a UnionScanner initialized to the first interval.

func (*UnionScanner) Pos

func (us *UnionScanner) Pos() PosType

Pos returns the next position to be iterated over, or PosTypeMax if there aren't any.

func (*UnionScanner) Scan

func (us *UnionScanner) Scan(start *PosType, end *PosType, limit PosType) bool

Scan is written so that the following loop can be used to iterate over all within-interval positions up to (and not including) limit:

for us.Scan(&start, &end, limit) {
  for pos := start; pos < end; pos++ {
    // ...do stuff with pos...
  }
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL