snp

package
v0.0.0-...-d966d87 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 18, 2020 License: Apache-2.0 Imports: 29 Imported by: 1

Documentation

Overview

Copyright 2020 Grail Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Copyright 2020 Grail Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Copyright 2020 Grail Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Copyright 2020 Grail Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Copyright 2020 Grail Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Copyright 2020 Grail Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Copyright 2020 Grail Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Copyright 2020 Grail Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Index

Constants

View Source
const (
	FieldCounts = 1 << iota
	FieldPerReadA
	FieldPerReadC
	FieldPerReadG
	FieldPerReadT
	FieldPerReadAny = FieldPerReadA | FieldPerReadC | FieldPerReadG | FieldPerReadT
)
View Source
const FaEncoding = fasta.Seq8

FaEncoding is the fasta in-memory encoding expected by snp.Pileup(). (Seq8 is actually worse than both ASCII and Base5 for this SNP-pileup, but it simplifies future extension to indels.)

View Source
const PosTypeMax = pileup.PosTypeMax

PosTypeMax is the maximum value that can be represented by a PosType.

Variables

View Source
var DefaultOpts = Opts{
	Clip:        0,
	FlagExclude: 0xf00,
	Mapq:        60,
	MaxReadLen:  500,
	MaxReadSpan: 511,
	MinBagDepth: 0,
	MinBaseQual: 0,
	Parallelism: 0,
	PerStrand:   false,
	RemoveSq:    false,
	Stitch:      false,
}

Functions

func ChrId

func ChrId(chr string) (int, error)

ChrId returns the index for the given chromosome string in bio/pileup format.

func ConvertPileupRowsToBasestrandRio

func ConvertPileupRowsToBasestrandRio(ctx context.Context, tmpFiles []*os.File, mainPath string, refNames []string) (err error)

func ConvertPileupRowsToBasestrandTSV

func ConvertPileupRowsToBasestrandTSV(ctx context.Context, tmpFiles []*os.File, mainPath string, colBitset int, bgzip bool, parallelism int, refNames []string, refSeqs []string) (err error)

func ConvertPileupRowsToTSV

func ConvertPileupRowsToTSV(ctx context.Context, tmpFiles []*os.File, mainPath string, colBitset int, bgzip bool, parallelism int, refNames []string, refSeqs []string) (err error)

func MarshalPileupRow

func MarshalPileupRow(scratch []byte, p interface{}) ([]byte, error)

Serialized format:

[0..4): fieldsPresent
[4..8): refID
[8..12): pos
[12..16): depth
if counts present, stored in next 40 bytes
if perRead[pileup.baseA] present, length stored in next 4 bytes, then
  values stored in next 6*n bytes
if perRead[pileup.baseC] present... etc.

This is essentially the simplest format that can support the variable-length per-read feature arrays that are needed. It is not difficult to decrease the nominal size of these records by (i) using varints instead of uint32s, and (ii) making fieldsPresent indicate which counts[][] values are nonzero and only storing those; but I wouldn't expect that to be worth the additional complexity since all uses of this marshal function are bundled with the "zstd 1" transformer anyway. (Instead, all the 'extra' complexity in this function concerns (i) avoiding extra allocations and (ii) avoiding a ridiculous number of spurious bounds-checks, in ways that make sense for a wide variety of other serialization functions.)

In the future, we may need to add indel support.

func Pileup

func Pileup(ctx context.Context, xampath, fapath, format, outPrefix string, rawOpts *Opts, fa fasta.Fasta) (err error)

func ReadBaseStrandTsvIntoChannel

func ReadBaseStrandTsvIntoChannel(reader *tsv.Reader, c chan []BaseStrandTsvRow, bufferLen int, fileName string, wg *sync.WaitGroup)

ReadBaseStrandTsvIntoChannel reads a basestrand.tsv file from the given tsv.Reader into the given channel.

func WriteBaseStrandToTSV

func WriteBaseStrandToTSV(piles []BaseStrandPile, refNames []string, w io.Writer) (err error)

WriteBaseStrandToTSV writes a []BaseStrandPile as a TSV.

func WriteBaseStrandTsv

func WriteBaseStrandTsv(rows []BaseStrandTsvRow, writer io.Writer) error

WriteBaseStrandTsv writes a basestrand.tsv file to the given writer

func WriteBaseStrandsRio

func WriteBaseStrandsRio(piles []BaseStrandPile, refNames []string, out io.Writer) error

WriteBaseStrandsRio writes the given BaseStrand-pileup entries to the given writer, using recordio.

func WriteBaseStrandsRioAsTSV

func WriteBaseStrandsRioAsTSV(ctx context.Context, path string, w io.Writer) error

WriteBaseStrandsRioAsTSV converts the given recordio pileup to TSV.

Types

type BaseStrandPile

type BaseStrandPile struct {
	RefID  uint32
	Pos    uint32
	Counts [pileup.NBase][2]uint32
}

BaseStrandPile represents a single pileup entry with a count for every (base, strand) tuple.

  • Pos is zero-based; it is necessary to add 1 when converting to most text formats (but not BED).
  • In Counts[][], base is the major dimension, with pileup.BaseA=0, C=1, G=2, T=3. Strand is the minor dimension, with strandFwd=0 and strandRev=1. TODO(cchang): strandFwd=0, strandRev=1 is inconsistent with bio/pileup's internal representation (which has None=0). We have enough other code at this point with Fwd=0, Rev=1 that it's probably time to change bio/pileup's representation to match that.

func ReadBaseStrandsRio

func ReadBaseStrandsRio(rs io.ReadSeeker) (piles []BaseStrandPile, refNames []string, err error)

ReadBaseStrandsRio reads BaseStrand piles from a recordio file written by WriteBaseStrandsRio.

type BaseStrandTsvRow

type BaseStrandTsvRow struct {
	Chr  string `tsv:"#CHROM"` // Chromosome
	Pos  int64  `tsv:"POS"`    // Position in chromosome
	Ref  string `tsv:"REF"`    // Reference base
	FwdA int64  `tsv:"A+"`     // A count on the forward strand
	RevA int64  `tsv:"A-"`     // A count on the reverse strand
	FwdC int64  `tsv:"C+"`     // C count on the forward strand
	RevC int64  `tsv:"C-"`     // C count on the reverse strand
	FwdG int64  `tsv:"G+"`     // G count on the forward strand
	RevG int64  `tsv:"G-"`     // G count on the reverse strand
	FwdT int64  `tsv:"T+"`     // T count on the forward strand
	RevT int64  `tsv:"T-"`     // T count on the reverse strand
}

BaseStrandTsvRow represents a single row of a basestrand.tsv file.

func ReadBaseStrandTsv

func ReadBaseStrandTsv(r io.Reader) ([]BaseStrandTsvRow, error)

ReadBaseStrandTsv reads a basestrand.tsv file from the given io.Reader.

func ReadSingleStrandBaseStrandTsv

func ReadSingleStrandBaseStrandTsv(forward, reverse io.Reader) ([]BaseStrandTsvRow, error)

ReadSingleStrandBaseStrandTsv reads strand specific strand.<fwd/rev>.snp.tsv files from the given io.Reader.

type BaseStrandUnmarshaller

type BaseStrandUnmarshaller struct {
	// contains filtered or unexported fields
}

BaseStrandUnmarshaller is used to allocate memory in large blocks during unmarshalling, to prevent contention with other goroutines.

func (*BaseStrandUnmarshaller) UnmarshalBaseStrand

func (b *BaseStrandUnmarshaller) UnmarshalBaseStrand(in []byte) (out interface{}, err error)

type Opts

type Opts struct {
	// Commandline options.
	BedPath      string
	Region       string
	BamIndexPath string
	Clip         int
	Cols         string
	FlagExclude  int
	Mapq         int
	MaxReadLen   int
	MaxReadSpan  int
	MinBagDepth  int
	MinBaseQual  int
	Parallelism  int
	PerStrand    bool
	RemoveSq     bool
	Stitch       bool
	TempDir      string
}

type PerReadFeatures

type PerReadFeatures struct {
	// dist5p is the 0-based distance of the current base from its 5' end.  (Note
	// that Dist3p := fraglen - 1 - dist5p, so we don't need to store it
	// separately.)
	Dist5p uint16

	Fraglen uint16
	Qual    byte
	Strand  byte
}

type PileupPayload

type PileupPayload struct {
	Depth   uint32
	Counts  [pileup.NBaseEnum][2]uint32
	PerRead [pileup.NBase][]PerReadFeatures
}

PileupPayload is a container for all types of pileup data which may be associated with a single position. It does not store the position itself, or a tag indicating which parts of the container are used.

Depth and count values are of type uint32 instead of int to reduce cache footprint.

type PileupRow

type PileupRow struct {
	FieldsPresent uint32 // field... flags
	RefID         uint32
	Pos           uint32
	Payload       PileupPayload
}

PileupRow contains all pileup data associated with a single position, along with the position itself and the set of PileupPayload fields used.

The main loop splits the genome into shards, and generates lightly compressed (zstd level 1) per-shard PileupRow recordio files. Then, the per-shard files are read in sequence and converted to the final requested output format. This is a bit inefficient, but we can easily afford it.

type PosType

type PosType = pileup.PosType

PosType is the integer type used to represent genomic positions.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL