Documentation ¶
Overview ¶
Command bio-mark-duplicates marks or removes duplicates from .bam
files. This tool is meant to replicate the behavior of picard MarkDuplicates in a more scalable, and efficient way. The intention is to make bio-mark-duplicates scale well with machines exceeding 16 cores. Duplicate Marking Concepts: At the conceptual level, this tool considers two reads A and B as duplicates (isDuplicate(A, B)) if their: 1) reference 2) unclipped 5' position 3) read direction (orientation) are ALL identical. Two pairs P1 and P2 are considered duplicates of each other, if isDuplicate(P1.leftRead, P2.leftRead) and isDuplicate(P1.rightRead, P2.rightRead). Left vs right is determined by the unclipped 5' position of each read in the pair. Mapped pairs vs. Mapped-Unmapped pairs: For some read pairs, both reads will be mapped (mapped pairs). For other read pairs, only one of the reads will be mapped (mapped-unmapped pairs). A mapped pair can be a duplicate of another mapped pair, but a mapped pair P1 may NOT be a duplicate of a mapped-unmapped pair P2 because one read of P2 will have no alignment position, and thus cannot be equal to one of the mapped reads of P1. However, the mapped read of a mapped-unmapped pair can be considered a duplicate of one read on a mapped pair. So in this example, P2.left could be a duplicate of P1.left. We call P2.left a "mate-unmapped read". P1: left(chr1, 1020, F) right(chr1, 1040, R) P2: left(chr1, 1020, F) right(chr1, 0, ?) P1 is not a duplicate of P2, but P2.left is a duplicate of P1.left. After identifying the duplicates, this tool will select a primary pair or read for each set of duplicates. The primary will be the duplicate with the highest score based on the sum of its base qualities. To break ties, a higher priority is given to reads that appear earlier in the bam input. In choosing a primary, pairs are given priority over mate-unmapped reads. So if a mate-unmapped read is found to be a duplicate of one read in a mapped pair, the pair would always be the primary, and the mate-unmapped read would always be the duplicate. If two mate-unmapped reads are duplicates of each other, but not duplicates of a mapped pair, then the primary is chosen from the two mate-unmapped reads. After identifying the primary and the duplicates, this tool can be configured to mark each read with the duplicate flag 1024, or to remove each of the duplicate reads. Tagging: If the caller specifies the "tag-duplicates" parameter, the tool will attach auxiliary tags DI, DL, DS, and DT to the output. DI is the duplicate index of a duplicate set. All pairs in a duplicate set, including the primary, will have the same value for DI; the value for DI is the file index of the left-most read of the primary duplicate pair. DI is not set for mate-unmapped reads. DL is the number of library (LB aka PCR) duplicate pairs in the duplicate set, including the primary. This is equal to the DS value minus the number of "SQ" duplicates pairs in the duplicate set. DS is the number of pairs in the duplicate set. DS is not set on mate-unmapped reads, and it also does not count mate-unmapped duplicates. DT is set on duplicate pairs (not the primary) and mate-unmapped reads. It is set to "SQ" for optical duplicates, and "LB" for all other duplicates. Implementation: The implementation splits the input bam file into non-overlapping shards, and processes each of those shards in separate workers, and possibly in parallel with the other shards. Each worker is responsible for taking the reads in one shard, and outputting the same reads in the same order (except for marking or removing duplicates). Matching up pairs: To determine which pairs are duplicates, a worker must read both reads in a pair to determine each read's 5' position and orientation. Each read record contains a pointer to its mate, but it does not contain the orientation or the unclipped position which is in the mate's flags and cigar respectively. Matching up both reads in a pair is somewhat tricky. For most of the reads that live in a particular shard, the read's mate will appear within that shard, and a worker can easily match up those pairs. However, some pairs will span from inside the shard to outside the shard. Most of these read pairs will have their reads relatively close together, so the a worker will read an additional pair-padding distance on either end of the shard for matching up pairs, but even that is not enough. Sometimes, a pair spans from inside the shard to beyond the pair-padding; we'll call these "distant-pairs". To deal with distant-pairs, this tool pre-scans the entire bam input, to identify and save the distant-pairs to memory so that the workers can lookup the distant pair reads without needing to seek, decompress, and read them from the bam file. Note that the distant-pairs allow all pairs to be resolved, so the pair-padding is a memory optimization to avoid storing mates that live in the pair-padding in memory for the duration of the entire run. Pair-padding should be chosen so that most mates will fall into the pair-padding. Matching up duplicates: Duplicates are matched up using their 5' positions, and since a 5' position can differ from the alignment's start position, each shard needs some fuzziness at its boundaries to ensure all potential duplicates will be compared against each other. For example: shard1 shard2 |---------------------|-------------------------| |cccc|--------------| read1 (with clipping) 5 S E 5', Begin, End |c|-----------------| read2 (with clipping) 5 S E 5', Begin, End clip-pad shard2 clip-pad |-----------|-------------------------|-----------| In this example, read1 and read2 are duplicates according to their 5' position, but the two reads will reside in different shards based on their alignment Begin positions. To handle these overlap cases, each shard has "clip-padding" on each side of the shard. When searching for duplicates in each shard, the worker considers each read that is either in the clip-padding or in the shard. So in the example, read1 is in shard2, and read2 is in shard2's clip-padding, but the shard2 worker still compares the two reads with each other, decides they are duplicates, and marks one as a duplicate based on their scores. Note that the worker for shard1 also compares read1 and read2 and also decides they are duplicates, and marks one based on their scores. Since the scoring is deterministic, shard1 and shard2 agree on which read to mark as duplicate. Clip-padding and pair-padding serve different purposes. Clip-padding is for correctness and must exceed the largest clip distance in the input file. Pair-padding is a memory optization. The complete shard diagram looks like this: shard-pad clip-pad shard1 clip-pad shard-pad |---------|-----------|-------------------------|-----------|---------| Output ordering: As the workers complete each shard, they output the shard's marked reads to an output queue that preserves the original order of the shards. A writer reads shards from the queue in order, and writes the records to the bam file output. The output queue has a maximum size, and when full only allows a worker to insert the shard that the writer currently needs. This ensures that the queue does not grow too long when a worker takes a long time to process a particular shard.
Copyright 2019 Grail Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Copyright 2019 Grail Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Copyright 2019 Grail Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Copyright 2019 Grail Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Copyright 2019 Grail Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Copyright 2019 Grail Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Copyright 2019 Grail Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Copyright 2019 Grail Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Copyright 2019 Grail Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Copyright 2019 Grail Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Index ¶
- Constants
- func ChoosePrimary(entries []DuplicateEntry) int
- func GetLibrary(readGroupLibrary map[string]string, record *sam.Record) string
- func NewAux(name string, val interface{}) sam.Aux
- func NewRecord(name string, ref *sam.Reference, pos int, flags sam.Flags, matePos int, ...) *sam.Record
- func NewRecordAux(name string, ref *sam.Reference, pos int, flags sam.Flags, matePos int, ...) *sam.Record
- func NewRecordSeq(name string, ref *sam.Reference, pos int, flags sam.Flags, matePos int, ...) *sam.Record
- func NewTestOutput(dir string, index int, format string) string
- func ReadRecords(t *testing.T, path string) []*sam.Record
- func RunTestCases(t *testing.T, header *sam.Header, cases []TestCase)
- func SetupAndMark(ctx context.Context, provider bamprovider.Provider, opts *Opts) error
- type BagProcessor
- type BagProcessorFactory
- type DuplicateEntry
- type IndexedPair
- type IndexedSingle
- type IntermediateDuplicateSet
- type MarkDuplicates
- type Metrics
- type MetricsCollection
- type OpticalDetector
- type Opts
- type Orientation
- type PhysicalLocation
- type TestCase
- type TestRecord
- type TileOpticalDetector
Constants ¶
const ( // IlluminaReadName5Fields is the number of columns in a 5 field read name. IlluminaReadName5Fields = 5 // IlluminaReadName5FieldsTileField is 0-based field number that // contains the tileName for 5 field read names. IlluminaReadName5FieldsTileField = 2 // IlluminaReadName7Fields is the number of columns in a 5 field read name. IlluminaReadName7Fields = 7 // IlluminaReadName7FieldsTileField is 0-based field number that // contains the tileName for 7 field read names. IlluminaReadName7FieldsTileField = 4 // IlluminaReadName8Fields is the number of columns in a 5 field read name. IlluminaReadName8Fields = 8 // IlluminaReadName8FieldsTileField is 0-based field number that // contains the tileName for 8 field read names. IlluminaReadName8FieldsTileField = 4 )
Variables ¶
This section is empty.
Functions ¶
func ChoosePrimary ¶
func ChoosePrimary(entries []DuplicateEntry) int
func GetLibrary ¶
GetLibrary returns the library for the given record's read group. If the library is not defined in readGroupLibrary, returns "Unknown Library".
func NewRecordAux ¶
func NewRecordSeq ¶
func NewTestOutput ¶
NewTestOutput returns different string filename for the different output formats.
func ReadRecords ¶
ReadRecords reads the records from path and returns them as a slice, in order.
func SetupAndMark ¶
SetupAndMark does some minimal setup for validating opts, and creating provider and then runs mark().
Types ¶
type BagProcessor ¶
type BagProcessor func([]*IntermediateDuplicateSet) []*IntermediateDuplicateSet
BagProcessor takes the set of bags from a particular shard, and returns the same set of reads, but the reads may now be bagged differently. The new set of bags may contain more bags or fewer bags than the original set of bags.
type BagProcessorFactory ¶
type BagProcessorFactory interface {
Create() BagProcessor
}
BagProcessorFactory creates BagProcessors. Mark-duplciates creates one BagProcessor per goroutine.
type DuplicateEntry ¶
type IndexedPair ¶
type IndexedPair struct { Left IndexedSingle Right IndexedSingle }
func (IndexedPair) BaseQScore ¶
func (p IndexedPair) BaseQScore() int
func (IndexedPair) FileIdx ¶
func (p IndexedPair) FileIdx() uint64
func (IndexedPair) Name ¶
func (p IndexedPair) Name() string
type IndexedSingle ¶
func (IndexedSingle) BaseQScore ¶
func (s IndexedSingle) BaseQScore() int
func (IndexedSingle) FileIdx ¶
func (s IndexedSingle) FileIdx() uint64
func (IndexedSingle) Name ¶
func (s IndexedSingle) Name() string
type IntermediateDuplicateSet ¶
type IntermediateDuplicateSet struct { Pairs []DuplicateEntry Singles []DuplicateEntry Corrected map[string]string // Maps read name to corrected UMI pair: "GAC+GAG" }
type MarkDuplicates ¶
type MarkDuplicates struct { Provider bamprovider.Provider Opts *Opts // contains filtered or unexported fields }
MarkDuplicates implements duplicate marking.
func (*MarkDuplicates) Mark ¶
func (m *MarkDuplicates) Mark(shards []bam.Shard) (*MetricsCollection, error)
Mark marks the duplicates, and returns metrics, and an error if encountered.
type Metrics ¶
type Metrics struct { // UnpairedReads is the number of mapped reads examined which did // not have a mapped mate pair, either because the read is // unpaired, or the read is paired to an unmapped mate. UnpairedReads int // ReadPairsExamined is the number of mapped read pairs // examined. (Primary, non-supplemental). ReadPairsExamined int // SecondarySupplementary is the number of reads that were either // secondary or supplementary. SecondarySupplementary int // UnmappedReads is the total number of unmapped reads // examined. (Primary, non-supplemental). UnmappedReads int // UnpairedDups is the number of fragments that were marked as duplicates. UnpairedDups int // ReadPairDups is the number of read pairs that were marked as duplicates. ReadPairDups int // ReadPairOpticalDups is the number of read pairs duplicates that // were caused by optical duplication. Value is always < // READ_PAIR_DUPLICATES, which counts all duplicates regardless of // source. ReadPairOpticalDups int }
Metrics contains metrics from mark duplicates.
type MetricsCollection ¶
type MetricsCollection struct { // OpticalDistance stores the number of duplicate read pairs that // have the given Euclidean distance. OpticalDistance [][]int64 // LibraryMetrics contains per-library metrics. LibraryMetrics map[string]*Metrics // High coverage intervals and read counts. HighCoverageIntervals []coverageInterval // contains filtered or unexported fields }
MetricsCollection contains metrics computed by Mark.
func (*MetricsCollection) AddDistance ¶
func (mc *MetricsCollection) AddDistance(bagSize, distance int)
AddDistance increments the histogram counter for the given bagsize and distance.
func (*MetricsCollection) AddHighCovInterval ¶
func (mc *MetricsCollection) AddHighCovInterval(interval coverageInterval)
func (*MetricsCollection) Get ¶
func (mc *MetricsCollection) Get(library string) *Metrics
Get returns Metrics for the given library. If there is no Metrics for library yet, create one and return it.
func (*MetricsCollection) Merge ¶
func (mc *MetricsCollection) Merge(other *MetricsCollection)
Merge per-library and optical distance metrics from other into mc.
type OpticalDetector ¶
type OpticalDetector interface { // GetRecordProcessor returns a RecordProcessor that sees every // record in the bam input before any calls to Detect happen. The // OpticalDetector can use this to calculate statistics that // influence optical detection. GetRecordProcessor() bampair.RecordProcessor // RecordProcessorsDone should return maximum X and Y co-ordinates and // be called after the RecordProcessors have seen all the input records. RecordProcessorsDone() (int, int) // Detect identifies the optical duplicates in pairs and returns // their names in a slice. readGroupLibrary maps readGroup to // library name. pairs contains all the readpairs in the bag, and // bestIndex is an index into pairs that points to the bag's // primary readpair. Detect(readGroupLibrary map[string]string, pairs []DuplicateEntry, bestIndex int) []string }
OpticalDetector is a general interface for optical duplicate detection.
type Opts ¶
type Opts struct { // Commandline options. BamFile string IndexFile string MetricsFile string HighCoverageIntervalFile string TileSizeFile string Format string CoverageMax int ShardSize int MinBases int Padding int DiskMateShards int ScratchDir string Parallelism int QueueLength int ClearExisting bool RemoveDups bool TagDups bool IntDI bool UseUmis bool UmiFile string ScavengeUmis int EmitUnmodifiedFields bool SeparateSingletons bool OutputPath string StrandSpecific bool OpticalHistogram string OpticalHistogramMax int Seed int64 // Data and operators derived from commandline options. BagProcessorFactories []BagProcessorFactory OpticalDetector OpticalDetector KnownUmis []byte }
Opts for mark-duplicates.
type Orientation ¶
type Orientation uint8
func GetR1R2Orientation ¶
func GetR1R2Orientation(p *IndexedPair) Orientation
GetR1R2Orientation returns an orientation byte containing orientations for both R1 and R2.
type PhysicalLocation ¶
type PhysicalLocation struct { Lane int Surface int Swath int Section int TileNumber int TileName int X int Y int }
PhysicalLocation describes a read's physical location on the flow cell. Lane, Surface, Swatch, Section, and TileNumber together specify which flowcell tile the read was found in. TileName is the 4 or 5 digit representation of the tile, e.g. 1203 means surface 1, swath 2 and tile 3. 12304 means surface 1, swath 2, section 3, and tile 4. X and Y describe the X and Y coordinates of the well within the tile.
func ParseLocation ¶
func ParseLocation(qname string) PhysicalLocation
ParseLocation returns a physical location given an Illumina style read name. The read name must have 5, 7, or 8 fields separated by ':'. When there are 5 or 7 fields, the last three fields are tileName, X and Y. When there are 8 fields, the last four fields are tileName, X, Y, and UMI.
The tileName be formatted as a 4 or 5 digit Illumina tileName. For a description of 4 digit tile numbers, see Appendix B, section Tile Numbering in
http://support.illumina.com.cn/content/dam/illumina-support/documents/documentation/system_documentation/hiseqx/hiseq-x-system-guide-15050091-e.pdf
For a description of 5 digit tile numbers, see Appendix C, section Tile Numbering in
https://support.illumina.com/content/dam/illumina-support/documents/documentation/system_documentation/nextseq/nextseq-550-system-guide-15069765-05.pdf
type TestCase ¶
type TestCase struct { TRecords []TestRecord Opts Opts }
type TestRecord ¶
type TileOpticalDetector ¶
type TileOpticalDetector struct {
OpticalDistance int
}
TileOpticalDetector detects optical duplicates with a tile. For two reads to be optical duplicates, their tile, lane, surface, library, and read orientations must be identical
func (*TileOpticalDetector) Detect ¶
func (t *TileOpticalDetector) Detect(readGroupLibrary map[string]string, duplicates []DuplicateEntry, bestIndex int) []string
Detect implements OpticalDetector.
func (*TileOpticalDetector) GetRecordProcessor ¶
func (t *TileOpticalDetector) GetRecordProcessor() bampair.RecordProcessor
GetRecordProcessor implements OpticalDetector.
func (*TileOpticalDetector) RecordProcessorsDone ¶
func (t *TileOpticalDetector) RecordProcessorsDone() (int, int)
RecordProcessorsDone implements OpticalDetector.