Documentation ¶
Index ¶
- Constants
- Variables
- func CloseProximity(geneDB *GeneDB, fi FusionInfo, maxProximityDistance, maxProximityGenes int) bool
- func DiscardAbundantPartners(candidatesPtr *[]Candidate, maxGenePartners int)
- func FilterByMinSpan(hasUMI bool, minSpan int, candidatesPtr *[]Candidate, minReadSupport int)
- func FilterDuplicates(candidatesPtr *[]Candidate, hasUMI bool)
- func IsLowComplexity(seq string, lowComplexityFrac float64) bool
- func LinkedByLowComplexSubstring(frag Fragment, fi FusionInfo, lowComplexityFraction float64) bool
- func MaybeRemoveUMI(name, r1Seq, r2Seq string, opts Opts) (string, string, string)
- func ParseTranscriptomeKey(seqName string) (ensemblID, gene, chrom string, start, end, index int, err error)
- func RemoveLowComplexityReads(r1Seq, r2Seq string, stats *Stats, opts Opts) (newR1Seq, newR2Seq string)
- func SortGenePair(geneDB *GeneDB, g1, g2 GeneID, order GenePairOrder) (GeneID, GeneID)
- type Candidate
- type CrossReadPosRange
- type Fragment
- type FusionInfo
- type GeneDB
- func (m *GeneDB) GeneIDRange() (GeneID, GeneID)
- func (m *GeneDB) GeneInfo(id GeneID) *GeneInfo
- func (m *GeneDB) GeneInfoByName(name string) *GeneInfo
- func (m *GeneDB) IsFusionPair(g1, g2 GeneID) bool
- func (m *GeneDB) PrepopulateGeneInfo(genes []GeneInfo)
- func (m *GeneDB) PrepopulateGenes(names []string)
- func (m *GeneDB) ReadFusionEvents(ctx context.Context, path string)
- func (m *GeneDB) ReadTranscriptome(ctx context.Context, fastaPath string, filter bool)
- type GeneID
- type GeneInfo
- type GenePairOrder
- type Kmer
- type Opts
- type Pos
- type PosRange
- type ReadType
- type Stats
- type Stitcher
Constants ¶
const ReproduceBug = true
ReproduceBug introduces extra logic to reproduce suspicious behaviours of //bio/rna/fusion/ code.
Variables ¶
var DefaultOpts = Opts{ UMIInRead: false, UMIInName: false, KmerLength: 19, MaxGap: 9, MaxHomology: 15, MinSpan: 25, LowComplexityFraction: 0.9, MaxGenesPerKmer: 5, MaxGeneCandidatesPerFragment: 5, MaxProximityDistance: 100000, MaxProximityGenes: 5, MaxGenePartners: 5, MinReadSupport: 2, }
DefaultOpts sets the default values to Opts.
Functions ¶
func CloseProximity ¶
func CloseProximity(geneDB *GeneDB, fi FusionInfo, maxProximityDistance, maxProximityGenes int) bool
Return true if two genes in a candidate are deemed to be within close proximity of each other (Either they are within prox_dist bases of each other, or they are within prox_num genes of each other, on the same chromosome)
func DiscardAbundantPartners ¶
Discard calls where one of the partners is involved in numerous events
func FilterByMinSpan ¶
FilterByMinSpan filters candidates that aren't covered minSpan bases either G1 or G2. It also performs UMI collapsing and returns all valid fragment indices. More specifically, this function scans through all candidate gene pairs, and considers it as a valid pair if there are at least a prescribed minimum number of unique fragment supporting this fusion.
func FilterDuplicates ¶
Filter duplicate reads that call the same event
func IsLowComplexity ¶
IsLowComplexity returns true if input DNA is low complexity sequence, i.e. any two bases present at over lowComplexityFrac of the total sequence length.
func LinkedByLowComplexSubstring ¶
func LinkedByLowComplexSubstring(frag Fragment, fi FusionInfo, lowComplexityFraction float64) bool
Return true if the substring used to infer candidate gene pair is of low complexity.
func MaybeRemoveUMI ¶
MaybeRemoveUMI removes an UMI from the sequences and add add it to the name part, if the options prescribe such operations. It returns <new name, new r1 seq, new r2seq>.
func ParseTranscriptomeKey ¶
func ParseTranscriptomeKey(seqName string) (ensemblID, gene, chrom string, start, end, index int, err error)
ParseTranscriptomeKey parses a transcriptome fasta key.
Transcriptome key example: "ENST00000279783.3|OR8K1|chr11:56346039-56346998:1051|960"
func RemoveLowComplexityReads ¶
func RemoveLowComplexityReads(r1Seq, r2Seq string, stats *Stats, opts Opts) (newR1Seq, newR2Seq string)
RemoveLowComplexityReads check if r1Seq or r2Seq is low complexity, i.e., if the two most frequent nucleotide types dominate the sequence. If so, it converts them to an empty string.
func SortGenePair ¶
func SortGenePair(geneDB *GeneDB, g1, g2 GeneID, order GenePairOrder) (GeneID, GeneID)
Types ¶
type Candidate ¶
type Candidate struct { Frag Fragment Fusions []FusionInfo }
Candidate is a combination of a fragment and possible fusions detected for the fragment.
type CrossReadPosRange ¶
type CrossReadPosRange struct{ Start, End Pos }
CrossReadPosRange is the same as PosRange, except the range may cross a R1/R2 boundary.
func (CrossReadPosRange) Equal ¶
func (r CrossReadPosRange) Equal(other CrossReadPosRange) bool
Equal checks if the two ranges are identical.
type Fragment ¶
type Fragment struct { // Fragment name. It's a copy of the R1 name from the fastq. // // Example: E00469:245:HHK5TCCXY:1:1101:19634:35080:GTATCT+AGCAAT Name string // R1Seq is the sequence from the R1 fastq. When the R1 and R2 sequences are // found to have overlapping regions, they are stitched together, and R1Seq // will store the combined sequence and R2Seq will be empty. R1Seq string // R2Seq is the sequence from the R2 fastq. It is nonempty only when the // stitcher fails to stitch the R1 and R2 sequences. R2Seq string // contains filtered or unexported fields }
Fragment is a union of two (unpaired) reads (R1 & R2). Created by the Stitcher.
func (*Fragment) HammingDistance ¶
HammingDistance computes the hamming distance of sequences. If the sequences aren't of the same length, it returns a infiniteHammingDistance.
func (*Fragment) SubSeq ¶
func (r *Fragment) SubSeq(p CrossReadPosRange) string
SubSeq extracts part of the RNA sequence. The arg may cross the R1/R2 boundary, in which case this function returns a suffix of R1 plus a prefix of R2.
type FusionInfo ¶
type FusionInfo struct { // G1ID is the ID of the gene 1. The gene info can be looked up by calling GeneDB.GeneInfo(). G1ID GeneID // G2ID is the ID of the gene 2. The gene info can be looked up by calling GeneDB.GeneInfo(). G2ID GeneID // G1Span is the total length of the gene1 that intersects with the fragment. G1Span int // G2Span is the total length of the gene2 that intersects with the fragment. G2Span int // JointSpan is the total length covered by either G1 or G2. // // REQUIRES: JointSpan >= max(G1Span, G2Span) JointSpan int // FusionOrder=true iff g1 is aligned before g2. FusionOrder bool // G1Range is the [min,max) range of the fragment covered by G1. G1Range CrossReadPosRange // G2Range is the [min,max) range of the fragment covered by G2. G2Range CrossReadPosRange }
FusionInfo represents a fusion event between two genes.
func DetectFusion ¶
func DetectFusion(geneDB *GeneDB, frag Fragment, stats *Stats, opts Opts) []FusionInfo
DetectFusion is the toplevel entry point. It determines whether the given fragment is a fusion of two genes. It returns the list of candidate fusion events. If no event is found, it returns an empty slice.
type GeneDB ¶
type GeneDB struct {
// contains filtered or unexported fields
}
GeneDB is a singleton object that stores transcriptomes, kmers generated from the transcripts, and candidate fusion event pairs. Thread compatible.
func (*GeneDB) GeneIDRange ¶
GeneIDRange returns the range of gene IDs registered in this object. The low end is closed, high end is open. For example, the return value of (1, 95) means this DB holds 94 genes, IDs from 1 to 94. You can use GeneInfo() to get the information about the gene.
func (*GeneDB) GeneInfo ¶
GeneInfo gets the GeneInfo given an ID. It always returns a non-nil info.
REQUIRES: ID is valid.
func (*GeneDB) GeneInfoByName ¶
GeneInfoByName gets GeneINfo given a gene name. It returns nil if the gene is not registered.
func (*GeneDB) IsFusionPair ¶
IsFusionPair checks if the given pair of genes appear in the cosmic TSV file added by ReadFusionEvents.
REQUIRES: IDs are valid.
func (*GeneDB) PrepopulateGeneInfo ¶
PrepopulateGeneInfo x fills geneinfo in batch. NOT FOR GENERAL USE. It is used to populate gene DB from recordio dump.
func (*GeneDB) PrepopulateGenes ¶
PrepopulateGenes x assigns gene IDs to genes. NOT FOR GENERAL USE. It is used only to change the genename <-> geneID assignments to reproduce the behavior of the C++ code.
func (*GeneDB) ReadFusionEvents ¶
ReadFusionEvents reads from a Cosmic TSV file the names of gene pairs that form fusions. The first column of each line must be of form "gene1/gene2", for example "ACSL3/ETV1".
type GeneID ¶
type GeneID int32
GeneID is a dense sequence number (1, 2, 3, ...) assigned to a gene (e.g., "MAPK10"). IDs are valid only within one process invocation.
Caution: this type must be signed. kmer_index uses negative geneids to indicate outlined slices.
type GeneInfo ¶
type GeneInfo struct { // ID is a dense sequence number (1, 2, ...). It is valid only during the current run. ID GeneID // EnsemblID is parsed from the transcriptome FASTA key. E.g., "ENST00000279783.3" EnsemblID string // Gene is parsed from the transcriptome FASTA key. E.g., "OR8K1" Gene string // Chrom is parsed from the transcriptome FASTA key. E.g., "chr11" Chrom string // Start is parsed from the transcriptome FASTA key. E.g., 56346039 Start int // End is parsed from the transcriptome FASTA key. E.g., 56346998 End int // Index is the rank of this gene in Chrom. Gene with the smallest Start in // the given Chrom will have Index of zero. Index int // FusionEvent is true if this gene appears in the cosmic TSV file added via // ReadFusionEvents. FusionEvent bool }
GeneInfo stores the info about a gene. It is parsed out from the transcriptome fasta key.
Transcriptome key example: "ENST00000279783.3|OR8K1|chr11:56346039-56346998:1051|960"
type GenePairOrder ¶
type GenePairOrder int
GenePairOrder defines the order at which a gene pair is printed (A/B or B/A). Possible values are CosmicOrder and AlphabeticalOrder.
const ( // Output the genes in the order that's listed in the cosmic DB. This format // requires that the either <g1,g2> or <g2,g2> to be listed in the cosmic. CosmicOrder GenePairOrder = iota // Output the genes in the alphabetical order. AlphabeticalOrder )
type Opts ¶
type Opts struct { UMIInRead bool UMIInName bool // LowComplexityFraction determines whether a fragment (or read) should be // dropped because it contains too many repetition of the same base types. If // LowComplexityFraction of bases in a fragment are such repetitions, it is // dropped without further analyses. LowComplexityFraction float64 // KmerLength is the length of kmer used to match DNA sequences. KmerLength int Denovo bool // maxGap specifies the max gap allowed between // two consecutive ranges (this is to tolerate sequence errors). MaxGap int // MaxHomology is the max overlap allowed b/w genes in a fusion MaxHomology int // Min base evidence for a gene in the fusion. MinSpan int // MaxKmerFrequency the number of genes a kmer belongs to, default 5. Used to // be --cap flag in the C++ code. MaxGenesPerKmer int // MaxGeneCandidatesPerFragment caps the number of genes considered for each // fragment. Doing so will reduce the number of pairs that will be processed // for every read. // // TODO(saito,xyang) report read through events but differentiating this from // the overlapping genes MaxGeneCandidatesPerFragment int // MaxProximityDistance is the distance cutoff below which a candidate will be // rejected as a readthrough event MaxProximityDistance int // MaxProximityGenes is number of genes separating a gene pair (If on the same // chromsosome) below which they will be flagged as read-through events. MaxProximityGenes int // MaxGenePartners is the maximum number of partners a gene can have this is // used in the filtering stage. MaxGenePartners int // Minimum number of supporting reads required // to consider a fusion MinReadSupport int }
type Pos ¶
type Pos int64
Pos is the position in a read or a fragment (i.e., paired reads).
When Pos refers to the position in a read, it is simply the zero-based index within the read sequence.
When Pos refers to the position in a fragment, it can be either position in R1 or in R2. A position in R2 has a value >= r2PosOffset. To get the actual index within the R2 sequence, subtract r2PosOffset from the value.
type PosRange ¶
type PosRange struct{ Start, End Pos }
PosRange is a half-open range [start, end).
INVARIANT: Start and End never cross a R1/R2 boundary.
type ReadType ¶
type ReadType uint8
ReadType defines the read type (R1 or R2) for a paired fragment.
const ( // R1 means the read is either raw read from R1 fastq file, or it is a result // of stitching R1 and R2. R1 ReadType = iota // R2 means the read is from R2 fastq file. Note: when the read pair could be // stitched, the combined result will be stored in Fragment.R1Seq and // R2Seq. will be empty. R2 )
type Stats ¶
type Stats struct { // LowComplexityReads2 is the # of readpairs where both reads are // found to have low complexity. LowComplexityReads2 int // LowComplexityReads2 is the # of readpairs where one of the reads are found // to have low complexity. LowComplexityReads1 int // LowComplexityStitched is the # of readpairs that were successfully // stitched, but then found to have low complexity. LowComplexityReadsStitched int // Stitched is the # of reads successuflly stitched Stitched int // RawGenes is the total genes found during kmer lookup. RawGenes int // Genes is the total genes found during kmer lookup, after // Opts.MaxGeneCandidatesPerFragment cutoff. Genes int // Fragments counts the total number of fragments processed. Fragments int // FragmentsWithMatchingGenes[k] (0<=k<4) counts the total # of fragments that // are found to have k genes matching one of its kmers. The last element in // this array counts all the fragmenst with >=4 matching genes. FragmentsWithMatchingGenes [5]int // Ranges is the total # of ranges covered by any gene. RawRanges int // Ranges is the total # of ranges covered by any gene, after // Opts.MaxGeneCandidatesPerFragment cutoff. Ranges int }
Stats represents high-level statistics during the run of the stage1 of AF4.
type Stitcher ¶
type Stitcher struct {
// contains filtered or unexported fields
}
Stitcher stitches two reads (R1 and R2) and produces a Fragment. It can be used to stitch multiple read pairs. Thread compatible.
func NewStitcher ¶
NewStitcher creates a new stitcher. kmerLength and lowComplexityFraction should be copied from the counterparts in Opts.
func (*Stitcher) FreeFragment ¶
FreeFragment puts the fragment in a freepool. The caller must not retain any reference to the fragment after the call. The future calls to Stitch will use fragments in the freepool.
Source Files ¶
Directories ¶
Path | Synopsis |
---|---|
This is the main package for parsegencode.
|
This is the main package for parsegencode. |
Package parsegencode contains the required methods for parsing a gencode GTF annotation and printing out the transcripts to an output file.
|
Package parsegencode contains the required methods for parsing a gencode GTF annotation and printing out the transcripts to an output file. |