fai

package
v0.9.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 10, 2023 License: MIT Imports: 9 Imported by: 4

README

fai

GoDoc

Package fai implements fasta sequence file index handling, including creating , reading and random accessing.

Code of fai data structure were copied and edited from [1].

But I wrote the code of creating and reading fai, and so did test code.

Code of random accessing subsequences were copied from [2], but I extended them a lot.

Reference:

[1]. https://github.com/biogo/biogo/blob/master/io/seqio/fai/fai.go

[2]. https://github.com/brentp/faidx/blob/master/faidx.go

General Usage

import "github.com/shenwei356/bio/seqio/fai"

file := "seq.fa"
faidx, err := fai.New(file)
checkErr(err)
defer func() {
    checkErr(faidx.Close())
}()

// whole sequence
seq, err := faidx.Seq("cel-mir-2")
checkErr(err)

// single base
s, err := faidx.Base("cel-let-7", 1)
checkErr(err)

// subsequence. start and end are all 1-based
seq, err := faidx.SubSeq("cel-mir-2", 15, 19)
checkErr(err)

Extended SubSeq

For extended SubSeq, negative position is allowed.

This is my custom locating strategy. Start and end are all 1-based. To better understand the locating strategy, see examples below:

 1-based index    1 2 3 4 5 6 7 8 9 10
negative index    0-9-8-7-6-5-4-3-2-1
           seq    A C G T N a c g t n
           1:1    A
           2:4      C G T
         -4:-2                c g t
         -4:-1                c g t n
         -1:-1                      n
          2:-2      C G T N a c g t
          1:-1    A C G T N a c g t n

Examples:

// last 12 bases
seq, err := faidx.SubSeq("cel-mir-2", -12, -1)
checkErr(err)

Advanced Usage

Function fai.New(file string) is a wraper to simplify the process of creating and reading FASTA index . Let's see what's happend inside:

func New(file string) (*Faidx, error) {
        fileFai := file + ".fai"
        var index Index
        if _, err := os.Stat(fileFai); os.IsNotExist(err) {
                index, err = Create(file)
                if err != nil {
                        return nil, err
                }
        } else {
                index, err = Read(fileFai)
                if err != nil {
                        return nil, err
                }
        }

        return NewWithIndex(file, index)
}

By default, sequence ID is used as key in FASTA index file. Inside the package, a regular expression is used to get sequence ID from full head. The default value is ^([^\s]+)\s?, i.e. getting first non-space characters of head. So you can just use fai.Create(file string) to create .fai file.

If you want to use full head instead of sequence ID (first non-space characters of head), you could use fai.CreateWithIDRegexp(file string, idRegexp string) to create faidx. Here, the idRegexp should be ^(.+)$. For convenience, you can use another function CreateWithFullHead.

More Advanced Usages

Note that, by default, whole file is mapped into shared memory, which is OK for small files (smaller than your RAM). For very big files, you should disable that. Instead, file seeking is used.

// change the global variable
fai.MapWholeFile = false

// then do other things

Documentation

Documentation on godoc.

Documentation

Overview

Package fai implements fasta sequence file index handling, including creating , reading and random accessing.

Code of fai data structure were copied and edited from [1].

But I wrote the code of creating and reading fai, and so did test code.

Code of random accessing subsequences were copied from [2], but I extended them a lot.

Reference:

[1]. https://github.com/biogo/biogo/blob/master/io/seqio/fai/fai.go

[2]. https://github.com/brentp/faidx/blob/master/faidx.go

## General Usage

import "github.com/shenwei356/bio/seqio/fai"

file := "seq.fa"
faidx, err := fai.New(file)
checkErr(err)
defer func() {
    checkErr(faidx.Close())
}()

// whole sequence
seq, err := faidx.Seq("cel-mir-2")
checkErr(err)

// single base
s, err := faidx.Base("cel-let-7", 1)
checkErr(err)

// subsequence. start and end are all 1-based
seq, err := faidx.SubSeq("cel-mir-2", 15, 19)
checkErr(err)

## Extended SubSeq

For extended SubSeq, negative position is allowed.

This is my custom locating strategy. Start and end are all 1-based. To better understand the locating strategy, see examples below:

 1-based index    1 2 3 4 5 6 7 8 9 10
negative index    0-9-8-7-6-5-4-3-2-1
           seq    A C G T N a c g t n
           1:1    A
           2:4      C G T
         -4:-2                c g t
         -4:-1                c g t n
         -1:-1                      n
          2:-2      C G T N a c g t
          1:-1    A C G T N a c g t n
          1:12    A C G T N a c g t n
        -12:-1    A C G T N a c g t n

Examples:

// last 12 bases
seq, err := faidx.SubSeq("cel-mir-2", -12, -1)
checkErr(err)

## Advanced Usage

Function `fai.New(file string)` is a wraper to simplify the process of creating and reading FASTA index . Let's see what's happened inside:

func New(file string) (*Faidx, error) {
        fileFai := file + ".fai"
        var index Index
        if _, err := os.Stat(fileFai); os.IsNotExist(err) {
                index, err = Create(file)
                if err != nil {
                        return nil, err
                }
        } else {
                index, err = Read(fileFai)
                if err != nil {
                        return nil, err
                }
        }

        return NewWithIndex(file, index)
}

By default, sequence ID is used as key in FASTA index file. Inside the package, a regular expression is used to get sequence ID from full head. The default value is `^([^\s]+)\s?`, i.e. getting first non-space characters of head. So you can just use `fai.Create(file string)` to create .fai file.

If you want to use full head instead of sequence ID (first non-space characters of head), you could use `fai.CreateWithIDRegexp(file string, idRegexp string)` to create faidx. Here, the `idRegexp` should be `^(.+)$`. For convenience, you can use another function `CreateWithFullHead`.

## More Advanced Usages

Note that, ***by default, whole file is mapped into shared memory***, which is OK for small files (smaller than your RAM). For very big files, you should disable that. Instead, file seeking is used.

// change the global variable
fai.MapWholeFile = false

// then do other things

Index

Constants

This section is empty.

Variables

View Source
var ErrSeqNotExists = fmt.Errorf("sequence not exists")

ErrSeqNotExists means that sequence not exists

View Source
var IDRegexp = regexp.MustCompile(defaultIDRegexp)

IDRegexp is regexp for parsing record id

View Source
var MapWholeFile = true

MapWholeFile is a globle flag to decides whether map whole file

Functions

func SubLocation

func SubLocation(length, start, end int) (int, int, bool)

SubLocation is my sublocation strategy, start, end and returned start and end are all 1-based

1-based index    1 2 3 4 5 6 7 8 9 10

negative index 0-9-8-7-6-5-4-3-2-1

   seq    A C G T N a c g t n
   1:1    A
   2:4      C G T
 -4:-2                c g t
 -4:-1                c g t n
 -1:-1                      n
  2:-2      C G T N a c g t
  1:-1    A C G T N a c g t n
  1:12    A C G T N a c g t n
-12:-1    A C G T N a c g t n

Types

type Faidx

type Faidx struct {
	Index Index
	// contains filtered or unexported fields
}

Faidx is

func New

func New(fileSeq string) (*Faidx, error)

New try to get Faidx from fasta file

func NewWithCustomExt

func NewWithCustomExt(fileSeq, fileFai string) (*Faidx, error)

NewWithCustomExt try to get Faidx from fasta file, and .fai is specified

func NewWithIndex

func NewWithIndex(file string, index Index) (*Faidx, error)

NewWithIndex return faidx from file and readed Index. Useful for using custom IDRegexp

func (*Faidx) Base

func (f *Faidx) Base(chr string, pos int) (byte, error)

Base returns base in position pos. pos is 1 based

func (*Faidx) Close

func (f *Faidx) Close() error

Close the readers

func (*Faidx) Seq

func (f *Faidx) Seq(chr string) ([]byte, error)

Seq returns sequence of chr

func (*Faidx) SeqNotCleaned

func (f *Faidx) SeqNotCleaned(chr string) ([]byte, error)

SeqNotCleaned returns sequences without cleaning "\r", and "\n"

func (*Faidx) SubSeq

func (f *Faidx) SubSeq(chr string, start int, end int) ([]byte, error)

SubSeq returns subsequence of chr from start to end. start and end are 1-based.

func (*Faidx) SubSeqNotCleaned

func (f *Faidx) SubSeqNotCleaned(chr string, start int, end int) ([]byte, error)

SubSeqNotCleaned returns subsequence of chr from start to end. start and end are 1-based. "\r" and "\n" are not cleaned.

type Index

type Index map[string]Record

Index is FASTA index

func Create

func Create(fileSeq, fileFai string) (Index, error)

Create .fai for file

func CreateWithFullHead

func CreateWithFullHead(fileSeq, fileFai string) (Index, error)

CreateWithFullHead uses full head instead of just sequence ID

func CreateWithIDRegexp

func CreateWithIDRegexp(fileSeq, fileFai string, idRegexp string) (Index, error)

CreateWithIDRegexp uses custom regular expression to get sequence ID

func Read

func Read(fileFai string) (Index, error)

Read faidx from .fai file

type Record

type Record struct {
	Name         string
	Length       int
	Start        int64
	BasesPerLine int
	BytesPerLine int
}

Record is FASTA index record

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL