gostr

package module
v0.0.0-...-1708c0c Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 21, 2021 License: GPL-3.0 Imports: 8 Imported by: 0

README

gostr - String algorithms examples in Go

TravisCI Status codecov Coverage Status Codacy Badge

This is just a small repository for Go implementations of basic string algorithms for educational purposes.

Documentation

Index

Constants

View Source
const (
	// Sentinel is a unique byte (zero) used in several algorithms
	// to be different and smaller than all other letters.
	Sentinel = '\x00'
	// SentinelSymbol is a rune you can use to display the sentinel
	// when you translate byte slices into strings.
	SentinelSymbol = rune('𝕊')
)
View Source
const (
	M ApproxEdit = iota // Match/mismatch operations
	I            = iota // Insertion operations
	D            = iota // Deletion operations
)

Approximative matching edit operations.

Variables

This section is empty.

Functions

func Bmh

func Bmh(x, p string, callback func(int))

Bmh runs the O(nm) worst-case but expected sub-linear time Boyer-Moore-Horspool algorithm. This version uses a table of size 256 to map bytes to jumps, exploiting that we get bytes out of indexing strings.

Parameters:

  • x: the string we search in.
  • p: the string we search for
  • callback: a function called for each occurrence

func BmhWithAlphabet

func BmhWithAlphabet(x, p string, callback func(int))

BmhWithAlphabet runs the O(nm) worst-case but expected sub-linear time Boyer-Moore-Horspool algorithm. This version maps the input strings before search, so we know their alphabet size, and can create a jump table of the apprporiate size.

Parameters:

  • x: the string we search in.
  • p: the string we search for
  • callback: a function called for each occurrence

func BmhWithMap

func BmhWithMap(x, p string, callback func(int))

BmhWithMap runs the O(nm) worst-case but expected sub-linear time Boyer-Moore-Horspool algorithm. It uses a map for the jump table, and can theoretically handle more characters than bytes (but doesn't, since indexing into strings gives us bytes). It demonstrates the performance hit you get from using a map rather than an array as in Bmh.

Parameters:

  • x: the string we search in.
  • p: the string we search for
  • callback: a function called for each occurrence

func BorderSearch

func BorderSearch(x, p string, callback func(int))

BorderSearch runs the O(n+m) time algorithm based on building a border array and reporting when its value matches m.

Parameters:

  • x: the string we search in.
  • p: the string we search for
  • callback: a function called for each occurrence

func Borderarray

func Borderarray(x string) []int

Borderarray computes the border array over the string x. The border array ba will at index i have the length of the longest proper border of the string x[:i+1], i.e. the longest non-empty string that is both a prefix and a suffix of x[:i+1].

func Bwt

func Bwt(x []byte, sa []int32) []byte

Bwt gives you the Burrows-Wheeler transform of a string, computed using the suffix array for the string. The string should have a sentinel

func CountEdits

func CountEdits(x, p string, pos int, cigar string) (int, error)

CountEdits counts the number of edits in the local alignment between x and p specified by pos and cigar

func ExtractAlignment

func ExtractAlignment(x, p string, pos int, cigar string) (subx, subp string, err error)

ExtractAlignment extracts a pairwise alignment from the reference, x, the read, p, the position and the edits cigar.

func FMIndexApproxFromTables

func FMIndexApproxFromTables(tbls *FMIndexTables) func(p string, edits int, cb func(i int, cigar string))

FMIndexApproxFromTables return a search function from the preprocessed tables.

func FMIndexApproxPreprocess

func FMIndexApproxPreprocess(x string) func(p string, edits int, cb func(i int, cigar string))

FMIndexApproxPreprocess preprocesses the string x and returns a function that you can use to efficiently search in x.

func FMIndexExactFromTables

func FMIndexExactFromTables(tbls *FMIndexTables) func(p string, cb func(i int))

FMIndexExactFromTables returns a search function based on the preprocessed tables

func FMIndexExactPreprocess

func FMIndexExactPreprocess(x string) func(p string, cb func(i int))

FMIndexExactPreprocess preprocesses the string x and returns a function that you can use to efficiently search in x.

func Kmp

func Kmp(x, p string, callback func(int))

Kmp runs the O(n+m) time Knuth-Morris-Prat algorithm.

Parameters:

  • x: the string we search in.
  • p: the string we search for
  • callback: a function called for each occurrence

func Naive

func Naive(x, p string, callback func(int))

Naive runs the naive (duh) O(nm) times search algorithm.

Parameters:

  • x: the string we search in.
  • p: the string we search for
  • callback: a function called for each occurrence

func OpsToCigar

func OpsToCigar(ops EditOps) string

OpsToCigar turns a list of ops into a cigar.

func ReverseBwt

func ReverseBwt(bwt []byte) []byte

ReverseBwt reconstructs the original string from the bwt string.

func ReverseString

func ReverseString(s string) string

ReverseString returns a new string that is the reverse of the input string.

func Sais

func Sais(x string) (sa []int32)

Sais builds a suffix array from the string x

func SaisWithAlphabet

func SaisWithAlphabet(x string, alpha *Alphabet) ([]int32, error)

SaisWithAlphabet builds a suffix array from the string x, first mapping it to bytes using the alphabet alpha.

func Skew

func Skew(x string) []int32

Skew builds the suffix array of a string using the skew algorithm.

func SkewWithAlphabet

func SkewWithAlphabet(x string, alpha *Alphabet) ([]int32, error)

SkewWithAlphabet builds the suffix array of a string, first mapping it to a byte slice using the alphabet alpha.

func StSaConstruction

func StSaConstruction(x string) []int32

StSaConstruction constructs a suffix array from a suffix tree.

func StrictBorderarray

func StrictBorderarray(x string) []int

StrictBorderarray computes the strict border array over the string x. This is almost the same as the border array, but ba[i] will be the longest proper border of the string x[:i+1] such that x[ba[i]] != x[i].

Types

type Alphabet

type Alphabet struct {
	// contains filtered or unexported fields
}

Alphabet handles mapping from strings to smaller alphabets of bytes.

func MapString

func MapString(x string) ([]byte, *Alphabet)

MapString creates an alphabet from the input and then maps the string through it, returning both resulting byte slice and alphabet.

func MapStringWithSentinel

func MapStringWithSentinel(x string) ([]byte, *Alphabet)

MapStringWithSentinel creates an alphabet from the input and then maps the string through it, returning both resulting byte slice and alphabet. Unlike MapString, MapStringWithSentinel will add a terminal zero byte to the byte slice it returns.

func NewAlphabet

func NewAlphabet(ref string) *Alphabet

NewAlphabet creates an alphabet consisting of the bytes in ref only.

func (*Alphabet) Contains

func (alpha *Alphabet) Contains(a byte) bool

Contains checks if a is contained in the alphabet

func (*Alphabet) GobDecode

func (alpha *Alphabet) GobDecode(b []byte) (err error)

GobDecode implements the encoder interface for serialising an alphabet to a stream of bytes

func (Alphabet) GobEncode

func (alpha Alphabet) GobEncode() (res []byte, err error)

GobEncode implements the encoder interface for serialising an alphabet to a stream of bytes

func (*Alphabet) MapToBytes

func (alpha *Alphabet) MapToBytes(x string) ([]byte, error)

MapToBytes translates a string into a byte slice, mapping characters according to the alphabet

func (*Alphabet) MapToBytesWithSentinel

func (alpha *Alphabet) MapToBytesWithSentinel(x string) ([]byte, error)

MapToBytesWithSentinel translates a string into a byte slice, mapping characters according to the alphabet. The resulting byte slice has a terminal zero, acting as a sentinel.

func (*Alphabet) MapToInts

func (alpha *Alphabet) MapToInts(x string) ([]int32, error)

MapToInts translates a string into an integer slice, mapping characters according to the alphabet. It only differs from MapToBytes in the type of the output, but is used in algorithms where we need to operate on integers rather than the smaller bytes.

func (*Alphabet) MapToIntsWithSentinel

func (alpha *Alphabet) MapToIntsWithSentinel(x string) ([]int32, error)

MapToIntsWithSentinel translates a string into an integer slice, mapping characters according to the alphabet. The resulting int slice has a terminal zero, acting as a sentinel. It only differs from MapToBytes in the type of the output, but is used in algorithms where we need to operate on integers rather than the smaller bytes.

func (*Alphabet) RevmapBytes

func (alpha *Alphabet) RevmapBytes(x []byte) string

RevmapBytes maps a byte slice back into a string according to the alphabet that was used to translate the string into bytes in the first place.

func (*Alphabet) RevmapBytesStripSentinel

func (alpha *Alphabet) RevmapBytesStripSentinel(x []byte) string

RevmapBytesStripSentinel maps a byte slice back into a string according to the alphabet that was used to translate the string into bytes in the first place. RevmapBytesStripSentinel will strip the last character in the input from the output, getting rid of a sentinel that (hopefully) was added when the byte slice was created.

func (*Alphabet) Size

func (alpha *Alphabet) Size() int

Size gives the number of letters in the alphabet. It will always be at least one, since all alphabets contain the sentinel character (zero).

type AlphabetLookupError

type AlphabetLookupError struct {
	// contains filtered or unexported fields
}

AlphabetLookupError are errors that occur if you look up a character that is not in the alphabet.

func (*AlphabetLookupError) Error

func (err *AlphabetLookupError) Error() string

Error implements the interface for errors.

type ApproxEdit

type ApproxEdit int

ApproxEdit is a type for edit operations

type CTab

type CTab struct {
	CumSum []int
}

CTab Structur holding the C-table for BWT search. This is a map from letters in the alphabet to the cumulative sum of how often we see letters in the BWT

func NewCTab

func NewCTab(bwt []byte, asize int) *CTab

NewCTab builds the c-table from a string.

func (*CTab) Rank

func (ctab *CTab) Rank(a byte) int

Rank How many times does the BWT hold a letter smaller than a? Undefined behaviour if a isn't in the table.

type EdgeLabel

type EdgeLabel []byte

EdgeLabel is the representation of a string along an edge. It is a byte slice, so it takes up constant space, holding only a pointer, length and capacity, while the underlying string is shared between the edges in the tree.

func (EdgeLabel) Revmap

func (el EdgeLabel) Revmap(alpha *Alphabet) string

Revmap maps an EdgeLabel back to the underlying string. EdgeLabels are represented as slices of bytes, mapped from a string through an alphabet, and Revmap uses the alphabet to get the string back.

type EditOps

type EditOps = []ApproxEdit

EditOps is a type for a sequence of edit operations

func CigarToOps

func CigarToOps(cigar string) (EditOps, error)

CigarToOps turns a cigar string into the list of edit ops.

type FMIndexTables

type FMIndexTables struct {
	Alpha *Alphabet
	Sa    []int32
	Ctab  *CTab
	Otab  *OTab

	// for approx matching
	Rotab *OTab
}

FMIndexTables contains the preprocessed tables used for FM-index searching

func BuildFMIndexApproxTables

func BuildFMIndexApproxTables(x string) *FMIndexTables

BuildFMIndexExactTables builds the preprocessing tables for exact FM-index searching.

func BuildFMIndexExactTables

func BuildFMIndexExactTables(x string) *FMIndexTables

BuildFMIndexExactTables builds the preprocessing tables for exact FM-index searching.

type InnerNode

type InnerNode struct {
	SharedNode
	SuffixLink *InnerNode
	Children   []STNode
}

InnerNode contains the additional properties that only inner nodes have.

type InvalidCigar

type InvalidCigar struct {
	// contains filtered or unexported fields
}

InvalidCigar are errors when you use a cigar that isn't in the right format

func NewInvalidCigar

func NewInvalidCigar(x string) *InvalidCigar

NewInvalidCigar creates an InvalidCigar error

func (*InvalidCigar) Error

func (err *InvalidCigar) Error() string

Error implements the interface for errors.

func (*InvalidCigar) Is

func (err *InvalidCigar) Is(other error) bool

Is implements the Is interface for errors.

type LeafNode

type LeafNode struct {
	SharedNode
	Index int
}

LeafNode contains the additional properties that only leaves have.

type OTab

type OTab struct {
	// contains filtered or unexported fields
}

OTab Holds the o-table (rank table) from a BWT string

func NewOTab

func NewOTab(bwt []byte, asize int) *OTab

NewOTab builds the o-table from a string. It uses the suffix array to get the BWT and a c-table to handle the alphabet.

func (*OTab) GobDecode

func (otab *OTab) GobDecode(b []byte) (err error)

GobDecode implements the decoder interface for serialising to a stream of bytes

func (OTab) GobEncode

func (otab OTab) GobEncode() (res []byte, err error)

GobEncode implements the encoder interface for serialising to a stream of bytes

func (*OTab) Rank

func (otab *OTab) Rank(a byte, i int) int

Rank How many times do we see letter a before index i in the BWT string?

type STNode

type STNode struct {
	NodeType STNodeType
	// contains filtered or unexported fields
}

STNode wraps either a leaf or an inner node. Use the node type determine which, before you access it as a node.

func (STNode) Inner

func (n STNode) Inner() *InnerNode

Inner casts a node to a leaf. You should only do this if the NodeType is Inner.

func (STNode) IsNil

func (n STNode) IsNil() bool

IsNil returns true if the node represents a nil pointer. If it does, you cannot cast it to any other node type.

func (STNode) Leaf

func (n STNode) Leaf() *LeafNode

Leaf casts a node to a leaf. You should only do this if the NodeType is Leaf.

func (STNode) LeafIndices

func (n STNode) LeafIndices(fn func(int))

LeafIndices maps fn over all the leaf indices in the subtree rooted at n.

func (STNode) PathLabel

func (n STNode) PathLabel(alpha *Alphabet) string

PathLabel returns the string from the root down to a node. Do not call it with a NIL node, that is an error the function will crash.

func (STNode) Shared

func (n STNode) Shared() *SharedNode

Shared returns a pointer to the shared part of leaves and inner nodes. It is an error to access a node if it IsNil(), but otherwise, you can get the shared part of both leaves and inner nodes.

func (STNode) ToDot

func (n STNode) ToDot(alpha *Alphabet, w io.Writer)

ToDot writes the subtree starting at n to w.

Parameters:

  • alpha: The alphabet that was used to map the original string into the byte representation stored in the tree. You can get it from the suffix tree.
  • w: the output stream to write the dot representation to.

type STNodeType

type STNodeType int

STNodeType is a tag for identifying when we have leaves and when we have inner nodes.

const (
	// Leaf means that the node is a leaf
	Leaf STNodeType = iota
	// Inner means that the node is an inner node
	Inner STNodeType = iota
)

type SharedNode

type SharedNode struct {
	EdgeLabel
	Parent *InnerNode
}

SharedNode contains the attributes that both leaves and inner nodes have.

type SuffixTree

type SuffixTree struct {
	Alpha  *Alphabet
	String []byte
	Root   STNode
}

SuffixTree is the representation of a suffix tree.

func McCreight

func McCreight(x string) *SuffixTree

McCreight constructs a suffix tree using McCreight's algorithm.

func NaiveST

func NaiveST(x string) *SuffixTree

NaiveST is the naive O(n²) construction algorithm.

func (*SuffixTree) ComputeSuffixAndLcpArray

func (st *SuffixTree) ComputeSuffixAndLcpArray() (sa, lcp []int32)

ComputeSuffixAndLcpArray constructs a suffix array and longest common prefix array from a suffix tree.

func (*SuffixTree) Search

func (st *SuffixTree) Search(p string, visitor func(int))

Search maps visitor through all the leaves in the subtree found by a search.

func (*SuffixTree) ToDot

func (st *SuffixTree) ToDot(w io.Writer)

ToDot writes a dot representation of the tree to the output writer w.

type Trie

type Trie struct {
	// The parent of the node, unless this node is the root
	Parent *Trie
	// The suffix link of this node
	Suffix *Trie
	// Outlist for Aho-Corasick
	Outlist *Trie

	// The string label of this node. We assume that
	// only one string can be mapped to the same node.
	Label int
	// The children of this node
	Children [256]*Trie // If you map the alphabet, you could save some space...
}

Trie is both a trie and a node in a trie.

func BuildTrie

func BuildTrie(strings []string) *Trie

BuildTrie builds a new trie from a sequence of strings.

func (*Trie) Contains

func (t *Trie) Contains(p string) bool

Contains check if the trie contains the string p

func (*Trie) FindNode

func (t *Trie) FindNode(p string) *Trie

FindNode finds the node at the end of the string, or returns nil if there isn't one.

func (*Trie) IsRoot

func (t *Trie) IsRoot() bool

IsRoot returns true if and only if this trie is the root node.

func (*Trie) SetSuffixAndOutput

func (t *Trie) SetSuffixAndOutput()

SetSuffixAndOutput sets the suffix link and output for Aho-Corasick

func (*Trie) ToDot

func (t *Trie) ToDot(w io.Writer)

ToDot writes a trie structure to the writer in Dot format.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL