tokenizers

package
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 30, 2023 License: BSD-2-Clause Imports: 0 Imported by: 0

Documentation

Overview

Package tokenizers is an interim solution while developing `gotokenizers` (https://github.com/nlpodyssey/gotokenizers). APIs and implementations may be subject to frequent refactoring.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func GetStrings

func GetStrings(tokens []StringOffsetsPair) []string

GetStrings returns a sequence of string values from the given slice of StringOffsetsPair.

Types

type OffsetsType

type OffsetsType struct {
	Start int
	End   int
}

OffsetsType represents a (start, end) offsets pair. It usually represents a lower inclusive index position, and an upper exclusive position.

func GetOffsets

func GetOffsets(tokens []StringOffsetsPair) []OffsetsType

GetOffsets returns a sequence of offsets values from the given slice of StringOffsetsPair.

type StringOffsetsPair

type StringOffsetsPair struct {
	String  string
	Offsets OffsetsType
}

StringOffsetsPair represents a string value paired with offsets bounds. It usually represents a token string and its offsets positions in the original string.

type Tokenizer

type Tokenizer interface {
	Tokenize(text string) []StringOffsetsPair
}

Tokenizer is implemented by any value that has the Tokenize method.

Directories

Path Synopsis
Package basetokenizer provides an implementations of a very simple tokenizer that splits by white-spaces (and alike) and punctuation symbols.
Package basetokenizer provides an implementations of a very simple tokenizer that splits by white-spaces (and alike) and punctuation symbols.
internal/sentencepiece
Package sentencepiece implements the SentencePiece encoder (Kudo and Richardson, 2018).
Package sentencepiece implements the SentencePiece encoder (Kudo and Richardson, 2018).

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL