textsplitter

package

v0.4.1 Latest Latest Go to latest Published: Nov 14, 2023 License: MIT Imports: 10 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

Documentation ¶

Overview ¶

Package textsplitter provides tools for splitting long texts into smaller chunks based on configurable rules and parameters. It aims to help in processing these chunks more efficiently when interacting with language models or other text-processing tools.

The main components of this package are:

- TextSplitter interface: a common interface for splitting texts into smaller chunks. - RecursiveCharacter: a text splitter that recursively splits texts by different characters (separators) combined with chunk size and overlap settings. - Helper functions: utility functions for creating documents out of split texts and rejoining them if necessary.

Using the TextSplitter interface, developers can implement custom splitting strategies for their specific use cases and requirements.

Index ¶

Constants
Variables
func CreateDocuments(textSplitter TextSplitter, texts []string, metadatas []map[string]any) ([]schema.Document, error)
func SplitDocuments(textSplitter TextSplitter, documents []schema.Document) ([]schema.Document, error)
type RecursiveCharacter
- func NewRecursiveCharacter() RecursiveCharacter
- func (s RecursiveCharacter) SplitText(text string) ([]string, error)
type SentenceSplitter
- func NewSentenceSplitter(chunkSize, chunkOverlap int) *SentenceSplitter
type Split
type TextSplitter
type TokenSplitter
- func NewTokenSplitter() TokenSplitter
- func (s TokenSplitter) SplitText(text string) ([]string, error)

Constants ¶

View Source

const CHUNKING_REGEX = "[^,.;。]+[,.;。]?"

View Source

const DEFUALT_PARAGRAPH_SEP = "\n\n"

View Source

const TokenEncoding = "cl100k_base"

Variables ¶

View Source

var AllowedSpecial = []string{"all"}

View Source

var DisallowedSpecial = []string{"all"}

View Source

var ErrMismatchMetadatasAndText = errors.New("number of texts and metadatas does not match")

ErrMismatchMetadatasAndText is returned when the number of texts and metadatas given to CreateDocuments does not match. The function will not error if the length of the metadatas slice is zero.

Functions ¶

func CreateDocuments ¶

func CreateDocuments(textSplitter TextSplitter, texts []string, metadatas []map[string]any) ([]schema.Document, error)

CreateDocuments creates documents from texts and metadatas with a text splitter. If the length of the metadatas is zero, the result documents will contain no metadata. Otherwise the numbers of texts and metadatas must match.

func SplitDocuments ¶

func SplitDocuments(textSplitter TextSplitter, documents []schema.Document) ([]schema.Document, error)

SplitDocuments splits documents using a textsplitter.

Types ¶

type RecursiveCharacter ¶

type RecursiveCharacter struct {
	Separators   []string
	ChunkSize    int
	ChunkOverlap int
}

RecursiveCharacter is a text splitter that will split texts recursively by different characters.

func NewRecursiveCharacter ¶

func NewRecursiveCharacter() RecursiveCharacter

NewRecursiveCharacter creates a new recursive character splitter with default values. By default the separators used are "\n\n", "\n", " " and "". The chunk size is set to 4000 and chunk overlap is set to 200.

func (RecursiveCharacter) SplitText ¶

func (s RecursiveCharacter) SplitText(text string) ([]string, error)

SplitText splits a text into multiple text.

type SentenceSplitter ¶

type SentenceSplitter struct {
	ChunkSize              int
	ChunkOverlap           int
	Separator              string
	ParagraphSeparator     string
	SecondaryChunkingRegex string

	Tokenizer *tiktoken.Tiktoken
	// contains filtered or unexported fields
}

SentenceSplitter splits text into chunks with a preference for complete sentences.

func NewSentenceSplitter ¶

func NewSentenceSplitter(chunkSize, chunkOverlap int) *SentenceSplitter

NewSentenceSplitter creates a new SentenceSplitter instance.

func (*SentenceSplitter) ChunkingTokenizerFn ¶

func (s *SentenceSplitter) ChunkingTokenizerFn() func(string) []string

func (*SentenceSplitter) SplitByChar ¶

func (s *SentenceSplitter) SplitByChar() func(string) []string

SplitByChar splits text by character.

func (*SentenceSplitter) SplitByRegex ¶

func (s *SentenceSplitter) SplitByRegex(regex string) func(string) []string

SplitByRegex splits text by regex.

func (*SentenceSplitter) SplitBySep ¶

func (s *SentenceSplitter) SplitBySep(sep string, keepSep bool) func(string) []string

SplitBySep splits text by separator.

func (*SentenceSplitter) SplitText ¶

func (s *SentenceSplitter) SplitText(text string) []string

SplitText splits text into chunks.

func (*SentenceSplitter) SplitTextMetadataAware ¶

func (s *SentenceSplitter) SplitTextMetadataAware(text, metadataStr string) []string

SplitTextMetadataAware splits text with metadata into chunks.

func (*SentenceSplitter) TokenEncode ¶

func (s *SentenceSplitter) TokenEncode(text string) []int

type Split ¶

type Split struct {
	Text       string
	IsSentence bool
}

Split represents a text split.

type TextSplitter ¶

type TextSplitter interface {
	SplitText(string) ([]string, error)
}

TextSplitter is the standard interface for splitting texts.

type TokenSplitter ¶

type TokenSplitter struct {
	ChunkSize         int
	ChunkOverlap      int
	ModelName         string
	EncodingName      string
	AllowedSpecial    []string
	DisallowedSpecial []string
}

TokenSplitter is a text splitter that will split texts by tokens.

func NewTokenSplitter ¶

func NewTokenSplitter() TokenSplitter

func (TokenSplitter) SplitText ¶

func (s TokenSplitter) SplitText(text string) ([]string, error)

SplitText splits a text into multiple text.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL