textsplitter

package
v0.1.15 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 30, 2024 License: MIT Imports: 9 Imported by: 0

Documentation

Overview

Package textsplitter provides tools for splitting long texts into smaller chunks based on configurable rules and parameters. It aims to help in processing these chunks more efficiently when interacting with language models or other text-processing tools.

The main components of this package are:

- TextSplitter interface: a common interface for splitting texts into smaller chunks. - RecursiveCharacter: a text splitter that recursively splits texts by different characters (separators) combined with chunk size and overlap settings. - Helper functions: utility functions for creating documents out of split texts and rejoining them if necessary.

Using the TextSplitter interface, developers can implement custom splitting strategies for their specific use cases and requirements.

Index

Constants

This section is empty.

Variables

View Source
var ErrMismatchMetadatasAndText = errors.New("number of texts and metadatas does not match")

ErrMismatchMetadatasAndText is returned when the number of texts and metadatas given to CreateDocuments does not match. The function will not error if the length of the metadatas slice is zero.

Functions

func CreateDocuments

func CreateDocuments(textSplitter TextSplitter, texts []string, metadatas []map[string]any) ([]schema.Document, error)

CreateDocuments creates documents from texts and metadatas with a text splitter. If the length of the metadatas is zero, the result documents will contain no metadata. Otherwise, the numbers of texts and metadatas must match.

func SplitDocuments

func SplitDocuments(textSplitter TextSplitter, documents []schema.Document) ([]schema.Document, error)

SplitDocuments splits documents using a textsplitter.

Types

type MarkdownTextSplitter

type MarkdownTextSplitter struct {
	ChunkSize    int
	ChunkOverlap int
	// SecondSplitter splits paragraphs
	SecondSplitter   TextSplitter
	CodeBlocks       bool
	ReferenceLinks   bool
	HeadingHierarchy bool
	JoinTableRows    bool
}

MarkdownTextSplitter markdown header text splitter.

If your origin document is HTML, you purify and convert to markdown, then split it.

func NewMarkdownTextSplitter

func NewMarkdownTextSplitter(opts ...Option) *MarkdownTextSplitter

NewMarkdownTextSplitter creates a new Markdown text splitter.

func (MarkdownTextSplitter) SplitText

func (sp MarkdownTextSplitter) SplitText(text string) ([]string, error)

SplitText splits a text into multiple text.

type Option

type Option func(*Options)

Option is a function that can be used to set options for a text splitter.

func WithAllowedSpecial

func WithAllowedSpecial(allowedSpecial []string) Option

WithAllowedSpecial sets the allowed special tokens for a text splitter.

func WithChunkOverlap

func WithChunkOverlap(chunkOverlap int) Option

WithChunkOverlap sets the chunk overlap for a text splitter.

func WithChunkSize

func WithChunkSize(chunkSize int) Option

WithChunkSize sets the chunk size for a text splitter.

func WithCodeBlocks

func WithCodeBlocks(renderCode bool) Option

WithCodeBlocks sets whether indented and fenced codeblocks should be included in the output.

func WithDisallowedSpecial

func WithDisallowedSpecial(disallowedSpecial []string) Option

WithDisallowedSpecial sets the disallowed special tokens for a text splitter.

func WithEncodingName

func WithEncodingName(encodingName string) Option

WithEncodingName sets the encoding name for a text splitter.

func WithHeadingHierarchy

func WithHeadingHierarchy(trackHeadingHierarchy bool) Option

WithHeadingHierarchy sets whether the hierarchy of headings in a document should be persisted in the resulting chunks. When it is set to true, each chunk gets prepended with a list of all parent headings in the hierarchy up to this point. The purpose of having this parameter is to allow for returning more relevant chunks during similarity search. Default to False if not specified.

func WithJoinTableRows

func WithJoinTableRows(join bool) Option

WithJoinTableRows sets whether tables should be split by row or not. When it is set to True, table rows are joined until the chunksize. When it is set to False (the default), tables are split by row.

The default behavior is to split tables by row, so that each row is in a separate chunk.

func WithKeepSeparator

func WithKeepSeparator(keepSeparator bool) Option

WithKeepSeparator sets whether the separators should be kept in the resulting split text or not. When it is set to True, the separators are included in the resulting split text. When it is set to False, the separators are not included in the resulting split text. The purpose of having this parameter is to provide flexibility in how text splitting is handled. Default to False if not specified.

func WithLenFunc

func WithLenFunc(lenFunc func(string) int) Option

WithLenFunc sets the lenfunc for a text splitter.

func WithModelName

func WithModelName(modelName string) Option

WithModelName sets the model name for a text splitter.

func WithReferenceLinks(referenceLinks bool) Option

WithReferenceLinks sets whether reference links (i.e. `[text][label]`) should be patched with the url and title from their definition. Note that by default reference definitions are dropped from the output.

Caution: this also affects how other inline elements are rendered, e.g. all emphasis will use `*` even when another character (e.g. `_`) was used in the input.

func WithSecondSplitter

func WithSecondSplitter(secondSplitter TextSplitter) Option

WithSecondSplitter sets the second splitter for a text splitter.

func WithSeparators

func WithSeparators(separators []string) Option

WithSeparators sets the separators for a text splitter.

type Options

type Options struct {
	ChunkSize            int
	ChunkOverlap         int
	Separators           []string
	KeepSeparator        bool
	LenFunc              func(string) int
	ModelName            string
	EncodingName         string
	AllowedSpecial       []string
	DisallowedSpecial    []string
	SecondSplitter       TextSplitter
	CodeBlocks           bool
	ReferenceLinks       bool
	KeepHeadingHierarchy bool // Persist hierarchy of markdown headers in each chunk
	JoinTableRows        bool
}

Options is a struct that contains options for a text splitter.

func DefaultOptions

func DefaultOptions() Options

DefaultOptions returns the default options for all text splitter.

type RecursiveCharacter

type RecursiveCharacter struct {
	Separators    []string
	ChunkSize     int
	ChunkOverlap  int
	LenFunc       func(string) int
	KeepSeparator bool
}

RecursiveCharacter is a text splitter that will split texts recursively by different characters.

func NewRecursiveCharacter

func NewRecursiveCharacter(opts ...Option) RecursiveCharacter

NewRecursiveCharacter creates a new recursive character splitter with default values. By default, the separators used are "\n\n", "\n", " " and "". The chunk size is set to 4000 and chunk overlap is set to 200.

func (RecursiveCharacter) SplitText

func (s RecursiveCharacter) SplitText(text string) ([]string, error)

SplitText splits a text into multiple text.

type TextSplitter

type TextSplitter interface {
	SplitText(text string) ([]string, error)
}

TextSplitter is the standard interface for splitting texts.

type TokenSplitter

type TokenSplitter struct {
	ChunkSize         int
	ChunkOverlap      int
	ModelName         string
	EncodingName      string
	AllowedSpecial    []string
	DisallowedSpecial []string
}

TokenSplitter is a text splitter that will split texts by tokens.

func NewTokenSplitter

func NewTokenSplitter(opts ...Option) TokenSplitter

func (TokenSplitter) SplitText

func (s TokenSplitter) SplitText(text string) ([]string, error)

SplitText splits a text into multiple text.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL