Documentation ¶
Overview ¶
Package textsplitter provides tools for splitting long texts into smaller chunks based on configurable rules and parameters. It aims to help in processing these chunks more efficiently when interacting with language models or other text-processing tools.
The main components of this package are:
- TextSplitter interface: a common interface for splitting texts into smaller chunks. - RecursiveCharacter: a text splitter that recursively splits texts by different characters (separators) combined with chunk size and overlap settings. - Helper functions: utility functions for creating documents out of split texts and rejoining them if necessary.
Using the TextSplitter interface, developers can implement custom splitting strategies for their specific use cases and requirements.
Index ¶
- Variables
- func CreateDocuments(textSplitter TextSplitter, texts []string, metadatas []map[string]any) ([]schema.Document, error)
- func SplitDocuments(textSplitter TextSplitter, documents []schema.Document) ([]schema.Document, error)
- type MarkdownTextSplitter
- type Option
- func WithAllowedSpecial(allowedSpecial []string) Option
- func WithChunkOverlap(chunkOverlap int) Option
- func WithChunkSize(chunkSize int) Option
- func WithCodeBlocks(renderCode bool) Option
- func WithDisallowedSpecial(disallowedSpecial []string) Option
- func WithEncodingName(encodingName string) Option
- func WithHeadingHierarchy(trackHeadingHierarchy bool) Option
- func WithKeepSeparator(keepSeparator bool) Option
- func WithLenFunc(lenFunc func(string) int) Option
- func WithModelName(modelName string) Option
- func WithReferenceLinks(referenceLinks bool) Option
- func WithSecondSplitter(secondSplitter TextSplitter) Option
- func WithSeparators(separators []string) Option
- type Options
- type RecursiveCharacter
- type TextSplitter
- type TokenSplitter
Constants ¶
This section is empty.
Variables ¶
var ErrMismatchMetadatasAndText = errors.New("number of texts and metadatas does not match")
ErrMismatchMetadatasAndText is returned when the number of texts and metadatas given to CreateDocuments does not match. The function will not error if the length of the metadatas slice is zero.
Functions ¶
func CreateDocuments ¶
func CreateDocuments(textSplitter TextSplitter, texts []string, metadatas []map[string]any) ([]schema.Document, error)
CreateDocuments creates documents from texts and metadatas with a text splitter. If the length of the metadatas is zero, the result documents will contain no metadata. Otherwise, the numbers of texts and metadatas must match.
func SplitDocuments ¶
func SplitDocuments(textSplitter TextSplitter, documents []schema.Document) ([]schema.Document, error)
SplitDocuments splits documents using a textsplitter.
Types ¶
type MarkdownTextSplitter ¶
type MarkdownTextSplitter struct { ChunkSize int ChunkOverlap int // SecondSplitter splits paragraphs SecondSplitter TextSplitter CodeBlocks bool ReferenceLinks bool HeadingHierarchy bool }
MarkdownTextSplitter markdown header text splitter.
If your origin document is HTML, you purify and convert to markdown, then split it.
func NewMarkdownTextSplitter ¶
func NewMarkdownTextSplitter(opts ...Option) *MarkdownTextSplitter
NewMarkdownTextSplitter creates a new Markdown text splitter.
type Option ¶
type Option func(*Options)
Option is a function that can be used to set options for a text splitter.
func WithAllowedSpecial ¶
WithAllowedSpecial sets the allowed special tokens for a text splitter.
func WithChunkOverlap ¶
WithChunkOverlap sets the chunk overlap for a text splitter.
func WithChunkSize ¶
WithChunkSize sets the chunk size for a text splitter.
func WithCodeBlocks ¶ added in v0.1.4
WithCodeBlocks sets whether indented and fenced codeblocks should be included in the output.
func WithDisallowedSpecial ¶
WithDisallowedSpecial sets the disallowed special tokens for a text splitter.
func WithEncodingName ¶
WithEncodingName sets the encoding name for a text splitter.
func WithHeadingHierarchy ¶ added in v0.1.12
WithHeadingHierarchy sets whether the hierarchy of headings in a document should be persisted in the resulting chunks. When it is set to true, each chunk gets prepended with a list of all parent headings in the hierarchy up to this point. The purpose of having this parameter is to allow for returning more relevant chunks during similarity search. Default to False if not specified.
func WithKeepSeparator ¶ added in v0.1.10
WithKeepSeparator sets whether the separators should be kept in the resulting split text or not. When it is set to True, the separators are included in the resulting split text. When it is set to False, the separators are not included in the resulting split text. The purpose of having this parameter is to provide flexibility in how text splitting is handled. Default to False if not specified.
func WithLenFunc ¶ added in v0.1.5
WithLenFunc sets the lenfunc for a text splitter.
func WithModelName ¶
WithModelName sets the model name for a text splitter.
func WithReferenceLinks ¶ added in v0.1.4
WithReferenceLinks sets whether reference links (i.e. `[text][label]`) should be patched with the url and title from their definition. Note that by default reference definitions are dropped from the output.
Caution: this also affects how other inline elements are rendered, e.g. all emphasis will use `*` even when another character (e.g. `_`) was used in the input.
func WithSecondSplitter ¶
func WithSecondSplitter(secondSplitter TextSplitter) Option
WithSecondSplitter sets the second splitter for a text splitter.
func WithSeparators ¶
WithSeparators sets the separators for a text splitter.
type Options ¶
type Options struct { ChunkSize int ChunkOverlap int Separators []string KeepSeparator bool LenFunc func(string) int ModelName string EncodingName string AllowedSpecial []string DisallowedSpecial []string SecondSplitter TextSplitter CodeBlocks bool ReferenceLinks bool KeepHeadingHierarchy bool // Persist hierarchy of markdown headers in each chunk }
Options is a struct that contains options for a text splitter.
func DefaultOptions ¶
func DefaultOptions() Options
DefaultOptions returns the default options for all text splitter.
type RecursiveCharacter ¶
type RecursiveCharacter struct { Separators []string ChunkSize int ChunkOverlap int LenFunc func(string) int KeepSeparator bool }
RecursiveCharacter is a text splitter that will split texts recursively by different characters.
func NewRecursiveCharacter ¶
func NewRecursiveCharacter(opts ...Option) RecursiveCharacter
NewRecursiveCharacter creates a new recursive character splitter with default values. By default, the separators used are "\n\n", "\n", " " and "". The chunk size is set to 4000 and chunk overlap is set to 200.
type TextSplitter ¶
TextSplitter is the standard interface for splitting texts.
type TokenSplitter ¶
type TokenSplitter struct { ChunkSize int ChunkOverlap int ModelName string EncodingName string AllowedSpecial []string DisallowedSpecial []string }
TokenSplitter is a text splitter that will split texts by tokens.
func NewTokenSplitter ¶
func NewTokenSplitter(opts ...Option) TokenSplitter