Documentation ¶
Overview ¶
Package textsplitter provides tools for splitting long texts into smaller chunks based on configurable rules and parameters. It aims to help in processing these chunks more efficiently when interacting with language models or other text-processing tools.
The main components of this package are:
- TextSplitter interface: a common interface for splitting texts into smaller chunks. - RecursiveCharacter: a text splitter that recursively splits texts by different characters (separators) combined with chunk size and overlap settings. - Helper functions: utility functions for creating documents out of split texts and rejoining them if necessary.
Using the TextSplitter interface, developers can implement custom splitting strategies for their specific use cases and requirements.
Index ¶
- Constants
- Variables
- func CreateDocuments(textSplitter TextSplitter, texts []string, metadatas []map[string]any) ([]schema.Document, error)
- func SplitDocuments(textSplitter TextSplitter, documents []schema.Document) ([]schema.Document, error)
- type RecursiveCharacter
- type SentenceSplitter
- func (s *SentenceSplitter) ChunkingTokenizerFn() func(string) []string
- func (s *SentenceSplitter) SplitByChar() func(string) []string
- func (s *SentenceSplitter) SplitByRegex(regex string) func(string) []string
- func (s *SentenceSplitter) SplitBySep(sep string, keepSep bool) func(string) []string
- func (s *SentenceSplitter) SplitText(text string) []string
- func (s *SentenceSplitter) SplitTextMetadataAware(text, metadataStr string) []string
- func (s *SentenceSplitter) TokenEncode(text string) []int
- type Split
- type TextSplitter
- type TokenSplitter
Constants ¶
const CHUNKING_REGEX = "[^,.;。]+[,.;。]?"
const DEFUALT_PARAGRAPH_SEP = "\n\n"
const TokenEncoding = "cl100k_base"
Variables ¶
var AllowedSpecial = []string{"all"}
var DisallowedSpecial = []string{"all"}
var ErrMismatchMetadatasAndText = errors.New("number of texts and metadatas does not match")
ErrMismatchMetadatasAndText is returned when the number of texts and metadatas given to CreateDocuments does not match. The function will not error if the length of the metadatas slice is zero.
Functions ¶
func CreateDocuments ¶
func CreateDocuments(textSplitter TextSplitter, texts []string, metadatas []map[string]any) ([]schema.Document, error)
CreateDocuments creates documents from texts and metadatas with a text splitter. If the length of the metadatas is zero, the result documents will contain no metadata. Otherwise the numbers of texts and metadatas must match.
func SplitDocuments ¶
func SplitDocuments(textSplitter TextSplitter, documents []schema.Document) ([]schema.Document, error)
SplitDocuments splits documents using a textsplitter.
Types ¶
type RecursiveCharacter ¶
RecursiveCharacter is a text splitter that will split texts recursively by different characters.
func NewRecursiveCharacter ¶
func NewRecursiveCharacter() RecursiveCharacter
NewRecursiveCharacter creates a new recursive character splitter with default values. By default the separators used are "\n\n", "\n", " " and "". The chunk size is set to 4000 and chunk overlap is set to 200.
type SentenceSplitter ¶
type SentenceSplitter struct { ChunkSize int ChunkOverlap int Separator string ParagraphSeparator string SecondaryChunkingRegex string Tokenizer *tiktoken.Tiktoken // contains filtered or unexported fields }
SentenceSplitter splits text into chunks with a preference for complete sentences.
func NewSentenceSplitter ¶
func NewSentenceSplitter(chunkSize, chunkOverlap int) *SentenceSplitter
NewSentenceSplitter creates a new SentenceSplitter instance.
func (*SentenceSplitter) ChunkingTokenizerFn ¶
func (s *SentenceSplitter) ChunkingTokenizerFn() func(string) []string
func (*SentenceSplitter) SplitByChar ¶
func (s *SentenceSplitter) SplitByChar() func(string) []string
SplitByChar splits text by character.
func (*SentenceSplitter) SplitByRegex ¶
func (s *SentenceSplitter) SplitByRegex(regex string) func(string) []string
SplitByRegex splits text by regex.
func (*SentenceSplitter) SplitBySep ¶
func (s *SentenceSplitter) SplitBySep(sep string, keepSep bool) func(string) []string
SplitBySep splits text by separator.
func (*SentenceSplitter) SplitText ¶
func (s *SentenceSplitter) SplitText(text string) []string
SplitText splits text into chunks.
func (*SentenceSplitter) SplitTextMetadataAware ¶
func (s *SentenceSplitter) SplitTextMetadataAware(text, metadataStr string) []string
SplitTextMetadataAware splits text with metadata into chunks.
func (*SentenceSplitter) TokenEncode ¶
func (s *SentenceSplitter) TokenEncode(text string) []int
type TextSplitter ¶
TextSplitter is the standard interface for splitting texts.
type TokenSplitter ¶
type TokenSplitter struct { ChunkSize int ChunkOverlap int ModelName string EncodingName string AllowedSpecial []string DisallowedSpecial []string }
TokenSplitter is a text splitter that will split texts by tokens.
func NewTokenSplitter ¶
func NewTokenSplitter() TokenSplitter