text

package
v0.27.3-beta Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 12, 2024 License: MIT Imports: 11 Imported by: 0

README

---
title: "Text"
lang: "en-US"
draft: false
description: "Learn about how to set up a VDP Text component https://github.com/instill-ai/instill-core"
---

The Text component is an operator component that allows users to extract and manipulate text from different sources.
It can carry out the following tasks:

- [Chunk Text](#chunk-text)



## Release Stage

`Alpha`



## Configuration

The component configuration is defined and maintained [here](https://github.com/instill-ai/component/blob/main/operator/text/v0/config/definition.json).





## Supported Tasks

### Chunk Text

Chunk text with different strategies


| Input | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| Task ID (required) | `task` | string | `TASK_CHUNK_TEXT` |
| Text (required) | `text` | string | Text to be chunked |
| Strategy (required) | `strategy` | object | Chunking strategy |



| Output | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| Token Count | `token-count` | integer | Total count of tokens in the original input text |
| Text Chunks | `text-chunks` | array[object] | Text chunks after splitting |
| Number of Text Chunks | `chunk-num` | integer | Total number of output text chunks |
| Token Count Chunks | `chunks-token-count` | integer | Total count of tokens in the output text chunks |


### Chunking Strategy
There are three strategies available for chunking text in Text Component:
- 1. Token
- 2. Recursive
- 3. Markdown

#### Token
Language models have a token limit. You should not exceed the token limit. When you split your text into chunks it is therefore a good idea to count the number of tokens. There are many tokenizers. When you count tokens in your text you should use the same tokenizer as used in the language model.

| **Parameter**        | **Type**         | **Description**                                                                                                                                                                                              |
|----------------------|------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `chunk-size`         | integer          | Specifies the maximum size of each chunk in terms of the number of tokens                                                                                                                                    |
| `chunk-overlap`      | integer          | Determines the number of tokens that overlap between consecutive chunks                                                                                                                                      |
| `model-name`         | string           | The name of the model used for tokenization                                                                                                                                                                  |
| `allowed-special`    | array of strings | A list of special tokens that are allowed within chunks                                                                                                                                                      |
| `disallowed-special` | array of strings | A list of special tokens that should not appear within chunks                                                                                                                                                |

#### Recursive
This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

| **Parameter**      | **Type**         | **Description**                                                                                                                                                                                              |
|--------------------|------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `chunk-size`       | integer          | Specifies the maximum size of each chunk in terms of the number of tokens                                                                                                                                    |
| `chunk-overlap`    | integer          | Determines the number of tokens that overlap between consecutive chunks                                                                                                                                      |
| `model-name`       | string           | The name of the model used for tokenization                                                                                                                                                                  |
| `separators`       | array of strings | A list of strings representing the separators used to split the text                                                                                                                                         |
| `keep-separator`   | boolean          | A flag indicating whether to keep the separator characters at the beginning or end of chunks                                                                                                                 |


#### Markdown
This text splitter is specially designed for Markdown format.

| **Parameter**      | **Type** | **Description**                                                                                                                                                                                              |
|--------------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `chunk-size`       | integer  | Specifies the maximum size of each chunk in terms of the number of tokens                                                                                                                                    |
| `chunk-overlap`    | integer  | Determines the number of tokens that overlap between consecutive chunks                                                                                                                                      |
| `model-name`       | string   | The name of the model used for tokenization                                                                                                                                                                  |
| `code-blocks`      | boolean  | A flag indicating whether code blocks should be treated as a single unit                                                                                                                                     |

### Text Chunks in Output
| **Parameter**    | **Type** | **Description**                                              |
|------------------|----------|--------------------------------------------------------------|
| `test`           | string   | The text chunk                                               |
| `start-position` | integer  | The starting position of the text chunk in the original text |
| `end-position`   | integer  | The ending position of the text chunk in the original text   |





Documentation

Index

Constants

View Source
const (
	ListStarters = "-*+"
)

Document Implementation

Variables

This section is empty.

Functions

func Init

func Init(bc base.Component) *component

Init initializes the operator

Types

type ChunkPositionCalculator

type ChunkPositionCalculator interface {
	// contains filtered or unexported methods
}

type ChunkTextInput

type ChunkTextInput struct {
	Text     string   `json:"text"`
	Strategy Strategy `json:"strategy"`
}

type ChunkTextOutput

type ChunkTextOutput struct {
	ChunkNum         int         `json:"chunk-num"`
	TextChunks       []TextChunk `json:"text-chunks"`
	TokenCount       int         `json:"token-count"`
	ChunksTokenCount int         `json:"chunks-token-count"`
}

type Content

type Content struct {
	Type      string
	PlainText string
	Table     Table
	// All lists in the content with all levels in order
	Lists              []List
	BlockStartPosition int
	BlockEndPosition   int
}

type ContentChunk

type ContentChunk struct {
	Chunk                string
	ContentStartPosition int
	ContentEndPosition   int
}
type Header struct {
	Level int
	Text  string
	Size  int
}

type List

type List struct {
	// HeaderText is the text before the list starts
	HeaderText        string
	PreviousLevelList *List
	Text              string
	StartPosition     int
	EndPosition       int
	NextLevelLists    []List
	NextList          *List
	PreviousList      *List
	// contains filtered or unexported fields
}

List includes bullet points and numbered lists

type MarkdownDocument

type MarkdownDocument struct {
	Headers  []Header
	Contents []Content
}

type MarkdownTextSplitter

type MarkdownTextSplitter struct {
	ChunkSize    int
	ChunkOverlap int
	RawText      string
}

func NewMarkdownTextSplitter

func NewMarkdownTextSplitter(chunkSize, chunkOverlap int, rawText string) *MarkdownTextSplitter

func (*MarkdownTextSplitter) SplitText

func (sp *MarkdownTextSplitter) SplitText() ([]ContentChunk, error)

func (*MarkdownTextSplitter) Validate

func (sp *MarkdownTextSplitter) Validate() error

type PositionCalculator

type PositionCalculator struct{}

type Setting

type Setting struct {
	ChunkMethod       string   `json:"chunk-method,omitempty"`
	ChunkSize         int      `json:"chunk-size,omitempty"`
	ChunkOverlap      int      `json:"chunk-overlap,omitempty"`
	ModelName         string   `json:"model-name,omitempty"`
	AllowedSpecial    []string `json:"allowed-special,omitempty"`
	DisallowedSpecial []string `json:"disallowed-special,omitempty"`
	Separators        []string `json:"separators,omitempty"`
	KeepSeparator     bool     `json:"keep-separator,omitempty"`
	CodeBlocks        bool     `json:"code-blocks,omitempty"`
}

func (*Setting) SetDefault

func (s *Setting) SetDefault()

type Strategy

type Strategy struct {
	Setting Setting `json:"setting"`
}

type Table

type Table struct {
	HeaderText     string
	TableSeparator string
	HeaderRow      string
	Rows           []string
}

type TextChunk

type TextChunk struct {
	Text          string `json:"text"`
	StartPosition int    `json:"start-position"`
	EndPosition   int    `json:"end-position"`
	TokenCount    int    `json:"token-count"`
}

type TextSplitter

type TextSplitter interface {
	SplitText(text string) ([]string, error)
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL