text

package
v0.28.0-beta Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 24, 2024 License: MIT Imports: 11 Imported by: 0

README

---
title: "Text"
lang: "en-US"
draft: false
description: "Learn about how to set up a VDP Text component https://github.com/instill-ai/instill-core"
---

The Text component is an operator component that allows users to extract and manipulate text from different sources.
It can carry out the following tasks:
- [Chunk Text](#chunk-text)

## Release Stage

`Alpha`

## Configuration

The component definition and tasks are defined in the [definition.json](https://github.com/instill-ai/component/blob/main/operator/text/v0/config/definition.json) and [tasks.json](https://github.com/instill-ai/component/blob/main/operator/text/v0/config/tasks.json) files respectively.



## Supported Tasks

### Chunk Text

Chunk text with different strategies

| Input | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| Task ID (required) | `task` | string | `TASK_CHUNK_TEXT` |
| Text (required) | `text` | string | Text to be chunked |
| [Strategy](#chunk-text-strategy) (required) | `strategy` | object | Chunking strategy |

<details>
<summary> Input Objects in Chunk Text</summary>

<h4 id="chunk-text-strategy">Strategy</h4>

Chunking strategy

| Field | Field ID | Type | Note |
| :--- | :--- | :--- | :--- |
| [Setting](#chunk-text-setting) | `setting` | object | Chunk Setting  |
</details>

<details>
<summary>The <code>setting</code> Object </summary>

<h4 id="chunk-text-setting">Setting</h4>

`setting` must fulfill one of the following schemas:

<h5 id="chunk-text-token"><code>Token</code></h5>

Language models have a token limit. You should not exceed the token limit. When you split your text into chunks it is therefore a good idea to count the number of tokens. There are many tokenizers. When you count tokens in your text you should use the same tokenizer as used in the language model.

| Field | Field ID | Type | Note |
| :--- | :--- | :--- | :--- |
| Allowed Special Tokens | `allowed-special` | array |  A list of special tokens that are allowed within chunks.  |
| Chunk Method | `chunk-method` | string |  Must be `"Token"`   |
| Chunk Overlap | `chunk-overlap` | integer |  Determines the number of tokens that overlap between consecutive chunks  |
| Chunk Size | `chunk-size` | integer |  Specifies the maximum size of each chunk in terms of the number of tokens  |
| Disallowed Special Tokens | `disallowed-special` | array |  A list of special tokens that should not appear within chunks.  |
| Model | `model-name` | string |  The name of the model used for tokenization.  <br/><details><summary><strong>Enum values</strong></summary><ul><li>`gpt-4`</li><li>`gpt-3.5-turbo`</li><li>`text-davinci-003`</li><li>`text-davinci-002`</li><li>`text-davinci-001`</li><li>`text-curie-001`</li><li>`text-babbage-001`</li><li>`text-ada-001`</li><li>`davinci`</li><li>`curie`</li><li>`babbage`</li><li>`ada`</li><li>`code-davinci-002`</li><li>`code-davinci-001`</li><li>`code-cushman-002`</li><li>`code-cushman-001`</li><li>`davinci-codex`</li><li>`cushman-codex`</li><li>`text-davinci-edit-001`</li><li>`code-davinci-edit-001`</li><li>`text-embedding-ada-002`</li><li>`text-similarity-davinci-001`</li><li>`text-similarity-curie-001`</li><li>`text-similarity-babbage-001`</li><li>`text-similarity-ada-001`</li><li>`text-search-davinci-doc-001`</li><li>`text-search-curie-doc-001`</li><li>`text-search-babbage-doc-001`</li><li>`text-search-ada-doc-001`</li><li>`code-search-babbage-code-001`</li><li>`code-search-ada-code-001`</li><li>`gpt2`</li></ul></details>  |

<h5 id="chunk-text-recursive"><code>Recursive</code></h5>

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", "", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

| Field | Field ID | Type | Note |
| :--- | :--- | :--- | :--- |
| Chunk Method | `chunk-method` | string |  Must be `"Recursive"`   |
| Chunk Overlap | `chunk-overlap` | integer |  Determines the number of tokens that overlap between consecutive chunks  |
| Chunk Size | `chunk-size` | integer |  Specifies the maximum size of each chunk in terms of the number of tokens  |
| Keep Separator | `keep-separator` | boolean |  A flag indicating whether to keep the separator characters at the beginning or end of chunks  |
| Model | `model-name` | string |  The name of the model used for tokenization.  <br/><details><summary><strong>Enum values</strong></summary><ul><li>`gpt-4`</li><li>`gpt-3.5-turbo`</li><li>`text-davinci-003`</li><li>`text-davinci-002`</li><li>`text-davinci-001`</li><li>`text-curie-001`</li><li>`text-babbage-001`</li><li>`text-ada-001`</li><li>`davinci`</li><li>`curie`</li><li>`babbage`</li><li>`ada`</li><li>`code-davinci-002`</li><li>`code-davinci-001`</li><li>`code-cushman-002`</li><li>`code-cushman-001`</li><li>`davinci-codex`</li><li>`cushman-codex`</li><li>`text-davinci-edit-001`</li><li>`code-davinci-edit-001`</li><li>`text-embedding-ada-002`</li><li>`text-similarity-davinci-001`</li><li>`text-similarity-curie-001`</li><li>`text-similarity-babbage-001`</li><li>`text-similarity-ada-001`</li><li>`text-search-davinci-doc-001`</li><li>`text-search-curie-doc-001`</li><li>`text-search-babbage-doc-001`</li><li>`text-search-ada-doc-001`</li><li>`code-search-babbage-code-001`</li><li>`code-search-ada-code-001`</li><li>`gpt2`</li></ul></details>  |
| Separators | `separators` | array |  A list of strings representing the separators used to split the text.  |

<h5 id="chunk-text-markdown"><code>Markdown</code></h5>

This text splitter is specially designed for Markdown format.

| Field | Field ID | Type | Note |
| :--- | :--- | :--- | :--- |
| Chunk Method | `chunk-method` | string |  Must be `"Markdown"`   |
| Chunk Overlap | `chunk-overlap` | integer |  Determines the number of tokens that overlap between consecutive chunks  |
| Chunk Size | `chunk-size` | integer |  Specifies the maximum size of each chunk in terms of the number of tokens  |
| Code Blocks | `code-blocks` | boolean |  A flag indicating whether code blocks should be treated as a single unit  |
| Model | `model-name` | string |  The name of the model used for tokenization.  <br/><details><summary><strong>Enum values</strong></summary><ul><li>`gpt-4`</li><li>`gpt-3.5-turbo`</li><li>`text-davinci-003`</li><li>`text-davinci-002`</li><li>`text-davinci-001`</li><li>`text-curie-001`</li><li>`text-babbage-001`</li><li>`text-ada-001`</li><li>`davinci`</li><li>`curie`</li><li>`babbage`</li><li>`ada`</li><li>`code-davinci-002`</li><li>`code-davinci-001`</li><li>`code-cushman-002`</li><li>`code-cushman-001`</li><li>`davinci-codex`</li><li>`cushman-codex`</li><li>`text-davinci-edit-001`</li><li>`code-davinci-edit-001`</li><li>`text-embedding-ada-002`</li><li>`text-similarity-davinci-001`</li><li>`text-similarity-curie-001`</li><li>`text-similarity-babbage-001`</li><li>`text-similarity-ada-001`</li><li>`text-search-davinci-doc-001`</li><li>`text-search-curie-doc-001`</li><li>`text-search-babbage-doc-001`</li><li>`text-search-ada-doc-001`</li><li>`code-search-babbage-code-001`</li><li>`code-search-ada-code-001`</li><li>`gpt2`</li></ul></details>  |
</details>

| Output | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| Token Count | `token-count` | integer | Total count of tokens in the original input text |
| [Text Chunks](#chunk-text-text-chunks) | `text-chunks` | array[object] | Text chunks after splitting |
| Number of Text Chunks | `chunk-num` | integer | Total number of output text chunks |
| Token Count Chunks | `chunks-token-count` | integer | Total count of tokens in the output text chunks |

<details>
<summary> Output Objects in Chunk Text</summary>

<h4 id="chunk-text-text-chunks">Text Chunks</h4>

| Field | Field ID | Type | Note |
| :--- | :--- | :--- | :--- |
| End Position | `end-position` | integer | The ending position of the chunk in the original text |
| Start Position | `start-position` | integer | The starting position of the chunk in the original text |
| Text | `text` | string | Text chunk after splitting |
| Token Count | `token-count` | integer | Count of tokens in a chunk |
</details>

Documentation

Index

Constants

View Source
const (
	ListStarters = "-*+"
)

Document Implementation

Variables

This section is empty.

Functions

func Init

func Init(bc base.Component) *component

Init initializes the operator

Types

type ChunkPositionCalculator

type ChunkPositionCalculator interface {
	// contains filtered or unexported methods
}

type ChunkTextInput

type ChunkTextInput struct {
	Text     string   `json:"text"`
	Strategy Strategy `json:"strategy"`
}

type ChunkTextOutput

type ChunkTextOutput struct {
	ChunkNum         int         `json:"chunk-num"`
	TextChunks       []TextChunk `json:"text-chunks"`
	TokenCount       int         `json:"token-count"`
	ChunksTokenCount int         `json:"chunks-token-count"`
}

type Content

type Content struct {
	Type      string
	PlainText string
	Table     Table
	// All lists in the content with all levels in order
	Lists              []List
	BlockStartPosition int
	BlockEndPosition   int
}

type ContentChunk

type ContentChunk struct {
	Chunk                string
	ContentStartPosition int
	ContentEndPosition   int
}
type Header struct {
	Level int
	Text  string
	Size  int
}

type List

type List struct {
	// HeaderText is the text before the list starts
	HeaderText        string
	PreviousLevelList *List
	Text              string
	StartPosition     int
	EndPosition       int
	NextLevelLists    []List
	NextList          *List
	PreviousList      *List
	// contains filtered or unexported fields
}

List includes bullet points and numbered lists

type MarkdownDocument

type MarkdownDocument struct {
	Headers  []Header
	Contents []Content
}

type MarkdownTextSplitter

type MarkdownTextSplitter struct {
	ChunkSize    int
	ChunkOverlap int
	RawText      string
}

func NewMarkdownTextSplitter

func NewMarkdownTextSplitter(chunkSize, chunkOverlap int, rawText string) *MarkdownTextSplitter

func (*MarkdownTextSplitter) SplitText

func (sp *MarkdownTextSplitter) SplitText() ([]ContentChunk, error)

func (*MarkdownTextSplitter) Validate

func (sp *MarkdownTextSplitter) Validate() error

type PositionCalculator

type PositionCalculator struct{}

type Setting

type Setting struct {
	ChunkMethod       string   `json:"chunk-method,omitempty"`
	ChunkSize         int      `json:"chunk-size,omitempty"`
	ChunkOverlap      int      `json:"chunk-overlap,omitempty"`
	ModelName         string   `json:"model-name,omitempty"`
	AllowedSpecial    []string `json:"allowed-special,omitempty"`
	DisallowedSpecial []string `json:"disallowed-special,omitempty"`
	Separators        []string `json:"separators,omitempty"`
	KeepSeparator     bool     `json:"keep-separator,omitempty"`
	CodeBlocks        bool     `json:"code-blocks,omitempty"`
}

func (*Setting) SetDefault

func (s *Setting) SetDefault()

type Strategy

type Strategy struct {
	Setting Setting `json:"setting"`
}

type Table

type Table struct {
	HeaderText     string
	TableSeparator string
	HeaderRow      string
	Rows           []string
}

type TextChunk

type TextChunk struct {
	Text          string `json:"text"`
	StartPosition int    `json:"start-position"`
	EndPosition   int    `json:"end-position"`
	TokenCount    int    `json:"token-count"`
}

type TextSplitter

type TextSplitter interface {
	SplitText(text string) ([]string, error)
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL