text

package

v0.28.0-beta Latest Latest Go to latest Published: Sep 24, 2024 License: MIT Imports: 11 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

README ¶

---
title: "Text"
lang: "en-US"
draft: false
description: "Learn about how to set up a VDP Text component https://github.com/instill-ai/instill-core"
---

The Text component is an operator component that allows users to extract and manipulate text from different sources.
It can carry out the following tasks:
- [Chunk Text](#chunk-text)

## Release Stage

`Alpha`

## Configuration

The component definition and tasks are defined in the [definition.json](https://github.com/instill-ai/component/blob/main/operator/text/v0/config/definition.json) and [tasks.json](https://github.com/instill-ai/component/blob/main/operator/text/v0/config/tasks.json) files respectively.



## Supported Tasks

### Chunk Text

Chunk text with different strategies

| Input | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| Task ID (required) | `task` | string | `TASK_CHUNK_TEXT` |
| Text (required) | `text` | string | Text to be chunked |
| [Strategy](#chunk-text-strategy) (required) | `strategy` | object | Chunking strategy |

<details>
<summary> Input Objects in Chunk Text</summary>

<h4 id="chunk-text-strategy">Strategy</h4>

Chunking strategy

| Field | Field ID | Type | Note |
| :--- | :--- | :--- | :--- |
| [Setting](#chunk-text-setting) | `setting` | object | Chunk Setting  |
</details>

<details>
<summary>The <code>setting</code> Object </summary>

<h4 id="chunk-text-setting">Setting</h4>

`setting` must fulfill one of the following schemas:

<h5 id="chunk-text-token"><code>Token</code></h5>

Language models have a token limit. You should not exceed the token limit. When you split your text into chunks it is therefore a good idea to count the number of tokens. There are many tokenizers. When you count tokens in your text you should use the same tokenizer as used in the language model.

| Field | Field ID | Type | Note |
| :--- | :--- | :--- | :--- |
| Allowed Special Tokens | `allowed-special` | array |  A list of special tokens that are allowed within chunks.  |
| Chunk Method | `chunk-method` | string |  Must be `"Token"`   |
| Chunk Overlap | `chunk-overlap` | integer |  Determines the number of tokens that overlap between consecutive chunks  |
| Chunk Size | `chunk-size` | integer |  Specifies the maximum size of each chunk in terms of the number of tokens  |
| Disallowed Special Tokens | `disallowed-special` | array |  A list of special tokens that should not appear within chunks.  |
| Model | `model-name` | string |  The name of the model used for tokenization.  <br/><details><summary><strong>Enum values</strong></summary><ul><li>`gpt-4`</li><li>`gpt-3.5-turbo`</li><li>`text-davinci-003`</li><li>`text-davinci-002`</li><li>`text-davinci-001`</li><li>`text-curie-001`</li><li>`text-babbage-001`</li><li>`text-ada-001`</li><li>`davinci`</li><li>`curie`</li><li>`babbage`</li><li>`ada`</li><li>`code-davinci-002`</li><li>`code-davinci-001`</li><li>`code-cushman-002`</li><li>`code-cushman-001`</li><li>`davinci-codex`</li><li>`cushman-codex`</li><li>`text-davinci-edit-001`</li><li>`code-davinci-edit-001`</li><li>`text-embedding-ada-002`</li><li>`text-similarity-davinci-001`</li><li>`text-similarity-curie-001`</li><li>`text-similarity-babbage-001`</li><li>`text-similarity-ada-001`</li><li>`text-search-davinci-doc-001`</li><li>`text-search-curie-doc-001`</li><li>`text-search-babbage-doc-001`</li><li>`text-search-ada-doc-001`</li><li>`code-search-babbage-code-001`</li><li>`code-search-ada-code-001`</li><li>`gpt2`</li></ul></details>  |

<h5 id="chunk-text-recursive"><code>Recursive</code></h5>

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", "", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

| Field | Field ID | Type | Note |
| :--- | :--- | :--- | :--- |
| Chunk Method | `chunk-method` | string |  Must be `"Recursive"`   |
| Chunk Overlap | `chunk-overlap` | integer |  Determines the number of tokens that overlap between consecutive chunks  |
| Chunk Size | `chunk-size` | integer |  Specifies the maximum size of each chunk in terms of the number of tokens  |
| Keep Separator | `keep-separator` | boolean |  A flag indicating whether to keep the separator characters at the beginning or end of chunks  |
| Model | `model-name` | string |  The name of the model used for tokenization.  <br/><details><summary><strong>Enum values</strong></summary><ul><li>`gpt-4`</li><li>`gpt-3.5-turbo`</li><li>`text-davinci-003`</li><li>`text-davinci-002`</li><li>`text-davinci-001`</li><li>`text-curie-001`</li><li>`text-babbage-001`</li><li>`text-ada-001`</li><li>`davinci`</li><li>`curie`</li><li>`babbage`</li><li>`ada`</li><li>`code-davinci-002`</li><li>`code-davinci-001`</li><li>`code-cushman-002`</li><li>`code-cushman-001`</li><li>`davinci-codex`</li><li>`cushman-codex`</li><li>`text-davinci-edit-001`</li><li>`code-davinci-edit-001`</li><li>`text-embedding-ada-002`</li><li>`text-similarity-davinci-001`</li><li>`text-similarity-curie-001`</li><li>`text-similarity-babbage-001`</li><li>`text-similarity-ada-001`</li><li>`text-search-davinci-doc-001`</li><li>`text-search-curie-doc-001`</li><li>`text-search-babbage-doc-001`</li><li>`text-search-ada-doc-001`</li><li>`code-search-babbage-code-001`</li><li>`code-search-ada-code-001`</li><li>`gpt2`</li></ul></details>  |
| Separators | `separators` | array |  A list of strings representing the separators used to split the text.  |

<h5 id="chunk-text-markdown"><code>Markdown</code></h5>

This text splitter is specially designed for Markdown format.

| Field | Field ID | Type | Note |
| :--- | :--- | :--- | :--- |
| Chunk Method | `chunk-method` | string |  Must be `"Markdown"`   |
| Chunk Overlap | `chunk-overlap` | integer |  Determines the number of tokens that overlap between consecutive chunks  |
| Chunk Size | `chunk-size` | integer |  Specifies the maximum size of each chunk in terms of the number of tokens  |
| Code Blocks | `code-blocks` | boolean |  A flag indicating whether code blocks should be treated as a single unit  |
| Model | `model-name` | string |  The name of the model used for tokenization.  <br/><details><summary><strong>Enum values</strong></summary><ul><li>`gpt-4`</li><li>`gpt-3.5-turbo`</li><li>`text-davinci-003`</li><li>`text-davinci-002`</li><li>`text-davinci-001`</li><li>`text-curie-001`</li><li>`text-babbage-001`</li><li>`text-ada-001`</li><li>`davinci`</li><li>`curie`</li><li>`babbage`</li><li>`ada`</li><li>`code-davinci-002`</li><li>`code-davinci-001`</li><li>`code-cushman-002`</li><li>`code-cushman-001`</li><li>`davinci-codex`</li><li>`cushman-codex`</li><li>`text-davinci-edit-001`</li><li>`code-davinci-edit-001`</li><li>`text-embedding-ada-002`</li><li>`text-similarity-davinci-001`</li><li>`text-similarity-curie-001`</li><li>`text-similarity-babbage-001`</li><li>`text-similarity-ada-001`</li><li>`text-search-davinci-doc-001`</li><li>`text-search-curie-doc-001`</li><li>`text-search-babbage-doc-001`</li><li>`text-search-ada-doc-001`</li><li>`code-search-babbage-code-001`</li><li>`code-search-ada-code-001`</li><li>`gpt2`</li></ul></details>  |
</details>

| Output | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| Token Count | `token-count` | integer | Total count of tokens in the original input text |
| [Text Chunks](#chunk-text-text-chunks) | `text-chunks` | array[object] | Text chunks after splitting |
| Number of Text Chunks | `chunk-num` | integer | Total number of output text chunks |
| Token Count Chunks | `chunks-token-count` | integer | Total count of tokens in the output text chunks |

<details>
<summary> Output Objects in Chunk Text</summary>

<h4 id="chunk-text-text-chunks">Text Chunks</h4>

| Field | Field ID | Type | Note |
| :--- | :--- | :--- | :--- |
| End Position | `end-position` | integer | The ending position of the chunk in the original text |
| Start Position | `start-position` | integer | The starting position of the chunk in the original text |
| Text | `text` | string | Text chunk after splitting |
| Token Count | `token-count` | integer | Count of tokens in a chunk |
</details>

Documentation ¶

Constants ¶

View Source

const (
	ListStarters = "-*+"
)

Document Implementation

Variables ¶

This section is empty.

Functions ¶

func Init ¶

func Init(bc base.Component) *component

Init initializes the operator

Types ¶

type ChunkPositionCalculator ¶

type ChunkPositionCalculator interface {
	// contains filtered or unexported methods
}

type ChunkTextInput ¶

type ChunkTextInput struct {
	Text     string   `json:"text"`
	Strategy Strategy `json:"strategy"`
}

type ChunkTextOutput ¶

type ChunkTextOutput struct {
	ChunkNum         int         `json:"chunk-num"`
	TextChunks       []TextChunk `json:"text-chunks"`
	TokenCount       int         `json:"token-count"`
	ChunksTokenCount int         `json:"chunks-token-count"`
}

type Content ¶

type Content struct {
	Type      string
	PlainText string
	Table     Table
	// All lists in the content with all levels in order
	Lists              []List
	BlockStartPosition int
	BlockEndPosition   int
}

type ContentChunk ¶

type ContentChunk struct {
	Chunk                string
	ContentStartPosition int
	ContentEndPosition   int
}

type Header ¶

type Header struct {
	Level int
	Text  string
	Size  int
}

type List ¶

type List struct {
	// HeaderText is the text before the list starts
	HeaderText        string
	PreviousLevelList *List
	Text              string
	StartPosition     int
	EndPosition       int
	NextLevelLists    []List
	NextList          *List
	PreviousList      *List
	// contains filtered or unexported fields
}

List includes bullet points and numbered lists

type MarkdownDocument ¶

type MarkdownDocument struct {
	Headers  []Header
	Contents []Content
}

type MarkdownTextSplitter ¶

type MarkdownTextSplitter struct {
	ChunkSize    int
	ChunkOverlap int
	RawText      string
}

func NewMarkdownTextSplitter ¶

func NewMarkdownTextSplitter(chunkSize, chunkOverlap int, rawText string) *MarkdownTextSplitter

func (*MarkdownTextSplitter) SplitText ¶

func (sp *MarkdownTextSplitter) SplitText() ([]ContentChunk, error)

func (*MarkdownTextSplitter) Validate ¶

func (sp *MarkdownTextSplitter) Validate() error

type PositionCalculator ¶

type PositionCalculator struct{}

type Setting ¶

type Setting struct {
	ChunkMethod       string   `json:"chunk-method,omitempty"`
	ChunkSize         int      `json:"chunk-size,omitempty"`
	ChunkOverlap      int      `json:"chunk-overlap,omitempty"`
	ModelName         string   `json:"model-name,omitempty"`
	AllowedSpecial    []string `json:"allowed-special,omitempty"`
	DisallowedSpecial []string `json:"disallowed-special,omitempty"`
	Separators        []string `json:"separators,omitempty"`
	KeepSeparator     bool     `json:"keep-separator,omitempty"`
	CodeBlocks        bool     `json:"code-blocks,omitempty"`
}

func (*Setting) SetDefault ¶

func (s *Setting) SetDefault()

type Strategy ¶

type Strategy struct {
	Setting Setting `json:"setting"`
}

type Table ¶

type Table struct {
	HeaderText     string
	TableSeparator string
	HeaderRow      string
	Rows           []string
}

type TextChunk ¶

type TextChunk struct {
	Text          string `json:"text"`
	StartPosition int    `json:"start-position"`
	EndPosition   int    `json:"end-position"`
	TokenCount    int    `json:"token-count"`
}

type TextSplitter ¶

type TextSplitter interface {
	SplitText(text string) ([]string, error)
}

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL