text

package

v0.27.3-beta Latest Latest Go to latest Published: Sep 12, 2024 License: MIT Imports: 11 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

README ¶

---
title: "Text"
lang: "en-US"
draft: false
description: "Learn about how to set up a VDP Text component https://github.com/instill-ai/instill-core"
---

The Text component is an operator component that allows users to extract and manipulate text from different sources.
It can carry out the following tasks:

- [Chunk Text](#chunk-text)



## Release Stage

`Alpha`



## Configuration

The component configuration is defined and maintained [here](https://github.com/instill-ai/component/blob/main/operator/text/v0/config/definition.json).





## Supported Tasks

### Chunk Text

Chunk text with different strategies


| Input | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| Task ID (required) | `task` | string | `TASK_CHUNK_TEXT` |
| Text (required) | `text` | string | Text to be chunked |
| Strategy (required) | `strategy` | object | Chunking strategy |



| Output | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| Token Count | `token-count` | integer | Total count of tokens in the original input text |
| Text Chunks | `text-chunks` | array[object] | Text chunks after splitting |
| Number of Text Chunks | `chunk-num` | integer | Total number of output text chunks |
| Token Count Chunks | `chunks-token-count` | integer | Total count of tokens in the output text chunks |


### Chunking Strategy
There are three strategies available for chunking text in Text Component:
- 1. Token
- 2. Recursive
- 3. Markdown

#### Token
Language models have a token limit. You should not exceed the token limit. When you split your text into chunks it is therefore a good idea to count the number of tokens. There are many tokenizers. When you count tokens in your text you should use the same tokenizer as used in the language model.

| **Parameter**        | **Type**         | **Description**                                                                                                                                                                                              |
|----------------------|------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `chunk-size`         | integer          | Specifies the maximum size of each chunk in terms of the number of tokens                                                                                                                                    |
| `chunk-overlap`      | integer          | Determines the number of tokens that overlap between consecutive chunks                                                                                                                                      |
| `model-name`         | string           | The name of the model used for tokenization                                                                                                                                                                  |
| `allowed-special`    | array of strings | A list of special tokens that are allowed within chunks                                                                                                                                                      |
| `disallowed-special` | array of strings | A list of special tokens that should not appear within chunks                                                                                                                                                |

#### Recursive
This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

| **Parameter**      | **Type**         | **Description**                                                                                                                                                                                              |
|--------------------|------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `chunk-size`       | integer          | Specifies the maximum size of each chunk in terms of the number of tokens                                                                                                                                    |
| `chunk-overlap`    | integer          | Determines the number of tokens that overlap between consecutive chunks                                                                                                                                      |
| `model-name`       | string           | The name of the model used for tokenization                                                                                                                                                                  |
| `separators`       | array of strings | A list of strings representing the separators used to split the text                                                                                                                                         |
| `keep-separator`   | boolean          | A flag indicating whether to keep the separator characters at the beginning or end of chunks                                                                                                                 |


#### Markdown
This text splitter is specially designed for Markdown format.

| **Parameter**      | **Type** | **Description**                                                                                                                                                                                              |
|--------------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `chunk-size`       | integer  | Specifies the maximum size of each chunk in terms of the number of tokens                                                                                                                                    |
| `chunk-overlap`    | integer  | Determines the number of tokens that overlap between consecutive chunks                                                                                                                                      |
| `model-name`       | string   | The name of the model used for tokenization                                                                                                                                                                  |
| `code-blocks`      | boolean  | A flag indicating whether code blocks should be treated as a single unit                                                                                                                                     |

### Text Chunks in Output
| **Parameter**    | **Type** | **Description**                                              |
|------------------|----------|--------------------------------------------------------------|
| `test`           | string   | The text chunk                                               |
| `start-position` | integer  | The starting position of the text chunk in the original text |
| `end-position`   | integer  | The ending position of the text chunk in the original text   |

Documentation ¶

Constants ¶

View Source

const (
	ListStarters = "-*+"
)

Document Implementation

Variables ¶

This section is empty.

Functions ¶

func Init ¶

func Init(bc base.Component) *component

Init initializes the operator

Types ¶

type ChunkPositionCalculator ¶

type ChunkPositionCalculator interface {
	// contains filtered or unexported methods
}

type ChunkTextInput ¶

type ChunkTextInput struct {
	Text     string   `json:"text"`
	Strategy Strategy `json:"strategy"`
}

type ChunkTextOutput ¶

type ChunkTextOutput struct {
	ChunkNum         int         `json:"chunk-num"`
	TextChunks       []TextChunk `json:"text-chunks"`
	TokenCount       int         `json:"token-count"`
	ChunksTokenCount int         `json:"chunks-token-count"`
}

type Content ¶

type Content struct {
	Type      string
	PlainText string
	Table     Table
	// All lists in the content with all levels in order
	Lists              []List
	BlockStartPosition int
	BlockEndPosition   int
}

type ContentChunk ¶

type ContentChunk struct {
	Chunk                string
	ContentStartPosition int
	ContentEndPosition   int
}

type Header ¶

type Header struct {
	Level int
	Text  string
	Size  int
}

type List ¶

type List struct {
	// HeaderText is the text before the list starts
	HeaderText        string
	PreviousLevelList *List
	Text              string
	StartPosition     int
	EndPosition       int
	NextLevelLists    []List
	NextList          *List
	PreviousList      *List
	// contains filtered or unexported fields
}

List includes bullet points and numbered lists

type MarkdownDocument ¶

type MarkdownDocument struct {
	Headers  []Header
	Contents []Content
}

type MarkdownTextSplitter ¶

type MarkdownTextSplitter struct {
	ChunkSize    int
	ChunkOverlap int
	RawText      string
}

func NewMarkdownTextSplitter ¶

func NewMarkdownTextSplitter(chunkSize, chunkOverlap int, rawText string) *MarkdownTextSplitter

func (*MarkdownTextSplitter) SplitText ¶

func (sp *MarkdownTextSplitter) SplitText() ([]ContentChunk, error)

func (*MarkdownTextSplitter) Validate ¶

func (sp *MarkdownTextSplitter) Validate() error

type PositionCalculator ¶

type PositionCalculator struct{}

type Setting ¶

type Setting struct {
	ChunkMethod       string   `json:"chunk-method,omitempty"`
	ChunkSize         int      `json:"chunk-size,omitempty"`
	ChunkOverlap      int      `json:"chunk-overlap,omitempty"`
	ModelName         string   `json:"model-name,omitempty"`
	AllowedSpecial    []string `json:"allowed-special,omitempty"`
	DisallowedSpecial []string `json:"disallowed-special,omitempty"`
	Separators        []string `json:"separators,omitempty"`
	KeepSeparator     bool     `json:"keep-separator,omitempty"`
	CodeBlocks        bool     `json:"code-blocks,omitempty"`
}

func (*Setting) SetDefault ¶

func (s *Setting) SetDefault()

type Strategy ¶

type Strategy struct {
	Setting Setting `json:"setting"`
}

type Table ¶

type Table struct {
	HeaderText     string
	TableSeparator string
	HeaderRow      string
	Rows           []string
}

type TextChunk ¶

type TextChunk struct {
	Text          string `json:"text"`
	StartPosition int    `json:"start-position"`
	EndPosition   int    `json:"end-position"`
	TokenCount    int    `json:"token-count"`
}

type TextSplitter ¶

type TextSplitter interface {
	SplitText(text string) ([]string, error)
}

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL