---
title: "Text"
lang: "en-US"
draft: false
description: "Learn about how to set up a VDP Text component https://github.com/instill-ai/instill-core"
---
The Text component is an operator component that allows users to extract and manipulate text from different sources.
It can carry out the following tasks:
- [Chunk Text](#chunk-text)
## Release Stage
`Alpha`
## Configuration
The component configuration is defined and maintained [here](https://github.com/instill-ai/component/blob/main/operator/text/v0/config/definition.json).
## Supported Tasks
### Chunk Text
Chunk text with different strategies
| Input | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| Task ID (required) | `task` | string | `TASK_CHUNK_TEXT` |
| Text (required) | `text` | string | Text to be chunked |
| Strategy (required) | `strategy` | object | Chunking strategy |
| Output | ID | Type | Description |
| :--- | :--- | :--- | :--- |
| Token Count | `token-count` | integer | Total count of tokens in the original input text |
| Text Chunks | `text-chunks` | array[object] | Text chunks after splitting |
| Number of Text Chunks | `chunk-num` | integer | Total number of output text chunks |
| Token Count Chunks | `chunks-token-count` | integer | Total count of tokens in the output text chunks |
### Chunking Strategy
There are three strategies available for chunking text in Text Component:
- 1. Token
- 2. Recursive
- 3. Markdown
#### Token
Language models have a token limit. You should not exceed the token limit. When you split your text into chunks it is therefore a good idea to count the number of tokens. There are many tokenizers. When you count tokens in your text you should use the same tokenizer as used in the language model.
| **Parameter** | **Type** | **Description** |
|----------------------|------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `chunk-size` | integer | Specifies the maximum size of each chunk in terms of the number of tokens |
| `chunk-overlap` | integer | Determines the number of tokens that overlap between consecutive chunks |
| `model-name` | string | The name of the model used for tokenization |
| `allowed-special` | array of strings | A list of special tokens that are allowed within chunks |
| `disallowed-special` | array of strings | A list of special tokens that should not appear within chunks |
#### Recursive
This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.
| **Parameter** | **Type** | **Description** |
|--------------------|------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `chunk-size` | integer | Specifies the maximum size of each chunk in terms of the number of tokens |
| `chunk-overlap` | integer | Determines the number of tokens that overlap between consecutive chunks |
| `model-name` | string | The name of the model used for tokenization |
| `separators` | array of strings | A list of strings representing the separators used to split the text |
| `keep-separator` | boolean | A flag indicating whether to keep the separator characters at the beginning or end of chunks |
#### Markdown
This text splitter is specially designed for Markdown format.
| **Parameter** | **Type** | **Description** |
|--------------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `chunk-size` | integer | Specifies the maximum size of each chunk in terms of the number of tokens |
| `chunk-overlap` | integer | Determines the number of tokens that overlap between consecutive chunks |
| `model-name` | string | The name of the model used for tokenization |
| `code-blocks` | boolean | A flag indicating whether code blocks should be treated as a single unit |
### Text Chunks in Output
| **Parameter** | **Type** | **Description** |
|------------------|----------|--------------------------------------------------------------|
| `test` | string | The text chunk |
| `start-position` | integer | The starting position of the text chunk in the original text |
| `end-position` | integer | The ending position of the text chunk in the original text |
type ChunkTextOutput struct {
ChunkNum int `json:"chunk-num"`
TextChunks []TextChunk `json:"text-chunks"`
TokenCount int `json:"token-count"`
ChunksTokenCount int `json:"chunks-token-count"`
}
type Content struct {
Type string PlainText string Table Table// All lists in the content with all levels in order Lists []List BlockStartPosition int BlockEndPosition int}
type List struct {
// HeaderText is the text before the list starts HeaderText string PreviousLevelList *List Text string StartPosition int EndPosition int NextLevelLists []List NextList *List PreviousList *List// contains filtered or unexported fields
}
type TextChunk struct {
Text string `json:"text"`
StartPosition int `json:"start-position"`
EndPosition int `json:"end-position"`
TokenCount int `json:"token-count"`
}