Public repository
html-text-chunker
Description
When using HTML as rich text you might have the issue that a certain API you want to use only supports a certain amount of characters.
That's why this this project tries to split the text into fragments/chunks which are small enough to be used with such APIs.
This project tries to fulfill the following requirements:
- every chunk contains less than
CHUNK_SIZE
characters
- we can modify and reassemble all chunks to valid HTML
- the linguistic context of sentences should not be destroyed if possible
- should work with HTML strings
- should also work with non-HTML strings
- (gracefully ignore broken HTML)
Future considerations:
- extract
alt
tags separately if requested
Commands
Run benchmarks:
go test -bench=. -benchtime=10s ./chunk/
Run tests:
go test -v ./chunk/
Integrate this project into your project:
go get -t github.com/Staffbase/html-text-chunker
package main
import (
richText "github.com/Staffbase/html-text-chunker/chunk"
"fmt"
"strings"
)
const CHUNK_SIZE = 1337
func foo(text string) {
chunker := richText.NewChunkedRichText(text, CHUNK_SIZE, false)
chunker.MakeChunks()
// do some meaningful stuff
for idx, part := range chunker.TextParts {
// each part.Text has the maximum length of CHUNK_SIZE
// you can modify part.Text and replace the dom node
fmt.Printf("Processing text part %d\n", idx)
part.Text = bar(part.Text)
}
newText := chunker.Finish()
fmt.Print(newText)
}
func bar(input string) string{
if len(input) > CHUNK_SIZE {
panic("The string is too long!")
}
return strings.ToUpper(input)
}
After this run newText
will contain the given HTML markup with the modified parts from bar()
.
Update the dependency:
go get -u=patch github.com/Staffbase/html-text-chunker