parser

package
v0.3.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 6, 2025 License: Apache-2.0 Imports: 5 Imported by: 8

Documentation

Index

Constants

View Source
const (
	MetaKeySource = "_source"
)

Variables

This section is empty.

Functions

func GetImplSpecificOptions

func GetImplSpecificOptions[T any](base *T, opts ...Option) *T

GetImplSpecificOptions provides Parser author the ability to extract their own custom options from the unified Option type. T: the type of the impl specific options struct. This function should be used within the Parser implementation's Transform function. It is recommended to provide a base T as the first argument, within which the Parser author can provide default values for the impl specific options.

Types

type ExtParser

type ExtParser struct {
	// contains filtered or unexported fields
}

ExtParser is a parser that uses the file extension to determine which parser to use. You can register your own parsers by calling RegisterParser. Default parser is TextParser. Note:

parse 时,是通过 filepath.Ext(uri) 的方式找到对应的 parser,因此使用时需要:
 	① 必须使用 parser.WithURI 在请求时传入 URI
 	② URI 必须能通过 filepath.Ext 来解析出符合预期的 ext

eg:

pdf, _ := os.Open("./testdata/test.pdf")
docs, err := ExtParser.Parse(ctx, pdf, parser.WithURI("./testdata/test.pdf"))

func NewExtParser

func NewExtParser(ctx context.Context, conf *ExtParserConfig) (*ExtParser, error)

NewExtParser creates a new ExtParser.

func (*ExtParser) GetParsers

func (p *ExtParser) GetParsers() map[string]Parser

GetParsers returns a copy of the registered parsers. It is safe to modify the returned parsers.

func (*ExtParser) Parse

func (p *ExtParser) Parse(ctx context.Context, reader io.Reader, opts ...Option) ([]*schema.Document, error)

Parse parses the given reader and returns a list of documents.

type ExtParserConfig

type ExtParserConfig struct {
	// ext -> parser.
	// eg: map[string]Parser{
	// 	".pdf": &PDFParser{},
	// 	".md": &MarkdownParser{},
	// }
	Parsers map[string]Parser

	// Fallback parser to use when no other parser is found.
	// Default is TextParser if not set.
	FallbackParser Parser
}

ExtParserConfig defines the configuration for the ExtParser.

type Option

type Option struct {
	// contains filtered or unexported fields
}

Option defines call option for Parser component, which is part of the component interface signature. Each Parser implementation could define its own options struct and option funcs within its own package, then wrap the impl specific option funcs into this type, before passing to Transform.

func WithExtraMeta

func WithExtraMeta(meta map[string]any) Option

WithExtraMeta specifies the extra meta data of the document.

func WithURI

func WithURI(uri string) Option

WithURI specifies the URI of the document. It will be used as to select parser in ExtParser.

func WrapImplSpecificOptFn

func WrapImplSpecificOptFn[T any](optFn func(*T)) Option

WrapImplSpecificOptFn wraps the impl specific option functions into Option type. T: the type of the impl specific options struct. Parser implementations are required to use this function to convert its own option functions into the unified Option type. For example, if the Parser impl defines its own options struct:

type customOptions struct {
    conf string
}

Then the impl needs to provide an option function as such:

func WithConf(conf string) Option {
    return WrapImplSpecificOptFn(func(o *customOptions) {
		o.conf = conf
	}
}

.

type Options

type Options struct {
	// uri of source.
	URI string

	// extra metadata will merge to each document.
	ExtraMeta map[string]any
}

func GetCommonOptions

func GetCommonOptions(base *Options, opts ...Option) *Options

GetCommonOptions extract parser Options from Option list, optionally providing a base Options with default values.

type Parser

type Parser interface {
	Parse(ctx context.Context, reader io.Reader, opts ...Option) ([]*schema.Document, error)
}

Parser is a document parser, can be used to parse a document from a reader.

type TextParser

type TextParser struct{}

TextParser is a simple parser that reads the text from a reader and returns a single document. eg:

docs, err := TextParser.Parse(ctx, strings.NewReader("hello world"))
fmt.Println(docs[0].Content) // "hello world"

func (TextParser) Parse

func (dp TextParser) Parse(ctx context.Context, reader io.Reader, opts ...Option) ([]*schema.Document, error)

Parse reads the text from a reader and returns a single document.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL