documentloaders

package
v0.1.101 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 12, 2024 License: MIT Imports: 17 Imported by: 2

Documentation

Overview

Package documentloaders includes a standard interface for loading documents from a source and implementations of this interface.

Index

Constants

This section is empty.

Variables

View Source
var ErrMissingAudioSource = errors.New("assemblyai: missing audio source")

ErrMissingAudioSource is returned when neither an audio URL nor a reader has been set using WithAudioURL or WithAudioReader.

Functions

This section is empty.

Types

type AssemblyAIAudioTranscriptLoader added in v0.1.13

type AssemblyAIAudioTranscriptLoader struct {
	// contains filtered or unexported fields
}

AssemblyAIAudioTranscriptLoader transcribes an audio file using AssemblyAI and loads the transcript.

Audio files can be specified using either a URL or a reader.

For a list of the supported audio and video formats, see the FAQ.

func NewAssemblyAIAudioTranscript added in v0.1.13

func NewAssemblyAIAudioTranscript(apiKey string, opts ...AssemblyAIOption) *AssemblyAIAudioTranscriptLoader

NewAssemblyAIAudioTranscript returns a new instance AssemblyAIAudioTranscriptLoader.

func (*AssemblyAIAudioTranscriptLoader) Load added in v0.1.13

Load transcribes an audio file, transcribes it using AssemblyAI, and returns them transcript as a document.

func (*AssemblyAIAudioTranscriptLoader) LoadAndSplit added in v0.1.13

LoadAndSplit transcribes the audio data and splits it into multiple documents using a text splitter.

type AssemblyAIOption added in v0.1.13

type AssemblyAIOption func(loader *AssemblyAIAudioTranscriptLoader)

AssemblyAIOption is an option for the AssemblyAI loader.

func WithAudioReader added in v0.1.13

func WithAudioReader(r io.Reader) AssemblyAIOption

WithAudioReader configures the loader to transcribe a local audio file.

func WithAudioURL added in v0.1.13

func WithAudioURL(url string) AssemblyAIOption

WithAudioURL configures the loader to transcribe an audio file from a URL. The URL needs to be accessible from AssemblyAI's servers.

func WithTranscriptFormat added in v0.1.13

func WithTranscriptFormat(format TranscriptFormat) AssemblyAIOption

WithAudioReader configures the format of the document page content.

func WithTranscriptParams added in v0.1.13

func WithTranscriptParams(params *assemblyai.TranscriptOptionalParams) AssemblyAIOption

WithTranscriptParams configures the optional parameters for the transcription.

type CSV

type CSV struct {
	// contains filtered or unexported fields
}

CSV represents a CSV document loader.

func NewCSV

func NewCSV(r io.Reader, columns ...string) CSV

NewCSV creates a new csv loader with an io.Reader and optional column names for filtering.

func (CSV) Load

func (c CSV) Load(_ context.Context) ([]schema.Document, error)

Load reads from the io.Reader and returns a single document with the data.

func (CSV) LoadAndSplit

func (c CSV) LoadAndSplit(ctx context.Context, splitter textsplitter.TextSplitter) ([]schema.Document, error)

LoadAndSplit reads text data from the io.Reader and splits it into multiple documents using a text splitter.

type HTML

type HTML struct {
	// contains filtered or unexported fields
}

HTML loads parses and sanitizes html content from an io.Reader.

func NewHTML

func NewHTML(r io.Reader) HTML

NewHTML creates a new html loader with an io.Reader.

func (HTML) Load

func (h HTML) Load(_ context.Context) ([]schema.Document, error)

Load reads from the io.Reader and returns a single document with the data.

func (HTML) LoadAndSplit

func (h HTML) LoadAndSplit(ctx context.Context, splitter textsplitter.TextSplitter) ([]schema.Document, error)

LoadAndSplit reads text data from the io.Reader and splits it into multiple documents using a text splitter.

type Loader

type Loader interface {
	// Load loads from a source and returns documents.
	Load(ctx context.Context) ([]schema.Document, error)
	// LoadAndSplit loads from a source and splits the documents using a text splitter.
	LoadAndSplit(ctx context.Context, splitter textsplitter.TextSplitter) ([]schema.Document, error)
}

Loader is the interface for loading and splitting documents from a source.

type NotionDirectoryLoader added in v0.1.8

type NotionDirectoryLoader struct {
	// contains filtered or unexported fields
}

NotionDirectoryLoader is a document loader that reads content from pages within a Notion Database.

func NewNotionDirectory added in v0.1.8

func NewNotionDirectory(filePath string, encoding ...string) *NotionDirectoryLoader

NewNotionDirectory creates a new NotionDirectoryLoader with the given file path and encoding.

func (*NotionDirectoryLoader) Load added in v0.1.8

func (n *NotionDirectoryLoader) Load() ([]schema.Document, error)

Load retrieves data from a Notion directory and returns a list of schema.Document objects.

type PDF

type PDF struct {
	// contains filtered or unexported fields
}

PDF loads text data from an io.Reader.

func NewPDF

func NewPDF(r io.ReaderAt, size int64, opts ...PDFOptions) PDF

NewPDF creates a new text loader with an io.Reader.

func (PDF) Load

func (p PDF) Load(_ context.Context) ([]schema.Document, error)

Load reads from the io.Reader for the PDF data and returns the documents with the data and with metadata attached of the page number and total number of pages of the PDF.

func (PDF) LoadAndSplit

func (p PDF) LoadAndSplit(ctx context.Context, splitter textsplitter.TextSplitter) ([]schema.Document, error)

LoadAndSplit reads pdf data from the io.Reader and splits it into multiple documents using a text splitter.

type PDFOptions

type PDFOptions func(pdf *PDF)

PDFOptions are options for the PDF loader.

func WithPassword

func WithPassword(password string) PDFOptions

WithPassword sets the password for the PDF.

type Text

type Text struct {
	// contains filtered or unexported fields
}

Text loads text data from an io.Reader.

func NewText

func NewText(r io.Reader) Text

NewText creates a new text loader with an io.Reader.

func (Text) Load

func (l Text) Load(_ context.Context) ([]schema.Document, error)

Load reads from the io.Reader and returns a single document with the data.

func (Text) LoadAndSplit

func (l Text) LoadAndSplit(ctx context.Context, splitter textsplitter.TextSplitter) ([]schema.Document, error)

LoadAndSplit reads text data from the io.Reader and splits it into multiple documents using a text splitter.

type TranscriptFormat added in v0.1.13

type TranscriptFormat int

TranscriptFormat represents the format of the document page content.

const (
	// Single document with full transcript text.
	TranscriptFormatText TranscriptFormat = iota

	// Multiple documents with each sentence as page content.
	TranscriptFormatSentences

	// Multiple documents with each paragraph as page content.
	TranscriptFormatParagraphs

	// Single document with SRT formatted subtitles as page content.
	TranscriptFormatSubtitlesSRT

	// Single document with VTT formatted subtitles as page content.
	TranscriptFormatSubtitlesVTT
)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL