webscraper

package
v1.1.5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 17, 2025 License: MIT Imports: 12 Imported by: 0

Documentation

Overview

Package webscraper The Webpage Scraper Tool is a utility within the Atomic Agents ecosystem designed for scraping web content and converting it to markdown format. It includes features for extracting metadata and cleaning up the content for better readability.

Index

Constants

View Source
const (
	DefaultUserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
	DefaultAccept    = "text/html,application/xhtml+xml,application/xml;"
)

Variables

This section is empty.

Functions

This section is empty.

Types

type Config

type Config struct {
	tools.Config
	// contains filtered or unexported fields
}

type Input

type Input struct {
	// URL of the webpage to scrape.
	URL string `json:"url,omitempty" jsonschema:"title=url,description=URL of the webpage to scrape." validate:"required,url"`
	// IncludeLinks Whether to preserve hyperlinks in the markdown output.
	IncludeLinks bool `` /* 130-byte string literal not displayed */
}

Input schema for the WebpageScraperTool.

func NewInput

func NewInput(link string, includeLinks bool) *Input

type Metadata

type Metadata struct {
	// Title is the title of the webpage.
	Title string `json:"url,omitempty" jsonschema:"title=title,description=The title of the webpage."`
	// Author is the author of the webpage content.
	Author string `json:"author,omitempty" jsonschema:"title=author,description=The Author of the webpage."`
	// Description is the meta description of the webpage.
	Description string `json:"description,omitempty" jsonschema:"title=description,description=The meta description of the webpage."`
	// Keywords is the meta keywords of the webpage.
	Keywords string `json:"keywords,omitempty" jsonschema:"title=keywords,description=The meta keywords of the webpage."`
	// SiteName is the name of the website.
	SiteName string `json:"sitename,omitempty" jsonschema:"title=sitename,description=The name of the website."`
	// Domain is the domain name of the website.
	Domain string `json:"domain,omitempty" jsonschema:"title=domain,description=The domain name of the website."`
}

Metadata Schema for webpage metadata

type Option

type Option func(*Config)

func WithHttpClient

func WithHttpClient(clt *http.Client) Option

func WithMaxContentLength

func WithMaxContentLength(l int64) Option

func WithTimeout

func WithTimeout(timeout int) Option

func WithUserAgent

func WithUserAgent(ua string) Option

type Output

type Output struct {
	// Content The scraped content in markdown format.
	Content string `json:"content,omitempty" jsonschema:"title=content,description=The scraped content in markdown format."`
	// Metadata is metadata about the scraped webpage.
	Metadata *Metadata `json:"metadata,omitempty" jsonschema:"title=metadata,description=Metadata about the webpage."`
}

Output Schema for the output of the WebpageScraperTool.

func NewOutput

func NewOutput(content string, metadata *Metadata) *Output

type Webscraper

type Webscraper struct {
	Config
}

func New added in v1.0.1

func New(opts ...Option) *Webscraper

func (*Webscraper) Run

func (t *Webscraper) Run(ctx context.Context, input *Input, output *Output) error

func (*Webscraper) RunAnonymous added in v1.0.8

func (t *Webscraper) RunAnonymous(ctx context.Context, input any) (any, error)

RunAnonymous run tool for tools ochestration

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL