scraper

package
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 21, 2024 License: MIT Imports: 27 Imported by: 0

Documentation

Index

Constants

View Source
const (
	// PageExtension is the file extension that downloaded pages get.
	PageExtension = ".html"
	// PageDirIndex is the file name of the index file for every dir.
	PageDirIndex = "index" + PageExtension
)

Variables

This section is empty.

Functions

func Headers added in v0.2.0

func Headers(headers []string) http.Header

func ServeDirectory added in v0.2.0

func ServeDirectory(ctx context.Context, path string, port int16, logger *log.Logger) error

Types

type Config added in v0.1.1

type Config struct {
	URL      string
	Includes []string
	Excludes []string

	ImageQuality uint // image quality from 0 to 100%, 0 to disable reencoding
	MaxDepth     uint // download depth, 0 for unlimited
	Timeout      uint // time limit in seconds to process each http request

	OutputDirectory string
	Username        string
	Password        string

	Cookies   []Cookie
	Header    http.Header
	Proxy     string
	UserAgent string
}

Config contains the scraper configuration.

type Cookie struct {
	Name  string `json:"name"`
	Value string `json:"value,omitempty"`

	Expires *time.Time `json:"expires,omitempty"`
}

Cookie represents a cookie, it copies parts of the http.Cookie struct but changes the JSON marshaling to exclude empty fields.

type Scraper

type Scraper struct {
	URL *url.URL // contains the main URL to parse, will be modified in case of a redirect
	// contains filtered or unexported fields
}

Scraper contains all scraping data.

func New

func New(logger *log.Logger, cfg Config) (*Scraper, error)

New creates a new Scraper instance. nolint: funlen

func (*Scraper) Cookies added in v0.2.0

func (s *Scraper) Cookies() []Cookie

Cookies returns the current cookies.

func (*Scraper) Start

func (s *Scraper) Start(ctx context.Context) error

Start starts the scraping.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL