scrapezillaapp

package
v0.0.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 15, 2025 License: MIT Imports: 17 Imported by: 0

Documentation

Index

Constants

View Source
const (
	DefaultConcurrency = 1
	DefaultProvider    = "memory"
)

Variables

This section is empty.

Functions

func DisableImages

func DisableImages() func(*jsOptions)

func Headfull

func Headfull() func(*jsOptions)

Headfull is a helper function to create a headfull browser. Use it as a parameter to WithJS.

func WithCache

func WithCache(cacheType, cachePath string) func(*Config) error

WithCache sets the cache type and path of the app.

func WithConcurrency

func WithConcurrency(concurrency int) func(*Config) error

WithConcurrency sets the concurrency of the app.

func WithExitOnInactivity

func WithExitOnInactivity(duration time.Duration) func(*Config) error

WithExitOnInactivity sets the duration after which the app will exit if there are no more jobs to run.

func WithInitJob

func WithInitJob(job scrapezilla.IJob) func(*Config) error

WithInitJob sets the initial job of the app.

func WithJS

func WithJS(opts ...func(*jsOptions)) func(*Config) error

WithJS sets the app to use JavaScript to render the pages.

func WithProvider

func WithProvider(provider scrapezilla.JobProvider) func(*Config) error

WithProvider sets the provider of the app.

func WithProxies

func WithProxies(proxies []string) func(*Config) error

WithProxies sets the proxies of the app.

func WithStealth

func WithStealth(browser string) func(*Config) error

Types

type Config

type Config struct {
	// Concurrency is the number of concurrent scrapers to run.
	// If not set, it defaults to 1.
	Concurrency int `validate:"required,gte=1"`

	// Cache is the cache to use for storing scraped data.
	// If left empty then no caching will be used.
	// Otherwise the CacheType must be one of file or leveldb.
	CacheType string `validate:"omitempty,oneof=file leveldb"`
	// CachePath is the path to the cache file or directory.
	// It is required to be a valid path if CacheType is set.
	CachePath string `validate:"required_with=CacheType"`

	// UseJS is whether to use JavaScript to render the page.
	UseJS bool `validate:"omitempty"`
	// UseStealth is whether to use stealth mode to scrape the page.
	// uses a special http client to scrape the page.
	UseStealth bool `validate:"omitempty"`
	// StealthBrowser is the browser to use for stealth mode.
	StealthBrowser string `validate:"omitempty"`
	// JSOpts are the options for the JavaScript renderer.
	JSOpts jsOptions

	// ProviderType is the type of provider to use.
	// It is required to be a valid type if Provider is set.
	// If not set the memory provider will be used.
	Provider scrapezilla.JobProvider

	// Writers are the writers to use for writing the results.
	// At least one writer must be provided.
	Writers []scrapezilla.ResultWriter `validate:"required,gt=0"`
	// InitJob is the job to initialize the app with.
	InitJob scrapezilla.IJob
	// ExitOnInactivityDuration is whether to exit the app when there are no more jobs to run.
	ExitOnInactivityDuration time.Duration
	// Proxies are the proxies to use for the app.
	Proxies []string
}

func NewConfig

func NewConfig(writers []scrapezilla.ResultWriter, options ...func(*Config) error) (*Config, error)

NewConfig creates a new config with default values.

type ScrapezillaApp added in v0.0.4

type ScrapezillaApp struct {
	// contains filtered or unexported fields
}

func NewScrapezillaApp added in v0.0.3

func NewScrapezillaApp(cfg *Config) (*ScrapezillaApp, error)

NewScrapezillaApp creates a new ScrapezillaApp.

func (*ScrapezillaApp) Close added in v0.0.4

func (app *ScrapezillaApp) Close() error

Close closes the app.

func (*ScrapezillaApp) Start added in v0.0.4

func (app *ScrapezillaApp) Start(ctx context.Context, seedJobs ...scrapezilla.IJob) error

Start starts the app.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL