fetcher

package module
v0.1.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 29, 2020 License: MIT Imports: 12 Imported by: 0

README

crawler

Documentation

Index

Constants

This section is empty.

Variables

View Source
var DefaultOptions = &Options{
	client:          http.DefaultClient,
	limitDuration:   5 * time.Second,
	timeoutDuration: 1 * time.Minute,
	burst:           1,
}

Default options to be used with a `Fetcher` instance

Functions

func NewStoreWrappedFetchable added in v0.1.1

func NewStoreWrappedFetchable(fetchable Fetchable, store Store) *storeWrappedFetchable

func ReaderToStringFetchable

func ReaderToStringFetchable(ctx context.Context, reader io.Reader, channelBufferSize int) <-chan Fetchable

Types

type FetchI

type FetchI interface {
	Fetch(Fetchable) error
}

Interface

type Fetchable

type Fetchable interface {
	// Unique identifier for this fetchable item. This is useful in logging.
	Id() string

	// Url that this is trying to fetch. It can also be determined from
	// Request. Keeping it here to avoid repetition in the codebase.
	Url() string

	// Build a request.
	Request() (*http.Request, error)

	// Callback to handle the http response body corresponding to the request.
	// This can be used for example, to store data into the store, or to parse
	// the results in some way
	HandleResponseBody([]byte) error
}

Interface that defines what can be `fetched`.

type Fetcher

type Fetcher struct {
	// contains filtered or unexported fields
}

Fetcher struct used to download

func NewFetcher

func NewFetcher() *Fetcher

Returns a new `Fetcher` instance.

func NewFetcherWithOptions

func NewFetcherWithOptions(options *Options) *Fetcher

Returns a `Fetcher` with specified options. If any fields of the option are equal to the zero value, we use the value from `DefaultOptions` instead. This allows a caller to specify only the changed options

func (*Fetcher) Fetch

func (f *Fetcher) Fetch(furl Fetchable) error

Performs the actual fetch of a given `Fetchable`. The steps it follows are:

  1. Build the request by calling `Request()`
  2. Validate the request by calling `Validate()`
  3. Wait until the rate limit allows the domain to be crawled, or options.timeoutDuration is exceeded
  4. Actually make the http request with the supplied client, calling `HandleResponse()` on the output

func (*Fetcher) FetchConcurrentlyWait added in v0.1.2

func (f *Fetcher) FetchConcurrentlyWait(urlChannel <-chan Fetchable, concurrency int)

Starts `concurrency` goroutines to fetch content from `urlChannel` in parallel. The goroutines end when the `urlChannel` is closed. This method waits until all the launched goroutines are complete.

Note: Please ensure you call `close()` on the `urlChannel`, or else this method will never return

type Options

type Options struct {
	// contains filtered or unexported fields
}

type PebbleStore

type PebbleStore struct {
	// contains filtered or unexported fields
}

func NewPebbleStore

func NewPebbleStore(dirname string) (*PebbleStore, error)

func (*PebbleStore) Close

func (s *PebbleStore) Close()

func (*PebbleStore) Get

func (*PebbleStore) Set

func (s *PebbleStore) Set(key string, body []byte) error

type Store

type Store interface {
	Get(key string) (*crawled_url.CrawledUrl, io.Closer, error)
	Set(key string, body []byte) error
}

type StoreBackedFetcher

type StoreBackedFetcher struct {
	// contains filtered or unexported fields
}

func NewStoreBackedFetcher

func NewStoreBackedFetcher(store Store, fetcher *Fetcher, minInterval time.Duration) *StoreBackedFetcher

func (*StoreBackedFetcher) Fetch

func (sbf *StoreBackedFetcher) Fetch(furl StoringFetchable) error

func (*StoreBackedFetcher) FetchConcurrentlyWait added in v0.1.2

func (sbf *StoreBackedFetcher) FetchConcurrentlyWait(urlChannel <-chan StoringFetchable, concurrency int)

Starts `concurrency` goroutines to fetch content from `urlChannel` in parallel. The goroutines end when the `urlChannel` is closed. This method waits until all the launched goroutines are complete.

Note: Please ensure you call `close()` on the `urlChannel`, or else this method will never return

type StoringFetchable

type StoringFetchable interface {
	Fetchable

	// Force a fetch from the URL, instead of getting from the store
	ForceFetch() bool
}

type StringFetchable

type StringFetchable string

Wrapping a `string` (holding a url) into a `Fetchable`

func (StringFetchable) HandleResponseBody

func (sf StringFetchable) HandleResponseBody(body []byte) error

Always returns nil.

func (StringFetchable) Id

func (sf StringFetchable) Id() string

Return the string as the Id

func (StringFetchable) Request

func (sf StringFetchable) Request() (*http.Request, error)

Returns a GET request to the Url

func (StringFetchable) Url

func (sf StringFetchable) Url() string

Return the string as the Url

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL