fetch

package
v0.2.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 8, 2023 License: AGPL-3.0 Imports: 14 Imported by: 0

Documentation

Overview

Package fetch provides a stateful interface for routinely fetching resources from the web. Fetchers synchronously make web requests to get the latest version of a resource and preserve the state of the last request they made. They ensure that connections are closed and that resource use is minimized. For example, an RSS and Atom feed may need to be refreshed periodically, but to save bandwidth, we want to make sure we're respecting etag and modified headers as well as cache control. By creating a fetcher, we can repeatedly fetch the resource, minimizing bandwidth and being a good netizen.

Right now the fecher can handle http and https requests, but future implementations may also include authenticated fetchers. There are currently two types of fetchers: the FeedFetcher and the HTMLFetcher. The former is designed to fetch and parse RSS and ATOM feeds, while the latter is designed to fetch HTML content.

Basic Usage:

fetcher := fetch.NewFeedFetcher("https://www.example.com/rss")
feed, err := fetcher.Fetch(ctx)

For more on RSS hacking and bandwidth minimization see: https://fishbowl.pastiche.org/2002/10/21/http_conditional_get_for_rss_hackers

Index

Constants

View Source
const (
	HeaderUserAgent       = "User-Agent"
	HeaderAccept          = "Accept"
	HeaderAcceptLang      = "Accept-Language"
	HeaderAcceptEncode    = "Accept-Encoding"
	HeaderCacheControl    = "Cache-Control"
	HeaderReferer         = "Referer"
	HeaderIfNoneMatch     = "If-None-Match"
	HeaderIfModifiedSince = "If-Modified-Since"
	HeaderRFC3229         = "A-IM"
	HeaderETag            = "ETag"
	HeaderLastModified    = "Last-Modified"
	HeaderContentType     = "Content-Type"
	HeaderContentEncoding = "Content-Encoding"
)

Canonical names of headers used by the fetch package

Variables

This section is empty.

Functions

func SetClient

func SetClient(c *http.Client)

SetClient allows you to specify an alternative http.Client to the default one used by all http based Fetchers in this package. Use this function to change the timeouts of the client or to set a test client.

Types

type FeedFetcher

type FeedFetcher struct {
	// contains filtered or unexported fields
}

FeedFetcher provides a interface for anything that can get RSS data and provide it in a sequential fashion (e.g. without concurrency). The fetcher is the building block for larger subscription routines that periodically use the fetcher to retrieve data. FeedFetchers should therefore be treated as things that will only run inside of a single thread, whereas Subscription objects are things that may run concurrently.

func NewFeedFetcher

func NewFeedFetcher(url string) *FeedFetcher

NewFeedFetcher creates a new HTTP fetcher that can fetch rss feeds from the specified URL.

func (*FeedFetcher) ETag

func (f *FeedFetcher) ETag() string

func (*FeedFetcher) Fetch

func (f *FeedFetcher) Fetch(ctx context.Context) (feed *gofeed.Feed, err error)

The FeedFetcher uses GET requests to retrieve data with a Baleen-specific http client. We avoid using gofeed.ParseURL because it is very simple and doesn't respect rate limits or etags, which are necessary for Baleen to run in continuous operation.

func (*FeedFetcher) Modified

func (f *FeedFetcher) Modified() string

type Fetcher

type Fetcher interface {
	Fetch(context.Context) (any, error)
}

Fetcher is an interface for statefully making periodic requests to a resource.

type HTML added in v0.2.2

type HTML struct {
	// contains filtered or unexported fields
}

HTML is an in-memory materialized view of an HTML document fetched by the HTMLFetcher. It has helper methods to decode and parse the contents of the response, particularly if that response is compressed or encoded in non UTF-8 string encoding.

func (*HTML) Description added in v0.2.2

func (h *HTML) Description() string

func (*HTML) Extract added in v0.2.2

func (h *HTML) Extract() (_ []byte, err error)

Extract handles compression and content encoding from the response.

func (*HTML) Title added in v0.2.2

func (h *HTML) Title() string

type HTMLFetcher

type HTMLFetcher struct {
	// contains filtered or unexported fields
}

HTMLFetcher is an interface for fetching the full HTML associated with a feed item

func NewHTMLFetcher

func NewHTMLFetcher(url string) *HTMLFetcher

NewHTMLFetcher creates a new HTML fetcher that can fetch the full HTML from the specified URL.

func (*HTMLFetcher) Fetch

func (f *HTMLFetcher) Fetch(ctx context.Context) (html *HTML, err error)

The HTMLFetcher uses GET requests to retrieve the html containing the full text of articles of feeds with a Baleen-specific http client. TODO: return an HTML file instead of simply raw bytes (including document data).

type HTTPError

type HTTPError struct {
	Code   int
	Status string
}

HTTPError contains status information from the request and can be returned as error. This type of error is returned from the Fetcher when the server replies successfully but without a 200 status. The suggested use of this error is in a switch statement, e.g. something like: switch he := err.(type) {case fetch.HTTPError: ... default: ...}

func (HTTPError) Error

func (e HTTPError) Error() string

Error implements the error interface and returns a string representation of the err.

func (HTTPError) Forbidden

func (e HTTPError) Forbidden() bool

Forbidden returns true if the error is an HTTP 403

func (HTTPError) NotFound

func (e HTTPError) NotFound() bool

NotFound returns true if the error is an HTTP 404

func (HTTPError) NotModified

func (e HTTPError) NotModified() bool

NotModified returns true if the error is an HTTP 304

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL