Documentation ¶
Overview ¶
Package fetch provides a stateful interface for routinely fetching resources from the web. Fetchers synchronously make web requests to get the latest version of a resource and preserve the state of the last request they made. They ensure that connections are closed and that resource use is minimized. For example, an RSS and Atom feed may need to be refreshed periodically, but to save bandwidth, we want to make sure we're respecting etag and modified headers as well as cache control. By creating a fetcher, we can repeatedly fetch the resource, minimizing bandwidth and being a good netizen.
Right now the fecher can handle http and https requests, but future implementations may also include authenticated fetchers. There are currently two types of fetchers: the FeedFetcher and the HTMLFetcher. The former is designed to fetch and parse RSS and ATOM feeds, while the latter is designed to fetch HTML content.
Basic Usage:
fetcher := fetch.NewFeedFetcher("https://www.example.com/rss") feed, err := fetcher.Fetch(ctx)
For more on RSS hacking and bandwidth minimization see: https://fishbowl.pastiche.org/2002/10/21/http_conditional_get_for_rss_hackers
Index ¶
Constants ¶
const ( HeaderUserAgent = "User-Agent" HeaderAccept = "Accept" HeaderAcceptLang = "Accept-Language" HeaderAcceptEncode = "Accept-Encoding" HeaderCacheControl = "Cache-Control" HeaderReferer = "Referer" HeaderIfNoneMatch = "If-None-Match" HeaderIfModifiedSince = "If-Modified-Since" HeaderRFC3229 = "A-IM" HeaderETag = "ETag" HeaderLastModified = "Last-Modified" HeaderContentType = "Content-Type" HeaderContentEncoding = "Content-Encoding" )
Canonical names of headers used by the fetch package
Variables ¶
This section is empty.
Functions ¶
Types ¶
type FeedFetcher ¶
type FeedFetcher struct {
// contains filtered or unexported fields
}
FeedFetcher provides a interface for anything that can get RSS data and provide it in a sequential fashion (e.g. without concurrency). The fetcher is the building block for larger subscription routines that periodically use the fetcher to retrieve data. FeedFetchers should therefore be treated as things that will only run inside of a single thread, whereas Subscription objects are things that may run concurrently.
func NewFeedFetcher ¶
func NewFeedFetcher(url string) *FeedFetcher
NewFeedFetcher creates a new HTTP fetcher that can fetch rss feeds from the specified URL.
func (*FeedFetcher) ETag ¶
func (f *FeedFetcher) ETag() string
func (*FeedFetcher) Fetch ¶
The FeedFetcher uses GET requests to retrieve data with a Baleen-specific http client. We avoid using gofeed.ParseURL because it is very simple and doesn't respect rate limits or etags, which are necessary for Baleen to run in continuous operation.
func (*FeedFetcher) Modified ¶
func (f *FeedFetcher) Modified() string
type HTML ¶ added in v0.2.2
type HTML struct {
// contains filtered or unexported fields
}
HTML is an in-memory materialized view of an HTML document fetched by the HTMLFetcher. It has helper methods to decode and parse the contents of the response, particularly if that response is compressed or encoded in non UTF-8 string encoding.
func (*HTML) Description ¶ added in v0.2.2
type HTMLFetcher ¶
type HTMLFetcher struct {
// contains filtered or unexported fields
}
HTMLFetcher is an interface for fetching the full HTML associated with a feed item
func NewHTMLFetcher ¶
func NewHTMLFetcher(url string) *HTMLFetcher
NewHTMLFetcher creates a new HTML fetcher that can fetch the full HTML from the specified URL.
func (*HTMLFetcher) Fetch ¶
func (f *HTMLFetcher) Fetch(ctx context.Context) (html *HTML, err error)
The HTMLFetcher uses GET requests to retrieve the html containing the full text of articles of feeds with a Baleen-specific http client. TODO: return an HTML file instead of simply raw bytes (including document data).
type HTTPError ¶
HTTPError contains status information from the request and can be returned as error. This type of error is returned from the Fetcher when the server replies successfully but without a 200 status. The suggested use of this error is in a switch statement, e.g. something like: switch he := err.(type) {case fetch.HTTPError: ... default: ...}
func (HTTPError) Error ¶
Error implements the error interface and returns a string representation of the err.
func (HTTPError) NotModified ¶
NotModified returns true if the error is an HTTP 304