Documentation ¶
Index ¶
- func CompressSpace(s string) string
- func Contains(container *html.Node, n *html.Node) bool
- func DescribeNode(n *html.Node) string
- func DumpTree(n *html.Node, depth int)
- func GetAttr(n *html.Node, attr string) string
- func GetTextContent(n *html.Node) string
- func ParseTime(s string) (time.Time, error)
- func RenderText(n *html.Node) string
- func ServerMain(dbFile string, configfunc ConfigureFunc)
- func StripComments(n *html.Node)
- type ConfigureFunc
- type DBStore
- type DiscoverFunc
- func BuildGenericDiscover(scraperName, pageUrl, linkSelector string, allowHostChange bool) (DiscoverFunc, error)
- func BuildPaginatedGenericDiscover(scraperName, startUrl, nextPageSelector, linkSelector string) (DiscoverFunc, error)
- func BuildRSSDiscover(scraperName string, feeds []string) (DiscoverFunc, error)
- func MustBuildGenericDiscover(scraperName, pageUrl, linkSelector string, allowHostChange bool) DiscoverFunc
- func MustBuildPaginatedGenericDiscover(scraperName, startUrl, nextPageSelector, linkSelector string) DiscoverFunc
- func MustBuildRSSDiscover(scraperName string, feeds []string) DiscoverFunc
- type PressRelease
- type ScrapeFunc
- type Scraper
- type Store
- type TestStore
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func CompressSpace ¶
CompressSpace reduces all whitespace sequences (space, tabs, newlines etc) in a string to a single space. Leading/trailing space is trimmed. Has the effect of converting multiline strings to one line.
func DescribeNode ¶
DescribeNode generates a debug string describing the node. returns a string of form: "<element#id.class>" (ie, like a css selector)
func GetAttr ¶
GetAttr retrieved the value of an attribute on a node. Returns empty string if attribute doesn't exist.
func GetTextContent ¶
GetTextContent recursively fetches the text for a node
func RenderText ¶
RenderText returns the text, using whitespace and line breaks to make it look nice
func ServerMain ¶
func ServerMain(dbFile string, configfunc ConfigureFunc)
ServerMain is the entry point for running the server. handles commandline flags and all that stuff - the idea is that you can easily write a new server with a different bunch of scrapers. The real main() would just be a small stub which instantiates a bunch of scrapers, then passes control over to here. See ukpr/main.go for an example
func StripComments ¶
Types ¶
type ConfigureFunc ¶
type DBStore ¶
type DBStore struct {
// contains filtered or unexported fields
}
DBStore manages an archive of recent press releases in an sqlite db. It also implements eventsource.Repository to allow the press releases to be streamed out as server side events. Can stash away press releases for multiple sources.
func NewDBStore ¶
func (*DBStore) Replay ¶
func (store *DBStore) Replay(channel, lastEventId string) (out chan eventsource.Event)
Replay to handle last-event-id catchups note: channel contains the source (eg 'tesco'...)
func (*DBStore) Stash ¶
func (store *DBStore) Stash(pr *PressRelease) (*pressReleaseEvent, error)
Stash adds a press release into the store
func (*DBStore) WhichAreNew ¶
func (store *DBStore) WhichAreNew(incoming []*PressRelease) []*PressRelease
returns a list of press releases with the ones already in the store culled out
type DiscoverFunc ¶
type DiscoverFunc func() ([]*PressRelease, error)
DiscoverFunc is for fetching a list of 'current' press releases. (via RSS feed, or by scraping an index page or whatever) The results are passed back as PressRelease structs. At the very least, the Permalink field must be set to the URL of the press release, But there's no reason Discover() can't fill out all the fields if the data is available (eg some rss feeds have everything required). For incomplete PressReleases, the framework will fetch the HTML from the Permalink URL, and invoke Scrape() to complete the data.
func BuildGenericDiscover ¶
func BuildGenericDiscover(scraperName, pageUrl, linkSelector string, allowHostChange bool) (DiscoverFunc, error)
BuildGenericDiscover returns a DiscoverFunc which fetches a page and extracts matching links. TODO: pageUrl should be an array
func BuildPaginatedGenericDiscover ¶
func BuildPaginatedGenericDiscover(scraperName, startUrl, nextPageSelector, linkSelector string) (DiscoverFunc, error)
BuildPaginatedGenericDiscover returns a DiscoverFunc which fetches links and steps through multiple pages.
func BuildRSSDiscover ¶
func BuildRSSDiscover(scraperName string, feeds []string) (DiscoverFunc, error)
BuildRSSDiscover returns a discover function which grabs links from rss feeds
func MustBuildGenericDiscover ¶
func MustBuildGenericDiscover(scraperName, pageUrl, linkSelector string, allowHostChange bool) DiscoverFunc
TODO: kill this once a proper config parser is in place
func MustBuildPaginatedGenericDiscover ¶
func MustBuildPaginatedGenericDiscover(scraperName, startUrl, nextPageSelector, linkSelector string) DiscoverFunc
TODO: kill this once a proper config parser is in place
func MustBuildRSSDiscover ¶
func MustBuildRSSDiscover(scraperName string, feeds []string) DiscoverFunc
TODO: kill this once a proper config parser is in place
type PressRelease ¶
type PressRelease struct { Title string `json:"title"` Source string `json:"source"` Permalink string `json:"permalink"` PubDate time.Time `json:"published"` Content string `json:"text"` Type string `json:"type"` }
PressRelease is the data we're scraping and storing. TODO: support multiple urls
type ScrapeFunc ¶
type ScrapeFunc func(pr *PressRelease, doc *html.Node) error
ScrapeFunc is for scraping a single press release from html
func BuildGenericScrape ¶
func BuildGenericScrape(source, title, content, cruft, pubDate string) (ScrapeFunc, error)
BuildGenericScrape builds a function which scrapes a press release from raw_html based on a bunch of css selector strings
func MustBuildGenericScrape ¶
func MustBuildGenericScrape(source, title, content, cruft, pubDate string) ScrapeFunc
TODO: kill this once a proper config parser is in place
type Scraper ¶
type Scraper struct { Name string Discover DiscoverFunc Scrape ScrapeFunc }
ComposedScraper lets you pick-and-mix various discover and scrape functions
type Store ¶
type Store interface { WhichAreNew(incoming []*PressRelease) []*PressRelease Stash(pr *PressRelease) (*pressReleaseEvent, error) Replay(channel, lastEventId string) chan eventsource.Event }
type TestStore ¶
type TestStore struct {
// contains filtered or unexported fields
}
func NewTestStore ¶
func (*TestStore) Replay ¶
func (store *TestStore) Replay(channel, lastEventId string) chan eventsource.Event
func (*TestStore) Stash ¶
func (store *TestStore) Stash(pr *PressRelease) (*pressReleaseEvent, error)
func (*TestStore) WhichAreNew ¶
func (store *TestStore) WhichAreNew(incoming []*PressRelease) []*PressRelease