prscrape

package

v0.0.0-...-d4dc811 Latest Latest Go to latest Published: Feb 3, 2014 License: AGPL-3.0 Imports: 26 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/bcampbell/ukpr

Links

Open Source Insights

Documentation ¶

Index ¶

func CompressSpace(s string) string
func Contains(container *html.Node, n *html.Node) bool
func DescribeNode(n *html.Node) string
func DumpTree(n *html.Node, depth int)
func GetAttr(n *html.Node, attr string) string
func GetTextContent(n *html.Node) string
func ParseTime(s string) (time.Time, error)
func RenderText(n *html.Node) string
func ServerMain(dbFile string, configfunc ConfigureFunc)
func StripComments(n *html.Node)
type ConfigureFunc
type DBStore
- func NewDBStore(dbfile string) *DBStore
- func (store *DBStore) Replay(channel, lastEventId string) (out chan eventsource.Event)
- func (store *DBStore) Stash(pr *PressRelease) (*pressReleaseEvent, error)
- func (store *DBStore) WhichAreNew(incoming []*PressRelease) []*PressRelease
type DiscoverFunc
- func BuildGenericDiscover(scraperName, pageUrl, linkSelector string, allowHostChange bool) (DiscoverFunc, error)
- func BuildPaginatedGenericDiscover(scraperName, startUrl, nextPageSelector, linkSelector string) (DiscoverFunc, error)
- func BuildRSSDiscover(scraperName string, feeds []string) (DiscoverFunc, error)
- func MustBuildGenericDiscover(scraperName, pageUrl, linkSelector string, allowHostChange bool) DiscoverFunc
- func MustBuildPaginatedGenericDiscover(scraperName, startUrl, nextPageSelector, linkSelector string) DiscoverFunc
- func MustBuildRSSDiscover(scraperName string, feeds []string) DiscoverFunc
type PressRelease
type ScrapeFunc
- func BuildGenericScrape(source, title, content, cruft, pubDate string) (ScrapeFunc, error)
- func MustBuildGenericScrape(source, title, content, cruft, pubDate string) ScrapeFunc
type Scraper
type Store
type TestStore
- func NewTestStore(brief bool) *TestStore
- func (store *TestStore) Replay(channel, lastEventId string) chan eventsource.Event
- func (store *TestStore) Stash(pr *PressRelease) (*pressReleaseEvent, error)
- func (store *TestStore) WhichAreNew(incoming []*PressRelease) []*PressRelease

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func CompressSpace ¶

func CompressSpace(s string) string

CompressSpace reduces all whitespace sequences (space, tabs, newlines etc) in a string to a single space. Leading/trailing space is trimmed. Has the effect of converting multiline strings to one line.

func Contains ¶

func Contains(container *html.Node, n *html.Node) bool

Contains returns true if n is a descendant of container

func DescribeNode ¶

func DescribeNode(n *html.Node) string

DescribeNode generates a debug string describing the node. returns a string of form: "<element#id.class>" (ie, like a css selector)

func DumpTree ¶

func DumpTree(n *html.Node, depth int)

dumpTree is a debug helper to display a tree of nodes

func GetAttr ¶

func GetAttr(n *html.Node, attr string) string

GetAttr retrieved the value of an attribute on a node. Returns empty string if attribute doesn't exist.

func GetTextContent ¶

func GetTextContent(n *html.Node) string

GetTextContent recursively fetches the text for a node

func ParseTime ¶

func ParseTime(s string) (time.Time, error)

func RenderText ¶

func RenderText(n *html.Node) string

RenderText returns the text, using whitespace and line breaks to make it look nice

func ServerMain ¶

func ServerMain(dbFile string, configfunc ConfigureFunc)

ServerMain is the entry point for running the server. handles commandline flags and all that stuff - the idea is that you can easily write a new server with a different bunch of scrapers. The real main() would just be a small stub which instantiates a bunch of scrapers, then passes control over to here. See ukpr/main.go for an example

func StripComments ¶

func StripComments(n *html.Node)

Types ¶

type ConfigureFunc ¶

type ConfigureFunc func(historical bool) []*Scraper

type DBStore ¶

type DBStore struct {
	// contains filtered or unexported fields
}

DBStore manages an archive of recent press releases in an sqlite db. It also implements eventsource.Repository to allow the press releases to be streamed out as server side events. Can stash away press releases for multiple sources.

func NewDBStore ¶

func NewDBStore(dbfile string) *DBStore

func (*DBStore) Replay ¶

func (store *DBStore) Replay(channel, lastEventId string) (out chan eventsource.Event)

Replay to handle last-event-id catchups note: channel contains the source (eg 'tesco'...)

func (*DBStore) Stash ¶

func (store *DBStore) Stash(pr *PressRelease) (*pressReleaseEvent, error)

Stash adds a press release into the store

func (*DBStore) WhichAreNew ¶

func (store *DBStore) WhichAreNew(incoming []*PressRelease) []*PressRelease

returns a list of press releases with the ones already in the store culled out

type DiscoverFunc ¶

type DiscoverFunc func() ([]*PressRelease, error)

DiscoverFunc is for fetching a list of 'current' press releases. (via RSS feed, or by scraping an index page or whatever) The results are passed back as PressRelease structs. At the very least, the Permalink field must be set to the URL of the press release, But there's no reason Discover() can't fill out all the fields if the data is available (eg some rss feeds have everything required). For incomplete PressReleases, the framework will fetch the HTML from the Permalink URL, and invoke Scrape() to complete the data.

func BuildGenericDiscover ¶

func BuildGenericDiscover(scraperName, pageUrl, linkSelector string, allowHostChange bool) (DiscoverFunc, error)

BuildGenericDiscover returns a DiscoverFunc which fetches a page and extracts matching links. TODO: pageUrl should be an array

func BuildPaginatedGenericDiscover ¶

func BuildPaginatedGenericDiscover(scraperName, startUrl, nextPageSelector, linkSelector string) (DiscoverFunc, error)

BuildPaginatedGenericDiscover returns a DiscoverFunc which fetches links and steps through multiple pages.

func BuildRSSDiscover ¶

func BuildRSSDiscover(scraperName string, feeds []string) (DiscoverFunc, error)

BuildRSSDiscover returns a discover function which grabs links from rss feeds

func MustBuildGenericDiscover ¶

func MustBuildGenericDiscover(scraperName, pageUrl, linkSelector string, allowHostChange bool) DiscoverFunc

TODO: kill this once a proper config parser is in place

func MustBuildPaginatedGenericDiscover ¶

func MustBuildPaginatedGenericDiscover(scraperName, startUrl, nextPageSelector, linkSelector string) DiscoverFunc

TODO: kill this once a proper config parser is in place

func MustBuildRSSDiscover ¶

func MustBuildRSSDiscover(scraperName string, feeds []string) DiscoverFunc

TODO: kill this once a proper config parser is in place

type PressRelease ¶

type PressRelease struct {
	Title     string    `json:"title"`
	Source    string    `json:"source"`
	Permalink string    `json:"permalink"`
	PubDate   time.Time `json:"published"`
	Content   string    `json:"text"`
	Type      string    `json:"type"`
}

PressRelease is the data we're scraping and storing. TODO: support multiple urls

type ScrapeFunc ¶

type ScrapeFunc func(pr *PressRelease, doc *html.Node) error

ScrapeFunc is for scraping a single press release from html

func BuildGenericScrape ¶

func BuildGenericScrape(source, title, content, cruft, pubDate string) (ScrapeFunc, error)

BuildGenericScrape builds a function which scrapes a press release from raw_html based on a bunch of css selector strings

func MustBuildGenericScrape ¶

func MustBuildGenericScrape(source, title, content, cruft, pubDate string) ScrapeFunc

TODO: kill this once a proper config parser is in place

type Scraper ¶

type Scraper struct {
	Name     string
	Discover DiscoverFunc
	Scrape   ScrapeFunc
}

ComposedScraper lets you pick-and-mix various discover and scrape functions

type Store ¶

type Store interface {
	WhichAreNew(incoming []*PressRelease) []*PressRelease
	Stash(pr *PressRelease) (*pressReleaseEvent, error)
	Replay(channel, lastEventId string) chan eventsource.Event
}

type TestStore ¶

type TestStore struct {
	// contains filtered or unexported fields
}

func NewTestStore ¶

func NewTestStore(brief bool) *TestStore

func (*TestStore) Replay ¶

func (store *TestStore) Replay(channel, lastEventId string) chan eventsource.Event

func (*TestStore) Stash ¶

func (store *TestStore) Stash(pr *PressRelease) (*pressReleaseEvent, error)

func (*TestStore) WhichAreNew ¶

func (store *TestStore) WhichAreNew(incoming []*PressRelease) []*PressRelease

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL