Documentation ¶
Index ¶
- Variables
- func APICrawlHandler(c *Context) http.Handler
- func APIQueueHandler(c *Context) http.Handler
- func APIQueuesHandler(c *Context) http.Handler
- func APIRoutes(m *mux.Router, c *Context)
- func APISearchHandler(c *Context) http.Handler
- func APISitesHandler(c *Context) http.Handler
- func Contents(resp *http.Response) []byte
- func Crawl(url string, c *Context, q *Queue) error
- func ExtractLinks(doc *goquery.Document) []string
- func ExtractText(doc *goquery.Document) string
- func ExtractTitle(doc *goquery.Document) string
- func Get(req *http.Request) (*http.Response, error)
- func IndexPage(c *Context, q *Queue, url, site string) error
- func Links(doc *goquery.Document, q *Queue, site string)
- func MustGet(req *http.Request) (*http.Response, error)
- func Normalise(word string) string
- func ProcessPages(c *Context, q *Queue, site string, delay int64)
- func ProcessURL(link, site string) (string, error)
- func Request(url string) *http.Request
- func RootURL(link string) (string, error)
- func Stopper(word string) bool
- type Config
- type Context
- type Document
- type Index
- type Indexes
- type Queue
- type QueueList
- type Queues
- type Response
- type Result
- type Results
Constants ¶
This section is empty.
Variables ¶
var ( // UserAgent is passed on each HTTP request to identify the crawler. UserAgent = "Miru/1.0 (+http://www.miru.nylar.io)" // UnwantedTags are stripped from all HTML documents. UnwantedTags = "style, script, link, iframe, frame, embed" // ErrUnreachableURL for when the error doesn't return 200 OK. ErrUnreachableURL = errors.New("Url did not return a 200 OK response.") // ErrInvalidURL for when not a valid URL. ErrInvalidURL = errors.New("Url was invalid.") // Delay is time in between each crawl Delay int64 = 5 )
var DefaultConfig = `
[database]
host = "localhost:28015"
name = "miru"
[tables]
index = "indexes"
document = "documents"
[api]
port = "8036"
`
DefaultConfig is used if no config.toml file is found, sets the config to acceptable defaults.
Functions ¶
func APICrawlHandler ¶
APICrawlHandler (GET) allows one to provide a URL to be crawled. Will recursively crawl in the background.
func APIQueueHandler ¶
APIQueueHandler (GET) returns a single queue.
func APIQueuesHandler ¶
APIQueuesHandler (GET) returns a list of active queues.
func APIRoutes ¶
APIRoutes configures the routes for the API, cross-origin resource sharing is applied to each route then can be reached by external requests.
func APISearchHandler ¶
APISearchHandler (GET) allows one to search the datastore. Accepts one parameter: 'q', which is a URL encoded string.
func APISitesHandler ¶
APISitesHandler (GET) returns a list of sites.
func ExtractLinks ¶
ExtractLinks returns all internal links from a page.
func ExtractText ¶
ExtractText returns all p tags in a page
func ExtractTitle ¶
ExtractTitle looks for either a title tag or h1 tag and sets that as the title
func ProcessPages ¶
ProcessPages process all queue items and proceeds to index them.
func ProcessURL ¶
ProcessURL determines whether a URL is to be enqueued or not.
Types ¶
type Config ¶
type Config struct { Database database Tables tables Api api }
Config holds configuration information regarding the database and the port in which to serve on.
func LoadConfig ¶
LoadConfig loads configuration data into the Config struct.
type Context ¶
Context holds database, configuration and queue data.
func NewContext ¶
func NewContext() *Context
NewContext instantiates a new context and initialises a queue.
func (*Context) InitQueues ¶
func (c *Context) InitQueues()
InitQueues initialises a new queue list.
func (*Context) LoadConfig ¶
LoadConfig reads a given file from the filesystem, if not found uses the default config.
type Document ¶
type Document struct { DocID string `gorethink:"id" json:"document_id"` Url string `gorethink:"url" json:"url"` Site string `gorethink:"site" json:"site"` Title string `gorethink:"title" json:"title"` Content string `gorethink:"content" json:"content"` }
Document stores data about a page.
func NewDocument ¶
NewDocument creates a new document instance
type Index ¶
type Index struct { IndexID string `gorethink:"id" json:"index_id"` DocID string `gorethink:"doc_id" json:"document_id"` Word string `gorethink:"word" json:"word"` Count int64 `gorethink:"count" json:"count"` }
Index stores data on a given word in a document.
type Indexes ¶
type Indexes []*Index
Indexes is a slice of index, holds all the words in a document
func RemoveDuplicates ¶
RemoveDuplicates counts the number of duplicates and then keeps only the unique values.
type Queue ¶
type Queue struct { Manager map[string]bool `json:"manager"` Items []string `json:"items"` Name string `json:"name"` Status string `json:"status"` sync.Mutex }
Queue holds data regarding a queue
type QueueList ¶
type QueueList []queueList
QueueList is a sortable interface for keeping queue items in order.
type Results ¶
type Results struct { Speed float64 `json:"speed"` Count int64 `json:"count"` Results []Result `json:"results"` }
Results holds all of the results, the time taken to perform the query and the number of results.
func (*Results) ParseQuery ¶
ParseQuery splits words into a list of individual words.
func (*Results) RenderCount ¶
RenderCount formats the number of results
func (*Results) RenderCountHTML ¶
RenderCountHTML formats the number of results and escapes for use in templates.
func (*Results) RenderSpeed ¶
RenderSpeed formats the speed of the query
func (*Results) RenderSpeedHTML ¶
RenderSpeedHTML formats the speed of the query and escapes for use in templates.
Source Files ¶
Directories ¶
Path | Synopsis |
---|---|
Godeps
|
|
_workspace/src/code.google.com/p/goprotobuf/proto
Package proto converts data structures to and from the wire format of protocol buffers.
|
Package proto converts data structures to and from the wire format of protocol buffers. |
_workspace/src/github.com/PuerkitoBio/goquery
Package goquery implements features similar to jQuery, including the chainable syntax, to manipulate and query an HTML document.
|
Package goquery implements features similar to jQuery, including the chainable syntax, to manipulate and query an HTML document. |
_workspace/src/github.com/andybalholm/cascadia
The cascadia package is an implementation of CSS selectors.
|
The cascadia package is an implementation of CSS selectors. |
_workspace/src/github.com/dancannon/gorethink
Go driver for RethinkDB Current version: v0.6.2 (RethinkDB v1.16) For more in depth information on how to use RethinkDB check out the API docs at http://rethinkdb.com/api
|
Go driver for RethinkDB Current version: v0.6.2 (RethinkDB v1.16) For more in depth information on how to use RethinkDB check out the API docs at http://rethinkdb.com/api |
_workspace/src/github.com/gorilla/context
Package context stores values shared during a request lifetime.
|
Package context stores values shared during a request lifetime. |
_workspace/src/github.com/gorilla/mux
Package gorilla/mux implements a request router and dispatcher.
|
Package gorilla/mux implements a request router and dispatcher. |
_workspace/src/github.com/rs/cors
Package cors is net/http handler to handle CORS related requests as defined by http://www.w3.org/TR/cors/ You can configure it by passing an option struct to cors.New: c := cors.New(cors.Options{ AllowedOrigins: []string{"foo.com"}, AllowedMethods: []string{"GET", "POST", "DELETE"}, AllowCredentials: true, }) Then insert the handler in the chain: handler = c.Handler(handler) See Options documentation for more options.
|
Package cors is net/http handler to handle CORS related requests as defined by http://www.w3.org/TR/cors/ You can configure it by passing an option struct to cors.New: c := cors.New(cors.Options{ AllowedOrigins: []string{"foo.com"}, AllowedMethods: []string{"GET", "POST", "DELETE"}, AllowCredentials: true, }) Then insert the handler in the chain: handler = c.Handler(handler) See Options documentation for more options. |
_workspace/src/github.com/satori/go.uuid
Package uuid provides implementation of Universally Unique Identifier (UUID).
|
Package uuid provides implementation of Universally Unique Identifier (UUID). |
_workspace/src/github.com/stretchr/testify/assert
A set of comprehensive testing tools for use with the normal Go testing system.
|
A set of comprehensive testing tools for use with the normal Go testing system. |
_workspace/src/golang.org/x/net/html
Package html implements an HTML5-compliant tokenizer and parser.
|
Package html implements an HTML5-compliant tokenizer and parser. |
_workspace/src/golang.org/x/net/html/atom
Package atom provides integer codes (also known as atoms) for a fixed set of frequently occurring HTML strings: tag names and attribute keys such as "p" and "id".
|
Package atom provides integer codes (also known as atoms) for a fixed set of frequently occurring HTML strings: tag names and attribute keys such as "p" and "id". |
_workspace/src/golang.org/x/net/html/charset
Package charset provides common text encodings for HTML documents.
|
Package charset provides common text encodings for HTML documents. |
cmd
|
|