Documentation ¶
Overview ¶
Package swan implements the Goose HTML Content / Article Extractor algorithm.
Currently, swan will try to extract the following content types:
Comics: if something looks like a web comic, it will be extracted as just an image. This is a WIP.
Everything else: it will look for article text and try to extract any header image that goes with it.
Index ¶
Examples ¶
Constants ¶
const (
// Version of the library
Version = "1.0"
)
Variables ¶
This section is empty.
Functions ¶
Types ¶
type Article ¶
type Article struct { // Final URL after all redirects URL string // Newline-separated and cleaned content CleanedText string // Node from which CleanedText was created. Call .Html() on this to get // printable HTML. TopNode *goquery.Selection // A header image to use for the article. Nil if no image could be // detected. Img *Image // All metadata associated with the original document Meta struct { Authors []string Canonical string Description string Domain string Favicon string Keywords string Links []string Lang string OpenGraph map[string]string PublishDate string Tags []string Title string } // Full document backing this article Doc *goquery.Document // contains filtered or unexported fields }
Article is a fully extracted and cleaned document.
func FromDoc ¶
FromDoc does its best to extract an article from a single document
Pass in the URL the document came from so that images can be resolved correctly.
func FromHTML ¶
FromHTML does its best to extract an article from a single HTML page.
Pass in the URL the document came from so that images can be resolved correctly.
Example ¶
htmlIn := `<html> <head> <title> Example Title </title> <meta property="og:site_name" content="Example Name"/> </head> <body> <p>some article body with a bunch of text in it</p> </body> </html>` a, err := FromHTML("http://example.com/article/1", []byte(htmlIn)) if err != nil { panic(err) } if a.TopNode == nil { panic("no article could be extracted, " + "but a.Doc and a.Meta are still cleaned " + "and can be messed with ") } // Get the document title fmt.Printf("Title: %s\n", a.Meta.Title) // Hit any open graph tags fmt.Printf("Site Name: %s\n", a.Meta.OpenGraph["site_name"]) // Print out any cleaned-up HTML that was found html, _ := a.TopNode.Html() fmt.Printf("HTML: %s\n", strings.TrimSpace(html)) // Print out any cleaned-up text that was found fmt.Printf("Plain: %s\n", a.CleanedText)
Output: Title: Example Title Site Name: Example Name HTML: <p>some article body with a bunch of text in it</p> Plain: some article body with a bunch of text in it