goose

package module

v0.0.0-...-acf6fdd Latest Latest Go to latest Published: May 23, 2019 License: Apache-2.0 Imports: 23 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/jaytaylor/GoOse

README ¶

GoOse

HTML Content / Article Extractor in Golang

Description

This is a golang port of "Goose" originaly licensed to Gravity.com under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership.

Golang port was written by Antonio Linari

Gravity.com licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

INSTALL

go get github.com/jaytaylor/GoOse

HOW TO USE IT

package main

import (
	"github.com/jaytaylor/GoOse"
)

func main() {
	g := goose.New()
	article, _ := g.ExtractFromURL("http://edition.cnn.com/2012/07/08/opinion/banzi-ted-open-source/index.html")
	println("title", article.Title)
	println("description", article.MetaDescription)
	println("keywords", article.MetaKeywords)
	println("content", article.CleanedText)
	println("url", article.FinalURL)
	println("top image", article.TopImage)
}

Development - Getting started

This application is written in GO language, please refere to the guides in https://golang.org for getting started.

This project include a Makefile that allows you to test and build the project with simple commands. To see all available options:

make help

Before committing the code, please check if it passes all tests using

make deps
make qa

TODO

better organize code
improve "xpath" like queries
add other image extractions techniques (imagemagick)

THANKS TO

@Martin Angers for goquery
@Fatih Arslan for set
GoLang team for the amazing language and net/html

Documentation ¶

Index ¶

func NormaliseCharset(characterSet string) string
func OpenGraphResolver(doc *goquery.Document) string
func ReadLinesOfFile(filename string) []string
func UTF8encode(raw string, sourceCharset string) string
func WebPageResolver(article *Article) string
type Article
- func (article *Article) ToString() string
type Cleaner
- func NewCleaner(config Configuration) Cleaner
- func (c *Cleaner) Clean(docToClean *goquery.Document) *goquery.Document
type Configuration
- func GetDefaultConfiguration(args ...string) Configuration
type ContentExtractor
- func NewExtractor(config Configuration) ContentExtractor
- func (extr *ContentExtractor) CalculateBestNode(document *goquery.Document) *goquery.Selection
- func (extr *ContentExtractor) GetCanonicalLink(document *goquery.Document) string
- func (extr *ContentExtractor) GetCleanTextAndLinks(topNode *goquery.Selection, lang string) (string, []string)
- func (extr *ContentExtractor) GetDomain(canonicalLink string) string
- func (extr *ContentExtractor) GetFavicon(document *goquery.Document) string
- func (extr *ContentExtractor) GetMetaAuthor(document *goquery.Document) string
- func (extr *ContentExtractor) GetMetaContent(document *goquery.Document, metaName string) string
- func (extr *ContentExtractor) GetMetaContentLocation(document *goquery.Document) string
- func (extr *ContentExtractor) GetMetaContentWithSelector(document *goquery.Document, selector string) string
- func (extr *ContentExtractor) GetMetaContents(document *goquery.Document, metaNames set.Interface) map[string]string
- func (extr *ContentExtractor) GetMetaDescription(document *goquery.Document) string
- func (extr *ContentExtractor) GetMetaKeywords(document *goquery.Document) string
- func (extr *ContentExtractor) GetMetaLanguage(document *goquery.Document) string
- func (extr *ContentExtractor) GetPublishDate(document *goquery.Document) *time.Time
- func (extr *ContentExtractor) GetTags(document *goquery.Document) set.Interface
- func (extr *ContentExtractor) GetTitle(document *goquery.Document) string
- func (extr *ContentExtractor) PostCleanup(targetNode *goquery.Selection) *goquery.Selection
type Crawler
- func NewCrawler(config Configuration, url string, RawHTML string) Crawler
- func (c Crawler) Crawl() (*Article, error)
- func (c Crawler) GetCharset(document *goquery.Document) string
- func (c Crawler) GetContentType(document *goquery.Document) string
- func (c *Crawler) Preprocess() (*goquery.Document, error)
- func (c *Crawler) SetCharset(cs string)
type Goose
- func New(args ...string) Goose
- func (g Goose) ExtractFromRawHTML(url string, RawHTML string) (*Article, error)
- func (g Goose) ExtractFromURL(url string) (*Article, error)
type Parser
- func NewParser() *Parser
type StopWords
- func NewStopwords() StopWords
- func (stop StopWords) SimpleLanguageDetector(text string) string
type VideoExtractor
- func NewVideoExtractor() VideoExtractor
- func (ve *VideoExtractor) GetVideos(doc *goquery.Document) set.Interface

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func NormaliseCharset ¶

func NormaliseCharset(characterSet string) string

NormaliseCharset Overrides/fixes charset names to something we can parse. Fixes common mispellings and uses a canonical name for equivalent encodings. @see https://encoding.spec.whatwg.org#names-and-labels

func OpenGraphResolver ¶

func OpenGraphResolver(doc *goquery.Document) string

OpenGraphResolver return OpenGraph properties

func ReadLinesOfFile ¶

func ReadLinesOfFile(filename string) []string

ReadLinesOfFile returns the lines from a file as a slice of strings

func UTF8encode ¶

func UTF8encode(raw string, sourceCharset string) string

UTF8encode converts a string from the source character set to UTF-8, skipping invalid byte sequences @see http://stackoverflow.com/questions/32512500/ignore-illegal-bytes-when-decoding-text-with-go

func WebPageResolver ¶

func WebPageResolver(article *Article) string

WebPageResolver fetches the main image from the HTML page

Types ¶

type Article ¶

type Article struct {
	Title           string             `json:"title,omitempty"`
	CleanedText     string             `json:"content,omitempty"`
	MetaDescription string             `json:"description,omitempty"`
	MetaLang        string             `json:"lang,omitempty"`
	MetaFavicon     string             `json:"favicon,omitempty"`
	MetaKeywords    string             `json:"keywords,omitempty"`
	CanonicalLink   string             `json:"canonicalurl,omitempty"`
	Domain          string             `json:"domain,omitempty"`
	TopNode         *goquery.Selection `json:"-"`
	TopImage        string             `json:"image,omitempty"`
	Tags            *set.Set           `json:"tags,omitempty"`
	Movies          *set.Set           `json:"movies,omitempty"`
	FinalURL        string             `json:"url,omitempty"`
	LinkHash        string             `json:"linkhash,omitempty"`
	RawHTML         string             `json:"rawhtml,omitempty"`
	Doc             *goquery.Document  `json:"-"`
	Links           []string           `json:"links,omitempty"`
	PublishDate     *time.Time         `json:"publishdate,omitempty"`
	AdditionalData  map[string]string  `json:"additionaldata,omitempty"`
	Delta           int64              `json:"delta,omitempty"`
}

Article is a collection of properties extracted from the HTML body

func (*Article) ToString ¶

func (article *Article) ToString() string

ToString is a simple method to just show the title TODO: add more fields and pretty print

type Cleaner ¶

type Cleaner struct {
	// contains filtered or unexported fields
}

Cleaner removes menus, ads, sidebars, etc. and leaves the main content

func NewCleaner ¶

func NewCleaner(config Configuration) Cleaner

NewCleaner returns a new instance of a Cleaner

func (*Cleaner) Clean ¶

func (c *Cleaner) Clean(docToClean *goquery.Document) *goquery.Document

Clean removes HTML elements around the main content and prepares the document for parsing

type Configuration ¶

type Configuration struct {
	LocalStoragePath        string `json:"localStoragePath"` //not used in this version
	ImagesMinBytes          int    `json:"imagesMinBytes"`   //not used in this version
	TargetLanguage          string `json:"targetLanguage"`
	ImageMagickConvertPath  string `json:"imageMagickConvertPath"`  //not used in this version
	ImageMagickIdentifyPath string `json:"imageMagickIdentifyPath"` //not used in this version
	BrowserUserAgent        string `json:"browserUserAgent"`
	Debug                   bool   `json:"debug"`
	ExtractPublishDate      bool   `json:"extractPublishDate"`
	AdditionalDataExtractor bool   `json:"additionalDataExtractor"`
	EnableImageFetching     bool   `json:"enableImageFetching"`
	UseMetaLanguage         bool   `json:"useMetaLanguage"`
	// contains filtered or unexported fields
}

Configuration is a wrapper for various config options

func GetDefaultConfiguration ¶

func GetDefaultConfiguration(args ...string) Configuration

GetDefaultConfiguration returns safe default configuration options

type ContentExtractor ¶

type ContentExtractor struct {
	// contains filtered or unexported fields
}

ContentExtractor can parse the HTML and fetch various properties

func NewExtractor ¶

func NewExtractor(config Configuration) ContentExtractor

NewExtractor returns a configured HTML parser

func (*ContentExtractor) CalculateBestNode ¶

func (extr *ContentExtractor) CalculateBestNode(document *goquery.Document) *goquery.Selection

CalculateBestNode checks for the HTML node most likely to contain the main content. we're going to start looking for where the clusters of paragraphs are. We'll score a cluster based on the number of stopwords and the number of consecutive paragraphs together, which should form the cluster of text that this node is around also store on how high up the paragraphs are, comments are usually at the bottom and should get a lower score

func (*ContentExtractor) GetCanonicalLink ¶

func (extr *ContentExtractor) GetCanonicalLink(document *goquery.Document) string

GetCanonicalLink returns the meta canonical link set in the source

func (*ContentExtractor) GetCleanTextAndLinks ¶

func (extr *ContentExtractor) GetCleanTextAndLinks(topNode *goquery.Selection, lang string) (string, []string)

GetCleanTextAndLinks parses the main HTML node for text and links

func (*ContentExtractor) GetDomain ¶

func (extr *ContentExtractor) GetDomain(canonicalLink string) string

GetDomain extracts the domain from a link

func (*ContentExtractor) GetFavicon ¶

func (extr *ContentExtractor) GetFavicon(document *goquery.Document) string

GetFavicon returns the favicon set in the source, if the article has one

func (*ContentExtractor) GetMetaAuthor ¶

func (extr *ContentExtractor) GetMetaAuthor(document *goquery.Document) string

GetMetaAuthor returns the meta author set in the source, if the article has one

func (*ContentExtractor) GetMetaContent ¶

func (extr *ContentExtractor) GetMetaContent(document *goquery.Document, metaName string) string

GetMetaContent returns the content attribute of meta tag with the given property name

func (*ContentExtractor) GetMetaContentLocation ¶

func (extr *ContentExtractor) GetMetaContentLocation(document *goquery.Document) string

GetMetaContentLocation returns the meta content location set in the source, if the article has one

func (*ContentExtractor) GetMetaContentWithSelector ¶

func (extr *ContentExtractor) GetMetaContentWithSelector(document *goquery.Document, selector string) string

GetMetaContentWithSelector returns the content attribute of meta tag matching the selector

func (*ContentExtractor) GetMetaContents ¶

func (extr *ContentExtractor) GetMetaContents(document *goquery.Document, metaNames set.Interface) map[string]string

GetMetaContents returns all the meta tags as name->content pairs

func (*ContentExtractor) GetMetaDescription ¶

func (extr *ContentExtractor) GetMetaDescription(document *goquery.Document) string

GetMetaDescription returns the meta description set in the source, if the article has one

func (*ContentExtractor) GetMetaKeywords ¶

func (extr *ContentExtractor) GetMetaKeywords(document *goquery.Document) string

GetMetaKeywords returns the meta keywords set in the source, if the article has them

func (*ContentExtractor) GetMetaLanguage ¶

func (extr *ContentExtractor) GetMetaLanguage(document *goquery.Document) string

GetMetaLanguage returns the meta language set in the source, if the article has one

func (*ContentExtractor) GetPublishDate ¶

func (extr *ContentExtractor) GetPublishDate(document *goquery.Document) *time.Time

GetPublishDate returns the publication date, if one can be located.

func (*ContentExtractor) GetTags ¶

func (extr *ContentExtractor) GetTags(document *goquery.Document) set.Interface

GetTags returns the tags set in the source, if the article has them

func (*ContentExtractor) GetTitle ¶

func (extr *ContentExtractor) GetTitle(document *goquery.Document) string

GetTitle returns the title set in the source, if the article has one

func (*ContentExtractor) PostCleanup ¶

func (extr *ContentExtractor) PostCleanup(targetNode *goquery.Selection) *goquery.Selection

PostCleanup removes any divs that looks like non-content, clusters of links, or paras with no gusto

type Crawler ¶

type Crawler struct {
	RawHTML string
	Charset string
	// contains filtered or unexported fields
}

Crawler can fetch the target HTML page

func NewCrawler ¶

func NewCrawler(config Configuration, url string, RawHTML string) Crawler

NewCrawler returns a crawler object initialised with the URL and the [optional] raw HTML body

func (Crawler) Crawl ¶

func (c Crawler) Crawl() (*Article, error)

Crawl fetches the HTML body and returns an Article

func (Crawler) GetCharset ¶

func (c Crawler) GetCharset(document *goquery.Document) string

GetCharset returns a normalised charset string extracted from the meta tags

func (Crawler) GetContentType ¶

func (c Crawler) GetContentType(document *goquery.Document) string

GetContentType returns the Content-Type string extracted from the meta tags

func (*Crawler) Preprocess ¶

func (c *Crawler) Preprocess() (*goquery.Document, error)

Preprocess fetches the HTML page if needed, converts it to UTF-8 and applies some text normalisation to guarantee better results when extracting the content

func (*Crawler) SetCharset ¶

func (c *Crawler) SetCharset(cs string)

SetCharset can be used to force a charset (e.g. when read from the HTTP headers) rather than relying on the detection from the HTML meta tags

type Goose ¶

type Goose struct {
	// contains filtered or unexported fields
}

Goose is the main entry point of the program

func New ¶

func New(args ...string) Goose

New returns a new instance of the article extractor

func (Goose) ExtractFromRawHTML ¶

func (g Goose) ExtractFromRawHTML(url string, RawHTML string) (*Article, error)

ExtractFromRawHTML returns an article object from the raw HTML content

func (Goose) ExtractFromURL ¶

func (g Goose) ExtractFromURL(url string) (*Article, error)

ExtractFromURL follows the URL, fetches the HTML page and returns an article object

type Parser ¶

type Parser struct{}

Parser is an HTML parser specialised in extraction of main content and other properties

func NewParser ¶

func NewParser() *Parser

NewParser returns an HTML parser

type StopWords ¶

type StopWords struct {
	// contains filtered or unexported fields
}

StopWords implements a simple language detector

func NewStopwords ¶

func NewStopwords() StopWords

NewStopwords returns an instance of a stop words detector

func (StopWords) SimpleLanguageDetector ¶

func (stop StopWords) SimpleLanguageDetector(text string) string

SimpleLanguageDetector returns the language code for the text, based on its stop words

type VideoExtractor ¶

type VideoExtractor struct {
	// contains filtered or unexported fields
}

VideoExtractor can extract the main video from an HTML page

func NewVideoExtractor ¶

func NewVideoExtractor() VideoExtractor

NewVideoExtractor returns a new instance of a HTML video extractor

func (*VideoExtractor) GetVideos ¶

func (ve *VideoExtractor) GetVideos(doc *goquery.Document) set.Interface

GetVideos returns the video tags embedded in the article

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL