sandblast

package module
v0.0.0-...-43f8fb9 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 20, 2015 License: BSD-3-Clause Imports: 11 Imported by: 14

README

Library that uses Readability-like heuristics to extract text from an HTML document.

Example:

import "golang.org/x/net/html"
…
node, err := html.Parse(bytes.NewReader(raw_html))
if err != nil {
	log.Fatal("Parsing error: ", err)
}
title, text := sandblast.Extract(node)
fmt.Printf("Title: %s\n%s", title, text)
…

See also example/extract.go, a command line utility to extract text from a URL.

Documentation

Overview

Library that uses Readability-like heuristics to extract text from an HTML document

Index

Constants

View Source
const (
	KeepMenus  = Flags(1 << iota) // Not implemented
	KeepLinks                     // Keeps link destinations for links embedded inside text blocks
	KeepImages                    // Not implemented
	MarkTitles                    // Not implemented

)

Variables

This section is empty.

Functions

func DecodedBody

func DecodedBody(resp *http.Response) (content []byte, encoding string, err error)

Returns the body of resp as a decoded string, detecting its encoding

func Extract

func Extract(node *html.Node, flags Flags) (title, text string, err error)

func ExtractEx

func ExtractEx(node *html.Node, flags Flags) (title, text string, simplified, flattened, cleaned *element, err error)

func FetchURL

func FetchURL(url string) (body []byte, status int, encoding string, err error)

Types

type Flags

type Flags int

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL