Documentation ¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func GetRawTextFromHTML ¶
GetRawTextFromHTML extracts text from an HTML document without retaining any particular formatting information.
Limitation: GetRawTextFromHTML is only a minimal and naive HTML to text extractor, it does not consider any fancy HTML formatting directive nor complicated rules related to spaces collapsing and only concentrate into getting rid of HTML directive to access meaningful information for scraping or searching.
Types ¶
type Scanner ¶
type Scanner struct { // AllowedTags is the white-list of allowed tags. For each tag, allowed // attributes can be expressed as a pattern: // - ** : all attributes (except data-xxx and onxxx events that // should be explicitly allowed) are allowed. // - * : all attributes (except data-xxx and onxxx events that // should be explicitly allowed) are allowed if their value // is a space-separated list of names (letters, numbers, _ // or -). // - a=** : attribute 'a' whatever its value is. // - a or a=* : attribute 'a' whose value is a space-separated list of // names (letters, numbers, _ or -). // - a=_MIME : attribute 'a' whose value is a mimetype specification // (names separated by '/' like 'text/css') // - a=key : attribute 'a' whose value is 'key'. // - a=__URL : attribute 'a' whose value is an 'allowed' URL. // An allowed URL is parsable with a scheme matching // SafeSchemes. // 'Anonymous' hosts (no recorded domain name) are not // accepted. // Absolute URL for style-sheets are not accepted. // Absolute URL with target=_blank but without rel="noopener" // URL's query are not accepted except if _? suffix // is added. // - a=__REL_URL: like __URL but only relative URL are accepted. // // Several patterns can be listed for a given attribute's name, knowing // that patterns are checked against in their declaration order (first // matching will pass/first non-matching check will fail). // LIMITATION: Be extra-careful when using catch-all patterns. For instance // {http-equiv=refresh, '*'} as a result will allow any http-equiv to be // accepted, so catch-all patterns are actually quite tedious to use. // TODO: As off now, it is a "good enough" approach but probably needs further // polishing/rework to make something acceptable out of this. AllowedTags map[atom.Atom][]string // AllowedURLSchemes is the white-list of allowed schemes in URL. // "*" allows any schemes. AllowedURLSchemes []string // AllowAbsoluteURLinCSS, when set to true, accepts using external URL. // (by default only relative URL or local URL are considered). Queries are // not accepted. AllowAbsoluteURLinCSS bool // AllowedCSSProperties is the white-list of accepted CSS properties. // "*" allows any property, "!xxx" failed immediately for property xxx even // if property xxx is allowed afterwards. AllowedCSSProperties []string // AllowedCSSFunctions is the white-list of accepted CSS functions. // "*" allows any functions, "!xxx" failed immediately for function xxx even // if function xxx is allowed afterwards. AllowedCSSFunctions []string // AllowedCSSAtKeywords is the white-list of accepted at-keywords. // "*" allows any keywords, "!xxx" failed immediately for keyword xxx even // if keyword xxx is allowed afterwards. AllowedCSSAtKeywords []string }
Scanner represents an HTML/CSS scanner that looks for possible security risks. This scanner is only oriented to check existing untrusted HTML/CSS in EPUB, it does not properly managed all injection cases, notably obfuscated strings relying for example on strange characters encodings.
func NewMinimalScanner ¶
NewMinimalScanner creates a new scanner that allows only minimal HTML features, no CSS nor JS. globalAttr are added to the list of allowed attributes of all atom.
func NewPermissiveScanner ¶
func NewPermissiveScanner() *Scanner
NewPermissiveScanner creates a new scanner that allows any attributes or CSS properties but fetching external resources/URL.
func NewScannerWithStyle ¶
NewScannerWithStyle creates a new scanner that extends NewStrictScanner to allow use of common CSS properties (but not CSS functions or external CSS resources). If globalAttr are provided, these attributes will be allowed for each atom in addition to "style" and "class" attributes that are accepted by default.