html

package

v0.47.0 Latest Latest Go to latest Published: May 5, 2021 License: Apache-2.0 Imports: 15 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/zoomio/tagify

Links

Open Source Insights

Documentation ¶

Index ¶

Variables

Constants ¶

This section is empty.

Variables ¶

View Source

var ParseHTML model.ParseFunc = func(reader io.ReadCloser, options ...model.ParseOption) *model.ParseOutput {

	defer reader.Close()

	c := &model.ParseConfig{}

	for _, option := range options {
		option(c)
	}

	if c.Verbose {
		fmt.Println("--> parsing HTML...")
	}

	var err error
	var contents *htmlContents
	var parseFn parseFunc = parseHTML
	var tagWeights model.TagWeights

	if len(c.TagWeights) == 0 {
		tagWeights = defaultTagWeights
	} else {
		tagWeights = c.TagWeights
	}

	if c.FullSite && c.Source != "" {
		var crawler *webCrawler
		crawler, err = newWebCrawler(parseFn, tagWeights, c.Source, c.Verbose)
		if err != nil {
			return &model.ParseOutput{Err: err}
		}
		contents = crawler.run(reader)
	} else {
		contents = parseFn(reader, tagWeights, nil)
	}

	if err != nil {
		return &model.ParseOutput{Err: err}
	}

	if len(contents.lines) == 0 {
		return &model.ParseOutput{}
	}

	tags, title := tagifyHTML(contents, tagWeights, c.Verbose, c.NoStopWords, c.ContentOnly)

	return &model.ParseOutput{Tags: tags, DocTitle: title, DocHash: contents.hash()}
}

ParseHTML receives lines of raw HTML markup text from the Web and returns simple text, plus list of prioritised tags (if tagify == true) based on the importance of HTML tags which wrap sentences.

Example:

<h1>A story about foo
<p> Foo was a good guy but, had a quite poor time management skills,
therefore he had issues with shipping all his tasks. Though foo had heaps
of other amazing skills, which gained him a fortune.

Result:

foo: 2 + 1 = 3, story: 2, management: 1 + 1 = 2, skills: 1 + 1 = 2.

Returns a slice of tags as 1st result, a title of the page as 2nd and a version of the document based on the hashed contents as 3rd.

Functions ¶

This section is empty.

Types ¶

This section is empty.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL