webscraper

package module
v1.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 14, 2020 License: MIT Imports: 4 Imported by: 1

README

Web-scraper library for Golang

Build status

web-scraper is a small library for parsing and scraping the Html. It is built on top of golang.org/x/net/html

Installation

The go version has to be with go modules. Type the following command inside the working directory where the go.mod file is:

go get github.com/genjik/web-scraper

Documentation

The Element type contains a pointer to the html.Node. The whole API uses Element type to return html elements.

type Element struct {
    node *html.Node
}

GetRootElement takes Html as any type as long as it satisfies io.Reader. The function returns the Element type that contains pointer to the <html> node

GetRootElement(r io.Reader) (Element, error)

Retrieve raw text from element

func (e Element) GetText() string

Search for child elements

func (e Element) FindOne(tag string, recursive bool, attrs ...string) Element
func (e Element) FindAll(tag string, recursive bool, limit int, attrs ...string) []Element

Search for parent elements

func (e Element) FindParent(tag string, attrs ...string) Element
func (e Element) FindParents(tag string, limit int, attrs ...string) []Element

Search for sibling elements

func (e Element) FindPrevSibling(tag string, attrs ...string) Element
func (e Element) FindNextSibling(tag string, attrs ...string) Element
func (e Element) FindPrevSiblings(tag string, limit int, attrs ...string) []Element
func (e Element) FindNextSiblings(tag string, limit int, attrs ...string) []Element

Get an element

func (e Element) Parent() Element // Returns parent element
func (e Element) FirstChild() Element // Not supported yet
func (e Element) PrevSibling() Element // Not supported yet
func (e Element) NextSibling() Element // Not supported yet

Parameters:
tag string The tag name of element. E.g html/head/body/div/span/h1 and so on.

attrs ...string The attributes of element the method will search for. E.g {"class", "className"}. As many arguments as neccesary can be passed to the parameter, or it can be ommited at all

recursive bool "false" tells a method to look only for the elements that are children for the current element. "true" tells the method to look for child elements until it reaches the last element of html tree.

limit int The number is used to limit the size of final result. -1 means no limit

Example

package main

import (
    "strings"
    "github.com/genjik/web-scraper"
    "fmt"
)

func main() {
    r := strings.NewReader(`
        <html>
            <head></head>
            <body>
                <div id="red" class="box">
                    <div id="special">Special Message</div> 
                </div>

                <div id="green" class="box">
                    <div>
                        <div class="list-item" id="l1">List#1</div>
                        <div class="list-item" id="l2">List#2</div>
                        <div class="list-item" id="l3">List#3</div>
                        <div class="list-item" id="l4">List#4</div>
                        <div class="list-item" id="l5">List#5</div>
                    </div>
                </div>
            </body>
        </html>
    `)

    root, err := webscraper.GetRootElement(r)
    if err != nil {
        // Error handling
    }

    el := root.FindOne("div", true, "id", "special").GetText()
    fmt.Println(el) // Special Message

    elements := root.FindAll("div", true, -1, "class", "list-item") 
    for _, element := range elements {
        fmt.Println(element.GetText()) // List#1-5
    }
}

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Element

type Element struct {
	// contains filtered or unexported fields
}

func GetRootElement

func GetRootElement(r io.Reader) (Element, error)

Takes io.Reader parameter that contains html, parses it and returns first-found element type that is html.ElementNode

func (Element) FindAll

func (e Element) FindAll(tag string, recursive bool, limit int, attrs ...string) []Element

Traverses through children elements of current element and appends to the []Element any child element that satisfies tag and attributes. If recursive == true, than it will look for children elements of children elements, and so on. If limit == -1, then there is no limit. if limit == n, it will return only n-number of elements

func (Element) FindNextSibling

func (e Element) FindNextSibling(tag string, attrs ...string) Element

Traverses through sibling elements AFTER current element, and returns element if it satisfies the searching parameters. Otherwise, returns nil

func (Element) FindNextSiblings

func (e Element) FindNextSiblings(tag string, limit int, attrs ...string) []Element

Traverses through sibling elements AFTER current element, and returns []Element that contains elements that satisfies the searching parameters. Otherwise, returns nil

func (Element) FindOne

func (e Element) FindOne(tag string, recursive bool, attrs ...string) Element

Traverses through children elements of current element and returns first-found child element that satisfies tag and attributes If it doesn't find any element, than it returns nil

func (Element) FindParent

func (e Element) FindParent(tag string, attrs ...string) Element

Recursively traverses through parent elements of current element until it finds the element that satisfies tag and attrs

func (Element) FindParents

func (e Element) FindParents(tag string, limit int, attrs ...string) []Element

Recursively traverses through all parent elements of current element and returns []Element that contains elements that satisfies tag and attrs. If limit == -1, then there is no limit. If limit == n, it will return only n-number of elements

func (Element) FindPrevSibling

func (e Element) FindPrevSibling(tag string, attrs ...string) Element

Traverses through sibling elements BEFORE current element, and returns element if it satisfies the searching parameters. Otherwise, returns nil

func (Element) FindPrevSiblings

func (e Element) FindPrevSiblings(tag string, limit int, attrs ...string) []Element

Traverses through sibling elements BEFORE current element, and returns []Element that contains elements that satisfies the searching parameters. Otherwise, returns nil

func (Element) GetText

func (e Element) GetText() string

func (Element) Parent added in v1.0.1

func (e Element) Parent() Element

Returns first parent element that is ElementNode. Otherwise, returns nil

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL