goSpider

package module
v1.7.5 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 17, 2024 License: Apache-2.0 Imports: 18 Imported by: 0

README

goSpider Navigation Library

This Go library provides functions to navigate websites and retrieve information using chromedp. It supports basic actions like fetching HTML content, clicking buttons, filling forms, handling alerts, and more complex interactions such as dynamically loaded content.

Installation

To use this library, you need to install:

go get github.com/chromedp/chromedp
go get github.com/DanielFillol/goSpider

Usage

Importing the Library First, import the library in your Go project:

import "DanielFillol/goSpider"

Example Usage

Here's an example of how to use the library: If you need a more complete example of use, please take a look on this project

package main

import (
	"fmt"
	"github.com/DanielFillol/goSpider"
	"golang.org/x/net/html"
	"log"
	"time"
)

func main() {
	users := []goSpider.Requests{
		{SearchString: "1017927-35.2023.8.26.0008"},
		{SearchString: "0002396-75.2013.8.26.0201"},
		{SearchString: "1551285-50.2021.8.26.0477"},
		{SearchString: "0015386-82.2013.8.26.0562"},
		{SearchString: "0007324-95.2015.8.26.0590"},
		{SearchString: "1545639-85.2023.8.26.0090"},
		{SearchString: "1557599-09.2021.8.26.0090"},
		{SearchString: "1045142-72.2021.8.26.0002"},
		{SearchString: "0208591-43.2009.8.26.0004"},
		{SearchString: "1024511-70.2022.8.26.0003"},
	}

	numberOfWorkers := 1
	duration := 0 * time.Millisecond

	results, err := goSpider.ParallelRequests(users, numberOfWorkers, duration, Crawler)
	if err != nil {
		log.Println("Expected %d results, but got %d, List results: %v", len(users), 0, len(results))
	}

	log.Println("Finish Parallel Requests!")

	fmt.Println(len(results))
}

func Crawler(d string) (*html.Node, error) {
	url := "https://esaj.tjsp.jus.br/cpopg/open.do"
	nav := goSpider.NewNavigator("", true)

	err := nav.OpenURL(url)
	if err != nil {
		log.Printf("OpenURL error: %v", err)
		return nil, err
	}

	err = nav.CheckRadioButton("#interna_NUMPROC > div > fieldset > label:nth-child(5)")
	if err != nil {
		log.Printf("CheckRadioButton error: %v", err)
		return nil, err
	}

	err = nav.FillField("#nuProcessoAntigoFormatado", d)
	if err != nil {
		log.Printf("filling field error: %v", err)
		return nil, err
	}

	err = nav.ClickButton("#botaoConsultarProcessos")
	if err != nil {
		log.Printf("ClickButton error: %v", err)
		return nil, err
	}

	err = nav.WaitForElement("#tabelaUltimasMovimentacoes > tr:nth-child(1) > td.dataMovimentacao", 15*time.Second)
	if err != nil {
		log.Printf("WaitForElement error: %v", err)
		return nil, err
	}

	pageSource, err := nav.GetPageSource()
	if err != nil {
		log.Printf("GetPageSource error: %v", err)
		return nil, err
	}

	return pageSource, nil
}


Functions

Functions Overview

  • NewNavigator(profilePath string, headless bool) *Navigator Creates a new instance of the Navigator struct, initializing a new ChromeDP context and logger. profilePath: the path to chrome profile defined by the user;can be passed as an empty string headless: if false will show chrome UI
nav := goSpider.NewNavigator()
  • Close() Closes the Navigator instance and releases resources.
nav.Close()
  • OpenNewTab(url string) error Opens a new browser tab with the specified URL.
err := nav.OpenNewTab("https://www.example.com")
  • OpenURL(url string) error Opens the specified URL in the current browser context.
err := nav.OpenURL("https://www.example.com")
  • GetCurrentURL() (string, error) Returns the current URL of the browser.
currentURL, err := nav.GetCurrentURL()
  • Login(url, username, password, usernameSelector, passwordSelector, loginButtonSelector string, messageFailedSuccess string) error Logs into a website using the provided credentials and selectors.
err := nav.Login("https://www.example.com/login", "username", "password", "#username", "#password", "#login-button", "Login failed")
  • CaptureScreenshot() error Captures a screenshot of the current browser window and saves it as screenshot.png.
err := nav.CaptureScreenshot()
  • GetElement(selector string) (string, error) Retrieves the text content of an element specified by the selector.
text, err := nav.GetElement("#elementID")
  • WaitForElement(selector string, timeout time.Duration) error Waits for an element specified by the selector to be visible within the given timeout.
err := nav.WaitForElement("#elementID", 5*time.Second)
  • ClickButton(selector string) error Clicks a button specified by the selector.
err := nav.ClickButton("#buttonID")
  • ClickElement(selector string) error Clicks an element specified by the selector.
err := nav.ClickElement("#elementID")
  • CheckRadioButton(selector string) error Selects a radio button specified by the selector.
err := nav.CheckRadioButton("#radioButtonID")
  • UncheckRadioButton(selector string) error Unchecks a checkbox specified by the selector.
err := nav.UncheckRadioButton("#checkboxID")
  • FillField(selector string, value string) error Fills a field specified by the selector with the provided value.
err := nav.FillField("#fieldID", "value")
  • ExtractTableData(selector string) ([]map[int]map[string]interface{}, error) Extracts data from a table specified by the selector.
tableData, err := nav.ExtractTableData("#tableID")
  • ExtractDivText(parentSelectors ...string) (map[string]string, error) Extracts text content from divs specified by the parent selectors.
textData, err := nav.ExtractDivText("#parent1", "#parent2")
  • FetchHTML(url string) (string, error) Fetches the HTML content of the specified URL.
htmlContent, err := nav.FetchHTML("https://www.example.com")
  • ExtractLinks() ([]string, error) Extracts all links from the current page.
links, err := nav.ExtractLinks()
  • FillForm(formSelector string, data map[string]string) error Fills out a form specified by the selector with the provided data and submits it.
formData := map[string]string{
    "username": "myUsername",
    "password": "myPassword",
}
err := nav.FillForm("#loginForm", formData)
  • HandleAlert() error Handles JavaScript alerts by accepting them.
err := nav.HandleAlert()
  • SelectDropdown(selector, value string) error Selects an option in a dropdown specified by the selector and value.
err := nav.SelectDropdown("#dropdownID", "optionValue")
  • FindNodes(node *html.Node, nodeExpression string) ([]*html.Node, error) extracts nodes content from nodes specified by the parent selectors
nodeData, err := goSpider.FindNode(pageSource,"#parent1")
  • ExtractText(node *html.Node, nodeExpression string, Dirt string) (string, error)
textData, err := goSpider.ExtractText(pageSource,"#parent1", "\n")
  • func ExtractTable(pageSource *html.Node, tableRowsExpression string) ([]*html.Node, error)
tableData, err := goSpider.ExtractTableData(pageSource,"#tableID")
  • ParallelRequests(requests []Requests, numberOfWorkers int, duration time.Duration, crawlerFunc func(string) (map[string]string, []map[int]map[string]interface{}, []map[int]map[string]interface{}, error)) ([]ResponseBody, error) Performs web scraping tasks concurrently with a specified number of workers and a delay between requests. The crawlerFunc parameter allows for flexibility in defining the web scraping logic. Parameters: requests: A slice of Requests structures containing the data needed for each request. numberOfWorkers: The number of concurrent workers to process the requests. duration: The delay duration between each request to avoid overwhelming the target server. crawlerFunc: A user-defined function that takes a process number as input and returns cover data, movements, people, and an error.
	users := []goSpider.Requests{
		{SearchString: "1017927-35.2023.8.26.0008"},
		{SearchString: "0002396-75.2013.8.26.0201"},
		{SearchString: "1551285-50.2021.8.26.0477"},
		{SearchString: "0015386-82.2013.8.26.0562"},
		{SearchString: "0007324-95.2015.8.26.0590"},
		{SearchString: "1545639-85.2023.8.26.0090"},
		{SearchString: "1557599-09.2021.8.26.0090"},
		{SearchString: "1045142-72.2021.8.26.0002"},
		{SearchString: "0208591-43.2009.8.26.0004"},
		{SearchString: "1024511-70.2022.8.26.0003"},
	}

	numberOfWorkers := 1
	duration := 0 * time.Millisecond

	results, err := goSpider.ParallelRequests(users, numberOfWorkers, duration, Crawler)
  • EvaluateParallelRequests(previousResults []PageSource, crawlerFunc func(string) (*html.Node, error), evaluate func([]PageSource) ([]Request, []PageSource)) ([]PageSource, error) EvaluateParallelRequests iterates over a set of previous results, evaluates them using the provided evaluation function, and handles re-crawling of problematic sources until all sources are valid or no further progress can be made. Parameters:
    • previousResults: A slice of PageSource objects containing the initial crawl results.
    • crawlerFunc: A function that takes a string (URL or identifier) and returns a parsed HTML node and an error.
    • evaluate: A function that takes a slice of PageSource objects and returns two slices:
      1. A slice of Request objects for sources that need to be re-crawled.
      2. A slice of valid PageSource objects. Returns:
    • A slice of valid PageSource objects after all problematic sources have been re-crawled and evaluated.
    • An error if there is a failure in the crawling process. Example usage:
 results, err := EvaluateParallelRequests(resultsFirst, Crawler, Eval)

	func Eval(previousResults []PageSource) ([]Request, []PageSource) {
		var newRequests []Request
		var validResults []PageSource

		for _, result := range previousResults {
			_, err := extractDataCover(result.Page, "")
			if err != nil {
				newRequests = append(newRequests, Request{SearchString: result.Request})
			} else {
				validResults = append(validResults, result)
			}
		}

		return newRequests, validResults
	}
  • func LoginWithGoogle(email, password string) error performs the Google login on https://accounts.google.com. The email and password are required for loggin and the 2FA code is passed on prompt.
err := nav.LoginWithGoogle("yor_login", "your_password")

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func AskForString added in v1.4.0

func AskForString(prompt string) string

AskForString prompts the user to enter a string and returns the trimmed input.

func ExtractTable added in v1.2.0

func ExtractTable(pageSource *html.Node, tableRowsExpression string) ([]*html.Node, error)

ExtractTable extracts data from a table specified by the selector. Example:

tableData, err := goSpider.ExtractTableData(pageSource,"#tableID")

func ExtractText added in v1.2.0

func ExtractText(node *html.Node, nodeExpression string, Dirt string) (string, error)

ExtractText extracts text content from nodes specified by the parent selectors. Example:

textData, err := goSpider.ExtractText(pageSource,"#parent1", "\n")

func FindNodes added in v1.2.0

func FindNodes(node *html.Node, nodeExpression string) ([]*html.Node, error)

FindNodes extracts nodes content from nodes specified by the parent selectors. Example:

nodeData, err := goSpider.FindNode(pageSource,"#parent1")

func PrintHtml added in v1.7.5

func PrintHtml(pageSource *html.Node) (string, error)

Types

type Navigator struct {
	Ctx     context.Context
	Cancel  context.CancelFunc
	Logger  *log.Logger
	Timeout time.Duration
	Cookies []*network.Cookie
}

Navigator is a struct that holds the context for the ChromeDP session and a logger.

func NewNavigator

func NewNavigator(profilePath string, headless bool) *Navigator

NewNavigator creates a new Navigator instance.

Parameters:

  • profilePath: the path to chrome profile defined by the user; can be passed as an empty string
  • headless: if false will show chrome UI

Example:

nav := goSpider.NewNavigator("/Users/USER_NAME/Library/Application Support/Google/Chrome/Profile 2", true, initialCookies)

NewNavigator creates a new Navigator instance with enhanced logging for troubleshooting authentication issues.

func (nav *Navigator) CaptureScreenshot(nameFile string) error

CaptureScreenshot captures a screenshot of the current browser window. Example:

err := nav.CaptureScreenshot("img")
func (nav *Navigator) CheckRadioButton(selector string) error

CheckRadioButton selects a radio button specified by the selector. Example:

err := nav.CheckRadioButton("#radioButtonID")
func (nav *Navigator) ClickButton(selector string) error

ClickButton clicks a button specified by the selector. Example:

err := nav.ClickButton("#buttonID")
func (nav *Navigator) ClickElement(selector string) error

ClickElement clicks an element specified by the selector. Example:

err := nav.ClickElement("#elementID")
func (nav *Navigator) Close()

Close closes the Navigator instance and releases resources. Example:

nav.Close()
func (nav *Navigator) Datepicker(date, calendarButtonSelector, calendarButtonGoBack, calendarButtonsTableXpath, calendarButtonTR string) error

Datepicker deals with date-picker elements on websites by receiving a date, calculates the amount of time it needs to go back in the picker and finally selects a day.

date: string in the format "dd/mm/aaaa"
calendarButtonSelector: the css selector of the data-picker
calendarButtonGoBack: the css selector of the go back button
calendarButtonsTableXpath: the xpath of the days table example: "//*[@id="ui-datepicker-div"]/table/tbody/tr";
calendarButtonTR: the css selector of the days table row, example: "//*[@id="ui-datepicker-div"]/table/tbody/tr"
func (nav *Navigator) EvaluateScript(script string) (interface{}, error)

EvaluateScript executes a JavaScript script and returns the result

func (nav *Navigator) ExecuteScript(script string) error

ExecuteScript runs the specified JavaScript on the current page script: the JavaScript code to execute Returns an error if any

func (nav *Navigator) ExtractLinks() ([]string, error)

ExtractLinks extracts all links from the current page. Example:

links, err := nav.ExtractLinks()
func (nav *Navigator) FillField(selector string, value string) error

FillField fills a field specified by the selector with the provided value. Example:

err := nav.FillField("#fieldID", "value")
func (nav *Navigator) FillForm(selector string, data map[string]string) error

FillForm fills out a form specified by the selector with the provided data and submits it. Example:

formData := map[string]string{
    "username": "myUsername",
    "password": "myPassword",
}
err := nav.FillForm("#loginForm", formData)
func (nav *Navigator) GetCurrentURL() (string, error)

GetCurrentURL returns the current URL of the browser. Example:

currentURL, err := nav.GetCurrentURL()
func (nav *Navigator) GetElement(selector string) (string, error)

GetElement retrieves the text content of an element specified by the selector. Example:

text, err := nav.GetElement("#elementID")
func (nav *Navigator) GetElementAttribute(selector, attribute string) (string, error)

GetElementAttribute retrieves the value of a specified attribute from an element identified by a CSS selector. Parameters: - selector: The CSS selector of the element. - attribute: The name of the attribute to retrieve the value of. Returns: - The value of the specified attribute. - An error if the attribute value could not be retrieved.

func (nav *Navigator) GetPageSource() (*html.Node, error)

GetPageSource captures all page HTML from the current page Returns the page HTML as a string and an error if any Example:

pageSource, err := nav.GetPageSource()
func (nav *Navigator) HandleAlert() error

HandleAlert handles JavaScript alerts by accepting them. Example:

err := nav.HandleAlert()
func (nav *Navigator) Login(url, username, password, usernameSelector, passwordSelector, loginButtonSelector string, messageFailedSuccess string) error

Login logs into a website using the provided credentials and selectors. Example:

err := nav.Login("https://www.example.com/login", "username", "password", "#username", "#password", "#login-button", "#login-message-fail")
func (nav *Navigator) LoginAccountsGoogle(email, password string) error

LoginAccountsGoogle performs the Google login on the given URL

func (nav *Navigator) LoginWithGoogle(url string) error

LoginWithGoogle performs the Google login on the given URL

func (nav *Navigator) MakeElementVisible(selector string) error

MakeElementVisible changes the style display of an element to nil

func (nav *Navigator) OpenURL(url string) error

OpenURL opens the specified URL in the current browser context. Example:

err := nav.OpenURL("https://www.example.com")
func (nav *Navigator) ReloadPage(retryCount int) error

ReloadPage reloads the current page with retry logic retryCount: number of times to retry reloading the page in case of failure Returns an error if any

func (nav *Navigator) SaveImageBase64(selector, outputPath, prefixClean string) (string, error)

SaveImageBase64 extracts the base64 image data from the given selector and saves it to a file.

Parameters:

  • selector: the CSS selector of the CAPTCHA image element
  • outputPath: the file path to save the image
  • prefixClean: the prefix to clear from the source, if any

Example:

err := nav.SaveImageBase64("#imagemCaptcha", "captcha.png", "data:image/png;base64,")
func (nav *Navigator) SelectDropdown(selector, value string) error

SelectDropdown selects an option in a dropdown specified by the selector and value. Example:

err := nav.SelectDropdown("#dropdownID", "optionValue")
func (nav *Navigator) SetTimeOut(timeOut time.Duration)

SetTimeOut sets a timeout for all the waiting functions on the package. The standard timeout of the Navigator is 300 ms.

func (nav *Navigator) SwitchToDefaultContent() error

SwitchToDefaultContent switches the context back to the main content from an iframe context.

func (nav *Navigator) SwitchToFrame(selector string) error

SwitchToFrame switches the context to the specified iframe.

func (nav *Navigator) UncheckRadioButton(selector string) error

UncheckRadioButton unchecks a checkbox specified by the selector. Example:

err := nav.UncheckRadioButton("#checkboxID")
func (nav *Navigator) UnsafeClickButton(selector string) error

UnsafeClickButton clicks a button specified by the selector. Unsafe because this methode does not use the wait element feature. Example:

err := nav.ClickButton("#buttonID")
func (nav *Navigator) UnsafeFillField(selector string, value string) error

UnsafeFillField fills a field specified by the selector with the provided value. Unsafe because this methode does not use the wait element feature. Example:

err := nav.FillField("#fieldID", "value")
func (nav *Navigator) WaitForElement(selector string, timeout time.Duration) error

WaitForElement waits for an element specified by the selector to be visible within the given timeout. Example:

err := nav.WaitForElement("#elementID", 5*time.Second)
func (nav *Navigator) WaitPageLoad() (string, error)

WaitPageLoad waits for the current page to fully load by checking the document.readyState property It will retry until the page is fully loaded or the timeout of one minute is reached Returns the page readyState as a string and an error if any

type PageSource added in v1.2.0

type PageSource struct {
	Page    *html.Node
	Request string
	Error   error
}

PageSource structure to hold the HTML data

func EvaluateParallelRequests added in v1.3.0

func EvaluateParallelRequests(previousResults []PageSource, crawlerFunc func(string) (*html.Node, error), evaluate func([]PageSource) ([]Request, []PageSource)) ([]PageSource, error)

EvaluateParallelRequests iterates over a set of previous results, evaluates them using the provided evaluation function, and handles re-crawling of problematic sources until all sources are valid or no further progress can be made.

Parameters: - previousResults: A slice of PageSource objects containing the initial crawl results. - crawlerFunc: A function that takes a string (URL or identifier) and returns a parsed HTML node and an error. - evaluate: A function that takes a slice of PageSource objects and returns two slices:

  1. A slice of Request objects for sources that need to be re-crawled.
  2. A slice of valid PageSource objects.

Returns: - A slice of valid PageSource objects after all problematic sources have been re-crawled and evaluated. - An error if there is a failure in the crawling process.

Example usage:

results, err := EvaluateParallelRequests(resultsFirst, Crawler, Eval)

func Eval(previousResults []PageSource) ([]Request, []PageSource) {
	var newRequests []Request
	var validResults []PageSource

	for _, result := range previousResults {
		_, err := extractDataCover(result.Page, "")
		if err != nil {
			newRequests = append(newRequests, Request{SearchString: result.Request})
		} else {
			validResults = append(validResults, result)
		}
	}

	return newRequests, validResults
}

func ParallelRequests added in v1.1.0

func ParallelRequests(requests []Request, numberOfWorkers int, delay time.Duration, crawlerFunc func(string) (*html.Node, error)) ([]PageSource, error)

ParallelRequests performs web scraping tasks concurrently with a specified number of workers and a delay between requests. The crawlerFunc parameter allows for flexibility in defining the web scraping logic.

Parameters: - requests: A slice of Request structures containing the data needed for each request. - numberOfWorkers: The number of concurrent workers to process the requests. - delay: The delay duration between each request to avoid overwhelming the target server. - crawlerFunc: A user-defined function that takes a process number as input and returns the html as *html.Node, and an error.

Returns: - A slice of ResponseBody structures containing the results of the web scraping tasks. - An error if any occurred during the requests.

Example Usage:

results, err := ParallelRequests(requests, numberOfWorkers, delay, crawlerFunc)

func RemovePageSource added in v1.3.0

func RemovePageSource(slice []PageSource, s int) []PageSource

RemovePageSource removes the element at index `s` from a slice of `PageSource` objects. It returns the modified slice without the element at index `s`.

type Request added in v1.3.0

type Request struct {
	SearchString string
}

Request structure to hold user data

func RemoveRequest added in v1.3.0

func RemoveRequest(slice []Request, s int) []Request

RemoveRequest removes the element at index `s` from a slice of `Request` objects. It returns the modified slice without the element at index `s`.

Directories

Path Synopsis
Package htmlquery provides extract data from HTML documents using XPath expression.
Package htmlquery provides extract data from HTML documents using XPath expression.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL