web

package
v0.8.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 8, 2024 License: MIT Imports: 8 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ScrapeWebData

func ScrapeWebData(uri []string, depth int) ([]byte, error)

ScrapeWebData initiates the scraping process for the given list of URIs. It returns a CollectedData struct containing the scraped sections from each URI, and an error if any occurred during the scraping process.

Parameters:

  • uri: []string - list of URLs to scrape
  • depth: int - depth of how many subpages to scrape

Returns:

  • []byte - JSON representation of the collected data
  • error - any error that occurred during the scraping process

Example usage:

go func() {
	res, err := scraper.ScrapeWebData([]string{"https://en.wikipedia.org/wiki/Maize"}, 5)
	if err != nil {
		logrus.WithError(err).Error("Error collecting data")
		return
	}
	logrus.WithField("result", string(res)).Info("Scraping completed")
}()

Types

type CollectedData

type CollectedData struct {
	Sections []Section `json:"sections"` // Sections is a collection of webpage sections that have been scraped.
	Pages    []string  `json:"pages"`
}

CollectedData represents the aggregated result of the scraping process. It contains a slice of Section structs, each representing a distinct part of a scraped webpage.

type Section

type Section struct {
	Title      string   `json:"title"`      // Title is the heading text of the section.
	Paragraphs []string `json:"paragraphs"` // Paragraphs contains all the text content of the section.
	Images     []string `json:"images"`     // Images storing base64 - maybe!!?
}

Section represents a distinct part of a scraped webpage, typically defined by a heading. It contains a Title, representing the heading of the section, and Paragraphs, a slice of strings containing the text content found within that section.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL