crawler

package

v0.9.4 Latest Latest Go to latest Published: Jun 21, 2024 License: Apache-2.0 Imports: 41 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/pzaino/thecrowler

Links

Open Source Insights

Documentation ¶

Overview ¶

Package crawler implements the crawling logic of the application. It's responsible for crawling a website and extracting information from it.

Index ¶

func ApplyPostProcessingStep(ctx *processContext, step *rs.PostProcessingStep, data *[]byte)
func ApplyRule(ctx *processContext, rule *rs.ScrapingRule, webPage *selenium.WebDriver) map[string]interface{}
func ApplyRulesGroup(ctx *processContext, ruleGroup *rs.RuleGroup, url string, ...) (map[string]interface{}, error)
func ConnectSelenium(sel SeleniumInstance, browseType int) (selenium.WebDriver, error)
func CrawlWebsite(args CrawlerPars, sel SeleniumInstance, ...)
func DefaultActionConfig(url string) cfg.SourceConfig
func DefaultCrawlingConfig(url string) cfg.SourceConfig
func FuzzURL(baseURL string, rule rules.CrawlingRule) ([]string, error)
func IsValidURIProtocol(u string) bool
func IsValidURL(u string) bool
func NewProcessContext(args CrawlerPars) *processContext
func NewSeleniumService(c cfg.Selenium) (*selenium.Service, error)
func ProcessHtmlToJson(htmlData string) (string, error)
func QuitSelenium(wd *selenium.WebDriver)
func ReturnSeleniumInstance(wg *sync.WaitGroup, pCtx *processContext, sel *SeleniumInstance, ...)
func StartCrawler(cf cfg.Config)
func StopSelenium(sel *selenium.Service) error
func StrIsHTML(s string) bool
func UpdateSourceState(db cdb.Handler, sourceURL string, crawlError error)
type BlockedCookie
type Cookie
type CrawlerPars
type CrawlerStatus
type LinkItem
type LogMessage
type LogParams
type LogResponseInfo
type LogResponseTiming
type LogSecurityDetails
type MetaTag
type PageDetails
type PageInfo
type PerformanceLog
type PerformanceLogEntry
type ScrapedItem
type ScraperRuleEngine
type Screenshot
- func TakeScreenshot(wd *selenium.WebDriver, filename string, maxHeight int) (Screenshot, error)
type SeleniumInstance
type WebObjectDetails

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func ApplyPostProcessingStep ¶ added in v0.9.4

func ApplyPostProcessingStep(ctx *processContext, step *rs.PostProcessingStep, data *[]byte)

ApplyPostProcessingStep applies the provided post-processing step to the provided data.

func ApplyRule ¶ added in v0.9.4

func ApplyRule(ctx *processContext, rule *rs.ScrapingRule, webPage *selenium.WebDriver) map[string]interface{}

ApplyRule applies the provided scraping rule to the provided web page.

func ApplyRulesGroup ¶ added in v0.9.4

func ApplyRulesGroup(ctx *processContext, ruleGroup *rs.RuleGroup, url string, webPage *selenium.WebDriver) (map[string]interface{}, error)

ApplyRulesGroup extracts the data from the provided web page using the provided a rule group.

func ConnectSelenium ¶

func ConnectSelenium(sel SeleniumInstance, browseType int) (selenium.WebDriver, error)

ConnectSelenium is responsible for connecting to the Selenium server instance

func CrawlWebsite ¶

func CrawlWebsite(args CrawlerPars, sel SeleniumInstance, releaseSelenium chan<- SeleniumInstance)

CrawlWebsite is responsible for crawling a website, it's the main entry point and it's called from the main.go when there is a Source to crawl.

func DefaultActionConfig ¶

func DefaultActionConfig(url string) cfg.SourceConfig

func DefaultCrawlingConfig ¶

func DefaultCrawlingConfig(url string) cfg.SourceConfig

func FuzzURL ¶ added in v0.9.2

func FuzzURL(baseURL string, rule rules.CrawlingRule) ([]string, error)

FuzzURL takes a base URL and a CrawlingRule, generating fuzzed URLs based on the rule's parameters.

func IsValidURIProtocol ¶ added in v0.9.3

func IsValidURIProtocol(u string) bool

func IsValidURL ¶

func IsValidURL(u string) bool

IsValidURL checks if the string is a valid URL.

func NewProcessContext ¶ added in v0.9.3

func NewProcessContext(args CrawlerPars) *processContext

NewProcessContext creates a new process context

func NewSeleniumService ¶

func NewSeleniumService(c cfg.Selenium) (*selenium.Service, error)

NewSeleniumService is responsible for initializing Selenium Driver The commented out code could be used to initialize a local Selenium server instead of using only a container based one. However, I found that the container based Selenium server is more stable and reliable than the local one. and it's obviously easier to setup and more secure.

func ProcessHtmlToJson ¶ added in v0.9.3

func ProcessHtmlToJson(htmlData string) (string, error)

func QuitSelenium ¶

func QuitSelenium(wd *selenium.WebDriver)

QuitSelenium is responsible for quitting the Selenium server instance

func ReturnSeleniumInstance ¶ added in v0.9.2

func ReturnSeleniumInstance(wg *sync.WaitGroup, pCtx *processContext, sel *SeleniumInstance, releaseSelenium chan<- SeleniumInstance)

ReturnSeleniumInstance is responsible for returning the Selenium server instance

func StartCrawler ¶

func StartCrawler(cf cfg.Config)

StartCrawler is responsible for initializing the crawler

func StopSelenium ¶

func StopSelenium(sel *selenium.Service) error

StopSelenium Stops the Selenium server instance (if local)

func StrIsHTML ¶ added in v0.9.4

func StrIsHTML(s string) bool

StrIsHTML checks if the given string could be HTML by trying to parse it.

func UpdateSourceState ¶ added in v0.9.2

func UpdateSourceState(db cdb.Handler, sourceURL string, crawlError error)

UpdateSourceState is responsible for updating the state of a Source in the database after crawling it (it does consider errors too)

Types ¶

type BlockedCookie ¶ added in v0.9.3

type BlockedCookie struct {
	BlockedReasons []string `json:"blockedReasons"` // The reasons why the cookie was blocked.
	Cookie         Cookie   `json:"cookie"`         // The cookie that was blocked.
	CookieLine     string   `json:"cookieLine"`     // The cookie line.
}

BlockedCookie represents a blocked cookie object

type Cookie struct {
	Domain       string  `json:"domain"`       // The domain of the cookie.
	Expires      float64 `json:"expires"`      // The expiration time of the cookie.
	HTTPOnly     bool    `json:"httpOnly"`     // Whether the cookie is HTTP only.
	Name         string  `json:"name"`         // The name of the cookie.
	Path         string  `json:"path"`         // The path of the cookie.
	Priority     string  `json:"priority"`     // The priority of the cookie.
	SameParty    bool    `json:"sameParty"`    // Whether the cookie is from the same party.
	SameSite     string  `json:"sameSite"`     // The same site attribute of the cookie.
	Secure       bool    `json:"secure"`       // Whether the cookie is secure.
	Session      bool    `json:"session"`      // Whether the cookie is a session cookie.
	Size         int     `json:"size"`         // The size of the cookie.
	SourcePort   int     `json:"sourcePort"`   // The source port of the cookie.
	SourceScheme string  `json:"sourceScheme"` // The source scheme of the cookie.
	Value        string  `json:"value"`        // The value of the cookie.
}

Cookie represents a cookie object

type CrawlerPars ¶ added in v0.9.3

type CrawlerPars struct {
	WG      *sync.WaitGroup
	DB      cdb.Handler
	Src     cdb.Source
	Sel     *chan SeleniumInstance
	SelIdx  int
	RE      *rules.RuleEngine
	Sources *[]cdb.Source
	Index   int
	Status  *CrawlerStatus
}

Local type to pass parameters to the goroutine

type CrawlerStatus ¶ added in v0.9.3

type CrawlerStatus struct {
	PipelineID      uint64
	SourceID        uint64
	Source          string
	TotalPages      int
	TotalLinks      int
	TotalSkipped    int
	TotalDuplicates int
	TotalErrors     int
	TotalScraped    int
	TotalActions    int
	TotalFuzzing    int
	StartTime       time.Time
	EndTime         time.Time
	CurrentDepth    int
	LastWait        float64
	LastDelay       float64
	// Flags values: 0 - Not started yet, 1 - Running, 2 - Completed, 3 - Error
	NetInfoRunning  int // Flag to check if network info is already gathered
	HTTPInfoRunning int // Flag to check if HTTP info is already gathered
	PipelineRunning int // Flag to check if site info is already gathered
	CrawlingRunning int // Flag to check if crawling is still running
}

type LinkItem ¶ added in v0.9.3

type LinkItem struct {
	PageURL   string `json:"url"`
	PageLevel int    `json:"level"`
	Link      string `json:"link"`
	ElementID string `json:"element_id"`
}

LinkItem represents a link item collected on a web page

type LogMessage ¶ added in v0.9.3

type LogMessage struct {
	Method string    `json:"method"` // The method of the log message.
	Params LogParams `json:"params"` // The parameters of the log message.
}

LogMessage represents a log message

type LogParams ¶ added in v0.9.3

type LogParams struct {
	ResponseInfo LogResponseInfo `json:"response"`       // The extra information of the response.
	TimeStamp    float64         `json:"timestamp"`      // The timestamp of the log message.
	Type         string          `json:"type,omitempty"` // The type of the log message.
}

LogParams represents the parameters of a log message

type LogResponseInfo ¶ added in v0.9.3

type LogResponseInfo struct {
	BlockedCookies         []BlockedCookie    `json:"blockedCookies,omitempty"`         // The blocked cookies.
	Headers                map[string]string  `json:"headers,omitempty"`                // The headers of the response.
	RequestID              string             `json:"requestId"`                        // The ID of the request.
	ResourceIPAddressSpace string             `json:"resourceIPAddressSpace,omitempty"` // The IP address space of the resource.
	StatusCode             int                `json:"statusCode"`                       // The status code of the response.
	StatusText             string             `json:"statusText"`                       // The status text of the response.
	MimeType               string             `json:"mimeType,omitempty"`               // The MIME type of the response.
	Protocol               string             `json:"protocol,omitempty"`               // The protocol of the response.
	RemoteIPAddress        string             `json:"remoteIPAddress,omitempty"`        // The remote IP address.
	RemotePort             int                `json:"remotePort,omitempty"`             // The remote port.
	ResponseTime           float64            `json:"responseTime,omitempty"`           // The response time.
	SecurityDetails        LogSecurityDetails `json:"securityDetails,omitempty"`        // Security details of the response.
	SecurityState          string             `json:"securityState,omitempty"`          // Security state of the response.
	Timing                 LogResponseTiming  `json:"timing,omitempty"`                 // Timing information.
	URL                    string             `json:"url"`                              // The URL of the response.
}

ResponseExtraInfo represents additional information about a response in network logs.

type LogResponseTiming ¶ added in v0.9.3

type LogResponseTiming struct {
	ConnectEnd               float64 `json:"connectEnd"`
	ConnectStart             float64 `json:"connectStart"`
	DNSEnd                   float64 `json:"dnsEnd"`
	DNSStart                 float64 `json:"dnsStart"`
	ReceiveHeadersEnd        float64 `json:"receiveHeadersEnd"`
	RequestTime              float64 `json:"requestTime"`
	SendEnd                  float64 `json:"sendEnd"`
	SendStart                float64 `json:"sendStart"`
	SSLStart                 float64 `json:"sslStart"`
	SSLEnd                   float64 `json:"sslEnd"`
	WorkerStart              float64 `json:"workerStart"`
	WorkerFetchStart         float64 `json:"workerFetchStart"`
	WorkerReady              float64 `json:"workerReady"`
	WorkerRespondWithSettled float64 `json:"workerRespondWithSettled"`
}

ResponseTiming holds timing information from the network logs.

type LogSecurityDetails ¶ added in v0.9.3

type LogSecurityDetails struct {
	CertificateID                     int      `json:"certificateId"`
	CertificateTransparencyCompliance string   `json:"certificateTransparencyCompliance"`
	Cipher                            string   `json:"cipher"`
	EncryptedClientHello              bool     `json:"encryptedClientHello"`
	Issuer                            string   `json:"issuer"`
	KeyExchange                       string   `json:"keyExchange"`
	KeyExchangeGroup                  string   `json:"keyExchangeGroup"`
	Protocol                          string   `json:"protocol"`
	SANList                           []string `json:"sanList"`
	ServerSignatureAlgorithm          int      `json:"serverSignatureAlgorithm"`
	SubjectName                       string   `json:"subjectName"`
	ValidFrom                         float64  `json:"validFrom"`
	ValidTo                           float64  `json:"validTo"`
}

SecurityDetails holds detailed security information from the network logs.

type MetaTag ¶

type MetaTag struct {
	Name    string
	Content string
}

MetaTag represents a single meta tag, including its name and content.

type PageDetails ¶ added in v0.9.3

type PageDetails struct {
	URL      string           `json:"URL"`         // The URL of the web page.
	Title    string           `json:"title"`       // The title of the web page.
	PerfInfo []PerformanceLog `json:"performance"` // The performance information of the web page.
	Links    []string         `json:"links"`       // The links found in the web page.
}

PageDetails represents the details of a collected web page

type PageInfo ¶

type PageInfo struct {
	URL string `json:"URL"` // The URL of the web page.

	Title        string                           `json:"title"`         // The title of the web page.
	Summary      string                           `json:"summary"`       // A summary of the web page content.
	BodyText     string                           `json:"body_text"`     // The main body text of the web page.
	HTML         string                           `json:"html"`          // The HTML content of the web page.
	MetaTags     []MetaTag                        `json:"meta_tags"`     // The meta tags of the web page.
	Keywords     map[string]string                `json:"keywords"`      // The keywords of the web page.
	DetectedType string                           `json:"detected_type"` // The detected document type of the web page.
	DetectedLang string                           `json:"detected_lang"` // The detected language of the web page.
	NetInfo      *neti.NetInfo                    `json:"net_info"`      // The network information of the web page.
	HTTPInfo     *httpi.HTTPDetails               `json:"http_info"`     // The HTTP header information of the web page.
	ScrapedData  []ScrapedItem                    `json:"scraped_data"`  // The scraped data from the web page.
	Links        []LinkItem                       `json:"links"`         // The links found in the web page.
	PerfInfo     PerformanceLog                   `json:"performance"`   // The performance information of the web page.
	DetectedTech map[string]detect.DetectedEntity `json:"detected_tech"` // The detected technologies of the web page.
	Config       *cfg.Config                      `json:"config"`        // The configuration of the web page.
	// contains filtered or unexported fields
}

PageInfo represents the information of a web page.

type PerformanceLog ¶ added in v0.9.3

type PerformanceLog struct {
	TCPConnection   float64               `json:"tcp_connection"`     // The time to establish a TCP connection.
	TimeToFirstByte float64               `json:"time_to_first_byte"` // The time to first byte.
	ContentLoad     float64               `json:"content_load"`       // The time to load the content.
	DNSLookup       float64               `json:"dns_lookup"`         // Number of DNS lookups.
	PageLoad        float64               `json:"page_load"`          // The time to load the page.
	LogEntries      []PerformanceLogEntry `json:"log_entries"`        // The log entries of the web page.
}

type PerformanceLogEntry ¶ added in v0.9.3

type PerformanceLogEntry struct {
	Message LogMessage `json:"message"` // The log message.
	Webview string     `json:"webview"` // The webview.
}

PerformanceLog represents a structure for performance log entries

type ScrapedItem ¶ added in v0.9.3

type ScrapedItem map[string]interface{}

type ScraperRuleEngine ¶

type ScraperRuleEngine struct {
	*rs.RuleEngine // generic rule engine
}

ScraperRuleEngine extends RuleEngine from the ruleset package

type Screenshot ¶

type Screenshot struct {
	IndexID         uint64 `json:"index_id"`
	ScreenshotLink  string `json:"screenshot_link"`
	Height          int    `json:"height"`
	Width           int    `json:"width"`
	ByteSize        int    `json:"byte_size"`
	ThumbnailHeight int    `json:"thumbnail_height"`
	ThumbnailWidth  int    `json:"thumbnail_width"`
	ThumbnailLink   string `json:"thumbnail_link"`
	Format          string `json:"format"`
}

Screenshot represents the metadata of a webpage screenshot

func TakeScreenshot ¶

func TakeScreenshot(wd *selenium.WebDriver, filename string, maxHeight int) (Screenshot, error)

TakeScreenshot is responsible for taking a screenshot of the current page

type SeleniumInstance ¶

type SeleniumInstance struct {
	Service *selenium.Service
	Config  cfg.Selenium
	Mutex   *sync.Mutex
}

SeleniumInstance holds a Selenium service and its configuration

type WebObjectDetails ¶ added in v0.9.3

type WebObjectDetails struct {
	ScrapedData  ScrapedItem                      `json:"scraped_data"`  // The scraped data from the web page.
	Links        []string                         `json:"links"`         // The links found in the web page.
	PerfInfo     PerformanceLog                   `json:"performance"`   // The performance information of the web page.
	DetectedTech map[string]detect.DetectedEntity `json:"detected_tech"` // The detected technologies of the web page.
}

WebObjectDetails represents the details of a web object.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL