Documentation
¶
Overview ¶
Package crawler implements the crawling logic of the application. It's responsible for crawling a website and extracting information from it.
Package crawler implements the crawling logic of the application. It's responsible for crawling a website and extracting information from it.
Package crawler implements the crawling logic of the application. It's responsible for crawling a website and extracting information from it.
Package crawler implements the crawling logic of the application. It's responsible for crawling a website and extracting information from it.
Package crawler implements the crawling logic of the application. It's responsible for crawling a website and extracting information from it.
Package crawler implements the crawling logic of the application. It's responsible for crawling a website and extracting information from it.
Package crawler implements the crawling logic of the application. It's responsible for crawling a website and extracting information from it.
Index ¶
- func ApplyPostProcessingStep(ctx *processContext, step *rs.PostProcessingStep, data *[]byte)
- func ApplyRule(ctx *processContext, rule *rs.ScrapingRule, webPage *selenium.WebDriver) map[string]interface{}
- func ApplyRulesGroup(ctx *processContext, ruleGroup *rs.RuleGroup, url string, ...) (map[string]interface{}, error)
- func ConnectSelenium(sel SeleniumInstance, browseType int) (selenium.WebDriver, error)
- func CrawlWebsite(args CrawlerPars, sel SeleniumInstance, ...)
- func DefaultActionConfig(url string) cfg.SourceConfig
- func DefaultCrawlingConfig(url string) cfg.SourceConfig
- func FuzzURL(baseURL string, rule rules.CrawlingRule) ([]string, error)
- func IsValidURIProtocol(u string) bool
- func IsValidURL(u string) bool
- func NewProcessContext(args CrawlerPars) *processContext
- func NewSeleniumService(c cfg.Selenium) (*selenium.Service, error)
- func ProcessHtmlToJson(htmlData string) (string, error)
- func QuitSelenium(wd *selenium.WebDriver)
- func ReturnSeleniumInstance(wg *sync.WaitGroup, pCtx *processContext, sel *SeleniumInstance, ...)
- func StartCrawler(cf cfg.Config)
- func StopSelenium(sel *selenium.Service) error
- func StrIsHTML(s string) bool
- func UpdateSourceState(db cdb.Handler, sourceURL string, crawlError error)
- type BlockedCookie
- type Cookie
- type CrawlerPars
- type CrawlerStatus
- type LinkItem
- type LogMessage
- type LogParams
- type LogResponseInfo
- type LogResponseTiming
- type LogSecurityDetails
- type MetaTag
- type PageDetails
- type PageInfo
- type PerformanceLog
- type PerformanceLogEntry
- type ScrapedItem
- type ScraperRuleEngine
- type Screenshot
- type SeleniumInstance
- type WebObjectDetails
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func ApplyPostProcessingStep ¶ added in v0.9.4
func ApplyPostProcessingStep(ctx *processContext, step *rs.PostProcessingStep, data *[]byte)
ApplyPostProcessingStep applies the provided post-processing step to the provided data.
func ApplyRule ¶ added in v0.9.4
func ApplyRule(ctx *processContext, rule *rs.ScrapingRule, webPage *selenium.WebDriver) map[string]interface{}
ApplyRule applies the provided scraping rule to the provided web page.
func ApplyRulesGroup ¶ added in v0.9.4
func ApplyRulesGroup(ctx *processContext, ruleGroup *rs.RuleGroup, url string, webPage *selenium.WebDriver) (map[string]interface{}, error)
ApplyRulesGroup extracts the data from the provided web page using the provided a rule group.
func ConnectSelenium ¶
func ConnectSelenium(sel SeleniumInstance, browseType int) (selenium.WebDriver, error)
ConnectSelenium is responsible for connecting to the Selenium server instance
func CrawlWebsite ¶
func CrawlWebsite(args CrawlerPars, sel SeleniumInstance, releaseSelenium chan<- SeleniumInstance)
CrawlWebsite is responsible for crawling a website, it's the main entry point and it's called from the main.go when there is a Source to crawl.
func DefaultActionConfig ¶
func DefaultActionConfig(url string) cfg.SourceConfig
func DefaultCrawlingConfig ¶
func DefaultCrawlingConfig(url string) cfg.SourceConfig
func FuzzURL ¶ added in v0.9.2
func FuzzURL(baseURL string, rule rules.CrawlingRule) ([]string, error)
FuzzURL takes a base URL and a CrawlingRule, generating fuzzed URLs based on the rule's parameters.
func IsValidURIProtocol ¶ added in v0.9.3
func NewProcessContext ¶ added in v0.9.3
func NewProcessContext(args CrawlerPars) *processContext
NewProcessContext creates a new process context
func NewSeleniumService ¶
NewSeleniumService is responsible for initializing Selenium Driver The commented out code could be used to initialize a local Selenium server instead of using only a container based one. However, I found that the container based Selenium server is more stable and reliable than the local one. and it's obviously easier to setup and more secure.
func ProcessHtmlToJson ¶ added in v0.9.3
func QuitSelenium ¶
QuitSelenium is responsible for quitting the Selenium server instance
func ReturnSeleniumInstance ¶ added in v0.9.2
func ReturnSeleniumInstance(wg *sync.WaitGroup, pCtx *processContext, sel *SeleniumInstance, releaseSelenium chan<- SeleniumInstance)
ReturnSeleniumInstance is responsible for returning the Selenium server instance
func StartCrawler ¶
StartCrawler is responsible for initializing the crawler
func StopSelenium ¶
StopSelenium Stops the Selenium server instance (if local)
Types ¶
type BlockedCookie ¶ added in v0.9.3
type BlockedCookie struct { BlockedReasons []string `json:"blockedReasons"` // The reasons why the cookie was blocked. Cookie Cookie `json:"cookie"` // The cookie that was blocked. CookieLine string `json:"cookieLine"` // The cookie line. }
BlockedCookie represents a blocked cookie object
type Cookie ¶ added in v0.9.3
type Cookie struct { Domain string `json:"domain"` // The domain of the cookie. Expires float64 `json:"expires"` // The expiration time of the cookie. HTTPOnly bool `json:"httpOnly"` // Whether the cookie is HTTP only. Name string `json:"name"` // The name of the cookie. Path string `json:"path"` // The path of the cookie. Priority string `json:"priority"` // The priority of the cookie. SameParty bool `json:"sameParty"` // Whether the cookie is from the same party. SameSite string `json:"sameSite"` // The same site attribute of the cookie. Secure bool `json:"secure"` // Whether the cookie is secure. Session bool `json:"session"` // Whether the cookie is a session cookie. Size int `json:"size"` // The size of the cookie. SourcePort int `json:"sourcePort"` // The source port of the cookie. SourceScheme string `json:"sourceScheme"` // The source scheme of the cookie. Value string `json:"value"` // The value of the cookie. }
Cookie represents a cookie object
type CrawlerPars ¶ added in v0.9.3
type CrawlerPars struct { WG *sync.WaitGroup DB cdb.Handler Src cdb.Source Sel *chan SeleniumInstance SelIdx int RE *rules.RuleEngine Sources *[]cdb.Source Index int Status *CrawlerStatus }
Local type to pass parameters to the goroutine
type CrawlerStatus ¶ added in v0.9.3
type CrawlerStatus struct { PipelineID uint64 SourceID uint64 Source string TotalPages int TotalLinks int TotalSkipped int TotalDuplicates int TotalErrors int TotalScraped int TotalActions int TotalFuzzing int StartTime time.Time EndTime time.Time CurrentDepth int LastWait float64 LastDelay float64 // Flags values: 0 - Not started yet, 1 - Running, 2 - Completed, 3 - Error NetInfoRunning int // Flag to check if network info is already gathered HTTPInfoRunning int // Flag to check if HTTP info is already gathered PipelineRunning int // Flag to check if site info is already gathered CrawlingRunning int // Flag to check if crawling is still running }
type LinkItem ¶ added in v0.9.3
type LinkItem struct { PageURL string `json:"url"` PageLevel int `json:"level"` Link string `json:"link"` ElementID string `json:"element_id"` }
LinkItem represents a link item collected on a web page
type LogMessage ¶ added in v0.9.3
type LogMessage struct { Method string `json:"method"` // The method of the log message. Params LogParams `json:"params"` // The parameters of the log message. }
LogMessage represents a log message
type LogParams ¶ added in v0.9.3
type LogParams struct { ResponseInfo LogResponseInfo `json:"response"` // The extra information of the response. TimeStamp float64 `json:"timestamp"` // The timestamp of the log message. Type string `json:"type,omitempty"` // The type of the log message. }
LogParams represents the parameters of a log message
type LogResponseInfo ¶ added in v0.9.3
type LogResponseInfo struct { BlockedCookies []BlockedCookie `json:"blockedCookies,omitempty"` // The blocked cookies. Headers map[string]string `json:"headers,omitempty"` // The headers of the response. RequestID string `json:"requestId"` // The ID of the request. ResourceIPAddressSpace string `json:"resourceIPAddressSpace,omitempty"` // The IP address space of the resource. StatusCode int `json:"statusCode"` // The status code of the response. StatusText string `json:"statusText"` // The status text of the response. MimeType string `json:"mimeType,omitempty"` // The MIME type of the response. Protocol string `json:"protocol,omitempty"` // The protocol of the response. RemoteIPAddress string `json:"remoteIPAddress,omitempty"` // The remote IP address. RemotePort int `json:"remotePort,omitempty"` // The remote port. ResponseTime float64 `json:"responseTime,omitempty"` // The response time. SecurityDetails LogSecurityDetails `json:"securityDetails,omitempty"` // Security details of the response. SecurityState string `json:"securityState,omitempty"` // Security state of the response. Timing LogResponseTiming `json:"timing,omitempty"` // Timing information. URL string `json:"url"` // The URL of the response. }
ResponseExtraInfo represents additional information about a response in network logs.
type LogResponseTiming ¶ added in v0.9.3
type LogResponseTiming struct { ConnectEnd float64 `json:"connectEnd"` ConnectStart float64 `json:"connectStart"` DNSEnd float64 `json:"dnsEnd"` DNSStart float64 `json:"dnsStart"` ReceiveHeadersEnd float64 `json:"receiveHeadersEnd"` RequestTime float64 `json:"requestTime"` SendEnd float64 `json:"sendEnd"` SendStart float64 `json:"sendStart"` SSLStart float64 `json:"sslStart"` SSLEnd float64 `json:"sslEnd"` WorkerStart float64 `json:"workerStart"` WorkerFetchStart float64 `json:"workerFetchStart"` WorkerReady float64 `json:"workerReady"` WorkerRespondWithSettled float64 `json:"workerRespondWithSettled"` }
ResponseTiming holds timing information from the network logs.
type LogSecurityDetails ¶ added in v0.9.3
type LogSecurityDetails struct { CertificateID int `json:"certificateId"` CertificateTransparencyCompliance string `json:"certificateTransparencyCompliance"` Cipher string `json:"cipher"` EncryptedClientHello bool `json:"encryptedClientHello"` Issuer string `json:"issuer"` KeyExchange string `json:"keyExchange"` KeyExchangeGroup string `json:"keyExchangeGroup"` Protocol string `json:"protocol"` SANList []string `json:"sanList"` ServerSignatureAlgorithm int `json:"serverSignatureAlgorithm"` SubjectName string `json:"subjectName"` ValidFrom float64 `json:"validFrom"` ValidTo float64 `json:"validTo"` }
SecurityDetails holds detailed security information from the network logs.
type PageDetails ¶ added in v0.9.3
type PageDetails struct { URL string `json:"URL"` // The URL of the web page. Title string `json:"title"` // The title of the web page. PerfInfo []PerformanceLog `json:"performance"` // The performance information of the web page. Links []string `json:"links"` // The links found in the web page. }
PageDetails represents the details of a collected web page
type PageInfo ¶
type PageInfo struct { URL string `json:"URL"` // The URL of the web page. Title string `json:"title"` // The title of the web page. Summary string `json:"summary"` // A summary of the web page content. BodyText string `json:"body_text"` // The main body text of the web page. HTML string `json:"html"` // The HTML content of the web page. MetaTags []MetaTag `json:"meta_tags"` // The meta tags of the web page. Keywords map[string]string `json:"keywords"` // The keywords of the web page. DetectedType string `json:"detected_type"` // The detected document type of the web page. DetectedLang string `json:"detected_lang"` // The detected language of the web page. NetInfo *neti.NetInfo `json:"net_info"` // The network information of the web page. HTTPInfo *httpi.HTTPDetails `json:"http_info"` // The HTTP header information of the web page. ScrapedData []ScrapedItem `json:"scraped_data"` // The scraped data from the web page. Links []LinkItem `json:"links"` // The links found in the web page. PerfInfo PerformanceLog `json:"performance"` // The performance information of the web page. DetectedTech map[string]detect.DetectedEntity `json:"detected_tech"` // The detected technologies of the web page. Config *cfg.Config `json:"config"` // The configuration of the web page. // contains filtered or unexported fields }
PageInfo represents the information of a web page.
type PerformanceLog ¶ added in v0.9.3
type PerformanceLog struct { TCPConnection float64 `json:"tcp_connection"` // The time to establish a TCP connection. TimeToFirstByte float64 `json:"time_to_first_byte"` // The time to first byte. ContentLoad float64 `json:"content_load"` // The time to load the content. DNSLookup float64 `json:"dns_lookup"` // Number of DNS lookups. PageLoad float64 `json:"page_load"` // The time to load the page. LogEntries []PerformanceLogEntry `json:"log_entries"` // The log entries of the web page. }
type PerformanceLogEntry ¶ added in v0.9.3
type PerformanceLogEntry struct { Message LogMessage `json:"message"` // The log message. Webview string `json:"webview"` // The webview. }
PerformanceLog represents a structure for performance log entries
type ScrapedItem ¶ added in v0.9.3
type ScrapedItem map[string]interface{}
type ScraperRuleEngine ¶
type ScraperRuleEngine struct {
*rs.RuleEngine // generic rule engine
}
ScraperRuleEngine extends RuleEngine from the ruleset package
type Screenshot ¶
type Screenshot struct { IndexID uint64 `json:"index_id"` ScreenshotLink string `json:"screenshot_link"` Height int `json:"height"` Width int `json:"width"` ByteSize int `json:"byte_size"` ThumbnailHeight int `json:"thumbnail_height"` ThumbnailWidth int `json:"thumbnail_width"` ThumbnailLink string `json:"thumbnail_link"` Format string `json:"format"` }
Screenshot represents the metadata of a webpage screenshot
func TakeScreenshot ¶
TakeScreenshot is responsible for taking a screenshot of the current page
type SeleniumInstance ¶
SeleniumInstance holds a Selenium service and its configuration
type WebObjectDetails ¶ added in v0.9.3
type WebObjectDetails struct { ScrapedData ScrapedItem `json:"scraped_data"` // The scraped data from the web page. Links []string `json:"links"` // The links found in the web page. PerfInfo PerformanceLog `json:"performance"` // The performance information of the web page. DetectedTech map[string]detect.DetectedEntity `json:"detected_tech"` // The detected technologies of the web page. }
WebObjectDetails represents the details of a web object.