spider

package
v0.7.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 20, 2025 License: Apache-2.0 Imports: 8 Imported by: 0

Documentation

Overview

Package spider utilizes a combination of tooling including Katana to perform analysis on http server urls and feed them into the other subpacakges of sleuth for analysis

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func MapOptionsToTypesOptions

func MapOptionsToTypesOptions(options *Options) *types.Options

MapOptionsToTypesOptions maps Options parameters to types.Options

Types

type LinkDetails

type LinkDetails struct {
	Link         string   `json:"link" yaml:"link"`
	Status       int      `json:"status" yaml:"status"`
	Technologies []string `json:"technologies" yaml:"technologies"`
}

LinkDetails provides the details of a single link found during a web spider operation

type OnResultCallback

type OnResultCallback func(result output.Result) error

OnResultCallback is a callback function that is called when a result is found

type Option

type Option func(*Options)

Option is a functional option for the katana crawler

func WithAutomaticFormFill

func WithAutomaticFormFill(automaticFormFill bool) Option

WithAutomaticFormFill enables optional automatic form filling and submission

func WithBodyReadSize

func WithBodyReadSize(bodyReadSize int) Option

WithBodyReadSize sets the maximum size of response body to read

func WithChromeDataDir

func WithChromeDataDir(chromeDataDir string) Option

WithChromeDataDir specifies the --user-data-dir to chrome binary to preserve sessions

func WithChromeWSUrl

func WithChromeWSUrl(chromeWSUrl string) Option

WithChromeWSUrl specifies the Chrome debugger websocket url for a running Chrome instance to attach to

func WithConcurrency

func WithConcurrency(concurrency int) Option

WithConcurrency sets the number of concurrent crawling goroutines

func WithCrawlDuration

func WithCrawlDuration(crawlDuration time.Duration) Option

WithCrawlDuration sets the crawl duration

func WithCustomHeaders

func WithCustomHeaders(customHeaders []string) Option

WithCustomHeaders sets the custom headers to add to request

func WithDebug

func WithDebug(debug bool) Option

WithDebug enables debug mode

func WithDelay

func WithDelay(delay int) Option

WithDelay sets the delay between requests

func WithDisableRedirects

func WithDisableRedirects(disableRedirects bool) Option

WithDisableRedirects disables the following of redirects

func WithDisableUpdateCheck

func WithDisableUpdateCheck(disableUpdateCheck bool) Option

WithDisableUpdateCheck disables automatic update check

func WithDisplayOutScope

func WithDisplayOutScope(displayOutScope bool) Option

WithDisplayOutScope sets the display out of scope flag

func WithErrorLogFile

func WithErrorLogFile(errorLogFile string) Option

WithErrorLogFile specifies a file to write with the errors of all requests

func WithExclude

func WithExclude(exclude []string) Option

WithExclude sets the exclude filter to use

func WithExtensionFilter

func WithExtensionFilter(extensionFilter []string) Option

WithExtensionFilter sets the extension filter

func WithExtensionsMatch

func WithExtensionsMatch(extensionsMatch []string) Option

WithExtensionsMatch sets the extensions to match

func WithFieldConfig

func WithFieldConfig(fieldConfig string) Option

WithFieldConfig sets the custom field configuration file

func WithFieldScope

func WithFieldScope(fieldScope string) Option

WithFieldScope sets the field scope for default DNS scope

func WithFields

func WithFields(fields string) Option

WithFields sets the fields to format in output

func WithFilterRegex

func WithFilterRegex(filterRegex []*regexp.Regexp) Option

WithFilterRegex sets the slice regex to filter url

func WithFormConfig

func WithFormConfig(formConfig string) Option

WithFormConfig sets the form configuration file

func WithFormExtraction

func WithFormExtraction(formExtraction bool) Option

WithFormExtraction enables extraction of form, input, textarea & select elements

func WithHeadless

func WithHeadless(headless bool) Option

WithHeadless enables headless scraping

func WithHeadlessNoIncognito

func WithHeadlessNoIncognito(headlessNoIncognito bool) Option

WithHeadlessNoIncognito specifies if chrome should be started without incognito mode

func WithHeadlessNoSandbox

func WithHeadlessNoSandbox(headlessNoSandbox bool) Option

WithHeadlessNoSandbox specifies if chrome should be start in --no-sandbox mode

func WithHeadlessOptionalArguments

func WithHeadlessOptionalArguments(headlessOptionalArguments []string) Option

WithHeadlessOptionalArguments specifies optional arguments to pass to Chrome

func WithHealthCheck

func WithHealthCheck(healthCheck bool) Option

WithHealthCheck determines if a self-healthcheck should be performed

func WithIgnoreQueryParams

func WithIgnoreQueryParams(ignoreQueryParams bool) Option

WithIgnoreQueryParams ignores crawling same path with different query-param values

func WithJSON

func WithJSON(json bool) Option

WithJSON enables writing output in JSON format

func WithKnownFiles

func WithKnownFiles(knownFiles string) Option

WithKnownFiles sets the known files to crawl

func WithMatchRegex

func WithMatchRegex(matchRegex []*regexp.Regexp) Option

WithMatchRegex sets the slice regex to match url

func WithMaxDepth

func WithMaxDepth(maxDepth int) Option

WithMaxDepth sets the maximum depth to crawl

func WithNoClobber

func WithNoClobber(noClobber bool) Option

WithNoClobber specifies if katana should overwrite existing output files

func WithNoColors

func WithNoColors(noColors bool) Option

WithNoColors disables coloring of response output

func WithNoScope

func WithNoScope(noScope bool) Option

WithNoScope sets the no scope flag

func WithOmitBody

func WithOmitBody(omitBody bool) Option

WithOmitBody omits the response body from the output

func WithOmitRaw

func WithOmitRaw(omitRaw bool) Option

WithOmitRaw omits raw requests/responses from the output

func WithOnResult

func WithOnResult(onResult OnResultCallback) Option

WithOnResult allows callback function on a result

func WithOutOfScope

func WithOutOfScope(outOfScope []string) Option

WithOutOfScope sets the out of scope regexes to use

func WithOutputFile

func WithOutputFile(outputFile string) Option

WithOutputFile sets the output file

func WithOutputFilterCondition

func WithOutputFilterCondition(outputFilterCondition string) Option

WithOutputFilterCondition sets the output filter condition

func WithOutputFilterRegex

func WithOutputFilterRegex(outputFilterRegex []string) Option

WithOutputFilterRegex sets the regex to filter output url

func WithOutputMatchCondition

func WithOutputMatchCondition(outputMatchCondition string) Option

WithOutputMatchCondition sets the output match condition

func WithOutputMatchRegex

func WithOutputMatchRegex(outputMatchRegex []string) Option

WithOutputMatchRegex sets the regex to match output url

func WithParallelism

func WithParallelism(parallelism int) Option

WithParallelism sets the number of urls processing goroutines

func WithPprofServer

func WithPprofServer(pprofServer bool) Option

WithPprofServer enables pprof server

func WithProxy

func WithProxy(proxy string) Option

WithProxy sets the proxy URL

func WithRateLimit

func WithRateLimit(rateLimit int) Option

WithRateLimit sets the rate limit for requests

func WithRateLimitMinute

func WithRateLimitMinute(rateLimitMinute int) Option

WithRateLimitMinute sets the rate limit for requests per minute

func WithResolvers

func WithResolvers(resolvers []string) Option

WithResolvers sets the custom resolvers

func WithResume

func WithResume(resume string) Option

WithResume sets the resume file to use

func WithRetries

func WithRetries(retries int) Option

WithRetries sets the number of retries for requests

func WithScope

func WithScope(scope []string) Option

WithScope sets the scope regexes to use

func WithScrapeJSLuiceResponses

func WithScrapeJSLuiceResponses(scrapeJSLuiceResponses bool) Option

WithScrapeJSLuiceResponses enables scraping of endpoints from javascript using jsluice

func WithScrapeJSResponses

func WithScrapeJSResponses(scrapeJSResponses bool) Option

WithScrapeJSResponses enables scraping of relative endpoints from javascript

func WithShowBrowser

func WithShowBrowser(showBrowser bool) Option

WithShowBrowser specifies whether the show the browser in headless mode

func WithSilent

func WithSilent(silent bool) Option

WithSilent shows only output

func WithStoreFieldDir

func WithStoreFieldDir(storeFieldDir string) Option

WithStoreFieldDir specifies if katana should use a custom directory to store fields

func WithStoreFields

func WithStoreFields(storeFields string) Option

WithStoreFields sets the fields to store in separate per-host files

func WithStoreResponse

func WithStoreResponse(storeResponse bool) Option

WithStoreResponse specifies if katana should store http requests/responses

func WithStoreResponseDir

func WithStoreResponseDir(storeResponseDir string) Option

WithStoreResponseDir specifies if katana should use a custom directory to store http requests/responses

func WithStrategy

func WithStrategy(strategy string) Option

WithStrategy sets the crawling strategy

func WithSystemChromePath

func WithSystemChromePath(systemChromePath string) Option

WithSystemChromePath specifies the chrome binary path for headless crawling

func WithTLSImpersonate

func WithTLSImpersonate(tlsImpersonate bool) Option

WithTlsImpersonate enables experimental tls ClientHello randomization for standard crawler

func WithTechDetect

func WithTechDetect(techDetect bool) Option

WithTechDetect enables technology detection

func WithTimeout

func WithTimeout(timeout int) Option

WithTimeout sets the timeout for requests

func WithURLs

func WithURLs(urls []string) Option

WithURLs sets the URLs to crawl

func WithUseInstalledChrome

func WithUseInstalledChrome(useInstalledChrome bool) Option

WithUseInstalledChrome skips chrome install and use local instance

func WithVerbose

func WithVerbose(verbose bool) Option

WithVerbose specifies showing verbose output

func WithVersion

func WithVersion(version bool) Option

WithVersion enables showing of crawler version

func WithXhrExtraction

func WithXhrExtraction(xhrExtraction bool) Option

WithXhrExtraction enables extraction of xhr requests

type Options

type Options struct {
	// URLs contains a list of URLs for crawling
	URLs goflags.StringSlice
	// Resume the scan from the state stored in the resume config file
	Resume string
	// Exclude host matching specified filter ('cdn', 'private-ips', cidr, ip, regex)
	Exclude goflags.StringSlice
	// Scope contains a list of regexes for in-scope URLS
	Scope goflags.StringSlice
	// OutOfScope contains a list of regexes for out-scope URLS
	OutOfScope goflags.StringSlice
	// NoScope disables host based default scope
	NoScope bool
	// DisplayOutScope displays out of scope items in results
	DisplayOutScope bool
	// ExtensionsMatch contains extensions to match explicitly
	ExtensionsMatch goflags.StringSlice
	// ExtensionFilter contains additional items for filter list
	ExtensionFilter goflags.StringSlice
	// OutputMatchCondition is the condition to match output
	OutputMatchCondition string
	// OutputFilterCondition is the condition to filter output
	OutputFilterCondition string
	// MaxDepth is the maximum depth to crawl
	MaxDepth int
	// BodyReadSize is the maximum size of response body to read
	BodyReadSize int
	// Timeout is the time to wait for request in seconds
	Timeout int
	// CrawlDuration is the duration in seconds to crawl target from
	CrawlDuration time.Duration
	// Delay is the delay between each crawl requests in seconds
	Delay int
	// RateLimit is the maximum number of requests to send per second
	RateLimit int
	// Retries is the number of retries to do for request
	Retries int
	// RateLimitMinute is the maximum number of requests to send per minute
	RateLimitMinute int
	// Concurrency is the number of concurrent crawling goroutines
	Concurrency int
	// Parallelism is the number of urls processing goroutines
	Parallelism int
	// FormConfig is the path to the form configuration file
	FormConfig string
	// Proxy is the URL for the proxy server
	Proxy string
	// Strategy is the crawling strategy. depth-first or breadth-first
	Strategy string
	// FieldScope is the scope field for default DNS scope
	FieldScope string
	// OutputFile is the file to write output to
	OutputFile string
	// KnownFiles enables crawling of knows files like robots.txt, sitemap.xml, etc
	KnownFiles string
	// Fields is the fields to format in output
	Fields string
	// StoreFields is the fields to store in separate per-host files
	StoreFields string
	// FieldConfig is the path to the custom field configuration file
	FieldConfig string
	// NoColors disables coloring of response output
	NoColors bool
	// JSON enables writing output in JSON format
	JSON bool
	// Silent shows only output
	Silent bool
	// Verbose specifies showing verbose output
	Verbose bool
	// TechDetect enables technology detection
	TechDetect bool
	// Version enables showing of crawler version
	Version bool
	// ScrapeJSResponses enables scraping of relative endpoints from javascript
	ScrapeJSResponses bool
	// ScrapeJSLuiceResponses enables scraping of endpoints from javascript using jsluice
	ScrapeJSLuiceResponses bool
	// CustomHeaders is a list of custom headers to add to request
	CustomHeaders goflags.StringSlice
	// Headless enables headless scraping
	Headless bool
	// AutomaticFormFill enables optional automatic form filling and submission
	AutomaticFormFill bool
	// FormExtraction enables extraction of form, input, textarea & select elements
	FormExtraction bool
	// UseInstalledChrome skips chrome install and use local instance
	UseInstalledChrome bool
	// ShowBrowser specifies whether the show the browser in headless mode
	ShowBrowser bool
	// HeadlessOptionalArguments specifies optional arguments to pass to Chrome
	HeadlessOptionalArguments goflags.StringSlice
	// HeadlessNoSandbox specifies if chrome should be start in --no-sandbox mode
	HeadlessNoSandbox bool
	// SystemChromePath : Specify the chrome binary path for headless crawling
	SystemChromePath string
	// ChromeWSUrl : Specify the Chrome debugger websocket url for a running Chrome instance to attach to
	ChromeWSUrl string
	// OnResult allows callback function on a result
	OnResult OnResultCallback
	// StoreResponse specifies if katana should store http requests/responses
	StoreResponse bool
	// StoreResponseDir specifies if katana should use a custom directory to store http requests/responses
	StoreResponseDir string
	// NoClobber specifies if katana should overwrite existing output files
	NoClobber bool
	// StoreFieldDir specifies if katana should use a custom directory to store fields
	StoreFieldDir string
	// OmitRaw omits raw requests/responses from the output
	OmitRaw bool
	// OmitBody omits the response body from the output
	OmitBody bool
	// ChromeDataDir : 	Specify the --user-data-dir to chrome binary to preserve sessions
	ChromeDataDir string
	// HeadlessNoIncognito specifies if chrome should be started without incognito mode
	HeadlessNoIncognito bool
	// XhrExtraction extract xhr requests
	XhrExtraction bool
	// HealthCheck determines if a self-healthcheck should be performed
	HealthCheck bool
	// PprofServer enables pprof server
	PprofServer bool
	// ErrorLogFile specifies a file to write with the errors of all requests
	ErrorLogFile string
	// Resolvers contains custom resolvers
	Resolvers goflags.StringSlice
	// OutputMatchRegex is the regex to match output url
	OutputMatchRegex goflags.StringSlice
	// OutputFilterRegex is the regex to filter output url
	OutputFilterRegex goflags.StringSlice
	// FilterRegex is the slice regex to filter url
	FilterRegex []*regexp.Regexp
	// MatchRegex is the slice regex to match url
	MatchRegex []*regexp.Regexp
	// DisableUpdateCheck disables automatic update check
	DisableUpdateCheck bool
	// IgnoreQueryParams ignore crawling same path with different query-param values
	IgnoreQueryParams bool
	// Debug
	Debug bool
	// TlsImpersonate enables experimental tls ClientHello randomization for standard crawler
	TLSImpersonate bool
	// DisableRedirects disables the following of redirects
	DisableRedirects bool
}

Options are the functional parameters for katana

func NewOptions

func NewOptions(opts ...Option) *Options

NewOptions creates a new Options struct with default values and allows overrides

type Spider

type Spider struct {
	//	Client  *standard.Crawler
	Options *Options
	Client  *hybrid.Crawler
}

Spider is a wrapper struct for the katana crawler

type WebSpiderReport

type WebSpiderReport struct {
	Targets []string      `json:"targets" yaml:"targets"`
	Links   []LinkDetails `json:"links" yaml:"links"`
	Errors  []string      `json:"errors" yaml:"errors"`
}

A WebSpiderReport represents a holistic report of all the links that were found during a web spider operation, including non-fatal errors that occurred during the operation

func PerformWebSpider

func PerformWebSpider(ctx context.Context, targets []string) WebSpiderReport

PerformWebSpider performs a web spider operation against the provided targets, returning a WebSpiderReport with the results of the spider

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL