LazySpider
What is it?
LazySpider is a...lazy web spider.
- In
lazy
mode, it crawls a target site (and only a target site) and constructs a word cloud based on the nouns it scrapes from page source. It will slowly stop visiting leaves with the same parent, hence the laziness.
- In
msdn
mode, it does purpose-build crawling of a target (assumed to be using MSDN's UI components) for warning boxes in documentation.
- In
generic
mode, it crawls on-domain for a component specified by a JQuery-style selector. You may optionally provide keywords to futher filter selected components.
Why did you build it?
There are some useful security reconnaisance properties of having a word cloud of your target site. These are good starting points for things like API fuzzing. Usually, you need to use a pre-compiled wordlist grabbed from the internet or carefully curated over a long career. LazySpider is intended to help generate "application context" wordlists quickly on a target.
What makes this different from <web spider x>?
Good question. The first page of google results is "go write one" and the Colly project has excellent primitives for getting off the ground quickly. The 'lazy' part comes from spidering pages with "flat" architecture that is content-heavy (think Wikipedia). The first time a content page is seen, it is useful, but subsequent pages have diminishing returns. LazySpider applies a decay function to each parent path when it encounters a leaf node.
How do I use it?
LazySpider is still in active development and does not have super nice features yet (pipelining, etc.) It is minimally composable with things like JQ by passing the -json
flag. Current usage is:
lazyspider -url https://example.com [-json] [-spiderType (lazy|msdn|generic)] [-selector <jquery selector>] [-keywords <comma-separated keyword list>] [-banwords <comma-separated banword list>]
LazySpider is opinionated and will refuse to e.g. crawl redirects from https://www.example.com
to https://example.com
. This is by design to keep it from wandering off-target. Fuzzy matching of domains is on the to-do list.
Can I contribute?
Sure, fork away! You will need to sign your commits to contribute to upstream.
To-do list
- better test coverage
- configurable backoff behavior
- CLI documentation improvements
- Module-ify word cloud behavior (this is a useful primitive for other projects)
- Make CLI pipelineable
- Thresholding: only return top N% of results or results with more than M hits