lazyspider

module

v0.0.0-...-43cc0ce Latest Latest Go to latest Published: Sep 8, 2022 License: Apache-2.0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

gitlab.com/lapt0r/lazyspider

Links

Open Source Insights

README ¶

LazySpider

What is it?

LazySpider is a...lazy web spider.

In lazy mode, it crawls a target site (and only a target site) and constructs a word cloud based on the nouns it scrapes from page source. It will slowly stop visiting leaves with the same parent, hence the laziness.
In msdn mode, it does purpose-build crawling of a target (assumed to be using MSDN's UI components) for warning boxes in documentation.
In generic mode, it crawls on-domain for a component specified by a JQuery-style selector. You may optionally provide keywords to futher filter selected components.

Why did you build it?

There are some useful security reconnaisance properties of having a word cloud of your target site. These are good starting points for things like API fuzzing. Usually, you need to use a pre-compiled wordlist grabbed from the internet or carefully curated over a long career. LazySpider is intended to help generate "application context" wordlists quickly on a target.

What makes this different from <web spider x>?

Good question. The first page of google results is "go write one" and the Colly project has excellent primitives for getting off the ground quickly. The 'lazy' part comes from spidering pages with "flat" architecture that is content-heavy (think Wikipedia). The first time a content page is seen, it is useful, but subsequent pages have diminishing returns. LazySpider applies a decay function to each parent path when it encounters a leaf node.

How do I use it?

LazySpider is still in active development and does not have super nice features yet (pipelining, etc.) It is minimally composable with things like JQ by passing the -json flag. Current usage is:

lazyspider -url https://example.com [-json] [-spiderType (lazy|msdn|generic)] [-selector <jquery selector>] [-keywords <comma-separated keyword list>] [-banwords <comma-separated banword list>]

LazySpider is opinionated and will refuse to e.g. crawl redirects from https://www.example.com to https://example.com. This is by design to keep it from wandering off-target. Fuzzy matching of domains is on the to-do list.

Can I contribute?

Sure, fork away! You will need to sign your commits to contribute to upstream.

To-do list

better test coverage
configurable backoff behavior
CLI documentation improvements
Module-ify word cloud behavior (this is a useful primitive for other projects)
Make CLI pipelineable
Thresholding: only return top N% of results or results with more than M hits

Directories ¶

Path	Synopsis
cli
cmd
lazyspider
spider
textparser

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL