page-prowler

command module

v0.0.0-...-6d375a0 Latest Latest Go to latest Published: Nov 23, 2024 License: MIT Imports: 16 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/jonesrussell/page-prowler

Links

Open Source Insights

README ¶

Page Prowler

Page Prowler is a tool designed to find and extract links from websites based on specified terms. It allows direct interaction through the command-line interface or the initiation of the Echo web server, which exposes an API. This API utilizes the Asynq library to manage queued crawl jobs.

Usage

page-prowler [command]

Commands

api: Starts the API server.
matchlinks: Crawls specific websites and extracts matchlinks that match the provided terms. Can be run from the command line or via a POST request to /v1/matchlinks on the API server.
clearlinks: Clears the Redis set for a given siteid.
getlinks: Gets the list of links for a given siteid.
worker: Starts the Asynq worker.
help: Displays help about any command.

Building

To install Page Prowler, clone the repository and build the binary using the following commands:

git clone https://github.com/jonesrussell/page-prowler.git
cd page-prowler
go build

Alternatively, you can use the provided Makefile to build the project:

make all

This command will run fmt, lint, test, and build targets defined in the Makefile.

Command Line

To search for matchlinks from the command line, use the following command:

./page-prowler matchlinks --url="https://www.example.com" --searchterms="keyword1,keyword2" --siteid=siteID --maxdepth=1 --debug

Replace "https://www.example.com" with the URL you want to crawl, "keyword1,keyword2" with the search terms you want to look for, siteID with your site ID, and 1 with the maximum depth of the crawl.

API

To start the API server, use the following command:

./page-prowler api

Then, you can send a POST request to start a crawl:

curl -X POST -H "Content-Type: application/json" -d '{
 "URL": "https://www.example.com",
 "SearchTerms": "keyword1,keyword2",
 "CrawlSiteID": "siteID",
 "MaxDepth": 3,
 "Debug": true
}' http://localhost:3000/matchlinks

Again, replace "https://www.example.com" with the URL you want to crawl, "keyword1,keyword2" with the search terms you want to look for, siteID with your site ID, and 3 with the maximum depth of the crawl.

Configuration

Page Prowler uses a .env file for configuration. You can specify the Redis host and password in this file. For example:

REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_AUTH=yourpassword

Contributing

Contributions are welcome! Please feel free to submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Documentation ¶

There is no documentation for this package.

Source Files ¶

View all Source files

main.go

Directories ¶

Path	Synopsis
cmd
crawler
dbmanager
internal
common
consumer
drug
matcher
mining
prowlredis Package prowlredis is a generated GoMock package.	Package prowlredis is a generated GoMock package.
stats Package stats provides a simple way to track and manipulate statistics related to web crawling.	Package stats provides a simple way to track and manipulate statistics related to web crawling.
tasks
termmatcher
worker
models
news
utils

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL