roedor

module
v0.1.4-0...-39de252 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 24, 2021 License: MIT

README

roedor

License GitHub release (latest by date) Go

A modular web crawler Go module. The purpose of this module is to crawl through web sites and extract data using the python package Markout. All extracted data is then stored into a CSV file to be analyzed later.

Usage

Firstly, make sure you fulfill the requirements below:

  • Python >= 3.5
  • Pip (Python module) >= 18.1
  • Go >= 1.13.6

To install this package, run the installing script on ./scripts:

./scripts/installing

NOTE: This package uses an external Python package, if you want to use (Roedor) on your code you must also make sure that package (Markout) is also installed (the installing script should take care of this).

Using the CLI

If you want to use the CLI, all you have to do is to call markout_html with the following flags ($GOBINmust be set for this command to work):

--workers: Number of parallel workers at the same time.

--url: link to crawl onto.

--tokens: JSON string with tokens to be used (see Markout for details).

--output: filename of output CSV file (optional).

You may also use the --help flag to list all the flags above with help messages.

Using on your code

If you this want to use this package on your code, you can just import it! But remember that this package requires external Python packages!

Here's an example of use:

link, err := url.Parse("https://gobyexample.com/")
if err != nil {
  panic("hot potatoes")
}

tokens := make(Tokens)
tokens["p"] = "\n{}"

c := NewCrawler(
  []*url.URL{
    link,
  },
  numWorkers,
  tokens,
  "./roedor.json",
)

// This will run while links are found
c.Start()

Contributions

Feel free to leave your contribution here, I would really appreciate it! Also, if you have any doubts or troubles using this package just contact me or leave an issue.

Directories

Path Synopsis
cmd
pkg

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL