gachifinder

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 28, 2021 License: Apache-2.0 Imports: 0 Imported by: 0

README

GachiFinder

This project is an agent for scraping, parsing and writing to some storages.
Firstly, it has been scraped the news on the portal(eg, Naver/Daum) in Korea.

Workflow(Pipeline)

Target page -------------------------|
  |- 1'st Sub -----------------------|
       |- 2'st Sub ------------------|
             |- N'st visitable ------|=> asynchronous scraper(crawler) module
       |- ... -----------------------|                  |
  |- 1'st Sub -----------------------|                  |=> asynchronous emitter(store or relay) module
  |- ... ----------------------------|
Scraper Modules

Actually, this is an extendable parsing handler.
Data crawling do via the Scraper interface.

naver_news_headline
daum_news_headline

Emitter Modules

This is an extendable output module via Emitter interface

elasticsearch

How to build

Preparing dependencies
  • GachiFinder works with go 1.15+(on current build version).
  • In the case of Windows, it couldn't be invoked "make" command,
    So you need to download and install GNUMake for windows.
Run from the source code
Tested Support OS : Linux, MacOSX(darwin), Windows
# If you're on Windows, run "Git Bash" and type the followings.

$ git clone https://github.com/seversky/gachifinder.git
$ cd gachifinder
$ make all # or one of "windows", "darwin" and "linux".

If well done, you can see the binary.

$ cd $GACHIFINDER_FOLDER/cmd/gachifinder/windows
$ ls
gachifinder.exe

You can run it refers to the help options.

$ ./gachifinder.exe -h
Usage:
  C:\Users\...\go\src\github.com\seversky\gachifinder\cmd\gachifinder\windows\gachifinder.exe [OPTIONS]

Options for GachiFinder

Application Options:
  /c, /config_path:  Path To configure (default: ../config/gachifinder.yml)
                     [%CONFIG_PATH%]
  /t, /test          To test for crawling via a scraper only
                     (Without an emitter module)
  /v, /version       Show GachiFinder version and git revision id

Help Options:
  /?                 Show this help message
  /h, /help          Show this help message
Options: gachifinder.yml
global:
  max_used_cores: 0 # if zero(0), all cores used.
  interval: 5 # Crawing interval(unit: min)

  log:
    log_level: debug # one of trace, debug, info, warn[ing], error, fatal or panic
    stdout: true
    format: 'text' # one of "text" or "json"
    # go_time_format: '2006-01-02T15:04:05.999Z07:00' # default=RFC3339, refer to https://golang.org/src/time/format.go
    force_colors: true

    log_path: './log/gachifinder.log'
    max_size: 50 # Max megabytes before log is rotated
    max_age: 7 # Max number of days to retain log files
    max_backups: 3 # Max number of old log files to keep
    compress: true

scraper:
  visit_domains:
    - https://news.naver.com
    - https://news.daum.net
  # allowed_domains:
  #   - https://news.naver.com
  #   - https://news.daum.net
  user_agent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
  max_depth_to_visit: 1

  async: true # Async turns on asynchronous network communication. Use Collector.Wait() to be sure all requests have been finished.

  parallelism: 20 # The number of the maximum allowed concurrent requests of the matching domains.
  delay: 1 # The duration to wait before creating a new request to the matching domains.(unit: sec)
  random_delay: 5 # The extra randomized duration to wait added to delay before creating a new request.(unit: sec)

  consumer_queue_threads: 2 # The number of consumer queue threads
  consumer_queue_max_size: 10 # Max size of consumer queue

emitter:
  elasticsearch:
    hosts:
      - http://elasticsearch:9200
    username: elastic
    password: changeme
Simple test for scraping
# Go into the folder where the built binary of gachifinder is.
$ cd $GACHIFINDER_FOLDER/cmd/gachifinder/windows
$ ./gachifinder.exe -t
INFO[2021-05-28T15:13:41+09:00] Show All Configurations                       config.Emitter.Elasticsearch.Hosts="[http://elasticsearch:9200]" config.Emitter.Elasticsearch.Password=changem
e config.Emitter.Elasticsearch.Username=elastic config.Global.Interval=5 config.Global.Log.Compress=true config.Global.Log.ForceColors=true config.Global.Log.Format=text
config.Global.Log.GoTimeFormat= config.Global.Log.LogLevel=debug config.Global.Log.LogPath=./log/gachifinder.log config.Global.Log.MaxAge=7 config.Global.Log.MaxBackups=3
 config.Global.Log.MaxSize=50 config.Global.Log.Stdout=true config.Global.MaxUsedCores=0 config.Scraper.AllowedDomains="[]" config.Scraper.Async=true config.Scraper.
ConsumerQueueMaxSize=10 config.Scraper.ConsumerQueueThreads=2 config.Scraper.Delay=1 config.Scraper.MaxDepthToVisit=1 config.Scraper.Parallelism=20 config.Scraper.RandomD
elay=5 config.Scraper.UserAgent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36" config.Scraper.VisitDomains="[https://news.n
aver.com https://news.daum.net]"
INFO[2021-05-28T15:13:41+09:00] Application Initializing                      1-GO Runtime Version=go1.16.3 2-System Arch=amd64 3-GachiFider version=0.1.0 4-GachiFider revisi
on number=c44ad459 5-Number of used CPUs=8
INFO[2021-05-28T14:56:34+09:00] Begin crawling
INFO[2021-05-28T14:56:34+09:00] visiting https://news.daum.net
INFO[2021-05-28T14:56:34+09:00] visiting https://news.naver.com
...
(omit)

Using Docker with Elasticsearch and Kibana for a test environment on Linux.

I suppose to be already installed Docker and Docker-compose therefore I don't handle installing those here.

To run Elasticsearch and Kibana, just go ahead below.

$ cd $GACHIFINDER_FOLDER/docker
$ docker-compose up -d --build

If Elasticsearch account has been changed, you need to type into docker/kibana/kibana.yml

Documentation

Index

Constants

View Source
const (
	JSON = "json"
	TEXT = "text"
)

Types of Config.Global.Log.Format

Variables

This section is empty.

Functions

This section is empty.

Types

type Config

type Config struct {
	Global struct {
		MaxUsedCores int `yaml:"max_used_cores"`
		Interval     int `yaml:"interval"`
		Log          struct {
			LogLevel     string `yaml:"log_level"`
			Stdout       bool   `yaml:"stdout"`
			Format       string `yaml:"format"`
			ForceColors  bool   `yaml:"force_colors"`
			GoTimeFormat string `yaml:"go_time_format"`
			LogPath      string `yaml:"log_path"`
			MaxSize      int    `yaml:"max_size"`
			MaxAge       int    `yaml:"max_age"`
			MaxBackups   int    `yaml:"max_backups"`
			Compress     bool   `yaml:"compress"`
		} `yaml:"log"`
	} `yaml:"global"`

	Scraper struct {
		VisitDomains         []string `yaml:"visit_domains"`
		AllowedDomains       []string `yaml:"allowed_domains"`
		UserAgent            string   `yaml:"user_agent"`
		MaxDepthToVisit      int      `yaml:"max_depth_to_visit"`
		Async                bool     `yaml:"async"`
		Parallelism          int      `yaml:"parallelism"`
		Delay                int      `yaml:"delay"`
		RandomDelay          int      `yaml:"random_delay"`
		ConsumerQueueThreads int      `yaml:"consumer_queue_threads"`
		ConsumerQueueMaxSize int      `yaml:"consumer_queue_max_size"`
	} `yaml:"scraper"`

	Emitter struct {
		Elasticsearch struct {
			Hosts    []string `yaml:"hosts"`
			Username string   `yaml:"username"`
			Password string   `yaml:"password"`
		}
	}
}

Config is the options of gachifinder.

type Emitter

type Emitter interface {
	// Connect to the Emitter; connect is only called once when the plugin starts.
	Connect() error
	// Close any connections to the Emitter. Close is called once when the output
	// is shutting down. Close will not be called until all writes have finished,
	// and Write() will not be called once Close() has been, so locking is not
	// necessary.
	Close()
	// Write takes in group of points to be written to the Emitter.
	// this is a consumer in a part of a pipeline.
	Write(<-chan GachiData) error
}

Emitter interface to sent or write the data to the targets.

type GachiData

type GachiData struct {
	Timestamp       string
	VisitHost       string
	Creator         string
	Title           string
	Description     string
	URL             string
	ShortCutIconURL string
	ImageURL        string
}

GachiData is contents to collect data by scraper.

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL