spider

package
v1.2.7 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 4, 2021 License: MPL-2.0 Imports: 11 Imported by: 0

Documentation

Overview

craw master module

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Spider

type Spider struct {
	// contains filtered or unexported fields
}

func NewSpider

func NewSpider(pageinst page_processer.PageProcesser, taskname string) *Spider

Spider is scheduler module for all the other modules, like downloader, pipeline, scheduler and etc. The taskname could be empty string too, or it can be used in Pipeline for record the result crawled by which task;

func (*Spider) AddPipeline

func (this *Spider) AddPipeline(p pipeline.Pipeline) *Spider

func (*Spider) AddRequest

func (this *Spider) AddRequest(req *request.Request) *Spider

add Request to Schedule

func (*Spider) AddRequests

func (this *Spider) AddRequests(reqs []*request.Request) *Spider

func (*Spider) AddUrl

func (this *Spider) AddUrl(url string, respType string) *Spider

func (*Spider) AddUrlEx

func (this *Spider) AddUrlEx(url string, respType string, headerFile string, proxyHost string) *Spider

func (*Spider) AddUrlWithHeaderFile

func (this *Spider) AddUrlWithHeaderFile(url string, respType string, headerFile string) *Spider

func (*Spider) AddUrls

func (this *Spider) AddUrls(urls []string, respType string) *Spider

func (*Spider) AddUrlsEx

func (this *Spider) AddUrlsEx(urls []string, respType string, headerFile string, proxyHost string) *Spider

func (*Spider) AddUrlsWithHeaderFile

func (this *Spider) AddUrlsWithHeaderFile(urls []string, respType string, headerFile string) *Spider

func (*Spider) CloseFileLog

func (this *Spider) CloseFileLog() *Spider

The CloseFileLog close file log.

func (*Spider) CloseStrace

func (this *Spider) CloseStrace() *Spider

The CloseStrace close strace.

func (*Spider) Get

func (this *Spider) Get(url string, respType string) *page_items.PageItems

Deal with one url and return the PageItems.

func (*Spider) GetAll

func (this *Spider) GetAll(urls []string, respType string) []*page_items.PageItems

Deal with several urls and return the PageItems slice.

func (*Spider) GetAllByRequest

func (this *Spider) GetAllByRequest(reqs []*request.Request) []*page_items.PageItems

Deal with several urls and return the PageItems slice

func (*Spider) GetByRequest

func (this *Spider) GetByRequest(req *request.Request) *page_items.PageItems

Deal with one url and return the PageItems with other setting.

func (*Spider) GetDownloader

func (this *Spider) GetDownloader() downloader.Downloader

func (*Spider) GetExitWhenComplete

func (this *Spider) GetExitWhenComplete() bool

func (*Spider) GetScheduler

func (this *Spider) GetScheduler() scheduler.Scheduler

func (*Spider) GetThreadnum

func (this *Spider) GetThreadnum() uint

func (*Spider) OpenFileLog

func (this *Spider) OpenFileLog(filePath string) *Spider

The OpenFileLog initialize the log path and open log. If log is opened, error info or other useful info in spider will be logged in file of the filepath. Log command is mlog.LogInst().LogError("info") or mlog.LogInst().LogInfo("info"). Spider's default log is closed. The filepath is absolute path.

func (*Spider) OpenFileLogDefault

func (this *Spider) OpenFileLogDefault() *Spider

OpenFileLogDefault open file log with default file path like "WD/log/log.2014-9-1".

func (*Spider) OpenStrace

func (this *Spider) OpenStrace() *Spider

The OpenStrace open strace that output progress info on the screen. Spider's default strace is opened.

func (*Spider) Run

func (this *Spider) Run()

func (*Spider) SetDownloader

func (this *Spider) SetDownloader(d downloader.Downloader) *Spider

func (*Spider) SetExitWhenComplete

func (this *Spider) SetExitWhenComplete(e bool) *Spider

If exit when each crawl task is done. If you want to keep spider in memory all the time and add url from outside, you can set it true.

func (*Spider) SetScheduler

func (this *Spider) SetScheduler(s scheduler.Scheduler) *Spider

func (*Spider) SetSleepTime

func (this *Spider) SetSleepTime(sleeptype string, s uint, e uint) *Spider

The SetSleepTime set sleep time after each crawl task. The unit is millisecond. If sleeptype is "fixed", the s is the sleep time and e is useless. If sleeptype is "rand", the sleep time is rand between s and e.

func (*Spider) SetThreadnum

func (this *Spider) SetThreadnum(i uint) *Spider

func (*Spider) Taskname

func (this *Spider) Taskname() string

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL