Documentation ¶
Overview ¶
craw master module
Index ¶
- type Spider
- func (this *Spider) AddBlackList(urlStr string) *Spider
- func (this *Spider) AddPipeline(p pipeline.Pipeline) *Spider
- func (this *Spider) AddRequest(req *request.Request) *Spider
- func (this *Spider) AddRequests(reqs []*request.Request) *Spider
- func (this *Spider) AddUrl(url string, respType string) *Spider
- func (this *Spider) AddUrlEx(url string, respType string, headerFile string, proxyHost string) *Spider
- func (this *Spider) AddUrlWithHeaderFile(url string, respType string, headerFile string) *Spider
- func (this *Spider) AddUrls(urls []string, respType string) *Spider
- func (this *Spider) AddUrlsEx(urls []string, respType string, headerFile string, proxyHost string) *Spider
- func (this *Spider) AddUrlsWithHeaderFile(urls []string, respType string, headerFile string) *Spider
- func (this *Spider) AddWhiteList(urlStr string) *Spider
- func (this *Spider) CloseFileLog() *Spider
- func (this *Spider) CloseStrace() *Spider
- func (this *Spider) Get(url string, respType string) *page_items.PageItems
- func (this *Spider) GetAll(urls []string, respType string) []*page_items.PageItems
- func (this *Spider) GetAllByRequest(reqs []*request.Request) []*page_items.PageItems
- func (this *Spider) GetByRequest(req *request.Request) *page_items.PageItems
- func (this *Spider) GetDownloader() downloader.Downloader
- func (this *Spider) GetExitWhenComplete() bool
- func (this *Spider) GetScheduler() scheduler.Scheduler
- func (this *Spider) GetThreadnum() uint
- func (this *Spider) IsUrlAllowded(req *request.Request) bool
- func (this *Spider) OpenFileLog(filePath string) *Spider
- func (this *Spider) OpenFileLogDefault() *Spider
- func (this *Spider) OpenStrace() *Spider
- func (this *Spider) Run()
- func (this *Spider) SetDownloader(d downloader.Downloader) *Spider
- func (this *Spider) SetExitWhenComplete(e bool) *Spider
- func (this *Spider) SetScheduler(s scheduler.Scheduler) *Spider
- func (this *Spider) SetSleepTime(sleeptype string, s uint, e uint) *Spider
- func (this *Spider) SetThreadnum(i uint) *Spider
- func (this *Spider) Taskname() string
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Spider ¶
type Spider struct {
// contains filtered or unexported fields
}
func NewSpider ¶
func NewSpider(pageinst page_processer.PageProcesser, taskname string) *Spider
Spider is scheduler module for all the other modules, like downloader, pipeline, scheduler and etc. The taskname could be empty string too, or it can be used in Pipeline for record the result crawled by which task;
func (*Spider) AddBlackList ¶
func (*Spider) AddRequest ¶
add Request to Schedule
func (*Spider) AddUrlWithHeaderFile ¶
func (*Spider) AddUrlsWithHeaderFile ¶
func (*Spider) AddWhiteList ¶
func (*Spider) CloseFileLog ¶
The CloseFileLog close file log.
func (*Spider) CloseStrace ¶
The CloseStrace close strace.
func (*Spider) Get ¶
func (this *Spider) Get(url string, respType string) *page_items.PageItems
Deal with one url and return the PageItems.
func (*Spider) GetAll ¶
func (this *Spider) GetAll(urls []string, respType string) []*page_items.PageItems
Deal with several urls and return the PageItems slice.
func (*Spider) GetAllByRequest ¶
func (this *Spider) GetAllByRequest(reqs []*request.Request) []*page_items.PageItems
Deal with several urls and return the PageItems slice
func (*Spider) GetByRequest ¶
func (this *Spider) GetByRequest(req *request.Request) *page_items.PageItems
Deal with one url and return the PageItems with other setting.
func (*Spider) GetDownloader ¶
func (this *Spider) GetDownloader() downloader.Downloader
func (*Spider) GetExitWhenComplete ¶
func (*Spider) GetScheduler ¶
func (*Spider) GetThreadnum ¶
func (*Spider) OpenFileLog ¶
The OpenFileLog initialize the log path and open log. If log is opened, error info or other useful info in spider will be logged in file of the filepath. Log command is mlog.LogInst().LogError("info") or mlog.LogInst().LogInfo("info"). Spider's default log is closed. The filepath is absolute path.
func (*Spider) OpenFileLogDefault ¶
OpenFileLogDefault open file log with default file path like "WD/log/log.2014-9-1".
func (*Spider) OpenStrace ¶
The OpenStrace open strace that output progress info on the screen. Spider's default strace is opened.
func (*Spider) SetDownloader ¶
func (this *Spider) SetDownloader(d downloader.Downloader) *Spider
func (*Spider) SetExitWhenComplete ¶
If exit when each crawl task is done. If you want to keep spider in memory all the time and add url from outside, you can set it true.
func (*Spider) SetSleepTime ¶
The SetSleepTime set sleep time after each crawl task. The unit is millisecond. If sleeptype is "fixed", the s is the sleep time and e is useless. If sleeptype is "rand", the sleep time is rand between s and e.