Documentation
¶
Overview ¶
Package etc_config implements config initialization of one spider.
Index ¶
- func Conf() *goutils.Config
- func ReadHeaderFromFile(headerFile string) http.Header
- func StartConf(configFilePath string) *goutils.Config
- type CollectPipeline
- type CollectPipelinePageItems
- type Downloader
- type Page
- func (self *Page) AddField(key string, value string)
- func (self *Page) AddTargetRequest(req *Request) *Page
- func (self *Page) AddTargetRequests(reqs []*Request) *Page
- func (self *Page) Errormsg() string
- func (self *Page) GetBodyStr() string
- func (self *Page) GetCookies() []*http.Cookie
- func (self *Page) GetHeader() http.Header
- func (self *Page) GetHtmlParser() *goquery.Document
- func (self *Page) GetJson() *simplejson.Json
- func (self *Page) GetPageItems() *PageItems
- func (self *Page) GetRequest() *Request
- func (self *Page) GetSkip() bool
- func (self *Page) GetTargetRequests() []*Request
- func (self *Page) GetUrlTag() string
- func (self *Page) IsSucc() bool
- func (self *Page) ResetHtmlParser() *goquery.Document
- func (self *Page) SetBodyStr(body string) *Page
- func (self *Page) SetCookies(cookies []*http.Cookie)
- func (self *Page) SetHeader(header http.Header)
- func (self *Page) SetHtmlParser(doc *goquery.Document) *Page
- func (self *Page) SetJson(js *simplejson.Json) *Page
- func (self *Page) SetRequest(r *Request) *Page
- func (self *Page) SetSkip(skip bool)
- func (self *Page) SetStatus(isfail bool, errormsg string)
- type PageItems
- type PageProcesser
- type Pipeline
- type Request
- func (self *Request) AddHeaderFile(headerFile string) *Request
- func (self *Request) AddProxyHost(host string) *Request
- func (self *Request) GetBaseUrl() string
- func (self *Request) GetCookies() []*http.Cookie
- func (self *Request) GetHeader() http.Header
- func (self *Request) GetMeta() interface{}
- func (self *Request) GetMethod() string
- func (self *Request) GetPostdata() string
- func (self *Request) GetProxyHost() string
- func (self *Request) GetResponceType() string
- func (self *Request) GetUrl() string
- func (self *Request) GetUrlTag() string
- type ResourceManage
- type Scheduler
- type Spider
- func (self *Spider) AddPipeline(p Pipeline) *Spider
- func (self *Spider) AddRequest(req *Request) *Spider
- func (self *Spider) AddRequests(reqs []*Request) *Spider
- func (self *Spider) Close()
- func (self *Spider) CloseFileLog() *Spider
- func (self *Spider) CloseStrace() *Spider
- func (self *Spider) Get(req *Request) *PageItems
- func (self *Spider) GetAll(reqs []*Request) []*PageItems
- func (self *Spider) GetAllByRequest(reqs []*Request) []*PageItems
- func (self *Spider) GetByRequest(req *Request) *PageItems
- func (self *Spider) GetDownloader() Downloader
- func (self *Spider) GetExitWhenComplete() bool
- func (self *Spider) GetScheduler() Scheduler
- func (self *Spider) OpenFileLog(filePath string) *Spider
- func (self *Spider) OpenFileLogDefault() *Spider
- func (self *Spider) OpenStrace() *Spider
- func (self *Spider) Run()
- func (self *Spider) SetDownloader(d Downloader) *Spider
- func (self *Spider) SetExitWhenComplete(e bool) *Spider
- func (self *Spider) SetScheduler(s Scheduler) *Spider
- func (self *Spider) SetSleepTime(sleeptype string, s uint, e uint) *Spider
- func (self *Spider) Taskname() string
- type SpiderOptions
- type Task
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func ReadHeaderFromFile ¶
Types ¶
type CollectPipeline ¶
type CollectPipeline interface { Pipeline // The GetCollected returns result saved in in process's memory temporarily. GetCollected() []*PageItems }
The interface CollectPipeline recommend result in process's memory temporarily.
type CollectPipelinePageItems ¶
type CollectPipelinePageItems struct {
// contains filtered or unexported fields
}
func NewCollectPipelinePageItems ¶
func NewCollectPipelinePageItems() *CollectPipelinePageItems
func (*CollectPipelinePageItems) GetCollected ¶
func (self *CollectPipelinePageItems) GetCollected() []*PageItems
func (*CollectPipelinePageItems) Process ¶
func (self *CollectPipelinePageItems) Process(items *PageItems, t Task)
type Downloader ¶
The Downloader interface. You can implement the interface by implement function Download. Function Download need to return Page instance pointer that has request result downloaded from Request.
type Page ¶
type Page struct {
// contains filtered or unexported fields
}
Page represents an entity be crawled.
func (*Page) AddTargetRequest ¶
AddTargetRequest adds one new Request waitting for crawl.
func (*Page) AddTargetRequests ¶
AddTargetRequests adds new Requests waitting for crawl.
func (*Page) GetBodyStr ¶
GetBodyStr returns plain string crawled.
func (*Page) GetCookies ¶
GetHeader returns the cookies of http responce
func (*Page) GetHtmlParser ¶
GetHtmlParser returns goquery object binded to target crawl result.
func (*Page) GetPageItems ¶
GetPageItems returns PageItems object that record KV pair parsed in PageProcesser.
func (*Page) GetRequest ¶
GetRequest returns request oject of self page.
func (*Page) GetTargetRequests ¶
GetTargetRequests returns the target requests that will put into Scheduler
func (*Page) ResetHtmlParser ¶
GetHtmlParser returns goquery object binded to target crawl result.
func (*Page) SetBodyStr ¶
SetBodyStr saves plain string crawled in Page.
func (*Page) SetCookies ¶
SetHeader save the cookies of http responce
func (*Page) SetHtmlParser ¶
SetHtmlParser saves goquery object binded to target crawl result.
func (*Page) SetRequest ¶
SetRequest saves request oject of self page.
type PageItems ¶
type PageItems struct {
// contains filtered or unexported fields
}
PageItems represents an entity save result parsed by PageProcesser and will be output at last. 保存解析后结果
func NewPageItems ¶
NewPageItems returns initialized PageItems object. 返回一个初始化的pageitems
func (*PageItems) GetRequest ¶
GetRequest returns request of PageItems
type PageProcesser ¶
type PageProcesser interface { Process(p *Page) Finish() }
页面下载后的处理接口,需要开发者自己实现
type Pipeline ¶
type Pipeline interface { // The Process implements result persistent. // The items has the result be crawled. // The t has informations of this crawl task. Process(items *PageItems, t Task) }
The interface Pipeline can be implemented to customize ways of persistent. 最终抓取数据流向,需开发者自己实现,pipeline文件夹下有例子
type Request ¶
type Request struct { Url string // Responce type: html json jsonp text RespType string // GET POST Method string // POST data Postdata string // name for marking url and distinguish different urls in PageProcesser and Pipeline Urltag string // http header Header http.Header // http cookies Cookies []*http.Cookie //proxy host example='localhost:80' ProxyHost string Meta interface{} }
Request represents object waiting for being crawled.
func NewRequest ¶
func (*Request) AddHeaderFile ¶
point to a json file
xxx.json
{ "User-Agent":"curl/7.19.3 (i386-pc-win32) libcurl/7.19.3 OpenSSL/1.0.0d", "Referer":"http://weixin.sogou.com/gzh?openid=oIWsFt6Sb7aZmuI98AU7IXlbjJps", "Cookie":"" }
func (*Request) AddProxyHost ¶
@host http://localhost:8765/
func (*Request) GetBaseUrl ¶
获取URL路径 http://www.79xs.com/Html/Book/147/147144/Index.html 返回http://www.79xs.com/Html/Book/147/147144/
func (*Request) GetCookies ¶
func (*Request) GetPostdata ¶
func (*Request) GetProxyHost ¶
func (*Request) GetResponceType ¶
type ResourceManage ¶
type ResourceManage interface { //启动资源管理器 Start() //释放资源管理器 Free() //向资源管理器中添加任务 AddTask(func(*Request), *Request) //获取资源管理器中的资源量 Has() int }
资源管理接口
type Spider ¶
type Spider struct {
// contains filtered or unexported fields
}
func NewSpider ¶
func NewSpider(options SpiderOptions) *Spider
2016-01-07 创建爬虫项目,一切从这个开始,首选需要你添加爬虫的各种选项参数,包括用哪种下载器,哪种调度器,哪种资源管理器,哪种pipeline,及页面处理器 当然,我们也为你准备了一系列写好的类,给你进行参考和使用,你可以到对应的文件夹中去寻找
func (*Spider) AddPipeline ¶
func (*Spider) AddRequest ¶
add Request to Schedule
func (*Spider) AddRequests ¶
func (*Spider) CloseFileLog ¶
The CloseFileLog close file log.
func (*Spider) CloseStrace ¶
The CloseStrace close strace.
func (*Spider) GetAllByRequest ¶
Deal with several urls and return the PageItems slice
func (*Spider) GetByRequest ¶
Deal with one url and return the PageItems with other setting.
func (*Spider) GetDownloader ¶
func (self *Spider) GetDownloader() Downloader
func (*Spider) GetExitWhenComplete ¶
func (*Spider) GetScheduler ¶
func (*Spider) OpenFileLog ¶
The OpenFileLog initialize the log path and open log. If log is opened, error info or other useful info in spider will be logged in file of the filepath. Log command is mlog.LogInst().LogError("info") or mlog.LogInst().LogInfo("info"). Spider's default log is closed. The filepath is absolute path.
func (*Spider) OpenFileLogDefault ¶
OpenFileLogDefault open file log with default file path like "WD/log/log.2014-9-1".
func (*Spider) OpenStrace ¶
The OpenStrace open strace that output progress info on the screen. Spider's default strace is opened.
func (*Spider) SetDownloader ¶
func (self *Spider) SetDownloader(d Downloader) *Spider
func (*Spider) SetExitWhenComplete ¶
If exit when each crawl task is done. If you want to keep spider in memory all the time and add url from outside, you can set it true.
func (*Spider) SetScheduler ¶
func (*Spider) SetSleepTime ¶
The SetSleepTime set sleep time after each crawl task. The unit is millisecond. If sleeptype is "fixed", the s is the sleep time and e is useless. If sleeptype is "rand", the sleep time is rand between s and e.
type SpiderOptions ¶
type SpiderOptions struct { //任务名称 TaskName string //页面处理接口实现 PageProcesser PageProcesser //下载器接口实现 Downloader Downloader //调度器接口实现 Scheduler Scheduler //Pipeline的接口实现,直接将一系列pipeline的实现对象放入这个列表 Pipelines []Pipeline //资源管理接口实现 ResourceManage ResourceManage //最大协程数,用于协程池 MaxGoroutineNum uint }
爬虫设置选项