page

package
v1.2.7 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 4, 2021 License: MPL-2.0 Imports: 7 Imported by: 0

Documentation

Overview

Package page contains result catched by Downloader. And it alse has result parsed by PageProcesser.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Page

type Page struct {
	// contains filtered or unexported fields
}

Page represents an entity be crawled.

func NewPage

func NewPage(req *request.Request) *Page

NewPage returns initialized Page object.

func (*Page) AddField

func (this *Page) AddField(key string, value string)

AddField saves KV string pair to PageItems preparing for Pipeline

func (*Page) AddTargetRequest

func (this *Page) AddTargetRequest(url string, respType string) *Page

AddTargetRequest adds one new Request waitting for crawl.

func (*Page) AddTargetRequestWithHeaderFile

func (this *Page) AddTargetRequestWithHeaderFile(url string, respType string, headerFile string) *Page

AddTargetRequest adds one new Request with header file for waitting for crawl.

func (*Page) AddTargetRequestWithParams

func (this *Page) AddTargetRequestWithParams(req *request.Request) *Page

AddTargetRequest adds one new Request waitting for crawl. The respType is "html" or "json" or "jsonp" or "text". The urltag is name for marking url and distinguish different urls in PageProcesser and Pipeline. The method is POST or GET. The postdata is http body string. The header is http header. The cookies is http cookies.

func (*Page) AddTargetRequestWithProxy

func (this *Page) AddTargetRequestWithProxy(url string, respType string, proxyHost string) *Page

AddTargetRequestWithProxy adds one new Request waitting for crawl.

func (*Page) AddTargetRequests

func (this *Page) AddTargetRequests(urls []string, respType string) *Page

AddTargetRequests adds new Requests waitting for crawl.

func (*Page) AddTargetRequestsWithParams

func (this *Page) AddTargetRequestsWithParams(reqs []*request.Request) *Page

AddTargetRequests adds new Requests waitting for crawl.

func (*Page) AddTargetRequestsWithProxy

func (this *Page) AddTargetRequestsWithProxy(urls []string, respType string, proxyHost string) *Page

AddTargetRequestsWithProxy adds new Requests waitting for crawl.

func (*Page) Errormsg

func (this *Page) Errormsg() string

Errormsg show the download error message.

func (*Page) GetBodyStr

func (this *Page) GetBodyStr() string

GetBodyStr returns plain string crawled.

func (*Page) GetCookies

func (this *Page) GetCookies() []*http.Cookie

GetHeader returns the cookies of http responce

func (*Page) GetHeader

func (this *Page) GetHeader() http.Header

GetHeader returns the header of http responce

func (*Page) GetHtmlParser

func (this *Page) GetHtmlParser() *goquery.Document

GetHtmlParser returns goquery object binded to target crawl result.

func (*Page) GetJson

func (this *Page) GetJson() *simplejson.Json

SetJson returns json result.

func (*Page) GetPageItems

func (this *Page) GetPageItems() *page_items.PageItems

GetPageItems returns PageItems object that record KV pair parsed in PageProcesser.

func (*Page) GetRequest

func (this *Page) GetRequest() *request.Request

GetRequest returns request oject of this page.

func (*Page) GetSkip

func (this *Page) GetSkip() bool

GetSkip returns skip label of PageItems.

func (*Page) GetTargetRequests

func (this *Page) GetTargetRequests() []*request.Request

GetTargetRequests returns the target requests that will put into Scheduler

func (*Page) GetUrlTag

func (this *Page) GetUrlTag() string

GetUrlTag returns name of url.

func (*Page) IsSucc

func (this *Page) IsSucc() bool

IsSucc test whether download process success or not.

func (*Page) ResetHtmlParser

func (this *Page) ResetHtmlParser() *goquery.Document

GetHtmlParser returns goquery object binded to target crawl result.

func (*Page) SetBodyStr

func (this *Page) SetBodyStr(body string) *Page

SetBodyStr saves plain string crawled in Page.

func (*Page) SetCookies

func (this *Page) SetCookies(cookies []*http.Cookie)

SetHeader save the cookies of http responce

func (*Page) SetHeader

func (this *Page) SetHeader(header http.Header)

SetHeader save the header of http responce

func (*Page) SetHtmlParser

func (this *Page) SetHtmlParser(doc *goquery.Document) *Page

SetHtmlParser saves goquery object binded to target crawl result.

func (*Page) SetJson

func (this *Page) SetJson(js *simplejson.Json) *Page

SetJson saves json result.

func (*Page) SetRequest

func (this *Page) SetRequest(r *request.Request) *Page

SetRequest saves request oject of this page.

func (*Page) SetSkip

func (this *Page) SetSkip(skip bool)

SetSkip set label "skip" of PageItems. PageItems will not be saved in Pipeline wher skip is set true

func (*Page) SetStatus

func (this *Page) SetStatus(isfail bool, errormsg string)

SetStatus save status info about download process.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL