surfer

package
v0.0.0-...-4498091 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 8, 2017 License: Apache-2.0 Imports: 25 Imported by: 0

Documentation

Overview

surfer是一款Go语言编写的高并发web下载器,支持 GET/POST/HEAD 方法及 http/https 协议,同时支持固定UserAgent自动保存cookie与随机大量UserAgent禁用cookie两种模式,高度模拟浏览器行为,可实现模拟登录等功能。

Index

Constants

View Source
const (
	SurfID             = 0               // Surf下载器标识符
	PhomtomJsID        = 1               // PhomtomJs下载器标识符
	DefaultMethod      = "GET"           // 默认请求方法
	DefaultDialTimeout = 2 * time.Minute // 默认请求服务器超时
	DefaultConnTimeout = 2 * time.Minute // 默认下载超时
	DefaultTryTimes    = 3               // 默认最大下载次数
	DefaultRetryPause  = 2 * time.Second // 默认重新下载前停顿时长
)

Variables

This section is empty.

Functions

func AutoToUTF8

func AutoToUTF8(resp *http.Response) error

采用surf内核下载时,可以尝试自动转码为utf8 采用phantomjs内核时,无需转码(已是utf8)

func BodyBytes

func BodyBytes(resp *http.Response) ([]byte, error)

读取完整响应流正文

func DestroyJsFiles

func DestroyJsFiles()

销毁Phantomjs的js临时文件

func Download

func Download(req Request) (resp *http.Response, err error)

func GetWDPath

func GetWDPath() string

The GetWDPath gets the work directory path.

func IsDirExists

func IsDirExists(path string) bool

The IsDirExists judges path is directory or not.

func IsFileExists

func IsFileExists(path string) bool

The IsFileExists judges path is file or not.

func UrlEncode

func UrlEncode(urlStr string) (*url.URL, error)

返回编码后的url.URL指针、及解析错误

func WalkDir

func WalkDir(targpath string, suffixes ...string) (dirlist []string)

遍历目录,可指定后缀

Types

type Body

type Body struct {
	io.ReadCloser
	io.Reader
}

封装Response.Body

func (*Body) Read

func (b *Body) Read(p []byte) (int, error)

type DefaultRequest

type DefaultRequest struct {
	// url (必须填写)
	Url string
	// GET POST POST-M HEAD (默认为GET)
	Method string
	// http header
	Header http.Header
	// 是否使用cookies,在Spider的EnableCookie设置
	EnableCookie bool
	// POST values
	PostData string
	// dial tcp: i/o timeout
	DialTimeout time.Duration
	// WSARecv tcp: i/o timeout
	ConnTimeout time.Duration
	// the max times of download
	TryTimes int
	// how long pause when retry
	RetryPause time.Duration
	// max redirect times
	// when RedirectTimes equal 0, redirect times is ∞
	// when RedirectTimes less than 0, redirect times is 0
	RedirectTimes int
	// the download ProxyHost
	Proxy string

	// 指定下载器ID
	// 0为Surf高并发下载器,各种控制功能齐全
	// 1为PhantomJS下载器,特点破防力强,速度慢,低并发
	DownloaderID int
	// contains filtered or unexported fields
}

默认实现的Request

func (*DefaultRequest) GetConnTimeout

func (self *DefaultRequest) GetConnTimeout() time.Duration

WSARecv tcp: i/o timeout

func (*DefaultRequest) GetDialTimeout

func (self *DefaultRequest) GetDialTimeout() time.Duration

dial tcp: i/o timeout

func (*DefaultRequest) GetDownloaderID

func (self *DefaultRequest) GetDownloaderID() int

select Surf ro PhomtomJS

func (*DefaultRequest) GetEnableCookie

func (self *DefaultRequest) GetEnableCookie() bool

enable http cookies

func (*DefaultRequest) GetHeader

func (self *DefaultRequest) GetHeader() http.Header

http header

func (*DefaultRequest) GetMethod

func (self *DefaultRequest) GetMethod() string

GET POST POST-M HEAD

func (*DefaultRequest) GetPostData

func (self *DefaultRequest) GetPostData() string

POST values

func (*DefaultRequest) GetProxy

func (self *DefaultRequest) GetProxy() string

the download ProxyHost

func (*DefaultRequest) GetRedirectTimes

func (self *DefaultRequest) GetRedirectTimes() int

max redirect times

func (*DefaultRequest) GetRetryPause

func (self *DefaultRequest) GetRetryPause() time.Duration

the pause time of retry

func (*DefaultRequest) GetTryTimes

func (self *DefaultRequest) GetTryTimes() int

the max times of download

func (*DefaultRequest) GetUrl

func (self *DefaultRequest) GetUrl() string

url

type Param

type Param struct {
	// contains filtered or unexported fields
}

func NewParam

func NewParam(req Request) (param *Param, err error)

type Phantom

type Phantom struct {
	PhantomjsFile string //Phantomjs完整文件名
	TempJsDir     string //临时js存放目录
	// contains filtered or unexported fields
}

基于Phantomjs的下载器实现,作为surfer的补充 效率较surfer会慢很多,但是因为模拟浏览器,破防性更好 支持UserAgent/TryTimes/RetryPause/自定义js

func (*Phantom) DestroyJsFiles

func (self *Phantom) DestroyJsFiles()

销毁js临时文件

func (*Phantom) Download

func (self *Phantom) Download(req Request) (resp *http.Response, err error)

实现surfer下载器接口

type Request

type Request interface {
	// url
	GetUrl() string
	// GET POST POST-M HEAD
	GetMethod() string
	// POST values
	GetPostData() string
	// http header
	GetHeader() http.Header
	// enable http cookies
	GetEnableCookie() bool
	// dial tcp: i/o timeout
	GetDialTimeout() time.Duration
	// WSARecv tcp: i/o timeout
	GetConnTimeout() time.Duration
	// the max times of download
	GetTryTimes() int
	// the pause time of retry
	GetRetryPause() time.Duration
	// the download ProxyHost
	GetProxy() string
	// max redirect times
	GetRedirectTimes() int
	// select Surf ro PhomtomJS
	GetDownloaderID() int
}

type Response

type Response struct {
	Cookies []string
	Body    string
}

基于Phantomjs的下载器实现,作为surfer的补充 效率较surfer会慢很多,但是因为模拟浏览器,破防性更好 支持UserAgent/TryTimes/RetryPause/自定义js

type Surf

type Surf struct {
	// contains filtered or unexported fields
}

Default is the default Download implementation.

func (*Surf) Download

func (self *Surf) Download(req Request) (resp *http.Response, err error)

type Surfer

type Surfer interface {
	// GET @param url string, header http.Header, cookies []*http.Cookie
	// HEAD @param url string, header http.Header, cookies []*http.Cookie
	// POST PostForm @param url, referer string, values url.Values, header http.Header, cookies []*http.Cookie
	// POST-M PostMultipart @param url, referer string, values url.Values, header http.Header, cookies []*http.Cookie
	Download(Request) (resp *http.Response, err error)
}

Downloader represents an core of HTTP web browser for crawler.

func New

func New() Surfer

func NewPhantom

func NewPhantom(phantomjsFile, tempJsDir string) Surfer

Directories

Path Synopsis
Package agent generates user agents strings for well known browsers and for custom browsers.
Package agent generates user agents strings for well known browsers and for custom browsers.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL