pholcus_lib

package
v0.0.0-...-71bf9ba Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 28, 2020 License: Apache-2.0 Imports: 6 Imported by: 0

README

中国新闻网-滚动新闻栏目

说明

只是爬取滚动新闻栏目(共10页)

代码说明

1.直接访问滚动新闻栏目地址(http://www.chinanews.com/scroll-news/news1.html)
2.获取分页导航
3.获取分页链接

刚开始学习,写的不好,多多指教 weChat:gaoyawei616

Documentation

Index

Constants

This section is empty.

Variables

View Source
var FileTest = &Spider{
	Name:        "villagevoice",
	Description: "https://www.villagevoice.com/",

	EnableCookie: false,
	RuleTree: &RuleTree{
		Root: func(ctx *Context) {
			ctx.AddQueue(&request.Request{
				Url:  "https://www.villagevoice.com/",
				Rule: "villagevoice",
			})
		},

		Trunk: map[string]*Rule{
			"villagevoice": {
				ItemFields: []string{
					"标题",
					"内容",
					"来源",
					"时间",
				},
				ParseFunc: func(ctx *Context) {
					query := ctx.GetDom()

					title := query.Find(".content-h1 h1").Text()
					newList := query.Find("a")
					newList.Each(func(i int, s *goquery.Selection) {
						if url, ok := s.Attr("href"); ok {
							if strings.HasPrefix(url, "//") {
								url = "https:" + url
							} else if strings.HasPrefix(url, "/") {
								url = "https://www.villagevoice.com" + url
							}
							if strings.Contains(url, "#") {
								url = url[:strings.LastIndex(url, "#")]
							}
							if !strings.Contains(url, "villagevoice") {
								return
							}
							ctx.AddQueue(&request.Request{
								Url:  url,
								Rule: "villagevoice",
							})
						}

					})

					content := util.RemoveNilLine(query.Find("#content").Text())
					if len(content) < 100 {
						return
					}
					ctx.Output(map[int]interface{}{
						0: title,
						1: content,
						2: query.Url.String,
						3: time.Now(),
					})
				},
			},
		},
	},
}

Functions

This section is empty.

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL