pholcus_lib

package

v0.0.0-...-71bf9ba Latest Latest Go to latest Published: Feb 28, 2020 License: Apache-2.0 Imports: 6 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

README ¶

根据京东新的页面规则进行了修改

1.以前是修改url中的page参数就可以得到每页的值。但是现在京东做了修改。 Imgur 现在点击第二页的时候，url中的page参数会是3，修改page现在不能得到所有的商品信息的。page=2的时候的内容，会在你的页面滚动到中间的时候通过异步的方式来加载。

2.我们输入的关键字总共有多少页商品的显示方式也修改了。这个参数现在改到了一段javasript代码中，通过js来生成页面代码。 Imgur

3.在存入结果的时候，我判断了一下title为空的情况。这个是因为，京东会在一些商品里面加入广告的，但是这个广告的html结构是和商品是一样的，这样我们的规则在解析的时候会得到这个无效的信息，需要去掉。如下图: Imgur

这个爬虫整体的过程就是。

先访问参数page=1的url，使用正则表达式得到这个关键字一共有多少页商品
根据两种加载方式(url的直接返回和异步加载)，生成所有的url。
分析页面结构，得到相关的值

第一次写，写的不好的或错的地方希望大家多多包涵。^_^

Documentation ¶

Index ¶

Variables

Constants ¶

This section is empty.

Variables ¶

View Source

var JDSpider = &Spider{
	Name:        "京东搜索new",
	Description: "京东搜索结果 [search.jd.com]",

	Keyin:        KEYIN,
	Limit:        LIMIT,
	EnableCookie: false,
	RuleTree: &RuleTree{
		Root: func(ctx *Context) {

			ctx.Aid(map[string]interface{}{"Rule": "判断页数"}, "判断页数")
		},

		Trunk: map[string]*Rule{

			"判断页数": {
				AidFunc: func(ctx *Context, aid map[string]interface{}) interface{} {
					ctx.AddQueue(
						&request.Request{
							Url:  "http://search.jd.com/Search?keyword=" + ctx.GetKeyin() + "&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&bs=1&s=1&click=0&page=1",
							Rule: aid["Rule"].(string),
						},
					)
					return nil
				},
				ParseFunc: func(ctx *Context) {
					query := ctx.GetDom()
					pageCount := 0
					query.Find("script").Each(func(i int, s *goquery.Selection) {
						if strings.Contains(s.Text(), "page_count") {
							re, _ := regexp.Compile(`page_count:"\d{1,}"`)
							temp := re.FindString(s.Text())
							re, _ = regexp.Compile(`\d{1,}`)
							temp2 := re.FindString(temp)
							pageCount, _ = strconv.Atoi(temp2)
						}
					})
					ctx.Aid(map[string]interface{}{"PageCount": pageCount}, "生成请求")
				},
			},

			"生成请求": {

				AidFunc: func(ctx *Context, aid map[string]interface{}) interface{} {

					pageCount := aid["PageCount"].(int)

					for i := 1; i < pageCount; i++ {
						ctx.AddQueue(
							&request.Request{
								Url:  "http://search.jd.com/Search?keyword=" + ctx.GetKeyin() + "&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&bs=1&s=1&click=0&page=" + strconv.Itoa(i*2-1),
								Rule: "搜索结果",
							},
						)
						ctx.AddQueue(
							&request.Request{
								Url:  "http://search.jd.com/s_new.php?keyword=" + ctx.GetKeyin() + "&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&bs=1&s=31&scrolling=y&pos=30&page=" + strconv.Itoa(i*2),
								Rule: "搜索结果",
							},
						)
					}
					return nil
				},
			},

			"搜索结果": {

				ItemFields: []string{
					"标题",
					"价格",
					"评论数",
					"链接",
				},
				ParseFunc: func(ctx *Context) {
					query := ctx.GetDom()

					query.Find(".gl-item").Each(func(i int, s *goquery.Selection) {

						a := s.Find(".p-name.p-name-type-2 > a")
						title := a.Text()

						re, _ := regexp.Compile("\\<[\\S\\s]+?\\>")

						title = re.ReplaceAllString(title, " ")
						title = strings.Trim(title, " \t\n")

						price := s.Find(".p-price > strong > i").Text()

						discuss := s.Find(".p-commit > strong > a").Text()

						url, _ := a.Attr("href")
						url = "http:" + url

						if title != "" {
							ctx.Output(map[int]interface{}{
								0: title,
								1: price,
								2: discuss,
								3: url,
							})
						}
					})
				},
			},
		},
	},
}

Functions ¶

This section is empty.

Types ¶

This section is empty.

Source Files ¶

View all Source files

jdSpider.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL