FastBS4

module

v0.0.0-...-70a2eda Latest Latest Go to latest Published: Dec 9, 2024 License: Apache-2.0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/zhangyiming748/FastBS4

README ¶

FastBS4

soup

Web Scraper in Go, similar to BeautifulSoup

soup is a small web scraper package for Go, with its interface highly similar to that of BeautifulSoup.

Exported variables and functions implemented till now :

var Headers map[string]string // Set headers as a map of key-value pairs, an alternative to calling Header() individually 将 headers 设置为键值对的映射，这是单独调用 Header() 的替代方法
var Cookies map[string]string // Set cookies as a map of key-value  pairs, an alternative to calling Cookie() individually 将 cookie 设置为键值对的映射，这是单独调用 Cookie() 的替代方法
func Get(string) (string,error) {} // Takes the url as an argument, returns HTML string 将 url 作为参数，返回 HTML 字符串
func GetWithClient(string, *http.Client) {} // Takes the url and a custom HTTP client as arguments, returns HTML string 将 url 和自定义 HTTP 客户端作为参数，返回 HTML 字符串
func Post(string, string, interface{}) (string, error) {} // Takes the url, bodyType, and payload as an argument, returns HTML string 将 url、bodyType 和 Payload 作为参数，返回 HTML 字符串
func PostForm(string, url.Values) {} // Takes the url and body. bodyType is set to "application/x-www-form-urlencoded" 获取 url 和正文。 bodyType 设置为“application/x-www-form-urlencoded”
func Header(string, string) {} // Takes key,value pair to set as headers for the HTTP request made in Get() 将键值对设置为 Get() 中发出的 HTTP 请求的标头
func Cookie(string, string) {} // Takes key, value pair to set as cookies to be sent with the HTTP request in Get() 在 Get() 中将键、值对设置为与 HTTP 请求一起发送的 cookie
func HTMLParse(string) Root {} // Takes the HTML string as an argument, returns a pointer to the DOM constructed 将 HTML 字符串作为参数，返回指向构造的 DOM 的指针
func Find([]string) Root {} // Element tag,(attribute key-value pair) as argument, pointer to first occurence returned 元素标签，（属性键值对）作为参数，返回指向第一次出现的指针
func FindAll([]string) []Root {} // Same as Find(), but pointers to all occurrences returned 与 Find() 相同，但返回指向所有出现次数的指针
func FindStrict([]string) Root {} //  Element tag,(attribute key-value pair) as argument, pointer to first occurence returned with exact matching values 元素标签，（属性键值对）作为参数，指向第一次出现的指针，返回精确匹配的值
func FindAllStrict([]string) []Root {} // Same as FindStrict(), but pointers to all occurrences returned 与 FindStrict() 相同，但返回指向所有出现次数的指针
func FindNextSibling() Root {} // Pointer to the next sibling of the Element in the DOM returned 指向 DOM 中返回的元素的下一个同级的指针
func FindNextElementSibling() Root {} // Pointer to the next element sibling of the Element in the DOM returned 指向返回的 DOM 中 Element 的下一个同级元素的指针
func FindPrevSibling() Root {} // Pointer to the previous sibling of the Element in the DOM returned 指向 DOM 中返回的元素的前一个同级的指针
func FindPrevElementSibling() Root {} // Pointer to the previous element sibling of the Element in the DOM returned 指向 DOM 中返回的 Element 的上一个同级元素的指针
func Children() []Root {} // Find all direct children of this DOM element 查找此 DOM 元素的所有直接子元素
func Attrs() map[string]string {} // Map returned with all the attributes of the Element as lookup to their respective values 返回的映射包含元素的所有属性，作为对其各自值的查找
func Text() string {} // Full text inside a non-nested tag returned, first half returned in a nested one 返回非嵌套标签内的全文，前半部分在嵌套标签中返回
func FullText() string {} // Full text inside a nested/non-nested tag returned 返回嵌套/非嵌套标签内的全文
func SetDebug(bool) {} // Sets the debug mode to true or false; false by default 将调试模式设置为 true 或 false；默认为 false
func HTML() {} // HTML returns the HTML code for the specific element //HTML 返回特定元素的 HTML 代码

Root is a struct, containing three fields :

Pointer containing the pointer to the current html node
NodeValue containing the current html node's value, i.e. the tag name for an ElementNode, or the text in case of a TextNode
Error containing an error in a struct if one occurrs, else nil is returned. A detailed text explaination of the error can be accessed using the Error() function. A field Type in this struct of type ErrorType will denote the kind of error that took place, which will consist of either of the following
- ErrUnableToParse
- ErrElementNotFound
- ErrNoNextSibling
- ErrNoPreviousSibling
- ErrNoNextElementSibling
- ErrNoPreviousElementSibling
- ErrCreatingGetRequest
- ErrInGetRequest
- ErrReadingResponse

Installation

Install the package using the command

go get github.com/anaskhan96/soup

Example

An example code is given below to scrape the "Comics I Enjoy" part (text and its links) from xkcd.

More Examples

package main

import (
	"fmt"
	"github.com/anaskhan96/soup"
	"os"
)

func main() {
	resp, err := soup.Get("https://xkcd.com")
	if err != nil {
		os.Exit(1)
	}
	doc := soup.HTMLParse(resp)
	links := doc.Find("div", "id", "comicLinks").FindAll("a")
	for _, link := range links {
		fmt.Println(link.Text(), "| Link :", link.Attrs()["href"])
	}
}

Contributions

This package was developed in my free time. However, contributions from everybody in the community are welcome, to make it a better web scraper. If you think there should be a particular feature or function included in the package, feel free to open up a new issue or pull request.

Directories ¶

Path	Synopsis
ag
liwenzhou
netease
ph
rss
soup
telegraph
util

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL