extractor

package module

v0.0.0-...-9b24770 Latest Latest Go to latest Published: Jan 20, 2025 License: MIT Imports: 10 Imported by: 2

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

README ¶

rabbitcrawler

A powerful web scraping tool designed for extracting structured data from websites with configurable rules and multiple execution modes.

Features

Configurable JSON-based scraping rules
Multiple extraction modes:
- Static: Fast HTML parsing without JavaScript execution
- Browser: Full browser emulation with JavaScript support
Concurrent scraping with adjustable worker count

Installation

go install github.com/crawlerclub/extractor/cmd/rabbitextract@latest
go install github.com/crawlerclub/extractor/cmd/rabbitcrawler@latest

Using rabbitextract

rabbitextract is a command-line tool for extracting data from a single webpage using JSON configuration rules.

Command Line Options

-config: Path to the config JSON file (required)
-url: URL to extract data from (optional if provided in config)
-mode: Extraction mode (optional, defaults to "auto")
- auto: Automatically choose between static and browser mode
- static: Fast HTML parsing without JavaScript
- browser: Full browser emulation with JavaScript support
-output: Output file path (optional, defaults to stdout)

Example Usage

Create a configuration file config.json:

{
  "name": "example-scraper",
  "example_url": "https://example.com/page",
  "schemas": [
    {
      "name": "articles",
      "entity_type": "article",
      "selector": "//div[@class='article']",
      "fields": [
        {
          "name": "title",
          "type": "text",
          "selector": ".//h1"
        },
        {
          "name": "content",
          "type": "text",
          "selector": ".//div[@class='content']"
        }
      ]
    }
  ]
}

Run the extractor:

rabbitextract -config config.json -url "https://example.com/page" -output result.json

Supported Field Types

text: Extract text content from an element
attribute: Extract specific attribute value from an element
nested: Extract nested object with multiple fields
list: Extract array of items

Special Fields

_id: Used to generate unique external_id for items
_time: Used to set external_time for items

Documentation ¶

Index ¶

Constants
type BrowserExtractor
- func NewBrowserExtractor(config ExtractorConfig) *BrowserExtractor
- func (e *BrowserExtractor) Extract(url string) (*ExtractionResult, error)
- func (e *BrowserExtractor) ExtractWithoutCache(url string) (*ExtractionResult, error)
type ExtractedItem
type ExtractionError
type ExtractionResult
type Extractor
- func NewExtractor(config ExtractorConfig) Extractor
type ExtractorConfig
type Field
type Schema
type SchemaInfo
type SchemaResult
type StaticExtractor
- func NewStaticExtractor(config ExtractorConfig) *StaticExtractor
- func (e *StaticExtractor) Extract(url string) (*ExtractionResult, error)
- func (e *StaticExtractor) ExtractWithoutCache(url string) (*ExtractionResult, error)

Constants ¶

View Source

const (
	FromURL     string = "url"
	FromElement string = "element"
)

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type BrowserExtractor ¶

type BrowserExtractor struct {
	Config  ExtractorConfig
	Browser *rod.Browser
}

func NewBrowserExtractor ¶

func NewBrowserExtractor(config ExtractorConfig) *BrowserExtractor

func (*BrowserExtractor) Extract ¶

func (e *BrowserExtractor) Extract(url string) (*ExtractionResult, error)

func (*BrowserExtractor) ExtractWithoutCache ¶

func (e *BrowserExtractor) ExtractWithoutCache(url string) (*ExtractionResult, error)

type ExtractedItem ¶

type ExtractedItem map[string]interface{}

type ExtractionError ¶

type ExtractionError struct {
	Field   string
	Message string
	URL     string
}

type ExtractionResult ¶

type ExtractionResult struct {
	SchemaResults map[string]SchemaResult
	Errors        []ExtractionError
	FinalURL      string
}

type Extractor ¶

type Extractor interface {
	Extract(url string) (*ExtractionResult, error)
	ExtractWithoutCache(url string) (*ExtractionResult, error)
}

func NewExtractor ¶

func NewExtractor(config ExtractorConfig) Extractor

type ExtractorConfig ¶

type ExtractorConfig struct {
	Name       string   `json:"name"`
	Pattern    string   `json:"pattern"`
	ExampleURL string   `json:"example_url"`
	Mode       string   `json:"mode"`
	Schemas    []Schema `json:"schemas"`
}

type Field ¶

type Field struct {
	Name      string  `json:"name"`
	From      string  `json:"from"`
	Selector  string  `json:"selector"`
	Pattern   string  `json:"pattern"`
	Type      string  `json:"type"`
	Attribute string  `json:"attribute,omitempty"`
	Fields    []Field `json:"fields,omitempty"`
}

type Schema ¶

type Schema struct {
	Name       string  `json:"name"`
	EntityType string  `json:"entity_type"`
	Selector   string  `json:"selector"`
	Type       string  `json:"type"`
	Fields     []Field `json:"fields,omitempty"`
}

type SchemaInfo ¶

type SchemaInfo struct {
	Name       string `json:"name"`
	EntityType string `json:"entity_type"`
}

type SchemaResult ¶

type SchemaResult struct {
	Schema SchemaInfo
	Items  []ExtractedItem
}

type StaticExtractor ¶

type StaticExtractor struct {
	Config ExtractorConfig
}

func NewStaticExtractor ¶

func NewStaticExtractor(config ExtractorConfig) *StaticExtractor

func (*StaticExtractor) Extract ¶

func (e *StaticExtractor) Extract(url string) (*ExtractionResult, error)

func (*StaticExtractor) ExtractWithoutCache ¶

func (e *StaticExtractor) ExtractWithoutCache(url string) (*ExtractionResult, error)

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
rabbitcrawler Module
rabbitextract Module

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL