# GoScrapy: Web Scraping Framework in Go
GoScrapy aims to be a powerful open-source web scraping framework written in the Go and inspired by Python's Scrapy framework. It offers a user-friendly interface for extracting data from websites, making it an ideal tool for various data collection and analysis tasks.
Getting Started
Follow these steps to start using GoScrapy:
1: Project Initialization
To initialize a project, open your terminal and run the following command:
go mod init <project_name>
Replace <project_name> with your desired project name. For example:
go mod init scrape_this_site
2. Installation
After initialization, install the GoScrapy CLI Tool using this command:
go install github.com/tech-engine/goscrapy@latest
Note: You will only need to run the above command the very first time.
3. Verify Installation
To verify your installation, check your version GoScrapy version using the following command:
goscrapy -v
4. Create a New Project
Create a new GoScrapy project using the following command:
goscrapy startproject <project_name>
Replace <project_name> with your project name. For example:
goscrapy startproject scrapethissite
This command will create a new project directory with the all necessary files to begin working with GoScrapy.
PS D:\My-Projects\go\go-test-scrapy> goscrapy startproject scrapethissite
🚀 GoScrapy generating project files. Please wait!
✔️ scrapethissite\constants.go
✔️ scrapethissite\core.go
✔️ scrapethissite\errors.go
✔️ scrapethissite\job.go
✔️ scrapethissite\output.go
✔️ scrapethissite\spider.go
✔️ scrapethissite\types.go
✨ Congrates. scrapethissite created successfully.
Usage
Defining a Scraping Task
GoScrapy operates around the below three concepts.
- Job: Describes an input to your spider.
- Output: Represents an output produced by your spider.
- Spider: Contains the main logic of your scraper.
Job
A Job represents an input to the goscrapy spider. In the provided code job.go
, a Job struct is defined by fields like id and query of which only the id field is compulsory and you can add custom fields to the Job structure as you feel required.
// id field is compulsory in a Job defination. You can add your custom to Job
type Job struct {
id string
query string // your custom field
}
// add your custom receiver functions below
func (j *Job) SetQuery(query string) {
j.query = query
return
}
Output
An Output represents an output produced by your spider(via yield). It encapsulates the records obtained from scraping, any potential errors, and a reference to the associated Job. The Output struct, as defined in the output.go
code, contains methods to retrieve records, error information, and other details.
// do not modify this file
type Output struct {
records []Record
err error
job *Job
}
Spider
A Spider encapsulate the main logic of your spider from the making a requests, parsing of responses, and data extraction.
Example
This example illustrates how to utilize the GoScrapy framework to scrape data for the website https://www.scrapethissite.com. The example covers the following files:
spider.go
Define the spider responsible for handling the scraping logic in your spider.go
file. The following code snippet sets up the spider:
package scrapethissite
import (
"context"
"errors"
"net/url"
"github.com/tech-engine/goscrapy/pkg/core"
)
func NewSpider() (*Spider, error) {
return &Spider{}, nil
}
func (s *Spider) StartRequest(ctx context.Context, job *Job) {
// for each request we must call NewRequest() and never reuse it
req := s.NewRequest()
var headers map[string]string
// GET is the default request method
req.SetUrl("<URL>").
SetMetaData("JOB", job).
SetHeaders(headers)
/* POST
req.SetUrl(s.baseUrl.String()).
SetMethod("POST").
SetMetaData("JOB", job).
SetHeaders(headers).
SetBody(<BODY_HERE>)
*/
// call the next parse method
s.Request(ctx, req, s.parse)
}
func (s *Spider) parse(ctx context.Context, response core.ResponseReader) {
// response.Body()
// response.StatusCode()
// response.Headers()
// check output.go for the fields
// s.yield(output)
}
The NewSpider function returns a new spider instance.
types.go
In your types.go
file, define the Record structure that corresponds to the records you're scraping. Here's how the structure for the Record type looks like:
/*
json and csv struct field tags are required, if you want the Record to be exported
or processed by builtin pipelines
*/
type Record struct {
Title string `json:"title" csv:"title"`
}
main.go
In your main.go
file, set up and execute your spider using the GoScrapy framework by following these steps:
For implementation details, you can refer to the sample code here.
Pipelines
In GoScrapy framework, pipelines play a pivotal role in managing, transforming, and fine-tuning the scraped data to meet your project's specific needs. Pipelines provide a powerful mechanism for executing a sequence of actions that are executed on the scraped data.
Built-in Pipelines
GoScrapy as of now offers a few built-in pipelines you can choose from, designed to facilitate different aspects of data manipulation and organization.
- Export2CSV
- Export2JSON
- Export2GSHEET
- Export2MONGODB
Incorporating Pipelines into Your Scraping Workflow
To seamlessly integrate pipelines into your scraping workflow, you can utilize the Pipelines.Add method.
Here is an example on how you can add pipelines to your scraping process:
Export to JSON Pipeline
:
// goScrapy instance
goScrapy.Pipelines.Add(pipelines.Export2JSON[*customProject.Job, []customProject.Record]())
Incorporating custom Pipelines
GoScrapy also allows you to define your own custom pipelines. To create a custom pipeline, you can use the command below.
cd into your Goscrapy project directory
PS D:\My-Projects\go\go-test-scrapy>scrapethissite> goscrapy pipeline export_2_DB
✔️ pipelines\export_2_DB.go
✨ Congrates, export_2_DB created successfully.
Middlewares
GoScrapy also support inbuilt + custom middlewares for manipulation outgoing request.
Built-in Middlewares
- MultiCookiJarMiddleware - used for maintaining different cookie sessions while scraping.
Custom middleware
Implementing your custom middleware is fairly easy in GoScrapy. A custom middleware must implement the below interface.
func MultiCookieJarMiddleware(next http.RoundTripper) http.RoundTripper {
return core.MiddlewareFunc(func(req *http.Request) (*http.Response, error) {
// you middleware custom code here
})
}
Incorporating Middlewares into Your Scraping Workflow
To seamlessly integrate middlewares into your scraping workflow, you can utilize the AddMiddlewares method which is a variadic function and can accept arbirary number of middlewares.
Here is an example on how you can add middlewares to your scraping process:
MultiCookieJarMiddleware Middleware
:
// goScrapy instance
goScrapy.AddMiddlewares(
middlewares.MultiCookieJarMiddleware,
...
)
Note
GoScrapy is still in it's initial baby stage and under developement and thus lacks many features like html parsing, cookie management etc. So more work is under way. Thank your for your patience.
Roadmap
Cookie management
Builtin & Custom Middlewares support
- HTML element selectors
- Triggers
- Unit Tests