fasttld

package module

v0.4.4 Latest Latest Go to latest Published: Mar 18, 2023 License: BSD-3-Clause Imports: 18 Imported by: 6

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/elliotwutingfeng/go-fasttld

Links

Open Source Insights

README ¶

go-fasttld

Summary

go-fasttld is a high performance effective top level domains (eTLD) extraction module that extracts subcomponents from URLs.

URLs can either contain hostnames, IPv4 addresses, or IPv6 addresses. eTLD extraction is based on the Mozilla Public Suffix List. Private domains listed in the Mozilla Public Suffix List like 'blogspot.co.uk' and 'sinaapp.com' are also supported.

Demo

Spot any bugs? Report them here

Installation

go get github.com/elliotwutingfeng/go-fasttld

Try the CLI

First, build the CLI application.

# `git clone` and `cd` to the go-fasttld repository folder first
make build_cli

Afterwards, try extracting subcomponents from a URL.

# `git clone` and `cd` to the go-fasttld repository folder first
./dist/fasttld extract https://user@a.subdomain.example.a%63.uk:5000/a/b\?id\=42

Try the example code

All of the following examples can be found at examples/demo.go. To play the demo, run the following command:

# `git clone` and `cd` to the go-fasttld repository folder first
make demo

Hostname

// Initialise fasttld extractor
extractor, _ := fasttld.New(fasttld.SuffixListParams{})

// Extract URL subcomponents
url := "https://user@a.subdomain.example.a%63.uk:5000/a/b?id=42"
res, _ := extractor.Extract(fasttld.URLParams{URL: url})

// Display results
fasttld.PrintRes(url, res) // Pretty-prints res.Scheme, res.UserInfo, res.SubDomain etc.

Scheme	UserInfo	SubDomain	Domain	Suffix	RegisteredDomain	Port	Path	HostType
https://	user	a.subdomain	example	a%63.uk	example.a%63.uk	5000	/a/b?id=42	hostname

IPv4 Address

extractor, _ := fasttld.New(fasttld.SuffixListParams{})
url := "https://127.0.0.1:5000"
res, _ := extractor.Extract(fasttld.URLParams{URL: url})

Scheme	UserInfo	SubDomain	Domain	Suffix	RegisteredDomain	Port	Path	HostType
https://			127.0.0.1		127.0.0.1	5000		ipv4 address

IPv6 Address

extractor, _ := fasttld.New(fasttld.SuffixListParams{})
url := "https://[aBcD:ef01:2345:6789:aBcD:ef01:2345:6789]:5000"
res, _ := extractor.Extract(fasttld.URLParams{URL: url})

Scheme	UserInfo	SubDomain	Domain	Suffix	RegisteredDomain	Port	Path	HostType
https://			aBcD:ef01:2345:6789:aBcD:ef01:2345:6789		aBcD:ef01:2345:6789:aBcD:ef01:2345:6789	5000		ipv6 address

Internationalised label separators

go-fasttld supports the following internationalised label separators (IETF RFC 3490)

Full Stop	Ideographic Full Stop	Fullwidth Full Stop	Halfwidth Ideographic Full Stop
U+002E `.`	U+3002 `。`	U+FF0E `．`	U+FF61 `｡`

extractor, _ := fasttld.New(fasttld.SuffixListParams{})
url := "https://brb\u002ei\u3002am\uff0egoing\uff61to\uff0ebe\u3002a\uff61fk"
res, _ := extractor.Extract(fasttld.URLParams{URL: url})

Scheme	UserInfo	SubDomain	Domain	Suffix	RegisteredDomain	Port	Path	HostType
https://		brb\u002ei\u3002am\uff0egoing\uff61to	be	a\uff61fk	be\u3002a\uff61fk			hostname

Public Suffix List options

Specify custom public suffix list file

You can use a custom public suffix list file by setting CacheFilePath in fasttld.SuffixListParams{} to its absolute path.

cacheFilePath := "/absolute/path/to/file.dat"
extractor, err := fasttld.New(fasttld.SuffixListParams{CacheFilePath: cacheFilePath})

Updating the default Public Suffix List cache

Whenever fasttld.New is called without specifying CacheFilePath in fasttld.SuffixListParams{}, the local cache of the default Public Suffix List is updated automatically if it is more than 3 days old. You can also manually update the cache by using Update().

// Automatic update performed if `CacheFilePath` is not specified
// and local cache is more than 3 days old
extractor, _ := fasttld.New(fasttld.SuffixListParams{})

// Manually update local cache
if err := extractor.Update(); err != nil {
    log.Println(err)
}

Private domains

According to the Mozilla.org wiki, the Mozilla Public Suffix List contains private domains like blogspot.com and sinaapp.com.

By default, these private domains are excluded (i.e. IncludePrivateSuffix = false)

extractor, _ := fasttld.New(fasttld.SuffixListParams{})
url := "https://google.blogspot.com"
res, _ := extractor.Extract(fasttld.URLParams{URL: url})

Scheme	UserInfo	SubDomain	Domain	Suffix	RegisteredDomain	Port	Path	HostType
https://		google	blogspot	com	blogspot.com			hostname

You can include private domains by setting IncludePrivateSuffix = true

extractor, _ := fasttld.New(fasttld.SuffixListParams{IncludePrivateSuffix: true})
url := "https://google.blogspot.com"
res, _ := extractor.Extract(fasttld.URLParams{URL: url})

Scheme	UserInfo	SubDomain	Domain	Suffix	RegisteredDomain	Port	Path	HostType
https://			google	blogspot.com	google.blogspot.com			hostname

Extraction options

Ignore Subdomains

You can ignore subdomains by setting IgnoreSubDomains = true. By default, subdomains are extracted.

extractor, _ := fasttld.New(fasttld.SuffixListParams{})
url := "https://maps.google.com"
res, _ := extractor.Extract(fasttld.URLParams{URL: url, IgnoreSubDomains: true})

Scheme	UserInfo	SubDomain	Domain	Suffix	RegisteredDomain	Port	Path	HostType
https://			google	com	google.com			hostname

Punycode

By default, internationalised URLs are not converted to punycode before extraction.

extractor, _ := fasttld.New(fasttld.SuffixListParams{})
url := "https://hello.世界.com"
res, _ := extractor.Extract(fasttld.URLParams{URL: url})

Scheme	UserInfo	SubDomain	Domain	Suffix	RegisteredDomain	Port	Path	HostType
https://		hello	世界	com	世界.com			hostname

You can convert internationalised URLs to punycode before extraction by setting ConvertURLToPunyCode = true.

extractor, _ := fasttld.New(fasttld.SuffixListParams{})
url := "https://hello.世界.com"
res, _ := extractor.Extract(fasttld.URLParams{URL: url, ConvertURLToPunyCode: true})

Scheme	UserInfo	SubDomain	Domain	Suffix	RegisteredDomain	Port	Path	HostType
https://		hello	xn--rhqv96g	com	xn--rhqv96g.com			hostname

Parsing errors

If the URL is invalid, the second value returned by Extract(), error, will be non-nil. Partially extracted subcomponents can still be retrieved from the first value returned, ExtractResult.

extractor, _ := fasttld.New(fasttld.SuffixListParams{})
url := "https://example!.com" // invalid characters in hostname
color.New().Println("The following line should be an error message")
if res, err := extractor.Extract(fasttld.URLParams{URL: url}); err != nil {
    color.New(color.FgHiRed, color.Bold).Print("Error: ")
    color.New(color.FgHiWhite).Println(err)
}
fasttld.PrintRes(url, res) // Partially extracted subcomponents can still be retrieved

Scheme	UserInfo	SubDomain	Domain	Suffix	RegisteredDomain	Port	Path	HostType
https://

Testing

# `git clone` and `cd` to the go-fasttld repository folder first
make tests

# Alternatively, run tests without race detection
# Useful for systems that do not support the -race flag like windows/386
# See https://tip.golang.org/src/cmd/dist/test.go
make tests_without_race

Benchmarks

# `git clone` and `cd` to the go-fasttld repository folder first
make bench

Modules used

Benchmark Name	Source
GoFastTld	go-fasttld (this module)
JPilloraGoTld	github.com/jpillora/go-tld
JoeGuoTldExtract	github.com/joeguo/tldextract
Mjd2021USATldExtract	github.com/mjd2021usa/tldextract

Results

Benchmarks performed on AMD Ryzen 7 5800X, Manjaro Linux.

go-fasttld performs especially well on longer URLs.

#1

https://iupac.org/iupac-announces-the-2021-top-ten-emerging-technologies-in-chemistry/

Benchmark Name	Iterations	ns/op	B/op	allocs/op	Fastest
GoFastTld	8037906	150.8 ns/op	0 B/op	0 allocs/op	✔
JPilloraGoTld	1675113	716.1 ns/op	224 B/op	2 allocs/op
JoeGuoTldExtract	2204854	515.1 ns/op	272 B/op	5 allocs/op
Mjd2021USATldExtract	1676722	712.0 ns/op	288 B/op	6 allocs/op

#2

https://www.google.com/maps/dir/Parliament+Place,+Parliament+House+Of+Singapore,+Singapore/Parliament+St,+London,+UK/@25.2440033,33.6721455,4z/data=!3m1!4b1!4m14!4m13!1m5!1m1!1s0x31da19a0abd4d71d:0xeda26636dc4ea1dc!2m2!1d103.8504863!2d1.2891543!1m5!1m1!1s0x487604c5aaa7da5b:0xf13a2197d7e7dd26!2m2!1d-0.1260826!2d51.5017061!3e4

Benchmark Name	Iterations	ns/op	B/op	allocs/op	Fastest
GoFastTld	6381516	181.9 ns/op	0 B/op	0 allocs/op	✔
JPilloraGoTld	431671	2603 ns/op	928 B/op	4 allocs/op
JoeGuoTldExtract	893347	1176 ns/op	1120 B/op	6 allocs/op
Mjd2021USATldExtract	1030250	1165 ns/op	1120 B/op	6 allocs/op

#3

https://a.b.c.d.e.f.g.h.i.j.k.l.m.n.oo.pp.qqq.rrrr.ssssss.tttttttt.uuuuuuuuuuu.vvvvvvvvvvvvvvv.wwwwwwwwwwwwwwwwwwwwww.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy.zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz.cc

Benchmark Name	Iterations	ns/op	B/op	allocs/op	Fastest
GoFastTld	833682	1424 ns/op	0 B/op	0 allocs/op	✔
JPilloraGoTld	734790	1640 ns/op	304 B/op	3 allocs/op
JoeGuoTldExtract	695475	1452 ns/op	1040 B/op	5 allocs/op
Mjd2021USATldExtract	330717	3628 ns/op	1904 B/op	8 allocs/op

Implementation details

Why not split on "." and take the last element instead?

Splitting on "." and taking the last element only works for simple eTLDs like com, but not more complex ones like oseto.nagasaki.jp.

eTLD tries

Trie

go-fasttld stores eTLDs in compressed tries.

Valid eTLDs from the Mozilla Public Suffix List are appended to the compressed trie in reverse-order.

Given the following eTLDs
au
nsw.edu.au
com.ac
edu.ac
gov.ac

and the example URL host `example.nsw.edu.au`

The compressed trie will be structured as follows:

START
 ╠═ au 🚩 ✅
 ║  ╚═ edu ✅
 ║     ╚═ nsw 🚩 ✅
 ╚═ ac
    ╠═ com 🚩
    ╠═ edu 🚩
    ╚═ gov 🚩

=== Symbol meanings ===
🚩 : path to this node is a valid eTLD
✅ : path to this node found in example URL host `example.nsw.edu.au`

The URL host subcomponents are parsed from right-to-left until no more matching nodes can be found. In this example, the path of matching nodes are au -> edu -> nsw. Reversing the nodes gives the extracted eTLD nsw.edu.au.

Acknowledgements

This module is a port of the Python fasttld module, with additional modifications to support extraction of subcomponents from full URLs, IPv4 addresses, and IPv6 addresses.

Documentation ¶

Overview ¶

Package fasttld is a high performance effective top level domains (eTLD) extraction module implemented with compressed tries.

This module is a port of the Python fasttld module, with additional modifications to support extraction of subcomponents from full URLs, IPv4 addresses, and IPv6 addresses.

Index ¶

func PrintRes(url string, res ExtractResult)
type ExtractResult
type FastTLD
- func New(n SuffixListParams) (*FastTLD, error)
- func (f *FastTLD) Extract(e URLParams) (ExtractResult, error)
- func (f *FastTLD) Update() error
type HostType
type SuffixListParams
type URLParams

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func PrintRes ¶ added in v0.3.0

func PrintRes(url string, res ExtractResult)

PrintRes pretty-prints URL components from ExtractResult

Types ¶

type ExtractResult ¶

type ExtractResult struct {
	Scheme, UserInfo, SubDomain, Domain, Suffix, RegisteredDomain, Port, Path string
	HostType                                                                  HostType
}

ExtractResult contains components extracted from URL.

type FastTLD ¶ added in v0.0.2

type FastTLD struct {
	// contains filtered or unexported fields
}

FastTLD provides the Extract() function, to extract URLs using tldTrie generated from the Public Suffix List file at cacheFilePath.

func New ¶

func New(n SuffixListParams) (*FastTLD, error)

New creates a new *FastTLD using data from a Public Suffix List file.

func (*FastTLD) Extract ¶ added in v0.0.2

func (f *FastTLD) Extract(e URLParams) (ExtractResult, error)

Extract components from a given `url`.

func (*FastTLD) Update ¶ added in v0.0.2

func (f *FastTLD) Update() error

Update updates the default Public Suffix list file and updates its suffix trie using the updated file. If cache file path is not the same as the default cache file path, this will be a no-op.

type HostType ¶ added in v0.4.0

type HostType int

HostType indicates whether parsed URL contains a HostName, IPv4 address, IPv6 address or none of them

const (
	None HostType = iota
	HostName
	IPv4
	IPv6
)

None, HostName, IPv4 and IPv6 indicate whether parsed URL contains a HostName, IPv4 address, IPv6 address or none of them

type SuffixListParams ¶

type SuffixListParams struct {
	CacheFilePath        string
	IncludePrivateSuffix bool
}

SuffixListParams contains parameters for specifying path to Public Suffix List file and whether to extract private suffixes (e.g. blogspot.com).

type URLParams ¶ added in v0.0.2

type URLParams struct {
	URL                  string
	IgnoreSubDomains     bool
	ConvertURLToPunyCode bool
}

URLParams specifies URL to extract components from.

If IgnoreSubDomains = true, do not extract SubDomain.

If ConvertURLToPunyCode = true, convert non-ASCII characters like 世界 to punycode.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
fasttld
examples

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL