hqgourl

package module

v0.0.0-...-0d9326f Latest Latest Go to latest Published: Feb 12, 2024 License: MIT Imports: 11 Imported by: 5

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/hueristiq/hqgourl

README ¶

hqgourl

A Go(Golang) package for extracting, parsing and manipulating URLs.

Resources

Features
Usage
Contributing
Licensing
Credits
- Contributors
- Similar Projects

Features

Flexible URL extraction from text using regular expressions.
Domain parsing into subdomains, root domains, and TLDs.
Extends the standard net/url URLs parsing with additional fields.

Installation

go get -v -u github.com/hueristiq/hqgourl

Usage

URL Extraction

package main

import (
    "fmt"
    "github.com/hueristiq/hqgourl"
    "regexp"
)

func main() {
    extractor := hqgourl.NewURLExtractor()
    text := "Check out this website: https://example.com and send an email to info@example.com."
    
    regex := extractor.CompileRegex()
    matches := regex.FindAllString(text, -1)
    
    fmt.Println("Found URLs:", matches)
}

The URLExtractor allows customization of the URL extraction process through various options. For instance, you can specify whether to include URL schemes and hosts in the extraction and provide custom regex patterns for these components.

Extracting URLs with Specific Schemes
```
extractor := hqgourl.NewURLExtractor(
    hqgourl.URLExtractorWithSchemePattern(`(?:https?|ftp)://`),
)
```
This configuration will extract only URLs starting with http, https, or ftp schemes.
Extracting URLs with Custom Host Patterns
```
extractor := hqgourl.NewURLExtractor(
    hqgourl.URLExtractorWithHostPattern(`(?:www\.)?example\.com`),
)
```
This setup will extract URLs that have hosts matching www.example.com or example.com.

[!NOTE] Since API is centered around regexp.Regexp, many other methods are available

Domain Parsing

package main

import (
    "fmt"
    "github.com/hueristiq/hqgourl"
)

func main() {
    dp := hqgourl.NewDomainParser()

    parsedDomain := dp.Parse("subdomain.example.com")

    fmt.Printf("Subdomain: %s, Root Domain: %s, TLD: %s\n", parsedDomain.Sub, parsedDomain.Root, parsedDomain.TopLevel)
}

URL Parsing

package main

import (
    "fmt"
    "github.com/hueristiq/hqgourl"
)

func main() {
    up := hqgourl.NewURLParser()

    parsedURL, err := up.Parse("https://subdomain.example.com:8080/path/file.txt")
    if err != nil {
        fmt.Println("Error parsing URL:", err)

        return
    }

    fmt.Printf("Subdomain: %s\n", parsedURL.Domain.Sub)
    fmt.Printf("Root Domain: %s\n", parsedURL.Domain.Root)
    fmt.Printf("TLD: %s\n", parsedURL.Domain.TopLevel)
    fmt.Printf("Port: %d\n", parsedURL.Port)
    fmt.Printf("File Extension: %s\n", parsedURL.Extension)
}

Set a default scheme:

up := hqgourl.NewURLParser(hqgourl.URLParserWithDefaultScheme("https"))

Contributing

Issues and Pull Requests are welcome! Check out the contribution guidelines.

Licensing

This utility is distributed under the MIT license.

Credits

Contributors

Thanks to the amazing contributors for keeping this project alive.

Similar Projects

Thanks to similar open source projects - check them out, may fit in your needs.

DomainParser ◇ urlx ◇ xurls ◇ goware's tldomains ◇ jakewarren's tldomains

Documentation ¶

Index ¶

Constants
Variables
type Domain
- func (d *Domain) String() (domain string)
type DomainInterface
type DomainParser
- func NewDomainParser(opts ...DomainParserOptionsFunc) (dp *DomainParser)
- func (dp *DomainParser) Parse(domain string) (parsedDomain *Domain)
type DomainParserInterface
type DomainParserOptionsFunc
- func DomainParserWithTLDs(TLDs ...string) DomainParserOptionsFunc
type URL
type URLExtractor
- func NewURLExtractor(opts ...URLExtractorOptionsFunc) (extractor *URLExtractor)
- func (e *URLExtractor) CompileRegex() (regex *regexp.Regexp)
type URLExtractorInterface
type URLExtractorOptionsFunc
- func URLExtractorWithHost() URLExtractorOptionsFunc
- func URLExtractorWithHostPattern(pattern string) URLExtractorOptionsFunc
- func URLExtractorWithScheme() URLExtractorOptionsFunc
- func URLExtractorWithSchemePattern(pattern string) URLExtractorOptionsFunc
type URLParser
- func NewURLParser(opts ...URLParserOptionsFunc) (up *URLParser)
- func (up *URLParser) DefaultScheme() (scheme string)
- func (up *URLParser) Parse(rawURL string) (parsedURL *URL, err error)
- func (up *URLParser) WithDefaultScheme(scheme string)
type URLParserInterface
type URLParserOptionsFunc
- func URLParserWithDefaultScheme(scheme string) URLParserOptionsFunc

Constants ¶

View Source

const (
	URLExtractorIPv4Pattern         = `` /* 206-byte string literal not displayed */
	URLExtractorNonEmptyIPv6Pattern = `(?:` +

		`(?:[0-9a-fA-F]{1,4}:){7}(?:[0-9a-fA-F]{1,4}|:)|` +

		`(?:[0-9a-fA-F]{1,4}:){6}(?:` + URLExtractorIPv4Pattern + `|:[0-9a-fA-F]{1,4}|:)|` +

		`(?:[0-9a-fA-F]{1,4}:){5}(?::` + URLExtractorIPv4Pattern + `|(?::[0-9a-fA-F]{1,4}){1,2}|:)|` +

		`(?:[0-9a-fA-F]{1,4}:){4}(?:(?::[0-9a-fA-F]{1,4}){0,1}:` + URLExtractorIPv4Pattern + `|(?::[0-9a-fA-F]{1,4}){1,3}|:)|` +

		`(?:[0-9a-fA-F]{1,4}:){3}(?:(?::[0-9a-fA-F]{1,4}){0,2}:` + URLExtractorIPv4Pattern + `|(?::[0-9a-fA-F]{1,4}){1,4}|:)|` +

		`(?:[0-9a-fA-F]{1,4}:){2}(?:(?::[0-9a-fA-F]{1,4}){0,3}:` + URLExtractorIPv4Pattern + `|(?::[0-9a-fA-F]{1,4}){1,5}|:)|` +

		`(?:[0-9a-fA-F]{1,4}:){1}(?:(?::[0-9a-fA-F]{1,4}){0,4}:` + URLExtractorIPv4Pattern + `|(?::[0-9a-fA-F]{1,4}){1,6}|:)|` +

		`:(?:(?::[0-9a-fA-F]{1,4}){0,5}:` + URLExtractorIPv4Pattern + `|(?::[0-9a-fA-F]{1,4}){1,7})` +
		`)`
	URLExtractorIPv6Pattern = `(?:` + URLExtractorNonEmptyIPv6Pattern + `|::)`

	URLExtractorPortPattern         = `(?::[0-9]{1,4}|[1-5][0-9]{4}|6[0-5][0-9]{3}\b)`
	URLExtractorPortOptionalPattern = URLExtractorPortPattern + `?`
)

Variables ¶

View Source

var (
	// URLExtractorSchemePattern defines a general pattern for matching URL schemes.
	// It matches any scheme that starts with alphabetical characters followed by any combination
	// of alphabets, dots, hyphens, or pluses, and ends with "://". It also matches any scheme
	// from a predefined list that does not require authority (host), ending with ":".
	URLExtractorSchemePattern = `(?:[a-zA-Z][a-zA-Z.\-+]*://|` + anyOf(schemes.SchemesNoAuthority...) + `:)`
	// URLExtractorKnownOfficialSchemePattern defines a pattern for matching officially recognized
	// URL schemes. This includes schemes like "http", "https", "ftp", etc., and is strictly based
	// on the schemes defined in the schemes.Schemes slice, ensuring a match ends with "://".
	URLExtractorKnownOfficialSchemePattern = `(?:` + anyOf(schemes.Schemes...) + `://)`
	// URLExtractorKnownUnofficialSchemePattern defines a pattern for matching unofficial or
	// less commonly used URL schemes. Similar to the official pattern but based on the
	// schemes.SchemesUnofficial slice, it supports schemes that might not be universally recognized
	// but are valid in specific contexts, ending with "://".
	URLExtractorKnownUnofficialSchemePattern = `(?:` + anyOf(schemes.SchemesUnofficial...) + `://)`
	// URLExtractorKnownNoAuthoritySchemePattern defines a pattern for matching schemes that
	// do not require an authority (host) component. This is useful for schemes like "mailto:",
	// "tel:", and others where a host is not applicable, ending with ":".
	URLExtractorKnownNoAuthoritySchemePattern = `(?:` + anyOf(schemes.SchemesNoAuthority...) + `:)`
	// URLExtractorKnownSchemePattern combines the patterns for officially recognized,
	// unofficial, and no-authority-required schemes into one comprehensive pattern. It is
	// case-insensitive (noted by "(?i)") and designed to match a wide range of schemes, accommodating
	// the broadest possible set of URLs.
	URLExtractorKnownSchemePattern = `(?:(?i)(?:` + anyOf(schemes.Schemes...) + `|` + anyOf(schemes.SchemesUnofficial...) + `)://|` + anyOf(schemes.SchemesNoAuthority...) + `:)`
)

Functions ¶

This section is empty.

Types ¶

type Domain ¶

type Domain struct {
	Sub      string
	Root     string
	TopLevel string
}

Domain struct represents the structure of a parsed domain name, including its subdomain, root domain, and top-level domain (TLD).

func (*Domain) String ¶

func (d *Domain) String() (domain string)

String assembles the domain components back into a full domain string.

type DomainInterface ¶

type DomainInterface interface {
	String() (domain string)
}

DomainInterface defines a standard interface for any domain representation.

type DomainParser ¶

type DomainParser struct {
	// contains filtered or unexported fields
}

DomainParser encapsulates the logic for parsing full domain strings into their constituent parts: subdomains, root domains, and top-level domains (TLDs). It leverages a suffix array for efficient search and extraction of these components from a full domain string.

func NewDomainParser ¶

func NewDomainParser(opts ...DomainParserOptionsFunc) (dp *DomainParser)

NewDomainParser creates and initializes a DomainParser with a comprehensive list of TLDs, including both standard and pseudo-TLDs. This setup ensures accurate parsing across a wide range of domain names. Additional options can be applied to customize the parser further.

func (*DomainParser) Parse ¶

func (dp *DomainParser) Parse(domain string) (parsedDomain *Domain)

Parse takes a full domain string and splits it into its constituent parts: subdomain, root domain, and TLD. This method efficiently identifies the TLD using a suffix array and separates the remaining parts of the domain accordingly.

type DomainParserInterface ¶

type DomainParserInterface interface {
	Parse(domain string) (parsedDomain *Domain)
	// contains filtered or unexported methods
}

DomainParserInterface defines a standard interface for any DomainParser representation.

type DomainParserOptionsFunc ¶

type DomainParserOptionsFunc func(*DomainParser)

DomainParserOptionsFunc is a function type designed for configuring a DomainParser instance. It allows for the application of customization options, such as specifying custom TLDs.

func DomainParserWithTLDs ¶

func DomainParserWithTLDs(TLDs ...string) DomainParserOptionsFunc

DomainParserWithTLDs allows for the initialization of the DomainParser with a custom set of TLDs. This is particularly useful for applications requiring parsing of non-standard or niche TLDs.

type URL ¶

type URL struct {
	*url.URL // Embedding the standard URL struct for base functionalities.

	Domain    *Domain
	Port      int    // Port number used in the URL.
	Extension string // File extension derived from the URL path.
}

URL extends the standard net/url URL struct with additional domain-related fields. It includes details like subdomain, root domain, and Top-Level Domain (TLD), along with standard URL components. This struct provides a comprehensive representation of a URL.

type URLExtractor ¶

type URLExtractor struct {
	// contains filtered or unexported fields
}

URLExtractor is a struct that configures the URL extraction process. It allows specifying whether to include URL schemes and hosts in the extraction and supports custom regex patterns for these components.

func NewURLExtractor ¶

func NewURLExtractor(opts ...URLExtractorOptionsFunc) (extractor *URLExtractor)

NewURLExtractor creates a new URLExtractor instance with optional configuration. It applies the provided options to the extractor, allowing for customized behavior.

func (*URLExtractor) CompileRegex ¶

func (e *URLExtractor) CompileRegex() (regex *regexp.Regexp)

CompileRegex compiles a regex pattern based on the URLExtractor configuration. It dynamically constructs a regex pattern to accurately capture URLs from text, supporting various URL formats and components. The method ensures the regex captures the longest possible match for a URL, enhancing the accuracy of the extraction process.

type URLExtractorInterface ¶

type URLExtractorInterface interface {
	CompileRegex() (regex *regexp.Regexp)
}

URLExtractorInterface defines the interface for URLExtractor, ensuring it implements certain methods.

type URLExtractorOptionsFunc ¶

type URLExtractorOptionsFunc func(*URLExtractor)

URLExtractorOptionsFunc defines a function type for configuring URLExtractor instances. This approach allows for flexible and fluent configuration of the extractor.

func URLExtractorWithHost ¶

func URLExtractorWithHost() URLExtractorOptionsFunc

URLExtractorWithHost returns an option function to include hosts in the URLs to be extracted. This can be used to ensure that only URLs with specified host components are captured.

func URLExtractorWithHostPattern ¶

func URLExtractorWithHostPattern(pattern string) URLExtractorOptionsFunc

URLExtractorWithHostPattern returns an option function to specify a custom regex pattern for matching URL hosts. This is useful for targeting specific domain names or IP address formats.

func URLExtractorWithScheme ¶

func URLExtractorWithScheme() URLExtractorOptionsFunc

URLExtractorWithScheme returns an option function to include URL schemes in the extraction process.

func URLExtractorWithSchemePattern ¶

func URLExtractorWithSchemePattern(pattern string) URLExtractorOptionsFunc

URLExtractorWithSchemePattern returns an option function to specify a custom regex pattern for matching URL schemes. This allows for fine-tuned control over which schemes are considered valid.

type URLParser ¶

type URLParser struct {
	// contains filtered or unexported fields
}

URLParser encapsulates the logic for parsing URLs with additional domain-specific information. It enhances the standard URL parsing with the extraction of subdomain, root domain, and TLD. It also handles the addition of a default scheme if one is not present in the input URL.

func NewURLParser ¶

func NewURLParser(opts ...URLParserOptionsFunc) (up *URLParser)

NewURLParser creates a new URLParser with the given options. It initializes a DomainParser for parsing domain details and applies any additional configuration options.

func (*URLParser) DefaultScheme ¶

func (up *URLParser) DefaultScheme() (scheme string)

DefaultScheme returns the currently set default scheme of the URLParser.

func (*URLParser) Parse ¶

func (up *URLParser) Parse(rawURL string) (parsedURL *URL, err error)

Parse takes a raw URL string and parses it into a URL struct. It adds domain-specific details like subdomain, root domain, and TLD to the parsed URL. The method also ensures a default scheme is set if the URL does not specify one.

func (*URLParser) WithDefaultScheme ¶

func (up *URLParser) WithDefaultScheme(scheme string)

WithDefaultScheme allows setting a default scheme for the URLParser. This default scheme is used if the input URL doesn't specify a scheme.

type URLParserInterface ¶

type URLParserInterface interface {
	WithDefaultScheme(scheme string)

	DefaultScheme() (scheme string)

	Parse(rawURL string) (parsedURL *URL, err error)
}

URLParserInterface defines the interface for URL parsing functionality.

type URLParserOptionsFunc ¶

type URLParserOptionsFunc func(*URLParser)

URLParserOptionsFunc defines a function type for configuring a URLParser.

func URLParserWithDefaultScheme ¶

func URLParserWithDefaultScheme(scheme string) URLParserOptionsFunc

URLParserWithDefaultScheme returns a URLParserOptionsFunc to set a default scheme. This is useful when parsing URLs that may not have a scheme included.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
generate
schemes
tlds
unicodes
schemes
tlds
unicodes

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL