Documentation ¶
Index ¶
Constants ¶
const ( URLExtractorIPv4Pattern = `` /* 206-byte string literal not displayed */ URLExtractorNonEmptyIPv6Pattern = `(?:` + `(?:[0-9a-fA-F]{1,4}:){7}(?:[0-9a-fA-F]{1,4}|:)|` + `(?:[0-9a-fA-F]{1,4}:){6}(?:` + URLExtractorIPv4Pattern + `|:[0-9a-fA-F]{1,4}|:)|` + `(?:[0-9a-fA-F]{1,4}:){5}(?::` + URLExtractorIPv4Pattern + `|(?::[0-9a-fA-F]{1,4}){1,2}|:)|` + `(?:[0-9a-fA-F]{1,4}:){4}(?:(?::[0-9a-fA-F]{1,4}){0,1}:` + URLExtractorIPv4Pattern + `|(?::[0-9a-fA-F]{1,4}){1,3}|:)|` + `(?:[0-9a-fA-F]{1,4}:){3}(?:(?::[0-9a-fA-F]{1,4}){0,2}:` + URLExtractorIPv4Pattern + `|(?::[0-9a-fA-F]{1,4}){1,4}|:)|` + `(?:[0-9a-fA-F]{1,4}:){2}(?:(?::[0-9a-fA-F]{1,4}){0,3}:` + URLExtractorIPv4Pattern + `|(?::[0-9a-fA-F]{1,4}){1,5}|:)|` + `(?:[0-9a-fA-F]{1,4}:){1}(?:(?::[0-9a-fA-F]{1,4}){0,4}:` + URLExtractorIPv4Pattern + `|(?::[0-9a-fA-F]{1,4}){1,6}|:)|` + `:(?:(?::[0-9a-fA-F]{1,4}){0,5}:` + URLExtractorIPv4Pattern + `|(?::[0-9a-fA-F]{1,4}){1,7})` + `)` URLExtractorIPv6Pattern = `(?:` + URLExtractorNonEmptyIPv6Pattern + `|::)` URLExtractorPortPattern = `(?::[0-9]{1,4}|[1-5][0-9]{4}|6[0-5][0-9]{3}\b)` URLExtractorPortOptionalPattern = URLExtractorPortPattern + `?` )
Variables ¶
var ( // URLExtractorSchemePattern defines a general pattern for matching URL schemes. // It matches any scheme that starts with alphabetical characters followed by any combination // of alphabets, dots, hyphens, or pluses, and ends with "://". It also matches any scheme // from a predefined list that does not require authority (host), ending with ":". URLExtractorSchemePattern = `(?:[a-zA-Z][a-zA-Z.\-+]*://|` + anyOf(schemes.SchemesNoAuthority...) + `:)` // URLExtractorKnownOfficialSchemePattern defines a pattern for matching officially recognized // URL schemes. This includes schemes like "http", "https", "ftp", etc., and is strictly based // on the schemes defined in the schemes.Schemes slice, ensuring a match ends with "://". URLExtractorKnownOfficialSchemePattern = `(?:` + anyOf(schemes.Schemes...) + `://)` // URLExtractorKnownUnofficialSchemePattern defines a pattern for matching unofficial or // less commonly used URL schemes. Similar to the official pattern but based on the // schemes.SchemesUnofficial slice, it supports schemes that might not be universally recognized // but are valid in specific contexts, ending with "://". URLExtractorKnownUnofficialSchemePattern = `(?:` + anyOf(schemes.SchemesUnofficial...) + `://)` // URLExtractorKnownNoAuthoritySchemePattern defines a pattern for matching schemes that // do not require an authority (host) component. This is useful for schemes like "mailto:", // "tel:", and others where a host is not applicable, ending with ":". URLExtractorKnownNoAuthoritySchemePattern = `(?:` + anyOf(schemes.SchemesNoAuthority...) + `:)` // URLExtractorKnownSchemePattern combines the patterns for officially recognized, // unofficial, and no-authority-required schemes into one comprehensive pattern. It is // case-insensitive (noted by "(?i)") and designed to match a wide range of schemes, accommodating // the broadest possible set of URLs. URLExtractorKnownSchemePattern = `(?:(?i)(?:` + anyOf(schemes.Schemes...) + `|` + anyOf(schemes.SchemesUnofficial...) + `)://|` + anyOf(schemes.SchemesNoAuthority...) + `:)` )
Functions ¶
This section is empty.
Types ¶
type Domain ¶
Domain struct represents the structure of a parsed domain name, including its subdomain, root domain, and top-level domain (TLD).
type DomainInterface ¶
type DomainInterface interface {
String() (domain string)
}
DomainInterface defines a standard interface for any domain representation.
type DomainParser ¶
type DomainParser struct {
// contains filtered or unexported fields
}
DomainParser encapsulates the logic for parsing full domain strings into their constituent parts: subdomains, root domains, and top-level domains (TLDs). It leverages a suffix array for efficient search and extraction of these components from a full domain string.
func NewDomainParser ¶
func NewDomainParser(opts ...DomainParserOptionsFunc) (dp *DomainParser)
NewDomainParser creates and initializes a DomainParser with a comprehensive list of TLDs, including both standard and pseudo-TLDs. This setup ensures accurate parsing across a wide range of domain names. Additional options can be applied to customize the parser further.
func (*DomainParser) Parse ¶
func (dp *DomainParser) Parse(domain string) (parsedDomain *Domain)
Parse takes a full domain string and splits it into its constituent parts: subdomain, root domain, and TLD. This method efficiently identifies the TLD using a suffix array and separates the remaining parts of the domain accordingly.
type DomainParserInterface ¶
type DomainParserInterface interface { Parse(domain string) (parsedDomain *Domain) // contains filtered or unexported methods }
DomainParserInterface defines a standard interface for any DomainParser representation.
type DomainParserOptionsFunc ¶
type DomainParserOptionsFunc func(*DomainParser)
DomainParserOptionsFunc is a function type designed for configuring a DomainParser instance. It allows for the application of customization options, such as specifying custom TLDs.
func DomainParserWithTLDs ¶
func DomainParserWithTLDs(TLDs ...string) DomainParserOptionsFunc
DomainParserWithTLDs allows for the initialization of the DomainParser with a custom set of TLDs. This is particularly useful for applications requiring parsing of non-standard or niche TLDs.
type URL ¶
type URL struct { *url.URL // Embedding the standard URL struct for base functionalities. Domain *Domain Port int // Port number used in the URL. Extension string // File extension derived from the URL path. }
URL extends the standard net/url URL struct with additional domain-related fields. It includes details like subdomain, root domain, and Top-Level Domain (TLD), along with standard URL components. This struct provides a comprehensive representation of a URL.
type URLExtractor ¶
type URLExtractor struct {
// contains filtered or unexported fields
}
URLExtractor is a struct that configures the URL extraction process. It allows specifying whether to include URL schemes and hosts in the extraction and supports custom regex patterns for these components.
func NewURLExtractor ¶
func NewURLExtractor(opts ...URLExtractorOptionsFunc) (extractor *URLExtractor)
NewURLExtractor creates a new URLExtractor instance with optional configuration. It applies the provided options to the extractor, allowing for customized behavior.
func (*URLExtractor) CompileRegex ¶
func (e *URLExtractor) CompileRegex() (regex *regexp.Regexp)
CompileRegex compiles a regex pattern based on the URLExtractor configuration. It dynamically constructs a regex pattern to accurately capture URLs from text, supporting various URL formats and components. The method ensures the regex captures the longest possible match for a URL, enhancing the accuracy of the extraction process.
type URLExtractorInterface ¶
URLExtractorInterface defines the interface for URLExtractor, ensuring it implements certain methods.
type URLExtractorOptionsFunc ¶
type URLExtractorOptionsFunc func(*URLExtractor)
URLExtractorOptionsFunc defines a function type for configuring URLExtractor instances. This approach allows for flexible and fluent configuration of the extractor.
func URLExtractorWithHost ¶
func URLExtractorWithHost() URLExtractorOptionsFunc
URLExtractorWithHost returns an option function to include hosts in the URLs to be extracted. This can be used to ensure that only URLs with specified host components are captured.
func URLExtractorWithHostPattern ¶
func URLExtractorWithHostPattern(pattern string) URLExtractorOptionsFunc
URLExtractorWithHostPattern returns an option function to specify a custom regex pattern for matching URL hosts. This is useful for targeting specific domain names or IP address formats.
func URLExtractorWithScheme ¶
func URLExtractorWithScheme() URLExtractorOptionsFunc
URLExtractorWithScheme returns an option function to include URL schemes in the extraction process.
func URLExtractorWithSchemePattern ¶
func URLExtractorWithSchemePattern(pattern string) URLExtractorOptionsFunc
URLExtractorWithSchemePattern returns an option function to specify a custom regex pattern for matching URL schemes. This allows for fine-tuned control over which schemes are considered valid.
type URLParser ¶
type URLParser struct {
// contains filtered or unexported fields
}
URLParser encapsulates the logic for parsing URLs with additional domain-specific information. It enhances the standard URL parsing with the extraction of subdomain, root domain, and TLD. It also handles the addition of a default scheme if one is not present in the input URL.
func NewURLParser ¶
func NewURLParser(opts ...URLParserOptionsFunc) (up *URLParser)
NewURLParser creates a new URLParser with the given options. It initializes a DomainParser for parsing domain details and applies any additional configuration options.
func (*URLParser) DefaultScheme ¶
DefaultScheme returns the currently set default scheme of the URLParser.
func (*URLParser) Parse ¶
Parse takes a raw URL string and parses it into a URL struct. It adds domain-specific details like subdomain, root domain, and TLD to the parsed URL. The method also ensures a default scheme is set if the URL does not specify one.
func (*URLParser) WithDefaultScheme ¶
WithDefaultScheme allows setting a default scheme for the URLParser. This default scheme is used if the input URL doesn't specify a scheme.
type URLParserInterface ¶
type URLParserInterface interface { WithDefaultScheme(scheme string) DefaultScheme() (scheme string) Parse(rawURL string) (parsedURL *URL, err error) }
URLParserInterface defines the interface for URL parsing functionality.
type URLParserOptionsFunc ¶
type URLParserOptionsFunc func(*URLParser)
URLParserOptionsFunc defines a function type for configuring a URLParser.
func URLParserWithDefaultScheme ¶
func URLParserWithDefaultScheme(scheme string) URLParserOptionsFunc
URLParserWithDefaultScheme returns a URLParserOptionsFunc to set a default scheme. This is useful when parsing URLs that may not have a scheme included.