Documentation ¶
Index ¶
- Variables
- type Domain
- type DomainExtractor
- type DomainExtractorInterface
- type DomainExtractorOptionFunc
- type DomainInterface
- type DomainParser
- type DomainParserInterface
- type DomainParserOptionFunc
- type Extractor
- type ExtractorInterface
- type ExtractorOptionFunc
- type Parser
- type ParserInterface
- type ParserOptionFunc
- type URL
Constants ¶
This section is empty.
Variables ¶
var ( // ExtractorSchemePattern defines a general pattern for matching URL schemes. // It matches any URL scheme that starts with alphabetical characters (a-z, A-Z), followed by // any combination of alphabets, dots (.), hyphens (-), or plus signs (+), and ends with "://". // Additionally, it matches schemes from a predefined list that do not require an authority (host), // ending with just a colon (":"). These are known as "no-authority" schemes (e.g., "mailto:"). // // This pattern covers a broad range of schemes, making it versatile for extracting different types // of URLs, whether they require an authority component or not. ExtractorSchemePattern = `(?:[a-zA-Z][a-zA-Z.\-+]*://|` + anyOf(schemes.NoAuthority...) + `:)` // ExtractorKnownOfficialSchemePattern defines a pattern for matching officially recognized // URL schemes. These include well-known schemes such as "http", "https", "ftp", etc., as registered // with IANA. The pattern ensures that the scheme is followed by "://". // // This pattern ensures that only officially recognized schemes are matched. ExtractorKnownOfficialSchemePattern = `(?:` + anyOf(schemes.Official...) + `://)` // ExtractorKnownUnofficialSchemePattern defines a pattern for matching unofficial or less commonly // used URL schemes. These schemes may not be registered with IANA but are still valid in specific contexts, // such as application-specific schemes (e.g., "slack://", "zoommtg://"). // The pattern ensures that the scheme is followed by "://". // // This pattern is useful for applications that work with unofficial or niche schemes. ExtractorKnownUnofficialSchemePattern = `(?:` + anyOf(schemes.Unofficial...) + `://)` // ExtractorKnownNoAuthoritySchemePattern defines a pattern for matching URL schemes that // do not require an authority component (host). These schemes are followed by a colon (":") rather than "://". // Examples include "mailto:", "tel:", and "sms:". // // This pattern is used for schemes where a host is not applicable, making it suitable for schemes // that involve direct communication (e.g., email or telephone). ExtractorKnownNoAuthoritySchemePattern = `(?:` + anyOf(schemes.NoAuthority...) + `:)` // ExtractorKnownSchemePattern combines the patterns for officially recognized, unofficial, // and no-authority-required schemes into a single comprehensive pattern. // It is case-insensitive (denoted by "(?i)") and matches the broadest possible range of URLs. // // This pattern is suitable for extracting any known scheme, regardless of its official status // or whether it requires an authority component. ExtractorKnownSchemePattern = `(?:(?i)(?:` + anyOf(schemes.Official...) + `|` + anyOf(schemes.Unofficial...) + `)://|` + anyOf(schemes.NoAuthority...) + `:)` // ExtractorIPv4Pattern defines a pattern for matching valid IPv4 addresses. // It matches four groups of 1 to 3 digits (0-255) separated by periods (e.g., "192.168.0.1"). // // This pattern is essential for extracting or validating IPv4 addresses in URLs or hostnames. ExtractorIPv4Pattern = `` /* 206-byte string literal not displayed */ // ExtractorNonEmptyIPv6Pattern defines a detailed pattern for matching valid, non-empty IPv6 addresses. // It accounts for various valid formats of IPv6 addresses, including those with elisions ("::") and IPv4 // address representations. // // This pattern supports matching fully expanded IPv6 addresses, elided sections, and IPv4-mapped IPv6 addresses. ExtractorNonEmptyIPv6Pattern = `(?:` + `(?:[0-9a-fA-F]{1,4}:){7}(?:[0-9a-fA-F]{1,4}|:)|` + `(?:[0-9a-fA-F]{1,4}:){6}(?:` + ExtractorIPv4Pattern + `|:[0-9a-fA-F]{1,4}|:)|` + `(?:[0-9a-fA-F]{1,4}:){5}(?::` + ExtractorIPv4Pattern + `|(?::[0-9a-fA-F]{1,4}){1,2}|:)|` + `(?:[0-9a-fA-F]{1,4}:){4}(?:(?::[0-9a-fA-F]{1,4}){0,1}:` + ExtractorIPv4Pattern + `|(?::[0-9a-fA-F]{1,4}){1,3}|:)|` + `(?:[0-9a-fA-F]{1,4}:){3}(?:(?::[0-9a-fA-F]{1,4}){0,2}:` + ExtractorIPv4Pattern + `|(?::[0-9a-fA-F]{1,4}){1,4}|:)|` + `(?:[0-9a-fA-F]{1,4}:){2}(?:(?::[0-9a-fA-F]{1,4}){0,3}:` + ExtractorIPv4Pattern + `|(?::[0-9a-fA-F]{1,4}){1,5}|:)|` + `(?:[0-9a-fA-F]{1,4}:){1}(?:(?::[0-9a-fA-F]{1,4}){0,4}:` + ExtractorIPv4Pattern + `|(?::[0-9a-fA-F]{1,4}){1,6}|:)|` + `:(?:(?::[0-9a-fA-F]{1,4}){0,5}:` + ExtractorIPv4Pattern + `|(?::[0-9a-fA-F]{1,4}){1,7})` + `)` // ExtractorIPv6Pattern is a comprehensive pattern that matches both fully expanded and compressed IPv6 addresses. // It also handles "::" elision and optional IPv4-mapped sections. ExtractorIPv6Pattern = `(?:` + ExtractorNonEmptyIPv6Pattern + `|::)` // ExtractorPortPattern defines a pattern for matching port numbers in URLs. // It matches valid port numbers (1 to 65535) that are typically found in network addresses. // The port number is preceded by a colon (":"). ExtractorPortPattern = `(?::[0-9]{1,4}|[1-5][0-9]{4}|6[0-5][0-9]{3}\b)` // ExtractorPortOptionalPattern is similar to ExtractorPortPattern but makes the port number optional. // This is useful for matching URLs where the port may or may not be specified. ExtractorPortOptionalPattern = ExtractorPortPattern + `?` )
Functions ¶
This section is empty.
Types ¶
type Domain ¶
Domain represents a parsed domain name, broken down into three main components:
- Subdomain: The subdomain part of the domain (e.g., "www" in "www.example.com").
- SLD: The root domain, also known as the second-level domain (SLD), which is the core part of the domain (e.g., "example" in "www.example.com").
- TLD: The top-level domain (TLD), which is the domain suffix or extension (e.g., "com" in "www.example.com").
This struct is useful in scenarios where you need to manipulate and analyze domain names. It can be applied in tasks such as:
- Domain validation (e.g., ensuring that domains conform to expected formats).
- URL parsing (e.g., breaking down a URL into its domain components).
- Domain classification (e.g., identifying and grouping URLs by subdomain, root domain, or TLD).
By splitting a domain into its components, you can easily identify domain hierarchies, manipulate specific parts of a domain, or analyze domain names for SEO, security, or categorization purposes.
Example:
domain := Domain{ Subdomain: "www", // Subdomain part ("www") SLD: "example", // Second-level domain part ("example") TLD: "com", // Top-level domain part ("com") } // Output: "www.example.com" fmt.Println(domain.String())
func (*Domain) String ¶
String reassembles the components of the domain (Subdomain, SLD, and TLD) back into a complete domain name string. Non-empty components are joined with a dot ("."). If any component is missing, it is omitted from the final output. This method is useful for reconstructing domain names after parsing.
Example:
- If Subdomain = "www", SLD = "example", and TLD = "com", the output will be "www.example.com".
- If Subdomain is empty, the output will be "example.com".
- If both Subdomain and TLD are empty, the output will be just the SLD "example".
Returns:
- domain (string): The reconstructed domain name string.
type DomainExtractor ¶
type DomainExtractor struct { RootDomainPattern string // Custom regex pattern for matching the root domain (e.g., "example"). TopLevelDomainPattern string // Custom regex pattern for matching the TLD (e.g., "com"). }
DomainExtractor is responsible for extracting domain names, including both root domains and top-level domains (TLDs), using regular expressions. It provides flexibility in the domain extraction process by allowing custom patterns for both root domains and TLDs.
func NewDomainExtractor ¶
func NewDomainExtractor(opts ...DomainExtractorOptionFunc) (extractor *DomainExtractor)
NewDomainExtractor creates and initializes a DomainExtractor with optional configurations. By default, it uses pre-defined patterns for extracting root domains and TLDs, but custom patterns can be applied using the provided options.
Returns:
- extractor: A pointer to the initialized DomainExtractor.
func (*DomainExtractor) CompileRegex ¶
func (e *DomainExtractor) CompileRegex() (regex *regexp.Regexp)
CompileRegex compiles a regular expression based on the configured DomainExtractor. It builds a regex that can match domains, combining the root domain pattern with the top-level domain (TLD) pattern. The method separates ASCII and Unicode TLDs and includes a punycode pattern to handle internationalized domain names (IDNs). It also ensures that the regex captures the longest possible domain match.
Returns:
- regex: The compiled regular expression for matching domain names.
type DomainExtractorInterface ¶
DomainExtractorInterface defines the interface for domain extraction functionality. It ensures that any domain extractor can compile regular expressions to match domain names.
type DomainExtractorOptionFunc ¶
type DomainExtractorOptionFunc func(*DomainExtractor)
DomainExtractorOptionFunc defines a function type for configuring a DomainExtractor. It allows setting options like custom patterns for root domains and TLDs.
func DomainExtractorWithRootDomainPattern ¶
func DomainExtractorWithRootDomainPattern(pattern string) DomainExtractorOptionFunc
DomainExtractorWithRootDomainPattern returns an option function to configure the DomainExtractor with a custom regex pattern for matching root domains (e.g., "example" in "example.com").
Parameters:
- pattern: The custom root domain regex pattern.
Returns:
- A function that applies the custom root domain pattern to the DomainExtractor.
func DomainExtractorWithTLDPattern ¶
func DomainExtractorWithTLDPattern(pattern string) DomainExtractorOptionFunc
DomainExtractorWithTLDPattern returns an option function to configure the DomainExtractor with a custom regex pattern for matching top-level domains (TLDs) (e.g., "com" in "example.com").
Parameters:
- pattern: The custom TLD regex pattern.
Returns:
- A function that applies the custom TLD pattern to the DomainExtractor.
type DomainInterface ¶
type DomainInterface interface {
String() (domain string)
}
DomainInterface defines an interface for domain representations.
type DomainParser ¶
type DomainParser struct {
// contains filtered or unexported fields
}
DomainParser is responsible for parsing domain names into their constituent parts: subdomain, root domain (SLD), and top-level domain (TLD). It utilizes a suffix array to efficiently identify TLDs from a comprehensive list of known TLDs (both standard and pseudo-TLDs). This allows the parser to split the domain into subdomain, root domain, and TLD components quickly and accurately.
The suffix array helps in handling a large number of known TLDs and enables fast lookups, even for complex domain structures where subdomains might be mistaken for TLDs.
Fields:
- sa (*suffixarray.Index):
- The suffix array index used for efficiently searching through known TLDs.
- This allows for rapid identification of the TLD in the domain string.
Example Usage:
parser := NewDomainParser() domain := "www.example.com" parsedDomain := parser.Parse(domain) fmt.Println(parsedDomain.Subdomain) // Output: "www" fmt.Println(parsedDomain.SLD) // Output: "example" fmt.Println(parsedDomain.TLD) // Output: "com"
func NewDomainParser ¶
func NewDomainParser(opts ...DomainParserOptionFunc) (parser *DomainParser)
NewDomainParser creates a new DomainParser instance and initializes it with a comprehensive list of TLDs, including both standard TLDs and pseudo-TLDs. Additional options can be passed to customize the parser, such as using a custom set of TLDs.
Parameters:
- opts (variadic DomainParserOptionFunc): Optional configuration options.
Returns:
- parser (*DomainParser): A pointer to the initialized DomainParser.
func (*DomainParser) Parse ¶
func (p *DomainParser) Parse(domain string) (parsed *Domain)
Parse takes a full domain string (e.g., "www.example.com") and splits it into three main components: subdomain, root domain (SLD), and TLD. The method uses the suffix array to identify the TLD and then extracts the subdomain and root domain from the rest of the domain string.
Parameters:
- domain (string): The full domain string to be parsed.
Returns:
- parsed (*Domain): A pointer to a Domain struct containing the subdomain, root domain (SLD), and TLD.
type DomainParserInterface ¶
type DomainParserInterface interface { Parse(domain string) (parsed *Domain) // contains filtered or unexported methods }
DomainParserInterface defines the interface for domain parsing functionality.
type DomainParserOptionFunc ¶
type DomainParserOptionFunc func(*DomainParser)
DomainParserOptionFunc defines a function type for configuring a DomainParser instance. This allows customization options like specifying custom TLDs.
Example:
parser := NewDomainParser(DomainParserWithTLDs("custom", "tld"))
func DomainParserWithTLDs ¶
func DomainParserWithTLDs(TLDs ...string) DomainParserOptionFunc
DomainParserWithTLDs allows the DomainParser to be initialized with a custom set of TLDs. This option is useful for handling non-standard or niche TLDs that may not be included in the default set.
Parameters:
- TLDs ([]string): A slice of custom TLDs to be used by the DomainParser.
Returns:
- A DomainParserOptionFunc that applies the custom TLDs to the parser.
type Extractor ¶
type Extractor struct {
// contains filtered or unexported fields
}
Extractor is a struct that configures the URL extraction process. It provides options for controlling whether URL schemes and hosts are mandatory, and allows custom regular expression patterns to be specified for these components. This allows fine-grained control over the types of URLs that are extracted from text.
func NewExtractor ¶
func NewExtractor(opts ...ExtractorOptionFunc) (extractor *Extractor)
NewExtractor creates a new Extractor instance with optional configuration. The options can be used to customize how URLs are extracted, such as whether to include URL schemes or hosts.
func (*Extractor) CompileRegex ¶
CompileRegex constructs and compiles a regular expression based on the Extractor configuration. It builds a regex pattern that can capture various forms of URLs, including those with or without schemes and hosts. The method also supports custom patterns provided by the user, ensuring that the longest possible match for a URL is found, improving accuracy in URL extraction.
type ExtractorInterface ¶
ExtractorInterface defines the interface that Extractor should implement. It ensures that Extractor has the ability to compile regex patterns for URL extraction.
type ExtractorOptionFunc ¶
type ExtractorOptionFunc func(*Extractor)
ExtractorOptionFunc defines a function type for configuring Extractor instances. It allows users to pass options that modify the behavior of the Extractor, such as whether to include schemes or hosts in URL extraction.
func ExtractorWithHost ¶
func ExtractorWithHost() ExtractorOptionFunc
ExtractorWithHost returns an option function that configures the Extractor to require URL hosts in the extraction process.
func ExtractorWithHostPattern ¶
func ExtractorWithHostPattern(pattern string) ExtractorOptionFunc
ExtractorWithHostPattern returns an option function that allows specifying a custom regex pattern for matching URL hosts.
func ExtractorWithScheme ¶
func ExtractorWithScheme() ExtractorOptionFunc
ExtractorWithScheme returns an option function that configures the Extractor to require URL schemes in the extraction process.
func ExtractorWithSchemePattern ¶
func ExtractorWithSchemePattern(pattern string) ExtractorOptionFunc
ExtractorWithSchemePattern returns an option function that allows specifying a custom regex pattern for matching URL schemes.
type Parser ¶
type Parser struct {
// contains filtered or unexported fields
}
Parser is responsible for parsing URLs while also handling domain-related parsing through the use of a DomainParser. It extends basic URL parsing functionality by providing support for handling custom schemes and extracting domain components such as subdomains, root domains, and TLDs.
Fields:
- dp (*DomainParser):
- A reference to a `DomainParser` used for extracting subdomain, root domain, and TLD information from the host part of the URL.
- scheme (string):
- The default scheme to use when parsing URLs without a specified scheme. For example, if a URL is missing a scheme (e.g., "www.example.com"), the `scheme` field will prepend a default scheme like "https", resulting in "https://www.example.com".
Methods:
- Parse(unparsed string) (parsed *URL, err error):
- Takes a raw URL string and parses it into a custom `URL` struct that includes both the standard URL components (via the embedded `net/url.URL`) and domain-specific details.
- If the URL does not include a scheme, the default scheme is added (if specified).
- Additionally, the method uses the DomainParser to break down the domain into subdomain, root domain, and TLD components.
Example Usage:
parser := NewParser(ParserWithDefaultScheme("https")) parsedURL, err := parser.Parse("example.com/path") if err != nil { log.Fatal(err) } fmt.Println(parsedURL.Scheme) // Output: https fmt.Println(parsedURL.Hostname()) // Output: example.com fmt.Println(parsedURL.Domain.Root) // Output: example
func NewParser ¶
func NewParser(opts ...ParserOptionFunc) (parser *Parser)
NewParser creates and initializes a new Parser with the given options. The Parser is also initialized with a DomainParser for extracting domain-specific details such as subdomain, root domain, and TLD. Additional configuration options can be applied using the variadic `opts` parameter.
Parameters:
- opts: A variadic list of `ParserOptionFunc` functions that can configure the Parser.
Returns:
- parser (*Parser): A pointer to the initialized Parser instance.
func (*Parser) Parse ¶
Parse takes a raw URL string and parses it into a custom URL struct that includes:
- Standard URL components from `net/url` (scheme, host, path, etc.)
- Domain-specific details such as subdomain, root domain, and TLD.
If the URL does not specify a scheme, the default scheme (if any) is added. The method also validates and parses the host and port (if specified).
Parameters:
- unparsed (string): The raw URL string to parse.
Returns:
- parsed (*URL): A pointer to the parsed URL struct containing both standard URL components and domain-specific details.
- err (error): An error if the URL cannot be parsed.
type ParserInterface ¶
ParserInterface defines the interface that all Parser implementations must adhere to.
type ParserOptionFunc ¶
type ParserOptionFunc func(*Parser)
ParserOptionFunc defines a function type for configuring a Parser instance. It is used to apply various options such as setting the default scheme.
Example:
parser := NewParser(ParserWithDefaultScheme("https"))
func ParserWithDefaultScheme ¶
func ParserWithDefaultScheme(scheme string) ParserOptionFunc
ParserWithDefaultScheme returns a `ParserOptionFunc` that sets the default scheme for the Parser. This function allows you to specify a default scheme (e.g., "http" or "https") that will be added to URLs that don't provide one.
Parameters:
- scheme (string): The default scheme to set (e.g., "http" or "https").
Returns:
- A `ParserOptionFunc` that applies the default scheme to the Parser.
type URL ¶
URL extends the standard net/url URL struct by embedding it and adding additional fields for handling domain-related information. This extension provides a more detailed representation of the URL by including a separate `Domain` struct that breaks down the domain into Subdomain, second-level domain (SLD), and top-level domain (TLD).
Fields:
URL (*url.URL):
Embeds the standard `net/url.URL` struct, which provides all the base URL parsing and functionalities, such as handling the scheme, host, path, query parameters, and fragment.
Methods and functions from the embedded `net/url.URL` can be used transparently.
Domain (*Domain):
A pointer to the `Domain` struct that contains parsed domain information, including:
Subdomain (string): The subdomain of the URL (e.g., "www" in "www.example.com").
Second-level domain (SLD) (string): The main domain (e.g., "example").
Top-level domain (TLD) (string): The domain suffix (e.g., "com" in "www.example.com").
This allows for better handling of domain components, which is useful in cases like:
URL classification and domain analysis.
Security or SEO applications where separating domain components is important.
Example Usage:
// Parse a URL using the standard url.Parse method. parsedURL, _ := url.Parse("https://www.example.com") // Create an extended URL object and manually add domain information. extendedURL := &URL{ URL: parsedURL, // Embeds the parsed URL from the standard library. // Domain can be parsed separately or manually assigned. Domain: &Domain{ Subdomain: "www", // Subdomain part (e.g., "www"). SLD: "example", // Root domain part (e.g., "example"). TLD: "com", // Top-level domain part (e.g., "com"). }, } // Access standard URL components. fmt.Println(extendedURL.Scheme) // Output: https fmt.Println(extendedURL.Host) // Output: www.example.com fmt.Println(extendedURL.Path) // Output: / // Access domain-specific information. fmt.Println(extendedURL.Domain.Subdomain) // Output: www fmt.Println(extendedURL.Domain.SLD) // Output: example fmt.Println(extendedURL.Domain.TLD) // Output: com
Purpose:
This `URL` struct provides a more detailed breakdown of a URL's domain components, making it particularly useful for tasks involving domain analysis, URL classification, or scenarios where understanding subdomains, root domains, and TLDs is important.
Source Files ¶
Directories ¶
Path | Synopsis |
---|---|
gen
|
|
Package schemes provides a collection of constants and lists representing URL schemes.
|
Package schemes provides a collection of constants and lists representing URL schemes. |
Package tlds provides a collection of constants and lists representing official top-level domains (TLDs) and pseudo or special-use TLDs.
|
Package tlds provides a collection of constants and lists representing official top-level domains (TLDs) and pseudo or special-use TLDs. |
Package unicodes provides constants for defining sets of allowed Unicode characters.
|
Package unicodes provides constants for defining sets of allowed Unicode characters. |