Documentation ¶
Overview ¶
Sequence is a high performance sequential log scanner, analyzer and parser. It sequentially goes through a log message, parses out the meaningful parts, without the use regular expressions. It can parse over 100,000 messages per second without the need to separate parsing rules by log source type.
Documentation and other information are available at https://github.com/leolee192/sequencer/wiki
Index ¶
Constants ¶
This section is empty.
Variables ¶
var ( TagTypesCount int TokenTypesCount = int(token__END__) + 1 )
var (
ErrNoMatch = errors.New("sequence: no pattern matched for this message")
)
Functions ¶
func ReadConfig ¶
Types ¶
type Analyzer ¶
type Analyzer struct {
// contains filtered or unexported fields
}
Analyzer builds an analysis tree that represents all the Sequences from messages. It can be used to determine all of the unique patterns for a large body of messages.
It's based on a single basic concept, that for multiple log messages, if tokens in the same position shares one same parent and one same child, then the tokens in that position is likely variable string, which means it's something we can extract. For example, take a look at the following two messages:
Jan 12 06:49:42 irc sshd[7034]: Accepted password for root from 218.161.81.238 port 4228 ssh2 Jan 12 14:44:48 jlz sshd[11084]: Accepted publickey for jlz from 76.21.0.16 port 36609 ssh2
The first token of each message is a timestamp, and the 3rd token of each message is the literal "sshd". For the literals "irc" and "jlz", they both share a common parent, which is a timestamp. They also both share a common child, which is "sshd". This means token in between these, the 2nd token in each message, likely represents a variable token in this message type. In this case, "irc" and "jlz" happens to represent the syslog host.
Looking further down the message, the literals "password" and "publickey" also share a common parent, "Accepted", and a common child, "for". So that means the token in this position is also a variable token (of type TokenString).
You can find several tokens that share common parent and child in these two messages, which means each of these tokens can be extracted. And finally, we can determine that the single pattern that will match both is:
%time% %string% sshd [ %integer% ] : Accepted %string% for %string% from %ipv4% port %integer% ssh2
If later we add another message to this mix:
Jan 12 06:49:42 irc sshd[7034]: Failed password for root from 218.161.81.238 port 4228 ssh2
The Analyzer will determine that the literals "Accepted" in the 1st message, and "Failed" in the 3rd message share a common parent ":" and a common child "password", so it will determine that the token in this position is also a variable token. After all three messages are analyzed, the final pattern that will match all three messages is:
%time% %string% sshd [ %integer% ] : %string% %string% for %string% from %ipv4% port %integer% ssh2
func NewAnalyzer ¶
func NewAnalyzer() *Analyzer
func (*Analyzer) Add ¶
Add adds a single message sequence to the analysis tree. It will not determine if the tokens share a common parent or child at this point. After all the sequences are added, then Finalize() should be called.
type Parser ¶
type Parser struct {
// contains filtered or unexported fields
}
Parser is a tree-based parsing engine for log messages. It builds a parsing tree based on pattern sequence supplied, and for each message sequence, returns the matching pattern sequence. Each of the message tokens will be marked with the semantic tag types.
type Scanner ¶
type Scanner struct {
// contains filtered or unexported fields
}
Scanner is a sequential lexical analyzer that breaks a log message into a sequence of tokens. It is sequential because it goes through log message sequentially tokentizing each part of the message, without the use of regular expressions. The scanner currently recognizes time stamps, IPv4 addresses, URLs, MAC addresses, integers and floating point numbers.
For example, the following message
Jan 12 06:49:42 irc sshd[7034]: Failed password for root from 218.161.81.238 port 4228 ssh2
Returns the following Sequence:
Sequence{ Token{TokenTime, TagUnknown, "Jan 12 06:49:42"}, Token{TokenLiteral, TagUnknown, "irc"}, Token{TokenLiteral, TagUnknown, "sshd"}, Token{TokenLiteral, TagUnknown, "["}, Token{TokenInteger, TagUnknown, "7034"}, Token{TokenLiteral, TagUnknown, "]"}, Token{TokenLiteral, TagUnknown, ":"}, Token{TokenLiteral, TagUnknown, "Failed"}, Token{TokenLiteral, TagUnknown, "password"}, Token{TokenLiteral, TagUnknown, "for"}, Token{TokenLiteral, TagUnknown, "root"}, Token{TokenLiteral, TagUnknown, "from"}, Token{TokenIPv4, TagUnknown, "218.161.81.238"}, Token{TokenLiteral, TagUnknown, "port"}, Token{TokenInteger, TagUnknown, "4228"}, Token{TokenLiteral, TagUnknown, "ssh2"}, },
The following message
id=firewall time="2005-03-18 14:01:43" fw=TOPSEC priv=4 recorder=kernel type=conn policy=504 proto=TCP rule=deny src=210.82.121.91 sport=4958 dst=61.229.37.85 dport=23124 smac=00:0b:5f:b2:1d:80 dmac=00:04:c1:8b:d8:82
Will return
Sequence{ Token{TokenLiteral, TagUnknown, "id"}, Token{TokenLiteral, TagUnknown, "="}, Token{TokenLiteral, TagUnknown, "firewall"}, Token{TokenLiteral, TagUnknown, "time"}, Token{TokenLiteral, TagUnknown, "="}, Token{TokenLiteral, TagUnknown, "\""}, Token{TokenTime, TagUnknown, "2005-03-18 14:01:43"}, Token{TokenLiteral, TagUnknown, "\""}, Token{TokenLiteral, TagUnknown, "fw"}, Token{TokenLiteral, TagUnknown, "="}, Token{TokenLiteral, TagUnknown, "TOPSEC"}, Token{TokenLiteral, TagUnknown, "priv"}, Token{TokenLiteral, TagUnknown, "="}, Token{TokenInteger, TagUnknown, "4"}, Token{TokenLiteral, TagUnknown, "recorder"}, Token{TokenLiteral, TagUnknown, "="}, Token{TokenLiteral, TagUnknown, "kernel"}, Token{TokenLiteral, TagUnknown, "type"}, Token{TokenLiteral, TagUnknown, "="}, Token{TokenLiteral, TagUnknown, "conn"}, Token{TokenLiteral, TagUnknown, "policy"}, Token{TokenLiteral, TagUnknown, "="}, Token{TokenInteger, TagUnknown, "504"}, Token{TokenLiteral, TagUnknown, "proto"}, Token{TokenLiteral, TagUnknown, "="}, Token{TokenLiteral, TagUnknown, "TCP"}, Token{TokenLiteral, TagUnknown, "rule"}, Token{TokenLiteral, TagUnknown, "="}, Token{TokenLiteral, TagUnknown, "deny"}, Token{TokenLiteral, TagUnknown, "src"}, Token{TokenLiteral, TagUnknown, "="}, Token{TokenIPv4, TagUnknown, "210.82.121.91"}, Token{TokenLiteral, TagUnknown, "sport"}, Token{TokenLiteral, TagUnknown, "="}, Token{TokenInteger, TagUnknown, "4958"}, Token{TokenLiteral, TagUnknown, "dst"}, Token{TokenLiteral, TagUnknown, "="}, Token{TokenIPv4, TagUnknown, "61.229.37.85"}, Token{TokenLiteral, TagUnknown, "dport"}, Token{TokenLiteral, TagUnknown, "="}, Token{TokenInteger, TagUnknown, "23124"}, Token{TokenLiteral, TagUnknown, "smac"}, Token{TokenLiteral, TagUnknown, "="}, Token{TokenMac, TagUnknown, "00:0b:5f:b2:1d:80"}, Token{TokenLiteral, TagUnknown, "dmac"}, Token{TokenLiteral, TagUnknown, "="}, Token{TokenMac, TagUnknown, "00:04:c1:8b:d8:82"}, }
func NewScanner ¶
func NewScanner() *Scanner
func (*Scanner) Scan ¶
Scan returns a Sequence, or a list of tokens, for the data string supplied. Scan is not concurrent-safe, and the returned Sequence is only valid until the next time any Scan*() method is called. The best practice would be to create one Scanner for each goroutine.
func (*Scanner) ScanJson ¶
ScanJson returns a Sequence, or a list of tokens, for the json string supplied. Scan is not concurrent-safe, and the returned Sequence is only valid until the next time any Scan*() method is called. The best practice would be to create one Scanner for each goroutine.
ScanJson flattens a json string into key=value pairs, and it performs the following transformation:
- all {, }, [, ], ", characters are removed
- colon between key and value are changed to "="
- nested objects have their keys concatenated with ".", so a json string like "userIdentity": {"type": "IAMUser"} will be returned as userIdentity.type=IAMUser
- arrays are flattened by appending an index number to the end of the key, starting with 0, so a json string like {"value":[{"open":"2014-08-16T13:00:00.000+0000"}]} will be returned as value.0.open = 2014-08-16T13:00:00.000+0000
- skips any key that has an empty value, so json strings like "reference":"" or "filterSet": {} will not show up in the Sequence
type Sequence ¶
type Sequence []Token
Sequence represents a list of tokens returned from the scanner, analyzer or parser.
func (Sequence) PrintTokens ¶
Longstring returns a multi-line representation of the tokens in the sequence
type TagType ¶
type TagType int
TagType is the semantic representation of a token.
var ( TagUnknown TagType = 0 TagMsgId TagType // The message identifier TagMsgTime TagType // The timestamp that’s part of the log message TagSeverity TagType // The severity of the event, e.g., Emergency, … TagPriority TagType // The pirority of the event TagAppHost TagType // The hostname of the host where the log message is generated TagAppIP TagType // The IP address of the host where the application that generated the log message is running on. TagAppVendor TagType // The type of application that generated the log message, e.g., Cisco, ISS TagAppName TagType // The name of the application that generated the log message, e.g., asa, snort, sshd TagSrcDomain TagType // The domain name of the initiator of the event, usually a Windows domain TagSrcZone TagType // The originating zone TagSrcHost TagType // The hostname of the originator of the event or connection. TagSrcIP TagType // The IPv4 address of the originator of the event or connection. TagSrcIPNAT TagType // The natted (network address translation) IP of the originator of the event or connection. TagSrcPort TagType // The port number of the originating connection. TagSrcPortNAT TagType // The natted port number of the originating connection. TagSrcMac TagType // The mac address of the host that originated the connection. TagSrcUser TagType // The user that originated the session. TagSrcUid TagType // The user id that originated the session. TagSrcGroup TagType // The group that originated the session. TagSrcGid TagType // The group id that originated the session. TagSrcEmail TagType // The originating email address TagDstDomain TagType // The domain name of the destination of the event, usually a Windows domain TagDstZone TagType // The destination zone TagDstHost TagType // The hostname of the destination of the event or connection. TagDstIP TagType // The IPv4 address of the destination of the event or connection. TagDstIPNAT TagType // The natted (network address translation) IP of the destination of the event or connection. TagDstPort TagType // The destination port number of the connection. TagDstPortNAT TagType // The natted destination port number of the connection. TagDstMac TagType // The mac address of the destination host. TagDstUser TagType // The user at the destination. TagDstUid TagType // The user id that originated the session. TagDstGroup TagType // The group that originated the session. TagDstGid TagType // The group id that originated the session. TagDstEmail TagType // The destination email address TagProtocol TagType // The protocol, such as TCP, UDP, ICMP, of the connection TagInIface TagType // The incoming TagTypeerface TagOutIface TagType // The outgoing TagTypeerface TagPolicyID TagType // The policy ID TagSessionID TagType // The session or process ID TagObject TagType // The object affected. TagAction TagType // The action taken TagCommand TagType // The command executed TagMethod TagType // The method in which the action was taken, for example, public key or password for ssh TagStatus TagType // The status of the action taken TagReason TagType // The reason for the action taken or the status returned TagBytesRecv TagType // The number of bytes received TagBytesSent TagType // The number of bytes sent TagPktsRecv TagType // The number of packets received TagPktsSent TagType // The number of packets sent TagDuration TagType // The duration of the session )
type Token ¶
type Token struct { Type TokenType // Type is the type of token the Value represents. Tag TagType // Tag determines which tag the Value should be. Value string // Value is the extracted string from the log message. // contains filtered or unexported fields }
Token is a piece of information extracted from a log message. The Scanner will do its best to determine the TokenType which could be a time stamp, IPv4 or IPv6 address, a URL, a mac address, an integer or a floating point number. In addition, if the Scanner finds a token that's surrounded by %, e.g., %srcuser%, it will try to determine the correct tag type the token represents.
type TokenType ¶
type TokenType int
Tokentype is the lexical representation of a token.
const ( TokenUnknown TokenType = iota // Unknown token TokenLiteral // Token is a fixed literal TokenTime // Token is a timestamp, in the format listed in TimeFormats TokenIPv4 // Token is an IPv4 address, in the form of a.b.c.d TokenIPv6 // Token is an IPv6 address TokenInteger // Token is an integer number TokenFloat // Token is a floating point number TokenURI // Token is an URL, in the form of http://... or https://... TokenMac // Token is a mac address TokenString // Token is a string that reprensents multiple possible values )