sequence

package module
v0.0.0-...-3e973fe Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 24, 2019 License: Apache-2.0 Imports: 11 Imported by: 0

README

sequence

GoDoc

GoDoc

sequence is a high performance sequential log scanner, analyzer and parser. It sequentially goes through a log message, parses out the meaningful parts, without the use regular expressions. It can achieve high performance parsing of 100,000 - 200,000 messages per second (MPS) without the need to separate parsing rules by log source type.

If you have a set of logs you would like me to test out, please feel free to open an issue and we can arrange a way for me to download and test your logs.

Motivation

Log messages are notoriusly difficult to parse because they all have different formats. Industries (see Splunk, ArcSight, Tibco LogLogic, Sumo Logic, Logentries, Loggly, LogRhythm, etc etc etc) have been built to solve the problems of parsing, understanding and analyzing log messages.

Let's say you have a bunch of log files you like to parse. The first problem you will typically run into is you have no way of telling how many DIFFERENT types of messages there are, so you have no idea how much work there will be to develop rules to parse all the messages. Not only that, you have hundreds of thousands, if not millions of messages, in front of you, and you have no idea what messages are worth parsing, and what's not.

The typical workflow is develop a set of regular expressions and keeps testing against the logs until some magical moment where all the logs you want parsed are parsed. Ask anyone who does this for a living and they will tell you this process is long, frustrating and error-prone.

Even after you have developed a set of regular expressions that match the original set of messages, if new messages come in, you will have to determine which of the new messages need to be parsed. And if you develop a new set of regular expressions to parse those new messages, you still have no idea if the regular expressions will conflict with the ones you wrote before. If you write your regex parsers too liberally, it can easily parse the wrong messages.

After all that, you will end up finding out the regex parsers are quite slow. It can typically parse several thousands messages per second. Given enough CPU resources on a large enough machine, regex parsers can probably parse tens of thousands of messages per second. Even to achieve this type of performance, you will likely need to limit the number of regular expressions the parser has. The more regex rules, the slower the parser will go.

To work around this performance issue, companies have tried to separate the regex rules for different log message types into different parsers. For example, they will have a parser for Cisco ASA logs, a parser for sshd logs, a parser for Apache logs, etc etc. And then they will require the users to tell them which parser to use (usually by indicating the log source type of the originating IP address or host.)

Sequence is developed to make analyzing and parsing log messages a lot easier and faster.

Performance

The following performance benchmarks are run on a single 4-core (2.8Ghz i7) MacBook Pro, although the tests were only using 1 or 2 cores. The first file is a bunch of sshd logs, averaging 98 bytes per message. The second is a Cisco ASA log file, averaging 180 bytes per message. Last is a mix of ASA, sshd and sudo logs, averaging 136 bytes per message.

  $ ./sequence bench scan -i ../../data/sshd.all
  Scanned 212897 messages in 0.78 secs, ~ 272869.35 msgs/sec

  $ ./sequence bench parse -p ../../patterns/sshd.txt -i ../../data/sshd.all
  Parsed 212897 messages in 1.69 secs, ~ 126319.27 msgs/sec

  $ ./sequence bench parse -p ../../patterns/asa.txt -i ../../data/allasa.log
  Parsed 234815 messages in 2.89 secs, ~ 81323.41 msgs/sec

  $ ./sequence bench parse -d ../patterns -i ../data/asasshsudo.log
  Parsed 447745 messages in 4.47 secs, ~ 100159.65 msgs/sec

Performance can be improved by adding more cores:

  $ GOMAXPROCS=2 ./sequence bench scan -i ../../data/sshd.all -w 2
  Scanned 212897 messages in 0.43 secs, ~ 496961.52 msgs/sec

  GOMAXPROCS=2 ./sequence bench parse -p ../../patterns/sshd.txt -i ../../data/sshd.all -w 2
  Parsed 212897 messages in 1.00 secs, ~ 212711.83 msgs/sec

  $ GOMAXPROCS=2 ./sequence bench parse -p ../../patterns/asa.txt -i ../../data/allasa.log -w 2
  Parsed 234815 messages in 1.56 secs, ~ 150769.68 msgs/sec

  $ GOMAXPROCS=2 ./sequence bench parse -d ../patterns -i ../data/asasshsudo.log -w 2
  Parsed 447745 messages in 2.52 secs, ~ 177875.94 msgs/sec
Limitations
  • sequence does not handle multi-line logs. Each log message must appear as a single line. So if there's multi-line logs, they must first be converted into a single line.
  • sequence has only been tested with a limited set of system (Linux, AIX, sudo, ssh, su, dhcp, etc etc), network (ASA, PIX, Neoteris, CheckPoint, Juniper Firewall) and infrastructure application (apache, bluecoat, etc) logs. If you have a set of logs you would like me to test out, please feel free to open an issue and we can arrange a way for me to download and test your logs.
Usage

To run the unit tests, you need to be in the top level sequence dir:

go get github.com/leolee192/sequencer
cd $GOPATH/src/github.com/leolee192/sequencer
go test

To run the actual command you need to

cd $GOPATH/src/github.com/leolee192/sequencer/cmd/sequence
go run sequence.go

If you hoping to run few quick tests, you can find some sample log messages in loghub.

Documentation is available at wiki.

History

The project originated from zentures/sequence , which was iced in 2017. Since I tried and couldn't contact the original author for weeks, I decided to restart the project here in Nov 2019.

Documentation

Overview

Sequence is a high performance sequential log scanner, analyzer and parser. It sequentially goes through a log message, parses out the meaningful parts, without the use regular expressions. It can parse over 100,000 messages per second without the need to separate parsing rules by log source type.

Documentation and other information are available at https://github.com/leolee192/sequencer/wiki

Index

Constants

This section is empty.

Variables

View Source
var (
	TagTypesCount   int
	TokenTypesCount = int(token__END__) + 1
)
View Source
var (
	ErrNoMatch = errors.New("sequence: no pattern matched for this message")
)

Functions

func ReadConfig

func ReadConfig(file string) error

Types

type Analyzer

type Analyzer struct {
	// contains filtered or unexported fields
}

Analyzer builds an analysis tree that represents all the Sequences from messages. It can be used to determine all of the unique patterns for a large body of messages.

It's based on a single basic concept, that for multiple log messages, if tokens in the same position shares one same parent and one same child, then the tokens in that position is likely variable string, which means it's something we can extract. For example, take a look at the following two messages:

Jan 12 06:49:42 irc sshd[7034]: Accepted password for root from 218.161.81.238 port 4228 ssh2
Jan 12 14:44:48 jlz sshd[11084]: Accepted publickey for jlz from 76.21.0.16 port 36609 ssh2

The first token of each message is a timestamp, and the 3rd token of each message is the literal "sshd". For the literals "irc" and "jlz", they both share a common parent, which is a timestamp. They also both share a common child, which is "sshd". This means token in between these, the 2nd token in each message, likely represents a variable token in this message type. In this case, "irc" and "jlz" happens to represent the syslog host.

Looking further down the message, the literals "password" and "publickey" also share a common parent, "Accepted", and a common child, "for". So that means the token in this position is also a variable token (of type TokenString).

You can find several tokens that share common parent and child in these two messages, which means each of these tokens can be extracted. And finally, we can determine that the single pattern that will match both is:

%time% %string% sshd [ %integer% ] : Accepted %string% for %string% from %ipv4% port %integer% ssh2

If later we add another message to this mix:

Jan 12 06:49:42 irc sshd[7034]: Failed password for root from 218.161.81.238 port 4228 ssh2

The Analyzer will determine that the literals "Accepted" in the 1st message, and "Failed" in the 3rd message share a common parent ":" and a common child "password", so it will determine that the token in this position is also a variable token. After all three messages are analyzed, the final pattern that will match all three messages is:

%time% %string% sshd [ %integer% ] : %string% %string% for %string% from %ipv4% port %integer% ssh2

func NewAnalyzer

func NewAnalyzer() *Analyzer

func (*Analyzer) Add

func (this *Analyzer) Add(seq Sequence) error

Add adds a single message sequence to the analysis tree. It will not determine if the tokens share a common parent or child at this point. After all the sequences are added, then Finalize() should be called.

func (*Analyzer) Analyze

func (this *Analyzer) Analyze(seq Sequence) (Sequence, error)

Analyze analyzes the message sequence supplied, and returns the unique pattern that will match this message.

func (*Analyzer) Finalize

func (this *Analyzer) Finalize() error

Finalize will go through the analysis tree and determine which tokens share common parent and child, merge all the nodes that share at least 1 parent and 1 child, and finally compact the tree and remove all dead nodes.

type Message

type Message struct {
	Data string
	// contains filtered or unexported fields
}

func (*Message) Tokenize

func (this *Message) Tokenize() (Token, error)

Scan is similar to Tokenize except it returns one token at a time

type Parser

type Parser struct {
	// contains filtered or unexported fields
}

Parser is a tree-based parsing engine for log messages. It builds a parsing tree based on pattern sequence supplied, and for each message sequence, returns the matching pattern sequence. Each of the message tokens will be marked with the semantic tag types.

func NewParser

func NewParser() *Parser

func (*Parser) Add

func (this *Parser) Add(seq Sequence) error

Add will add a single pattern sequence to the parser tree. This effectively builds the parser tree so it can be used for parsing later. func (this *Parser) Add(s string) error {

func (*Parser) Parse

func (this *Parser) Parse(seq Sequence) (Sequence, error)

Parse will take the message sequence supplied and go through the parser tree to find the matching pattern sequence. If found, the pattern sequence is returned. func (this *Parser) Parse(s string) (Sequence, error) {

type Scanner

type Scanner struct {
	// contains filtered or unexported fields
}

Scanner is a sequential lexical analyzer that breaks a log message into a sequence of tokens. It is sequential because it goes through log message sequentially tokentizing each part of the message, without the use of regular expressions. The scanner currently recognizes time stamps, IPv4 addresses, URLs, MAC addresses, integers and floating point numbers.

For example, the following message

Jan 12 06:49:42 irc sshd[7034]: Failed password for root from 218.161.81.238 port 4228 ssh2

Returns the following Sequence:

Sequence{
	Token{TokenTime, TagUnknown, "Jan 12 06:49:42"},
	Token{TokenLiteral, TagUnknown, "irc"},
	Token{TokenLiteral, TagUnknown, "sshd"},
	Token{TokenLiteral, TagUnknown, "["},
	Token{TokenInteger, TagUnknown, "7034"},
	Token{TokenLiteral, TagUnknown, "]"},
	Token{TokenLiteral, TagUnknown, ":"},
	Token{TokenLiteral, TagUnknown, "Failed"},
	Token{TokenLiteral, TagUnknown, "password"},
	Token{TokenLiteral, TagUnknown, "for"},
	Token{TokenLiteral, TagUnknown, "root"},
	Token{TokenLiteral, TagUnknown, "from"},
	Token{TokenIPv4, TagUnknown, "218.161.81.238"},
	Token{TokenLiteral, TagUnknown, "port"},
	Token{TokenInteger, TagUnknown, "4228"},
	Token{TokenLiteral, TagUnknown, "ssh2"},
},

The following message

id=firewall time="2005-03-18 14:01:43" fw=TOPSEC priv=4 recorder=kernel type=conn policy=504 proto=TCP rule=deny src=210.82.121.91 sport=4958 dst=61.229.37.85 dport=23124 smac=00:0b:5f:b2:1d:80 dmac=00:04:c1:8b:d8:82

Will return

Sequence{
	Token{TokenLiteral, TagUnknown, "id"},
	Token{TokenLiteral, TagUnknown, "="},
	Token{TokenLiteral, TagUnknown, "firewall"},
	Token{TokenLiteral, TagUnknown, "time"},
	Token{TokenLiteral, TagUnknown, "="},
	Token{TokenLiteral, TagUnknown, "\""},
	Token{TokenTime, TagUnknown, "2005-03-18 14:01:43"},
	Token{TokenLiteral, TagUnknown, "\""},
	Token{TokenLiteral, TagUnknown, "fw"},
	Token{TokenLiteral, TagUnknown, "="},
	Token{TokenLiteral, TagUnknown, "TOPSEC"},
	Token{TokenLiteral, TagUnknown, "priv"},
	Token{TokenLiteral, TagUnknown, "="},
	Token{TokenInteger, TagUnknown, "4"},
	Token{TokenLiteral, TagUnknown, "recorder"},
	Token{TokenLiteral, TagUnknown, "="},
	Token{TokenLiteral, TagUnknown, "kernel"},
	Token{TokenLiteral, TagUnknown, "type"},
	Token{TokenLiteral, TagUnknown, "="},
	Token{TokenLiteral, TagUnknown, "conn"},
	Token{TokenLiteral, TagUnknown, "policy"},
	Token{TokenLiteral, TagUnknown, "="},
	Token{TokenInteger, TagUnknown, "504"},
	Token{TokenLiteral, TagUnknown, "proto"},
	Token{TokenLiteral, TagUnknown, "="},
	Token{TokenLiteral, TagUnknown, "TCP"},
	Token{TokenLiteral, TagUnknown, "rule"},
	Token{TokenLiteral, TagUnknown, "="},
	Token{TokenLiteral, TagUnknown, "deny"},
	Token{TokenLiteral, TagUnknown, "src"},
	Token{TokenLiteral, TagUnknown, "="},
	Token{TokenIPv4, TagUnknown, "210.82.121.91"},
	Token{TokenLiteral, TagUnknown, "sport"},
	Token{TokenLiteral, TagUnknown, "="},
	Token{TokenInteger, TagUnknown, "4958"},
	Token{TokenLiteral, TagUnknown, "dst"},
	Token{TokenLiteral, TagUnknown, "="},
	Token{TokenIPv4, TagUnknown, "61.229.37.85"},
	Token{TokenLiteral, TagUnknown, "dport"},
	Token{TokenLiteral, TagUnknown, "="},
	Token{TokenInteger, TagUnknown, "23124"},
	Token{TokenLiteral, TagUnknown, "smac"},
	Token{TokenLiteral, TagUnknown, "="},
	Token{TokenMac, TagUnknown, "00:0b:5f:b2:1d:80"},
	Token{TokenLiteral, TagUnknown, "dmac"},
	Token{TokenLiteral, TagUnknown, "="},
	Token{TokenMac, TagUnknown, "00:04:c1:8b:d8:82"},
}

func NewScanner

func NewScanner() *Scanner

func (*Scanner) Scan

func (this *Scanner) Scan(s string) (Sequence, error)

Scan returns a Sequence, or a list of tokens, for the data string supplied. Scan is not concurrent-safe, and the returned Sequence is only valid until the next time any Scan*() method is called. The best practice would be to create one Scanner for each goroutine.

func (*Scanner) ScanJson

func (this *Scanner) ScanJson(s string) (Sequence, error)

ScanJson returns a Sequence, or a list of tokens, for the json string supplied. Scan is not concurrent-safe, and the returned Sequence is only valid until the next time any Scan*() method is called. The best practice would be to create one Scanner for each goroutine.

ScanJson flattens a json string into key=value pairs, and it performs the following transformation:

  • all {, }, [, ], ", characters are removed
  • colon between key and value are changed to "="
  • nested objects have their keys concatenated with ".", so a json string like "userIdentity": {"type": "IAMUser"} will be returned as userIdentity.type=IAMUser
  • arrays are flattened by appending an index number to the end of the key, starting with 0, so a json string like {"value":[{"open":"2014-08-16T13:00:00.000+0000"}]} will be returned as value.0.open = 2014-08-16T13:00:00.000+0000
  • skips any key that has an empty value, so json strings like "reference":"" or "filterSet": {} will not show up in the Sequence

type Sequence

type Sequence []Token

Sequence represents a list of tokens returned from the scanner, analyzer or parser.

func (Sequence) PrintTokens

func (this Sequence) PrintTokens() string

Longstring returns a multi-line representation of the tokens in the sequence

func (Sequence) Signature

func (this Sequence) Signature() string

Signature returns a single line string that represents a common pattern for this types of messages, basically stripping any strings or literals from the message.

func (Sequence) String

func (this Sequence) String() string

String returns a single line string that represents the pattern for the Sequence

type TagType

type TagType int

TagType is the semantic representation of a token.

var (
	TagUnknown    TagType = 0
	TagMsgId      TagType // The message identifier
	TagMsgTime    TagType // The timestamp that’s part of the log message
	TagSeverity   TagType // The severity of the event, e.g., Emergency, …
	TagPriority   TagType // The pirority of the event
	TagAppHost    TagType // The hostname of the host where the log message is generated
	TagAppIP      TagType // The IP address of the host where the application that generated the log message is running on.
	TagAppVendor  TagType // The type of application that generated the log message, e.g., Cisco, ISS
	TagAppName    TagType // The name of the application that generated the log message, e.g., asa, snort, sshd
	TagSrcDomain  TagType // The domain name of the initiator of the event, usually a Windows domain
	TagSrcZone    TagType // The originating zone
	TagSrcHost    TagType // The hostname of the originator of the event or connection.
	TagSrcIP      TagType // The IPv4 address of the originator of the event or connection.
	TagSrcIPNAT   TagType // The natted (network address translation) IP of the originator of the event or connection.
	TagSrcPort    TagType // The port number of the originating connection.
	TagSrcPortNAT TagType // The natted port number of the originating connection.
	TagSrcMac     TagType // The mac address of the host that originated the connection.
	TagSrcUser    TagType // The user that originated the session.
	TagSrcUid     TagType // The user id that originated the session.
	TagSrcGroup   TagType // The group that originated the session.
	TagSrcGid     TagType // The group id that originated the session.
	TagSrcEmail   TagType // The originating email address
	TagDstDomain  TagType // The domain name of the destination of the event, usually a Windows domain
	TagDstZone    TagType // The destination zone
	TagDstHost    TagType // The hostname of the destination of the event or connection.
	TagDstIP      TagType // The IPv4 address of the destination of the event or connection.
	TagDstIPNAT   TagType // The natted (network address translation) IP of the destination of the event or connection.
	TagDstPort    TagType // The destination port number of the connection.
	TagDstPortNAT TagType // The natted destination port number of the connection.
	TagDstMac     TagType // The mac address of the destination host.
	TagDstUser    TagType // The user at the destination.
	TagDstUid     TagType // The user id that originated the session.
	TagDstGroup   TagType // The group that originated the session.
	TagDstGid     TagType // The group id that originated the session.
	TagDstEmail   TagType // The destination email address
	TagProtocol   TagType // The protocol, such as TCP, UDP, ICMP, of the connection
	TagInIface    TagType // The incoming TagTypeerface
	TagOutIface   TagType // The outgoing TagTypeerface
	TagPolicyID   TagType // The policy ID
	TagSessionID  TagType // The session or process ID
	TagObject     TagType // The object affected.
	TagAction     TagType // The action taken
	TagCommand    TagType // The command executed
	TagMethod     TagType // The method in which the action was taken, for example, public key or password for ssh
	TagStatus     TagType // The status of the action taken
	TagReason     TagType // The reason for the action taken or the status returned
	TagBytesRecv  TagType // The number of bytes received
	TagBytesSent  TagType // The number of bytes sent
	TagPktsRecv   TagType // The number of packets received
	TagPktsSent   TagType // The number of packets sent
	TagDuration   TagType // The duration of the session
)

func (TagType) String

func (this TagType) String() string

func (TagType) TokenType

func (this TagType) TokenType() TokenType

type Token

type Token struct {
	Type  TokenType // Type is the type of token the Value represents.
	Tag   TagType   // Tag determines which tag the Value should be.
	Value string    // Value is the extracted string from the log message.
	// contains filtered or unexported fields
}

Token is a piece of information extracted from a log message. The Scanner will do its best to determine the TokenType which could be a time stamp, IPv4 or IPv6 address, a URL, a mac address, an integer or a floating point number. In addition, if the Scanner finds a token that's surrounded by %, e.g., %srcuser%, it will try to determine the correct tag type the token represents.

func (Token) String

func (this Token) String() string

type TokenType

type TokenType int

Tokentype is the lexical representation of a token.

const (
	TokenUnknown TokenType = iota // Unknown token
	TokenLiteral                  // Token is a fixed literal
	TokenTime                     // Token is a timestamp, in the format listed in TimeFormats
	TokenIPv4                     // Token is an IPv4 address, in the form of a.b.c.d
	TokenIPv6                     // Token is an IPv6 address
	TokenInteger                  // Token is an integer number
	TokenFloat                    // Token is a floating point number
	TokenURI                      // Token is an URL, in the form of http://... or https://...
	TokenMac                      // Token is a mac address
	TokenString                   // Token is a string that reprensents multiple possible values

)

func (TokenType) String

func (this TokenType) String() string

Directories

Path Synopsis
cmd
sequence
Sequence is a high performance sequential log scanner, analyzer and parser.
Sequence is a high performance sequential log scanner, analyzer and parser.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL