post-processor

command module
v0.0.0-...-aeffa90 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 29, 2024 License: BSD-3-Clause Imports: 26 Imported by: 0

README

VV8 Post Processor

Note This is fork of the post-processor repo located at the visiblev8

Swiss-Army dumping ground of logic related to VisibleV8 (VV8) trace log processing. Originally tightly integrated to a single workflow (i.e., many assumptions/dependencies w.r.t. databases and filenames); slowly being transmogrified into a standalone, modular toolkit.

Building

To build the vv8-postprocessors, you need to have the following programs/languages installed:

  • python3 (Preferably > 3.8)
  • Rust (The project was tested using the 2021 edition)
  • Go (Any version > 1.13 should be sufficient to build vv8-postprocessors)
  • make

Once the programs are installed you can build the postprocessors by using the make command. A resulting artifacts/ folder is created which contains all the necessary binaries for running the postprocessors.

Quick Start

Assuming you have some VV8 log (e.g., vv8-*.[0-9].log) files in $PWD, you can quickly experiment with the supported modes/tools by letting the output go to stdout only (the default) and specifying a single "aggregator" (i.e., output mode) at a time. E.g., to get a quick summary of features used by execution-context-origin-URL, you can do (note that this is an operation that relies on idldata.json being in $PWD, too):

$ ./vv8-post-processor -aggs ufeatures vv8*.log

The output is a single JSON object containing all browser API features accessed globally (an array of strings under the allFeatures key) and an array of per-distinct-origin-URL feature arrays (the featureOrigins key).

You can combined multiple aggregation passes in a single run by specifying a + delimited list of aggregator names as the argument to the -aggs flag when you run the post-processor. (This approach typically makes sense more in a batch processing situation where outputs are being sent to databases.)

Input

Log file input can be read from named log files or from stdin (by specifying - as a filename). Filenames prefixed by the @ character are interpreted as MongoDB OIDs from our original MongoDB storage scheme; these require MongoDB credentials to be provided via environment variables;

Output Modes

By default, output goes to stdout (typically in some form of JSON, though each aggregator is free to use a different format).

The original workflow for which vv8-post-processor was written involved both MongoDB and PostgreSQL databases used in concert (for live collection of bulk data and for offline aggregation and analysis, respectively). Hence, most aggregators support mongo (MongoDB I/O required) and/or mongresql (both MongoDB and PostgreSQL I/O required). We do not document the particulars here, as we consider these modes to be deprecated for future development. The source code (including a SQL DDL schema file for PostgreSQL) can provide details for the stubbornly intrepid.

That said, a subsequent PostgreSQL-based workflow (via the Mfeatures aggregator; see the mega folder for schema details) has proved useful and fairly scalable, so you might want to check that out.

Other options

  • -submissionid: Specify the submission ID to which the logs are linked to
  • -log-root: a way to manually specify a base name for a log file when streaming data from stdin

What are all these aggregators?

  • call_args (broken): A aggregator that records every call being made and the associated arguments
  • poly_features/features/scripts/blobs: 4 different output modes for a single input-processing pass (the original one, actually) that extracts polymorphic and monomorphic feature sites (locations within scripts that used a given feature and how many times; polymorphic and monomorphic instances kept separate), loaded script hashes and metadata (i.e., URL or eval-parent hash), and the full binary dump of loaded scripts
  • create_element: emits records of each call to Document.createElement, its script context/location, and its first argument (i.e., what kind of element was being created)
  • causality/causality_graphml: 2 different output modes for a single input-processing pass that uses a bunch of heuristics to try to reconstruct script provenance (what script loaded what other script); the later mode emits GraphML (i.e., XML)
  • ufeatures: a nice summary of features-touched globally on a per logfile basis
  • Mfeatures: the latest and probably best/richest aggregation of data into a fairly normalized entity-relationship schema of script/instance/feature/usage; requires PostgreSQL (see mega/postgres_schema.sql)
  • adblock: A aggregator which logs which url and origin combinations are blocked by easyprivacy.txt and easylist.txt. We use a the brave adblock engine implementation in Rust.

Note To use the adblock postprocessor you need to have the adblock binary and the easyprivacy.txt and easylist.txt files in the current working directory or set the following variables to the path of thier respective locations.

  • ADBLOCK_BINARY
  • EASYLIST_FILE
  • EASYPRIVACY_FILE
  • fptp: This postprocessor logs each and every script and whether or not they are a third party compared to:
    • The origin in which they were loaded
    • If available via the submission id, the root domain in which they were loaded

Note To use the fptp postprocessor you need to have the entities.json file (generated as part of the build process) in the current working directory or set the EMAP_FILE variable to the path of the file.

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL