videodb

package
v0.0.0-...-e05d22d Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 2, 2024 License: MIT Imports: 20 Imported by: 0

README

Video DB

This is my 2nd attempt at creating the recording database.

The 1st attempt was centered around the idea that there wouldn't be many recordings, and with an emphasis on labeling.

Here I'm focusing first on continuous recording. We need to be efficient when there are hundreds of hours of footage. The user must be able to scan through this footage easily, and we don't want to lose any frames.

Video Archive

All the video footage is stored inside our 'fsv' format archive. Initially, we'll just be supporting our own 'rf1' video format, but we could conceivably support more mainstream formats such as mp4, if that turns out to be useful.

The primary reason for using rf1 is that it is designed to withstand a system crash, and still retain all the video data that was successfully written to disk. The second reason is efficiency. 'rf1' files are extremely simple - we're just copying the raw NALUs to disk.

Database Design

The event table is fat. The objects field is a JSON object that contains up to 5 minutes of object box movement, or 10KB of data, whichever limit comes first.

The event_summary table is lean, and used to draw the colored timeline of a camera, where the colors allow the user to quickly tell which parts of the video had interesting detections.

The event_summary table stores information in fixed-size time slots. For example, if the time slot is 5 minutes, then there would be one record per camera, for every 5 minute segment of time. If there are zero events for a 5 minute segment, then we don't bother storing a record. But how do we know that 5 minutes is the ideal interval? The problem we're trying to solve here is quickly drawing a timeline below a camera that shows the points in time where interesting events occurred. Our average data usage in our event table is 12KB for 300 frames. At 10 FPS, that is 12KB for 30 seconds. If we make our event_summary segment size too large (eg 1 hour), then we'll end up having poor resolution on our timeline.

The resolution is really the constraint here, and let's work conservatively, and say that the person has a 4K monitor. A single pixel on the timeline is probably not visible enough, but two pixels will be fine. So let's say we target a resolution of 2000 pixels on the timeline. That's 2000 segments. 2000 segments at 5 minutes per segment is 166 hours, or 7 days. That seems like a decent zoomed-out view, in terms of performance/quality tradeoff. But what happens when we try to zoom in, say to 2 days? 2 days split across 2000 pixels is 1.44 minutes per pixel. And what about 2 hours? 2 hours split across 2000 pixels is 3.6 seconds per pixel. This is a huge dynamic range, and it's making me wonder if we should just do a hierarchical (eg power-of-2 sizes) event summary table, like mip-maps.

Yes, I think we do hierarchical summaries - anything else will be splitting hairs, or reaching bad performance corner cases when zoomed in or zoomed out.

Event Summary Bitmaps

After considering this for some time, it occurred to me that we might as well represent the event summary as a bitmap. Imagine a bitmap that is 2048 wide and 32 high. The 2048 columns (X) are distinct time segments. The 32 rows (Y) are 32 different items that we're interested in, such as "person", "car", etc. 32 is a lot of object types, so we choose that to be conservative. A key thing is that these rows can literally be bitmaps - i.e. we only need a single bit to say whether such as object was found during that time segment. It's pretty obvious that this data will compress well. Even with zero compression, we still have a tiny amount of data per segment. At 2048 x 32, we have 8KB raw. Assuming we get a 10:1 compression ratio, that's 800 bytes per time segment. Even at a compression ratio of 5:1, we can almost fit into a single network packet.

I love this pure bitmap representation, because we no longer have to fuss over wasted space in our SQLite DB, or efficiency, or worst case. In addition, mipmap tiles are dead simple to reason about. The only thing that remains is to pick the lowest mip level, and the tile size. Here's a table showing some candidate numbers.

  • Segment: The duration of each time segment, at the finest granularity
  • Size: The number of segments per tile
  • Tile Duration: Duration of a full tile at the finest granularity
  • Raw Size: Raw size of bitmap, if capable of holding 32 object types
Segment Size Tile Duration Raw Size Compressed Size @ 5:1
1s 512 8.5m 2KB 409 bytes
1s 1024 17m 4KB 819 bytes
1s 2048 34.1m 8KB 1638 bytes
2s 512 17m 1KB 204 bytes
2s 1024 34.1m 2KB 409 bytes
2s 2048 68.3m 4KB 819 bytes

At first it seems tempting to make the tile size large (1s, 2048 wide), but the problem with that, is that we have to wait 34 minutes before our latest tile is created. If the user wants an event summary during that time, somebody (either server or client) will need to synthesize it from the dense event data, which can be on the order of megabytes for half an hour.

But hang on! We're expected to be running 24/7, so it should be easy for us to maintain an up to date tile by directly feeding our in-memory events to the tiler, in real-time. There's no need to roundtrip this stuff through the database. So then there's no longer any consideration of liveness, or performance overhead to build an up-to-date tile. The only thing that remains is to decide on the finest granularity. I think we might as well do tiles that are 1024 pixels wide, at 1 second per pixel granularity, because those are such nice round numbers.

Building Higher Level Tiles

The following diagram represents tiles being built. The dashes are moments that have already passed, and the x's are in the future. The blank portions are regions of time when the system was switched off.

Level 4 |---------------------------------------------------------xxxxxxxxxxxxxxxxxxxxxx|
Level 3 |---------------------------------------|-----------------xxxxxxxxxxxxxxxxxxxxxx|
Level 2 |-------------------|-------------------|-----------------xx|
Level 1 |---------|---------|---------|---------|         |-------xx|
Level 0 |----|----|----|----|----|----|----|----|--- |    |----|--xx|

Before going on, let's consider how many levels we need in practice. Our lowest level (level 0) has one pixel per second. Let's use a screen with resolution of 2000 horizontal pixels as our representative use case. This would be on the desktop. A phone screen would be much less (eg 300 horizontal CSS pixels).

Let's consider various tile levels and the time spans that they would represent on a 2000 pixel wide screen. Each level is 2x the number of seconds/pixel of the previous level.

Level Pixels Seconds/Pixel Duration
0 2000 1 33.3m
1 2000 2 1.1h
2 2000 4 2.2h
3 2000 8 4.4h
4 2000 16 8.8h
5 2000 32 17.7h
6 2000 64 1.5d
7 2000 128 2.9d
8 2000 256 5.9d
9 2000 512 11.8d
10 2000 1024 23.7d

The bottom line is that we need to support going up to many levels. On a phone, it's likely to be even more levels, because of the reduced resolution.

It's clear that we need a mechanism which continuously builds higher level tiles in the background, so that they're ready for consumption at any time. If our dataset was generated once-off, then this is a trivial. However, our job is slightly more complicated, because our tiles are constantly being generated.

I'm thinking right now, that if we just do this one thing, everything will be OK: If a tile still has any portion of itself in the future, then we don't write it to disk. Whenever a caller requests tiles that aren't in the DB yet, we build them on the fly, from lower level tiles. I can't tell for sure, but it looks to me like the time to build up tiles like this should O(max_level). Our levels won't get higher than about 10, and merging/compacting bitmaps should be very fast. Let's see!

Documentation

Index

Constants

View Source
const TileWidth = 1024

Number of pixels in on tile. At the highest resolution (level = 0), each pixel is 1 second.

Variables

View Source
var ErrInvalidTimeRange = errors.New("invalid time range in tileBuilder.updateObject")
View Source
var ErrNoTime = errors.New("no time data in TrackedObject for tileBuilder.updateObject")
View Source
var ErrNoVideoFound = errors.New("No video found")
View Source
var ErrTooManyClasses = errors.New("too many classes")

Functions

func DecompressTileToRawLines

func DecompressTileToRawLines(blob []byte) [][]byte

This is for debug/analysis, specifically to create an extract of raw lines so that we can test our bitmap compression codecs. Returns a list of 128 byte bitmaps

func GetClassIDsInTileBlob

func GetClassIDsInTileBlob(tile []byte) ([]uint32, error)

Decode a tile enough to be able to find the list of class IDs inside it, and return that list of IDs.

func Migrations

func Migrations(log logs.Log) []migration.Migrator

func VideoStreamNameForCamera

func VideoStreamNameForCamera(cameraLongLivedName string, resolution defs.Resolution) string

Generate the name of the video stream for the given camera and resolution.

Types

type BaseModel

type BaseModel struct {
	ID int64 `gorm:"primaryKey" json:"id"`
}

BaseModel is our base class for a GORM model. The default GORM Model uses int, but we prefer int64

type Event

type Event struct {
	BaseModel
	Time       dbh.IntTime                         `json:"time"`       // Start of event
	Duration   int32                               `json:"duration"`   // Duration of event in milliseconds
	Camera     uint32                              `json:"camera"`     // LongLived camera name (via lookup in 'strings' table)
	Detections *dbh.JSONField[EventDetectionsJSON] `json:"detections"` // Objects detected in the event
}

An event is one or more frames of motion or object detection. For efficiency sake, we limit events in the database to a max size and duration. SYNC-VIDEODB-EVENT

func (*Event) EndTime

func (e *Event) EndTime() time.Time

Return the end time of the event.

type EventDetectionsJSON

type EventDetectionsJSON struct {
	Resolution [2]int        `json:"resolution"` // Resolution of the camera on which the detection was run.
	Objects    []*ObjectJSON `json:"objects"`    // Objects detected in the event
}

SYNC-VIDEODB-EVENTDETECTIONS

type EventTile

type EventTile struct {
	Camera uint32 `gorm:"primaryKey;autoIncrement:false" json:"camera"` // LongLived camera name (via lookup in 'strings' table)
	Level  uint32 `gorm:"primaryKey;autoIncrement:false" json:"level"`  // 0 = lowest level
	Start  uint32 `gorm:"primaryKey;autoIncrement:false" json:"start"`  // Start time of tile (unix seconds / (1024 * 2^level))...... Rename to tileIdx?
	Tile   []byte `json:"tile"`                                         // Compressed tile data
}

SYNC-EVENT-TILE-JSON

type ObjectJSON

type ObjectJSON struct {
	ID            uint32               `json:"id"`            // Can be used to track objects across separate Event records
	Class         uint32               `json:"class"`         // eg "person", "car" (via lookup in 'strings' table)
	Positions     []ObjectPositionJSON `json:"positions"`     // Object positions throughout event
	NumDetections int32                `json:"numDetections"` // Total number of detections witnessed for this object, before filtering out irrelevant box movements (eg box jiggling around by a few pixels)
}

An object detected by the camera. SYNC-VIDEODB-OBJECT

type ObjectPositionJSON

type ObjectPositionJSON struct {
	Box        [4]int16 `json:"box"`        // [X1,Y1,X2,Y2]
	Time       int32    `json:"time"`       // Time in milliseconds relative to start of event.
	Confidence float32  `json:"confidence"` // NN confidence of detection (0..1)
}

Position of an object in a frame. SYNC-VIDEODB-OBJECTPOSITION

type TileRequest

type TileRequest struct {
	Level    uint32
	StartIdx uint32 // inclusive
	EndIdx   uint32 // exclusive
	Indices  map[uint32]bool
}

TileRequest is a request to read tiles. Do ONE of the following: 1. Populate StartIdx and EndIdx 2. Populate Indices

type TrackedBox

type TrackedBox struct {
	Time       time.Time
	Box        nn.Rect
	Confidence float32
}

type TrackedObject

type TrackedObject struct {
	ID               uint32
	Camera           uint32
	CameraResolution [2]int
	Class            uint32
	Boxes            []TrackedBox
	LastSeen         time.Time // In case you're not updating Boxes, or Boxes is empty. Maybe you're not updating Boxes because the object hasn't moved.
	NumDetections    int32     // Naively equal to len(Boxes), but can be different if some detections were so similar to the previous that we filtered them out. NumDetections >= len(Boxes)
}

func (*TrackedObject) TimeBounds

func (t *TrackedObject) TimeBounds() (time.Time, time.Time)

Returns the min/max observed time of this object. We can have any mix of Boxes and LastSeen, but if none of them are set, then we return time.Time{} for both.

type VideoDB

type VideoDB struct {
	// Root directory
	// root/fsv/...         Video file archive
	// root/videos.sqlite   Our SQLite DB
	Root string

	Archive *fsv.Archive
	// contains filtered or unexported fields
}

VideoDB manages recordings

func NewVideoDB

func NewVideoDB(logger logs.Log, root string) (*VideoDB, error)

Open or create a video DB

func (*VideoDB) Close

func (v *VideoDB) Close()

func (*VideoDB) IDToString

func (v *VideoDB) IDToString(id uint32) (string, error)

func (*VideoDB) IDsToString

func (v *VideoDB) IDsToString(ids []uint32) ([]string, error)

func (*VideoDB) MaxTileLevel

func (v *VideoDB) MaxTileLevel() int

func (*VideoDB) ObjectDetected

func (v *VideoDB) ObjectDetected(camera string, cameraResolution [2]int, id uint32, detections []TrackedBox, class string)

This is the way our users inform us of a new object detection. We'll get one of these calls on every frame where an object is detected. id must be unique enough that by the time it wraps around, the previous object is no longer in frame. Also, id must be unique across cameras. This is currently the way our 'monitor' package works, but I'm just codifying it here.

func (*VideoDB) ReadEventTiles

func (v *VideoDB) ReadEventTiles(camera string, request TileRequest) ([]*EventTile, error)

Fetch event tiles in the range [startIdx, endIdx)

func (*VideoDB) ReadEvents

func (v *VideoDB) ReadEvents(camera string, startTime, endTime time.Time) ([]*Event, error)

func (*VideoDB) SetMaxArchiveSize

func (v *VideoDB) SetMaxArchiveSize(maxSize int64)

The archive won't delete any files until this is called, because it doesn't know yet what the size limit is.

func (*VideoDB) StringToID

func (v *VideoDB) StringToID(s string) (uint32, error)

Get a database-wide unique ID for the given string. At some point we should implement a cleanup method that gets rid of strings that are no longer used. It is beneficial to keep the IDs small, because smaller numbers produce smaller DB records due to varint encoding.

func (*VideoDB) StringsToID

func (v *VideoDB) StringsToID(s []string) ([]uint32, error)

Resolve multiple strings to IDs

func (*VideoDB) VideoStartTimeForCamera

func (v *VideoDB) VideoStartTimeForCamera(camera string) (time.Time, error)

Find the timestamp of the oldest recorded frame for the given camera. Returns *ErrNoVideoFound* if no video footage can be found for the camera.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL