videodb

package

v0.0.0-...-e05d22d Latest Latest Go to latest Published: Dec 2, 2024 License: MIT Imports: 20 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/cyclopcam/cyclops

README ¶

Video DB

This is my 2nd attempt at creating the recording database.

The 1st attempt was centered around the idea that there wouldn't be many recordings, and with an emphasis on labeling.

Here I'm focusing first on continuous recording. We need to be efficient when there are hundreds of hours of footage. The user must be able to scan through this footage easily, and we don't want to lose any frames.

Video Archive

All the video footage is stored inside our 'fsv' format archive. Initially, we'll just be supporting our own 'rf1' video format, but we could conceivably support more mainstream formats such as mp4, if that turns out to be useful.

The primary reason for using rf1 is that it is designed to withstand a system crash, and still retain all the video data that was successfully written to disk. The second reason is efficiency. 'rf1' files are extremely simple - we're just copying the raw NALUs to disk.

Database Design

The event table is fat. The objects field is a JSON object that contains up to 5 minutes of object box movement, or 10KB of data, whichever limit comes first.

The event_summary table is lean, and used to draw the colored timeline of a camera, where the colors allow the user to quickly tell which parts of the video had interesting detections.

The event_summary table stores information in fixed-size time slots. For example, if the time slot is 5 minutes, then there would be one record per camera, for every 5 minute segment of time. If there are zero events for a 5 minute segment, then we don't bother storing a record. But how do we know that 5 minutes is the ideal interval? The problem we're trying to solve here is quickly drawing a timeline below a camera that shows the points in time where interesting events occurred. Our average data usage in our event table is 12KB for 300 frames. At 10 FPS, that is 12KB for 30 seconds. If we make our event_summary segment size too large (eg 1 hour), then we'll end up having poor resolution on our timeline.

The resolution is really the constraint here, and let's work conservatively, and say that the person has a 4K monitor. A single pixel on the timeline is probably not visible enough, but two pixels will be fine. So let's say we target a resolution of 2000 pixels on the timeline. That's 2000 segments. 2000 segments at 5 minutes per segment is 166 hours, or 7 days. That seems like a decent zoomed-out view, in terms of performance/quality tradeoff. But what happens when we try to zoom in, say to 2 days? 2 days split across 2000 pixels is 1.44 minutes per pixel. And what about 2 hours? 2 hours split across 2000 pixels is 3.6 seconds per pixel. This is a huge dynamic range, and it's making me wonder if we should just do a hierarchical (eg power-of-2 sizes) event summary table, like mip-maps.

Yes, I think we do hierarchical summaries - anything else will be splitting hairs, or reaching bad performance corner cases when zoomed in or zoomed out.

Event Summary Bitmaps

After considering this for some time, it occurred to me that we might as well represent the event summary as a bitmap. Imagine a bitmap that is 2048 wide and 32 high. The 2048 columns (X) are distinct time segments. The 32 rows (Y) are 32 different items that we're interested in, such as "person", "car", etc. 32 is a lot of object types, so we choose that to be conservative. A key thing is that these rows can literally be bitmaps - i.e. we only need a single bit to say whether such as object was found during that time segment. It's pretty obvious that this data will compress well. Even with zero compression, we still have a tiny amount of data per segment. At 2048 x 32, we have 8KB raw. Assuming we get a 10:1 compression ratio, that's 800 bytes per time segment. Even at a compression ratio of 5:1, we can almost fit into a single network packet.

I love this pure bitmap representation, because we no longer have to fuss over wasted space in our SQLite DB, or efficiency, or worst case. In addition, mipmap tiles are dead simple to reason about. The only thing that remains is to pick the lowest mip level, and the tile size. Here's a table showing some candidate numbers.

Segment: The duration of each time segment, at the finest granularity
Size: The number of segments per tile
Tile Duration: Duration of a full tile at the finest granularity
Raw Size: Raw size of bitmap, if capable of holding 32 object types

Segment	Size	Tile Duration	Raw Size	Compressed Size @ 5:1
1s	512	8.5m	2KB	409 bytes
1s	1024	17m	4KB	819 bytes
1s	2048	34.1m	8KB	1638 bytes
2s	512	17m	1KB	204 bytes
2s	1024	34.1m	2KB	409 bytes
2s	2048	68.3m	4KB	819 bytes

At first it seems tempting to make the tile size large (1s, 2048 wide), but the problem with that, is that we have to wait 34 minutes before our latest tile is created. If the user wants an event summary during that time, somebody (either server or client) will need to synthesize it from the dense event data, which can be on the order of megabytes for half an hour.

But hang on! We're expected to be running 24/7, so it should be easy for us to maintain an up to date tile by directly feeding our in-memory events to the tiler, in real-time. There's no need to roundtrip this stuff through the database. So then there's no longer any consideration of liveness, or performance overhead to build an up-to-date tile. The only thing that remains is to decide on the finest granularity. I think we might as well do tiles that are 1024 pixels wide, at 1 second per pixel granularity, because those are such nice round numbers.

Building Higher Level Tiles

The following diagram represents tiles being built. The dashes are moments that have already passed, and the x's are in the future. The blank portions are regions of time when the system was switched off.

Level 4 |---------------------------------------------------------xxxxxxxxxxxxxxxxxxxxxx|
Level 3 |---------------------------------------|-----------------xxxxxxxxxxxxxxxxxxxxxx|
Level 2 |-------------------|-------------------|-----------------xx|
Level 1 |---------|---------|---------|---------|         |-------xx|
Level 0 |----|----|----|----|----|----|----|----|--- |    |----|--xx|

Before going on, let's consider how many levels we need in practice. Our lowest level (level 0) has one pixel per second. Let's use a screen with resolution of 2000 horizontal pixels as our representative use case. This would be on the desktop. A phone screen would be much less (eg 300 horizontal CSS pixels).

Let's consider various tile levels and the time spans that they would represent on a 2000 pixel wide screen. Each level is 2x the number of seconds/pixel of the previous level.

Level	Pixels	Seconds/Pixel	Duration
0	2000	1	33.3m
1	2000	2	1.1h
2	2000	4	2.2h
3	2000	8	4.4h
4	2000	16	8.8h
5	2000	32	17.7h
6	2000	64	1.5d
7	2000	128	2.9d
8	2000	256	5.9d
9	2000	512	11.8d
10	2000	1024	23.7d

The bottom line is that we need to support going up to many levels. On a phone, it's likely to be even more levels, because of the reduced resolution.

It's clear that we need a mechanism which continuously builds higher level tiles in the background, so that they're ready for consumption at any time. If our dataset was generated once-off, then this is a trivial. However, our job is slightly more complicated, because our tiles are constantly being generated.

I'm thinking right now, that if we just do this one thing, everything will be OK: If a tile still has any portion of itself in the future, then we don't write it to disk. Whenever a caller requests tiles that aren't in the DB yet, we build them on the fly, from lower level tiles. I can't tell for sure, but it looks to me like the time to build up tiles like this should O(max_level). Our levels won't get higher than about 10, and merging/compacting bitmaps should be very fast. Let's see!

Documentation ¶

Index ¶

Constants
Variables
func DecompressTileToRawLines(blob []byte) [][]byte
func GetClassIDsInTileBlob(tile []byte) ([]uint32, error)
func Migrations(log logs.Log) []migration.Migrator
func VideoStreamNameForCamera(cameraLongLivedName string, resolution defs.Resolution) string
type BaseModel
type Event
- func (e *Event) EndTime() time.Time
type EventDetectionsJSON
type EventTile
type ObjectJSON
type ObjectPositionJSON
type TileRequest
type TrackedBox
type TrackedObject
- func (t *TrackedObject) TimeBounds() (time.Time, time.Time)
type VideoDB
- func NewVideoDB(logger logs.Log, root string) (*VideoDB, error)
- func (v *VideoDB) Close()
- func (v *VideoDB) IDToString(id uint32) (string, error)
- func (v *VideoDB) IDsToString(ids []uint32) ([]string, error)
- func (v *VideoDB) MaxTileLevel() int
- func (v *VideoDB) ObjectDetected(camera string, cameraResolution [2]int, id uint32, detections []TrackedBox, ...)
- func (v *VideoDB) ReadEventTiles(camera string, request TileRequest) ([]*EventTile, error)
- func (v *VideoDB) ReadEvents(camera string, startTime, endTime time.Time) ([]*Event, error)
- func (v *VideoDB) SetMaxArchiveSize(maxSize int64)
- func (v *VideoDB) StringToID(s string) (uint32, error)
- func (v *VideoDB) StringsToID(s []string) ([]uint32, error)
- func (v *VideoDB) VideoStartTimeForCamera(camera string) (time.Time, error)

Constants ¶

View Source

const TileWidth = 1024

Number of pixels in on tile. At the highest resolution (level = 0), each pixel is 1 second.

Variables ¶

View Source

var ErrInvalidTimeRange = errors.New("invalid time range in tileBuilder.updateObject")

View Source

var ErrNoTime = errors.New("no time data in TrackedObject for tileBuilder.updateObject")

View Source

var ErrNoVideoFound = errors.New("No video found")

View Source

var ErrTooManyClasses = errors.New("too many classes")

Functions ¶

func DecompressTileToRawLines ¶

func DecompressTileToRawLines(blob []byte) [][]byte

This is for debug/analysis, specifically to create an extract of raw lines so that we can test our bitmap compression codecs. Returns a list of 128 byte bitmaps

func GetClassIDsInTileBlob ¶

func GetClassIDsInTileBlob(tile []byte) ([]uint32, error)

Decode a tile enough to be able to find the list of class IDs inside it, and return that list of IDs.

func Migrations ¶

func Migrations(log logs.Log) []migration.Migrator

func VideoStreamNameForCamera ¶

func VideoStreamNameForCamera(cameraLongLivedName string, resolution defs.Resolution) string

Generate the name of the video stream for the given camera and resolution.

Types ¶

type BaseModel ¶

type BaseModel struct {
	ID int64 `gorm:"primaryKey" json:"id"`
}

BaseModel is our base class for a GORM model. The default GORM Model uses int, but we prefer int64

type Event ¶

type Event struct {
	BaseModel
	Time       dbh.IntTime                         `json:"time"`       // Start of event
	Duration   int32                               `json:"duration"`   // Duration of event in milliseconds
	Camera     uint32                              `json:"camera"`     // LongLived camera name (via lookup in 'strings' table)
	Detections *dbh.JSONField[EventDetectionsJSON] `json:"detections"` // Objects detected in the event
}

An event is one or more frames of motion or object detection. For efficiency sake, we limit events in the database to a max size and duration. SYNC-VIDEODB-EVENT

func (*Event) EndTime ¶

func (e *Event) EndTime() time.Time

Return the end time of the event.

type EventDetectionsJSON ¶

type EventDetectionsJSON struct {
	Resolution [2]int        `json:"resolution"` // Resolution of the camera on which the detection was run.
	Objects    []*ObjectJSON `json:"objects"`    // Objects detected in the event
}

SYNC-VIDEODB-EVENTDETECTIONS

type EventTile ¶

type EventTile struct {
	Camera uint32 `gorm:"primaryKey;autoIncrement:false" json:"camera"` // LongLived camera name (via lookup in 'strings' table)
	Level  uint32 `gorm:"primaryKey;autoIncrement:false" json:"level"`  // 0 = lowest level
	Start  uint32 `gorm:"primaryKey;autoIncrement:false" json:"start"`  // Start time of tile (unix seconds / (1024 * 2^level))...... Rename to tileIdx?
	Tile   []byte `json:"tile"`                                         // Compressed tile data
}

SYNC-EVENT-TILE-JSON

type ObjectJSON ¶

type ObjectJSON struct {
	ID            uint32               `json:"id"`            // Can be used to track objects across separate Event records
	Class         uint32               `json:"class"`         // eg "person", "car" (via lookup in 'strings' table)
	Positions     []ObjectPositionJSON `json:"positions"`     // Object positions throughout event
	NumDetections int32                `json:"numDetections"` // Total number of detections witnessed for this object, before filtering out irrelevant box movements (eg box jiggling around by a few pixels)
}

An object detected by the camera. SYNC-VIDEODB-OBJECT

type ObjectPositionJSON ¶

type ObjectPositionJSON struct {
	Box        [4]int16 `json:"box"`        // [X1,Y1,X2,Y2]
	Time       int32    `json:"time"`       // Time in milliseconds relative to start of event.
	Confidence float32  `json:"confidence"` // NN confidence of detection (0..1)
}

Position of an object in a frame. SYNC-VIDEODB-OBJECTPOSITION

type TileRequest ¶

type TileRequest struct {
	Level    uint32
	StartIdx uint32 // inclusive
	EndIdx   uint32 // exclusive
	Indices  map[uint32]bool
}

TileRequest is a request to read tiles. Do ONE of the following: 1. Populate StartIdx and EndIdx 2. Populate Indices

type TrackedBox ¶

type TrackedBox struct {
	Time       time.Time
	Box        nn.Rect
	Confidence float32
}

type TrackedObject ¶

type TrackedObject struct {
	ID               uint32
	Camera           uint32
	CameraResolution [2]int
	Class            uint32
	Boxes            []TrackedBox
	LastSeen         time.Time // In case you're not updating Boxes, or Boxes is empty. Maybe you're not updating Boxes because the object hasn't moved.
	NumDetections    int32     // Naively equal to len(Boxes), but can be different if some detections were so similar to the previous that we filtered them out. NumDetections >= len(Boxes)
}

func (*TrackedObject) TimeBounds ¶

func (t *TrackedObject) TimeBounds() (time.Time, time.Time)

Returns the min/max observed time of this object. We can have any mix of Boxes and LastSeen, but if none of them are set, then we return time.Time{} for both.

type VideoDB ¶

type VideoDB struct {
	// Root directory
	// root/fsv/...         Video file archive
	// root/videos.sqlite   Our SQLite DB
	Root string

	Archive *fsv.Archive
	// contains filtered or unexported fields
}

VideoDB manages recordings

func NewVideoDB ¶

func NewVideoDB(logger logs.Log, root string) (*VideoDB, error)

Open or create a video DB

func (*VideoDB) Close ¶

func (v *VideoDB) Close()

func (*VideoDB) IDToString ¶

func (v *VideoDB) IDToString(id uint32) (string, error)

func (*VideoDB) IDsToString ¶

func (v *VideoDB) IDsToString(ids []uint32) ([]string, error)

func (*VideoDB) MaxTileLevel ¶

func (v *VideoDB) MaxTileLevel() int

func (*VideoDB) ObjectDetected ¶

func (v *VideoDB) ObjectDetected(camera string, cameraResolution [2]int, id uint32, detections []TrackedBox, class string)

This is the way our users inform us of a new object detection. We'll get one of these calls on every frame where an object is detected. id must be unique enough that by the time it wraps around, the previous object is no longer in frame. Also, id must be unique across cameras. This is currently the way our 'monitor' package works, but I'm just codifying it here.

func (*VideoDB) ReadEventTiles ¶

func (v *VideoDB) ReadEventTiles(camera string, request TileRequest) ([]*EventTile, error)

Fetch event tiles in the range [startIdx, endIdx)

func (*VideoDB) ReadEvents ¶

func (v *VideoDB) ReadEvents(camera string, startTime, endTime time.Time) ([]*Event, error)

func (*VideoDB) SetMaxArchiveSize ¶

func (v *VideoDB) SetMaxArchiveSize(maxSize int64)

The archive won't delete any files until this is called, because it doesn't know yet what the size limit is.

func (*VideoDB) StringToID ¶

func (v *VideoDB) StringToID(s string) (uint32, error)

Get a database-wide unique ID for the given string. At some point we should implement a cleanup method that gets rid of strings that are no longer used. It is beneficial to keep the IDs small, because smaller numbers produce smaller DB records due to varint encoding.

func (*VideoDB) StringsToID ¶

func (v *VideoDB) StringsToID(s []string) ([]uint32, error)

Resolve multiple strings to IDs

func (*VideoDB) VideoStartTimeForCamera ¶

func (v *VideoDB) VideoStartTimeForCamera(camera string) (time.Time, error)

Find the timestamp of the oldest recorded frame for the given camera. Returns *ErrNoVideoFound* if no video footage can be found for the camera.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL