AI-based video surveillance system


Background

I was a founding engineer at a startup that is still in stealth. We were building surveillance systems that used AI (license plate recognition and facial recognition) to reduce crime. Here, I describe the system that I built that could connect to almost any off-the-shelf surveillance camera, run machine vision algorithms on the video, and notify users of events. The system was designed to handle massive data throughput (up to 100Gb/s sustained) while maintaining cost-effectiveness and vendor independence.

System Architecture

Ingestion

Traditional methods to ingest video from surveillance cameras face several challenges:

1. They typically require modifications to network configurations to set up port forwarding configurations and allow firewall rules

2. They rely on RTSP, an unencrypted protocol

3. RTSP uses plain UDP, which is prone to frame drops on unreliable networks like the internet

To address these limitations, instead of our system connecting directly to the surveillance cameras, we would run software in the same LAN as the cameras. This software would connect to the cameras with conventional RTSP (which is fine as long as it is inside the LAN) but would use SRT to stream the video to our servers.

By using SRT, instead of having to open a port on the router and having our remote servers initiate connection with our IP, we were able to initiate the connection from our device, which would allow us to egress video out of protected networks without having to edit network configurations (firewall rules, port forwarding). Furthermore, SRT allowed us to encrypt the video with negligible performance impact, and SRT's forward error correction made our streams more reliable in low-quality networks like the public internet. This software was written with Rust and used Linux's FFmpeg to perform the remuxing.

This software was also able to automatically detect the cameras present in the LAN by matching MAC prefixes.

Recording

The system would store the video as 10-second .mp4 segments in an object store.

Restreaming

When a user wanted to live-stream the video from the cameras, we would use LL-HLS to stream the video from our servers to the user as this would provide us a balance between low-latency and reliability. As the traffic pattern for streaming surveillance cameras is much less consumer-heavy, we didn't have to implement caching systems, but if we had thousands of users trying to access video from the same camera, we could set up caches between our LL-HLS server and the clients to distribute the load.

Infrastructure

Different components of this system ran on separate cloud providers so that it could optimize the cost of parts that were compute, networking, or storage intensive. All of this ran on Kubernetes clusters managed with ArgoCD and Terraform.

Front-end

The front-end was written with React and shadcn/ui components and deployed using a GCP storage bucket with a CDN.

Back-end

The back-end was written with Rust and JavaScript (Bun) and deployed in our K8s clusters.

Hardware

In order to facilitate the connection of any off-the-shelf cameras with our system, we made a device that plugged into a LAN containing surveillance cameras and ran the software described in the "Ingestion" section. This way, the customer didn't need to have a server in their LAN. There was no need to design custom boards for this device, so I used off-the-shelf dev modules with Allwinner and Rockchip SoCs. This device ran a custom Linux image made with Buildroot.