Performance Optimization

Relevant source files

Purpose and Scope

This document covers the resource management and optimization strategies implemented in Deep-Live-Cam to ensure efficient utilization of system resources including memory (RAM and GPU VRAM), CPU threads, and GPU compute resources. The system employs sophisticated techniques for hardware detection, resource limiting, thread pool optimization, and graceful degradation when resources are constrained.

Deep-Live-Cam is specifically optimized for high-performance live mode and batch processing, featuring platform-specific enhancements for NVIDIA GPUs (CUDA), Apple Silicon (CoreML), and other hardware via ONNX Runtime.

Memory Management Architecture

Deep-Live-Cam implements platform-specific memory limiting to prevent system instability when processing large videos or high-resolution images.

RAM Limiting Strategy

The system provides configurable RAM limits through the --max-memory parameter, with platform-specific defaults and enforcement mechanisms.

Memory Limiting Flow: Configuration to platform-specific enforcement mechanisms

Sources: modules/core.py121-156

The limit_resources() function in modules/core.py139-156 implements three distinct strategies:

Platform	Memory Calculation	API Used	Purpose
macOS	`max_memory * 1024^6`	N/A (advisory)	Different unit scaling for Darwin
Windows	`max_memory * 1024^3`	`kernel32.SetProcessWorkingSetSize()`	Hard process memory limit
Linux	`max_memory * 1024^3`	`resource.setrlimit(RLIMIT_DATA)`	Data segment size limit

TensorFlow GPU Memory Growth

To prevent TensorFlow from allocating all available GPU memory upfront, the system enables memory growth for all detected GPUs:

This configuration appears in modules/core.py141-143 and ensures TensorFlow only allocates GPU memory as needed, allowing ONNX Runtime and PyTorch to share GPU resources effectively.

Sources: modules/core.py139-156

Thread Pool and Synchronization

Execution Thread Configuration

The system uses configurable thread pools for parallel frame processing, with intelligent defaults based on the execution provider. The suggest_execution_threads() function in modules/core.py131-136 returns provider-specific thread counts:

DirectML: 1 thread (DirectML manages its own threading)
ROCM: 1 thread (ROCMExecutionProvider is single-threaded)
Default: 8 threads (optimal for CUDA, CPU, CoreML)

Sources: modules/core.py131-136

Model Initialization Thread Safety

Frame processors use threading.Lock to ensure thread-safe singleton initialization of AI models. This prevents multiple threads from attempting to load heavy ONNX models into VRAM simultaneously.

Sources: modules/processors/frame/face_swapper.py24-25 modules/processors/frame/face_swapper.py84-88

OpenMP Thread Limiting

For GPU-accelerated execution, the system limits OpenMP threads to prevent CPU over-subscription. This optimization in modules/core.py4-5 doubles CUDA performance by preventing CPU thread competition with GPU operations.

Sources: modules/core.py1-5

Apple Silicon (M1-M5) Optimizations

Deep-Live-Cam includes specialized optimizations for macOS ARM64 architecture to maximize Neural Engine (ANE) utilization.

CoreML Model Graph Rewriting

The optimize_for_coreml function in modules/onnx_optimize.py40-94 performs several graph-level transformations to prevent CPU↔ANE round-trips:

Pad(reflect) Decomposition: Rewrites Pad(mode=reflect) (unsupported by CoreML) into Slice and Concat ops modules/onnx_optimize.py12-16
Shape/Gather Constant Folding: Replaces dynamic shape chains with constants when input dimensions are known modules/onnx_optimize.py6-10
Split → Slice Decomposition: Converts Split ops to Slice ops to maintain partition boundaries modules/onnx_optimize.py18-21
Scalar Gather Widening: Widens rank-0 indices to rank-1 to satisfy CoreML EP requirements modules/onnx_optimize.py22-26

Live Mode Enhancements

For real-time webcam performance on Mac, the face_swapper module implements:

Detection Caching: Caches face detections to avoid running the heavy detection model every frame modules/processors/frame/face_swapper.py35
Adaptive Quality: Adjusts processing parameters based on system load modules/processors/frame/face_swapper.py39
Frame Cache: Uses a deque to manage frame reuse and reduce allocation overhead modules/processors/frame/face_swapper.py34

Sources: modules/onnx_optimize.py1-94 modules/processors/frame/face_swapper.py32-40

GPU Acceleration Strategies

CUDA and Tensor Core Optimization

The system prefers FP16 (Half Precision) models on modern NVIDIA GPUs (Turing architecture and newer) to reduce memory bandwidth usage and increase inference speed modules/processors/frame/face_swapper.py89-91

Additionally, the system initializes CUDA graph sessions for compatible models to minimize CPU-to-GPU launch overhead modules/processors/frame/face_swapper.py135-141

Sources: modules/processors/frame/face_swapper.py89-141

OpenCV CUDA Integration

The modules.gpu_processing module provides drop-in replacements for standard OpenCV functions like GaussianBlur, resize, and addWeighted modules/gpu_processing.py12-17

These functions utilize cv2.cuda.GpuMat to perform image processing on the GPU. However, this is disabled by default (OPENCV_CUDA_PROCESSING=0) because the overhead of uploading and downloading small frames (e.g., webcam resolution) often exceeds the compute savings modules/gpu_processing.py31-34

GPU Processing: Conditional upload/download logic

Sources: modules/gpu_processing.py31-47 modules/gpu_processing.py86-107

Face Swapping Pipeline Optimization

Fast Paste-Back

The system uses a highly optimized "paste-back" mechanism. Instead of eroding and blurring the warped mask in output coordinates (which is O(N²) relative to face size), it pre-calculates a "soft alpha" mask in aligned-face space modules/processors/frame/face_swapper.py165-174

Optimization	Logic	Benefit
Alpha Caching	Caches feathered alpha template in `_paste_cache`	Reduces per-frame mask generation cost modules/processors/frame/face_swapper.py159-162
Affine Transform	Warps the soft mask per-frame	Feather radius scales naturally with transform modules/processors/frame/face_swapper.py172-173
In-place Writing	`_fast_paste_back` writes directly to target frame	Eliminates unnecessary frame copies benchmark_pipeline.py136-143

Sources: modules/processors/frame/face_swapper.py158-174 benchmark_pipeline.py136-143

Benchmarking and Bottleneck Analysis

The benchmark_pipeline.py utility allows developers to measure the latency of every stage in the 1080p pipeline benchmark_pipeline.py1-5

Sources: benchmark_pipeline.py107-151

Resource Lifecycle Management

GPU Memory Cleanup

The release_resources() function in modules/core.py158-160 is called after processing stages to release cached VRAM back to the OS, which is critical for preventing fragmentation in long-running batch jobs.

Sources: modules/core.py158-160 modules/core.py194