This document covers the resource management and optimization strategies implemented in Deep-Live-Cam to ensure efficient utilization of system resources including memory (RAM and GPU VRAM), CPU threads, and GPU compute resources. The system employs sophisticated techniques for hardware detection, resource limiting, thread pool optimization, and graceful degradation when resources are constrained.
Deep-Live-Cam is specifically optimized for high-performance live mode and batch processing, featuring platform-specific enhancements for NVIDIA GPUs (CUDA), Apple Silicon (CoreML), and other hardware via ONNX Runtime.
Deep-Live-Cam implements platform-specific memory limiting to prevent system instability when processing large videos or high-resolution images.
The system provides configurable RAM limits through the --max-memory parameter, with platform-specific defaults and enforcement mechanisms.
Memory Limiting Flow: Configuration to platform-specific enforcement mechanisms
Sources: modules/core.py121-156
The limit_resources() function in modules/core.py139-156 implements three distinct strategies:
| Platform | Memory Calculation | API Used | Purpose |
|---|---|---|---|
| macOS | max_memory * 1024^6 | N/A (advisory) | Different unit scaling for Darwin |
| Windows | max_memory * 1024^3 | kernel32.SetProcessWorkingSetSize() | Hard process memory limit |
| Linux | max_memory * 1024^3 | resource.setrlimit(RLIMIT_DATA) | Data segment size limit |
To prevent TensorFlow from allocating all available GPU memory upfront, the system enables memory growth for all detected GPUs:
This configuration appears in modules/core.py141-143 and ensures TensorFlow only allocates GPU memory as needed, allowing ONNX Runtime and PyTorch to share GPU resources effectively.
Sources: modules/core.py139-156
The system uses configurable thread pools for parallel frame processing, with intelligent defaults based on the execution provider. The suggest_execution_threads() function in modules/core.py131-136 returns provider-specific thread counts:
Sources: modules/core.py131-136
Frame processors use threading.Lock to ensure thread-safe singleton initialization of AI models. This prevents multiple threads from attempting to load heavy ONNX models into VRAM simultaneously.
Sources: modules/processors/frame/face_swapper.py24-25 modules/processors/frame/face_swapper.py84-88
For GPU-accelerated execution, the system limits OpenMP threads to prevent CPU over-subscription. This optimization in modules/core.py4-5 doubles CUDA performance by preventing CPU thread competition with GPU operations.
Sources: modules/core.py1-5
Deep-Live-Cam includes specialized optimizations for macOS ARM64 architecture to maximize Neural Engine (ANE) utilization.
The optimize_for_coreml function in modules/onnx_optimize.py40-94 performs several graph-level transformations to prevent CPU↔ANE round-trips:
Pad(mode=reflect) (unsupported by CoreML) into Slice and Concat ops modules/onnx_optimize.py12-16Split ops to Slice ops to maintain partition boundaries modules/onnx_optimize.py18-21For real-time webcam performance on Mac, the face_swapper module implements:
deque to manage frame reuse and reduce allocation overhead modules/processors/frame/face_swapper.py34Sources: modules/onnx_optimize.py1-94 modules/processors/frame/face_swapper.py32-40
The system prefers FP16 (Half Precision) models on modern NVIDIA GPUs (Turing architecture and newer) to reduce memory bandwidth usage and increase inference speed modules/processors/frame/face_swapper.py89-91
Additionally, the system initializes CUDA graph sessions for compatible models to minimize CPU-to-GPU launch overhead modules/processors/frame/face_swapper.py135-141
Sources: modules/processors/frame/face_swapper.py89-141
The modules.gpu_processing module provides drop-in replacements for standard OpenCV functions like GaussianBlur, resize, and addWeighted modules/gpu_processing.py12-17
These functions utilize cv2.cuda.GpuMat to perform image processing on the GPU. However, this is disabled by default (OPENCV_CUDA_PROCESSING=0) because the overhead of uploading and downloading small frames (e.g., webcam resolution) often exceeds the compute savings modules/gpu_processing.py31-34
GPU Processing: Conditional upload/download logic
Sources: modules/gpu_processing.py31-47 modules/gpu_processing.py86-107
The system uses a highly optimized "paste-back" mechanism. Instead of eroding and blurring the warped mask in output coordinates (which is O(N²) relative to face size), it pre-calculates a "soft alpha" mask in aligned-face space modules/processors/frame/face_swapper.py165-174
| Optimization | Logic | Benefit |
|---|---|---|
| Alpha Caching | Caches feathered alpha template in _paste_cache | Reduces per-frame mask generation cost modules/processors/frame/face_swapper.py159-162 |
| Affine Transform | Warps the soft mask per-frame | Feather radius scales naturally with transform modules/processors/frame/face_swapper.py172-173 |
| In-place Writing | _fast_paste_back writes directly to target frame | Eliminates unnecessary frame copies benchmark_pipeline.py136-143 |
Sources: modules/processors/frame/face_swapper.py158-174 benchmark_pipeline.py136-143
The benchmark_pipeline.py utility allows developers to measure the latency of every stage in the 1080p pipeline benchmark_pipeline.py1-5
Sources: benchmark_pipeline.py107-151
The release_resources() function in modules/core.py158-160 is called after processing stages to release cached VRAM back to the OS, which is critical for preventing fragmentation in long-running batch jobs.
Sources: modules/core.py158-160 modules/core.py194
Refresh this wiki