Accelerating Image Processing with Fast Gaussian Blur in C++

An exploration of real-time Gaussian blur approximation using optimized C++ techniques, highlighting linear-time filtering and performance considerations on modern CPUs.

Introduction

As part of our GPU programming coursework at ESIEA, we explored various methods for accelerating computational tasks traditionally considered intensive. While many projects naturally leaned towards leveraging GPU power via CUDA, our work focused on the CPU side of the spectrum, seeking to achieve high performance using optimized C++ and parallel programming techniques. Specifically, we implemented a fast approximation of the Gaussian blur, commonly used in image processing, with a focus on linear-time filtering and cache-friendly execution.

Although not directly implemented on GPU hardware, this project offers valuable insights into performance engineering, memory locality, and algorithmic approximation—skills that directly translate to GPU programming and parallel computation.

Project Outline

Algorithm Selection

Gaussian blur is a cornerstone operation in computer vision and image processing, typically computed using convolution with a Gaussian kernel. However, the standard approach has quadratic time complexity relative to the kernel size, making it unsuitable for real-time applications with large images or high blur radii.

To address this, we adopted an algorithm inspired by the work of Peter Kovesi and popularized by Ivan Kutskir, which approximates Gaussian blur using multiple box blur passes. This approach relies on the Central Limit Theorem: repeated convolution with box kernels gradually approximates a true Gaussian distribution. Crucially, this method operates in linear time with respect to image size and is independent of the blur radius (sigma), making it ideal for high-performance contexts.

C++ Implementation

We developed a header-only C++ library implementing a fast Gaussian blur using templated functions and cache-optimized traversal. The main function exposed by the library allows flexible use with any image format stored in a raw buffer and supports various border policies (mirror, extend, wrap, crop).

template<typename T>
void fast_gaussian_blur(
    T*& in,
    T*& out,
    const int w,
    const int h,
    const int c,
    const float sigma,
    const uint32_t n,
    const Border p
);

The algorithm performs N horizontal box blur passes, followed by a buffer transposition, and then N additional horizontal passes (which now act as vertical passes). This method ensures better cache coherence compared to traditional row-column separation.

To mitigate cache miss penalties, especially during transposition, we process image blocks per thread. This design significantly improves memory locality, an essential factor in real-time performance on CPU architectures.

Performance Optimization

All computations are parallelized using OpenMP, ensuring effective utilization of multi-core CPUs. Our tests showed that blurring a 2-million-pixel image (e.g., 1400×1400) completes in approximately 7 milliseconds on a Ryzen 7 2700X, making the algorithm viable for real-time or interactive applications.

We also compared various strategies for transposition and blur pass ordering, ultimately confirming that alternating all passes (horizontal followed by vertical) was unnecessary thanks to the separability and commutativity of box filters. The final pipeline minimizes memory access penalties while delivering high-quality approximations.

Results

Our fast Gaussian blur implementation achieves linear-time performance with respect to image size and exhibits consistent runtimes regardless of the chosen sigma. This contrasts with the quadratic growth observed in traditional Gaussian filtering methods.

Original	σ = 2	σ = 5	σ = 10	σ = 30	σ = 50

These results validate the quality of the approximation, especially for medium to large sigma values. While the algorithm may exhibit minor boundary inaccuracies and slightly fluctuating standard deviations at low sigma, these are generally imperceptible in practical scenarios.

Performance Analysis

The figure below illustrates execution times across various image sizes, highlighting the efficiency of our approach compared to naive and SIMD-accelerated Gaussian blur implementations.

Our cache-coherent block-wise transposition, coupled with sliding accumulator blur passes, offers superior scalability and stability, even for large images. Attempts to surpass this performance using ISPC (Intel SPMD Program Compiler) yielded no significant improvement, reaffirming the quality of our current implementation.

Key Learnings

This project provided valuable experience in the following areas:

Algorithmic Approximation: We explored how repeated simple operations (box filters) can converge to more complex mathematical results (Gaussian filtering), trading minimal accuracy for significant speed.
Cache Optimization: Efficient memory access patterns, particularly in image processing, are often more impactful than raw computation speed.
Parallel Programming: The use of OpenMP for multithreaded acceleration proved both accessible and effective for modern CPU architectures.
Cross-Platform Design: Our implementation remains portable and dependency-free, relying only on minimal external libraries (STB) for image I/O.

Conclusion

While GPU acceleration remains the gold standard for high-throughput image processing, this project demonstrates that clever CPU-based implementations can achieve comparable real-time performance in specific use cases. Our fast Gaussian blur approximation is a testament to the power of algorithmic refinement, memory locality, and parallel execution.

Looking forward, this method could serve as a baseline for GPU porting, potentially achieving even greater speeds with CUDA or compute shaders—albeit at the cost of development complexity for generalized multi-channel support.

Thank you for reading about our work. The full source code and demonstration materials can be found in the project repository here.