Project · Sep 2025 – Dec 2025 · CUDA · GPU Programming

Neural Network Inference Engine

A multi-layer CNN with custom CUDA kernels and vectorized tensor operations, built to squeeze every drop of performance from NVIDIA GPUs.

The Overview

This project focuses on building a neural network inference engine and optimizing it for both CPU and GPU execution. The main idea is to understand where performance bottlenecks come from in real workloads, especially in matrix multiplication, which dominates most neural network computations. The system is tested using a simple MNIST-trained model and implemented in both dense and sparse formats. This makes it possible to compare how things like memory access patterns, sparsity, and compute optimizations affect overall runtime and efficiency.

Neural Network

CNN architecture — layer breakdown and tensor flow.

CUDA Kernels

The model used here is a basic 2-layer neural network for MNIST classification. It takes input features, passes them through a hidden layer with a tanh activation, and then produces outputs using a sigmoid activation. The final prediction is selected using argmax.

Even though the architecture is simple, everything is built around matrix-vector and matrix-matrix operations, which makes it a good setup for testing optimized linear algebra implementations. The same structure is used for both dense and sparse versions so performance differences come only from the implementation, not the model itself.

CUDA Kernels

Custom CUDA kernel implementation for vectorized tensor operations.

Performance Profiling

The system was benchmarked across CPU and GPU, as well as dense and sparse configurations. On the CPU, performance is mostly limited by memory bandwidth and cache efficiency, especially for larger matrix sizes. On the GPU, parallel execution gives much better performance, particularly for large workloads and sparse operations. Sparse models reduce computation significantly at higher sparsity levels, but this comes with some accuracy trade-offs. Overall, the results clearly show how sparsity and hardware optimization impact runtime, with GPU implementations giving the strongest speedups across most cases.

Performance
View on GitHub →