Author: Me and (Copilot)
This project implements a simple neural network entirely in x86 assembly language to recognize handwritten digits from the MNIST dataset. The goal was to understand how neural networks work at the lowest level — from memory layout and arithmetic operations to training logic — without relying on high-level libraries or frameworks. It runs in a lightweight Debian Slim environment via Docker for easy setup.
Sometimes, we think we truly understand something, until we try to build it from scratch. When theory meets practice, every small detail becomes a challenge.
Not only is this a pure assembly implementation, but I also optimized it for performance:
- Using AVX-512 ZMM registers, I can compute 16 float32 operations in parallel, adding SIMD acceleration to the neural network computations.
 - As a result, this assembly implementation is roughly 7× faster than a pure Python implementation using NumPy (which itself relies on C libraries).
 
I decided to write this project in pure assembly language to push my limits and see what really happens behind high-level neural network frameworks. It was a deep dive into how each operation — from matrix multiplication to gradient updates — actually works at the CPU level.
Throughout this project, I faced many challenges in both implementation and understanding of neural network concepts. But this is just the beginning — I plan to continue building what I want to know from scratch in deep learning and neural networks, creating everything in assembly to truly understand how they work underneath.
| Layer | Size | Activation | 
|---|---|---|
| Input | 784 | – (flattened MNIST image) | 
| Hidden 1 | 128 | ReLU | 
| Hidden 2 | 64 | ReLU | 
| Output | 10 | Softmax | 
- Epochs: 10
 - Batch size: 32
 - Training samples: 60,000
 - Test samples: 10,000
 - Learning rate: 0.01
 
- 
I first wanted to make this for Windows, but Windows API calls in assembly were too complicated to debug. So I switched to Linux where I could use GDB in the terminal, which was hard but fun to learn.
 - 
For the softmax and loss functions, I needed exp() and log(). Instead of writing them myself, I used the math library by linking with -lc -lm.
 - 
To help debug, I wrote the same neural network in Python to compare outputs and find where my assembly was wrong.
 - 
I first implemented all operations using 64-bit doubles, but then switched to 32-bit floats to reduce memory usage and improve performance. This required changing memory layouts, instructions, and data handling throughout the code.
 - 
I added parallelism using AVX-512 ZMM registers, allowing 16 float32 calculations to be performed simultaneously in dot products and matrix operations, which sped up computation significantly.
 - 
I discovered that reading each MNIST image individually from disk was a major bottleneck. I solved this by loading the entire dataset into RAM at once, which drastically reduced file I/O overhead.
 - 
I tried to make my functions flexible and reusable across different parts of the neural network.
 - 
I tried to minimize using the stack and use registers more instead. It was really hard to find registers that weren't already being used somewhere else.
 
The program starts from _start.asm and trains the neural network on MNIST data. In 10 epochs It processes images in batches, passing them through three layers (128 → 64 → 10 neurons) with ReLU activation. For each batch, it calculates the loss and uses backpropagation backprop.asm to compute gradients for all weights and biases in (dW1, dbias1, ...).
After each batch, gradient.asm updates the weights using these gradients and a learning rate of 0.01. layers_buffer.asm provides the memory space for all layer outputs and gradients. Once training completes, the program tests the network on unseen images and prints the final accuracy.
The build.sh script assembles all NASM files into object files, links them with the math libraries, and produces the final executable.
This project can be built and run inside a Docker container with NASM and build tools installed. Follow these steps:
docker build -t nasm-assembly .docker run \
  --volume="PATH/TO/PROJECT:/mnt/project" \
  --cpus=4 \
  --memory=4g \
  --memory-swap=4g \
  nasm-assembly
Replace PATH/TO/PROJECT with your local project folder, e.g.:
- Windows: 
C:/Users/YourName/NASM_MNIST - Linux/Mac: 
/home/username/NASM_MNIST 
docker exec -it <container_id_or_name> bash
This project includes a build.sh script to assemble, link, and run the NASM MNIST neural network.
./build.sh
- This will assemble all .asm files, link them, and produce the executable ./mnist.
 
Run the neural network:
./mnist
This project includes a build_prod.sh script for creating an optimized production executable:
./build_prod.sh




