Fast, reproducible, and portable software development environments
-
Updated
Dec 8, 2021 - Dockerfile
Fast, reproducible, and portable software development environments
Remote development on HPC clusters with VSCode
Matrix multiplication example performed with OpenMP, OpenACC, BLAS, cuBLABS, and CUDA
Accelerate and optimize existing C/C++ CPU-only applications using the most essential CUDA tools and techniques.
A simple and understandable CUDA kernel for batch-matmul operation
Repository for Architecture of computers and parallel systems course on VŠB
The MNIST classification problem is a fundamental machine learning task that involves recognizing handwritten digits (0- 9) from a dataset of 70,000 grayscale images (28x28 pixels each). It serves as a benchmark for evaluating machine learning models, particularly neural networks.
University Project for "Computer Architecture" course (MSc Computer Engineering @ University of Pisa). Implementation of a Parallelized Nearest Neighbor Upscaler using CUDA.
🎬 Explore GPU training efficiency with FP32 vs FP16 in this modular lab, utilizing Tensor Core acceleration for deep learning insights.
A reproducible GPU benchmarking lab that compares FP16 vs FP32 training on MNIST using PyTorch, CuPy, and Nsight profiling tools. This project blends performance engineering with cinematic storytelling—featuring NVTX-tagged training loops, fused CuPy kernels, and a profiler-driven README that narrates the GPU’s inner workings frame by frame.
Add a description, image, and links to the nsight topic page so that developers can more easily learn about it.
To associate your repository with the nsight topic, visit your repo's landing page and select "manage topics."