1. Detect uarch and deduce if the GPU has tensor cores or not 2. Run a GeMM (how?) using tensor cores to achieve the peak performance in half precision