I tried running the example (stories42M/stories15M), comparing timing against the original llama2.c (tok/sec), and this variant runs slower. Is that to be expected?