-
-
Notifications
You must be signed in to change notification settings - Fork 611
Open
Description
您好,我是一位初学者,想请教一下为什么向量化访存反而让速度变慢了。
在运行elementwise.py
后,我发现得到的结果和我预期的相差比较大。
观察输出,可以发现几乎所有情况下,f32x4都要慢于f32;f16中,x2和x8也没有出现明显的加速。
这是为什么,向量化访存不应该更快吗?
我的gpu是2080Ti 22G,带宽是616GB/s,是我gpu本身带宽较大导致加速不明显吗?
-------------------------------------------------------------------------------------
S=1024, K=1024
out_f32: [0.96132743, 1.51430702], time:0.02730680ms
out_f32x4: [0.96132743, 1.51430702], time:0.02952051ms
out_f32_th: [0.96132743, 1.51430702], time:0.02986336ms
-------------------------------------------------------------------------------------
out_f16: [0.9609375, 1.51367188], time:0.01912737ms
out_f16x2: [0.9609375, 1.51367188], time:0.01694560ms
out_f16x8: [0.9609375, 1.51367188], time:0.01652718ms
out_f16x8pack: [0.9609375, 1.51367188], time:0.01584172ms
out_f16_th: [0.9609375, 1.51367188], time:0.01598263ms
-------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------
S=1024, K=2048
out_f32: [-2.21341228, -0.22882888], time:0.05539894ms
out_f32x4: [-2.21341228, -0.22882888], time:0.05360365ms
out_f32_th: [-2.21341228, -0.22882888], time:0.05304718ms
-------------------------------------------------------------------------------------
out_f16: [-2.21289062, -0.22888184], time:0.02895474ms
out_f16x2: [-2.21289062, -0.22888184], time:0.02823472ms
out_f16x8: [-2.21289062, -0.22888184], time:0.02895665ms
out_f16x8pack: [-2.21289062, -0.22888184], time:0.02967000ms
out_f16_th: [-2.21289062, -0.22888184], time:0.02897358ms
-------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------
S=1024, K=4096
out_f32: [1.17699647, 0.544092], time:0.10359192ms
out_f32x4: [1.17699647, 0.544092], time:0.10464501ms
out_f32_th: [1.17699647, 0.544092], time:0.10411191ms
-------------------------------------------------------------------------------------
out_f16: [1.17675781, 0.54394531], time:0.05496955ms
out_f16x2: [1.17675781, 0.54394531], time:0.05489397ms
out_f16x8: [1.17675781, 0.54394531], time:0.05424166ms
out_f16x8pack: [1.17675781, 0.54394531], time:0.05376101ms
out_f16_th: [1.17675781, 0.54394531], time:0.05454874ms
-------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------
S=2048, K=1024
out_f32: [-1.46793842, -0.19991362], time:0.05346680ms
out_f32x4: [-1.46793842, -0.19991362], time:0.05228972ms
out_f32_th: [-1.46793842, -0.19991362], time:0.05246186ms
-------------------------------------------------------------------------------------
out_f16: [-1.46875, -0.19995117], time:0.02835703ms
out_f16x2: [-1.46875, -0.19995117], time:0.02766037ms
out_f16x8: [-1.46875, -0.19995117], time:0.02791238ms
out_f16x8pack: [-1.46875, -0.19995117], time:0.02756238ms
out_f16_th: [-1.46875, -0.19995117], time:0.02814460ms
-------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------
S=2048, K=2048
out_f32: [1.18721819, 2.10879827], time:0.10185599ms
out_f32x4: [1.18721819, 2.10879827], time:0.10227609ms
out_f32_th: [1.18721819, 2.10879827], time:0.10236549ms
-------------------------------------------------------------------------------------
out_f16: [1.1875, 2.109375], time:0.05349374ms
out_f16x2: [1.1875, 2.109375], time:0.05265045ms
out_f16x8: [1.1875, 2.109375], time:0.05346727ms
out_f16x8pack: [1.1875, 2.109375], time:0.05260110ms
out_f16_th: [1.1875, 2.109375], time:0.05285811ms
-------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------
S=2048, K=4096
out_f32: [1.48429728, 0.29218626], time:0.20148802ms
out_f32x4: [1.48429728, 0.29218626], time:0.20691109ms
out_f32_th: [1.48429728, 0.29218626], time:0.20491791ms
-------------------------------------------------------------------------------------
out_f16: [1.484375, 0.29223633], time:0.10527062ms
out_f16x2: [1.484375, 0.29223633], time:0.10620880ms
out_f16x8: [1.484375, 0.29223633], time:0.10499525ms
out_f16x8pack: [1.484375, 0.29223633], time:0.10438800ms
out_f16_th: [1.484375, 0.29223633], time:0.10462546ms
-------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------
S=4096, K=1024
out_f32: [0.99221903, -0.55276316], time:0.10310864ms
out_f32x4: [0.99221903, -0.55276316], time:0.10449076ms
out_f32_th: [0.99221903, -0.55276316], time:0.10388064ms
-------------------------------------------------------------------------------------
out_f16: [0.9921875, -0.55273438], time:0.05691886ms
out_f16x2: [0.9921875, -0.55273438], time:0.05419183ms
out_f16x8: [0.9921875, -0.55273438], time:0.05423760ms
out_f16x8pack: [0.9921875, -0.55273438], time:0.05318284ms
out_f16_th: [0.9921875, -0.55273438], time:0.05289507ms
-------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------
S=4096, K=2048
out_f32: [0.69422352, -2.08615279], time:0.20160818ms
out_f32x4: [0.69422352, -2.08615279], time:0.20334101ms
out_f32_th: [0.69422352, -2.08615279], time:0.20489359ms
-------------------------------------------------------------------------------------
out_f16: [0.69384766, -2.0859375], time:0.10406899ms
out_f16x2: [0.69384766, -2.0859375], time:0.10641527ms
out_f16x8: [0.69384766, -2.0859375], time:0.10471940ms
out_f16x8pack: [0.69384766, -2.0859375], time:0.10447454ms
out_f16_th: [0.69384766, -2.0859375], time:0.10432959ms
-------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------
S=4096, K=4096
out_f32: [1.44236982, 1.11088383], time:0.39831758ms
out_f32x4: [1.44236982, 1.11088383], time:0.40422320ms
out_f32_th: [1.44236982, 1.11088383], time:0.40196872ms
-------------------------------------------------------------------------------------
out_f16: [1.44238281, 1.11132812], time:0.20373106ms
out_f16x2: [1.44238281, 1.11132812], time:0.20653439ms
out_f16x8: [1.44238281, 1.11132812], time:0.20352507ms
out_f16x8pack: [1.44238281, 1.11132812], time:0.20461726ms
out_f16_th: [1.44238281, 1.11132812], time:0.20387745ms
-------------------------------------------------------------------------------------
Metadata
Metadata
Assignees
Labels
No labels