Skip to content

关于elementwise的向量化访存 #361

@H3AlO3

Description

@H3AlO3

您好,我是一位初学者,想请教一下为什么向量化访存反而让速度变慢了。
在运行elementwise.py后,我发现得到的结果和我预期的相差比较大。
观察输出,可以发现几乎所有情况下,f32x4都要慢于f32;f16中,x2和x8也没有出现明显的加速。
这是为什么,向量化访存不应该更快吗?
我的gpu是2080Ti 22G,带宽是616GB/s,是我gpu本身带宽较大导致加速不明显吗?

-------------------------------------------------------------------------------------
                                        S=1024, K=1024
           out_f32: [0.96132743, 1.51430702], time:0.02730680ms
         out_f32x4: [0.96132743, 1.51430702], time:0.02952051ms
        out_f32_th: [0.96132743, 1.51430702], time:0.02986336ms
-------------------------------------------------------------------------------------
           out_f16: [0.9609375, 1.51367188], time:0.01912737ms
         out_f16x2: [0.9609375, 1.51367188], time:0.01694560ms
         out_f16x8: [0.9609375, 1.51367188], time:0.01652718ms
     out_f16x8pack: [0.9609375, 1.51367188], time:0.01584172ms
        out_f16_th: [0.9609375, 1.51367188], time:0.01598263ms
-------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------
                                        S=1024, K=2048
           out_f32: [-2.21341228, -0.22882888], time:0.05539894ms
         out_f32x4: [-2.21341228, -0.22882888], time:0.05360365ms
        out_f32_th: [-2.21341228, -0.22882888], time:0.05304718ms
-------------------------------------------------------------------------------------
           out_f16: [-2.21289062, -0.22888184], time:0.02895474ms
         out_f16x2: [-2.21289062, -0.22888184], time:0.02823472ms
         out_f16x8: [-2.21289062, -0.22888184], time:0.02895665ms
     out_f16x8pack: [-2.21289062, -0.22888184], time:0.02967000ms
        out_f16_th: [-2.21289062, -0.22888184], time:0.02897358ms
-------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------
                                        S=1024, K=4096
           out_f32: [1.17699647, 0.544092], time:0.10359192ms
         out_f32x4: [1.17699647, 0.544092], time:0.10464501ms
        out_f32_th: [1.17699647, 0.544092], time:0.10411191ms
-------------------------------------------------------------------------------------
           out_f16: [1.17675781, 0.54394531], time:0.05496955ms
         out_f16x2: [1.17675781, 0.54394531], time:0.05489397ms
         out_f16x8: [1.17675781, 0.54394531], time:0.05424166ms
     out_f16x8pack: [1.17675781, 0.54394531], time:0.05376101ms
        out_f16_th: [1.17675781, 0.54394531], time:0.05454874ms
-------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------
                                        S=2048, K=1024
           out_f32: [-1.46793842, -0.19991362], time:0.05346680ms
         out_f32x4: [-1.46793842, -0.19991362], time:0.05228972ms
        out_f32_th: [-1.46793842, -0.19991362], time:0.05246186ms
-------------------------------------------------------------------------------------
           out_f16: [-1.46875, -0.19995117], time:0.02835703ms
         out_f16x2: [-1.46875, -0.19995117], time:0.02766037ms
         out_f16x8: [-1.46875, -0.19995117], time:0.02791238ms
     out_f16x8pack: [-1.46875, -0.19995117], time:0.02756238ms
        out_f16_th: [-1.46875, -0.19995117], time:0.02814460ms
-------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------
                                        S=2048, K=2048
           out_f32: [1.18721819, 2.10879827], time:0.10185599ms
         out_f32x4: [1.18721819, 2.10879827], time:0.10227609ms
        out_f32_th: [1.18721819, 2.10879827], time:0.10236549ms
-------------------------------------------------------------------------------------
           out_f16: [1.1875, 2.109375], time:0.05349374ms
         out_f16x2: [1.1875, 2.109375], time:0.05265045ms
         out_f16x8: [1.1875, 2.109375], time:0.05346727ms
     out_f16x8pack: [1.1875, 2.109375], time:0.05260110ms
        out_f16_th: [1.1875, 2.109375], time:0.05285811ms
-------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------
                                        S=2048, K=4096
           out_f32: [1.48429728, 0.29218626], time:0.20148802ms
         out_f32x4: [1.48429728, 0.29218626], time:0.20691109ms
        out_f32_th: [1.48429728, 0.29218626], time:0.20491791ms
-------------------------------------------------------------------------------------
           out_f16: [1.484375, 0.29223633], time:0.10527062ms
         out_f16x2: [1.484375, 0.29223633], time:0.10620880ms
         out_f16x8: [1.484375, 0.29223633], time:0.10499525ms
     out_f16x8pack: [1.484375, 0.29223633], time:0.10438800ms
        out_f16_th: [1.484375, 0.29223633], time:0.10462546ms
-------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------
                                        S=4096, K=1024
           out_f32: [0.99221903, -0.55276316], time:0.10310864ms
         out_f32x4: [0.99221903, -0.55276316], time:0.10449076ms
        out_f32_th: [0.99221903, -0.55276316], time:0.10388064ms
-------------------------------------------------------------------------------------
           out_f16: [0.9921875, -0.55273438], time:0.05691886ms
         out_f16x2: [0.9921875, -0.55273438], time:0.05419183ms
         out_f16x8: [0.9921875, -0.55273438], time:0.05423760ms
     out_f16x8pack: [0.9921875, -0.55273438], time:0.05318284ms
        out_f16_th: [0.9921875, -0.55273438], time:0.05289507ms
-------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------
                                        S=4096, K=2048
           out_f32: [0.69422352, -2.08615279], time:0.20160818ms
         out_f32x4: [0.69422352, -2.08615279], time:0.20334101ms
        out_f32_th: [0.69422352, -2.08615279], time:0.20489359ms
-------------------------------------------------------------------------------------
           out_f16: [0.69384766, -2.0859375], time:0.10406899ms
         out_f16x2: [0.69384766, -2.0859375], time:0.10641527ms
         out_f16x8: [0.69384766, -2.0859375], time:0.10471940ms
     out_f16x8pack: [0.69384766, -2.0859375], time:0.10447454ms
        out_f16_th: [0.69384766, -2.0859375], time:0.10432959ms
-------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------
                                        S=4096, K=4096
           out_f32: [1.44236982, 1.11088383], time:0.39831758ms
         out_f32x4: [1.44236982, 1.11088383], time:0.40422320ms
        out_f32_th: [1.44236982, 1.11088383], time:0.40196872ms
-------------------------------------------------------------------------------------
           out_f16: [1.44238281, 1.11132812], time:0.20373106ms
         out_f16x2: [1.44238281, 1.11132812], time:0.20653439ms
         out_f16x8: [1.44238281, 1.11132812], time:0.20352507ms
     out_f16x8pack: [1.44238281, 1.11132812], time:0.20461726ms
        out_f16_th: [1.44238281, 1.11132812], time:0.20387745ms
-------------------------------------------------------------------------------------

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions