Skip to content

Conversation

@Fiery
Copy link

@Fiery Fiery commented Aug 18, 2025

Summary:
as title, this issue continue exists in most recent ITEP experiments when we only apply ITEP on the baseline without changing batch size and/or trainer numbers.

from recent MAI results in f777920760, we see about 3.5% QPS gap with ITEP enabled (393 vs 403 P90)

issues visible in trace
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2Faps-aps-mai_to_flow-777920760-f777930060%2F0%2Frank-0.Aug_11_01_48_39.4443.pt.trace.json.gz&bucket=aps_traces

{F1981311699}

Differential Revision: D67302872

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 18, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D67302872

@facebook-github-bot
Copy link
Contributor

@Fiery has exported this pull request. If you are a Meta employee, you can view the originating diff in D67302872.

Fiery pushed a commit to Fiery/torchrec that referenced this pull request Sep 16, 2025
…rch#3293)

Summary:

as title, this issue continue exists in most recent ITEP experiments when we only apply ITEP on the baseline without changing batch size and/or trainer numbers.


from recent  MAI results in f777920760, we see about 3.5% QPS gap with ITEP enabled (393 vs 403 P90)


issues visible in trace
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2Faps-aps-mai_to_flow-777920760-f777930060%2F0%2Frank-0.Aug_11_01_48_39.4443.pt.trace.json.gz&bucket=aps_traces

{F1981311699}

Reviewed By: doIIarplus

Differential Revision: D67302872
@facebook-github-bot
Copy link
Contributor

@Fiery has exported this pull request. If you are a Meta employee, you can view the originating diff in D67302872.

Fiery pushed a commit to Fiery/torchrec that referenced this pull request Sep 16, 2025
…rch#3293)

Summary:
Pull Request resolved: meta-pytorch#3293

as title, this issue continue exists in most recent ITEP experiments when we only apply ITEP on the baseline without changing batch size and/or trainer numbers.

from recent  MAI results in f777920760, we see about 3.5% QPS gap with ITEP enabled (393 vs 403 P90)

issues visible in trace
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2Faps-aps-mai_to_flow-777920760-f777930060%2F0%2Frank-0.Aug_11_01_48_39.4443.pt.trace.json.gz&bucket=aps_traces

{F1981311699}

Reviewed By: doIIarplus

Differential Revision: D67302872
…rch#3293)

Summary:

as title, this issue continue exists in most recent ITEP experiments when we only apply ITEP on the baseline without changing batch size and/or trainer numbers.


from recent  MAI results in f777920760, we see about 3.5% QPS gap with ITEP enabled (393 vs 403 P90)


issues visible in trace
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2Faps-aps-mai_to_flow-777920760-f777930060%2F0%2Frank-0.Aug_11_01_48_39.4443.pt.trace.json.gz&bucket=aps_traces

{F1981311699}

Reviewed By: doIIarplus

Differential Revision: D67302872
@facebook-github-bot
Copy link
Contributor

@Fiery has exported this pull request. If you are a Meta employee, you can view the originating diff in D67302872.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants