Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
5726 commits
Select commit Hold shift + click to select a range
5814a00
build: Upgrade jet-client
ko3n1g Oct 21, 2025
bacc707
build: Upgrade JET
ko3n1g Oct 21, 2025
fd0df7d
ADLR/megatron-lm!3974 - Convert static to use dynamic under the hood
shanmugamr1992 Oct 21, 2025
754dfa2
ADLR/megatron-lm!4308 - build: Bump TE
ko3n1g Oct 22, 2025
3106714
ADLR/megatron-lm!4286 - Add cpu offloading interface
youngeunkwon0405 Oct 22, 2025
edcbc17
ADLR/megatron-lm!4170 - chore: delete utils_object_storage.py
shunjiad Oct 22, 2025
9cb2518
ADLR/megatron-lm!4316 - ci: Temporarily block external contributions
ko3n1g Oct 22, 2025
c0a595d
ADLR/megatron-lm!4272 - Track and cleanup NSys NVTX context
maanug-nv Oct 22, 2025
268fda0
ADLR/megatron-lm!3955 - Megatron-FSDP Expert Parallel (DeepSeek-v3) S…
shjwudp Oct 23, 2025
a3a1f06
ci: Add copyright checker for GitHub CI
ko3n1g Oct 23, 2025
f82223f
ci: Fix copyright checker (#1889)
ko3n1g Oct 23, 2025
6c57be9
ci: Run on dev
ko3n1g Oct 23, 2025
e7106d2
ci: Fix linter
ko3n1g Oct 23, 2025
4ddd50d
ci: Fix copyright checker (#1893)
ko3n1g Oct 23, 2025
2a01637
ci: Linting on main
ko3n1g Oct 23, 2025
218b0e0
ci(fix): HAS_RUN_TESTS_LABEL
ko3n1g Oct 23, 2025
e0b3d5b
ci: Fix linting
ko3n1g Oct 23, 2025
3364dba
ci(fix): Do not run linting on push
ko3n1g Oct 23, 2025
8325951
chore: Add codeowners (#1897)
ko3n1g Oct 23, 2025
2e38079
chore: Update codeowners
ko3n1g Oct 23, 2025
4d14c57
ci(fix): No copyright on push
ko3n1g Oct 23, 2025
a350a6e
ci: Extend queue-manager for dev branch (#1906)
ko3n1g Oct 23, 2025
adf4247
ci: Move test optimizer into its own bucket (#1909)
ko3n1g Oct 24, 2025
1edc4d6
ci: Use matrix for approval-bot
ko3n1g Oct 24, 2025
04e640b
ci: Update function name
ko3n1g Oct 24, 2025
c7f154f
ci: Adjust approval-bot for copy-pr-bot
ko3n1g Oct 24, 2025
019084e
ci: Parametrize workflow
ko3n1g Oct 24, 2025
aff784e
ci: Parametrize workflow
ko3n1g Oct 24, 2025
4d282bf
ci: Remove attribute
ko3n1g Oct 24, 2025
a597390
ci: Update container image tag to use GitHub SHA
ko3n1g Oct 24, 2025
94c6526
chore: Remove file
ko3n1g Oct 24, 2025
5bfda01
ci: Fix approval bot
ko3n1g Oct 24, 2025
7e661c9
ci: Configure cherrypick bot (#1925)
ko3n1g Oct 24, 2025
9697129
Ci approve dev (#1933)
ko3n1g Oct 25, 2025
6a8bbe9
ci: Update nightly schedule (#1934)
ko3n1g Oct 25, 2025
3bd66cf
ci: Bump pre-flight for runs on main/dev (#1935)
ko3n1g Oct 25, 2025
e965a15
ci: Allow skipping on main (#1936)
ko3n1g Oct 25, 2025
621d17c
Ko3n1g/ci/pr template community bot (#1937)
ko3n1g Oct 25, 2025
3e07859
ci: More granular unit tests buckets (#1932)
ko3n1g Oct 25, 2025
b2b9b42
Add sequence packing to RL (#1911)
tdene Oct 25, 2025
bf06bbe
chore: Update template (#1939)
ko3n1g Oct 25, 2025
ddddc2f
chore: Add description about who can merge (#1940)
ko3n1g Oct 25, 2025
6d1aa99
Ko3n1g/ci/fix main on eos (#1938)
ko3n1g Oct 25, 2025
0d5f25f
Ko3n1g/ci/internal mrs (#1942)
ko3n1g Oct 25, 2025
903c4b0
ci: Fix branch of approval bot (#1944)
ko3n1g Oct 25, 2025
47a99af
ci: Approvalbot for other branches (#1947)
ko3n1g Oct 25, 2025
768c312
ci(fix): Approval bot (#1949)
ko3n1g Oct 25, 2025
18f40ef
ci(fix): Approval gate
ko3n1g Oct 26, 2025
858dec9
ci: Approval gate rule
ko3n1g Oct 26, 2025
1daf7bf
ci: Update golden values nightly
ko3n1g Oct 26, 2025
856cabb
ci: Approval gate
ko3n1g Oct 26, 2025
07733ed
ci: Approval bot
ko3n1g Oct 26, 2025
e7648f5
ci: Sync branches
ko3n1g Oct 26, 2025
7b5e355
ci: Smaller image
ko3n1g Oct 26, 2025
cee934a
ci: Better output
ko3n1g Oct 26, 2025
391ece2
ci: sync branches
ko3n1g Oct 26, 2025
dd72236
ci: Fix sync bot
ko3n1g Oct 26, 2025
b16831b
ci: Finalize
ko3n1g Oct 26, 2025
da4b142
ci: Finalize
ko3n1g Oct 26, 2025
1426dad
Ko3n1g/ci/sync branches (#1956)
ko3n1g Oct 26, 2025
db1a2e8
ci: Increase time limit for main tests
ko3n1g Oct 26, 2025
2f384db
Ko3n1g/ci/add milestone (#1951)
ko3n1g Oct 26, 2025
e36a875
Remove M-FSDP testing under LTS environment (#1959)
shjwudp Oct 26, 2025
aed29d4
ci: Run on push to release branch (#1960)
ko3n1g Oct 26, 2025
f6d1db9
ci: Add golden values for inference
ko3n1g Oct 27, 2025
f34fa11
Fix typo in rl section of CODEOWNERS (#1968)
tdene Oct 27, 2025
4c2768a
ci: Update copyright checker (#1973)
ko3n1g Oct 27, 2025
df17c98
Ko3n1g/ci/auto reminder GitHub (#1955)
ko3n1g Oct 27, 2025
2dac593
ci: Update secret
ko3n1g Oct 27, 2025
ece4205
ci(fix): `Run tests` label (#1970)
ko3n1g Oct 28, 2025
f110cd0
ci(hotfix): Disable tests again
ko3n1g Oct 28, 2025
31cf35c
ci(hotfix): Add merge-group to copyright check
ko3n1g Oct 28, 2025
ca52416
ci(hotfix): Copyright check on merge-queue
ko3n1g Oct 28, 2025
04cb1b0
zarr soft deprecation (#2004)
dimapihtar Oct 28, 2025
6080057
Make `get_asyncio_loop` safe to use repeatedly (#1990)
tdene Oct 28, 2025
c5ac863
Update symmetric registration interface to sync-up with upstream pyto…
youngeunkwon0405 Oct 28, 2025
cd7314a
chore: Update codeowners (#2012)
ko3n1g Oct 28, 2025
75bf979
Deduplicate dynamic engine + coordinator. (#1981)
lmcafee-nvidia Oct 28, 2025
0b06db0
Safely access state dict args in load ckpt (#1957)
maanug-nv Oct 28, 2025
afa7361
Allow mixed-batch sampling in dynamic inference (#1927)
tdene Oct 29, 2025
69d23c4
Stop Nemo_CICD_Test from failing in forks (#2024)
tdene Oct 29, 2025
e640a89
Clean up dynamic inference step (#1992)
tdene Oct 29, 2025
e6e0769
ci: Auto-update copy-pr-bot vetters (#1850)
ko3n1g Oct 29, 2025
75c0721
Have datasets account for tokenizers which incorrectly define PAD (#2…
tdene Oct 29, 2025
c7a9003
ci: Enable integration tests (#2023)
ko3n1g Oct 29, 2025
f9a1fff
ci: Fix build-push-wheel workflow (#2022)
ko3n1g Oct 29, 2025
eb0a744
chore: Update tooling for interactive jobs (#2032)
ko3n1g Oct 29, 2025
d5a9645
revert(hotfix): ci: trustees_override (#2041)
ko3n1g Oct 30, 2025
9458be9
add missing warnings import in model parallel config (#2039)
yashaswikarnati Oct 30, 2025
bb21676
Reduce-scatter implementation with FP32 accumulation (#1967)
deepakn94 Oct 30, 2025
629af78
ci(fix): Workflows on `main` (#2045)
ko3n1g Oct 30, 2025
8b42b9e
build: Bump modelopt (#2046)
ko3n1g Oct 30, 2025
27be0ce
Remove TestCaptureFreezeGC unit test. (#1978)
lmcafee-nvidia Oct 30, 2025
852870c
ci: Add multi-approval action (#2051)
ko3n1g Oct 30, 2025
4c2e1c9
ci(hotfix): Repair codeowners file
ko3n1g Oct 30, 2025
a07e00b
ci(hotfix): Set docs allowed to fail
ko3n1g Oct 31, 2025
f559059
Ko3n1g/ci/test iteration time (#2067)
ko3n1g Oct 31, 2025
818e072
ci(hotfix): Remove performance for ckpt-resume
ko3n1g Oct 31, 2025
f248fcb
Allow inference test throughput to vary by 10% (#2070)
mathemakitten Oct 31, 2025
e715d2f
ci(hotfix): Inference test pipeline
ko3n1g Oct 31, 2025
aad8761
chore: Fix autoformatter (#2073)
ko3n1g Oct 31, 2025
e3ae351
ci(hotfix): Remove iteration-time from t5
ko3n1g Oct 31, 2025
87cbe76
ci(hotfix): disable inference test
ko3n1g Nov 1, 2025
d0d00b3
ci(hotfix): Disable inference test
ko3n1g Nov 2, 2025
88e3a8a
ci(hotfix): Bypass approvalbot in merge-queue (#2082)
ko3n1g Nov 2, 2025
53305bc
ci(hotfix): Enable merge-group for approval bot
ko3n1g Nov 2, 2025
7c16ca0
chore: Update local tooling (#2066)
ko3n1g Nov 2, 2025
dc7a0ca
Add extra RL files (#2077)
tdene Nov 2, 2025
5cfad7b
Prevent summary jobs from running in forks (#2083)
tdene Nov 2, 2025
ba21b69
ci: Fix test scope (#2091)
ko3n1g Nov 2, 2025
7ca2890
ci(hotfix): Remove publish workflows
ko3n1g Nov 3, 2025
a652e2c
Refactor the attention metadata into separate classes (#2001)
kanz-nv Nov 3, 2025
65cd27c
Guard against incorrectly using MoE prefill graphs (#2030)
tdene Nov 3, 2025
d3f1af4
Revert "Refactor the attention metadata into separate classes (#2001)"
ko3n1g Nov 3, 2025
5671e3a
Run mr-slim tests in lightweight-mode (#2106)
chtruong814 Nov 3, 2025
7487c53
Inference | Lazy compile UVM allocator. (#1977)
lmcafee-nvidia Nov 3, 2025
1307f87
chore: Reenable trustees (#2108)
ko3n1g Nov 3, 2025
282b74c
Revert "Inference | Lazy compile UVM allocator. (#1977)"
ko3n1g Nov 3, 2025
2cab46f
ci(fix): Changeset of copyright checker (#2110)
ko3n1g Nov 3, 2025
d4194b7
Ko3n1g/chore/update release settings (#2097)
ko3n1g Nov 3, 2025
5dee638
Remove unnecessary check on rotary_pos_cos (#2003)
santhnm2 Nov 4, 2025
aecce9e
(Reverted) Inference | Lazy compile UVM allocator. (#2125)
lmcafee-nvidia Nov 4, 2025
1586563
Refactor Attention Metadata to Separate Classes (#2112)
kanz-nv Nov 4, 2025
46e066b
Refactor model_provider to model_builder format for ModelOpt examples…
AAnoosheh Nov 5, 2025
26b2eb5
wandb Inference stats logging (#2026)
wdykas Nov 5, 2025
9be6d47
Make `PipelineParallelLayout` always return str from ` __repr__` (#2055)
ananthsub Nov 5, 2025
a32ff75
Add flash_attn_3 as first option for FA3 import (#2010)
santhnm2 Nov 5, 2025
f119a06
Add debugging hint for case when cudagraphs are created but no matchi…
mathemakitten Nov 5, 2025
eb48e81
ci: LTS container (#2133)
ko3n1g Nov 5, 2025
75f87c2
Revert "ci: LTS container (#2133)"
ko3n1g Nov 5, 2025
08c3771
Fix param init (#2033)
cuichenx Nov 5, 2025
f150f42
Hotfix to unit tests on hopper FA3 (#2143)
tdene Nov 5, 2025
10146c6
Add BytesIO to safe_globals (#2074)
tdene Nov 6, 2025
f167a85
add deprecation warning for legacy tokenizer system (#2145)
dimapihtar Nov 6, 2025
23a1dca
replay: ci: Bump LTS container (#2157)
ko3n1g Nov 6, 2025
0abff08
Hotfix to unit tests on hopper FA3 (bis) (#2179)
tdene Nov 7, 2025
0981e3c
Fix has_modelopt_state() for native Torch checkpoint format (#2160)
AAnoosheh Nov 7, 2025
c63b921
chore: Remove codeowners (#2175)
ko3n1g Nov 7, 2025
9aa14ed
Fix FP8 inference with sequence parallelism (#2009)
santhnm2 Nov 7, 2025
0f8fb9b
Replace ModelOpt generation server (#2147)
AAnoosheh Nov 7, 2025
e07c4a4
Add hybrid model support for dynamic inference engine (#1907)
santhnm2 Nov 7, 2025
82e846d
Async task and event loop safety in Megatron Core (#2025)
tdene Nov 10, 2025
c193bf5
Rename skip_prompt_log_probs (#2181)
tdene Nov 10, 2025
d6979d6
Dynamic inference context | UVM only. (#1983)
lmcafee-nvidia Nov 10, 2025
a59223d
Update copy-pr-bot.yaml [skip ci]
ko3n1g Nov 10, 2025
7055186
Revert "Dynamic inference context | UVM only. (#1983)"
ko3n1g Nov 10, 2025
75f7d50
ci: Run `auto-update-copy-pr-bot` only on forks (#2191)
ko3n1g Nov 10, 2025
2fef6bb
Inference throughput tests: refactor goldens to be in list format (#2…
mathemakitten Nov 10, 2025
1f6cde8
Enable TE custom quantization recipe (#2005)
negvet Nov 11, 2025
0acf6c2
Add MoE parameters to ModelOpt pruning example + conf fixes (#2205)
kevalmorabia97 Nov 11, 2025
49061f1
Add repr to pg collection class (#2089)
yashaswikarnati Nov 11, 2025
265af20
Move `data_samplers.py` from `legacy` to `training.datasets` & add `D…
asolergi-nv Nov 11, 2025
d82a6d8
Fix Megatron-FSDP checkpoint save failure (#2138)
shjwudp Nov 12, 2025
bcf2a59
Fix moe CODEOWNERS. (#2200)
jaredcasper Nov 12, 2025
08360ec
chore: Update LICENSE (#2219)
ko3n1g Nov 12, 2025
45b40bb
remove `megatron.training` dependency from `megatron.core` for FSDP c…
ananthsub Nov 12, 2025
909c746
Revert "remove `megatron.training` dependency from `megatron.core` fo…
ko3n1g Nov 12, 2025
7db8ae4
Tensorize dynamic inference mixed sampling (#2105)
tdene Nov 12, 2025
ac9221d
Revert "Tensorize dynamic inference mixed sampling (#2105)"
ko3n1g Nov 12, 2025
989d13e
Add unit test for inference DP coordinator (#2187)
tdene Nov 12, 2025
bb5a0fd
Inference linear layer (#1908)
sidsingh-nvidia Nov 12, 2025
34932c7
Update copy-pr-bot.yaml [skip ci]
github-actions[bot] Nov 13, 2025
b958982
ci(hotfix): Auto-update copy-pr-bot
github-actions[bot] Nov 13, 2025
dbc4a4f
chore: Prefer Nvidia email addresses for reminder bot (#2221)
ko3n1g Nov 13, 2025
aa4ec99
[Megatron-FSDP] Fix hang caused by non-deterministic reduce-scatter (…
shjwudp Nov 13, 2025
9d91916
Remove qwen symlink to fix for case-insensitive FS (#2235)
kevalmorabia97 Nov 13, 2025
7d3f4a0
Optimizer refactor: clean up public `get_megatron_optimizer` interfac…
deepakn94 Nov 13, 2025
9fd43fa
Fix CI for PR#1983 (#2245)
lmcafee-nvidia Nov 13, 2025
70f85eb
Enable kv cache in training for eagle (#1895)
yeyu-nvidia Nov 13, 2025
b7ef391
Fix aux-loss logging for hybrid models (#2197)
deepakn94 Nov 13, 2025
610a75e
Update flops calculation (for throughput) for hybrid MoEs (#2198)
deepakn94 Nov 13, 2025
2751749
Add MoE layer type to hybrid models (#2196)
deepakn94 Nov 13, 2025
9be7c7b
Tensorize dynamic inference mixed sampling (bis) (#2231)
tdene Nov 14, 2025
c4f83f0
Revert "Add MoE layer type to hybrid models (#2196)"
ko3n1g Nov 14, 2025
41eecc4
ci(hotfix): Checkout repo before install check
ko3n1g Nov 14, 2025
c4ba666
chore: Fix codeowners (#2264)
ko3n1g Nov 15, 2025
4696d42
Allow loading checkpoint from iteration 0 (#2199)
ananthsub Nov 17, 2025
a2d8519
ci: Skip install test in merge queue (#2281)
chtruong814 Nov 17, 2025
9a1c0d0
Add MoE layer type to hybrid models (#2259)
deepakn94 Nov 18, 2025
3df2009
Add the Hybrid-EP backend to the Flex Dispatcher (#2176)
Autumn1998 Nov 18, 2025
e8b9df1
[MAIN][NVFP4] Support NVFP4 MOE with Proper Padding (#1985)
zhongbozhu Nov 18, 2025
a755887
Update ModelOpt example readmes and advanced usage (#2273)
kevalmorabia97 Nov 18, 2025
dcd3b39
Fix UVM compatibility with CUDA 13. (#2243)
lmcafee-nvidia Nov 18, 2025
5e3fa28
ci: Add flaky marker to LTS tests (#2290)
ko3n1g Nov 18, 2025
29eed5d
Dynamic engine suspend/resume via prefill. (#1982)
lmcafee-nvidia Nov 18, 2025
3b83c3f
Revert "Dynamic engine suspend/resume via prefill. (#1982)"
ko3n1g Nov 18, 2025
19d0422
fix: Pass the timeout argument for the EP group (#2268)
yanring Nov 19, 2025
efdc681
JIT for MoE router and preprocess (#1919)
yaox12 Nov 19, 2025
00884a8
Hotfix to CI, until the fix gets reviewed (#2298)
tdene Nov 19, 2025
f885d9c
Add functional test for DP coordinator throughput (#2189)
tdene Nov 19, 2025
70db86a
Add asyncio Queue like in Python 3.13 (#2224)
tdene Nov 19, 2025
744505e
Fixes for PR#1982 (#2303)
lmcafee-nvidia Nov 19, 2025
314a378
Fix PP KV cache allocation and enable multi-node PP inference (#2182)
santhnm2 Nov 19, 2025
21968ea
Revert active-buffer-size-gb arg name. (#2257)
lmcafee-nvidia Nov 19, 2025
712dff8
feat: check: api backwards compatibility (#2251)
pablo-garay Nov 19, 2025
6c8cdd5
Add MambaInferenceStateConfig dataclass (#2265)
santhnm2 Nov 19, 2025
dc473f9
Fix typo in inference example (#2311)
santhnm2 Nov 20, 2025
7dec856
feat: initialization of API backward compatibility verification (#2310)
pablo-garay Nov 20, 2025
e4b7259
Fix Mamba TP and remove confusing legacy initialization (#2202)
jaredcasper Nov 20, 2025
8463257
Refactor KD to use ModelOpt plugins file (#2305)
AAnoosheh Nov 20, 2025
9ce2482
mcore trigger mbridge
pablo-garay Nov 20, 2025
c2b1c7c
mcore trigger mbridge
pablo-garay Nov 20, 2025
a813740
mcore trigger mbridge
pablo-garay Nov 20, 2025
7e18da2
Revert "Refactor KD to use ModelOpt plugins file (#2305)"
ko3n1g Nov 20, 2025
8e830a1
Fix dynamic context syntax and remove redundant tensors (#2336)
kanz-nv Nov 20, 2025
475d7fa
Improve asyncio exception handling (#2300)
tdene Nov 20, 2025
5ab6392
ci: Upload to testpypi only on main (#2342)
ko3n1g Nov 21, 2025
0634924
implement graph config (#2203)
kanz-nv Nov 21, 2025
ddc55cd
Revert "implement graph config (#2203)"
ko3n1g Nov 21, 2025
f7fb5ec
feat: required check adjustment (#2350)
pablo-garay Nov 21, 2025
e772e06
synthesize, optimize
pablo-garay Nov 21, 2025
2cc0736
synthesize, optimize
pablo-garay Nov 21, 2025
f426230
Change default baseline commit for api compat check
pablo-garay Nov 21, 2025
f07cb14
fix: load iteration 0 for release checkpoints (#2351)
ananthsub Nov 21, 2025
81a87e2
Break apart dynamic inference step into 2 methods (#2192)
tdene Nov 21, 2025
c90160d
Bugfix for Mamba with Chunked-Prefill (#2293)
sidsingh-nvidia Nov 21, 2025
c9d2c8f
Explicitly zero out padding token activations for dynamic inference (…
santhnm2 Nov 21, 2025
63d4e7d
Refactor KD to use ModelOpt plugins file (v2) (#2355)
AAnoosheh Nov 21, 2025
29a810e
Prevent unnecessarily overwriting the default Hugging Face chat templ…
santhnm2 Nov 21, 2025
7994405
add FIM dataset support (#2291)
dimapihtar Nov 21, 2025
e35495d
Update DEFAULT_BASELINE in workflow configuration
pablo-garay Nov 22, 2025
233b5b0
Revert "Explicitly zero out padding token activations for dynamic inf…
chtruong814 Nov 22, 2025
90c8536
Clean up DP coord code & unit test (#2277)
tdene Nov 22, 2025
8daf046
[4/4] Merge Megatron-RL into LM (#2002)
tdene Nov 22, 2025
53bbf7a
Update coordinator control logic to be compatible with RL (#2227)
tdene Nov 22, 2025
8954e04
ci: Update backwards compat check baseline to 53bbf7a (#2361)
chtruong814 Nov 22, 2025
d313c6d
Account for test regression caused by prints (#2354)
tdene Nov 22, 2025
14464d1
Remove dependency on `megatron.training` within `megatron.core` (#2274)
ananthsub Nov 22, 2025
9873958
Fixes for gpt-oss (#2038)
cuichenx Nov 22, 2025
26b2e72
update
pablo-garay Nov 24, 2025
326ec8c
[HOT FIX] Fix bug of hybrid-ep backend in flex-dispatcher (#2286)
Autumn1998 Nov 24, 2025
17cd106
ci: Remove nemo-ci environment (#2364)
chtruong814 Nov 24, 2025
278e058
ci: Pass COMMUNITY_PROJECT_ID to community bot (#2366)
chtruong814 Nov 24, 2025
d61029f
ci: Remove environment from community-bot (#2376)
chtruong814 Nov 24, 2025
9269dda
monitoring & results in mcore
pablo-garay Nov 24, 2025
77b65ed
Add mbridge_ref input to select MBridge branch
pablo-garay Nov 24, 2025
aa7a564
Fix: Use correct repo NVIDIA-NeMo/Megatron-Bridge and add mbridge_ref…
pablo-garay Nov 24, 2025
7f70e22
gha action
pablo-garay Nov 24, 2025
c28b84e
ci: Bump commit for api check to d61029f (#2386)
chtruong814 Nov 24, 2025
ab1e26e
tidy / synthesize / enhance
pablo-garay Nov 24, 2025
56e8810
Merge branch 'main' of https://github.com/NVIDIA/Megatron-LM
pablo-garay Nov 24, 2025
bc242d9
Revert: trigger_mbridge_tests.yml‎ file change (#2389)
pablo-garay Nov 25, 2025
49eef58
build: Upgrade deps (#2289)
ko3n1g Nov 25, 2025
2a51d86
Change KV cache init to empty to speedup graph recording and first pr…
kanz-nv Nov 25, 2025
4c7d3d6
Handle UVM compile lock issues (#2299)
tdene Nov 25, 2025
14b791b
Remove experimental tags for fused kernels. (#2233)
Victarry Nov 25, 2025
ffb8c35
Reduce Overhead in Timers (#2210)
yaox12 Nov 25, 2025
60df5c2
Revert "build: Upgrade deps (#2289)"
ko3n1g Nov 25, 2025
ba9caf4
Fix the entropy sign. (#2374)
yobibyte Nov 25, 2025
77a2d8b
Remove RL use of mock dataloader and kill RL inference interface on e…
jon-barker Nov 25, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
5 changes: 0 additions & 5 deletions .coveragerc

This file was deleted.

4 changes: 4 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[flake8]
max-line-length = 100
extend-ignore = E203,E501,F401,E402,E714
per-file-ignores = __init__.py:F401
50 changes: 50 additions & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
megatron/core/ @NVIDIA/core-adlr @NVIDIA/core-nemo

megatron/core/models/gpt/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/gpt

megatron/core/models/multimodal/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/multi-modal

megatron/core/models/mamba/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/hybrid-mamba
megatron/core/ssm/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/hybrid-mamba

megatron/core/datasets/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/datasets

megatron/core/distributed/fsdp/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/megatron-fsdp

megatron/core/transformer/fsdp_dtensor_checkpoint.py @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/megatron-fsdp

megatron/core/dist_checkpointing/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/dist-checkpointing

megatron/core/optimizer/distrib_optimizer/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/dist-optimizer

megatron/core/inference/modelopt_support @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/quantization-and-inference

megatron/core/datasets/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/datasets

megatron/core/pipeline_parallel/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/pipeline-parallelism

megatron/core/transformer/ @NVIDIA/core-adlr @NVIDIA/core-nemo

megatron/core/transformer/moe/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/mixture-of-experts-adlr @NVIDIA/mixture-of-experts-devtech

megatron/core/inference/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/inference

megatron/core/parallel_state.py @NVIDIA/core-adlr @NVIDIA/core-nemo

megatron/core/post_training/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/post-training

megatron/post_training/ @NVIDIA/core-adlr @NVIDIA/core-nemo @NVIDIA/post-training

.gitlab/ @NVIDIA/ci
.github/ @NVIDIA/ci
.gitlab-ci.yml @NVIDIA/ci
docker/ @NVIDIA/ci
tests/functional_tests/python_test_utils/ @NVIDIA/ci
tests/functional_tests/shell_test_utils/ @NVIDIA/ci
tests/test_utils/recipes/ @NVIDIA/ci
tests/unit_tests/run_ci_test.sh @NVIDIA/ci

megatron/rl/ @NVIDIA/reinforcement-learning
examples/rl/ @NVIDIA/reinforcement-learning
test/unit_tests/test_rl_utils.py @NVIDIA/reinforcement-learning
train_rl.py @NVIDIA/reinforcement-learning
28 changes: 28 additions & 0 deletions .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
---
name: Bug report
about: Create a report to help us improve the repository or project
title: ""
labels: bug
assignees: ''

---

**Describe the bug**

A clear and concise description of what the bug is.

**Steps/Code to reproduce bug**

Please list *minimal* steps or code snippet for us to be able to reproduce the bug.

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.


**Expected behavior**

A clear and concise description of what you expected to happen.


**Additional context**

Add any other context about the problem here.
2 changes: 2 additions & 0 deletions .github/ISSUE_TEMPLATE/config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
blank_issues_enabled: false

20 changes: 20 additions & 0 deletions .github/ISSUE_TEMPLATE/feature_request.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---
name: Feature request
about: Suggest an idea for this project
title: ""
labels: enhancement
assignees: ''

---

**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

**Describe the solution you'd like**
A clear and concise description of what you want to happen.

**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

**Additional context**
Add any other context or screenshots about the feature request here.
12 changes: 12 additions & 0 deletions .github/ISSUE_TEMPLATE/question.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
name: QUESTION
about: Ask a question about Megatron-LM that is not a bug, regression or enhancement
request
title: "[QUESTION]"
labels: ''
assignees: ''

---

**Your question**
Ask a clear and concise question about Megatron-LM.
39 changes: 39 additions & 0 deletions .github/ISSUE_TEMPLATE/regression.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
---
name: REGRESSION
about: Report a regression in speed or accuracy due to a Megatron-LM update
title: "[REGRESSION]"
labels: ''
assignees: ''

---

**Describe the regression**
A clear and concise description of what the regression is.

**To Reproduce**
Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.

**Previous performance**
What speed or accuracy did you previously see.

**New performance**
What speed or accuracy do you see after the update.

**Stack trace/logs**
If applicable, add the stack trace or logs related to the regression.

**Environment (please complete the following information):**
- Previous Megatron-LM commit ID
- New Megatron-LM commit ID
- Previous PyTorch version
- New PyTorch version
- Previous CUDA version
- New CUDA version
- Previous NCCL version
- New NCCL version

**Proposed fix**
If you have a proposal for how to fix the issue state it here or link to a PR.

**Additional context**
Add any other context about the problem here.
Loading