Skip to content

Conversation

@Gargi-jais11
Copy link
Contributor

What changes were proposed in this pull request?

In Ozone, "ozone.metadata.dirs" is used in many places as the fallback solution if some specific properties is not defined. If OM, SCM, DN, S3g are all installed on different node, then this fallback will not cause any problem. But if on the same node, then there will be conflict, for example, OM has ratis directory, SCM also has ratis directory, so does DN. So for fallback solution, we should add component name to the directory, for example, om.ratis, scm.ratis and datanode.ratis .

But for ratis we need to handle the case as below:

Before fix:
/data/metadata/ratis/ ← SCM, OM, DataNode ALL tried to use this!❌ 

After fix:
/data/metadata/scm.ratis ← SCM only ✅ 
/data/metadata/om.ratis ← OM only ✅ 
/data/metadata/datanode.ratis ← DataNode only ✅

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13866

How was this patch tested?

Added new UT.

@Gargi-jais11 Gargi-jais11 marked this pull request as ready for review November 18, 2025 08:46
@ChenSammi
Copy link
Contributor

In addition, please

  1. remove getSCMRatisDirectory from SCMHAUtils, and rename the getRatisStorageDir to getSCMRatisDirectory.
  2. DB snapshot directory need be updated too. OZONE_SCM_HA_RATIS_SNAPSHOT_DIR, OZONE_OM_RATIS_SNAPSHOT_DIR

@Gargi-jais11 Gargi-jais11 marked this pull request as draft December 6, 2025 08:44
@Gargi-jais11 Gargi-jais11 force-pushed the HDDS-13866-component-specific-ratis branch from b5e0003 to d757e55 Compare December 6, 2025 09:13
@Gargi-jais11 Gargi-jais11 marked this pull request as ready for review December 9, 2025 05:46
@ChenSammi ChenSammi requested a review from Copilot December 9, 2025 06:12
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical issue where SCM, OM, and DataNode services fail to start when colocated on the same host due to conflicts over the shared Ratis directory. The fix introduces component-specific Ratis directory naming (e.g., scm.ratis, om.ratis, dn.ratis) while maintaining backward compatibility with existing installations.

Key changes:

  • Modified getDefaultRatisDirectory() and getDefaultRatisSnapshotDirectory() to accept NodeType parameter for component-specific directory creation
  • Added backward compatibility logic to detect and use existing old directory structures during upgrades
  • Updated all callers to pass appropriate NodeType when requesting default Ratis directories

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
ServerUtils.java Core implementation of component-specific Ratis directory logic with backward compatibility checks
TestServerUtils.java Comprehensive test suite validating new directory structure and backward compatibility scenarios
SCMHAUtils.java Updated SCM to use component-specific Ratis directories and removed deprecated getRatisStorageDir() method
OzoneManagerRatisUtils.java Updated OM to use component-specific Ratis directories
HddsServerUtil.java Updated DataNode to use component-specific Ratis directories
OzoneManager.java Updated reference from deprecated method to new getSCMRatisDirectory()
TestOzoneHARatisLogParser.java Updated test to use proper utility methods for getting Ratis directories
TestStorageContainerManager.java Updated test to use new method name
RatisUtil.java Updated to use new getSCMRatisDirectory() method
testOMHA.robot Updated test constant to reflect new OM-specific Ratis directory path

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

assertNotEquals(omRatisDir, dnRatisDir);

// Verify the base metadata dir exists
assertTrue(metaDir.exists());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who create the base metadata dir, getDefaultRatisDirectory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, getDefaultRatisDirectory creates the base metadata directory indirectly.
So when getDefaultRatisDirectory is called, it:

  • Calls getOzoneMetaDirPath() to get the metadata directory
  • That calls getDirectoryFromConfig(), which creates the metadata directory
  • Then returns the component-specific Ratis directory path

@Gargi-jais11 Gargi-jais11 marked this pull request as draft December 10, 2025 04:51
@Gargi-jais11 Gargi-jais11 marked this pull request as ready for review December 10, 2025 07:08
Copy link
Contributor

@aryangupta1998 aryangupta1998 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch @Gargi-jais11, can you please add a small unit test that tests when existing old dir is empty?
Maybe something like this,

@Test
public void testEmptyOldSharedRatisIgnored() throws IOException {
    final File metaDir = new File(folder.toFile(), "upgradeMetaDir");
    final File oldSharedRatisDir = new File(metaDir, "ratis");
    final OzoneConfiguration conf = new OzoneConfiguration();
    conf.set(HddsConfigKeys.OZONE_METADATA_DIRS, metaDir.getPath());

    try {
        // Create old Ratis directory (empty)
        assertTrue(oldSharedRatisDir.mkdirs());

        // SCM should use new SCM path
        String scmRatisDir = ServerUtils.getDefaultRatisDirectory(conf, SCM);
        assertEquals(Paths.get(metaDir.getPath(), "scm.ratis").toString(), scmRatisDir);

        // OM should use new OM path
        String omRatisDir = ServerUtils.getDefaultRatisDirectory(conf, OM);
        assertEquals(Paths.get(metaDir.getPath(), "om.ratis").toString(), omRatisDir);

    } finally {
        FileUtils.deleteQuietly(metaDir);
    }
}

Copy link
Contributor

@ChenSammi ChenSammi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Gargi-jais11 .

Copy link
Contributor

@aryangupta1998 aryangupta1998 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@adoroszlai adoroszlai marked this pull request as draft December 11, 2025 13:01
@adoroszlai
Copy link
Contributor

@adoroszlai adoroszlai changed the title HDDS-13866. Ozone datanode startup exception on colocated hosts locked the storage directory:./ratis/ HDDS-13866. Use component-specific default directory for Ratis Dec 11, 2025
@Gargi-jais11
Copy link
Contributor Author

I am looking into it

@ChenSammi ChenSammi marked this pull request as ready for review December 12, 2025 08:56
@adoroszlai adoroszlai marked this pull request as draft December 13, 2025 08:03
@adoroszlai
Copy link
Contributor

adoroszlai commented Dec 13, 2025

Thanks @Gargi-jais11 for updating the patch. SCM now comes out of safe mode, but OM failed to start during upgrade:

2025-12-12 05:22:21,319 [main] WARN server.ServerUtils: Storage directory for Ratis is not configured. It is a good idea to map this to an SSD disk. Falling back to ozone.metadata.dirs
2025-12-12 05:22:21,321 [main] INFO server.ServerUtils: Found existing Ratis directory at old shared location: /data/metadata/ratis. Using it for backward compatibility during upgrade.
2025-12-12 05:22:21,322 [main] WARN server.ServerUtils: Storage directory for Ratis is not configured. It is a good idea to map this to an SSD disk. Falling back to ozone.metadata.dirs
2025-12-12 05:22:21,323 [main] INFO server.ServerUtils: Found existing Ratis directory at old shared location: /data/metadata/ratis. Using it for backward compatibility during upgrade.
2025-12-12 05:22:21,323 [main] ERROR om.OzoneManagerStarter: Cancelling prepare to start OM in upgrade mode failed with exception
java.io.IOException: Path of ozone.om.ratis.storage.dir and ozone.scm.ha.ratis.storage.dir should not be co located. Please change at least one path.
	at org.apache.hadoop.ozone.om.OzoneManager.initializeRatisDirs(OzoneManager.java:1614)
	at org.apache.hadoop.ozone.om.OzoneManager.<init>(OzoneManager.java:701)
	at org.apache.hadoop.ozone.om.OzoneManager.createOm(OzoneManager.java:884)
	at org.apache.hadoop.ozone.om.OzoneManagerStarter$OMStarterHelper.startAndCancelPrepare(OzoneManagerStarter.java:231)
	at org.apache.hadoop.ozone.om.OzoneManagerStarter.startOmUpgrade(OzoneManagerStarter.java:120)

https://github.com/Gargi-jais11/ozone/actions/runs/20156785230

@Gargi-jais11
Copy link
Contributor Author

@ChenSammi and @aryangupta1998 . Please take another look on the patch.

Actual Reason why the upgrade tests where failing:
Version: 2.0.0
OM logs show:

2025-12-12 05:19:34,415 [om1-impl-thread1] INFO storage.RaftStorageDirectory: The storage directory /data/metadata/ratis/5cb24680-b9e7-3c90-a862-d66704efc61c does not exist. Creating ...
2025-12-12 05:19:34,422 [om1-impl-thread1] INFO storage.RaftStorageDirectory: Lock on /data/metadata/ratis/5cb24680-b9e7-3c90-a862-d66704efc61c/in_use.lock acquired by nodename 7@om1

✅OM uses: /data/metadata/ratis

SCM logs show:

2025-12-12 05:18:53,458 [bdd3caaa-0deb-43cc-a4ee-d222722bcb29-impl-thread1] INFO storage.RaftStorageDirectory: Lock on /data/metadata/scm-ha/8d6a99a9-6b52-4ed5-bc55-c2abbca60551/in_use.lock acquired by nodename 7@scm1.org

✅ SCM uses: /data/metadata/scm-ha

Version: 2.2.0 Upgrade

OM logs show the bug:

2025-12-12 05:22:21,680 [main] INFO server.ServerUtils: Found existing Ratis directory at old shared location: /data/metadata/ratis.2025-12-12 05:22:21,680 [main] INFO server.ServerUtils: Found existing Ratis directory at old shared location: /data/metadata/ratis.

Same message printed TWICE - first for OM, second for SCM check
Then crashes:

2025-12-12 05:22:21,681 [main] ERROR om.OzoneManagerStarter: java.io.IOException: Path of ozone.om.ratis.storage.dir and ozone.scm.ha.ratis.storage.dir should not be co located.

SCM log falls back correctly:

2025-12-12 05:21:56,966 [main] INFO ha.SCMSnapshotProvider: Initializing SCM Snapshot Provider
2025-12-12 05:21:56,966 [main] WARN server.ServerUtils: Storage directory for Ratis is not configured. It is a good idea to map this to an SSD disk. Falling back to ozone.metadata.dirs
2025-12-12 05:21:56,967 [main] INFO server.ServerUtils: Found existing SCM Ratis directory at old location: /data/metadata/scm-ha. Using it for backward compatibility during upgrade.

Real Issue: Was in OzoneManager check for co-location which was causing it to fail although the backward compatibility worked correctly.
The co-location check is looking at what directories exist locally instead of what's actually configured. In distributed setups, SCM directories won't exist on OM machines, so the check gives a false alarm.

@Gargi-jais11
Copy link
Contributor Author

Thanks @sreejasahithi for updating the patch. SCM now comes out of safe mode, but OM failed to start during upgrade:

2025-12-12 05:22:21,319 [main] WARN server.ServerUtils: Storage directory for Ratis is not configured. It is a good idea to map this to an SSD disk. Falling back to ozone.metadata.dirs
2025-12-12 05:22:21,321 [main] INFO server.ServerUtils: Found existing Ratis directory at old shared location: /data/metadata/ratis. Using it for backward compatibility during upgrade.
2025-12-12 05:22:21,322 [main] WARN server.ServerUtils: Storage directory for Ratis is not configured. It is a good idea to map this to an SSD disk. Falling back to ozone.metadata.dirs
2025-12-12 05:22:21,323 [main] INFO server.ServerUtils: Found existing Ratis directory at old shared location: /data/metadata/ratis. Using it for backward compatibility during upgrade.
2025-12-12 05:22:21,323 [main] ERROR om.OzoneManagerStarter: Cancelling prepare to start OM in upgrade mode failed with exception
java.io.IOException: Path of ozone.om.ratis.storage.dir and ozone.scm.ha.ratis.storage.dir should not be co located. Please change at least one path.
	at org.apache.hadoop.ozone.om.OzoneManager.initializeRatisDirs(OzoneManager.java:1614)
	at org.apache.hadoop.ozone.om.OzoneManager.<init>(OzoneManager.java:701)
	at org.apache.hadoop.ozone.om.OzoneManager.createOm(OzoneManager.java:884)
	at org.apache.hadoop.ozone.om.OzoneManagerStarter$OMStarterHelper.startAndCancelPrepare(OzoneManagerStarter.java:231)
	at org.apache.hadoop.ozone.om.OzoneManagerStarter.startOmUpgrade(OzoneManagerStarter.java:120)

https://github.com/Gargi-jais11/ozone/actions/runs/20156785230

Thanks @adoroszlai for pointing out at this log message. It helped a lot in debugging the issue.

@Gargi-jais11 Gargi-jais11 marked this pull request as ready for review December 13, 2025 16:25
@Gargi-jais11
Copy link
Contributor Author

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Gargi-jais11 for fixing the test failure.

@adoroszlai
Copy link
Contributor

Thanks @Gargi-jais11 for updating the patch.

@ChenSammi
Copy link
Contributor

The CI is green. @adoroszlai , could you like to take another look?

@adoroszlai adoroszlai merged commit 4710ac9 into apache:master Dec 15, 2025
83 of 84 checks passed
@adoroszlai
Copy link
Contributor

Thanks @Gargi-jais11 for the patch, @aryangupta1998, @ChenSammi, @sumitagrawl for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants