-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-13866. Use component-specific default directory for Ratis #9318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-13866. Use component-specific default directory for Ratis #9318
Conversation
hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMHAUtils.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/utils/HddsServerUtil.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/server/ServerUtils.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/server/ServerUtils.java
Outdated
Show resolved
Hide resolved
|
In addition, please
|
…onent-specific-ratis
b5e0003 to
d757e55
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a critical issue where SCM, OM, and DataNode services fail to start when colocated on the same host due to conflicts over the shared Ratis directory. The fix introduces component-specific Ratis directory naming (e.g., scm.ratis, om.ratis, dn.ratis) while maintaining backward compatibility with existing installations.
Key changes:
- Modified
getDefaultRatisDirectory()andgetDefaultRatisSnapshotDirectory()to acceptNodeTypeparameter for component-specific directory creation - Added backward compatibility logic to detect and use existing old directory structures during upgrades
- Updated all callers to pass appropriate
NodeTypewhen requesting default Ratis directories
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| ServerUtils.java | Core implementation of component-specific Ratis directory logic with backward compatibility checks |
| TestServerUtils.java | Comprehensive test suite validating new directory structure and backward compatibility scenarios |
| SCMHAUtils.java | Updated SCM to use component-specific Ratis directories and removed deprecated getRatisStorageDir() method |
| OzoneManagerRatisUtils.java | Updated OM to use component-specific Ratis directories |
| HddsServerUtil.java | Updated DataNode to use component-specific Ratis directories |
| OzoneManager.java | Updated reference from deprecated method to new getSCMRatisDirectory() |
| TestOzoneHARatisLogParser.java | Updated test to use proper utility methods for getting Ratis directories |
| TestStorageContainerManager.java | Updated test to use new method name |
| RatisUtil.java | Updated to use new getSCMRatisDirectory() method |
| testOMHA.robot | Updated test constant to reflect new OM-specific Ratis directory path |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
hadoop-hdds/framework/src/test/java/org/apache/hadoop/hdds/server/TestServerUtils.java
Show resolved
Hide resolved
hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/server/ServerUtils.java
Show resolved
Hide resolved
hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/server/ServerUtils.java
Outdated
Show resolved
Hide resolved
…PSHOT_DIR and other comments
hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/server/ServerUtils.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/server/ServerUtils.java
Outdated
Show resolved
Hide resolved
| assertNotEquals(omRatisDir, dnRatisDir); | ||
|
|
||
| // Verify the base metadata dir exists | ||
| assertTrue(metaDir.exists()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who create the base metadata dir, getDefaultRatisDirectory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, getDefaultRatisDirectory creates the base metadata directory indirectly.
So when getDefaultRatisDirectory is called, it:
- Calls
getOzoneMetaDirPath()to get the metadata directory - That calls
getDirectoryFromConfig(), which creates the metadata directory - Then returns the component-specific Ratis directory path
hadoop-hdds/framework/src/test/java/org/apache/hadoop/hdds/server/TestServerUtils.java
Outdated
Show resolved
Hide resolved
aryangupta1998
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch @Gargi-jais11, can you please add a small unit test that tests when existing old dir is empty?
Maybe something like this,
@Test
public void testEmptyOldSharedRatisIgnored() throws IOException {
final File metaDir = new File(folder.toFile(), "upgradeMetaDir");
final File oldSharedRatisDir = new File(metaDir, "ratis");
final OzoneConfiguration conf = new OzoneConfiguration();
conf.set(HddsConfigKeys.OZONE_METADATA_DIRS, metaDir.getPath());
try {
// Create old Ratis directory (empty)
assertTrue(oldSharedRatisDir.mkdirs());
// SCM should use new SCM path
String scmRatisDir = ServerUtils.getDefaultRatisDirectory(conf, SCM);
assertEquals(Paths.get(metaDir.getPath(), "scm.ratis").toString(), scmRatisDir);
// OM should use new OM path
String omRatisDir = ServerUtils.getDefaultRatisDirectory(conf, OM);
assertEquals(Paths.get(metaDir.getPath(), "om.ratis").toString(), omRatisDir);
} finally {
FileUtils.deleteQuietly(metaDir);
}
}
hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/server/ServerUtils.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/server/ServerUtils.java
Show resolved
Hide resolved
ChenSammi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Gargi-jais11 .
aryangupta1998
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
Upgrade test is timing out, please take a look: https://github.com/Gargi-jais11/ozone/actions/runs/20124486355/job/57764949606 |
|
I am looking into it |
|
Thanks @Gargi-jais11 for updating the patch. SCM now comes out of safe mode, but OM failed to start during upgrade: https://github.com/Gargi-jais11/ozone/actions/runs/20156785230 |
|
@ChenSammi and @aryangupta1998 . Please take another look on the patch. Actual Reason why the ✅OM uses: /data/metadata/ratis SCM logs show: ✅ SCM uses: /data/metadata/scm-ha Version: 2.2.0 Upgrade OM logs show the bug: Same message printed TWICE - first for OM, second for SCM check SCM log falls back correctly: Real Issue: Was in |
Thanks @adoroszlai for pointing out at this log message. It helped a lot in debugging the issue. |
|
Now CI has passed : https://github.com/Gargi-jais11/ozone/actions/runs/20194126863 |
adoroszlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Gargi-jais11 for fixing the test failure.
hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/server/ServerUtils.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/server/ServerUtils.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/server/ServerUtils.java
Outdated
Show resolved
Hide resolved
|
Thanks @Gargi-jais11 for updating the patch. |
|
The CI is green. @adoroszlai , could you like to take another look? |
|
Thanks @Gargi-jais11 for the patch, @aryangupta1998, @ChenSammi, @sumitagrawl for the review. |
What changes were proposed in this pull request?
In Ozone, "ozone.metadata.dirs" is used in many places as the fallback solution if some specific properties is not defined. If OM, SCM, DN, S3g are all installed on different node, then this fallback will not cause any problem. But if on the same node, then there will be conflict, for example, OM has ratis directory, SCM also has ratis directory, so does DN. So for fallback solution, we should add component name to the directory, for example, om.ratis, scm.ratis and datanode.ratis .
But for ratis we need to handle the case as below:
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-13866
How was this patch tested?
Added new UT.