Skip to content

Conversation

@KodaiD
Copy link
Contributor

@KodaiD KodaiD commented Nov 6, 2025

Description

This PR adds DML support for the Blob adapter, enabling data manipulation operations on Azure Blob Storage.

Support for S3 and GCS will be addressed in subsequent PRs.

Related issues and/or PRs

Changes made

  • Added DistributedStorage implementation for Object Storage.
  • Added unit tests.
  • Added integration tests.

Checklist

  • I have commented my code, particularly in hard-to-understand areas.
  • I have updated the documentation to reflect the changes.
  • I have considered whether similar issues could occur in other products, components, or modules if this PR is for bug fixes.
  • Any remaining open issues linked to this PR are documented and up-to-date (Jira, GitHub, etc.).
  • Tests (unit, integration, etc.) have been added for the changes.
  • My changes generate no new warnings.
  • Any dependent changes in other PRs have been merged and published.

Additional notes (optional)

N/A

Release notes

Added support for data manipulation operations over Azure Blob Storage.

@KodaiD KodaiD self-assigned this Nov 6, 2025
@KodaiD KodaiD added the enhancement New feature or request label Nov 6, 2025
@KodaiD
Copy link
Contributor Author

KodaiD commented Nov 6, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces data manipulation language (DML) support for the Blob Storage adapter, which is a significant and well-implemented feature. The changes include the core logic for handling mutations, numerous integration tests to ensure correctness, and necessary updates to error codes and utility classes. My review focuses on improving code clarity, robustness, and performance in the new classes. I've identified opportunities to remove dead code, enhance immutability, improve test coverage for conditional operations, and optimize a hashCode implementation. Overall, this is a solid contribution that greatly enhances the capabilities of the object storage adapter.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds significant data manipulation capabilities to the Blob Storage adapter, a major step forward. The overall implementation is solid, with good use of handlers for different statement types and a clear read-modify-write pattern for mutations. I've identified a few areas for improvement, including a bug in data serialization, a suboptimal hash implementation, and opportunities to enhance test coverage and code readability. My detailed comments are below.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Data Manipulation Language (DML) support for the Azure Blob Storage adapter, which is a significant enhancement. The changes include new classes for handling mutations and selections, as well as comprehensive integration and unit tests. The implementation correctly identifies and enforces several limitations inherent to object storage, such as the lack of index support, restricted conditional operations for BLOB types, and limitations on cross-partition scan ordering. The addition of specific error codes for these scenarios is also a positive step for clarity and debugging. Overall, the changes are well-structured and provide a solid foundation for DML operations in Object Storage.

@KodaiD KodaiD marked this pull request as ready for review November 7, 2025 01:26
Copilot AI review requested due to automatic review settings November 7, 2025 01:26
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 56 out of 56 changed files in this pull request and generated no comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@KodaiD KodaiD requested review from a team, Torch3333, brfrn169, feeblefakie and komamitsu and removed request for a team November 7, 2025 01:44
// Assert
assertThat(keys).isEmpty();
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Blob Storage can return up to 5,000 keys with a LIST operation, we should check whether getKeys can return more than 5,000 keys.

Ref: https://learn.microsoft.com/rest/api/storageservices/list-blobs

@KodaiD KodaiD force-pushed the add-dml-support-for-blob-storage branch from 70f2d31 to 0a73daf Compare November 7, 2025 02:34

Column<?> column1 =
ColumnValueMapper.convert(
clusteringKey1.get(columnName), columnName, metadata.getColumnDataType(columnName));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] metadata.getColumnDataType(columnName) can be reused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 533f3cd.

Map<String, ObjectStorageRecord> partition =
getPartition(namespaceName, tableName, partitionKey, readVersionMap);
for (Mutation mutation : mutations) {
if (mutation instanceof Put) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insert, Update and Upsert aren't passed to this method?

Copy link
Contributor Author

@KodaiD KodaiD Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DistributedStorage only handles Put. Insert, Update, and Upsert are translated into Put with a condition.

if (validationFailed) {
throw new ExecutionException(
String.format(
"A condition failed. ConditionalExpression: %s, Column: %s",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about including the operator in this error message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it's a mistake. We should pass the expression that includes both the column and operator information instead of expectedColumn. Let me fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 533f3cd.

throw new IllegalArgumentException(
CoreError.OPERATION_CHECK_ERROR_ORDERING_NOT_PROPERLY_SPECIFIED.buildMessage(scan));
}
boolean rightOrder =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate why this variable name is rightOrder?

Also, if the first scan order is the same as the clustering key order and the second scan order is different, it will fail?

Copy link
Contributor Author

@KodaiD KodaiD Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, maybe isValidOrder is more suitable. The variable indicates whether the specified scan ordering is valid.

Copy link
Contributor Author

@KodaiD KodaiD Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ScalarDB, we can only read records sorted by the clustering order defined in the table metadata, in forward or reverse order. For example, if a table has clustering keys first (Order=ASC) and second (Order=DESC), we can specify the following scan order:

first second
ASC -
ASC DESC
DESC -
DESC ASC

@brfrn169 I think we discussed this previously, but is this still correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if the first scan order is the same as the clustering key order and the second scan order is different, it will fail?

So, the answer is "yes."


records.sort(
(o1, o2) ->
new ClusteringKeyComparator(metadata)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something, but why don't we simply sort the records based on Scan.ordering (and the metadata's clustering key orders) at this point? If it works, the following Collections.reverse() isn't needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to this specification, we need to sort the records by the clustering order defined in the table metadata. Then, we can decide whether to read the sorted records in forward or reverse order based on the order specified in the scan request.

@KodaiD KodaiD requested a review from komamitsu November 7, 2025 07:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants