Skip to content

Conversation

@fcofdez
Copy link
Contributor

@fcofdez fcofdez commented Nov 19, 2025

This commit adds a new StoredFieldsFormat, PerFieldStoredFieldsFormat,
that allows using different formats depending on a which field is being stored.
In Lucene there's similar formats for doc values, vectors, etc, but nothing for
stored fields. That's the reason why we needed to add this, since Lucene relies
on SPI at read time in order to load codecs, we needed to introduce a new
format that exposes NamedSPI allowing us to use the same pattern while
loading segments.

@elasticsearchmachine
Copy link
Collaborator

Hi @fcofdez, I've created a changelog YAML for you.

@Override
public void document(int docID, StoredFieldVisitor visitor) throws IOException {
for (StoredFieldsReader storedFieldsReader : formatToStoredFieldReaders.values()) {
storedFieldsReader.document(docID, visitor);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to have some interesting side effects, in that it will change the order in which fields are presented to the visitor. We have a few places in ES which rely on eg the _id field always being the first field read from stored fields, as that means we can then stop reading the rest of the data. Not a show-stopper, but something to be aware of and will need to be tested carefully.

ESStoredFieldsFormat format = ESStoredFieldsFormat.forName(formatName);
storedFieldsReader = format.fieldsReader(directory, si, fn, context);
var previous = formatStoredFieldReaders.put(formatName, storedFieldsReader);
assert previous == null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to be only used by a single thread?

Copy link
Contributor

@burqen burqen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like that this PR is isolated to only add the PerFieldStoredFieldsFormat together with the additional plumbing and adaptations. It makes it easy to review 👍 Thanks for that.
I found some minor things to cleaned up and I also posted some questions. Most of the questions are more for my own learning and doesn't need any action for approval of the PR itself.
Given that there are a series of follow-up PRs building on this one, I'd be open to letting the comments be addressed in a later PR and approve now.

Comment on lines +80 to +83
private static final Set<String> FILE_EXTENSIONS = Set.of(
STORED_FIELDS_METADATA_BLOOM_FILTER_EXTENSION,
STORED_FIELDS_BLOOM_FILTER_FORMAT_NAME
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private static final Set<String> FILE_EXTENSIONS = Set.of(
STORED_FIELDS_METADATA_BLOOM_FILTER_EXTENSION,
STORED_FIELDS_BLOOM_FILTER_FORMAT_NAME
);
private static final Set<String> FILE_EXTENSIONS = Set.of(
STORED_FIELDS_METADATA_BLOOM_FILTER_EXTENSION,
STORED_FIELDS_BLOOM_FILTER_EXTENSION
);

Typo, right?

private static final byte BLOOM_FILTER_NOT_STORED = 0;
private static final ByteSizeValue MAX_BLOOM_FILTER_SIZE = ByteSizeValue.ofMb(8);
private static final String DEFAULT_SEGMENT_SUFFIX = "";
public static final ByteSizeValue DEFAULT_BLOOM_FILTER_SIZE = ByteSizeValue.ofKb(2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public static final ByteSizeValue DEFAULT_BLOOM_FILTER_SIZE = ByteSizeValue.ofKb(2);

DEFAULT_BLOOM_FILTER_SIZE not used. Leave if used in later PRs ofc.


import java.io.IOException;

abstract class FilterESStoredFieldsFormat extends ESStoredFieldsFormat {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see "Filter"/"Filtered" naming pattern around the code base and it seems to mean "delegating" in one way or the other. Is there more to it than that? (Not a comment on the code, just asking to learn)

* <p> Files written by each stored fields format should use different file fileExtensions, this is enforced during the writer creation.</p>
*/
public abstract class PerFieldStoredFieldsFormat extends StoredFieldsFormat {
public static final String STORED_FIELD_FORMAT_ATTRIBUTE_KEY = "stored_field_format";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public static final String STORED_FIELD_FORMAT_ATTRIBUTE_KEY = "stored_field_format";
public static final String STORED_FIELD_FORMAT_ATTRIBUTE_KEY = PerFieldStoredFieldsFormat.class.getSimpleName() + ".format";

nit: Looking at PerFieldPostingsFormat. It uses the class name to form the attribute key (PerFieldPostingsFormat.class.getSimpleName() + ".format"). Maybe nice to use the same structure for consistency?

Comment on lines +267 to +269
final String formatName = fi.getAttribute(STORED_FIELD_FORMAT_ATTRIBUTE_KEY);
// Can be a format name be null if we're reading a segment from this codec?
if (formatName != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are unsure about if formatName can be null, a safer way could be to assert formatName != null instead of if (formatName != null).

Comment on lines +1 to +3
org.elasticsearch.index.codec.storedfields.ESZstd814StoredFieldsFormat
org.elasticsearch.index.codec.storedfields.ESLucene90StoredFieldsFormat
org.elasticsearch.index.codec.bloomfilter.ES93BloomFilterStoredFieldsFormat
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need the META-INF here even though we have the module-info?

}
});
var bloomFilterSizeInKb = atLeast(2);
conf.setCodec(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We set the codec two times in this test, here and at line 93. I guess the later one which uses TestCodec it the one we want to keep.

@fcofdez
Copy link
Contributor Author

fcofdez commented Nov 21, 2025

I've discussed this offline with @tlrx and we've decided to try a different approach which might simplify things.

@fcofdez
Copy link
Contributor Author

fcofdez commented Nov 24, 2025

Closing this in favour of #138515

@fcofdez fcofdez closed this Nov 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants