-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Add PerFieldStoredFieldsFormat to allow multiple stored field formats #138299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @fcofdez, I've created a changelog YAML for you. |
| @Override | ||
| public void document(int docID, StoredFieldVisitor visitor) throws IOException { | ||
| for (StoredFieldsReader storedFieldsReader : formatToStoredFieldReaders.values()) { | ||
| storedFieldsReader.document(docID, visitor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to have some interesting side effects, in that it will change the order in which fields are presented to the visitor. We have a few places in ES which rely on eg the _id field always being the first field read from stored fields, as that means we can then stop reading the rest of the data. Not a show-stopper, but something to be aware of and will need to be tested carefully.
| ESStoredFieldsFormat format = ESStoredFieldsFormat.forName(formatName); | ||
| storedFieldsReader = format.fieldsReader(directory, si, fn, context); | ||
| var previous = formatStoredFieldReaders.put(formatName, storedFieldsReader); | ||
| assert previous == null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to be only used by a single thread?
burqen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like that this PR is isolated to only add the PerFieldStoredFieldsFormat together with the additional plumbing and adaptations. It makes it easy to review 👍 Thanks for that.
I found some minor things to cleaned up and I also posted some questions. Most of the questions are more for my own learning and doesn't need any action for approval of the PR itself.
Given that there are a series of follow-up PRs building on this one, I'd be open to letting the comments be addressed in a later PR and approve now.
| private static final Set<String> FILE_EXTENSIONS = Set.of( | ||
| STORED_FIELDS_METADATA_BLOOM_FILTER_EXTENSION, | ||
| STORED_FIELDS_BLOOM_FILTER_FORMAT_NAME | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| private static final Set<String> FILE_EXTENSIONS = Set.of( | |
| STORED_FIELDS_METADATA_BLOOM_FILTER_EXTENSION, | |
| STORED_FIELDS_BLOOM_FILTER_FORMAT_NAME | |
| ); | |
| private static final Set<String> FILE_EXTENSIONS = Set.of( | |
| STORED_FIELDS_METADATA_BLOOM_FILTER_EXTENSION, | |
| STORED_FIELDS_BLOOM_FILTER_EXTENSION | |
| ); |
Typo, right?
| private static final byte BLOOM_FILTER_NOT_STORED = 0; | ||
| private static final ByteSizeValue MAX_BLOOM_FILTER_SIZE = ByteSizeValue.ofMb(8); | ||
| private static final String DEFAULT_SEGMENT_SUFFIX = ""; | ||
| public static final ByteSizeValue DEFAULT_BLOOM_FILTER_SIZE = ByteSizeValue.ofKb(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| public static final ByteSizeValue DEFAULT_BLOOM_FILTER_SIZE = ByteSizeValue.ofKb(2); |
DEFAULT_BLOOM_FILTER_SIZE not used. Leave if used in later PRs ofc.
|
|
||
| import java.io.IOException; | ||
|
|
||
| abstract class FilterESStoredFieldsFormat extends ESStoredFieldsFormat { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see "Filter"/"Filtered" naming pattern around the code base and it seems to mean "delegating" in one way or the other. Is there more to it than that? (Not a comment on the code, just asking to learn)
| * <p> Files written by each stored fields format should use different file fileExtensions, this is enforced during the writer creation.</p> | ||
| */ | ||
| public abstract class PerFieldStoredFieldsFormat extends StoredFieldsFormat { | ||
| public static final String STORED_FIELD_FORMAT_ATTRIBUTE_KEY = "stored_field_format"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| public static final String STORED_FIELD_FORMAT_ATTRIBUTE_KEY = "stored_field_format"; | |
| public static final String STORED_FIELD_FORMAT_ATTRIBUTE_KEY = PerFieldStoredFieldsFormat.class.getSimpleName() + ".format"; |
nit: Looking at PerFieldPostingsFormat. It uses the class name to form the attribute key (PerFieldPostingsFormat.class.getSimpleName() + ".format"). Maybe nice to use the same structure for consistency?
| final String formatName = fi.getAttribute(STORED_FIELD_FORMAT_ATTRIBUTE_KEY); | ||
| // Can be a format name be null if we're reading a segment from this codec? | ||
| if (formatName != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are unsure about if formatName can be null, a safer way could be to assert formatName != null instead of if (formatName != null).
| org.elasticsearch.index.codec.storedfields.ESZstd814StoredFieldsFormat | ||
| org.elasticsearch.index.codec.storedfields.ESLucene90StoredFieldsFormat | ||
| org.elasticsearch.index.codec.bloomfilter.ES93BloomFilterStoredFieldsFormat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need the META-INF here even though we have the module-info?
| } | ||
| }); | ||
| var bloomFilterSizeInKb = atLeast(2); | ||
| conf.setCodec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We set the codec two times in this test, here and at line 93. I guess the later one which uses TestCodec it the one we want to keep.
|
I've discussed this offline with @tlrx and we've decided to try a different approach which might simplify things. |
|
Closing this in favour of #138515 |
This commit adds a new StoredFieldsFormat, PerFieldStoredFieldsFormat,
that allows using different formats depending on a which field is being stored.
In Lucene there's similar formats for doc values, vectors, etc, but nothing for
stored fields. That's the reason why we needed to add this, since Lucene relies
on SPI at read time in order to load codecs, we needed to introduce a new
format that exposes
NamedSPIallowing us to use the same pattern whileloading segments.