-
Notifications
You must be signed in to change notification settings - Fork 763
SOLR-17319: CompoundQueryComponent scribbles continued with CompoundQueryComponentTest #3648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
SOLR-17319: CompoundQueryComponent scribbles continued with CompoundQueryComponentTest #3648
Conversation
…Component to use it)
…ueryComponent to use it
…deprecate direct access' changes for illustrative purposes
solr/core/src/java/org/apache/solr/handler/component/CompoundResponseBuilder.java
Outdated
Show resolved
Hide resolved
…getParameterPrefix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I freaking love the elegance / simplicity here. It re-uses / leverages the existing distributed search phases of the SearchComponent design (as I hinted in JIRA may be possible), and meanwhile making that design more flexible for this new use-case. And reuses TopDocs.rrf
logic :-) I imagine this approach is doing query work in parallel without any extra code? That said, clearly this is a draft / POC. Perhaps there are serious flaws that aren't easily evident in a code review. Testing was minimal. I see no accommodations for faceting to ensure we don't wastefully compute the docSet.
@ercsonusharma Did you see this? Please give this a good review. I suspect the best solution may be a combination of both efforts. For example your (recently redone) tests, documentation, user experience (how to use it) are nice.
I'd be tempted to run both solutions with distributed tracing enabled to a local Zipkin/Jaeger to visualize what's going on. I used to have a shelved code snippet to instrument a test to do this.
solr/core/src/java/org/apache/solr/handler/component/CompoundQueryComponent.java
Outdated
Show resolved
Hide resolved
Yes, that's my understanding too.
Yeah, my understanding is that with this sort of approach there is conceptually no (obvious) support for faceting. Today, Solr's client would make two queries and then outside-of-Solr do the RRF locally -- conceptually that is what the approach here does too. Because the query work is in parallel and the doc sets are (I think) not passed back to where the fusion happens, then truly-correct faceting would not implicitly be available. ... ... having said that, currently the sub-queries are there and there are only sub-queries, so I imagine a fusion-aware faceting component might do an OR of the sub-queries and then send that for faceting purposes only i.e. with |
} | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the process
method here and wonder how the individual queries from responseBuilders are executed on the shard level from the SearchIndexer?
final TopDocs[] hits = new TopDocs[crb.responseBuilders.size()]; | ||
for (int crb_idx = 0; crb_idx < crb.responseBuilders.size(); ++crb_idx) { | ||
|
||
final SolrDocumentList sdl = crb.responseBuilders.get(crb_idx).getResponseDocs(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering where is this being populated?
if (rb instanceof CompoundResponseBuilder crb) { | ||
for (var rb_i : crb.responseBuilders) { | ||
if (rb_i.isThisFromMe(sreq)) { | ||
super.handleResponses(rb_i, sreq); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we using mergeIds on each of the crb and overriding the rb.resultIds?
Thanks for the draft @cpoerschke - appreciate it! Below are my initial observations:
I already checked TopDocs.rrf, and it is meant for rrf on TopDocs, especially for Lucene. I find it less effort in creating something ShardDoc.rrf rather than converting the SolrDocumentList to TopcDocs and back.
Query work is already happening in parallel by design of the distributed process. The things that need to be parallelised is merging the docs per query post querying, which I see sequentially here as well. I'm really doubtful at this point in time as to how the faceting & highlighting would work. |
…ghlightComponent to use it)
…et-fields finishStage did it i.e. now it happens after instead of before the fusion)
…ghtComponent to use it)
…nent and other (small) tweaks to match, for deterministic highlighting when document found for both sub-queries;
@@ -212,27 +212,31 @@ public void handleResponses(ResponseBuilder rb, ShardRequest sreq) {} | |||
|
|||
@Override | |||
public void finishStage(ResponseBuilder rb) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'Hide whitespace' diff mode makes it easier to see that the changes to this method are small (and extremely subtle, should add comments really, later).
Thanks @ercsonusharma for starting to take a look here! Yes, there's relatively little code but I totally appreciate that it's not easy to understand.
That's an interesting idea, thanks @dsmiley for sharing! @hossman's "Lifecycle of a Solr Search Request" talk slides and recording from quite-a-while-ago-now also spring to mind for me here. Brief replies to some of your initial observations.
Yes, for each component it will need to be considered how it should behave in a fusion scenario and whether or not it should participate in the fusion stage, or just skip it, which is the implied default.
The distributed process provides parallelism on the shard level e.g. if we have 10 shards then all 10 will run in parallel. Within each shard however, as I understand the current #3418 code, the The parallelism here on #3648 is on the shard level and also on the sub-queries level:
Answering only on the highlighting work at this time, I've advanced the proof-of-concept further so that the test case does cover highlighting. The highlighting test passes and it may help to develop an understanding on how that extra fusion stage fits into the picture.
@dsmiley - I don't suppose you'd be in a position to dig up or dust off that code snippet? |
continuation of the earlier #2597 scribbles
https://issues.apache.org/jira/browse/SOLR-17319