-
Notifications
You must be signed in to change notification settings - Fork 1.4k
[df] First integration of RHist filling
#20664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
As the classes are experimental, using the method will print a warning to inform the user.
This needs one indirection to construct the std::tuple and separate the weight argument.
| /// ### Example usage: | ||
| /// ~~~{.cpp} | ||
| /// auto h = std::make_shared<ROOT::Experimental::RHist<double>>(10, {5.0, 15.0}); | ||
| /// auto myHist = myDf.Hist(h, {"col0"}); | ||
| /// ~~~ | ||
| template <typename ColumnType = RDFDetail::RInferredType, typename... ColumnTypes, typename BinContentType> | ||
| RResultPtr<ROOT::Experimental::RHist<BinContentType>> | ||
| Hist(std::shared_ptr<ROOT::Experimental::RHist<BinContentType>> h, const ColumnNames_t &columnList) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we should market this overload as "bring your own histogram" 😅
Jokes aside, this will enable some nice use cases, for example filling a shared histogram from multiple computation graphs, as demonstrated by the RDFHist.PtrRunGraphs unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great!
Test Results 21 files 21 suites 3d 18h 43m 15s ⏱️ Results for commit 46f531b. |
hahnjo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things to discuss during review (after the break)
| { | ||
| std::shared_ptr h = std::make_shared<ROOT::Experimental::RHist<BinContentType>>(std::move(axes)); | ||
| if (h->GetNDimensions() != columnList.size()) { | ||
| throw std::runtime_error("Wrong number of columns for the specified number of histogram axes."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In histv7, we are consistently throwing std::invalid_argument while RDF prefers std::runtime_error. I think we have to decide which "consistency" to follow...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to https://en.cppreference.com/w/cpp/error/runtime_error.html and https://en.cppreference.com/w/cpp/error/invalid_argument.html :
std::runtime_error: It reports errors that are due to events beyond the scope of the program and cannot be easily predicted.std::invalid_argument: It reports errors that arise because an argument value has not been accepted.
I infer from this that any type of error that arises from breaking a pre-condition of a function, such as this case, should be an invalid_argument. In RDataFrame, we are just using std::runtime_error everywhere (for simplicity/consistency). If we were to apply the two definitions above, we should actually use both type of errors: runtime_error for things like errors in JITting, invalid_argument for this type of cases
| /// ROOT::Experimental::RRegularAxis axis(10, {5.0, 15.0}); | ||
| /// auto myHist = myDf.Hist({axis}, {"col0"}); | ||
| /// ~~~ | ||
| template <typename BinContentType = double, typename ColumnType = RDFDetail::RInferredType, typename... ColumnTypes> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One could also argue for BinContentType = int or long as default since it's unweighted filling. However, then a number of "post-processing steps" require conversion, for example scaling...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this approach: least number of operations by default.
| /// A list containing the names of the columns that will be passed when calling `Fill`. | ||
| /// (N columns for unweighted filling, or N+1 columns for weighted filling) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder to ourselves that we want to change this, and backport to 6.38 a warning message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #20816 to keep track of this comment 👍
| struct HistoND{}; | ||
| struct HistoNSparseD{}; | ||
| struct Hist{}; | ||
| struct HistWithWeight{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a second tag felt like the easiest solution, together with just a template argument bool WithWeight = false for RHistFillHelper. Other approaches are certainly possible...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is valid and given that everything is internal anyway I don't see a reason to spend more time on this.
vepadulano
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! The PR is already in an advanced state, I have left some minor comments
| std::tuple args{std::get<I>(columnValues)...}; | ||
| ROOT::Experimental::RWeight weight(std::get<sizeof...(ColumnTypes) - 1>(columnValues)); | ||
| fContexts[slot]->Fill(args, weight); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the extra args tuple necessary or can we directly pass columnValues to Fill?
| struct HistoND{}; | ||
| struct HistoNSparseD{}; | ||
| struct Hist{}; | ||
| struct HistWithWeight{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is valid and given that everything is internal anyway I don't see a reason to spend more time on this.
| } | ||
|
|
||
| #ifdef R__HAS_ROOT7 | ||
| // Action for RHist using RHistConcurrentFiller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Action for RHist using RHistConcurrentFiller | |
| // Action for RHist using RHistConcurrentFiller without weights |
| return std::make_unique<Action_t>(Helper_t(h, nSlots), columnList, std::move(prevNode), colRegister); | ||
| } | ||
|
|
||
| // Action for RHist using RHistConcurrentFiller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Action for RHist using RHistConcurrentFiller | |
| // Action for RHist using RHistConcurrentFiller with weights |
| /// A list containing the names of the columns that will be passed when calling `Fill`. | ||
| /// (N columns for unweighted filling, or N+1 columns for weighted filling) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #20816 to keep track of this comment 👍
| { | ||
| std::shared_ptr h = std::make_shared<ROOT::Experimental::RHist<BinContentType>>(std::move(axes)); | ||
| if (h->GetNDimensions() != columnList.size()) { | ||
| throw std::runtime_error("Wrong number of columns for the specified number of histogram axes."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to https://en.cppreference.com/w/cpp/error/runtime_error.html and https://en.cppreference.com/w/cpp/error/invalid_argument.html :
std::runtime_error: It reports errors that are due to events beyond the scope of the program and cannot be easily predicted.std::invalid_argument: It reports errors that arise because an argument value has not been accepted.
I infer from this that any type of error that arises from breaking a pre-condition of a function, such as this case, should be an invalid_argument. In RDataFrame, we are just using std::runtime_error everywhere (for simplicity/consistency). If we were to apply the two definitions above, we should actually use both type of errors: runtime_error for things like errors in JITting, invalid_argument for this type of cases
| /// During execution of the computation graph, the passed histogram must only be accessed with methods that are | ||
| /// allowed during concurrent filling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is always a tricky subject to communicate to users, so I wonder if we can be even more explicit here?
| /// During execution of the computation graph, the passed histogram must only be accessed with methods that are | |
| /// allowed during concurrent filling. | |
| /// During execution of the computation graph, the passed histogram must only be accessed with methods that are | |
| /// allowed during concurrent filling, i.e. that are thread-safe |
| /// ### Example usage: | ||
| /// ~~~{.cpp} | ||
| /// auto h = std::make_shared<ROOT::Experimental::RHist<double>>(10, {5.0, 15.0}); | ||
| /// auto myHist = myDf.Hist(h, {"col0"}); | ||
| /// ~~~ | ||
| template <typename ColumnType = RDFDetail::RInferredType, typename... ColumnTypes, typename BinContentType> | ||
| RResultPtr<ROOT::Experimental::RHist<BinContentType>> | ||
| Hist(std::shared_ptr<ROOT::Experimental::RHist<BinContentType>> h, const ColumnNames_t &columnList) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great!
|
|
||
| std::shared_ptr h = std::make_shared<ROOT::Experimental::RHist<BinContentType>>(std::move(axes)); | ||
| if (h->GetNDimensions() != columnList.size()) { | ||
| throw std::runtime_error("Wrong number of columns for the specified number of histogram axes."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we know both numbers, we could think about making the error a bit more friendly:
| throw std::runtime_error("Wrong number of columns for the specified number of histogram axes."); | |
| throw std::runtime_error("Wrong number of columns for the specified number of histogram axes: need '" + std::to_string(h->GetNDimensions()) + "', got '" + std::to_string(columnList.size()) + "'."); |
Applies similarly in other places.
| { | ||
| RDataFrame df(10); | ||
| const RRegularAxis axis(10, {5.0, 15.0}); | ||
| auto hist = df.Define("x", "rdfentry_ + 5.5").Hist({axis}, {"x"}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking out loud, outside the scope of this PR, about how much JITting we strictly need. In principle, the bin content type defaults to double, so what we're JITting is the column type, I wonder if we could find a way to ask the type-system to just read the contents of the column truly as double, then we'd have at least one default scenario without JITting and without the need for template arguments.
As the classes are experimental, using the method will print a warning to inform the user.
Note that this is only a first version, there are a couple of advanced functionalities and better error handling that will come in future PRs.