@@ -78,6 +78,7 @@ pn = PandasLikeNamespace(
7878)
7979print (nw.col(" a" )._to_compliant_expr(pn))
8080```
81+
8182The result from the last line above is the same as we'd get from ` pn.col('a') ` , and it's
8283a ` narwhals._pandas_like.expr.PandasLikeExpr ` object, which we'll call ` PandasLikeExpr ` for
8384short.
@@ -215,6 +216,7 @@ pn = PandasLikeNamespace(
215216expr = (nw.col(" a" ) + 1 )._to_compliant_expr(pn)
216217print (expr)
217218```
219+
218220If we then extract a Narwhals-compliant dataframe from ` df ` by
219221calling ` ._compliant_frame ` , we get a ` PandasLikeDataFrame ` - and that's an object which we can pass ` expr ` to!
220222
@@ -228,6 +230,7 @@ We can then view the underlying pandas Dataframe which was produced by calling `
228230``` python exec="1" result="python" session="pandas_api_mapping" source="above"
229231print (result._native_frame)
230232```
233+
231234which is the same as we'd have obtained by just using the Narwhals API directly:
232235
233236``` python exec="1" result="python" session="pandas_api_mapping" source="above"
@@ -238,49 +241,98 @@ print(nw.to_native(df.select(nw.col("a") + 1)))
238241
239242Group-by is probably one of Polars' most significant innovations (on the syntax side) with respect
240243to pandas. We can write something like
244+
241245``` python
242246df: pl.DataFrame
243247df.group_by(" a" ).agg((pl.col(" c" ) > pl.col(" b" ).mean()).max())
244248```
249+
245250To do this in pandas, we need to either use ` GroupBy.apply ` (sloooow), or do some crazy manual
246251optimisations to get it to work.
247252
248253In Narwhals, here's what we do:
249254
250255- if somebody uses a simple group-by aggregation (e.g. ` df.group_by('a').agg(nw.col('b').mean()) ` ),
251256 then on the pandas side we translate it to
252- ``` python
253- df: pd.DataFrame
254- df.groupby(" a" ).agg({" b" : [" mean" ]})
255- ```
257+
258+ ``` python
259+ df: pd.DataFrame
260+ df.groupby(" a" ).agg({" b" : [" mean" ]})
261+ ```
262+
256263- if somebody passes a complex group- by aggregation, then we use `apply` and raise a `UserWarning ` , warning
257264 users of the performance penalty and advising them to refactor their code so that the aggregation they perform
258265 ends up being a simple one.
259266
260- In order to tell whether an aggregation is simple, Narwhals uses the private ` _depth ` attribute of ` PandasLikeExpr ` :
267+ # # Nodes
268+
269+ If we have a Narwhals expression, we can look at the operations which make it up by accessing `_nodes` :
270+
271+ ```python exec =" 1" result=" python" session=" pandas_impl" source=" above"
272+ import narwhals as nw
273+
274+ expr = nw.col(" a" ).abs().std(ddof = 1 ) + nw.col(" b" )
275+ print (expr._nodes)
276+ ```
277+
278+ Each node represents an operation. Here, we have 4 operations:
279+
280+ 1 . Given some dataframe, select column ` 'a' ` .
281+ 2 . Take its absolute value.
282+ 3 . Take its standard deviation, with ` ddof=1 ` .
283+ 4 . Sum column ` 'b' ` .
284+
285+ Let's take a look at a couple of these nodes. Let's start with the third one:
286+
287+ ``` python exec="1" result="python" session="pandas_impl" source="above"
288+ print (expr._nodes[2 ].as_dict())
289+ ```
290+
291+ This tells us a few things:
292+
293+ - We're performing an aggregation.
294+ - The name of the function is ` 'std' ` . This will be looked up in the compliant object.
295+ - It takes keyword arguments ` ddof=1 ` .
296+ - We'll look at ` exprs ` , ` str_as_lit ` , and ` allow_multi_output ` later.
297+
298+ In order for the evaluation to succeed, then ` PandasLikeExpr ` must have a ` std ` method defined
299+ on it, which takes a ` ddof ` argument. And this is what the ` CompliantExpr ` Protocol is for: so
300+ long as a backend's implementation complies with the protocol, then Narwhals will be able to
301+ unpack a ` ExprNode ` and turn it into a valid call.
302+
303+ Let's take a look at the fourth node:
304+
305+ ``` python exec="1" result="python" session="pandas_impl" source="above"
306+ print (expr._nodes[3 ].as_dict())
307+ ```
308+
309+ Note how now, the ` exprs ` attribute is populated. Indeed, we are summing another expression: ` col('b') ` .
310+ The ` exprs ` parameter holds arguments which are either expressions, or should be interpreted as expressions.
311+ The ` str_as_lit ` parameter tells us whether string literals should be interpreted as literals (e.g. ` lit('foo') ` )
312+ or columns (e.g. ` col('foo') ` ). Finally ` allow_multi_output ` tells us whether multi-outuput expressions
313+ (more on this in the next section) are allowed to appear in ` exprs ` .
314+
315+ Note that the expression in ` exprs ` also has its own nodes:
261316
262317``` python exec="1" result="python" session="pandas_impl" source="above"
263- print (pn.col(" a" ).mean())
264- print ((pn.col(" a" ) + 1 ).mean())
318+ print (expr._nodes[3 ].exprs[0 ]._nodes)
265319```
266320
267- For simple aggregations, Narwhals can just look at ` _depth ` and ` function_name ` and figure out
268- which (efficient) elementary operation this corresponds to in pandas.
321+ It's nodes all the way down!
269322
270323## Expression Metadata
271324
272- Let's try printing out a few expressions to the console to see what they show us:
325+ Let's try printing out some compliant expressions' metadata to see what it shows us:
273326
274- ``` python exec="1" result="python" session="metadata " source="above"
327+ ``` python exec="1" result="python" session="pandas_impl " source="above"
275328import narwhals as nw
276329
277- print (nw.col(" a" ))
278- print (nw.col(" a" ).mean())
279- print (nw.col(" a" ).mean().over(" b" ))
330+ print (nw.col(" a" )._to_compliant_expr(pn)._metadata )
331+ print (nw.col(" a" ).mean()._to_compliant_expr(pn)._metadata )
332+ print (nw.col(" a" ).mean().over(" b" )._to_compliant_expr(pn)._metadata )
280333```
281334
282- Note how they tell us something about their metadata. This section is all about
283- making sense of what that all means, what the rules are, and what it enables.
335+ This section is all about making sense of what that all means, what the rules are, and what it enables.
284336
285337Here's a brief description of each piece of metadata:
286338
@@ -293,8 +345,6 @@ Here's a brief description of each piece of metadata:
293345 - ` ExpansionKind.MULTI_UNNAMED ` : Produces multiple outputs whose names depend
294346 on the input dataframe. For example, ` nw.nth(0, 1) ` or ` nw.selectors.numeric() ` .
295347
296- - ` last_node ` : Kind of the last operation in the expression. See
297- ` narwhals._expression_parsing.ExprKind ` for the various options.
298348- ` has_windows ` : Whether the expression already contains an ` over(...) ` statement.
299349- ` n_orderable_ops ` : How many order-dependent operations the expression contains.
300350
@@ -311,8 +361,9 @@ Here's a brief description of each piece of metadata:
311361- ` is_scalar_like ` : Whether the output of the expression is always length-1.
312362- ` is_literal ` : Whether the expression doesn't depend on any column but instead
313363 only on literal values, like ` nw.lit(1) ` .
364+ - ` nodes ` : List of operations which this expression applies when evaluated.
314365
315- #### Chaining
366+ ### Chaining
316367
317368Say we have ` expr.expr_method() ` . How does ` expr ` 's ` ExprMetadata ` change?
318369This depends on ` expr_method ` . Details can be found in ` narwhals/_expression_parsing ` ,
356407 then ` n_orderable_ops ` is decreased by 1. This is the only way that
357408 ` n_orderable_ops ` can decrease.
358409
359- ### Broadcasting
410+ ## Broadcasting
360411
361412When performing comparisons between columns and aggregations or scalars, we operate as if the
362413aggregation or scalar was broadcasted to the length of the whole column. For example, if we
@@ -377,3 +428,67 @@ Narwhals triggers a broadcast in these situations:
377428
378429Each backend is then responsible for doing its own broadcasting, as defined in each
379430` CompliantExpr.broadcast ` method.
431+
432+ ## Elementwise push-down
433+
434+ SQL is picky about ` over ` operations. For example:
435+
436+ - ` sum(a) over (partition by b) ` is valid.
437+ - ` sum(abs(a)) over (partition by b) ` is valid.
438+ - ` abs(sum(a)) over (partition by b) ` is not valid.
439+
440+ In Polars, however, all three of
441+
442+ - ` pl.col('a').sum().over('b') ` is valid.
443+ - ` pl.col('a').abs().sum().over('b') ` is valid.
444+ - ` pl.col('a').sum().abs().over('b') ` is valid.
445+
446+ How can we retain Polars' level of flexibility when translating to SQL engines?
447+
448+ The answer is: by rewriting expressions. Specifically, we push down ` over ` nodes past elementwise ones.
449+ To see this, let's try printing the Narwhals equivalent of the last expression above (the one that SQL rejects):
450+
451+ ``` python exec="1" result="python" session="pushdown" source="above"
452+ import narwhals as nw
453+
454+ print (nw.col(" a" ).sum().abs().over(" b" ))
455+ ```
456+
457+ Note how Narwhals automatically inserted the ` over ` operation _ before_ the ` abs ` one. In other words, instead
458+ of doing
459+
460+ - ` sum ` -> ` abs ` -> ` over `
461+
462+ it did
463+
464+ - ` sum ` -> ` over ` -> ` abs `
465+
466+ thus allowing the expression to be valid for SQL engines!
467+
468+ This is what we refer to as "pushing down ` over ` nodes". The idea is:
469+
470+ - Elementwise operations operate row-by-row and don't depend on the rows around them.
471+ - An ` over ` node partitions or orders a computation.
472+ - Therefore, an elementwise operation followed by an ` over ` operation is the same
473+ as doing the ` over ` operation followed by that same elementwise operation!
474+
475+ Note that the pushdown also applies to any arguments to the elementwise operation.
476+ For example, if we have
477+
478+ ``` python
479+ (nw.col(" a" ).sum() + nw.col(" b" ).sum()).over(" c" )
480+ ```
481+
482+ then ` + ` is an elementwise operation and so can be swapped with ` over ` . We just need
483+ to take care to apply the ` over ` operation to all the arguments of ` + ` , so that we
484+ end up with
485+
486+ ``` python
487+ nw.col(" a" ).sum().over(" c" ) + nw.col(" b" ).sum().over(" c" )
488+ ```
489+
490+ !!! info
491+ In general, query optimisation is out-of-scope for Narwhals. We consider this
492+ expression rewrite acceptable because:
493+ - It's simple.
494+ - It allows us to evaluate operations which otherwise wouldn't be allowed for certain backends.
0 commit comments