@@ -76,8 +76,9 @@ pn = PandasLikeNamespace(
7676 implementation = Implementation.PANDAS ,
7777 version = Version.MAIN ,
7878)
79- print (nw.col(" a" )._to_compliant_expr (pn))
79+ print (nw.col(" a" )(pn))
8080```
81+
8182The result from the last line above is the same as we'd get from ` pn.col('a') ` , and it's
8283a ` narwhals._pandas_like.expr.PandasLikeExpr ` object, which we'll call ` PandasLikeExpr ` for
8384short.
@@ -177,7 +178,7 @@ The way you access the Narwhals-compliant wrapper depends on the object:
177178
178179- ` narwhals.DataFrame ` and ` narwhals.LazyFrame ` : use the ` ._compliant_frame ` attribute.
179180- ` narwhals.Series ` : use the ` ._compliant_series ` attribute.
180- - ` narwhals.Expr ` : call the ` ._to_compliant_expr ` method, and pass to it the Narwhals-compliant namespace associated with
181+ - ` narwhals.Expr ` : call the ` .__call__ ` method, and pass to it the Narwhals-compliant namespace associated with
181182 the given backend.
182183
183184🛑 BUT WAIT! What's a Narwhals-compliant namespace?
@@ -212,9 +213,10 @@ pn = PandasLikeNamespace(
212213 implementation = Implementation.PANDAS ,
213214 version = Version.MAIN ,
214215)
215- expr = (nw.col(" a" ) + 1 )._to_compliant_expr (pn)
216+ expr = (nw.col(" a" ) + 1 )(pn)
216217print (expr)
217218```
219+
218220If we then extract a Narwhals-compliant dataframe from ` df ` by
219221calling ` ._compliant_frame ` , we get a ` PandasLikeDataFrame ` - and that's an object which we can pass ` expr ` to!
220222
@@ -228,6 +230,7 @@ We can then view the underlying pandas Dataframe which was produced by calling `
228230``` python exec="1" result="python" session="pandas_api_mapping" source="above"
229231print (result._native_frame)
230232```
233+
231234which is the same as we'd have obtained by just using the Narwhals API directly:
232235
233236``` python exec="1" result="python" session="pandas_api_mapping" source="above"
@@ -238,49 +241,42 @@ print(nw.to_native(df.select(nw.col("a") + 1)))
238241
239242Group-by is probably one of Polars' most significant innovations (on the syntax side) with respect
240243to pandas. We can write something like
244+
241245``` python
242246df: pl.DataFrame
243247df.group_by(" a" ).agg((pl.col(" c" ) > pl.col(" b" ).mean()).max())
244248```
249+
245250To do this in pandas, we need to either use ` GroupBy.apply ` (sloooow), or do some crazy manual
246251optimisations to get it to work.
247252
248253In Narwhals, here's what we do:
249254
250255- if somebody uses a simple group-by aggregation (e.g. ` df.group_by('a').agg(nw.col('b').mean()) ` ),
251256 then on the pandas side we translate it to
252- ``` python
253- df: pd.DataFrame
254- df.groupby(" a" ).agg({" b" : [" mean" ]})
255- ```
257+
258+ ``` python
259+ df: pd.DataFrame
260+ df.groupby(" a" ).agg({" b" : [" mean" ]})
261+ ```
262+
256263- if somebody passes a complex group- by aggregation, then we use `apply` and raise a `UserWarning ` , warning
257264 users of the performance penalty and advising them to refactor their code so that the aggregation they perform
258265 ends up being a simple one.
259266
260- In order to tell whether an aggregation is simple, Narwhals uses the private ` _depth ` attribute of ` PandasLikeExpr ` :
261-
262- ``` python exec="1" result="python" session="pandas_impl" source="above"
263- print (pn.col(" a" ).mean())
264- print ((pn.col(" a" ) + 1 ).mean())
265- ```
266-
267- For simple aggregations, Narwhals can just look at ` _depth ` and ` function_name ` and figure out
268- which (efficient) elementary operation this corresponds to in pandas.
269-
270267# # Expression Metadata
271268
272- Let's try printing out a few expressions to the console to see what they show us:
269+ Let' s try printing out some compliant expressions' metadata to see what it shows us:
273270
274- ``` python exec="1" result="python" session="metadata " source="above"
271+ ```python exec =" 1" result=" python" session=" pandas_impl " source=" above"
275272import narwhals as nw
276273
277- print (nw.col(" a" ))
278- print (nw.col(" a" ).mean())
279- print (nw.col(" a" ).mean().over(" b" ))
274+ print (nw.col(" a" )(pn)._metadata )
275+ print (nw.col(" a" ).mean()(pn)._metadata )
276+ print (nw.col(" a" ).mean().over(" b" )(pn)._metadata )
280277```
281278
282- Note how they tell us something about their metadata. This section is all about
283- making sense of what that all means, what the rules are, and what it enables.
279+ This section is all about making sense of what that all means, what the rules are, and what it enables.
284280
285281Here's a brief description of each piece of metadata:
286282
@@ -293,8 +289,6 @@ Here's a brief description of each piece of metadata:
293289 - ` ExpansionKind.MULTI_UNNAMED ` : Produces multiple outputs whose names depend
294290 on the input dataframe. For example, ` nw.nth(0, 1) ` or ` nw.selectors.numeric() ` .
295291
296- - ` last_node ` : Kind of the last operation in the expression. See
297- ` narwhals._expression_parsing.ExprKind ` for the various options.
298292- ` has_windows ` : Whether the expression already contains an ` over(...) ` statement.
299293- ` n_orderable_ops ` : How many order-dependent operations the expression contains.
300294
@@ -311,6 +305,7 @@ Here's a brief description of each piece of metadata:
311305- ` is_scalar_like ` : Whether the output of the expression is always length-1.
312306- ` is_literal ` : Whether the expression doesn't depend on any column but instead
313307 only on literal values, like ` nw.lit(1) ` .
308+ - ` nodes ` : List of operations which this expression applies when evaluated.
314309
315310#### Chaining
316311
@@ -377,3 +372,67 @@ Narwhals triggers a broadcast in these situations:
377372
378373Each backend is then responsible for doing its own broadcasting, as defined in each
379374` CompliantExpr.broadcast ` method.
375+
376+ ### Elementwise push-down
377+
378+ SQL is picky about ` over ` operations. For example:
379+
380+ - ` sum(a) over (partition by b) ` is valid.
381+ - ` sum(abs(a)) over (partition by b) ` is valid.
382+ - ` abs(sum(a)) over (partition by b) ` is not valid.
383+
384+ In Polars, however, all three of
385+
386+ - ` pl.col('a').sum().over('b') ` is valid.
387+ - ` pl.col('a').abs().sum().over('b') ` is valid.
388+ - ` pl.col('a').sum().abs().over('b') ` is valid.
389+
390+ How can we retain Polars' level of flexibility when translating to SQL engines?
391+
392+ The answer is: by rewriting expressions. Specifically, we push down ` over ` nodes past elementwise ones.
393+ To see this, let's try printing the Narwhals equivalent of the last expression above (the one that SQL rejects):
394+
395+ ``` python exec="1" result="python" session="pushdown" source="above"
396+ import narwhals as nw
397+
398+ print (nw.col(" a" ).sum().abs().over(" b" ))
399+ ```
400+
401+ Note how Narwhals automatically inserted the ` over ` operation _ before_ the ` abs ` one. In other words, instead
402+ of doing
403+
404+ - ` sum ` -> ` abs ` -> ` over `
405+
406+ it did
407+
408+ - ` sum ` -> ` over ` -> ` abs `
409+
410+ thus allowing the expression to be valid for SQL engines!
411+
412+ This is what we refer to as "pushing down ` over ` nodes". The idea is:
413+
414+ - Elementwise operations operate row-by-row and don't depend on the rows around them.
415+ - An ` over ` node partitions or orders a computation.
416+ - Therefore, an elementwise operation followed by an ` over ` operation is the same
417+ as doing the ` over ` operation followed by that same elementwise operation!
418+
419+ Note that the pushdown also applies to any arguments to the elementwise operation.
420+ For example, if we have
421+
422+ ``` python
423+ (nw.col(" a" ).sum() + nw.col(" b" ).sum()).over(" c" )
424+ ```
425+
426+ then ` + ` is an elementwise operation and so can be swapped with ` over ` . We just need
427+ to take care to apply the ` over ` operation to all the arguments of ` + ` , so that we
428+ end up with
429+
430+ ``` python
431+ nw.col(" a" ).sum().over(" c" ) + nw.col(" b" ).sum().over(" c" )
432+ ```
433+
434+ In general, query optimisation is out-of-scope for Narwhals. We consider this
435+ expression rewrite acceptable because:
436+
437+ - It's simple.
438+ - It allows us to evaluate operations which otherwise wouldn't be allowed for certain backends.
0 commit comments