What is the DataFrame.lazy() method in Polars?

The DataFrame.lazy() method

The DataFrame.lazy() method in Polars is used to initiate a lazy computation on a DataFrame. This means that the operations applied to the DataFrame will not be executed immediately but will be stored as a computation graphA computation graph is a sequence of operations that are performed on a LazyFrame object. or query. Lazy operations provide the advantage of query optimization and increased potential for parallelization.

Syntax

Let’s see the syntax of the lazy() method:

DataFrame.lazy()

Return value

The DataFrame.lazy() method returns a LazyFrame object. The LazyFrame object is similar to a DataFrame object, but it’s lazily evaluated.

Code

We create a sample DataFrame with three columns a, b, and c to apply a lazy() method below:

import polars as pl
df = pl.DataFrame(
{
"a": [50, 100, 35, 87],
"b": [9.2, 5.4, 2.5, 13.4],
"c": [True, True, False, True],
"d": [23, 65, 83, 91],
}
)
lazy_frame = df.lazy()
print(lazy_frame)
#Another example of lazy() method with filter
lazy_frame2 = df.lazy().filter(pl.col("a") == 100)
print(lazy_frame2)

Explanation

Here’s a step-by-step explanation of the provided code:

  • Lines 3–10: We create a DataFrame named df using the pl.DataFrame() constructor. The DataFrame has four columns (a, b, c, and d) with some data.

  • Line 12: We apply the lazy() method to the DataFrame df, creating a LazyFrame named lazy_frame. This LazyFrame represents a computation query or graph of deferred operations.

  • Line 13: We print the representation of the LazyFrame.

  • Lines 16–17: We apply the filter() method on the LazyFrame returned by the df.lazy() method.

Note: Check out the Answer on the filter() function for more information.

Note that directly printing the LazyFrame won’t display the content of the LazyFrame. We would need to execute some operations with the LazyFrame to view the actual content. Some LazyFrame operations are given below:

Operations on LazyFrame

Upon the creation of a LazyFrame, we can apply a range of operations to it. It’s important to note that these operations remain inactive until called explicitly. Here are some of the methods that can be used:

  • fetch(): This executes the lazy operations on a small number of rows.

  • collect(): This executes the lazy operations on all the data.

  • describe_plan(): This prints the unoptimized query plan.

  • describe_optimized_plan(): This prints the optimized query plan.

  • show_graph(): This displays the (un)optimized query plan as a Graphviz graph.

Now, let’s take a look at the fetch() operation:

import polars as pl
df = pl.DataFrame(
{
"a": [50, 100, 35, 87],
"b": [9.2, 5.4, 2.5, 13.4],
"c": [True, True, False, True],
"d": [23, 65, 83, 91],
}
)
lazy_frame= df.lazy()
print(lazy_frame.fetch(2))
lazy_frame2 = df.lazy().filter(pl.col("a") == 100)
print(lazy_frame2.collect())

The fetch() method triggers the execution of the operations and displays a DataFrame containing the first two rows of the original DataFrame. On the other hand, the collect() method executes the query on the data and returns the result as a DataFrame object.

Conclusion

Using DataFrame.lazy() is a powerful feature in Polars that enables lazy evaluation of operations on DataFrames. This allows for deferred execution of computations, providing opportunities for optimization and parallelization, which can be crucial when dealing with large datasets. When working with complex queries, utilizing lazy operations can lead to more efficient and faster data processing.

Free Resources

Copyright ©2024 Educative, Inc. All rights reserved