Polars is an open-source library for data manipulation, known for being one of the fastest data processing solutions on a single machine. It features a well-structured, typed API that is both expressive and easy to use.
Introducció
Polars easily trumps other solutions due to its parallel execution engine, efficient algorithms and use of vectorization with SIMD (Single Instruction, Multiple Data).
-
Easy to use. Write your queries the way they were intended. Polars will determine the most efficient way to execute them using its query optimizer.
-
Embarrassingly parallel. Complete your queries faster! Polars fully utilizes the power of your machine by dividing the workload among the available CPU cores without any additional configuration or serialization overhead.
-
Apache Arrow. Polars utilizes the Apache Arrow memory model allowing you to easily integrate with existing tools in the data landscape. It supports zero-copy data sharing for efficient collaboration.
-
Close to the metal. Polars is written from the ground up, designed close to the machine and without external dependencies. This allows for full control of the ecosystem (API, memory & execution).
-
Written in Rust. The core of Polars is written in Rust, one of the fastest growing programming languages in the world. Rust allows for high performance with fine-grained control over memory.
-
Out of core. Want to process large data sets that are bigger than your memory? Our streaming API allows you to process your results efficiently, eliminating the need to keep all data in memory.
Importa polars:
> poetry add polars
Data types and structures
Data types
Polars supports a variety of data types that fall broadly under the following categories:
- Numeric data types: signed integers, unsigned integers, floating point numbers, and decimals.
- Nested data types: lists, structs, and arrays.
- Temporal: dates, datetimes, times, and time deltas.
- Miscellaneous: strings, binary data, Booleans, categoricals, enums, and objects.
All types support missing values represented by the special value null
. This is not to be conflated with the special value NaN
in floating number data types.
Series
The core base data structures provided by Polars are series and dataframes. A series is a 1-dimensional homogeneous data structure. By “homogeneous” we mean that all elements inside a series have the same data type. The snippet below shows how to create a named series:
import polars as pl
s = pl.Series("ints", [1, 2, 3, 4, 5])
print(s)
shape: (5,)
Series: 'ints' [i64]
[
1
2
3
4
5
]
When creating a series, Polars will infer the data type from the values you provide.
You can specify a concrete data type to override the inference mechanism:
s1 = pl.Series("ints", [1, 2, 3, 4, 5])
s2 = pl.Series("uints", [1, 2, 3, 4, 5], dtype=pl.UInt64)
print(s1.dtype, s2.dtype)
Int64 UInt64
Dataframe
A dataframe is a 2-dimensional heterogeneous data structure that contains uniquely named series. By holding your data in a dataframe you will be able to use the Polars API to write queries that manipulate your data.
The snippet below shows how to create a dataframe from a dictionary of lists
from datetime import date
df = pl.DataFrame(
{
"name": ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
"birthdate": [
date(1997, 1, 10),
date(1985, 2, 15),
date(1983, 3, 22),
date(1981, 4, 30),
],
"weight": [57.9, 72.5, 53.6, 83.1], # (kg)
"height": [1.56, 1.77, 1.65, 1.75], # (m)
}
)
print(df)
shape: (4, 4)
┌────────────────┬────────────┬────────┬────────┐
│ name ┆ birthdate ┆ weight ┆ height │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ date ┆ f64 ┆ f64 │
╞════════════════╪════════════╪════════╪════════╡
│ Alice Archer ┆ 1997-01-10 ┆ 57.9 ┆ 1.56 │
│ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │
│ Chloe Cooper ┆ 1983-03-22 ┆ 53.6 ┆ 1.65 │
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1 ┆ 1.75 │
└────────────────┴────────────┴────────┴────────┘
Inspecting a dataframe
In this subsection we will show some useful methods to quickly inspect a dataframe. We will use the dataframe we created earlier as a starting point.
Head
The function head
shows the first rows of a dataframe. By default, you get the first 5 rows but you can also specify the number of rows you want:
print(df.head(3))
shape: (3, 4)
┌──────────────┬────────────┬────────┬────────┐
│ name ┆ birthdate ┆ weight ┆ height │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ date ┆ f64 ┆ f64 │
╞══════════════╪════════════╪════════╪════════╡
│ Alice Archer ┆ 1997-01-10 ┆ 57.9 ┆ 1.56 │
│ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │
│ Chloe Cooper ┆ 1983-03-22 ┆ 53.6 ┆ 1.65 │
└──────────────┴────────────┴────────┴────────┘
Glimpse
The function glimpse
is another function that shows the values of the first few rows of a dataframe, but formats the output differently from head
.
Here, each line of the output corresponds to a single column, making it easier to take inspect wider dataframes:
print(df.glimpse(return_as_string=True))
Rows: 4
Columns: 4
$ name <str> 'Alice Archer', 'Ben Brown', 'Chloe Cooper', 'Daniel Donovan'
$ birthdate <date> 1997-01-10, 1985-02-15, 1983-03-22, 1981-04-30
$ weight <f64> 57.9, 72.5, 53.6, 83.1
$ height <f64> 1.56, 1.77, 1.65, 1.75
Tail
The function tail
shows the last rows of a dataframe.
By default, you get the last 5 rows but you can also specify the number of rows you want, similar to how head works:
print(df.tail(3))
shape: (3, 4)
┌────────────────┬────────────┬────────┬────────┐
│ name ┆ birthdate ┆ weight ┆ height │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ date ┆ f64 ┆ f64 │
╞════════════════╪════════════╪════════╪════════╡
│ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │
│ Chloe Cooper ┆ 1983-03-22 ┆ 53.6 ┆ 1.65 │
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1 ┆ 1.75 │
└────────────────┴────────────┴────────┴────────┘
Sample
If you think the first or last rows of your dataframe are not representative of your data, you can use sample
to get an arbitrary number of randomly selected rows from the DataFrame.
Note that the rows are not necessarily returned in the same order as they appear in the dataframe:
print(df.sample(2))
┌────────────────┬────────────┬────────┬────────┐
│ name ┆ birthdate ┆ weight ┆ height │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ date ┆ f64 ┆ f64 │
╞════════════════╪════════════╪════════╪════════╡
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1 ┆ 1.75 │
│ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │
└────────────────┴────────────┴────────┴────────┘
Describe
You can also use describe
to compute summary statistics for all columns of your dataframe:
print(df.describe())
shape: (9, 5)
┌────────────┬────────────────┬─────────────────────┬───────────┬──────────┐
│ statistic ┆ name ┆ birthdate ┆ weight ┆ height │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ f64 ┆ f64 │
╞════════════╪════════════════╪═════════════════════╪═══════════╪══════════╡
│ count ┆ 4 ┆ 4 ┆ 4.0 ┆ 4.0 │
│ null_count ┆ 0 ┆ 0 ┆ 0.0 ┆ 0.0 │
│ mean ┆ null ┆ 1986-09-04 00:00:00 ┆ 66.775 ┆ 1.6825 │
│ std ┆ null ┆ null ┆ 13.560082 ┆ 0.097082 │
│ min ┆ Alice Archer ┆ 1981-04-30 ┆ 53.6 ┆ 1.56 │
│ 25% ┆ null ┆ 1983-03-22 ┆ 57.9 ┆ 1.65 │
│ 50% ┆ null ┆ 1985-02-15 ┆ 72.5 ┆ 1.75 │
│ 75% ┆ null ┆ 1985-02-15 ┆ 72.5 ┆ 1.75 │
│ max ┆ Daniel Donovan ┆ 1997-01-10 ┆ 83.1 ┆ 1.77 │
└────────────┴────────────────┴─────────────────────┴───────────┴──────────┘
Schema
When talking about data (in a dataframe or otherwise) we can refer to its schema. The schema is a mapping of column or series names to the data types of those same columns or series.
Much like with series, Polars will infer the schema of a dataframe when you create it but you can override the inference system if needed. You can check the schema of a dataframe with schema:
print(df.schema)
Schema({'name': String, 'birthdate': Date, 'weight': Float64, 'height': Float64})
Data types internals
Polars utilizes the Arrow Columnar Format for its data orientation. Following this specification allows Polars to transfer data to/from other tools that also use the Arrow specification with little to no overhead.
Polars gets most of its performance from its query engine, the optimizations it performs on your query plans, and from the parallelization that it employs when running your expressions.
Floating point numbers
Polars generally follows the IEEE 754 floating point standard for Float32 and Float64, with some exceptions:
- Any
NaN
compares equal to any otherNaN
, and greater than any non-NaN
value. - Operations do not guarantee any particular behavior on the sign of zero or
NaN
, nor on the payload ofNaN
values. This is not just limited to arithmetic operations, e.g. a sort or group by operation may canonicalize all zeroes to +0 and allNaN
s to a positiveNaN
without payload for efficient equality checks.
Polars always attempts to provide reasonably accurate results for floating point computations but does not provide guarantees on the error unless mentioned otherwise. Generally speaking 100% accurate results are infeasibly expensive to achieve (requiring much larger internal representations than 64-bit floats), and thus some error is always to be expected.
Polars always attempts to provide reasonably accurate results for floating point computations but does not provide guarantees on the error unless mentioned otherwise. Generally speaking 100% accurate results are infeasibly expensive to achieve (requiring much larger internal representations than 64-bit floats), and thus some error is always to be expected.
full data types table
Type(s) | Details |
---|---|
Boolean | Boolean type that is bit packed efficiently. |
Int8, Int16, Int32, Int64 | Varying-precision signed integer types. |
UInt8, UInt16, UInt32, UInt64 | Varying-precision unsigned integer types. |
Float32, Float64 | Varying-precision signed floating point numbers. |
Decimal | Decimal 128-bit type with optional precision and non-negative scale. Use this if you need fine-grained control over the precision of your floats and the operations you make on them. See Python's decimal.Decimal for documentation on what a decimal data type is. |
String | Variable length UTF-8 encoded string data, typically Human-readable. |
Binary | Stores arbitrary, varying length raw binary data. |
Date | Represents a calendar date. |
Time | Represents a time of day. |
Datetime | Represents a calendar date and time of day. |
Duration | Represents a time duration. |
Array | Arrays with a known, fixed shape per series; akin to numpy arrays. |
List | Homogeneous 1D container with variable length. |
Object | Wraps arbitrary Python objects. |
Categorical | Efficient encoding of string data where the categories are inferred at runtime. |
Enum | Efficient ordered encoding of a set of predetermined string categories. |
Struct | Composite product type that can store multiple fields. |
Null |
Expressions and contexts
Polars has developed its own Domain Specific Language (DSL) for transforming data. The language is very easy to use and allows for complex queries that remain human readable. Expressions and contexts, which will be introduced here, are very important in achieving this readability while also allowing the Polars query engine to optimize your queries to make them run as fast as possible.
Expressions
In Polars, an expression is a lazy representation of a data transformation. Expressions are modular and flexible, which means you can use them as building blocks to build more complex expressions. Here is an example of a Polars expression:
import polars as pl
pl.col("weight") / (pl.col("height") ** 2)
As you might be able to guess, this expression takes a column named "weight" and divides its values by the square of the values in a column "height", computing a person's BMI.
The code above expresses an abstract computation that we can save in a variable, manipulate further, or just print:
bmi_expr = pl.col("weight") / (pl.col("height") ** 2)
print(bmi_expr)
[(col("weight")) / (col("height").pow([dyn int: 2]))]
Because expressions are lazy, no computations have taken place yet. That's what we need contexts for.
Contexts
Polars expressions need a context in which they are executed to produce a result. Depending on the context it is used in, the same Polars expression can produce different results. In this section, we will learn about the four most common contexts that Polars provides1:
select
with_columns
filter
group_by
We use the dataframe below to show how each of the contexts works.
select
The selection context select
applies expressions over columns.
The context select
may produce new columns that are aggregations, combinations of other columns, or literals:
result = df.select(
bmi=bmi_expr,
avg_bmi=bmi_expr.mean(),
ideal_max_bmi=25,
)
print(result)
shape: (4, 3)
┌───────────┬───────────┬───────────────┐
│ bmi ┆ avg_bmi ┆ ideal_max_bmi │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ i32 │
╞═══════════╪═══════════╪═══════════════╡
│ 23.791913 ┆ 23.438973 ┆ 25 │
│ 23.141498 ┆ 23.438973 ┆ 25 │
│ 19.687787 ┆ 23.438973 ┆ 25 │
│ 27.134694 ┆ 23.438973 ┆ 25 │
└───────────┴───────────┴───────────────┘
The expressions in a context select
must produce series that are all the same length or they must produce a scalar.
Scalars will be broadcast to match the length of the remaining series. Literals, like the number used above, are also broadcast.
Note that broadcasting can also occur within expressions. For instance, consider the expression below:
result = df.select(deviation=(bmi_expr - bmi_expr.mean()) / bmi_expr.std())
print(result)
shape: (4, 1)
┌───────────┐
│ deviation │
│ --- │
│ f64 │
╞═══════════╡
│ 0.115645 │
│ -0.097471 │
│ -1.22912 │
│ 1.210946 │
└───────────┘
Both the subtraction and the division use broadcasting within the expression because the subexpressions that compute the mean and the standard deviation evaluate to single values.
The context select
is very flexible and powerful and allows you to evaluate arbitrary expressions independent of, and in parallel to, each other. This is also true of the other contexts that we will see next.
with_columns
The context with_columns
is very similar to the context select
.
The main difference between the two is that the context with_columns
creates a new dataframe that contains the columns from the original dataframe and the new columns according to its input expressions, whereas the context select
only includes the columns selected by its input expressions:
result = df.with_columns(
bmi=bmi_expr,
avg_bmi=bmi_expr.mean(),
ideal_max_bmi=25,
)
print(result)
shape: (4, 7)
┌────────────────┬────────────┬────────┬────────┬───────────┬───────────┬───────────────┐
│ name ┆ birthdate ┆ weight ┆ height ┆ bmi ┆ avg_bmi ┆ ideal_max_bmi │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ date ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ i32 │
╞════════════════╪════════════╪════════╪════════╪═══════════╪═══════════╪═══════════════╡
│ Alice Archer ┆ 1997-01-10 ┆ 57.9 ┆ 1.56 ┆ 23.791913 ┆ 23.438973 ┆ 25 │
│ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 ┆ 23.141498 ┆ 23.438973 ┆ 25 │
│ Chloe Cooper ┆ 1983-03-22 ┆ 53.6 ┆ 1.65 ┆ 19.687787 ┆ 23.438973 ┆ 25 │
│ Daniel Donovan ┆ 1981-04-30 ┆ 83.1 ┆ 1.75 ┆ 27.134694 ┆ 23.438973 ┆ 25 │
└────────────────┴────────────┴────────┴────────┴───────────┴───────────┴───────────────┘
Because of this difference between select
and with_columns
, the expressions used in a context with_columns
must produce series that have the same length as the original columns in the dataframe, whereas it is enough for the expressions in the context select
to produce series that have the same length among them.
filter
The context filter
filters the rows of a dataframe based on one or more expressions that evaluate to the Boolean data type.
result = df.filter(
pl.col("birthdate").is_between(date(1982, 12, 31), date(1996, 1, 1)),
pl.col("height") > 1.7,
)
print(result)
shape: (1, 4)
┌───────────┬────────────┬────────┬────────┐
│ name ┆ birthdate ┆ weight ┆ height │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ date ┆ f64 ┆ f64 │
╞═══════════╪════════════╪════════╪════════╡
│ Ben Brown ┆ 1985-02-15 ┆ 72.5 ┆ 1.77 │
└───────────┴────────────┴────────┴────────┘
group_by
and aggregations
In the context group_by
, rows are grouped according to the unique values of the grouping expressions.
You can then apply expressions to the resulting groups, which may be of variable lengths.
When using the context group_by
, you can use an expression to compute the groupings dynamically:
result = df.group_by(
(pl.col("birthdate").dt.year() // 10 * 10).alias("decade"),
).agg(pl.col("name"))
print(result)
shape: (2, 2)
┌────────┬─────────────────────────────────┐
│ decade ┆ name │
│ --- ┆ --- │
│ i32 ┆ list[str] │
╞════════╪═════════════════════════════════╡
│ 1990 ┆ ["Alice Archer"] │
│ 1980 ┆ ["Ben Brown", "Chloe Cooper", … │
└────────┴─────────────────────────────────┘
After using group_by
we use agg
to apply aggregating expressions to the groups.
Since in the example above we only specified the name of a column, we get the groups of that column as lists.
We can specify as many grouping expressions as we'd like and the context group_by
will group the rows according to the distinct values across the expressions specified. Here, we group by a combination of decade of birth and whether the person is shorter than 1.7 metres:
result = df.group_by(
(pl.col("birthdate").dt.year() // 10 * 10).alias("decade"),
(pl.col("height") < 1.7).alias("short?"),
).agg(pl.col("name"))
print(result)
shape: (3, 3)
┌────────┬────────┬─────────────────────────────────┐
│ decade ┆ short? ┆ name │
│ --- ┆ --- ┆ --- │
│ i32 ┆ bool ┆ list[str] │
╞════════╪════════╪═════════════════════════════════╡
│ 1980 ┆ true ┆ ["Chloe Cooper"] │
│ 1980 ┆ false ┆ ["Ben Brown", "Daniel Donovan"… │
│ 1990 ┆ true ┆ ["Alice Archer"] │
└────────┴────────┴─────────────────────────────────┘
The resulting dataframe, after applying aggregating expressions, contains one column per each grouping expression on the left and then as many columns as needed to represent the results of the aggregating expressions.
In turn, we can specify as many aggregating expressions as we want:
result = df.group_by(
(pl.col("birthdate").dt.year() // 10 * 10).alias("decade"),
(pl.col("height") < 1.7).alias("short?"),
).agg(
pl.len(),
pl.col("height").max().alias("tallest"),
pl.col("weight", "height").mean().name.prefix("avg_"),
)
print(result)
shape: (3, 6)
┌────────┬────────┬─────┬─────────┬────────────┬────────────┐
│ decade ┆ short? ┆ len ┆ tallest ┆ avg_weight ┆ avg_height │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ bool ┆ u32 ┆ f64 ┆ f64 ┆ f64 │
╞════════╪════════╪═════╪═════════╪════════════╪════════════╡
│ 1980 ┆ true ┆ 1 ┆ 1.65 ┆ 53.6 ┆ 1.65 │
│ 1990 ┆ true ┆ 1 ┆ 1.56 ┆ 57.9 ┆ 1.56 │
│ 1980 ┆ false ┆ 2 ┆ 1.77 ┆ 77.8 ┆ 1.76 │
└────────┴────────┴─────┴─────────┴────────────┴────────────┘
Expression expansion
The last example contained two grouping expressions and three aggregating expressions, and yet the resulting dataframe contained six columns instead of five. If we look closely, the last aggregating expression mentioned two different columns: “weight” and “height”.
Polars expressions support a feature called expression expansion. Expression expansion is like a shorthand notation for when you want to apply the same transform to multiple columns. As we have seen, the expression
pl.col("weight", "height").mean().name.prefix("avg_")
will compute the mean value of the columns “weight” and “height” and will rename them as “avg_weight” and “avg_height”, respectively.
In fact, the expression above is equivalent to using the two following expressions:
[
pl.col("weight").mean().alias("avg_weight"),
pl.col("height").mean().alias("avg_height"),
]
In this case, this expression expands into two independent expressions that Polars can execute in parallel.
In other cases, we may not be able to know in advance how many independent expressions an expression will unfold into.
Consider this simple but elucidative example:
(pl.col(pl.Float64) * 1.1).name.suffix("*1.1")
This expression will multiply all columns with data type Float64
by 1.1
.
The number of columns this applies to depends on the schema of each dataframe. In the case of the dataframe we have been using, it applies to two columns:
expr = (pl.col(pl.Float64) * 1.1).name.suffix("*1.1")
result = df.select(expr)
print(result)
shape: (4, 2)
┌────────────┬────────────┐
│ weight*1.1 ┆ height*1.1 │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞════════════╪════════════╡
│ 63.69 ┆ 1.716 │
│ 79.75 ┆ 1.947 │
│ 58.96 ┆ 1.815 │
│ 91.41 ┆ 1.925 │
└────────────┴────────────┘
In the case of the dataframe df2
below, the same expression expands to 0 columns because no column has the data type Float64
:
df2 = pl.DataFrame(
{
"ints": [1, 2, 3, 4],
"letters": ["A", "B", "C", "D"],
}
)
result = df2.select(expr)
print(result)
shape: (0, 0)
┌┐
╞╡
└┘
Conclusion
Because expressions are lazy, when you use an expression inside a context Polars can try to simplify your expression before running the data transformation it expresses. Separate expressions within a context are embarrassingly parallel and Polars will take advantage of that, while also parallelizing expression execution when using expression expansion.Further performance gains can be obtained when using the lazy API of Polars, which is introduced next.
We have only scratched the surface of the capabilities of expressions. There are a ton more expressions and they can be combined in a variety of ways. See the section on expressions for a deeper dive on the different types of expressions available.
Lazy API
Polars supports two modes of operation: lazy and eager. The examples so far have used the eager API, in which the query is executed immediately. In the lazy API, the query is only evaluated once it is collected. Deferring the execution to the last minute can have significant performance advantages and is why the lazy API is preferred in most cases.
Let us demonstrate this with an example:
df = pl.read_csv("data/iris.csv")
df_small = df.filter(pl.col("sepal_length") > 5)
df_agg = df_small.group_by("species").agg(pl.col("sepal_width").mean())
print(df_agg)
Descarrega el fitxer: https://gitlab.com/xtec/python/data/-/raw/main/iris.zip
In this example we use the eager API to:
- Read the iris dataset.
- Filter the dataset based on sepal length.
- Calculate the mean of the sepal width per species.
Every step is executed immediately returning the intermediate results. This can be very wasteful as we might do work or load extra data that is not being used. If we instead used the lazy API and waited on execution until all the steps are defined then the query planner could perform various optimizations. In this case:
-
Predicate pushdown: Apply filters as early as possible while reading the dataset, thus only reading rows with sepal length greater than 5.
-
Projection pushdown: Select only the columns that are needed while reading the dataset, thus removing the need to load additional columns (e.g., petal length and petal width).
q = (
pl.scan_csv("data/iris.csv")
.filter(pl.col("sepal_length") > 5)
.group_by("species")
.agg(pl.col("sepal_width").mean())
)
df = q.collect()
These will significantly lower the load on memory & CPU thus allowing you to fit bigger datasets in memory and process them faster. Once the query is defined you call collect
to inform Polars that you want to execute it. You can learn more about the lazy API in its dedicated chapter.
Eager API. In many cases the eager API is actually calling the lazy API under the hood and immediately collecting the result. This has the benefit that within the query itself optimization(s) made by the query planner can still take place.
When to use which
In general, the lazy API should be preferred unless you are either interested in the intermediate results or are doing exploratory work and don't know yet what your query is going to look like.
Previewing the query plan
When using the lazy API you can use the function explain
to ask Polars to create a description of the query plan that will be executed once you collect the results. This can be useful if you want to see what types of optimizations Polars performs on your queries.
We can ask Polars to explain the query q
we defined above:
print(q.explain())
AGGREGATE
[col("sepal_width").mean()] BY [col("species")] FROM
Csv SCAN [docs/assets/data/iris.csv]
PROJECT 3/5 COLUMNS
SELECTION: [(col("sepal_length")) > (5.0)]
Immediately, we can see in the explanation that Polars did apply predicate pushdown, as it is only reading rows where the sepal length is greater than 5, and it did apply projection pushdown, as it is only reading the columns that are needed by the query.
The function explain
can also be used to see how expression expansion will unfold in the context of a given schema. Consider the example expression from the section on expression expansion:
(pl.col(pl.Float64) * 1.1).name.suffix("*1.1")
We can use explain
to see how this expression would evaluate against an arbitrary schema:
schema = pl.Schema(
{
"int_1": pl.Int16,
"int_2": pl.Int32,
"float_1": pl.Float64,
"float_2": pl.Float64,
"float_3": pl.Float64,
}
)
print(
pl.LazyFrame(schema=schema)
.select((pl.col(pl.Float64) * 1.1).name.suffix("*1.1"))
.explain()
)
SELECT [[(col("float_1")) * (1.1)].alias("float_1*1.1"), [(col("float_2")) * (1.1)].alias("float_2*1.1"), [(col("float_3")) * (1.1)].alias("float_3*1.1")] FROM
DF ["int_1", "int_2", "float_1", "float_2"]; PROJECT 3/5 COLUMNS; SELECTION: None
CONTINUA
- Expressions ...
TODO (proper curs)
Tycho
Project Tycho té un dataset que conté un historial del recompte de casos de diverses enfermetats que han afectat als EEUU des del 1888 al 2014.
En alguns anys hi ha molts registres (1900-1950) i d’altres menys, però en total podem tenir més d’10M d’observacions; cadascuna de les quals té 10 columnes d'interès.
Per accedir-hi ens podem registrar (és gratuït) però no ens cal, ja que tenim les dades de fa 2 anys i no ens cal que siguin actualitzades per aquest exemple.
Hem penjat un subset de 1M de línies en aquest fitxer (120 MB un cop descomprimit):
https://gitlab.com/xtec/python/data/-/raw/main/tycho-mini.zip
En aquest enllaç tens un exemple de codi: https://gitlab.com/xtec/python/polars/-/blob/main/app/tycho.py?ref_type=heads
Fixa't amb les columnes que disposem originalment al fitxer CSV de Tycho:
epi_week | Setmana epidemiològica (de l'1 al 52 normalment, algún cop hi ha 53). És una mètrica necessària i molt habitual en la informàtica mèdica. |
country | País. En aquest dataset només hi ha mostres dels Estats Units, per tant podrem ometre-la. US |
state | Sigles de l'estat dels EEUU. |
loc | Nom complet de l'estat dels EEUU. Guardar state i loc (info redundant) només si volem visualitzar mapes. |
loc_type | En el dataset pot ser CITY o STATE. |
disease | Enfermetat. Entre [] ens indica informació addicional, que potser en el nostre estudi és necessària i ometre-la ens pot estalviar memòria del dataFrame. |
event | Cada event pot ser de 2 tipus, i és important distingir-los segons el que volguem estudiar: CASES (número de casos), DEATHS(número de morts causats per la enfermetat). |
Si volem treballar bé aquest estudi es pot calcular una ràtio de CASES i DEATHS, que el seu resultat serà entre 0 i 1 (1 si tots els casos han estat mortals) | |
number | Important! Número de casos (si event='CASES') o número de morts (si event='DEATHS') |
from_date,to_date | Dates d'inici i de fi en què es mesura el número de casos. Són (o haurien de ser) intèrvals d'una setmana. |
url | |
El projecte Tycho ha escanejat i/o digitalitzat documents de paper a PDF (anys 1980 i anteriors) que demostren els registres realitzats i han de ser molt interessants. Desgraciadament l'enllaç proporcionat no funciona. |
Finalment, remarcar que podem observar que el dataset està en format Tidy, que és el que desitgem per poder realitzar estadítica i gràfics.
Aquest punt és fonamental verificar-lo abans de començar a investigar un dataset.
- Cada fila és una observació
- Cada columna és una variable
- Cada valor té una única dada
Si el dataset no fos Tidy, hauriem de preprocessar-lo i arreglar-lo fins que ho sigui.
EXEMPLE. Agafa el fitxer de Tycho Dataset de 78 mil línies aproximadament que hem vist i realitza aquesta selecció de dades
- Drop 'country' and 'url' columns
- Rename 'evnt' to 'event'
- Cleanup the diseases removing the names in square brackets. (See hint below)
- Add a new column called 'year' of type 'int' with the year from the epi_week.
- Select rows where the year is 1910 or 1911.
- Add a new column called 'id' with a numerical unique identifier starting from 0
- Rename 'loc' to 'city', 'number' to 'deaths'
- Reorder columns as follows: ['id', 'year', 'epi_week', 'from_date', 'to_date', 'state', 'city', 'disease', 'deaths']
Exercicis
Realitza les següents consultes, a partir del fitxer generat a l'anterior exemple:
1.- Llista de totes les ciutats que surten al fitxer, que no es repeteixin.
- Llista de ciutats: Us n'haurien de sortir 247.
┌──────────────────┐
│ city │
╞══════════════════╡
│ OAKLAND │
│ ANN ARBOR │
│ BIDDEFORD │
│ SARATOGA SPRINGS │
│ … │
│ BENNINGTON │
│ BRADDOCK │
│ CHARLOTTE │
│ CHICAGO │
└──────────────────┘
2.- Llista el número total de morts de cada malaltia, ordenada pel número de morts.
┌────────────────┬────────┐
│ disease ┆ deaths │
╞════════════════╪════════╡
│ TUBERCULOSIS ┆ 181972 │
│ SCARLET FEVER ┆ 110893 │
│ DIPHTHERIA ┆ 100049 │
│ TYPHOID FEVER ┆ 44291 │
│ WHOOPING COUGH ┆ 14713 │
└────────────────┴────────┘
3.- Mostra el número de morts per la tuberculosi, a Nova York, l'any 1910: 32403
import polars as pl
# Suposem que 'fixed_entries' és el DataFrame que conté les dades del CSV
# https://gitlab.com/xtec/bio/pandas/-/raw/main/data/tycho/fixed_entries.csv?ref_type=heads
file_csv: str = "fixed_entries.csv"
broken_entries: pl.DataFrame = pl.read_csv(source=file_csv, separator=",")
# 1. Llista de totes les ciutats
cities = fixed_entries.select('city').unique()
print("Llista de ciutats:", cities)
# 2. Llista de malalties ordenada pel número de morts
disease_deaths = (
fixed_entries.groupby('disease')
.agg(pl.col('deaths').sum())
.sort('deaths', descending=True)
)
print("Malalties ordenades per número de morts:\n", disease_deaths)
# 3. Filtrar per les condicions especificades
tuberculosis_ny_1910 = fixed_entries.filter(
(pl.col('disease') == 'TUBERCULOSIS') &
(pl.col('city') == 'NEW YORK') &
(pl.col('year') == 1910)
)
# Sumar el número de morts
num_deaths = tuberculosis_ny_1910.select(pl.col('deaths').sum()).item()
print(f"Nombre de morts per TUBERCULOSIS a NEW YORK l'any 1910: {num_deaths}")
Ara faltatia explotar una mica més aquest fitxer, realitzant algún gràfic o altres estadístiques. Encara que sigui un historial de malalties de fa molts anys, la forma d'organitzar les dades que s'usa avui en dia és força semblant, bancs de dades amb format Tidy.
Per això considerem que és molt útil i adient realitzar consultes com aquestes.
Activitat: Electric Vehicle Population
In the next example, you’ll work with electric vehicle population data from Data.gov. This dataset contains information about electric and hybrid vehicles registered in the Washington State Department of Licensing. Each row in the data represents one car, and each column contains information about the car.
https://data.wa.gov/api/views/f6w7-q2d2/rows.csv?accessType=DOWNLOAD
The key to efficiently working with files through the lazy API is to use Polars’ scan functionality.
When you scan a file, rather than reading the entire file into memory, Polars creates a LazyFrame
that references the file’s data. As before, no processing of the data occurs until you explicitly execute a query.
With the following code, you scan electric_cars.csv
:
df = pl.scan_csv(filepath)
print(df)
> python3 electric_vehicle.py
naive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan)
Csv SCAN [data/electric-cars.csv]
PROJECT */17 COLUMNS
>>> lazy_car_data = pl.scan_csv(local_file_path)
>>> lazy_car_data
<polars.LazyFrame object at 0x10292EC50>
>>> lazy_car_data.schema
{'VIN (1-10)': Utf8, 'County': Utf8, 'City': Utf8, 'State': Utf8,
'Postal Code': Int64, 'Model Year': Int64, 'Make': Utf8, 'Model': Utf8,
'Electric Vehicle Type': Utf8, 'Clean Alternative Fuel Vehicle (CAFV) Eligibility': Utf8,
'Electric Range': Int64, 'Base MSRP': Int64, 'Legislative District': Int64,
'DOL Vehicle ID': Int64, 'Vehicle Location': Utf8, 'Electric Utility': Utf8,
'2020 Census Tract': Int64}
You create a LazyFrame, df
, by using scan_csv()
. Crucially, the data from the CSV file isn’t stored in memory.
Instead, the only thing df
stores from electric_cars.csv
is the schema:
Si vols pots demanar a polars que generi un "schema" de les dades amb collect_schema
:
print(df.collect_schema())
No utilitzis la propietat df.schema
per resoldre un LazyFrame
perquè is a potentially expensive operation This property exists only for symmetry with the DataFrame class.
> python3 electric_vehicle.py
Schema({'VIN (1-10)': String, 'County': String, 'City': String, 'State': String, 'Postal Code': Int64, 'Model Year': Int64, 'Make': String, 'Model': String, 'Electric Vehicle Type': String, 'Clean Alternative Fuel Vehicle (CAFV) Eligibility': String, 'Electric Range': Int64, 'Base MSRP': Int64, 'Legislative District': Int64, 'DOL Vehicle ID': Int64, 'Vehicle Location': String, 'Electric Utility': String, '2020 Census Tract': Int64})
This allows you to see the file’s column names and their respective data types, and it also helps Polars optimize queries that you run on this data. In fact, Polars must know the schema before executing any step of a query plan.
You can now run a query on the data contained in electric_cars.csv
using the lazy API.
Your queries can have arbitrary complexity, and Polars will only store and process the necessary data.
For instance, you could run the following query:
query = (
df.filter((pl.col("Model Year") >= 2018))
.filter(pl.col("Electric Vehicle Type") == "Battery Electric Vehicle (BEV)")
.group_by(["State", "Make"])
.agg(
pl.mean("Electric Range").alias("Average Electric Range"),
pl.min("Model Year").alias("Oldest Model Year"),
pl.len().alias("Number of Cars"),
)
.filter(pl.col("Average Electric Range") > 0)
.filter(pl.col("Number of Cars") > 5)
.sort(pl.col("Number of Cars"), descending=True)
)
print(query.collect())
> python3 electric_vehicle.py
shape: (22, 5)
┌───────┬───────────┬────────────────────────┬───────────────────┬────────────────┐
│ State ┆ Make ┆ Average Electric Range ┆ Oldest Model Year ┆ Number of Cars │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ i64 ┆ u32 │
╞═══════╪═══════════╪════════════════════════╪═══════════════════╪════════════════╡
│ WA ┆ TESLA ┆ 56.715893 ┆ 2018 ┆ 85408 │
│ WA ┆ CHEVROLET ┆ 88.73235 ┆ 2018 ┆ 8881 │
│ WA ┆ NISSAN ┆ 67.008487 ┆ 2018 ┆ 7423 │
│ WA ┆ FORD ┆ 0.083241 ┆ 2018 ┆ 7208 │
│ WA ┆ KIA ┆ 35.681039 ┆ 2018 ┆ 6239 │
│ … ┆ … ┆ … ┆ … ┆ … │
│ MD ┆ TESLA ┆ 33.733333 ┆ 2018 ┆ 15 │
│ TX ┆ TESLA ┆ 105.785714 ┆ 2018 ┆ 14 │
│ NC ┆ TESLA ┆ 16.538462 ┆ 2018 ┆ 13 │
│ FL ┆ TESLA ┆ 63.875 ┆ 2019 ┆ 8 │
│ CO ┆ TESLA ┆ 35.833333 ┆ 2018 ┆ 6 │
└───────┴───────────┴────────────────────────┴───────────────────┴────────────────┘
In this query, you filter the data on all cars where the model year is 2018 or later and the electric vehicle type is Battery Electric Vehicle (BEV). You then compute the average electric range, the minimum model year, and the number of cars for each state and make. Lastly, you further filter the data where the average electric range is positive and where the number of cars for the state and make is greater than five.
Because this is a lazy query, no computation is performed until you call query.collect()
. After the query is executed, only the data you asked for is stored and returned—nothing more.
Each row in the DataFrame returned from df.collect()
tells you the average electric range, oldest model year, and number of cars for each state and make. For example, the first row tells you there are 55,690 Teslas from 2018 or later in Washington State, and their average electric range is around 89.11 miles.
With this example, you saw how Polars uses the lazy API to query data from files in a performant and memory-efficient manner. This powerful API gives Polars a huge leg up over other DataFrame libraries, and you should opt to use the lazy API whenever possible.
TODO
- https://docs.kanaries.net/es/topics/Polars/polars-dataframe
- https://cienciadedatos.net/documentos/pyml01-intro_polars_es
- https://www.linkedin.com/pulse/polars-redefiniendo-el-procesamiento-de-datos-en-i-magaña-fuentes-nb5ic/
- https://realpython.com/polars-python/
- Per treballar amb grans conjunts de dades: Spark, Dask i [Ray][https://www.ray.io/].