Patito offers a simple way to declare pydantic data models which double as schema for your polars data frames.

Introducció

patito is a dataframe validation library built on top of Polars.

The core idea of Patito is that you should define a so-called model for each of your data sources.

A model is a declarative python class which describes the general properties of a tabular data set: the names of all the columns, their types, value bounds, and so on…

These models can then be used to validate the data sources when they are ingested into your project’s data pipeline. In turn, your models become a trustworthy, centralized catalog of all the core facts about your data, facts you can safely rely upon during development.

Crea un projecte nou i instal.la patito:

$ poetry new patito --name app
$ cd patito
$ poetry add patito polars

DataFrame Validation

Let’s say that your project keeps track of products, and that these products have four core properties:

  1. A unique, numeric identifier
  2. A name
  3. An ideal temperature zone of either "dry", "cold", or "frozen"
  4. A product demand given as a percentage of the total sales forecast for the next week

In tabular form the data might look something like this:

product_id name temperature_zone demand_percentage
1 Apple dry 0.23%
2 Milk cold 0.61%
3 Ice cubes frozen 0.01%
... ... ... ...

We now start to model the restrictions we want to put upon our data. In Patito this is done by defining a class which inherits from patito.Model, a class which has one field annotation for each column in the data.

These models should preferably be defined in a centralized place, conventionally app/models.py, where you can easily find and refer to them.

app/models.py:

from typing import Literal

import patito as pt

class Product(pt.Model):
     product_id: int
     name: str
     temperature_zone: Literal["dry", "cold", "frozen"]
     demand_percentage: float

Here we have used typing.Literal from the standard library in order to specify that temperature_zone is not only a str, but specifically one of the literal values "dry", "cold", or "frozen".

You can now use this class to represent a single specific instance of a product:

app/main.py

from models import Product

apple: Product = Product(product_id=1, name="Apple", temperature_zone="dry", demand_percentage=0.23)
print(apple)

The class also automatically offers input data validation, for instance if you provide an invalid value for temperature_zone.:

from models import Product

pizza: Product = Product(product_id=64, name="Pizza", temperature_zone="oven", demand_percentage=0.12)
> python app/main.py
...
pydantic_core._pydantic_core.ValidationError: 1 validation error for Product
temperature_zone
  Input should be 'dry', 'cold' or 'frozen' [type=literal_error, input_value='oven', input_type=str]
    For further information visit https://errors.pydantic.dev/2.9/v/literal_error

You might notice that this looks suspiciously like Pydantic's data models, and that is in fact because it is! Patito’s model class is built upon pydantic's pydantic.BaseClass and therefore offers all of pydantic's functionality.

But the difference is that Patito extends pydantic's validation of singular object instances to collections of the same objects represented as dataframes.

Let’s take this data and represent it as a polars dataframe.

from models import Product

import polars as pl

df = pl.DataFrame(
    {
        "product_id": [1, 2, 3],
        "name": ["Apple", "Milk", "Ice cubes"],
        "temperature_zone": ["dry", "cold", "frozen"],
        "demand_percentage": [0.23, 0.61, 0.01],
    }
)

We can now use Product.validate() in order to validate the content of our dataframe:

Product.validate(df)

Well, that wasn't really interesting…

The validate method simply returns the dataframe if no errors are found.

It is intended as a guard statement to be put before any logic that requires the data to be valid. That way you can rely on the data being compatible with the given model schema, otherwise the .validate() method would have raised an exception.

Let's try this with invalid data, setting the temperature zone of one of the products to "oven":

import polars as pl

from models import Product
from patito.exceptions import DataFrameValidationError


df = pl.DataFrame(
    {
        "product_id": [64, 64],
        "name": ["Pizza", "Cereal"],
        "temperature_zone": ["oven", "dry"],
        "demand_percentage": [0.07, 0.16],
    }
)

try:
    Product.validate(df)
except DataFrameValidationError as e:
    print(e)
> python app/main.py
1 validation error for Product
temperature_zone
  Rows with invalid values: {'oven'}. (type=value_error.rowvalue)

Now we’re talking!

Patito allows you to define a single class which validates both singular object instances and dataframe collections without code duplication!

flowchart LR

    pydantic["pydantic.BaseModel
    ------------------------------
    Singular Instance Validation"]
    
    patito["patito.Model
    ------------------------------
    Singular Instance Validation
    +
    DataFrame Validation"]

    pydantic -- Same class definition -->  patito

Patito tries to rely as much as possible on pydantic's existing modelling concepts, naturally extending them to the dataframe domain where suitable. Model fields annotated with str will map to dataframe columns stored as pl.Utf8, int as pl.Int8/pl.Int16/…/pl.Int64, and so on. Field types wrapped in Optional allow null values, while bare types do not.

But certain modelling concepts are not applicable in the context of singular object instances, and are therefore necessarily not part of pydantic's API.

Take product_id as an example, you would expect this column to be unique across all products and duplicates should therefore be considered invalid. In pydantic you have no way to express this, but Patito expands upon pydantic in various ways in order to represent dataframe-related constraints. One of these extensions is the unique parameter accepted by patito.Field, which allows you to specify that all the values of a given column should be unique.

import patito as pt

from typing import Literal

class Product(pt.Model):
    product_id: int = pt.Field(unique=True)
    name: str
    temperature_zone: Literal["dry", "cold", "frozen"]
    demand_percentage: float

The patito.Field class accepts the same parameters as pydantic.Field, but adds additional dataframe-specific constraints documented here. In those cases where Patito's built-in constraints do not suffice, you can specify arbitrary constraints in the form of polars expressions which must evaluate to True for each row in order for the dataframe to be considered valid.

Let’s say we want to make sure that demand_percentage sums up to 100% for the entire dataframe, otherwise we might be missing one or more products. We can do this by passing the constraints parameter to patito.Field:

class Product(pt.Model):
    product_id: int = pt.Field(unique=True)
    name: str
    temperature_zone: Literal["dry", "cold", "frozen"]
    demand_percentage: float = pt.Field(constraints=pt.field.sum() == 100.0)

Here patito.field is an alias for the field column and is automatically replaced with polars.col("demand_percentage") before validation.

If we now use this improved class to validate df, we should detect new errors:

> python app/main.py
3 validation errors for Product
product_id
  2 rows with duplicated values. (type=value_error.rowvalue)
temperature_zone
  Rows with invalid values: {'oven'}. (type=value_error.rowvalue)
demand_percentage
  2 rows does not match custom constraints. (type=value_error.rowvalue)

Patito has now detected that product_id contains duplicates and that demand_percentage does not sum up to 100%!

Several more properties and methods are available on patito.Model as outlined here:

DataFrame

https://patito.readthedocs.io/en/latest/api/patito/DataFrame/index.html

Model

https://patito.readthedocs.io/en/latest/api/patito/Model/index.html

Field

https://patito.readthedocs.io/en/latest/api/patito/Field/index.html

class Product(pt.Model):

    # Do not allow duplicates
    product_id: int = pt.Field(unique=True)

    # Price must be stored as unsigned 16-bit integers
    price: int = pt.Field(dtype=pl.UInt16)

    # The product name should be from 3 to 128 characters long
    name: str = pt.Field(min_length=3, max_length=128)

    # Represent colors in the form of upper cased hex colors
    #_color: str = pt.Field(regex=r"^\#[0-9A-F]{6}$")


Product.DataFrame(
    {
        "product_id": [1, 1],
        "price": [400, 600],
        "brand_color": ["#ab00ff", "AB00FF"],
    }
).validate()
> python app/main.py
...