Pandas Cookbook ++ Great Expectations

Neil Williams

17 Oct 2024 • 3 min read

If there’s one thing the Pandas Cookbook (3rd Edition) excels at, it’s turning complex concepts into approachable, digestible bites.

The Data Validation section in Chapter 11 of my Early Review Copy, did exactly that for me with Great Expectations—an essential tool for ensuring data quality. Until now, I hadn’t used it at all. However, in just a few clear and concise steps, the Pandas Cookbook took me through setting up Great Expectations and using it to validate a dataset. What really struck me was how this small recipe has opened up a new world of possibilities in terms of automating data quality checks in my own work.

The cookbook uses a vehicles dataset, which strikes a nice balance between complexity and clarity. It’s not a trivial dataset, but it’s one that any reader can quickly understand, making it the perfect choice for demonstrating data validation in a meaningful way. This keeps the focus on mastering the tool, rather than trying to decipher the data. I immediately found myself thinking about how I could adapt this same workflow to larger, more complicated datasets in my own projects.

One of the best aspects of this recipe, and of the Pandas Cookbook in general, is the side-by-side presentation of code snippets with their outputs. For someone like me—who doesn’t always want to fire up an IDE just to get a sense of the results—this approach is a huge time-saver. Just by scanning the code and output together, I could quickly get the gist of what Great Expectations was doing, and it was motivating enough to make me want to try it out on my own.

Sure, the cookbook only scratches the surface of Great Expectations, but it does so in a way that got me curious enough to dive into the official documentation and start playing around with more advanced features. By the end of reading and digesting the recipe, I had enough confidence to begin building my own validation tests and applying them to my datasets, ensuring that my data pipelines are as reliable as they can be.

Exercises

Inspired by the book and interest in getting my hands dirty with Great Expectations, I got started straight away with two exercises and have three more in the backlog.

Exploring the Expectations Gallery

Trying out some of the more advanced expectations available in the Expectations Gallery has deepened my understanding of how flexible the tool can be. Each recipe includes a description, lists the applicable data sources, a summary of the arguments, some examples, notes and links to related expectations.

The gallery is organised by data source and data quality issue:

Data source	Data quality issue
Pandas, Spark, SQLite, PostgreSQL, MySQL, MSSQL, Redshift, BigQuery, Snowflake	Missingness, Pattern matching, Cardinality, Sets, Distribution, Numerical data, Schema, Volume

Eland

As a first exploration of experimenting with other data types, I turned to my old favourite ElasticSearch. In particular, I used Eland to help me load a dataset into the Elastic Cloud and then as the conduit to Great Expectations. It was fun to how these technologies fitted together and I will post the code in a few days.

Follow on sprint

Next, I will expand my use through:

Automating Validation Checks: Set up Great Expectations in a larger project to automate checks for null values, duplicates, and outliers across multiple datasets.
Creating Custom Expectations: Write custom validation rules for specific business logic, such as ensuring certain columns always follow a predefined pattern (e.g., email addresses or timestamps).
Integrating Great Expectations with a Data Pipeline: Incorporate Great Expectations into an ETL (Extract, Transform, Load) pipeline to automatically validate data at various stages. In particular, I will be exploring integration with my Prefect set up.

Summing up

The Pandas Cookbook (3rd Edition) has done more than just introduce a tool—it has given me a springboard to explore new techniques and workflows in my data science projects. I can’t wait to keep pushing the boundaries of what’s possible with Pandas. If you haven’t already, I highly recommend giving Chapter 11 a read and seeing where it takes you.