16 minute read · August 16, 2023
10 Data Quality Checks in SQL, Pandas and Polars
· Senior Tech Evangelist, Dremio
In the era of big data and analytics, the quality of data plays a critical role in making informed decisions and extracting meaningful insights. However, ensuring data quality can be complex, requiring thorough checks and validations.
In this blog article, we explore 10 essential data quality checks using three powerful tools: SQL, Pandas, and Polars. Whether you prefer working with traditional SQL databases, the popular Pandas library in Python, or the efficient Polars library, we've got you covered. By leveraging these tools, you can identify issues like missing values, duplicates, data inconsistencies, and more.
Keep in mind that you may want to isolate ingested data for doing these data quality checks using a “data as code” approach, and if you are working with Apache Iceberg tables, that can be done effectively by ingesting the data onto a branch with a Dremio Arctic catalog and then merging your changes after validation.
For the Pandas and Polars examples, we’ll assume we’ve loaded the data from a Parquet file into DataFrame and LazyFrame, respectively, as shown below.
PANDAS
#Load the data from the Parquet file into a DataFrame orders_received_df = pd.read_parquet('orders_received.parquet')
POLARS
# Read the data from the Parquet file into a LazyFrame orders_received_lf = pl.read_parquet('orders_received.parquet')
10 Data Quality Checks
The "orders_received"
dataset represents a fictional collection of data that captures information about orders received by a company. This dataset provides valuable insights into the company's sales operations, enabling analysis and decision-making processes. It includes key attributes such as "order_number,"
a unique identifier for each order; "product_code,"
which indicates the specific product being ordered; "customer_id,"
used to associate orders with respective customers; "quantity,"
representing the quantity of the product ordered; and "order_date,"
indicating the date when the order was received. These attributes collectively offer a comprehensive snapshot of the orders received, facilitating business growth and customer satisfaction initiatives.
{ "type": "object", "properties": { "order_number": { "type": "integer" }, "product_code": { "type": "string" }, "customer_id": { "type": "integer" }, "quantity": { "type": "integer" }, "order_date": { "type": "string", "format": "date" } } }
Checking for NULL Values
This example helps identify orders in the "orders_received" table that have missing order numbers. It allows you to find any incomplete or improperly recorded orders.
SQL
SELECT * FROM orders_received WHERE order_number IS NULL;
PANDAS
# Checking for NULL values null_orders = orders_received_df[orders_received_df['order_number'].isnull()]
POLARS
# Checking for NULL values null_orders = orders_received_lf.filter(pl.col('order_number').is_null())
Counting Distinct Values
This technique helps you count the product codes in the "orders_received" table. It allows you to identify if there are multiple occurrences of the same product code, which could indicate data entry errors or duplicates.
SQL
-- Getting all values appearing more than once SELECT product_code, COUNT(*) AS count FROM orders_received GROUP BY product_code HAVING COUNT(*) > 1;
PANDAS
# Getting all values appearing more than once product_counts = orders_received_df['product_code'].value_counts() multiple_occurrences = product_counts[product_counts > 1]
POLARS
# Getting all values appearing more than once distinct_product_counts = orders_received_lf.groupby('product_code').agg(pl.count('product_code').alias('count')) multiple_occurrences = distinct_product_counts.filter(pl.col('count') > 1)
Validating Data Types
This example checks if the "quantity" column in the "orders_received" table contains only numeric values. It helps identify any instances where non-numeric or improperly formatted data is present.
SQL
SELECT * FROM orders_received WHERE quantity::text !~ '^[0-9]+$';
PANDAS
# Validating data types non_numeric_quantity = orders_received_df[~orders_received_df['quantity'].astype(str).str.isnumeric()]
POLARS
# Validating data types non_numeric_quantity = orders_received_lf.filter(pl.col('quantity').cast(pl.UInt32).is_null())
Identifying Duplicates
This query helps find duplicate order numbers in the "orders_received" table. It allows you to spot any orders that might have been recorded multiple times, ensuring data accuracy and preventing potential errors.
SQL
SELECT order_number, COUNT(*) AS duplicate_count FROM orders_received GROUP BY order_number HAVING COUNT(*) > 1;
PANDAS
# Identifying duplicates duplicate_orders = orders_received_df[orders_received_df.duplicated(subset='order_number', keep=False)]
POLARS
# Identifying duplicates duplicate_orders = orders_received_lf.with_column( orders_received_lf.groupby('order_number').agg(pl.count('*').alias('count')) ).filter(pl.col('count') > 1).select(orders_received_lf.columns())
Checking Data Integrity with Foreign Keys
This example verifies that the "customer_id" in the "orders_received" table corresponds to valid customer records in the "customers" table. It helps maintain referential integrity by ensuring all orders are associated with existing customers.
SQL
SELECT * FROM orders_received WHERE customer_id NOT IN (SELECT customer_id FROM customers);
PANDAS
# Checking data integrity with foreign keys foreign_key_mismatch = orders_received_df[~orders_received_df['customer_id'].isin(customers_df['customer_id'])]
POLARS
# Checking data integrity with foreign keys foreign_key_mismatch = orders_received_lf.filter( ~pl.col('customer_id').isin(pl.scan_csv("customers.csv").select("customer_id")) )
Verifying Data Range or Constraints
This query checks if the "order_date" falls within a specified range, such as a particular year. It helps ensure that the orders recorded in the "orders_received" table adhere to specific temporal constraints or business rules.
SQL
SELECT * FROM orders_received WHERE order_date < '2023-01-01' OR order_date > '2023-12-31';
PANDAS
# Verifying data range or constraints out_of_range_dates = orders_received_df[(orders_received_df['order_date'] < '2023-01-01') | (orders_received_df['order_date'] > '2023-12-31')]
POLARS
# Verifying data range or constraints out_of_range_dates = orders_received_lf.filter( (pl.col('order_date') < pl.lit('2023-01-01')) | (pl.col('order_date') > pl.lit('2023-12-31')) )
Identifying Records with Invalid or Inconsistent Formats
This example identifies "product_code" values in the "orders_received" table that do not conform to a desired alphanumeric format. It helps identify any entries that might have been incorrectly recorded or contain invalid characters.
SQL
SELECT * FROM orders_received WHERE NOT REGEXP_LIKE(product_code, '^[A-Za-z0-9]+$');
PANDAS
# Identifying records with invalid or inconsistent formats invalid_product_codes = orders_received_df[~orders_received_df['product_code'].str.match(r'^[A-Za-z0-9]+$')]
POLARS
# Identifying records with invalid or inconsistent formats def does_not_match(s: str) -> bool: import re return not re.match(r'^[A-Za-z0-9]+$', s) invalid_product_codes = orders_received_lf.filter( pl.col('product_code').apply(does_not_match).alias('invalid_code') )
Detecting Outliers or Unusual Values
This query identifies any unusually high or low quantities by comparing the "quantity" values in the "orders_received" table with their statistical distribution. It helps detect potential data entry errors or outliers that deviate significantly from the norm.
SQL
SELECT * FROM orders_received WHERE quantity < (SELECT AVG(quantity) - (3 * STDDEV(quantity)) FROM orders_received) OR quantity > (SELECT AVG(quantity) + (3 * STDDEV(quantity)) FROM orders_received);
PANDAS
# Detecting outliers or unusual values quantity_outliers = orders_received_df[(orders_received_df['quantity'] < (orders_received_df['quantity'].mean() - (3 * orders_received_df['quantity'].std()))) | (orders_received_df['quantity'] > (orders_received_df['quantity'].mean() + (3 * orders_received_df['quantity'].std())))]
POLARS
# Detecting outliers or unusual values quantity_outliers = orders_received_lf.filter( (pl.col('quantity') < (pl.col('quantity').mean() - (3 * pl.col('quantity').std()))) | (pl.col('quantity') > (pl.col('quantity').mean() + (3 * pl.col('quantity').std()))) )
Detecting Stale Data
Detecting stale data is an important aspect of data quality checks to ensure the timeliness and relevance of the "orders_received" dataset. Stale data refers to outdated information that no longer reflects the current state of the business. By performing stale data checks, you can identify records that may have become obsolete or irrelevant due to time constraints. This helps maintain the accuracy and reliability of the dataset, allowing for more informed decision-making and analysis. Let’s assume our data should only represent orders from the last 90 days.
SQL
SELECT * FROM orders_received WHERE order_date < CURRENT_DATE - INTERVAL '90 days';
PANDAS
# Detecting stale data today = datetime.now().date() stale_data = orders_received_df[orders_received_df['order_date'] < (today - timedelta(days=90))]
POLARS
# Detecting stale data stale_date = today - timedelta(days=90) stale_data = orders_received_lf.filter(pl.col('order_date') < pl.lit(stale_date))
Detecting Changing Data
Detecting changing data is an important data quality check that helps identify records in the "orders_received" dataset where the values have been modified or updated since a specific reference point. This check allows you to spot instances where data has changed unexpectedly, which could be a result of data corruption or manual errors. By performing this check, you can ensure the accuracy and consistency of the dataset for reliable analysis and decision-making.
To detect changing data, you would typically need to have a reference point or a historical dataset to compare against the current dataset. By comparing corresponding values between the reference dataset and the current dataset, you can identify any differences or changes in the data.
SQL
SELECT * FROM orders_received WHERE EXISTS ( SELECT * FROM orders_reference WHERE orders_reference.order_number = orders_received.order_number AND ( orders_reference.order_date <> orders_received.order_date OR orders_reference.quantity <> orders_received.quantity -- Include other relevant columns for comparison ) );
PANDAS
# Perform an inner merge to compare the datasets changing_data = pd.merge(orders_received_df, orders_reference_df, on='order_number', suffixes=('_current', '_reference')) changing_data = changing_data[changing_data['order_date_current'] != changing_data['order_date_reference'] | changing_data['quantity_current'] != changing_data['quantity_reference'] ]
POLARS
# Perform an inner join to compare the datasets orders_received_lf = orders_received_lf.with_column(orders_received_lf['order_date'].alias('order_date_current')) orders_received_lf = orders_received_lf.with_column(orders_received_lf['quantity'].alias('quantity_current')) changing_data = orders_received_lf.join( orders_reference_lf, on='order_number' ).filter( (pl.col('order_date_current') != pl.col('order_date_reference')) | (pl.col('quantity_current') != pl.col('quantity_reference')) # Include other relevant columns for comparison )
Conclusion
By implementing these data quality checks, you can trust the accuracy of your data and make reliable decisions based on high-quality information. Data quality is an ongoing process, and these techniques serve as essential tools to ensure the integrity and usefulness of your datasets.