Streamline Information Pipelines: Tips on how to Use WhyLogs with PySpark for Efficient Information Profiling and Validation | by Sarthak Sarbahi

Machine Learning

Streamline Information Pipelines: Tips on how to Use WhyLogs with PySpark for Efficient Information Profiling and Validation | by Sarthak Sarbahi | Jan, 2024

hhhhm

2024年1月8日

Streamline Information Pipelines: Tips on how to Use WhyLogs with PySpark for Efficient Information Profiling and Validation | by Sarthak Sarbahi | Jan, 2024

[ad_1]

Parts of whylogs

Let’s start by understanding the necessary traits of whylogs.

Logging information: The core of whylogs is its capacity to log information. Consider it like retaining an in depth diary of your information’s traits. It information varied points of your information, akin to what number of rows you have got, the vary of values in every column, and different statistical particulars.
Whylogs profiles: As soon as information is logged, whylogs creates “profiles”. These profiles are like snapshots that summarize your information. They embody statistics like averages, counts, and distributions. That is helpful for understanding your information at a look and monitoring the way it modifications over time.
Information monitoring: With whylogs, you possibly can observe modifications in your information over time. That is necessary as a result of information usually evolves, and what was true final month may not be true at present. Monitoring helps you catch these modifications and perceive their impression.
Information validation: Whylogs permits you to arrange guidelines or constraints to make sure your information is as anticipated. For instance, if a sure column ought to solely have optimistic numbers, you possibly can set a rule for that. If one thing doesn’t match your guidelines, you’ll know there could be a problem.
Visualization: It’s simpler to know information by means of visuals. Whylogs can create graphs and charts that can assist you see what’s happening in your information, making it extra accessible, particularly for many who usually are not information consultants.
Integrations: Whylogs helps integrations with a wide range of instruments, frameworks and languages — Spark, Kafka, Pandas, MLFlow, GitHub actions, RAPIDS, Java, Docker, AWS S3 and extra.

That is all we have to find out about whylogs. In the event you’re curious to know extra, I encourage you to examine the documentation. Subsequent, let’s work to set issues up for the tutorial.

Surroundings setup

We’ll use a Jupyter pocket book for this tutorial. To make our code work anyplace, we’ll use JupyterLab in Docker. This setup installs all wanted libraries and will get the pattern information prepared. In the event you’re new to Docker and need to learn to set it up, try this hyperlink.

Begin by downloading the pattern information (CSV) from right here. This information is what we’ll use for profiling and validation. Create a information folder in your undertaking root listing and save the CSV file there. Subsequent, create a Dockerfile in the identical root listing.

Dockerfile for this tutorial (Picture by writer)

This Dockerfile is a set of directions to create a particular surroundings for the tutorial. Let’s break it down:

The primary line FROM quay.io/jupyter/pyspark-notebook tells Docker to make use of an present picture as the start line. This picture is a Jupyter pocket book that already has PySpark arrange.
The RUN pip set up whylogs whylogs[viz] whylogs[spark] line is about including the required libraries to this surroundings. It makes use of pip so as to add whylogs and its extra options for visualization (viz) and for working with Spark (spark).
The final line, COPY information/patient_data.csv /dwelling/patient_data.csv, is about shifting your information file into this surroundings. It takes the CSV file patient_data.csv from the information folder in your undertaking listing and places it within the /dwelling/ listing contained in the Docker surroundings.

By now your undertaking listing ought to look one thing like this.

Challenge listing in VS Code (Picture by writer)

Superior! Now, let’s construct a Docker picture. To do that, sort the next command in your terminal, ensuring you’re in your undertaking’s root folder.

docker construct -t pyspark-whylogs .

This command creates a Docker picture named pyspark-whylogs. You’ll be able to see it within the ‘Photos’ tab of your Docker Desktop app.

Docker picture constructed (Picture by writer)

Subsequent step: let’s run this picture to begin JupyterLab. Kind one other command in your terminal.

docker run -p 8888:8888 pyspark-whylogs

This command launches a container from the pyspark-whylogs picture. It makes positive you possibly can entry JupyterLab by means of port 8888 in your laptop.

After operating this command, you’ll see a URL within the logs that appears like this: http://127.0.0.1:8888/lab?token=your_token. Click on on it to open the JupyterLab internet interface.

Docker container logs (Picture by writer)

Nice! Every thing’s arrange for utilizing whylogs. Now, let’s get to know the dataset we’ll be working with.

Understanding the dataset

We’ll use a dataset about hospital sufferers. The file, named patient_data.csv, consists of 100k rows with these columns:

patient_id: Every affected person’s distinctive ID. Keep in mind, you may see the identical affected person ID greater than as soon as within the dataset.
patient_name: The title of the affected person. Completely different sufferers can have the identical title.
peak: The affected person’s peak in centimeters. Every affected person has the identical peak listed for each hospital go to.
weight: The affected person’s weight in kilograms. It’s all the time greater than zero.
visit_date: The date the affected person visited the hospital, within the format YYYY-MM-DD.

As for the place this dataset got here from, don’t fear. It was created by ChatGPT. Subsequent, let’s begin writing some code.

Getting began with PySpark

First, open a brand new pocket book in JupyterLab. Keep in mind to put it aside earlier than you begin working.

We’ll start by importing the wanted libraries.

# Import libraries
from typing import Any
import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.features as F
from whylogs.api.pyspark.experimental import collect_column_profile_views
from whylogs.api.pyspark.experimental import collect_dataset_profile_view
from whylogs.core.metrics.condition_count_metric import Situation
from whylogs.core.relations import Predicate
from whylogs.core.schema import DeclarativeSchema
from whylogs.core.resolvers import STANDARD_RESOLVER
from whylogs.core.specialized_resolvers import ConditionCountMetricSpec
from whylogs.core.constraints.factories import condition_meets
from whylogs.core.constraints import ConstraintsBuilder
from whylogs.core.constraints.factories import no_missing_values
from whylogs.core.constraints.factories import greater_than_number
from whylogs.viz import NotebookProfileVisualizer
import pandas as pd
import datetime

Then, we’ll arrange a SparkSession. This lets us run PySpark code.

# Initialize a SparkSession
spark = SparkSession.builder.appName('whylogs').getOrCreate()
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled","true")

After that, we’ll make a Spark dataframe by studying the CSV file. We’ll additionally try its schema.

# Create a dataframe from CSV file
df = spark.learn.choice("header",True).choice("inferSchema",True).csv("/dwelling/patient_data.csv")
df.printSchema()

Subsequent, let’s peek on the information. We’ll view the primary row within the dataframe.

# First row from dataframe
df.present(n=1, vertical=True)

Now that we’ve seen the info, it’s time to begin information profiling with whylogs.

Information profiling with whylogs

To profile our information, we’ll use two features. First, there’s collect_column_profile_views. This operate collects detailed profiles for every column within the dataframe. These profiles give us stats like counts, distributions, and extra, relying on how we arrange whylogs.

# Profile the info with whylogs
df_profile = collect_column_profile_views(df)
print(df_profile)

Every column within the dataset will get its personal ColumnProfileView object in a dictionary. We will look at varied metrics for every column, like their imply values.

whylogs will have a look at each information level and statistically resolve wether or not that information level is related to the ultimate calculation

For instance, let’s have a look at the common peak.

df_profile["height"].get_metric("distribution").imply.worth

Subsequent, we’ll additionally calculate the imply straight from the dataframe for comparability.

# Examine with imply from dataframe
df.choose(F.imply(F.col("peak"))).present()

However, profiling columns one after the other isn’t all the time sufficient. So, we use one other operate, collect_dataset_profile_view. This operate profiles the entire dataset, not simply single columns. We will mix it with Pandas to investigate all of the metrics from the profile.

# Placing every little thing collectively
df_profile_view = collect_dataset_profile_view(input_df=df)
df_profile_view.to_pandas().head()

We will additionally save this profile as a CSV file for later use.

# Persist profile as a file
df_profile_view.to_pandas().reset_index().to_csv("/dwelling/jovyan/patint_profile.csv",header = True,index = False)

The folder /dwelling/jovyan in our Docker container is from Jupyter’s Docker Stacks (ready-to-use Docker photographs containing Jupyter functions). In these Docker setups, ‘jovyan’ is the default consumer for operating Jupyter. The /dwelling/jovyan folder is the place Jupyter notebooks normally begin and the place it is best to put information to entry them in Jupyter.

And that’s how we profile information with whylogs. Subsequent, we’ll discover information validation.

Information validation with whylogs

For our information validation, we’ll carry out these checks:

patient_id: Make sure that there aren’t any lacking values.
weight: Guarantee each worth is greater than zero.
visit_date: Examine if dates are within the YYYY-MM-DD format.

Now, let’s begin. Information validation in whylogs begins from information profiling. We will use the collect_dataset_profile_view operate to create a profile, like we noticed earlier than.

Nevertheless, this operate normally makes a profile with commonplace metrics like common and depend. However what if we have to examine particular person values in a column versus the opposite constraints, that may be checked in opposition to combination metrics? That’s the place situation depend metrics are available in. It’s like including a customized metric to our profile.

Let’s create one for the visit_date column to validate every row.

def check_date_format(date_value: Any) -> bool:
date_format = '%Y-%m-%d'
strive:
datetime.datetime.strptime(date_value, date_format)
return True
besides ValueError:
return Falsevisit_date_condition = {"is_date_format": Situation(Predicate().is_(check_date_format))}

As soon as we have now our situation, we add it to the profile. We use a Commonplace Schema and add our customized examine.

# Create situation depend metric
schema = DeclarativeSchema(STANDARD_RESOLVER)
schema.add_resolver_spec(column_name="visit_date", metrics=[ConditionCountMetricSpec(visit_date_condition)])

Then we re-create the profile with each commonplace metrics and our new customized metric for the visit_date column.

# Use the schema to move to logger with collect_dataset_profile_view
# This creates profile with commonplace metrics in addition to situation depend metrics
df_profile_view_v2 = collect_dataset_profile_view(input_df=df, schema=schema)

With our profile prepared, we will now arrange our validation checks for every column.

builder = ConstraintsBuilder(dataset_profile_view=df_profile_view_v2)
builder.add_constraint(no_missing_values(column_name="patient_id"))
builder.add_constraint(condition_meets(column_name="visit_date", condition_name="is_date_format"))
builder.add_constraint(greater_than_number(column_name="weight",quantity=0))constraints = builder.construct()
constraints.generate_constraints_report()

We will additionally use whylogs to indicate a report of those checks.

# Visualize constraints report utilizing Pocket book Profile Visualizer
visualization = NotebookProfileVisualizer()
visualization.constraints_report(constraints, cell_height=300)

It’ll be an HTML report displaying which checks handed or failed.

whylogs constraints report (Picture by writer)

Right here’s what we discover:

The patient_id column has no lacking values. Good!
Some visit_date values don’t match the YYYY-MM-DD format.
A couple of weight values are zero.

Let’s double-check these findings in our dataframe. First, we examine the visit_date format with PySpark code.

# Validate visit_date column
df 
.withColumn("check_visit_date",F.to_date(F.col("visit_date"),"yyyy-MM-dd")) 
.withColumn("null_check",F.when(F.col("check_visit_date").isNull(),"null").in any other case("not_null")) 
.groupBy("null_check") 
.depend() 
.present(truncate = False)+----------+-----+
|null_check|depend|
+----------+-----+
|not_null  |98977|
|null      |1023 |
+----------+-----+

It reveals that 1023 out of 100,000 rows don’t match our date format. Subsequent, the weight column.

# Validate weight column
df 
.choose("weight") 
.groupBy("weight") 
.depend() 
.orderBy(F.col("weight")) 
.restrict(1) 
.present(truncate = False)+------+-----+
|weight|depend|
+------+-----+
|0     |2039 |
+------+-----+

Once more, our findings match whylogs. Nearly 2,000 rows have a weight of zero. And that wraps up our tutorial. Yow will discover the pocket book for this tutorial right here.

[ad_2]