Skip to main content

Introduction to MLlib β€” Pipelines, Transformers, and Estimators

At DataVerse Labs, the data science team struggled with a familiar problem:

β€œOur ML code works… until someone touches it.”

Different preprocessing steps, inconsistent model versions, and messy feature engineering caused pipeline failures every week.

So the team switched to PySpark MLlib Pipelines β€” a structured, reproducible, production-friendly way to prepare data and train machine learning models.

In this chapter, you’ll understand Pipelines, Transformers, and Estimators one by one, with real examples and output.


1. What Are MLlib Pipelines?​

Think of a pipeline as an assembly line:

  1. Raw data enters
  2. It passes through multiple stages
  3. A prediction-ready dataset or model comes out

A PySpark ML Pipeline consists of stages, and each stage is either:

  • Transformer β†’ transforms data (e.g., VectorAssembler, StringIndexerModel)
  • Estimator β†’ learns from data and creates a transformer (e.g., LogisticRegression)

A pipeline ensures:

βœ” reproducibility
βœ” cleaner code
βœ” easy hyperparameter tuning
βœ” versioned, deployable ML workflow


2. Transformers β€” They Transform Data​

Transformers apply a function to your dataset.

Example: StringIndexerModel, VectorAssembler, StandardScalerModel.

Imagine NeoMart customer data:

Input Data

+--------+--------+-------+
|user_id |gender |income |
+--------+--------+-------+
|U1001 |Male |50000 |
|U1002 |Female |65000 |
|U1003 |Female |45000 |
+--------+--------+-------+

Code β€” StringIndexer (Transformer)​

from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol="gender", outputCol="gender_index")
model_indexer = indexer.fit(df)
df_indexed = model_indexer.transform(df)

df_indexed.show()

Output​

+--------+--------+-------+-------------+
|user_id |gender |income |gender_index |
+--------+--------+-------+-------------+
|U1001 |Male |50000 |1.0 |
|U1002 |Female |65000 |0.0 |
|U1003 |Female |45000 |0.0 |
+--------+--------+-------+-------------+

Transformers do not learn β€” they only apply existing logic.


3. Estimators β€” They Learn From Data​

Estimators produce transformers after fitting.

Examples:

  • StringIndexer β†’ produces StringIndexerModel
  • StandardScaler β†’ produces StandardScalerModel
  • LogisticRegression β†’ produces LogisticRegressionModel

Code β€” Logistic Regression (Estimator)​

from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(featuresCol="features", labelCol="label")
model_lr = lr.fit(training_df)

Here:

  • lr is an Estimator
  • model_lr is a Transformer created by learning from training data

4. Building a Full ML Pipeline (Step-by-Step)​

Let’s build a pipeline predicting whether a customer will purchase a product.

Input Data

+-------+--------+-------+------+
|gender |age |income |label |
+-------+--------+-------+------+
|Male |34 |50000 |1 |
|Female |28 |65000 |0 |
|Female |30 |45000 |1 |
+-------+--------+-------+------+

Pipeline Code​

from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline

indexer = StringIndexer(inputCol="gender", outputCol="gender_index")

assembler = VectorAssembler(
inputCols=["gender_index", "age", "income"],
outputCol="features"
)

lr = LogisticRegression(featuresCol="features", labelCol="label")

pipeline = Pipeline(stages=[indexer, assembler, lr])

model = pipeline.fit(df)
predictions = model.transform(df)

predictions.select("gender", "age", "income", "prediction").show()

Output Example​

+--------+---+-------+----------+
|gender |age|income |prediction|
+--------+---+-------+----------+
|Male |34 |50000 |1.0 |
|Female |28 |65000 |0.0 |
|Female |30 |45000 |1.0 |
+--------+---+-------+----------+

The entire process β€” encoding, assembling, training β€” is now automated.


5. Saving & Loading the Pipeline​

pipeline.write().overwrite().save("/tmp/pipeline_purchase")
model.write().overwrite().save("/tmp/pipeline_purchase_model")

To load later:

from pyspark.ml import PipelineModel
loaded_model = PipelineModel.load("/tmp/pipeline_purchase_model")

Production teams love this step β€” saves retraining costs and ensures reproducibility.


6. Why Use MLlib Pipelines?​

βœ” Prevents messy ML code βœ” Reusable components βœ” Ensures consistent feature engineering βœ” Works seamlessly with Spark clusters βœ” Best for production-grade ML systems


Summary​

In this chapter, you learned:

✨ Transformers β†’ transform data ✨ Estimators β†’ learn from data ✨ Pipelines β†’ combine everything into a clean workflow

MLlib pipelines help companies like DataVerse Labs achieve clean, scalable, reproducible ML engineering.


Next Topic β†’ Regression & Classification in PySpark

Career