Introduction to MLlib β Pipelines, Transformers, and Estimators
At DataVerse Labs, the data science team struggled with a familiar problem:
βOur ML code worksβ¦ until someone touches it.β
Different preprocessing steps, inconsistent model versions, and messy feature engineering caused pipeline failures every week.
So the team switched to PySpark MLlib Pipelines β a structured, reproducible, production-friendly way to prepare data and train machine learning models.
In this chapter, youβll understand Pipelines, Transformers, and Estimators one by one, with real examples and output.
1. What Are MLlib Pipelines?β
Think of a pipeline as an assembly line:
- Raw data enters
- It passes through multiple stages
- A prediction-ready dataset or model comes out
A PySpark ML Pipeline consists of stages, and each stage is either:
- Transformer β transforms data (e.g., VectorAssembler, StringIndexerModel)
- Estimator β learns from data and creates a transformer (e.g., LogisticRegression)
A pipeline ensures:
β reproducibility
β cleaner code
β easy hyperparameter tuning
β versioned, deployable ML workflow
2. Transformers β They Transform Dataβ
Transformers apply a function to your dataset.
Example: StringIndexerModel, VectorAssembler, StandardScalerModel.
Imagine NeoMart customer data:
Input Data
+--------+--------+-------+
|user_id |gender |income |
+--------+--------+-------+
|U1001 |Male |50000 |
|U1002 |Female |65000 |
|U1003 |Female |45000 |
+--------+--------+-------+
Code β StringIndexer (Transformer)β
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol="gender", outputCol="gender_index")
model_indexer = indexer.fit(df)
df_indexed = model_indexer.transform(df)
df_indexed.show()
Outputβ
+--------+--------+-------+-------------+
|user_id |gender |income |gender_index |
+--------+--------+-------+-------------+
|U1001 |Male |50000 |1.0 |
|U1002 |Female |65000 |0.0 |
|U1003 |Female |45000 |0.0 |
+--------+--------+-------+-------------+
Transformers do not learn β they only apply existing logic.
3. Estimators β They Learn From Dataβ
Estimators produce transformers after fitting.
Examples:
StringIndexerβ produces StringIndexerModelStandardScalerβ produces StandardScalerModelLogisticRegressionβ produces LogisticRegressionModel
Code β Logistic Regression (Estimator)β
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(featuresCol="features", labelCol="label")
model_lr = lr.fit(training_df)
Here:
lris an Estimatormodel_lris a Transformer created by learning from training data
4. Building a Full ML Pipeline (Step-by-Step)β
Letβs build a pipeline predicting whether a customer will purchase a product.
Input Data
+-------+--------+-------+------+
|gender |age |income |label |
+-------+--------+-------+------+
|Male |34 |50000 |1 |
|Female |28 |65000 |0 |
|Female |30 |45000 |1 |
+-------+--------+-------+------+
Pipeline Codeβ
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
indexer = StringIndexer(inputCol="gender", outputCol="gender_index")
assembler = VectorAssembler(
inputCols=["gender_index", "age", "income"],
outputCol="features"
)
lr = LogisticRegression(featuresCol="features", labelCol="label")
pipeline = Pipeline(stages=[indexer, assembler, lr])
model = pipeline.fit(df)
predictions = model.transform(df)
predictions.select("gender", "age", "income", "prediction").show()
Output Exampleβ
+--------+---+-------+----------+
|gender |age|income |prediction|
+--------+---+-------+----------+
|Male |34 |50000 |1.0 |
|Female |28 |65000 |0.0 |
|Female |30 |45000 |1.0 |
+--------+---+-------+----------+
The entire process β encoding, assembling, training β is now automated.
5. Saving & Loading the Pipelineβ
pipeline.write().overwrite().save("/tmp/pipeline_purchase")
model.write().overwrite().save("/tmp/pipeline_purchase_model")
To load later:
from pyspark.ml import PipelineModel
loaded_model = PipelineModel.load("/tmp/pipeline_purchase_model")
Production teams love this step β saves retraining costs and ensures reproducibility.
6. Why Use MLlib Pipelines?β
β Prevents messy ML code β Reusable components β Ensures consistent feature engineering β Works seamlessly with Spark clusters β Best for production-grade ML systems
Summaryβ
In this chapter, you learned:
β¨ Transformers β transform data β¨ Estimators β learn from data β¨ Pipelines β combine everything into a clean workflow
MLlib pipelines help companies like DataVerse Labs achieve clean, scalable, reproducible ML engineering.
Next Topic β Regression & Classification in PySpark