Study notes
Before training a model, data must be preprocessed (prepare), make easy for the algorithm to fit it into the model.
The most two common preparation techniques are:
- Scaling numeric features - bring all in the same range ie.
A | B | C |
---|---|---|
3 | 480 | 65 |
will become:
A | B | C |
---|---|---|
0.3 | 0.48 | 0.65 |
- Encoding categorial variable
will become:
Size: 0, 1,2or better:
(one hot encodig)
Size_S | Size_M | Size_L |
---|---|---|
1 | 0 | 0 |
0 | 1 | 0 |
0 | 0 | 1 |
Preprocessing and the algorithm will be packed up into a pipeline.
# Will be used:
# sklearn
# compose
# ColumTransformer
# pipeline
# Pipeline
# impute
# SimpleImputer
# preprocessing
# StandardScaler
# OneHotEncoder
# liniar_model
# LiniarRegression
# Train the model
from sklearn.composeimport ColumnTransformer
from sklearn.pipelineimport Pipeline
from sklearn.imputeimport SimpleImputer
from sklearn.preprocessingimport StandardScaler, OneHotEncoder
from sklearn.linear_modelimport LinearRegression
import numpyas np
# preprocessing for numeric columns (scale)
numeric_features= [1,2]
from sklearn.composeimport ColumnTransformer
from sklearn.pipelineimport Pipeline
from sklearn.imputeimport SimpleImputer
from sklearn.preprocessingimport StandardScaler, OneHotEncoder
from sklearn.linear_modelimport LinearRegression
import numpyas np
# preprocessing for numeric columns (scale)
numeric_features= [1,2]
numeric_transfirmer= Pipeline(
steps=[
('scaler', StandardScaler())
]
)
# preprocessing for categorial
categorial_features= [3,4]
categorial_transformer= Pipline(
steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))
]
)
# combine preprocessing bove
preprocessor= ColumnTransformer(
transformers= [
('num', numeric_transformar, numeric_features),
('cat', categorial_transformar, categorial_features),
]
)
# add into the same pipeline both preprocessing steps and the algorithm
pipeline = Pipeline(
steps = [
('peprocessor', preprocessor),
('regressor', GradientBoostingRegressor)
]
)
# Train model
model = pipeline.fit(X_train, (y_train))
print (model)
Pipeline(steps=[('preprocessor',
ColumnTransformer(transformers=[('num',
Pipeline(steps=[('scaler',
StandardScaler())]),
[1, 2]),
('cat',
Pipeline(steps=[('onehot',
OneHotEncoder(handle_unknown='ignore'))]),
[3, 4])])),
('regressor', GradientBoostingRegressor())])
Now is easy to use another algorithm
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('regressor', RandomForestRegressor())])
('regressor', RandomForestRegressor())])
# Save job:
import joblib
joblib.dump(model, 'my_pipelined_job.pkl')
# Load job in a model
loaded_model = joblib.load('my_pipelined_job.pkl')
References:
Create machine learning models - Training | Microsoft Learn
1. Supervised learning — scikit-learn 1.2.1 documentation