将包含 XGBoost 模型的管道进行转换¶

sklearn-onnx 仅将 scikit-learn 模型转换为 ONNX，但许多库都实现了 scikit-learn API，以便将它们的模型包含在 scikit-learn 管道中。本示例考虑一个包含 XGBoost 模型的管道。sklearn-onnx 可以转换整个管道，只要它知道与 XGBClassifier 关联的转换器。让我们看看如何做到这一点。

训练 XGBoost 分类器¶

import numpy
import onnxruntime as rt
from sklearn.datasets import load_iris, load_diabetes, make_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier, XGBRegressor, DMatrix, train as train_xgb
from skl2onnx.common.data_types import FloatTensorType
from onnxmltools.convert.common.data_types import FloatTensorType as ml_tools_FloatTensorType
from skl2onnx import convert_sklearn, to_onnx, update_registered_converter
from skl2onnx.common.shape_calculator import (
    calculate_linear_classifier_output_shapes,
    calculate_linear_regressor_output_shapes,
)
from skl2onnx.convert import may_switch_bases_classes_order
from onnxmltools.convert.xgboost.operator_converters.XGBoost import convert_xgboost
from onnxmltools.convert import convert_xgboost as convert_xgboost_booster


data = load_iris()
X = data.data[:, :2]
y = data.target

ind = numpy.arange(X.shape[0])
numpy.random.shuffle(ind)
X = X[ind, :].copy()
y = y[ind].copy()

pipe = Pipeline([("scaler", StandardScaler()), ("xgb", XGBClassifier(n_estimators=3))])
pipe.fit(X, y)

# The conversion fails but it is expected.

try:
    convert_sklearn(
        pipe,
        "pipeline_xgboost",
        [("input", FloatTensorType([None, 2]))],
        target_opset={"": 12, "ai.onnx.ml": 2},
    )
except Exception as e:
    print(e)

# The error message tells no converter was found
# for :epkg:`XGBoost` models. By default, :epkg:`sklearn-onnx`
# only handles models from :epkg:`scikit-learn` but it can
# be extended to every model following :epkg:`scikit-learn`
# API as long as the module knows there exists a converter
# for every model used in a pipeline. That's why
# we need to register a converter.

Unable to find a shape calculator for type '<class 'xgboost.sklearn.XGBClassifier'>'.
It usually means the pipeline being converted contains a
transformer or a predictor with no corresponding converter
implemented in sklearn-onnx. If the converted is implemented
in another library, you need to register
the converted so that it can be used by sklearn-onnx (function
update_registered_converter). If the model is not yet covered
by sklearn-onnx, you may raise an issue to
https://github.com/onnx/sklearn-onnx/issues
to get the converter implemented or even contribute to the
project. If the model is a custom model, a new converter must
be implemented. Examples can be found in the gallery.

注册 XGBClassifier 的转换器¶

转换器实现于 onnxmltools： onnxmltools…XGBoost.py，以及形状计算器： onnxmltools…Classifier.py。

update_registered_converter(
    XGBClassifier,
    "XGBoostXGBClassifier",
    calculate_linear_classifier_output_shapes,
    convert_xgboost,
    options={"nocl": [True, False], "zipmap": [True, False, "columns"]},
)

再次转换¶

with may_switch_bases_classes_order(XGBClassifier):
    # This context should not be needed anymore once this issue
    # is fixed in XGBoost.
    model_onnx = convert_sklearn(
        pipe,
        "pipeline_xgboost",
        [("input", FloatTensorType([None, 2]))],
        target_opset={"": 12, "ai.onnx.ml": 2},
    )

# And save.
with open("pipeline_xgboost.onnx", "wb") as f:
    f.write(model_onnx.SerializeToString())

比较预测¶

使用 XGBoost 进行预测。

with may_switch_bases_classes_order(XGBClassifier):
    print("predict", pipe.predict(X[:5]))
    print("predict_proba", pipe.predict_proba(X[:1]))

predict [1 2 2 1 2]
predict_proba [[0.1554611  0.652432   0.19210689]]

使用 onnxruntime 的预测。

sess = rt.InferenceSession("pipeline_xgboost.onnx", providers=["CPUExecutionProvider"])
pred_onx = sess.run(None, {"input": X[:5].astype(numpy.float32)})
print("predict", pred_onx[0])
print("predict_proba", pred_onx[1][:1])

predict [1 2 2 1 2]
predict_proba [{0: 0.15546110272407532, 1: 0.6524320244789124, 2: 0.19210688769817352}]

使用 XGBRegressor 的相同示例¶

update_registered_converter(
    XGBRegressor,
    "XGBoostXGBRegressor",
    calculate_linear_regressor_output_shapes,
    convert_xgboost,
)


data = load_diabetes()
x = data.data
y = data.target
X_train, X_test, y_train, _ = train_test_split(x, y, test_size=0.5)

pipe = Pipeline([("scaler", StandardScaler()), ("xgb", XGBRegressor(n_estimators=3))])
pipe.fit(X_train, y_train)

with may_switch_bases_classes_order(XGBRegressor):
    print("predict", pipe.predict(X_test[:5]))

predict [108.139824 104.385315 178.4701   166.48694   99.2167  ]

ONNX

with may_switch_bases_classes_order(XGBRegressor):
    onx = to_onnx(
        pipe, X_train.astype(numpy.float32), target_opset={"": 12, "ai.onnx.ml": 2}
    )

sess = rt.InferenceSession(onx.SerializeToString(), providers=["CPUExecutionProvider"])
pred_onx = sess.run(None, {"X": X_test[:5].astype(numpy.float32)})
print("predict", pred_onx[0].ravel())

predict [108.13983  104.38532  178.4701   166.48694   99.216705]

可能会出现一些差异。在这种情况下，您应该阅读切换到浮点数时出现的问题。

使用 Booster 的情况相同¶

Booster 不能插入到管道中。它需要一个不同的转换函数，因为它不遵循 scikit-learn API。

x, y = make_classification(
    n_classes=2, n_features=5, n_samples=100, random_state=42, n_informative=3
)
X_train, X_test, y_train, _ = train_test_split(x, y, test_size=0.5, random_state=42)

dtrain = DMatrix(X_train, label=y_train)

param = {"objective": "multi:softmax", "num_class": 3}
bst = train_xgb(param, dtrain, 10)

initial_type = [("float_input", ml_tools_FloatTensorType([None, X_train.shape[1]]))]

try:
    onx = convert_xgboost_booster(bst, "name", initial_types=initial_type)
    cont = True
except AssertionError as e:
    print("XGBoost is too recent or onnxmltools too old.", e)
    cont = False

if cont:
    sess = rt.InferenceSession(
        onx.SerializeToString(), providers=["CPUExecutionProvider"]
    )
    input_name = sess.get_inputs()[0].name
    label_name = sess.get_outputs()[0].name
    pred_onx = sess.run([label_name], {input_name: X_test.astype(numpy.float32)})[0]
    print(pred_onx)

[0 0 1 1 0 1 0 1 0 1 0 0 1 1 1 0 0 1 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 1 1 1 1
 0 1 1 1 0 0 1 1 0 0 0 1 0]

脚本总运行时间： (0 分 4.385 秒)

Sphinx-Gallery 生成的图库