选择分类器的合适输出

scikit-learn 分类器通常返回一个概率矩阵。默认情况下,sklearn-onnx 会将该矩阵转换为字典列表,其中每个概率映射到其类别 ID 或名称。这种机制保留了类别名称,但速度较慢。让我们看看还有哪些其他选项可用。

训练模型并进行转换

from timeit import repeat
import numpy
import sklearn
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import onnxruntime as rt
import onnx
import skl2onnx
from skl2onnx.common.data_types import FloatTensorType
from skl2onnx import to_onnx
from sklearn.linear_model import LogisticRegression
from sklearn.multioutput import MultiOutputClassifier

iris = load_iris()
X, y = iris.data, iris.target
X = X.astype(numpy.float32)
y = y * 2 + 10  # to get labels different from [0, 1, 2]
X_train, X_test, y_train, y_test = train_test_split(X, y)
clr = LogisticRegression(max_iter=500)
clr.fit(X_train, y_train)
print(clr)

onx = to_onnx(clr, X_train, target_opset=12)
LogisticRegression(max_iter=500)

默认行为:zipmap=True

概率的输出类型是字典列表。

sess = rt.InferenceSession(onx.SerializeToString(), providers=["CPUExecutionProvider"])
res = sess.run(None, {"X": X_test})
print(res[1][:2])
print("probabilities type:", type(res[1]))
print("type for the first observations:", type(res[1][0]))
[{10: 6.341787866404047e-06, 12: 0.030411386862397194, 14: 0.9695823192596436}, {10: 0.022245943546295166, 12: 0.9420960545539856, 14: 0.035658009350299835}]
probabilities type: <class 'list'>
type for the first observations: <class 'dict'>

选项 zipmap=False

概率现在是一个矩阵。

initial_type = [("float_input", FloatTensorType([None, 4]))]
options = {id(clr): {"zipmap": False}}
onx2 = to_onnx(clr, X_train, options=options, target_opset=12)

sess2 = rt.InferenceSession(
    onx2.SerializeToString(), providers=["CPUExecutionProvider"]
)
res2 = sess2.run(None, {"X": X_test})
print(res2[1][:2])
print("probabilities type:", type(res2[1]))
print("type for the first observations:", type(res2[1][0]))
[[6.3417879e-06 3.0411387e-02 9.6958232e-01]
 [2.2245944e-02 9.4209605e-01 3.5658009e-02]]
probabilities type: <class 'numpy.ndarray'>
type for the first observations: <class 'numpy.ndarray'>

选项 zipmap=’columns’

此选项移除了最终的 ZipMap 操作符,并将概率拆分为列。最终模型会为一个标签生成一个输出,并为每个类别生成一个输出。

options = {id(clr): {"zipmap": "columns"}}
onx3 = to_onnx(clr, X_train, options=options, target_opset=12)

sess3 = rt.InferenceSession(
    onx3.SerializeToString(), providers=["CPUExecutionProvider"]
)
res3 = sess3.run(None, {"X": X_test})
for i, out in enumerate(sess3.get_outputs()):
    print(
        "output: '{}' shape={} values={}...".format(
            out.name, res3[i].shape, res3[i][:2]
        )
    )
output: 'output_label' shape=(38,) values=[14 12]...
output: 'i10' shape=(38,) values=[6.3417879e-06 2.2245944e-02]...
output: 'i12' shape=(38,) values=[0.03041139 0.94209605]...
output: 'i14' shape=(38,) values=[0.9695823  0.03565801]...

让我们比较预测时间

print("Average time with ZipMap:")
print(sum(repeat(lambda: sess.run(None, {"X": X_test}), number=100, repeat=10)) / 10)

print("Average time without ZipMap:")
print(sum(repeat(lambda: sess2.run(None, {"X": X_test}), number=100, repeat=10)) / 10)

print("Average time without ZipMap but with columns:")
print(sum(repeat(lambda: sess3.run(None, {"X": X_test}), number=100, repeat=10)) / 10)

# The prediction is much faster without ZipMap
# on this example.
# The optimisation is even faster when the classes
# are described with strings and not integers
# as the final result (list of dictionaries) may copy
# many times the same information with onnxruntime.
Average time with ZipMap:
0.003264389200558071
Average time without ZipMap:
0.0027052922996517736
Average time without ZipMap but with columns:
0.0026047332994494354

选项 zipmap=False 和 output_class_labels=True

选项 zipmap=False 似乎是更好的选择,因为它快得多,但在处理过程中会丢失标签。可以使用 output_class_labels 选项将标签作为第三个输出暴露出来。

initial_type = [("float_input", FloatTensorType([None, 4]))]
options = {id(clr): {"zipmap": False, "output_class_labels": True}}
onx4 = to_onnx(clr, X_train, options=options, target_opset=12)

sess4 = rt.InferenceSession(
    onx4.SerializeToString(), providers=["CPUExecutionProvider"]
)
res4 = sess4.run(None, {"X": X_test})
print(res4[1][:2])
print("probabilities type:", type(res4[1]))
print("class labels:", res4[2])
[[6.3417879e-06 3.0411387e-02 9.6958232e-01]
 [2.2245944e-02 9.4209605e-01 3.5658009e-02]]
probabilities type: <class 'numpy.ndarray'>
class labels: [10 12 14]

处理时间。

print("Average time without ZipMap but with output_class_labels:")
print(sum(repeat(lambda: sess4.run(None, {"X": X_test}), number=100, repeat=10)) / 10)
Average time without ZipMap but with output_class_labels:
0.003510921600536676

MultiOutputClassifier

该模型等同于多个分类器,每个要预测的标签一个。它不返回概率矩阵,而是返回一个矩阵序列。让我们首先修改标签,以获得适用于 MultiOutputClassifier 的问题。

y = numpy.vstack([y, y + 100]).T
y[::5, 1] = 1000  # Let's a fourth class.
print(y[:5])
[[  10 1000]
 [  10  110]
 [  10  110]
 [  10  110]
 [  10  110]]

让我们训练一个 MultiOutputClassifier。

X_train, X_test, y_train, y_test = train_test_split(X, y)
clr = MultiOutputClassifier(LogisticRegression(max_iter=500))
clr.fit(X_train, y_train)
print(clr)

onx5 = to_onnx(clr, X_train, target_opset=12)

sess5 = rt.InferenceSession(
    onx5.SerializeToString(), providers=["CPUExecutionProvider"]
)
res5 = sess5.run(None, {"X": X_test[:3]})
print(res5)
MultiOutputClassifier(estimator=LogisticRegression(max_iter=500))
/home/xadupre/github/sklearn-onnx/skl2onnx/_parse.py:564: UserWarning: Option zipmap is ignored for model <class 'sklearn.multioutput.MultiOutputClassifier'>. Set option zipmap to False to remove this message.
  warnings.warn(
[array([[ 10, 110],
       [ 12, 112],
       [ 12, 114]], dtype=int64), [array([[9.6185070e-01, 3.8149051e-02, 2.1035541e-07],
       [4.4240355e-02, 9.4647479e-01, 9.2849005e-03],
       [8.0552045e-03, 6.3393027e-01, 3.5801452e-01]], dtype=float32), array([[7.9632753e-01, 8.7586209e-02, 1.8680112e-04, 1.1589945e-01],
       [2.6829399e-02, 7.2040278e-01, 5.2288104e-02, 2.0047970e-01],
       [7.8932270e-03, 3.6448562e-01, 5.4698211e-01, 8.0639035e-02]],
      dtype=float32)]]

选项 zipmap 被忽略。标签丢失了,但可以作为第三个输出添加回来。

onx6 = to_onnx(
    clr,
    X_train,
    target_opset=12,
    options={"zipmap": False, "output_class_labels": True},
)

sess6 = rt.InferenceSession(
    onx6.SerializeToString(), providers=["CPUExecutionProvider"]
)
res6 = sess6.run(None, {"X": X_test[:3]})
print("predicted labels", res6[0])
print("predicted probabilies", res6[1])
print("class labels", res6[2])
predicted labels [[ 10 110]
 [ 12 112]
 [ 12 114]]
predicted probabilies [array([[9.6185070e-01, 3.8149051e-02, 2.1035541e-07],
       [4.4240355e-02, 9.4647479e-01, 9.2849005e-03],
       [8.0552045e-03, 6.3393027e-01, 3.5801452e-01]], dtype=float32), array([[7.9632753e-01, 8.7586209e-02, 1.8680112e-04, 1.1589945e-01],
       [2.6829399e-02, 7.2040278e-01, 5.2288104e-02, 2.0047970e-01],
       [7.8932270e-03, 3.6448562e-01, 5.4698211e-01, 8.0639035e-02]],
      dtype=float32)]
class labels [array([10, 12, 14], dtype=int64), array([ 110,  112,  114, 1000], dtype=int64)]

此示例使用的版本

print("numpy:", numpy.__version__)
print("scikit-learn:", sklearn.__version__)
print("onnx: ", onnx.__version__)
print("onnxruntime: ", rt.__version__)
print("skl2onnx: ", skl2onnx.__version__)
numpy: 2.2.0
scikit-learn: 1.6.0
onnx:  1.18.0
onnxruntime:  1.21.0+cu126
skl2onnx:  1.18.0

脚本总运行时间: (0 分钟 0.271 秒)

Gallery 由 Sphinx-Gallery 生成