注意
转到末尾 下载完整的示例代码。
选择合适的分类器输出¶
Scikit-learn 分类器通常会返回一个概率矩阵。默认情况下,sklearn-onnx 会将该矩阵转换为字典列表,其中每个概率都映射到其类 ID 或名称。这种机制保留了类名,但速度较慢。让我们看看还有哪些其他选项。
训练模型并进行转换¶
from timeit import repeat
import numpy
import sklearn
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import onnxruntime as rt
import onnx
import skl2onnx
from skl2onnx import to_onnx
from sklearn.linear_model import LogisticRegression
from sklearn.multioutput import MultiOutputClassifier
iris = load_iris()
X, y = iris.data, iris.target
X = X.astype(numpy.float32)
y = y * 2 + 10 # to get labels different from [0, 1, 2]
X_train, X_test, y_train, y_test = train_test_split(X, y)
clr = LogisticRegression(max_iter=500)
clr.fit(X_train, y_train)
print(clr)
onx = to_onnx(clr, X_train, target_opset=12)
LogisticRegression(max_iter=500)
默认行为:zipmap=True¶
概率的输出类型是字典列表。
[{10: 0.9853358864784241, 12: 0.01466413028538227, 14: 3.883373267399293e-08}, {10: 0.0013562627136707306, 12: 0.5341124534606934, 14: 0.4645313024520874}]
probabilities type: <class 'list'>
type for the first observations: <class 'dict'>
选项 zipmap=False¶
概率现在是一个矩阵。
options = {id(clr): {"zipmap": False}}
onx2 = to_onnx(clr, X_train, options=options, target_opset=12)
sess2 = rt.InferenceSession(
onx2.SerializeToString(), providers=["CPUExecutionProvider"]
)
res2 = sess2.run(None, {"X": X_test})
print(res2[1][:2])
print("probabilities type:", type(res2[1]))
print("type for the first observations:", type(res2[1][0]))
[[9.8533589e-01 1.4664130e-02 3.8833733e-08]
[1.3562627e-03 5.3411245e-01 4.6453130e-01]]
probabilities type: <class 'numpy.ndarray'>
type for the first observations: <class 'numpy.ndarray'>
选项 zipmap=’columns’¶
此选项会移除最后的 ZipMap 运算符,并将概率拆分为列。最终模型会生成一个标签输出,以及每个类别的输出。
options = {id(clr): {"zipmap": "columns"}}
onx3 = to_onnx(clr, X_train, options=options, target_opset=12)
sess3 = rt.InferenceSession(
onx3.SerializeToString(), providers=["CPUExecutionProvider"]
)
res3 = sess3.run(None, {"X": X_test})
for i, out in enumerate(sess3.get_outputs()):
print(
"output: '{}' shape={} values={}...".format(
out.name, res3[i].shape, res3[i][:2]
)
)
output: 'output_label' shape=(38,) values=[10 12]...
output: 'i10' shape=(38,) values=[0.9853359 0.00135626]...
output: 'i12' shape=(38,) values=[0.01466413 0.53411245]...
output: 'i14' shape=(38,) values=[3.8833733e-08 4.6453130e-01]...
让我们比较一下预测时间¶
print("Average time with ZipMap:")
print(sum(repeat(lambda: sess.run(None, {"X": X_test}), number=100, repeat=10)) / 10)
print("Average time without ZipMap:")
print(sum(repeat(lambda: sess2.run(None, {"X": X_test}), number=100, repeat=10)) / 10)
print("Average time without ZipMap but with columns:")
print(sum(repeat(lambda: sess3.run(None, {"X": X_test}), number=100, repeat=10)) / 10)
# The prediction is much faster without ZipMap
# on this example.
# The optimisation is even faster when the classes
# are described with strings and not integers
# as the final result (list of dictionaries) may copy
# many times the same information with onnxruntime.
Average time with ZipMap:
0.003566398999964804
Average time without ZipMap:
0.0027388382000481217
Average time without ZipMap but with columns:
0.008876207899970723
选项 zimpap=False 和 output_class_labels=True¶
选项 zipmap=False 似乎是更好的选择,因为它快得多,但在这个过程中标签会丢失。选项 output_class_labels 可用于将标签作为第三个输出暴露出来。
options = {id(clr): {"zipmap": False, "output_class_labels": True}}
onx4 = to_onnx(clr, X_train, options=options, target_opset=12)
sess4 = rt.InferenceSession(
onx4.SerializeToString(), providers=["CPUExecutionProvider"]
)
res4 = sess4.run(None, {"X": X_test})
print(res4[1][:2])
print("probabilities type:", type(res4[1]))
print("class labels:", res4[2])
[[9.8533589e-01 1.4664130e-02 3.8833733e-08]
[1.3562627e-03 5.3411245e-01 4.6453130e-01]]
probabilities type: <class 'numpy.ndarray'>
class labels: [10 12 14]
处理时间。
Average time without ZipMap but with output_class_labels:
0.003412025500006166
MultiOutputClassifier¶
此模型等同于多个分类器,每个分类器对应一个要预测的标签。它不会返回概率矩阵,而是返回一系列矩阵。让我们首先修改标签,以获取 MultiOutputClassifier 的问题。
[[ 10 1000]
[ 10 110]
[ 10 110]
[ 10 110]
[ 10 110]]
让我们训练一个 MultiOutputClassifier。
X_train, X_test, y_train, y_test = train_test_split(X, y)
clr = MultiOutputClassifier(LogisticRegression(max_iter=500))
clr.fit(X_train, y_train)
print(clr)
onx5 = to_onnx(clr, X_train, target_opset=12)
sess5 = rt.InferenceSession(
onx5.SerializeToString(), providers=["CPUExecutionProvider"]
)
res5 = sess5.run(None, {"X": X_test[:3]})
print(res5)
MultiOutputClassifier(estimator=LogisticRegression(max_iter=500))
/home/xadupre/github/sklearn-onnx/skl2onnx/_parse.py:569: UserWarning: Option zipmap is ignored for model <class 'sklearn.multioutput.MultiOutputClassifier'>. Set option zipmap to False to remove this message.
warnings.warn(
[array([[ 12, 112],
[ 10, 110],
[ 12, 112]], dtype=int64), [array([[3.4271918e-02, 9.4276208e-01, 2.2966042e-02],
[9.8361975e-01, 1.6380120e-02, 1.1866883e-07],
[2.5017133e-02, 9.3919086e-01, 3.5791978e-02]], dtype=float32), array([[2.2323465e-02, 6.9690281e-01, 6.7374893e-02, 2.1339883e-01],
[8.0926484e-01, 2.7745293e-02, 4.4738154e-05, 1.6294508e-01],
[1.6758386e-02, 5.4099339e-01, 6.5238215e-02, 3.7700996e-01]],
dtype=float32)]]
Zipmap 选项被忽略。标签缺失,但可以作为第三个输出添加回来。
onx6 = to_onnx(
clr,
X_train,
target_opset=12,
options={"zipmap": False, "output_class_labels": True},
)
sess6 = rt.InferenceSession(
onx6.SerializeToString(), providers=["CPUExecutionProvider"]
)
res6 = sess6.run(None, {"X": X_test[:3]})
print("predicted labels", res6[0])
print("predicted probabilies", res6[1])
print("class labels", res6[2])
predicted labels [[ 12 112]
[ 10 110]
[ 12 112]]
predicted probabilies [array([[3.4271918e-02, 9.4276208e-01, 2.2966042e-02],
[9.8361975e-01, 1.6380120e-02, 1.1866883e-07],
[2.5017133e-02, 9.3919086e-01, 3.5791978e-02]], dtype=float32), array([[2.2323465e-02, 6.9690281e-01, 6.7374893e-02, 2.1339883e-01],
[8.0926484e-01, 2.7745293e-02, 4.4738154e-05, 1.6294508e-01],
[1.6758386e-02, 5.4099339e-01, 6.5238215e-02, 3.7700996e-01]],
dtype=float32)]
class labels [array([10, 12, 14], dtype=int64), array([ 110, 112, 114, 1000], dtype=int64)]
此示例使用的版本
print("numpy:", numpy.__version__)
print("scikit-learn:", sklearn.__version__)
print("onnx: ", onnx.__version__)
print("onnxruntime: ", rt.__version__)
print("skl2onnx: ", skl2onnx.__version__)
numpy: 2.3.1
scikit-learn: 1.6.1
onnx: 1.19.0
onnxruntime: 1.23.0
skl2onnx: 1.19.1
脚本总运行时间: (0 分钟 1.331 秒)