注意
转到末尾 下载完整的示例代码。
一个模型,多种可能的转换选项¶
转换模型并非只有一种方法。新版本的 ONNX 中可能添加了新的运算符,这会加速转换后的模型。理性的选择是使用这个新运算符,但这意味着相应的运行时必须有一个实现。如果两个不同的用户对同一个模型需要两种不同的转换怎么办?让我们看看这如何实现。
选项 zipmap¶
按照设计,每个分类器都会被转换为一个 ONNX 图,该图输出两个结果:预测的标签和每个标签的预测概率。默认情况下,标签是整数,概率存储在字典中。这就是位于以下图末尾的 ZipMap 运算符的作用。
:19: DeprecationWarning: Deprecated since 1.19. Consider using onnx.printer.to_text() instead.
graph ONNX(LogisticRegression) (
%X[FLOAT, ?x4]
) {
%label, %probability_tensor = LinearClassifier[classlabels_ints = [0, 1, 2], coefficients = [-0.374590873718262, 0.882017612457275, -2.25903177261353, -0.96484386920929, 0.463038802146912, -0.698963463306427, -0.0836651995778084, -0.888288736343384, -0.0884479060769081, -0.18305416405201, 2.34269690513611, 1.85313260555267], intercepts = [8.58371162414551, 2.95640826225281, -11.5401201248169], multi_class = 1, post_transform = 'SOFTMAX'](%X)
%output_label = Cast[to = 7](%label)
%probabilities = Normalizer[norm = 'L1'](%probability_tensor)
%output_probability = ZipMap[classlabels_int64s = [0, 1, 2]](%probabilities)
return %output_label, %output_probability
}
此运算符效率不高,因为它将所有概率和标签复制到不同的容器中。对于小型分类器,这通常会花费大量时间。因此,移除它是有意义的。
:20: DeprecationWarning: Deprecated since 1.19. Consider using onnx.printer.to_text() instead.
graph ONNX(LogisticRegression) (
%X[FLOAT, ?x4]
) {
%label, %probability_tensor = LinearClassifier[classlabels_ints = [0, 1, 2], coefficients = [-0.374590873718262, 0.882017612457275, -2.25903177261353, -0.96484386920929, 0.463038802146912, -0.698963463306427, -0.0836651995778084, -0.888288736343384, -0.0884479060769081, -0.18305416405201, 2.34269690513611, 1.85313260555267], intercepts = [8.58371162414551, 2.95640826225281, -11.5401201248169], multi_class = 1, post_transform = 'SOFTMAX'](%X)
%probabilities = Normalizer[norm = 'L1'](%probability_tensor)
return %label, %probabilities
}
图中可能有很多分类器,因此能够指定哪个分类器应该保留其 ZipMap 而哪个不应该保留是很重要的。因此,可以通过 ID 指定选项。
from pprint import pformat
import numpy
from onnx.reference import ReferenceEvaluator
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from skl2onnx.common._registration import _converter_pool
from skl2onnx import to_onnx
from onnxruntime import InferenceSession
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, _ = train_test_split(X, y, random_state=11)
clr = LogisticRegression()
clr.fit(X_train, y_train)
model_def = to_onnx(
clr, X_train.astype(numpy.float32), options={id(clr): {"zipmap": False}}
)
oinf = ReferenceEvaluator(model_def)
print(oinf)
ReferenceEvaluator(X) -> label, probabilities
使用 id 函数有一个缺点:它不可 pick。最好使用字符串。
model_def = to_onnx(clr, X_train.astype(numpy.float32), options={"zipmap": False})
oinf = ReferenceEvaluator(model_def)
print(oinf)
ReferenceEvaluator(X) -> label, probabilities
管道中的选项¶
在管道中,sklearn-onnx 使用相同的命名约定。
pipe = Pipeline([("norm", MinMaxScaler()), ("clr", LogisticRegression())])
pipe.fit(X_train, y_train)
model_def = to_onnx(pipe, X_train.astype(numpy.float32), options={"clr__zipmap": False})
oinf = ReferenceEvaluator(model_def)
print(oinf)
ReferenceEvaluator(X) -> label, probabilities
选项 raw_scores¶
默认情况下,每个分类器都会被转换为一个返回概率的图。但许多模型计算的是未缩放的 raw_scores。首先,使用概率
pipe = Pipeline([("norm", MinMaxScaler()), ("clr", LogisticRegression())])
pipe.fit(X_train, y_train)
model_def = to_onnx(
pipe, X_train.astype(numpy.float32), options={id(pipe): {"zipmap": False}}
)
oinf = ReferenceEvaluator(model_def)
print(oinf.run(None, {"X": X.astype(numpy.float32)[:5]}))
[array([0, 0, 0, 0, 0]), array([[0.8826898 , 0.10948468, 0.00782558],
[0.7944286 , 0.19729899, 0.00827242],
[0.8555814 , 0.13791925, 0.00649932],
[0.82628906, 0.16633531, 0.00737559],
[0.9005094 , 0.09238414, 0.00710642]], dtype=float32)]
然后使用原始分数
model_def = to_onnx(
pipe,
X_train.astype(numpy.float32),
options={id(pipe): {"raw_scores": True, "zipmap": False}},
)
oinf = ReferenceEvaluator(model_def)
print(oinf.run(None, {"X": X.astype(numpy.float32)[:5]}))
[array([0, 0, 0, 0, 0]), array([[0.8826898 , 0.10948468, 0.00782558],
[0.7944286 , 0.19729899, 0.00827242],
[0.8555814 , 0.13791925, 0.00649932],
[0.82628906, 0.16633531, 0.00737559],
[0.9005094 , 0.09238414, 0.00710642]], dtype=float32)]
似乎不起作用……我们需要说明它适用于管道的特定部分,而不是整个管道。
model_def = to_onnx(
pipe,
X_train.astype(numpy.float32),
options={id(pipe.steps[1][1]): {"raw_scores": True, "zipmap": False}},
)
oinf = ReferenceEvaluator(model_def)
print(oinf.run(None, {"X": X.astype(numpy.float32)[:5]}))
[array([0, 0, 0, 0, 0]), array([[ 2.2709217 , 0.18373239, -2.454654 ],
[ 1.9858665 , 0.59296364, -2.57883 ],
[ 2.2350655 , 0.40995252, -2.6450179 ],
[ 2.1072361 , 0.5042972 , -2.6115332 ],
[ 2.3729892 , 0.09598386, -2.468973 ]], dtype=float32)]
存在负值。这有效。字符串仍然更容易使用。
model_def = to_onnx(
pipe,
X_train.astype(numpy.float32),
options={"clr__raw_scores": True, "clr__zipmap": False},
)
oinf = ReferenceEvaluator(model_def)
print(oinf.run(None, {"X": X.astype(numpy.float32)[:5]}))
[array([0, 0, 0, 0, 0]), array([[ 2.2709217 , 0.18373239, -2.454654 ],
[ 1.9858665 , 0.59296364, -2.57883 ],
[ 2.2350655 , 0.40995252, -2.6450179 ],
[ 2.1072361 , 0.5042972 , -2.6115332 ],
[ 2.3729892 , 0.09598386, -2.468973 ]], dtype=float32)]
负数。我们仍然得到原始分数。
选项 decision_path¶
scikit-learn 实现了一个检索决策路径的函数。可以通过选项 decision_path 启用它。
clrrf = RandomForestClassifier(n_estimators=2, max_depth=2)
clrrf.fit(X_train, y_train)
clrrf.predict(X_test[:2])
paths, n_nodes_ptr = clrrf.decision_path(X_test[:2])
print(paths.todense())
model_def = to_onnx(
clrrf,
X_train.astype(numpy.float32),
options={id(clrrf): {"decision_path": True, "zipmap": False}},
)
sess = InferenceSession(
model_def.SerializeToString(), providers=["CPUExecutionProvider"]
)
[[1 0 1 0 1 1 0 1 0 1]
[1 0 1 0 1 1 0 1 0 1]]
模型生成 3 个输出。
print([o.name for o in sess.get_outputs()])
['label', 'probabilities', 'decision_path']
让我们显示最后一个。
res = sess.run(None, {"X": X_test[:2].astype(numpy.float32)})
print(res[-1])
[['10101' '10101']
['10101' '10101']]
可用选项列表¶
选项已为每个转换注册,以便在转换运行时检测到任何受支持的选项。
all_opts = set()
for k, v in sorted(_converter_pool.items()):
opts = v.get_allowed_options()
if not isinstance(opts, dict):
continue
name = k.replace("Sklearn", "")
print("%s%s %r" % (name, " " * (30 - len(name)), opts))
for o in opts:
all_opts.add(o)
print("all options:", pformat(list(sorted(all_opts))))
LightGbmLGBMClassifier {'nocl': [True, False], 'zipmap': [True, False, 'columns']}
Skl2onnxTraceableCountVectorizer {'tokenexp': None, 'separators': None, 'nan': [True, False], 'keep_empty_string': [True, False], 'locale': None}
Skl2onnxTraceableTfidfVectorizer {'tokenexp': None, 'separators': None, 'nan': [True, False], 'keep_empty_string': [True, False], 'locale': None}
AdaBoostClassifier {'zipmap': [True, False, 'columns'], 'nocl': [True, False], 'output_class_labels': [False, True], 'raw_scores': [True, False]}
BaggingClassifier {'zipmap': [True, False, 'columns'], 'nocl': [True, False], 'output_class_labels': [False, True], 'raw_scores': [True, False]}
BayesianGaussianMixture {'score_samples': [True, False]}
BayesianRidge {'return_std': [True, False]}
BernoulliNB {'zipmap': [True, False, 'columns'], 'output_class_labels': [False, True], 'nocl': [True, False]}
CalibratedClassifierCV {'zipmap': [True, False, 'columns'], 'output_class_labels': [False, True], 'nocl': [True, False]}
CategoricalNB {'zipmap': [True, False, 'columns'], 'output_class_labels': [False, True], 'nocl': [True, False]}
ComplementNB {'zipmap': [True, False, 'columns'], 'output_class_labels': [False, True], 'nocl': [True, False]}
CountVectorizer {'tokenexp': None, 'separators': None, 'nan': [True, False], 'keep_empty_string': [True, False], 'locale': None}
DecisionTreeClassifier {'zipmap': [True, False, 'columns'], 'nocl': [True, False], 'output_class_labels': [False, True], 'decision_path': [True, False], 'decision_leaf': [True, False]}
DecisionTreeRegressor {'decision_path': [True, False], 'decision_leaf': [True, False]}
ExtraTreeClassifier {'zipmap': [True, False, 'columns'], 'nocl': [True, False], 'output_class_labels': [False, True], 'decision_path': [True, False], 'decision_leaf': [True, False]}
ExtraTreeRegressor {'decision_path': [True, False], 'decision_leaf': [True, False]}
ExtraTreesClassifier {'zipmap': [True, False, 'columns'], 'raw_scores': [True, False], 'nocl': [True, False], 'output_class_labels': [False, True], 'decision_path': [True, False], 'decision_leaf': [True, False]}
ExtraTreesRegressor {'decision_path': [True, False], 'decision_leaf': [True, False]}
FeatureHasher {'separator': None}
GaussianMixture {'score_samples': [True, False]}
GaussianNB {'zipmap': [True, False, 'columns'], 'output_class_labels': [False, True], 'nocl': [True, False]}
GaussianProcessClassifier {'optim': [None, 'cdist'], 'nocl': [False, True], 'output_class_labels': [False, True], 'zipmap': [False, True]}
GaussianProcessRegressor {'return_cov': [False, True], 'return_std': [False, True], 'optim': [None, 'cdist']}
GradientBoostingClassifier {'zipmap': [True, False, 'columns'], 'raw_scores': [True, False], 'output_class_labels': [False, True], 'nocl': [True, False]}
HistGradientBoostingClassifier {'zipmap': [True, False, 'columns'], 'raw_scores': [True, False], 'output_class_labels': [False, True], 'nocl': [True, False]}
HistGradientBoostingRegressor {'raw_scores': [True, False]}
IsolationForest {'score_samples': [True, False]}
KMeans {'gemm': [True, False]}
KNNImputer {'optim': [None, 'cdist']}
KNeighborsClassifier {'zipmap': [True, False, 'columns'], 'nocl': [True, False], 'raw_scores': [True, False], 'output_class_labels': [False, True], 'optim': [None, 'cdist']}
KNeighborsRegressor {'optim': [None, 'cdist']}
KNeighborsTransformer {'optim': [None, 'cdist']}
KernelPCA {'optim': [None, 'cdist']}
LinearClassifier {'zipmap': [True, False, 'columns'], 'nocl': [True, False], 'output_class_labels': [False, True], 'raw_scores': [True, False]}
LinearSVC {'nocl': [True, False], 'output_class_labels': [False, True], 'raw_scores': [True, False]}
LocalOutlierFactor {'score_samples': [True, False], 'optim': [None, 'cdist']}
MLPClassifier {'zipmap': [True, False, 'columns'], 'output_class_labels': [False, True], 'nocl': [True, False]}
MaxAbsScaler {'div': ['std', 'div', 'div_cast']}
MiniBatchKMeans {'gemm': [True, False]}
MultiOutputClassifier {'nocl': [False, True], 'output_class_labels': [False, True], 'zipmap': [False, True]}
MultinomialNB {'zipmap': [True, False, 'columns'], 'output_class_labels': [False, True], 'nocl': [True, False]}
NearestNeighbors {'optim': [None, 'cdist']}
OneVsOneClassifier {'zipmap': [True, False, 'columns'], 'nocl': [True, False], 'output_class_labels': [False, True]}
OneVsRestClassifier {'zipmap': [True, False, 'columns'], 'nocl': [True, False], 'output_class_labels': [False, True], 'raw_scores': [True, False]}
Pipeline {'zipmap': [True, False, 'columns'], 'nocl': [True, False], 'output_class_labels': [False, True], 'raw_scores': [True, False]}
QuadraticDiscriminantAnalysis {'zipmap': [True, False, 'columns'], 'nocl': [True, False], 'output_class_labels': [False, True]}
RadiusNeighborsClassifier {'zipmap': [True, False, 'columns'], 'nocl': [True, False], 'raw_scores': [True, False], 'output_class_labels': [False, True], 'optim': [None, 'cdist']}
RadiusNeighborsRegressor {'optim': [None, 'cdist']}
RandomForestClassifier {'zipmap': [True, False, 'columns'], 'raw_scores': [True, False], 'nocl': [True, False], 'output_class_labels': [False, True], 'decision_path': [True, False], 'decision_leaf': [True, False]}
RandomForestRegressor {'decision_path': [True, False], 'decision_leaf': [True, False]}
RobustScaler {'div': ['std', 'div', 'div_cast']}
SGDClassifier {'zipmap': [True, False, 'columns'], 'nocl': [True, False], 'output_class_labels': [False, True], 'raw_scores': [True, False]}
SVC {'zipmap': [True, False, 'columns'], 'nocl': [True, False], 'output_class_labels': [False, True], 'raw_scores': [True, False]}
Scaler {'div': ['std', 'div', 'div_cast']}
StackingClassifier {'zipmap': [True, False, 'columns'], 'nocl': [True, False], 'output_class_labels': [False, True], 'raw_scores': [True, False]}
TfidfTransformer {'nan': [True, False]}
TfidfVectorizer {'tokenexp': None, 'separators': None, 'nan': [True, False], 'keep_empty_string': [True, False], 'locale': None}
TunedThresholdClassifierCV {'zipmap': [True, False, 'columns'], 'output_class_labels': [False, True], 'nocl': [True, False]}
VotingClassifier {'zipmap': [True, False, 'columns'], 'output_class_labels': [False, True], 'nocl': [True, False]}
_ConstantPredictor {'zipmap': [True, False, 'columns'], 'nocl': [True, False]}
XGBoostXGBClassifier {'nocl': [True, False], 'zipmap': [True, False, 'columns']}
all options: ['decision_leaf',
'decision_path',
'div',
'gemm',
'keep_empty_string',
'locale',
'nan',
'nocl',
'optim',
'output_class_labels',
'raw_scores',
'return_cov',
'return_std',
'score_samples',
'separator',
'separators',
'tokenexp',
'zipmap']
脚本总运行时间: (0 分钟 0.145 秒)