转换 CountVectorizer 或 TfidfVectorizer 时的棘手问题¶

此问题在 scikit-learn/issues/13733 中有所描述。如果 CountVectorizer 或 TfidfVectorizer 生成的词元中包含空格，则 skl2onnx 无法确定它是二元词还是带空格的单元词。

一个无法转换的简单示例¶

import pprint
import numpy
from numpy.testing import assert_almost_equal
from onnxruntime import InferenceSession
from sklearn.feature_extraction.text import TfidfVectorizer
from skl2onnx import to_onnx
from skl2onnx.sklapi import TraceableTfidfVectorizer
import skl2onnx.sklapi.register  # noqa: F401

corpus = numpy.array(
    [
        "This is the first document.",
        "This document is the second document.",
        "Is this the first document?",
        "",
    ]
).reshape((4,))

pattern = r"\b[a-z ]{1,10}\b"
mod1 = TfidfVectorizer(ngram_range=(1, 2), token_pattern=pattern)
mod1.fit(corpus)

TfidfVectorizer(ngram_range=(1, 2), token_pattern='\\b[a-z ]{1,10}\\b')

在 Jupyter 环境中，请重新运行此单元格以显示 HTML 表示或信任该笔记本。
在 GitHub 上，HTML 表示无法渲染，请尝试使用 nbviewer.org 加载此页面。

单元词和二元词被放入以下容器中，该容器将其映射到其列索引。

pprint.pprint(mod1.vocabulary_)

{'document': 0,
 'document ': 1,
 'document  is the ': 2,
 'is the ': 3,
 'is the  second ': 4,
 'is this ': 5,
 'is this  the first ': 6,
 'second ': 7,
 'second  document': 8,
 'the first ': 9,
 'the first  document': 10,
 'this ': 11,
 'this  document ': 12,
 'this is ': 13,
 'this is  the first ': 14}

转换。

try:
    to_onnx(mod1, corpus)
except RuntimeError as e:
    print(e)

There were ambiguities between n-grams and tokens. 2 errors occurred. You can fix it by using class TraceableTfidfVectorizer.
You can learn more at https://github.com/scikit-learn/scikit-learn/issues/13733.
Unable to split n-grams 'is this  the first ' into tokens ('is', 'this', 'the', 'first ') existing in the vocabulary. Token 'is' does not exist in the vocabulary..
Unable to split n-grams 'this is  the first ' into tokens ('this', 'is', 'the', 'first ') existing in the vocabulary. Token 'this' does not exist in the vocabulary..

TraceableTfidfVectorizer¶

类 TraceableTfidfVectorizer 等效于 sklearn.feature_extraction.text.TfidfVectorizer，但它使用元组存储词汇表中的单元词和二元词，而不是将每个部分连接成一个字符串。

mod2 = TraceableTfidfVectorizer(ngram_range=(1, 2), token_pattern=pattern)
mod2.fit(corpus)

pprint.pprint(mod2.vocabulary_)

{('document',): 0,
 ('document ',): 1,
 ('document ', 'is the '): 2,
 ('is the ',): 3,
 ('is the ', 'second '): 4,
 ('is this ',): 5,
 ('is this ', 'the first '): 6,
 ('second ',): 7,
 ('second ', 'document'): 8,
 ('the first ',): 9,
 ('the first ', 'document'): 10,
 ('this ',): 11,
 ('this ', 'document '): 12,
 ('this is ',): 13,
 ('this is ', 'the first '): 14}

让我们检查它是否产生相同的结果。

assert_almost_equal(mod1.transform(corpus).todense(), mod2.transform(corpus).todense())

转换。添加了 import skl2onnx.sklapi.register 行来注册与这些新类关联的转换器。默认情况下，只声明了 scikit-learn 的转换器。

onx = to_onnx(mod2, corpus)
sess = InferenceSession(onx.SerializeToString(), providers=["CPUExecutionProvider"])
got = sess.run(None, {"X": corpus})

让我们检查是否存在差异…

assert_almost_equal(mod2.transform(corpus).todense(), got[0])

脚本总运行时间： (0 分钟 0.082 秒)

由 Sphinx-Gallery 生成的示例集锦