# Unicode 字符串 > 原文:[https://tensorflow.google.cn/tutorials/load_data/unicode](https://tensorflow.google.cn/tutorials/load_data/unicode) ## 简介 处理自然语言的模型通常使用不同的字符集来处理不同的语言。*Unicode* 是一种标准的编码系统,用于表示几乎所有语言的字符。每个字符使用 `0` 和 `0x10FFFF` 之间的唯一整数[码位](https://en.wikipedia.org/wiki/Code_point)进行编码。*Unicode 字符串*是由零个或更多码位组成的序列。 本教程介绍了如何在 TensorFlow 中表示 Unicode 字符串,以及如何使用标准字符串运算的 Unicode 等效项对其进行操作。它会根据字符体系检测将 Unicode 字符串划分为不同词例。 ```py import tensorflow as tf ``` ## [`tf.string`](https://tensorflow.google.cn/api_docs/python/tf#string) 数据类型 您可以使用基本的 TensorFlow [`tf.string`](https://tensorflow.google.cn/api_docs/python/tf#string) `dtype` 构建字节字符串张量。Unicode 字符串默认使用 UTF-8 编码。 ``` tf.constant(u"Thanks 😊") ```py ``` ```py [`tf.string`](https://tensorflow.google.cn/api_docs/python/tf#string) 张量可以容纳不同长度的字节字符串,因为字节字符串会被视为原子单元。字符串长度不包括在张量维度中。 ``` tf.constant([u"You're", u"welcome!"]).shape ```py ``` TensorShape([2]) ```py 注:使用 Python 构造字符串时,v2 和 v3 对 Unicode 的处理方式有所不同。在 v2 中,Unicode 字符串用前缀“u”表示(如上所示)。在 v3 中,字符串默认使用 Unicode 编码。 ## 表示 Unicode 在 TensorFlow 中有两种表示 Unicode 字符串的标准方式: * `string` 标量 - 使用已知[字符编码](https://en.wikipedia.org/wiki/Character_encoding)对码位序列进行编码。 * `int32` 向量 - 每个位置包含单个码位。 例如,以下三个值均表示 Unicode 字符串 `"语言处理"`: ``` # Unicode string, represented as a UTF-8 encoded string scalar. text_utf8 = tf.constant(u"语言处理") text_utf8 ```py ``` ```py ``` # Unicode string, represented as a UTF-16-BE encoded string scalar. text_utf16be = tf.constant(u"语言处理".encode("UTF-16-BE")) text_utf16be ```py ``` ```py ``` # Unicode string, represented as a vector of Unicode code points. text_chars = tf.constant([ord(char) for char in u"语言处理"]) text_chars ```py ``` ```py ### 在不同表示之间进行转换 TensorFlow 提供了在下列不同表示之间进行转换的运算: * [`tf.strings.unicode_decode`](https://tensorflow.google.cn/api_docs/python/tf/strings/unicode_decode):将编码的字符串标量转换为码位的向量。 * [`tf.strings.unicode_encode`](https://tensorflow.google.cn/api_docs/python/tf/strings/unicode_encode):将码位的向量转换为编码的字符串标量。 * [`tf.strings.unicode_transcode`](https://tensorflow.google.cn/api_docs/python/tf/strings/unicode_transcode):将编码的字符串标量转换为其他编码。 ``` tf.strings.unicode_decode(text_utf8, input_encoding='UTF-8') ```py ``` ```py ``` tf.strings.unicode_encode(text_chars, output_encoding='UTF-8') ```py ``` ```py ``` tf.strings.unicode_transcode(text_utf8, input_encoding='UTF8', output_encoding='UTF-16-BE') ```py ``` ```py ### 批次维度 解码多个字符串时,每个字符串中的字符数可能不相等。返回结果是 [`tf.RaggedTensor`](https://tensorflow.google.cn/guide/ragged_tensor),其中最里面的维度的长度会根据每个字符串中的字符数而变化: ``` # A batch of Unicode strings, each represented as a UTF8-encoded string. batch_utf8 = [s.encode('UTF-8') for s in [u'hÃllo', u'What is the weather tomorrow', u'Göödnight', u'😊']] batch_chars_ragged = tf.strings.unicode_decode(batch_utf8, input_encoding='UTF-8') for sentence_chars in batch_chars_ragged.to_list(): print(sentence_chars) ```py ``` [104, 195, 108, 108, 111] [87, 104, 97, 116, 32, 105, 115, 32, 116, 104, 101, 32, 119, 101, 97, 116, 104, 101, 114, 32, 116, 111, 109, 111, 114, 114, 111, 119] [71, 246, 246, 100, 110, 105, 103, 104, 116] [128522] ```py 您可以直接使用此 [`tf.RaggedTensor`](https://tensorflow.google.cn/api_docs/python/tf/RaggedTensor),也可以使用 [`tf.RaggedTensor.to_tensor`](https://tensorflow.google.cn/api_docs/python/tf/RaggedTensor#to_tensor) 和 [`tf.RaggedTensor.to_sparse`](https://tensorflow.google.cn/api_docs/python/tf/RaggedTensor#to_sparse) 方法将其转换为带有填充的密集 [`tf.Tensor`](https://tensorflow.google.cn/api_docs/python/tf/Tensor) 或 [`tf.SparseTensor`](https://tensorflow.google.cn/api_docs/python/tf/sparse/SparseTensor)。 ``` batch_chars_padded = batch_chars_ragged.to_tensor(default_value=-1) print(batch_chars_padded.numpy()) ```py ``` [[ 104 195 108 108 111 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1] [ 87 104 97 116 32 105 115 32 116 104 101 32 119 101 97 116 104 101 114 32 116 111 109 111 114 114 111 119] [ 71 246 246 100 110 105 103 104 116 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1] [128522 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1]] ```py ``` batch_chars_sparse = batch_chars_ragged.to_sparse() ```py 在对多个具有相同长度的字符串进行编码时,可以将 [`tf.Tensor`](https://tensorflow.google.cn/api_docs/python/tf/Tensor) 用作输入: ``` tf.strings.unicode_encode([[99, 97, 116], [100, 111, 103], [ 99, 111, 119]], output_encoding='UTF-8') ```py ``` ```py 当对多个具有不同长度的字符串进行编码时,应将 [`tf.RaggedTensor`](https://tensorflow.google.cn/api_docs/python/tf/RaggedTensor) 用作输入: ``` tf.strings.unicode_encode(batch_chars_ragged, output_encoding='UTF-8') ```py ``` ```py 如果您的张量具有填充或稀疏格式的多个字符串,请在调用 `unicode_encode` 之前将其转换为 [`tf.RaggedTensor`](https://tensorflow.google.cn/api_docs/python/tf/RaggedTensor): ``` tf.strings.unicode_encode( tf.RaggedTensor.from_sparse(batch_chars_sparse), output_encoding='UTF-8') ```py ``` ```py ``` tf.strings.unicode_encode( tf.RaggedTensor.from_tensor(batch_chars_padded, padding=-1), output_encoding='UTF-8') ```py ``` ```py ## Unicode 运算 ### 字符长度 [`tf.strings.length`](https://tensorflow.google.cn/api_docs/python/tf/strings/length) 运算具有 `unit` 参数,该参数表示计算长度的方式。`unit` 默认为 `"BYTE"`,但也可以将其设置为其他值(例如 `"UTF8_CHAR"` 或 `"UTF16_CHAR"`),以确定每个已编码 `string` 中的 Unicode 码位数量。 ``` # Note that the final character takes up 4 bytes in UTF8. thanks = u'Thanks 😊'.encode('UTF-8') num_bytes = tf.strings.length(thanks).numpy() num_chars = tf.strings.length(thanks, unit='UTF8_CHAR').numpy() print('{} bytes; {} UTF-8 characters'.format(num_bytes, num_chars)) ```py ``` 11 bytes; 8 UTF-8 characters ```py ### 字符子字符串 类似地,[`tf.strings.substr`](https://tensorflow.google.cn/api_docs/python/tf/strings/substr) 运算会接受 "`unit`" 参数,并用它来确定 "`pos`" 和 "`len`" 参数包含的偏移类型。 ``` # default: unit='BYTE'. With len=1, we return a single byte. tf.strings.substr(thanks, pos=7, len=1).numpy() ```py ``` b'\xf0' ```py ``` # Specifying unit='UTF8_CHAR', we return a single character, which in this case # is 4 bytes. print(tf.strings.substr(thanks, pos=7, len=1, unit='UTF8_CHAR').numpy()) ```py ``` b'\xf0\x9f\x98\x8a' ```py ### 拆分 Unicode 字符串 [`tf.strings.unicode_split`](https://tensorflow.google.cn/api_docs/python/tf/strings/unicode_split) 运算会将 Unicode 字符串拆分为单个字符的子字符串: ``` tf.strings.unicode_split(thanks, 'UTF-8').numpy() ```py ``` array([b'T', b'h', b'a', b'n', b'k', b's', b' ', b'\xf0\x9f\x98\x8a'], dtype=object) ```py ### 字符的字节偏移量 为了将 [`tf.strings.unicode_decode`](https://tensorflow.google.cn/api_docs/python/tf/strings/unicode_decode) 生成的字符张量与原始字符串对齐,了解每个字符开始位置的偏移量很有用。方法 [`tf.strings.unicode_decode_with_offsets`](https://tensorflow.google.cn/api_docs/python/tf/strings/unicode_decode_with_offsets) 与 `unicode_decode` 类似,不同的是它会返回包含每个字符起始偏移量的第二张量。 ``` codepoints, offsets = tf.strings.unicode_decode_with_offsets(u"🎈🎉🎊", 'UTF-8') for (codepoint, offset) in zip(codepoints.numpy(), offsets.numpy()): print("At byte offset {}: codepoint {}".format(offset, codepoint)) ```py ``` At byte offset 0: codepoint 127880 At byte offset 4: codepoint 127881 At byte offset 8: codepoint 127882 ```py ## Unicode 字符体系 每个 Unicode 码位都属于某个码位集合,这些集合被称作[字符体系](https://en.wikipedia.org/wiki/Script_%28Unicode%29)。某个字符的字符体系有助于确定该字符可能所属的语言。例如,已知 'Б' 属于西里尔字符体系,表明包含该字符的现代文本很可能来自某个斯拉夫语种(如俄语或乌克兰语)。 TensorFlow 提供了 [`tf.strings.unicode_script`](https://tensorflow.google.cn/api_docs/python/tf/strings/unicode_script) 运算来确定某一给定码位使用的是哪个字符体系。字符体系代码是对应于[国际 Unicode 组件](http://site.icu-project.org/home) (ICU) [`UScriptCode`](http://icu-project.org/apiref/icu4c/uscript_8h.html) 值的 `int32` 值。 ``` uscript = tf.strings.unicode_script([33464, 1041]) # ['芸', 'Б'] print(uscript.numpy()) # [17, 8] == [USCRIPT_HAN, USCRIPT_CYRILLIC] ```py ``` [17 8] ```py [`tf.strings.unicode_script`](https://tensorflow.google.cn/api_docs/python/tf/strings/unicode_script) 运算还可以应用于码位的多维 [`tf.Tensor`](https://tensorflow.google.cn/api_docs/python/tf/Tensor) 或 [`tf.RaggedTensor`](https://tensorflow.google.cn/api_docs/python/tf/RaggedTensor): ``` print(tf.strings.unicode_script(batch_chars_ragged)) ```py ``` ```py ## 示例:简单分词 分词是将文本拆分为类似单词的单元的任务。当使用空格字符分隔单词时,这通常很容易,但是某些语言(如中文和日语)不使用空格,而某些语言(如德语)中存在长复合词,必须进行拆分才能分析其含义。在网页文本中,不同语言和字符体系常常混合在一起,例如“NY 株価”(纽约证券交易所)。 我们可以利用字符体系的变化进行粗略分词(不实现任何 ML 模型),从而估算词边界。这对类似上面“NY 株価”示例的字符串都有效。这种方法对大多数使用空格的语言也都有效,因为各种字符体系中的空格字符都归类为 USCRIPT_COMMON,这是一种特殊的字符体系代码,不同于任何实际文本。 ``` # dtype: string; shape: [num_sentences] # # The sentences to process. Edit this line to try out different inputs! sentence_texts = [u'Hello, world.', u'世界こんにちは'] ```py 首先,我们将句子解码为字符码位,然后查找每个字符的字符体系标识符。 ``` # dtype: int32; shape: [num_sentences, (num_chars_per_sentence)] # # sentence_char_codepoint[i, j] is the codepoint for the j'th character in # the i'th sentence. sentence_char_codepoint = tf.strings.unicode_decode(sentence_texts, 'UTF-8') print(sentence_char_codepoint) # dtype: int32; shape: [num_sentences, (num_chars_per_sentence)] # # sentence_char_scripts[i, j] is the unicode script of the j'th character in # the i'th sentence. sentence_char_script = tf.strings.unicode_script(sentence_char_codepoint) print(sentence_char_script) ```py ``` ```py 接下来,我们使用这些字符体系标识符来确定添加词边界的位置。我们在每个句子的开头添加一个词边界;如果某个字符与前一个字符属于不同的字符体系,也为该字符添加词边界。 ``` # dtype: bool; shape: [num_sentences, (num_chars_per_sentence)] # # sentence_char_starts_word[i, j] is True if the j'th character in the i'th # sentence is the start of a word. sentence_char_starts_word = tf.concat( [tf.fill([sentence_char_script.nrows(), 1], True), tf.not_equal(sentence_char_script[:, 1:], sentence_char_script[:, :-1])], axis=1) # dtype: int64; shape: [num_words] # # word_starts[i] is the index of the character that starts the i'th word (in # the flattened list of characters from all sentences). word_starts = tf.squeeze(tf.where(sentence_char_starts_word.values), axis=1) print(word_starts) ```py ``` tf.Tensor([ 0 5 7 12 13 15], shape=(6,), dtype=int64) ```py 然后,我们可以使用这些起始偏移量来构建 `RaggedTensor`,它包含了所有批次的单词列表: ``` # dtype: int32; shape: [num_words, (num_chars_per_word)] # # word_char_codepoint[i, j] is the codepoint for the j'th character in the # i'th word. word_char_codepoint = tf.RaggedTensor.from_row_starts( values=sentence_char_codepoint.values, row_starts=word_starts) print(word_char_codepoint) ```py ``` ```py 最后,我们可以将词码位 `RaggedTensor` 划分回句子中: ``` # dtype: int64; shape: [num_sentences] # # sentence_num_words[i] is the number of words in the i'th sentence. sentence_num_words = tf.reduce_sum( tf.cast(sentence_char_starts_word, tf.int64), axis=1) # dtype: int32; shape: [num_sentences, (num_words_per_sentence), (num_chars_per_word)] # # sentence_word_char_codepoint[i, j, k] is the codepoint for the k'th character # in the j'th word in the i'th sentence. sentence_word_char_codepoint = tf.RaggedTensor.from_row_lengths( values=word_char_codepoint, row_lengths=sentence_num_words) print(sentence_word_char_codepoint) ```py ``` ```py 为了使最终结果更易于阅读,我们可以将其重新编码为 UTF-8 字符串: ``` tf.strings.unicode_encode(sentence_word_char_codepoint, 'UTF-8').to_list() ```py ``` [[b'Hello', b', ', b'world', b'.'], [b'\xe4\xb8\x96\xe7\x95\x8c', b'\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf']] ```