# Fine-tuning a BERT model > 原文:[https://tensorflow.google.cn/official_models/fine_tuning_bert](https://tensorflow.google.cn/official_models/fine_tuning_bert) In this example, we will work through fine-tuning a BERT model using the tensorflow-models PIP package. The pretrained BERT model this tutorial is based on is also available on [TensorFlow Hub](https://tensorflow.org/hub), to see how to use it refer to the [Hub Appendix](#hub_bert) ## Setup ### Install the TensorFlow Model Garden pip package * `tf-models-official` is the stable Model Garden package. Note that it may not include the latest changes in the `tensorflow_models` github repo. To include latest changes, you may install `tf-models-nightly`, which is the nightly Model Garden package created daily automatically. * pip will install all models and dependencies automatically. ```py pip install -q tf-models-official==2.3.0 ``` ```py WARNING: You are using pip version 20.2.3; however, version 20.2.4 is available. You should consider upgrading via the '/tmpfs/src/tf_docs_env/bin/python -m pip install --upgrade pip' command. ``` ### Imports ```py import os import numpy as np import matplotlib.pyplot as plt import tensorflow as tf import tensorflow_hub as hub import tensorflow_datasets as tfds tfds.disable_progress_bar() from official.modeling import tf_utils from official import nlp from official.nlp import bert # Load the required submodules import official.nlp.optimization import official.nlp.bert.bert_models import official.nlp.bert.configs import official.nlp.bert.run_classifier import official.nlp.bert.tokenization import official.nlp.data.classifier_data_lib import official.nlp.modeling.losses import official.nlp.modeling.models import official.nlp.modeling.networks ``` ### Resources This directory contains the configuration, vocabulary, and a pre-trained checkpoint used in this tutorial: ```py gs_folder_bert = "gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-12_H-768_A-12" tf.io.gfile.listdir(gs_folder_bert) ``` ```py ['bert_config.json', 'bert_model.ckpt.data-00000-of-00001', 'bert_model.ckpt.index', 'vocab.txt'] ``` You can get a pre-trained BERT encoder from [TensorFlow Hub](https://hub.tensorflow.google.cn/tensorflow/bert_en_uncased_L-12_H-768_A-12/2): ```py hub_url_bert = "https://hub.tensorflow.google.cn/tensorflow/bert_en_uncased_L-12_H-768_A-12/2" ``` ## The data For this example we used the [GLUE MRPC dataset from TFDS](https://tensorflow.google.cn/datasets/catalog/glue#gluemrpc). This dataset is not set up so that it can be directly fed into the BERT model, so this section also handles the necessary preprocessing. ### Get the dataset from TensorFlow Datasets The Microsoft Research Paraphrase Corpus (Dolan & Brockett, 2005) is a corpus of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent. * Number of labels: 2. * Size of training dataset: 3668. * Size of evaluation dataset: 408. * Maximum sequence length of training and evaluation dataset: 128. ```py glue, info = tfds.load('glue/mrpc', with_info=True, # It's small, load the whole dataset batch_size=-1) ``` ```py Downloading and preparing dataset glue/mrpc/1.0.0 (download: 1.43 MiB, generated: Unknown size, total: 1.43 MiB) to /home/kbuilder/tensorflow_datasets/glue/mrpc/1.0.0... Shuffling and writing examples to /home/kbuilder/tensorflow_datasets/glue/mrpc/1.0.0.incompleteKZIBN9/glue-train.tfrecord Shuffling and writing examples to /home/kbuilder/tensorflow_datasets/glue/mrpc/1.0.0.incompleteKZIBN9/glue-validation.tfrecord Shuffling and writing examples to /home/kbuilder/tensorflow_datasets/glue/mrpc/1.0.0.incompleteKZIBN9/glue-test.tfrecord Dataset glue downloaded and prepared to /home/kbuilder/tensorflow_datasets/glue/mrpc/1.0.0\. Subsequent calls will reuse this data. ``` ```py list(glue.keys()) ``` ```py ['test', 'train', 'validation'] ``` The `info` object describes the dataset and it's features: ```py info.features ``` ```py FeaturesDict({ 'idx': tf.int32, 'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2), 'sentence1': Text(shape=(), dtype=tf.string), 'sentence2': Text(shape=(), dtype=tf.string), }) ``` The two classes are: ```py info.features['label'].names ``` ```py ['not_equivalent', 'equivalent'] ``` Here is one example from the training set: ```py glue_train = glue['train'] for key, value in glue_train.items(): print(f"{key:9s}: {value[0].numpy()}") ``` ```py idx : 1680 label : 0 sentence1: b'The identical rovers will act as robotic geologists , searching for evidence of past water .' sentence2: b'The rovers act as robotic geologists , moving on six wheels .' ``` ### The BERT tokenizer To fine tune a pre-trained model you need to be sure that you're using exactly the same tokenization, vocabulary, and index mapping as you used during training. The BERT tokenizer used in this tutorial is written in pure Python (It's not built out of TensorFlow ops). So you can't just plug it into your model as a `keras.layer` like you can with [`preprocessing.TextVectorization`](https://tensorflow.google.cn/api_docs/python/tf/keras/layers/experimental/preprocessing/TextVectorization). The following code rebuilds the tokenizer that was used by the base model: ```py # Set up tokenizer to generate Tensorflow dataset tokenizer = bert.tokenization.FullTokenizer( vocab_file=os.path.join(gs_folder_bert, "vocab.txt"), do_lower_case=True) print("Vocab size:", len(tokenizer.vocab)) ``` ```py Vocab size: 30522 ``` Tokenize a sentence: ```py tokens = tokenizer.tokenize("Hello TensorFlow!") print(tokens) ids = tokenizer.convert_tokens_to_ids(tokens) print(ids) ``` ```py ['hello', 'tensor', '##flow', '!'] [7592, 23435, 12314, 999] ``` ### Preprocess the data The section manually preprocessed the dataset into the format expected by the model. This dataset is small, so preprocessing can be done quickly and easily in memory. For larger datasets the `tf_models` library includes some tools for preprocessing and re-serializing a dataset. See [Appendix: Re-encoding a large dataset](#re_encoding_tools) for details. #### Encode the sentences The model expects its two inputs sentences to be concatenated together. This input is expected to start with a `[CLS]` "This is a classification problem" token, and each sentence should end with a `[SEP]` "Separator" token: ```py tokenizer.convert_tokens_to_ids(['[CLS]', '[SEP]']) ``` ```py [101, 102] ``` Start by encoding all the sentences while appending a `[SEP]` token, and packing them into ragged-tensors: ```py def encode_sentence(s): tokens = list(tokenizer.tokenize(s.numpy())) tokens.append('[SEP]') return tokenizer.convert_tokens_to_ids(tokens) sentence1 = tf.ragged.constant([ encode_sentence(s) for s in glue_train["sentence1"]]) sentence2 = tf.ragged.constant([ encode_sentence(s) for s in glue_train["sentence2"]]) ``` ```py print("Sentence1 shape:", sentence1.shape.as_list()) print("Sentence2 shape:", sentence2.shape.as_list()) ``` ```py Sentence1 shape: [3668, None] Sentence2 shape: [3668, None] ``` Now prepend a `[CLS]` token, and concatenate the ragged tensors to form a single `input_word_ids` tensor for each example. [`RaggedTensor.to_tensor()`](https://tensorflow.google.cn/api_docs/python/tf/RaggedTensor#to_tensor) zero pads to the longest sequence. ```py cls = [tokenizer.convert_tokens_to_ids(['[CLS]'])]*sentence1.shape[0] input_word_ids = tf.concat([cls, sentence1, sentence2], axis=-1) _ = plt.pcolormesh(input_word_ids.to_tensor()) ``` ![png](img/10d71bce93ec45ba7076ef15a37bcb28.png) #### Mask and input type The model expects two additional inputs: * The input mask * The input type The mask allows the model to cleanly differentiate between the content and the padding. The mask has the same shape as the `input_word_ids`, and contains a `1` anywhere the `input_word_ids` is not padding. ```py input_mask = tf.ones_like(input_word_ids).to_tensor() plt.pcolormesh(input_mask) ``` ```py ``` ![png](img/1f9a0765029471b20952ac80887f73a4.png) The "input type" also has the same shape, but inside the non-padded region, contains a `0` or a `1` indicating which sentence the token is a part of. ```py type_cls = tf.zeros_like(cls) type_s1 = tf.zeros_like(sentence1) type_s2 = tf.ones_like(sentence2) input_type_ids = tf.concat([type_cls, type_s1, type_s2], axis=-1).to_tensor() plt.pcolormesh(input_type_ids) ``` ```py ``` ![png](img/e06760b4112e8fd989cdb1f7a948bc17.png) #### Put it all together Collect the above text parsing code into a single function, and apply it to each split of the `glue/mrpc` dataset. ```py def encode_sentence(s, tokenizer): tokens = list(tokenizer.tokenize(s)) tokens.append('[SEP]') return tokenizer.convert_tokens_to_ids(tokens) def bert_encode(glue_dict, tokenizer): num_examples = len(glue_dict["sentence1"]) sentence1 = tf.ragged.constant([ encode_sentence(s, tokenizer) for s in np.array(glue_dict["sentence1"])]) sentence2 = tf.ragged.constant([ encode_sentence(s, tokenizer) for s in np.array(glue_dict["sentence2"])]) cls = [tokenizer.convert_tokens_to_ids(['[CLS]'])]*sentence1.shape[0] input_word_ids = tf.concat([cls, sentence1, sentence2], axis=-1) input_mask = tf.ones_like(input_word_ids).to_tensor() type_cls = tf.zeros_like(cls) type_s1 = tf.zeros_like(sentence1) type_s2 = tf.ones_like(sentence2) input_type_ids = tf.concat( [type_cls, type_s1, type_s2], axis=-1).to_tensor() inputs = { 'input_word_ids': input_word_ids.to_tensor(), 'input_mask': input_mask, 'input_type_ids': input_type_ids} return inputs ``` ```py glue_train = bert_encode(glue['train'], tokenizer) glue_train_labels = glue['train']['label'] glue_validation = bert_encode(glue['validation'], tokenizer) glue_validation_labels = glue['validation']['label'] glue_test = bert_encode(glue['test'], tokenizer) glue_test_labels = glue['test']['label'] ``` Each subset of the data has been converted to a dictionary of features, and a set of labels. Each feature in the input dictionary has the same shape, and the number of labels should match: ```py for key, value in glue_train.items(): print(f'{key:15s} shape: {value.shape}') print(f'glue_train_labels shape: {glue_train_labels.shape}') ``` ```py input_word_ids shape: (3668, 103) input_mask shape: (3668, 103) input_type_ids shape: (3668, 103) glue_train_labels shape: (3668,) ``` ## The model ### Build the model The first step is to download the configuration for the pre-trained model. ```py import json bert_config_file = os.path.join(gs_folder_bert, "bert_config.json") config_dict = json.loads(tf.io.gfile.GFile(bert_config_file).read()) bert_config = bert.configs.BertConfig.from_dict(config_dict) config_dict ``` ```py {'attention_probs_dropout_prob': 0.1, 'hidden_act': 'gelu', 'hidden_dropout_prob': 0.1, 'hidden_size': 768, 'initializer_range': 0.02, 'intermediate_size': 3072, 'max_position_embeddings': 512, 'num_attention_heads': 12, 'num_hidden_layers': 12, 'type_vocab_size': 2, 'vocab_size': 30522} ``` The `config` defines the core BERT Model, which is a Keras model to predict the outputs of `num_classes` from the inputs with maximum sequence length `max_seq_length`. This function returns both the encoder and the classifier. ```py bert_classifier, bert_encoder = bert.bert_models.classifier_model( bert_config, num_labels=2) ``` The classifier has three inputs and one output: ```py tf.keras.utils.plot_model(bert_classifier, show_shapes=True, dpi=48) ``` ![png](img/906a04e5434908ec33033e39f2e83f6b.png) Run it on a test batch of data 10 examples from the training set. The output is the logits for the two classes: ```py glue_batch = {key: val[:10] for key, val in glue_train.items()} bert_classifier( glue_batch, training=True ).numpy() ``` ```py array([[ 0.08382261, 0.34465584], [ 0.02057236, 0.24053624], [ 0.04930754, 0.1117427 ], [ 0.17041089, 0.20810834], [ 0.21667874, 0.2840511 ], [ 0.02325345, 0.33799925], [-0.06198866, 0.13532838], [ 0.084592 , 0.20711854], [-0.04323687, 0.17096342], [ 0.23759182, 0.16801538]], dtype=float32) ``` The `TransformerEncoder` in the center of the classifier above **is** the `bert_encoder`. Inspecting the encoder, we see its stack of `Transformer` layers connected to those same three inputs: ```py tf.keras.utils.plot_model(bert_encoder, show_shapes=True, dpi=48) ``` ![png](img/6d5e829de3a867f7bb56dff003b7e217.png) ### Restore the encoder weights When built the encoder is randomly initialized. Restore the encoder's weights from the checkpoint: ```py checkpoint = tf.train.Checkpoint(model=bert_encoder) checkpoint.restore( os.path.join(gs_folder_bert, 'bert_model.ckpt')).assert_consumed() ``` ```py ``` **Note:** The pretrained `TransformerEncoder` is also available on [TensorFlow Hub](https://tensorflow.org/hub). See the [Hub appendix](#hub_bert) for details. ### Set up the optimizer BERT adopts the Adam optimizer with weight decay (aka "[AdamW](https://arxiv.org/abs/1711.05101)"). It also employs a learning rate schedule that firstly warms up from 0 and then decays to 0. ```py # Set up epochs and steps epochs = 3 batch_size = 32 eval_batch_size = 32 train_data_size = len(glue_train_labels) steps_per_epoch = int(train_data_size / batch_size) num_train_steps = steps_per_epoch * epochs warmup_steps = int(epochs * train_data_size * 0.1 / batch_size) # creates an optimizer with learning rate schedule optimizer = nlp.optimization.create_optimizer( 2e-5, num_train_steps=num_train_steps, num_warmup_steps=warmup_steps) ``` This returns an `AdamWeightDecay` optimizer with the learning rate schedule set: ```py type(optimizer) ``` ```py official.nlp.optimization.AdamWeightDecay ``` To see an example of how to customize the optimizer and it's schedule, see the [Optimizer schedule appendix](#optiizer_schedule). ### Train the model The metric is accuracy and we use sparse categorical cross-entropy as loss. ```py metrics = [tf.keras.metrics.SparseCategoricalAccuracy('accuracy', dtype=tf.float32)] loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) bert_classifier.compile( optimizer=optimizer, loss=loss, metrics=metrics) bert_classifier.fit( glue_train, glue_train_labels, validation_data=(glue_validation, glue_validation_labels), batch_size=32, epochs=epochs) ``` ```py Epoch 1/3 115/115 [==============================] - 26s 222ms/step - loss: 0.6151 - accuracy: 0.6611 - val_loss: 0.5462 - val_accuracy: 0.7451 Epoch 2/3 115/115 [==============================] - 24s 212ms/step - loss: 0.4447 - accuracy: 0.8010 - val_loss: 0.4150 - val_accuracy: 0.8309 Epoch 3/3 115/115 [==============================] - 24s 213ms/step - loss: 0.2830 - accuracy: 0.8964 - val_loss: 0.3697 - val_accuracy: 0.8480 ``` Now run the fine-tuned model on a custom example to see that it works. Start by encoding some sentence pairs: ```py my_examples = bert_encode( glue_dict = { 'sentence1':[ 'The rain in Spain falls mainly on the plain.', 'Look I fine tuned BERT.'], 'sentence2':[ 'It mostly rains on the flat lands of Spain.', 'Is it working? This does not match.'] }, tokenizer=tokenizer) ``` The model should report class `1` "match" for the first example and class `0` "no-match" for the second: ```py result = bert_classifier(my_examples, training=False) result = tf.argmax(result).numpy() result ``` ```py array([1, 0]) ``` ```py np.array(info.features['label'].names)[result] ``` ```py array(['equivalent', 'not_equivalent'], dtype=', 'dropout_rate': 0.1, 'initializer': , 'max_sequence_length': 512, 'num_layers': 12} ``` ```py manual_encoder = nlp.modeling.networks.TransformerEncoder(**transformer_config) ``` Restore the weights: ```py checkpoint = tf.train.Checkpoint(model=manual_encoder) checkpoint.restore( os.path.join(gs_folder_bert, 'bert_model.ckpt')).assert_consumed() ``` ```py ``` Test run it: ```py result = manual_encoder(my_examples, training=True) print("Sequence output shape:", result[0].shape) print("Pooled output shape:", result[1].shape) ``` ```py Sequence output shape: (2, 23, 768) Pooled output shape: (2, 768) ``` Wrap it in a classifier: ```py manual_classifier = nlp.modeling.models.BertClassifier( bert_encoder, num_classes=2, dropout_rate=transformer_config['dropout_rate'], initializer=tf.keras.initializers.TruncatedNormal( stddev=bert_config.initializer_range)) ``` ```py manual_classifier(my_examples, training=True).numpy() ``` ```py array([[ 0.07863025, -0.02940944], [ 0.30274656, 0.27299827]], dtype=float32) ``` ### Optimizers and schedules The optimizer used to train the model was created using the `nlp.optimization.create_optimizer` function: ```py optimizer = nlp.optimization.create_optimizer( 2e-5, num_train_steps=num_train_steps, num_warmup_steps=warmup_steps) ``` That high level wrapper sets up the learning rate schedules and the optimizer. The base learning rate schedule used here is a linear decay to zero over the training run: ```py epochs = 3 batch_size = 32 eval_batch_size = 32 train_data_size = len(glue_train_labels) steps_per_epoch = int(train_data_size / batch_size) num_train_steps = steps_per_epoch * epochs ``` ```py decay_schedule = tf.keras.optimizers.schedules.PolynomialDecay( initial_learning_rate=2e-5, decay_steps=num_train_steps, end_learning_rate=0) plt.plot([decay_schedule(n) for n in range(num_train_steps)]) ``` ```py [] ``` ![png](img/868f946086995ef931b7b454d904e14b.png) This, in turn is wrapped in a `WarmUp` schedule that linearly increases the learning rate to the target value over the first 10% of training: ```py warmup_steps = num_train_steps * 0.1 warmup_schedule = nlp.optimization.WarmUp( initial_learning_rate=2e-5, decay_schedule_fn=decay_schedule, warmup_steps=warmup_steps) # The warmup overshoots, because it warms up to the `initial_learning_rate` # following the original implementation. You can set # `initial_learning_rate=decay_schedule(warmup_steps)` if you don't like the # overshoot. plt.plot([warmup_schedule(n) for n in range(num_train_steps)]) ``` ```py [] ``` ![png](img/c542bc6784512a8abdc2e3a85a1e1905.png) Then create the `nlp.optimization.AdamWeightDecay` using that schedule, configured for the BERT model: ```py optimizer = nlp.optimization.AdamWeightDecay( learning_rate=warmup_schedule, weight_decay_rate=0.01, epsilon=1e-6, exclude_from_weight_decay=['LayerNorm', 'layer_norm', 'bias']) ```