2020-12-29 18:56:14

This commit is contained in:
wizardforcel
2020-12-29 18:56:15 +08:00
parent 86a3892422
commit 6ae3ae8bb1
38 changed files with 64 additions and 64 deletions

View File

@@ -48,7 +48,7 @@ colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
Pandas is a Python library with many helpful utilities for loading and working with structured data and can be used to download CSVs into a dataframe.
<aside class="note">**Note:** This dataset has been collected and analysed during a research collaboration of Worldline and the [Machine Learning Group](http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available [here](https://www.researchgate.net/project/Fraud-detection-5) and the page of the [DefeatFraud](https://mlg.ulb.ac.be/wordpress/portfolio_page/defeatfraud-assessment-and-validation-of-deep-feature-engineering-and-learning-solutions-for-fraud-detection/) project</aside>
**Note:** This dataset has been collected and analysed during a research collaboration of Worldline and the [Machine Learning Group](http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available [here](https://www.researchgate.net/project/Fraud-detection-5) and the page of the [DefeatFraud](https://mlg.ulb.ac.be/wordpress/portfolio_page/defeatfraud-assessment-and-validation-of-deep-feature-engineering-and-learning-solutions-for-fraud-detection/) project
```
file = tf.keras.utils
@@ -119,7 +119,7 @@ test_features = np.array(test_df)
Normalize the input features using the sklearn StandardScaler. This will set the mean to 0 and standard deviation to 1.
<aside class="note">**Note:** The `StandardScaler` is only fit using the `train_features` to be sure the model is not peeking at the validation or test sets.</aside>
**Note:** The `StandardScaler` is only fit using the `train_features` to be sure the model is not peeking at the validation or test sets.
```
scaler = StandardScaler()
@@ -151,7 +151,7 @@ Test features shape: (56962, 29)
```
<aside class="caution">**Caution:** If you want to deploy a model, it's critical that you preserve the preprocessing calculations. The easiest way to implement them as layers, and attach them to your model before export.</aside>
**Caution:** If you want to deploy a model, it's critical that you preserve the preprocessing calculations. The easiest way to implement them as layers, and attach them to your model before export.
### Look at the data distribution
@@ -234,7 +234,7 @@ Notice that there are a few metrics defined above that can be computed by the mo
* **Recall** is the percentage of **actual** positives that were correctly classified > $\frac{\text{true positives} }{\text{true positives + false negatives} }$
* **AUC** refers to the Area Under the Curve of a Receiver Operating Characteristic curve (ROC-AUC). This metric is equal to the probability that a classifier will rank a random positive sample higher than a random negative sample.
<aside class="note">**Note:** Accuracy is not a helpful metric for this task. You can 99.8%+ accuracy on this task by predicting False all the time.</aside>
**Note:** Accuracy is not a helpful metric for this task. You can 99.8%+ accuracy on this task by predicting False all the time.
Read more:
@@ -249,7 +249,7 @@ Read more:
Now create and train your model using the function that was defined earlier. Notice that the model is fit using a larger than default batch size of 2048, this is important to ensure that each batch has a decent chance of containing a few positive samples. If the batch size was too small, they would likely have no fraudulent transactions to learn from.
<aside class="note">**Note:** this model will not handle the class imbalance well. You will improve it later in this tutorial.</aside>
**Note:** this model will not handle the class imbalance well. You will improve it later in this tutorial.
```
EPOCHS = 100
@@ -575,7 +575,7 @@ plot_metrics(baseline_history)
![png](img/f021b204e92d0e77d8439a03a43bb21e.png)
<aside class="note">**Note:** That the validation curve generally performs better than the training curve. This is mainly caused by the fact that the dropout layer is not active when evaluating the model.</aside>
**Note:** That the validation curve generally performs better than the training curve. This is mainly caused by the fact that the dropout layer is not active when evaluating the model.
### Evaluate metrics
@@ -698,7 +698,7 @@ Weight for class 1: 289.44
Now try re-training and evaluating the model with class weights to see how that affects the predictions.
<aside class="note">**Note:** Using `class_weights` changes the range of the loss. This may affect the stability of the training depending on the optimizer. Optimizers whose step size is dependent on the magnitude of the gradient, like [`optimizers.SGD`](https://tensorflow.google.cn/api_docs/python/tf/keras/optimizers/SGD), may fail. The optimizer used here, [`optimizers.Adam`](https://tensorflow.google.cn/api_docs/python/tf/keras/optimizers/Adam), is unaffected by the scaling change. Also note that because of the weighting, the total losses are not comparable between the two models.</aside>
**Note:** Using `class_weights` changes the range of the loss. This may affect the stability of the training depending on the optimizer. Optimizers whose step size is dependent on the magnitude of the gradient, like [`optimizers.SGD`](https://tensorflow.google.cn/api_docs/python/tf/keras/optimizers/SGD), may fail. The optimizer used here, [`optimizers.Adam`](https://tensorflow.google.cn/api_docs/python/tf/keras/optimizers/Adam), is unaffected by the scaling change. Also note that because of the weighting, the total losses are not comparable between the two models.
```
weighted_model = make_model()
@@ -1064,7 +1064,7 @@ resampled_steps_per_epoch
Now try training the model with the resampled data set instead of using class weights to see how these methods compare.
<aside class="note">**Note:** Because the data was balanced by replicating the positive examples, the total dataset size is larger, and each epoch runs for more training steps.</aside>
**Note:** Because the data was balanced by replicating the positive examples, the total dataset size is larger, and each epoch runs for more training steps.
```
resampled_model = make_model()