The Functional API
Good reference for beginner and intermediate use cases of the Functional API.
The Functional API
Good reference for beginner and intermediate use cases of the Functional API.
Module: tf.keras.applications
pre-trained models that are part of Tensorflow. Majority are image recogition models trained on ImageNet 1000 classes.
If you want other pre-trained models, you can find them here: https://modelzoo.co/, but some may require manual coding to load the weights and do the appropriate pre-processing.
preprocessing_function
This is important to use if we are transfer learning from a pre-trained model. We should match the pre-processing done for that pre-trained model, otherwise the inputs will be invalid.
train_datagen = ImageDataGenerator( rescale=1./255, shear_range=0.2, zoom_range=0.2, horizontal_flip=True)test_datagen = ImageDataGenerator(rescale=1./255)
Note that we do both preprocessing and augmentation on training data.
Test/validation data is just pre-processed without augmentation. We don't want to change the test/validation data too much from real world.
flow_from_directory
this can be used to 1) load, and 2) optionally augment image data from a directory.
Typically, you would create a directory structure with folders indicating each class label:
data\
cat\
images of cats
dog\
images of dogs
monkey
images of monkeys
padding='valid
Set to 'valid' unless you want to keep the same image size for input and output.
'valid' turns off padding.
'same' pads the input so that the output surface (excluding depth) the same size as the input after stided convolution. This is useful in 2 cases:
Padding is typically implemented by applying zeros. Technically though, padding can be any number (such as -1, 1, 0, ...)
Conv2D
The 2D here indicates that the surface that convolution for each filter is being applied on a 2-dimensional input tensor. We do not count the filter depth when we say 2D.
This is usually for image data.
1D convolution can be done on sequential data, such as text or time series.
strides=(1, 1)
Set to a higher value if you don't want the filters to apply to overlapping windows of the input data.
Otherwise, set to a value smaller than kernel_size so that you get an overlap. Typically, we set to (1,1) which means the windows are moved forward and downward 1 step each.
kernel_size
2D window of filters (i.e. the learnt weights).
Total number of weights is (if use_bias=True):
kernel_size input_channels output_depth + output_depth
filters
depth of the output. The number of filters will be input x output.
units
important: sets the output shape. Input shape is implied by the previous layer.
(aka number of neurons).
Just your regular densely-connected NN layer.
Every input is connected via a weight to every output.
shape=None,
Most important to set, this is the input to the model.
E.g. shape=(10, ) if the model has 10 inputs
shape=(28, 28, 3) if the model is taking in an RGB image of 28x28
kernel_initializer='glorot_uniform', bias_initializer='zeros', kernel_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, bias_constraint=None, **kwargs)
use defaults, don't change unless you have a reason to.
activation=None
important to set.
a) Binary classification: 'sigmoid' b) Multi-class classification: 'softmax' c) Regression: None (or set to 'linear', they are the equivalent)
use_bias=True
use the default, adds flexibility to the network to shift the output up/down vertically
data_format
rarely set, use defaults
kernel_initializer Initializer for the kernel weights matrix. bias_initializer Initializer for the bias vector. kernel_regularizer Regularizer function applied to the kernel weights matrix. bias_regularizer Regularizer function applied to the bias vector. activity_regularizer Regularizer function applied to the output of the layer (its "activation"). kernel_constraint Constraint function applied to the kernel weights matrix. bias_constraint Constraint function applied to the bias vector.
Rarely customized. Defaults are good enough.
validation_data=None
There is no concept of cross-validation here. This is just the validation data.
Cross-validation will have to be implemented manually in your code, if you choose to do it.
shuffle=True, class_weight=None, sample_weight=None, initial_epoch=0, steps_per_epoch=None, validation_steps=None, validation_batch_size=None, validation_freq=1, max_queue_size=10, workers=1, use_multiprocessing=False
typically keep defaults unless you have a good reason not to.
batch_size=None
batch_size is typically a number smaller than the dataset size.
If you are using GPU, you may sometimes need to reduce the batch_size when GPU complains it is out of memory.
run_eagerly=None
recommend leave as default
loss_weights=None, weighted_metrics=None,
less commonly used
metrics=None
This specifies the additional metrics.
By default, you will always get the loss, even if you don't set any metrics.
You'll see that for classification, we often add the 'acc' (accuracy) metric.
Precision/recall is also available. See the metrics section for an example.
Note that generally you can't simply pass in an scikit-learn metric here. It has to be implemented as Tensorflow metric. See the metrics section.
loss=None
Very important to set. Depends on the task.
a) Binary classification: binary_crossentropy b) Multi-class classification: categorical_crossentropy c) Regression: mse, or mae
These are the standard losses. Stick to them unless you have a reason not to.
optimizer='rmsprop
'adam' is a good choice most of the time. It automatically adjusts learning rate using momentum, so you don't need to set learning rate.
'sgd' is another choice if you want to learn very slowly (we will use this in transfer learning fine tuning).
'rmsprop' is a less common choice, it sometimes is used for Recurrent Neural Networks. However, 'adam' is also good for RNNs.
adapt
This is like "fit" in TFIDFVectorizer. It will compute the vocabulary and create a unique integer index to each token.
"transform" is done as a separate step, this way:
vectorizer(X_train.values).numpy()
output_sequence_length
Important to set. This will keep all text token sequences the same length, either through truncation or padding.
Padding is usually right padded with spaces.
The fixed length token sequences will determine your Neural Network input shape.
ngrams
If using pre-trained word vectors like Glove, Word2Vec, check if those word vectors support ngrams before setting this.
Otherwise the ngrams will be considered out of vocab.
max_tokens
This will control the vocabulary size. It's a first-come-first serve basis, so once the max_tokens is reached, all other new words are considered OOV (out of vocab).
tf.keras.layers.experimental.preprocessing.TextVectorization
requires TF 2.1 or later.
class_weightdict or ‘balanced’, default=NoneSet the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
Set this to 'balanced' if your classes are imbalanced and you would like to keep the imbalance in the dataset. It does not automatically guarantee a better result, but will try to weigh minority samples higher.
Keeping the imbalance is sometimes necessary when you want to train a population specific model (rather than a genreal purpose model), or if you don't want to discard samples or manufacture synthetic samples, which are the usual balancing techniques.
kernel{‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’}, default=’rbf’
The shape of the kernel can be used to tailor the type of decision boundary you would like to draw.
For gentle curves, select 'rbf'. This is usually a good default.
For more angular curves, select 'poly'. However, poly tends to be more difficult to tune because the shape changes dramatically based on the degree of polynomial.
If you want draw a straight line boundary, then use 'linear'.
sigmoid is rarely used.
Cfloat, default=1.0Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty.
A Jupyter notebook Image object if Jupyter is installed. This enables in-line display of the model plots in notebooks.
keras.plot_model to plot the model architecture inline
3D tensor with shape: (batch_size, input_length, output_dim).
Embedding always produces a 3-D tensor. Each token is expanded into a vector of output_dim size. Generally this works well with RNNs or CNNs. For Dense layer, we will need to flatten back to a 2D tensor.
, penalty='l2
See annotation for LogisticRegression estimator here: https://hyp.is/ayr3TvDsEeqnS_vS5fzwbQ/scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
loss='hinge',
"hinge" means SVM with linear kernel, which uses the support vector - hyperplane method to find the widest boundary of separation.
"log" means LogisticRegression. The difference between this and the LogisticRegression estimator, is that this one uses Stochastic Gradient Descent to find the parameters. Whereas LogisticRegression estimator uses some other optimizer to find the weights. They are both ok in practice.
The other losses I would consider if I have outliers in my dataset.
C=1.0,
The C parameter comes hand in hand with the penalty parameter. If you set a C < 1.0, then it will increase regularization strength. C must be a positive number. Basically regularization is a way to constrain the weights of Logistic Regression so that they don't grow too big (overfit to training data when it puts too much emphasis on a feature based on what it's seen from training). Conceptually, regularization tries to "smooth out" the parameters (L2), or drop some parameters (L1).
If you set C to > 1.0, that will effectively reduce the regularization by 1/C. So C = 2.0, will give half strength. Reducing regularization is useful when you notice that your training accuracy is dropping too much.
solver='lbfgs'
Unlike SGDClassifier, which uses SGD to solve the linear equation (by doing a first-order gradient descent on the loss function), LogisticRegression uses 2nd order gradient solvers. Second-order is good for finding the global minima, but is very expensive to compute the exact solution. The different solvers here are ways to approximate the second order derivatives. You can probably get away with the default lbfgs, which is a quasi Newton's method. https://en.wikipedia.org/wiki/Limited-memory_BFGS
l1_ratio=None
Tune the l1_ratio if you use elastic net. It's basically a slider value, for example if l1_ratio is 0.4, that means you use 0.4 L1 regularization, and 0.6 L2 regularization. It helps you choose between the two forms of regularization, depending on your goals. L2 tends to result in smoother weights (smaller across all features), L1 tends to drop features.
penalty='l2'
If your training accuracy is poor (e.g. 60% or less), suggest setting to 'none', because there's no point regularizing. If your training accuracy is good, then keep as 'l2'. If you have a lot of features to remove, you can also set 'elasticnet' or 'l1' to incorporate some L1 regularization, which acts as an automated feature selector.