Part 0: Intro
Deep Learning is a powerful toolset, but it also involves a steep learning curve and a radical paradigm shift.
For those new to Deep Learning, there are many levers to learn and different approaches to try out. Even more frustratingly, designing deep learning architectures can be equal parts art and science, without some of the rigorous backing found in longer studied, linear models.
In this article, we’ll work through some of the basic principles of deep learning, by discussing the fundamental building blocks in this exciting field. Take a look at some of the primary ingredients of getting started below, and don’t forget to bookmark this page as your Deep Learning cheat sheet!
What is a layer?
A layer is an atomic unit, within a deep learning architecture. Networks are generally composed by adding successive layers.
What properties do all layers have?
Almost all layers will have :
- Weights (free parameters), which create a linear combination of the outputs from the previous layer.
- An activation, which allows for non-linearities
- A bias node, an equivalent to one incoming variable that is always set to
What changes between layer types?
There are many different layers for many different use cases. Different layers may allow for combining adjacent inputs (convolutional layers), or dealing with multiple timesteps in a single observation (RNN layers).
Difference between DL book and Keras Layers
Frustratingly, there is some inconsistency in how layers are referred to and utilized. For example, the Deep Learning Book commonly refers to archictures (whole networks), rather than specific layers. For example, their discussion of a convolutional neural network focuses on the convolutional layer as a sub-component of the network.
1D vs 2D
Some layers have 1D and 2D varieties. A good rule of thumb is:
- 1D: Temporal (time series, text)
- 2d: Spatial (image)
|Layer||Data Types||Weights from last layer||Comment||Further Reading||Keras docs|
|Input||All||There are none!|
|Embedding||Categorical, text||OHE Categorical input-> vector||Word2Vec is an example of an embedding||link||link|
|Dense||All||Get fed to each neuron||link||link|
|Dropout||Most||Get fed to each neuron, with some probability||Useful for regularization||link||link|
|Convolutional||Text, time series, image||Adjacent weights get combined||Foundational for computer vision. Generally paired w/ Max Pooling||link||link|
|Max Pooling||Text, time series, image||Take max of adjacent weights||Foundational for computer vision. Generally paired w/ Convolutional||link||link|
|RNN||Text, time series||Each ‘timestep’ gets fed in order||Generally replaced w/ LSTM||link||link|
|LSTM||Text, time series||Each ‘timestep’ gets fed in order||Smart improvement over RNN, to avoid vanishing gradients||link||link|
|Bidirectional||Text, time series||Get passed on both forwards and backwards||Layer wrapper that gives time steps forwards and backwards. Standard for RNN / LSTM layers||link||link|
Part 1: Standard layers
- Simple pass through
- Needs to align w/ shape of upcoming layers
- Categorical / text to vector
- Vector can be used with other (linear) algorithms
- Can use transfer learning / pre-trained embeddings(see example)
- Vanilla, default layer
- Many different activations
- Probably want to use ReLu activation
- Helpful for regularization
- Generally should not be used after input layer
- Can select fraction of weights (
p) to be dropped
- Weights are scaled at train / test time, so average weight is the same for both
- Weights are not dropped at test time
Part 2: Specialized layers
- Take a subset of input
- Create a linear combination of the elements in that subset
- Replace subset (multiple values) with the linear combination (single value)
- Weights for linear combination are learned
Time series & text layers
- Helpful when input has a specific order
- Time series (e.g. stock closing prices for 1 week)
- Text (e.g. words on a page, given in a certain order)
- Text data is generally preceeded by an embedding layer
- Generally should be paired w/
- Each time step is concatenated with the last time step's output
- This concatenated input is fed into a dense layer equivalent
- The output of the dense layer equivalent is this time step's output
- Generally, only the output from the last time step is used
- Specially handling for the first time step
- Improvement on Simple RNN, with internal 'memory state'
- Avoid issue of exploding / vanishing gradients
- There for utility use!