Quantum Vision Transformers

Hustle and Flow

The Add and Norm layers manage the flow of the information and stabilize the training process.

The Add layer is also known as the residual connection or skip connection performs element wise addition between the output of the previous layer and the output of the attention and feed forward layers.

This allows the preservation of the original information to allow the model to learn and update the new information captured by the sub-layer. This helps in the propagation of gradients during training. This helps combat the vanishing gradient problem and allows the model to learn effectively.

The normalization layer helps with generalization and reducing the impact of variations in the inputs.

Decoders

The decoder generate the output. It takes as input the output from the encoder which has already been processed.

Similar to the Encoder, the Decoder also has an attention mechanism layer which determines the relationships and dependencies between the output tokens it is generating.

In addition, the decoder has a “masked” multi-head attention layer. This layer sequences the tokens so that the dependencies in the output only deal with tokens that appear before the token. This prevents forward peeking or look-aheads. This allows the model to generate the output one token at a time with consideration only to the information available up to the output point of each token.

The reasoning is that during output, the decoder should not have knowledge of any upcoming tokens.

Now the decoder can use the input from the encoder to generate the next token in the output sequence.

I’m Hungry

The feed forward networks take the weighted sums and apply transformations and introduce non-linearity. These refine the representations of the tokens.

The feed forward network calculates the probability that a token is the next token in the output sequence.

Enter Player Two

Vision Transformer Neural Networks leverage the success of the use of transformer neural networks in natural language processing (NLP).

Image pre-processing: “Tokenize” an image, i.e. break it up into parts.
Flatten the patches
Reduce the dimensionality of the flattened patches
Embed position
Feed the sequence to a standard transformer encoder
Pretrain model with image labels (fully supervised)
Fine tune the downstream dataset for image classification