Guillermo Gonzalez – CS 5243 – Computer Vision – UTSA
Introduction
This body of work outlines an attempt to study available literature on quantum vision transformers and begin to design a plan for the implementation of a simple quantum transformer neural network and execute that design on a quantum simulator and then on a real quantum computer. The end goal is to execute the method against quantum computer simulators in hopes of deriving expected results and then potentially execute those same applications against real quantum computers.
Noisy Computers
Quantum computers are currently in an era called NISQ which stands for Noisy Intermediate-Scale Quantum. The term “noisy” comes for the fact that current quantum computers are highly error prone. Their qubits are “noisy”. These quantum computers have not yet reached the point where reliable results can be generated due to various errors including those related to the environment that are exhibited by qubit decoherence. Decoherence or loss of state means that currently available qubits cannot hold their state for very long. In addition, gate fidelity is also a problem. In addition, because hardware technologies are still being developed, current quantum computers consist of logic gates with sub-par gate fidelity (logic gate quality) which introduce more errors in the generation of the results. There is much research being done to help reduce the effects of errors on results including efforts in error mitigation and error correction. Another approach is to utilize aspects of classical machine learning methods to derive better results from today’s noisy quantum computers to refine the results.
Press On
Still, much research is being performed in hopes that one day the algorithms being written today can be useful once reliable quantum computers come online. Many researchers work on various quantum simulators that simulate real quantum computers. This allows researchers to test out their methods and algorithms and to perfect them for use in the future. A side effect of these efforts is that research in this area sometimes leads to “quantum inspired” solutions that can be immediately applied in the classical world. Looking at things through a different lens can spawn a new way to solve an existing problem and to solve it now. Research teams are working in all industries like finance, material science, chemistry, biology, cryptography, automotive, space and defense. In some of these industries, there is hope that quantum computers can help with tasks like autonomous vehicle navigation, space debris detection and cancer identification. These tasks are currently done in the classical world using machine learning methods. It is important to note that in the context of this study, quantum computing can advance the field of computer vision significantly.
Fast Foward to QML
Quantum Machine Learning (QML) is considered a very hot topic in the realm of quantum computing. Researchers are working to devise models, circuits, methods and algorithms to leverage quantum computing for various QML tasks including computer vision tasks like methods to determine if these methods can be used against typical image processing tasks, for example edge detection, image classification and object identification. QML, like classical machine learning, spans across many industries and research areas. One of these is computer vision. Various efforts are pursuing the incorporation of quantum computing concepts into classical machine learning techniques to gain benefits. Some of these benefits come in the form of the ability to handle larger data sets and allow faster execution of various QML models that are growing in complexity. Many recent QML research papers cover leveraging the potential power of quantum computing in classical machine learning methods for example Quantum Convolutional Neural Networks (QCNNs), quantum support vector machines, and quantum nearest neighbor algorithms, however very few tackle the topic of Quantum Transformer Neural Networks and more specifically Quantum Vision Transformers. The goal of this project is to study this area of quantum machine learning (QML) as it applies to computer vision.
“Transformers, Robots in Disguise”
Classical transformer neural networks have been used in image recognition and natural language processing (NLP) tasks and they have been shown to rival CNNs. Transformers don’t use recurrent or convolutional layers but are comprised of a simple architecture. Transformer neural networks are considered a form of semi-supervised learning in that they are pre-trained on unlabeled data and then fine-tuned on labeled data. A transformer is essentially a neural network that uses a mechanism called an attention that considers a global context while processing the entire data specimen element by element. Because of the use of the attention mechanism, they are more contextual. In addition, the architecture is highly parallelizable which could be highly exploited within a quantum algorithm. A sample transformer is shown to the right.
Pay Attention
As mentioned, at the heart of transformers are a mechanism known as a “attention”. This means, for example in the case of NLP, each word is not just processed individually in a sequence, but treated in the context in which it appears, e.g., at the beginning, at the end, next to another important word. In reference to computer vision, imagine viewing a video clip not just frame by frame, but by following the movement and actions of the characters and their progression through the frame including position, posture, facial expressions and expected progression. All of this helps you understand the context of the scenes.
As such, a transformer network attempts to mimic this process of “attention”. The hope is to understand the sequence of scene and the relationship and dependencies between individual elements, for example pixels in an image, whether they are near or far from each other. One example is that a word appearing in a sentence may depend on a word that appeared earlier in a paragraph.
Break it Down
As mentioned, at the heart of transformers are a mechanism known as a “attention”. This means, for example in the case of NLP, each word is not just processed individually in a sequence, but treated in the context in which it appears, e.g., at the beginning, at the end, next to another important word. In reference to computer vision, imagine viewing a video clip not just
The two primary parts of the Transformer Network are the Encoder and the Decoder.
Two other important parts of the network are the Input and the Output.
The first step for dealing with an input is tokenization. All pieces are split into individual parts. In the case of images, this could mean breaking down an image into smaller image.
Next is embedding where each individual token is identified with a density matrix which is based on the similarities and relationships between the tokens based on the meaning of the individual token.
The assignment of the embedding can be performed by an external model.
In addition, the input sent in the the transformer network leverages positional encoding. This is encoding that carries information about the position of a token amongst other tokens. This is an additional vector.
Once the density matrices are associated to the tokens, all of the tokens are submitted to the transformer network at once. The transformer network doesn’t know anything about the order of the tokens.
Encoders/Decoders
A typical transformer network consists of six encoders and six decoders.
Encoders
Our original input is passed through several encoders. The encoder itself consists of a self-attention mechanism layer and a feed forward neural network layer. This encoder captures the contextual nature of the tokens and their dependencies.
Cerberus – Multihead
One type of self-attention layer is the Multi-Head Attention layer. An attention layer helps determine which tokens are dependent on one another and how strongly those dependencies are.
What are the parts?
Each token will be assigned three values: query, key and value.
Query: token looks for other tokens to pay attention to
Key: the token being looked at by other tokens
Value: the meaning or information about a token
Using a formula, the self-attention layer calculates a similarity score after comparing each token to every other token. Specifically, the Query of each token is compared to the Key of every other token.
The higher the similarity score the stronger the relationship.
This provide guidance on how much attention each token should give to other tokens. Tokens with higher scores will get more attention.
Each of the Query, Key and Values are represented as a vector and the attention calculate we compare the query vector of one toke to the Key vector of another token to determine similarity.
The similarity scores are converted to attention weights using softmax.
Attention weights are used to compute the weighted sum of the Value vectors.
The weighted sums are used to update the representation of each of the tokens.