How can i make vision transformer model that recives input, multiple images

Question

수민 안 2023-12-26

1
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2064127-how-can-i-make-vision-transformer-model-that-recives-input-multiple-images

编辑： Debraj Maji 2023-12-27

Is it possible to create or learn a deep learning model in Matlab that receives multiple images as input and has one sequence as output?

For example, I wonder how to receive 20 consecutive images as inputs and output a sequence such as '11153'.

Thanks for reading

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Debraj Maji 2023-12-27

2
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2064127-how-can-i-make-vision-transformer-model-that-recives-input-multiple-images#answer_1378312

编辑：Debraj Maji 2023-12-27

Hi @수민 안

I understand that you are trying to create a Vision Transformer(ViT) Model which takes multiple images as input and generates a sequence.

Creating a Vision Transformer (ViT) model that receives multiple images as input in MATLAB involves adapting the ViT architecture to handle sequences of images. The original ViT architecture is designed for single-image classification tasks. To modify it for sequential multi-image input, you would treat each image as a token in the sequence and process these tokens in a way similar to how transformers process sequential data in natural language processing (NLP).

Here is a conceptual outline of how you could approach this:

Step 1: Preprocessing:

Resize all images to a fixed size.
Flatten each image into a 1D vector or use patches as tokens, as done in ViT.
Optionally, add positional encoding to retain the order of the images.

Step 2: Transformer Encoder:

Use a series of transformer encoder layers to process the sequence of image tokens.
Each transformer encoder layer would include multi-head self-attention and feedforward neural networks.

Step 3: Sequence Decoder:

After processing the images through the transformer encoder, you need to decode the output into a sequence.
You can use an RNN, LSTM, or another transformer decoder to generate the output sequence.

Step 4: Output Layer:

The output layer would produce the final sequence, which could be a series of classification layers, one for each position in the output sequence.

In MATLAB, you can use Deep Learning Toolbox to create custom layers and models. Currently MATLAB does support a pre-defined ViT. However this scenario would require you to implement the transformer layers manually. You can follow this documentation for steps on how to define custom Deep Learning Layers:

https://www.mathworks.com/help/deeplearning/ug/define-custom-deep-learning-layer.html

For the Pretrained ViT available in MATLAB you can refer to the following documentation: https://www.mathworks.com/help/vision/ref/visiontransformer.html

For additional info on pre-defined Deep Learning Layers in MATLAB you can refer to the following link:

https://www.mathworks.com/help/deeplearning/ug/list-of-deep-learning-layers.html

I hope this resolves your query.

With regards,

Debraj.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Answer 2

Shubham 2023-12-27

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2064127-how-can-i-make-vision-transformer-model-that-recives-input-multiple-images#answer_1378252

Hi 수민 안,

Yes, it is possible to create a deep learning model in MATLAB that takes multiple images as input and outputs a sequence of numbers. This can be done by using a Convolutional Neural Network (CNN) for image feature extraction combined with a Recurrent Neural Network (RNN) or Long Short-Term Memory (LSTM) network for sequence prediction.

Here's a high-level overview of how you might approach this:

Data Preparation:

Organize your images and corresponding sequence labels.
Preprocess the images (resizing, normalization, etc.).
Split the data into training, validation, and test sets.

2. Model Architecture:

Use a CNN as the feature extractor for the images. You can use pre-trained networks like VGG, ResNet, or create your own.
Flatten the output of the CNN or use global pooling to reduce the dimensionality.
Feed the output into an RNN or LSTM layer(s) to handle the sequence prediction.
The final output layer should have the number of units corresponding to the length of your output sequence with a softmax activation if you are treating each position as a classification problem.

3. Training:

Compile the model with an appropriate loss function (e.g., categorical cross-entropy if you're treating the sequence prediction as a classification problem).
Train the model using the training data with validation data to monitor performance.

4. Evaluation and Testing:

Evaluate the model's performance on the test set.
Adjust hyperparameters or model architecture as needed based on performance.

5. Prediction:

Use the trained model to predict sequences from new sets of images.

I hope this helps!

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

How can i make vision transformer model that recives input, multiple images

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

回答（2 个）

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

How can i make vision transformer model that recives input, multiple images

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

回答（2 个）

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论