Evolving Deep Architectures: A New Blend of CNNs and Transformers Without Pre-Training Dependencies

computing
Modelling in computer vision is slowly moving from Convolution Neural Networks (CNNs) to Vision Transformers due to the high performance of self-attention mechanisms in capturing global dependencies within the data. Although vision transformers proved to surpass CNNs in performance and require less computational power, their need for pre-training on large-scale datasets can become burdensome. Using pre-trained models has critical limitations, including limited flexibility to adjust network structures and domain mismatches of source and target domains. To address this, a new architecture with a blend of CNNs and Transformers is proposed. This project proposes an architecture modifying the SegFormer Transformer with two convolutional modules, achieving pixel accuracies of 0.6956 on MS COCO.
Finland
Manu Kiiskila
Manu Kiiskilä
Age: 20