Model comparisons

George S

3 min readMay 7, 2024

doing recap regarding what current models I am working on.

Prividing pros and cons to easily know model property.

Detectron2:

concept: next-generation platform for object detection and segmentation.

DETR(Detection with Transformer), 2020

model purpose: object detection

concept:

Set-based global loss that forces unique predictions via bipartite matching.

Transformer encoder-decoder architecture

pros: better in large obj detection than Faster RCNN(could be due tonon-local computations

cons: didn;t show good result in benchmarks. smaller object didn’w perform well

ref:

here

here2

Yolo:(anchor free in V8)

concept:
pros: 1 stage, faster
cons: less accurate then 2 stage.

CenterNet:

concept: one stage, Hourglass network+prediction module. Center pooling
pros:
cons:
ref: here

MaskRCNN:

model type:instance segmentation
concept:

successer of Faster R-CNN

two-stage

ResNet+ FPN(Feature Pyramid Network) to get different dimension feature maps

Using feature maps to generate many ROI, then send to RPN(region purposal network): comparing IoU with ground truth to generate purposal(Using IoU loss)

ROI align: mapping feature map with original images

ROI for classification, BBox regression, each ROI do fully concolutional network for semantic segmentation.

pros:
cons:inference,training time is time-consuming.

any further paper for improvement? → improved maskrcnn

ref: here
here2

ViT:

concept:

Linear Projection(Patch+position embedding)

transformer encoder(self-attention+multi-head mechanism)

MLP Head(do classification)

pros:

Vit has less Inductive Biases compared to CNN. so Vit can see global pictures on whole images.

cons:

Losing local feature(i.e.eys,mouth) detection(compared to CNN).

If dataset is small its performance is worse than CNN.

computation is expensive

GTR(graph transformer):

concept:
pros: could learn higher dimension
cons: due to high dimensional model, it needs more data otherwise it couldn’t show good performance
ref: here

TimeSformer(TimeSformer):

concept:

Spatio-temporal Self-Attention for video.

Divided space-time attention(T+S):

T: given we have T frame

S: for each frame, we have S batches in each frome

pros: low training and inference cost,, working on long than 1 mins video.
cons:video data benchmark.

Segment Anything:

model purpose: instance segmentation
concept:
pros:
cons: It needs prompt as guidline.

DinoV2:

model purpose: sementic segmentation, using connected component segmentation
concept:

self-superversed Vision transformer.

self-supervised approach contains:

momentum encoder.

multi-crop training.

and the use of small patches with ViTs

knowledge distillation:

two identical architectures, student-teacher networks.

Back prop is on student network and use EMA pass to teacher network.

pros:

self-supervised learning can help to learn semantic segmentation.

pre-trained on large dataset.

cons: 10B heavy paras model, expensive to fine-tune→ currently using DinoV2+Lora(check John code)as solutions.

Lora: freezes the pretrained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture

Mask DINO:

concepts:

Adding a mask prediction barnch

It adds noises to groundtruth boxes to speed up convergence.

Unified query selection for mask: Initilize both content and anchor box queries for decoder.

pros: good on CV tasks
cons: GPU memory limitation
ref:

Swin Transformer:

concepts:

Hierarchical Feature Maps. Patch Merging to decrease feature map size

Windows Multi-Head Self-Attention: only on each window

Shifted Windows Multi-Head Self-Attention(SW-MSA): to commuinicate between windowns

pros: reduce computation
cons: no from paper

U-Net, Attention U-Net, Optimized U-Net for brain tumor segmentation

concept:

Encoder-decoder architecture. It is composed of conv layers.

Upper layer detects local view, lower layert detects global view.

Each layer has a ‘skip connection’ to send ‘layer local feature’ to decoder. so model learns local features.

Receptive Field becomes larger when layer goes deeper.

pros: suitable for medical images due to local features saved. suitable for high resolution images.
cons:Excessive downsampling leads to more loss of spatial info
ref:

here

InternImage

concept:

CNN-based foundation model

Long-range dependencies(large effective receptive fields): sampling point offset is flexible.

Adaptive spatial aggregation: weight is learnable and conditionaled by input X

pros:

potential to be foundation model

cons:

Latency is an issue

Model comparisons

Written by George S