Model comparisons

doing recap regarding what current models I am working on.

Prividing pros and cons to easily know model property.


concept: next-generation platform for object detection and segmentation.

DETR(Detection with Transformer), 2020

model purpose: object detection


Set-based global loss that forces unique predictions via bipartite matching.

Transformer encoder-decoder architecture

pros: better in large obj detection than Faster RCNN(could be due tonon-local computations

cons: didn;t show good result in benchmarks. smaller object didn’w perform well




Yolo:(anchor free in V8)

  • concept:
  • pros: 1 stage, faster
  • cons: less accurate then 2 stage.


  • concept: one stage, Hourglass network+prediction module. Center pooling
  • pros:
  • cons:
  • ref: here


  • model type:instance segmentation
  • concept:

successer of Faster R-CNN


ResNet+ FPN(Feature Pyramid Network) to get different dimension feature maps

Using feature maps to generate many ROI, then send to RPN(region purposal network): comparing IoU with ground truth to generate purposal(Using IoU loss)

ROI align: mapping feature map with original images

ROI for classification, BBox regression, each ROI do fully concolutional network for semantic segmentation.

  • pros:
  • cons:inference,training time is time-consuming.

any further paper for improvement? → improved maskrcnn


  • concept:

Linear Projection(Patch+position embedding)

transformer encoder(self-attention+multi-head mechanism)

MLP Head(do classification)

  • pros:

Vit has less Inductive Biases compared to CNN. so Vit can see global pictures on whole images.

  • cons:

Losing local feature(i.e.eys,mouth) detection(compared to CNN).

If dataset is small its performance is worse than CNN.

computation is expensive

GTR(graph transformer):

  • concept:
  • pros: could learn higher dimension
  • cons: due to high dimensional model, it needs more data otherwise it couldn’t show good performance
  • ref: here


  • concept:

Spatio-temporal Self-Attention for video.

Divided space-time attention(T+S):

T: given we have T frame

S: for each frame, we have S batches in each frome

  • pros: low training and inference cost,, working on long than 1 mins video.
  • cons:video data benchmark.

Segment Anything:

  • model purpose: instance segmentation
  • concept:
  • pros:
  • cons: It needs prompt as guidline.


  • model purpose: sementic segmentation, using connected component segmentation
  • concept:

self-superversed Vision transformer.

self-supervised approach contains:

momentum encoder.

multi-crop training.

and the use of small patches with ViTs

knowledge distillation:

two identical architectures, student-teacher networks.

Back prop is on student network and use EMA pass to teacher network.

  • pros:

self-supervised learning can help to learn semantic segmentation.

pre-trained on large dataset.

  • cons: 10B heavy paras model, expensive to fine-tune→ currently using DinoV2+Lora(check John code)as solutions.

Lora: freezes the pretrained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture

Mask DINO:

  • concepts:

Adding a mask prediction barnch

It adds noises to groundtruth boxes to speed up convergence.

Unified query selection for mask: Initilize both content and anchor box queries for decoder.

  • pros: good on CV tasks
  • cons: GPU memory limitation
  • ref:

Swin Transformer:

  • concepts:

Hierarchical Feature Maps. Patch Merging to decrease feature map size

Windows Multi-Head Self-Attention: only on each window

Shifted Windows Multi-Head Self-Attention(SW-MSA): to commuinicate between windowns

  • pros: reduce computation
  • cons: no from paper

U-Net, Attention U-Net, Optimized U-Net for brain tumor segmentation

  • concept:

Encoder-decoder architecture. It is composed of conv layers.

Upper layer detects local view, lower layert detects global view.

Each layer has a ‘skip connection’ to send ‘layer local feature’ to decoder. so model learns local features.

Receptive Field becomes larger when layer goes deeper.

  • pros: suitable for medical images due to local features saved. suitable for high resolution images.
  • cons:Excessive downsampling leads to more loss of spatial info
  • ref:



  • concept:

CNN-based foundation model

Long-range dependencies(large effective receptive fields): sampling point offset is flexible.

Adaptive spatial aggregation: weight is learnable and conditionaled by input X

  • pros:

potential to be foundation model

  • cons:

Latency is an issue



