doing recap regarding what current models I am working on.
Prividing pros and cons to easily know model property.
Detectron2:
concept: next-generation platform for object detection and segmentation.
DETR(Detection with Transformer), 2020
model purpose: object detection
concept:
Set-based global loss that forces unique predictions via bipartite matching.
Transformer encoder-decoder architecture
pros: better in large obj detection than Faster RCNN(could be due tonon-local computations
cons: didn;t show good result in benchmarks. smaller object didn’w perform well
ref:
Yolo:(anchor free in V8)
- concept:
- pros: 1 stage, faster
- cons: less accurate then 2 stage.
CenterNet:
- concept: one stage, Hourglass network+prediction module. Center pooling
- pros:
- cons:
- ref: here
MaskRCNN:
- model type:instance segmentation
- concept:
successer of Faster R-CNN
two-stage
ResNet+ FPN(Feature Pyramid Network) to get different dimension feature maps
Using feature maps to generate many ROI, then send to RPN(region purposal network): comparing IoU with ground truth to generate purposal(Using IoU loss)
ROI align: mapping feature map with original images
ROI for classification, BBox regression, each ROI do fully concolutional network for semantic segmentation.
- pros:
- cons:inference,training time is time-consuming.
any further paper for improvement? → improved maskrcnn
ViT:
- concept:
Linear Projection(Patch+position embedding)
transformer encoder(self-attention+multi-head mechanism)
MLP Head(do classification)
- pros:
Vit has less Inductive Biases compared to CNN. so Vit can see global pictures on whole images.
- cons:
Losing local feature(i.e.eys,mouth) detection(compared to CNN).
If dataset is small its performance is worse than CNN.
computation is expensive
GTR(graph transformer):
- concept:
- pros: could learn higher dimension
- cons: due to high dimensional model, it needs more data otherwise it couldn’t show good performance
- ref: here
TimeSformer(TimeSformer):
- concept:
Spatio-temporal Self-Attention for video.
Divided space-time attention(T+S):
T: given we have T frame
S: for each frame, we have S batches in each frome
- pros: low training and inference cost,, working on long than 1 mins video.
- cons:video data benchmark.
Segment Anything:
- model purpose: instance segmentation
- concept:
- pros:
- cons: It needs prompt as guidline.
DinoV2:
- model purpose: sementic segmentation, using connected component segmentation
- concept:
self-superversed Vision transformer.
self-supervised approach contains:
momentum encoder.
multi-crop training.
and the use of small patches with ViTs
knowledge distillation:
two identical architectures, student-teacher networks.
Back prop is on student network and use EMA pass to teacher network.
- pros:
self-supervised learning can help to learn semantic segmentation.
pre-trained on large dataset.
- cons: 10B heavy paras model, expensive to fine-tune→ currently using DinoV2+Lora(check John code)as solutions.
Lora: freezes the pretrained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture
Mask DINO:
- concepts:
Adding a mask prediction barnch
It adds noises to groundtruth boxes to speed up convergence.
Unified query selection for mask: Initilize both content and anchor box queries for decoder.
- pros: good on CV tasks
- cons: GPU memory limitation
- ref:
Swin Transformer:
- concepts:
Hierarchical Feature Maps. Patch Merging to decrease feature map size
Windows Multi-Head Self-Attention: only on each window
Shifted Windows Multi-Head Self-Attention(SW-MSA): to commuinicate between windowns
- pros: reduce computation
- cons: no from paper
U-Net, Attention U-Net, Optimized U-Net for brain tumor segmentation
- concept:
Encoder-decoder architecture. It is composed of conv layers.
Upper layer detects local view, lower layert detects global view.
Each layer has a ‘skip connection’ to send ‘layer local feature’ to decoder. so model learns local features.
Receptive Field becomes larger when layer goes deeper.
- pros: suitable for medical images due to local features saved. suitable for high resolution images.
- cons:Excessive downsampling leads to more loss of spatial info
- ref:
InternImage
- concept:
CNN-based foundation model
Long-range dependencies(large effective receptive fields): sampling point offset is flexible.
Adaptive spatial aggregation: weight is learnable and conditionaled by input X
- pros:
potential to be foundation model
- cons:
Latency is an issue