MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks

KingsmanVince@kbin.social · 1 year ago

MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks

KingsmanVince@kbin.social · 1 year ago

Demystifying CLIP Data

KingsmanVince@kbin.social · 1 year ago

indeed it would be great if the authors did so. I personally found some non-official implementations:

KingsmanVince@kbin.social · edit-2 1 year ago

KingsmanVince@kbin.social · 1 year ago

PaLI-3 Vision Language Models: Smaller, Faster, Stronger

KingsmanVince@kbin.social · 1 year ago

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

KingsmanVince@kbin.social · 1 year ago

Finetune Like You Pretrain: Improved Finetuning of Zero-Shot Vision Models

KingsmanVince@kbin.social · 1 year ago

IIRC DeTr generate a sequence to predict boxes of objects. I think this paradigm can be applied to such models. “Think before you locate” could be a new path to explore.

KingsmanVince@kbin.social · 1 year ago

CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No

KingsmanVince@kbin.social · 1 year ago

Scaling Vision-Language Models with Sparse Mixture of Experts

KingsmanVince@kbin.social · 1 year ago

Hydra-MoE: A new class of Open-Source Mixture of Experts

KingsmanVince@kbin.social · 1 year ago

Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks

KingsmanVince@kbin.social · 1 year ago

Foundational Models Defining a New Era in Vision: A Survey and Outlook

KingsmanVince@kbin.social · 1 year ago

https://github.com/FudanDISC/weakly-supervised-mVLP/tree/master

KingsmanVince@kbin.social · 1 year ago

Unifying Cross-Lingual and Cross-Modal Modeling Towards Weakly Supervised Multilingual Vision-Language Pre-training

KingsmanVince@kbin.social · 1 year ago

MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

KingsmanVince@kbin.social · 1 year ago

Vision Language Transformers: A Survey

KingsmanVince@kbin.social · 1 year ago

VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use

KingsmanVince@kbin.social · 1 year ago

RecycleGPT: An Autoregressive Language Model with Recyclable Module

KingsmanVince@kbin.social · 1 year ago

AI replaces programmers for real

KingsmanVince@kbin.social · 1 year ago

```

KingsmanVince@kbin.social · 1 year ago

I also want to share some resources.
For Pytorch,

https://pytorch.org/tutorials/ their basic tutorials are fundamental but some more advanced tutorials might be outdated.
https://www.learnpytorch.io/ the author guides mostly in computer vision but he gives the overview from research to production.

For TPU,

https://github.com/ayaka14732/tpu-starter full guideline using TPUs with Jax

KingsmanVince