Paper-Conference

Try it on https://aistudio.google.com/models/gemini-2-5-flash-image

Google

Gemini 2.5:Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

In this report, we introduce the Gemini 2.X model family. Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash …

Gemini Team

Distilling vision-language models on millions of videos

The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this …

Yue Zhao, Long Zhao, Xingyi Zhou, Jialin Wu, Chun-Te Chu, Hui Miao, Florian Schroff, Hartwig Adam, Ting Liu, Boqing Gong, Philipp Krähenbühl, Liangzhe Yuan

Omni-SMoLA:Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts

Large multi-modal models (LMMs) exhibit remarkable performance across numerous tasks. However, generalist LMMs often suffer from …

Jialin Wu, Xia Hu, Yaqing Wang, Bo Pang, Radu Soricut

CausalLM Is Not Optimal for In-Context Learning

Recent empirical evidence indicates that transformer based in-context learning performs better when using a prefix language model …

Nan Ding, Tomer Levinboim, Jialin Wu, Sebastian Goodman, Radu Soricut

PaLI-X:On Scaling up a Multilingual Vision and Language Model

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the …

Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Et. Al., (43 Authors)

RT-2:Vision-language-action models transfer web knowledge to robotic controling

We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to …

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Et. Al., (54 Authors)

Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual Question Answering

Most Outside-Knowledge Visual Question Answering (OK-VQA) systems employ a two-stage framework that first retrieves external knowledge …

Jialin Wu, Raymond J. Mooney

Multi-Modal Answer Validation for Knowledge-Based VQA

The problem of knowledge-based visual question answering involves answering questions that require external knowledge in addition to …

Jialin Wu, Jiasen Lu, Ashish Sabharwal, Roozbeh Mottaghi

Improving VQA and its Explanations by Comparing Competing Explanations

Most recent state-of-the-art Visual Question Answering (VQA) systems are opaque black boxes that are only trained to fit the answer …

Jialin Wu, Liyan Chen, Raymond J. Mooney