Publications

Try it on https://aistudio.google.com/models/gemini-2-5-flash-image

Google

Gemini 2.5:Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

In this report, we introduce the Gemini 2.X model family. Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash …

Gemini Team

Distilling vision-language models on millions of videos

The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this …

Yue Zhao, Long Zhao, Xingyi Zhou, Jialin Wu, Chun-Te Chu, Hui Miao, Florian Schroff, Hartwig Adam, Ting Liu, Boqing Gong, Philipp Krähenbühl, Liangzhe Yuan

Omni-SMoLA:Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts

Large multi-modal models (LMMs) exhibit remarkable performance across numerous tasks. However, generalist LMMs often suffer from …

Jialin Wu, Xia Hu, Yaqing Wang, Bo Pang, Radu Soricut

CausalLM Is Not Optimal for In-Context Learning

Recent empirical evidence indicates that transformer based in-context learning performs better when using a prefix language model …

Nan Ding, Tomer Levinboim, Jialin Wu, Sebastian Goodman, Radu Soricut

PaLI-X:On Scaling up a Multilingual Vision and Language Model

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the …

Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Et. Al., (43 Authors)

RT-2:Vision-language-action models transfer web knowledge to robotic controling

We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to …

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Et. Al., (54 Authors)

Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual Question Answering

Most Outside-Knowledge Visual Question Answering (OK-VQA) systems employ a two-stage framework that first retrieves external knowledge …

Jialin Wu, Raymond J. Mooney

Multi-Modal Answer Validation for Knowledge-Based VQA

The problem of knowledge-based visual question answering involves answering questions that require external knowledge in addition to …

Jialin Wu, Jiasen Lu, Ashish Sabharwal, Roozbeh Mottaghi

Improving VQA and its Explanations by Comparing Competing Explanations

Most recent state-of-the-art Visual Question Answering (VQA) systems are opaque black boxes that are only trained to fit the answer …

Jialin Wu, Liyan Chen, Raymond J. Mooney

CoNAN: A Complementary Neighboring-based Attention Network for Referring Expression Generation

Daily scenes are complex in the real world due to occlusion, undesired lighting conditions, etc. Although humans handle those …

Jungjun Kim, Hanbin Ko, Jialin Wu

Self-Critical Reasoning for Robust Visual Question Answering

Visual Question Answering (VQA) deep-learning systems tend to capture superficial statistical correlations in the training data because …

Jialin Wu, Raymond J. Mooney

Generating Question Relevant Captions to Aid Visual Question Answering

Visual question answering (VQA) and image captioning require a shared body of general knowledge connecting language and vision. We …

Jialin Wu, Zeyuan Hu, Raymond J. Mooney

Faithful Multimodal Explanation for Visual Question Answering

AI systems’ ability to explain their reasoning is critical to their utility and trustworthiness. Deep neural networks have …

Jialin Wu, Raymond J. Mooney

Dynamic Filtering with Large Sampling Field for Convnets

We propose a dynamic filtering strategy with large sampling field for ConvNets (LS-DFN), where the position-specific kernels learn from …

Jialin Wu, Dai Li, Yu Yang, Chandrajit Bajaj, Xiangyang Ji