Full List of Papers on Google Scholar
VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety
Shruti Palaskar, Leon Gatys, Mona Abdelrahman, Mar Jacobo, Larry Lindsey, Rutika Moharir, Gunnar Lund, Yang Xu, Navid Shiee, Jeffrey Bigham, Charles Maalouf, Joseph Yitan Cheng
ICLR 2026
We present VLSU, a comprehensive benchmark with 8,187 samples across 15 harm categories that reveals critical failures in multimodal safety: while models achieve 90%+ accuracy on individual modalities, performance drops to 20-55% when joint image-text reasoning is required, with 34% of errors occurring despite correct unimodal classification.
SafetyPairs: Isolating Safety Critical Image Features with Counterfactual Image Generation
Alec Helbling, Shruti Palaskar, Kundan Krishna, Polo Chau, Leon Gatys, Joseph Yitan Cheng
Under Review
We introduce SafetyPairs, a scalable framework that generates counterfactual image pairs differing only in safety-critical features. Our benchmark of 3,020+ images across 9 safety categories reveals weaknesses in vision-language models' fine-grained safety understanding and improves training efficiency for guard models.
Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection
Shruti Palaskar, Oggi Rudovic, Sameer Dharur, Florian Pesce, Gautam Krishna, Aswin Sivaraman, Jack Berkowitz, Ahmed Hussen Abdelaziz, Saurabh Adya, Ahmed Tewfik
Interspeech 2024
We propose Fusion Low Rank Adaptation (FLoRA) to efficiently adapt text-only LLMs for multimodal inputs. For device-directed speech detection, FLoRA achieves 22% EER reduction over text-only approaches while tuning only a fraction of parameters, and with adapter dropout, improves robustness to missing data by 20% over full fine-tuning.
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization
Shruti Palaskar, Akshita Bhagia, Yonatan Bisk, Florian Metze, Alan W Black, Ana Marasović
EMNLP 2022
We investigate self-rationalization (jointly generating answers and explanations) across three vision-language tasks, finding that recent advances like CLIP and language model scaling don't consistently improve multimodal generation beyond captioning, motivating the need for novel architectures for complex text generation from images and text.
How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language
Amanda Duarte, Shruti Palaskar, Lucas Ventura, Deepti Ghadiyaram, Kenneth DeHaan, Florian Metze, Jordi Torres, Xavier Giro-i-Nieto
CVPR 2021
We introduce How2Sign, a multimodal and multiview continuous ASL dataset with 80+ hours of sign language videos paired with speech, English transcripts, and depth. A study with ASL signers confirms that synthesized videos using our dataset can be understood, validating its real-world impact for sign language recognition and translation research.
Multimodal Speech Summarization through Semantic Concept Learning
Shruti Palaskar, Ruslan Salakhutdinov, Alan W Black, Florian Metze
Interspeech 2021
We propose a cascaded multimodal speech summarization model that generates semantic concepts as an interpretable intermediate step. Using multimodal fusion on How2 data, we achieve significant improvements of 7.5 METEOR and 5.1 ROUGE-L points over previous methods, demonstrating scalability on 2000+ hours of video.
Multimodal Abstractive Summarization of How2 Videos
Shruti Palaskar, Jindrich Libovicky, Spandana Gella, Florian Metze
ACL 2019
We present a multi-source sequence-to-sequence model with hierarchical attention for abstractive video summarization that fuses information from video and audio transcripts. We demonstrate effective integration of multimodal information on How2 instructional videos and propose Content F1, a new metric measuring semantic adequacy of summaries.