I am a Senior Research Scientist at Apple. I lead research at the intersection of reinforcement learning, reasoning, and alignment for multimodal foundation models. My work develops novel RL techniques that advance multimodal reasoning capabilities while ensuring safety. My research has been deployed at scale in production systems and published at venues including ICLR, EMNLP, CVPR, and ACL.

I received my PhD from the School of Computer Science at Carnegie Mellon University in 2022, where I was advised by Florian Metze and Alan W Black. My doctoral research focused on multimodal learning from video, developing architectures for fusing audio, visual, and textual modalities for tasks like video summarization and vision-language reasoning.

My work has been recognized with the Meta (Facebook) Fellowship, the Center for Machine Learning and Health Fellowship, and selection as a Rising Star in EECS by UC Berkeley. During my PhD, I interned at Allen Institute for AI, Meta AI, and Abridge AI. I earned my MS from Carnegie Mellon University (2018) and BE from Pune Institute of Computer Technology (2016).

News

[Jan 2026] VLSU accepted to ICLR 2026!
[Nov 2025] Paper on JointMMSafe accepted to NeurIPS 2025 Workshop on multimodal foundation model safety.
[Nov 2024] I started working on AI Safety at Apple. Particularly excited about multimodal safety and vision-language alignment!
[Jun 2024] Paper accepted to Interspeech 2024 on efficient on-device multimodal LLMs.
[Sep 2022] Paper on multimodal reasoning accepted to Findings of EMNLP 2022.
[Jul 2022] Joined Apple as a Senior Research Scientist working on multimodal foundation models.
[May 2022] Defended my PhD!
[Mar 2021] Paper on How2Sign dataset accepted at CVPR 2021.
[Oct 2020] Invited to participate in the Rising Star in EECS Workshop at UC Berkeley.
[Jun 2019] Paper on multimodal video summarization accepted at ACL 2019.
[Dec 2018] Received the Facebook Fellowship.

Research

Full List of Papers on Google Scholar

VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety
Shruti Palaskar, Leon Gatys, Mona Abdelrahman, Mar Jacobo, Larry Lindsey, Rutika Moharir, Gunnar Lund, Yang Xu, Navid Shiee, Jeffrey Bigham, Charles Maalouf, Joseph Yitan Cheng
ICLR 2026
We present VLSU, a comprehensive benchmark with 8,187 samples across 15 harm categories that reveals critical failures in multimodal safety: while models achieve 90%+ accuracy on individual modalities, performance drops to 20-55% when joint image-text reasoning is required, with 34% of errors occurring despite correct unimodal classification.
SafetyPairs: Isolating Safety Critical Image Features with Counterfactual Image Generation
Alec Helbling, Shruti Palaskar, Kundan Krishna, Polo Chau, Leon Gatys, Joseph Yitan Cheng
Under Review
We introduce SafetyPairs, a scalable framework that generates counterfactual image pairs differing only in safety-critical features. Our benchmark of 3,020+ images across 9 safety categories reveals weaknesses in vision-language models' fine-grained safety understanding and improves training efficiency for guard models.
Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection
Shruti Palaskar, Oggi Rudovic, Sameer Dharur, Florian Pesce, Gautam Krishna, Aswin Sivaraman, Jack Berkowitz, Ahmed Hussen Abdelaziz, Saurabh Adya, Ahmed Tewfik
Interspeech 2024
We propose Fusion Low Rank Adaptation (FLoRA) to efficiently adapt text-only LLMs for multimodal inputs. For device-directed speech detection, FLoRA achieves 22% EER reduction over text-only approaches while tuning only a fraction of parameters, and with adapter dropout, improves robustness to missing data by 20% over full fine-tuning.
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization
Shruti Palaskar, Akshita Bhagia, Yonatan Bisk, Florian Metze, Alan W Black, Ana Marasović
EMNLP 2022
We investigate self-rationalization (jointly generating answers and explanations) across three vision-language tasks, finding that recent advances like CLIP and language model scaling don't consistently improve multimodal generation beyond captioning, motivating the need for novel architectures for complex text generation from images and text.
How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language
Amanda Duarte, Shruti Palaskar, Lucas Ventura, Deepti Ghadiyaram, Kenneth DeHaan, Florian Metze, Jordi Torres, Xavier Giro-i-Nieto
CVPR 2021
We introduce How2Sign, a multimodal and multiview continuous ASL dataset with 80+ hours of sign language videos paired with speech, English transcripts, and depth. A study with ASL signers confirms that synthesized videos using our dataset can be understood, validating its real-world impact for sign language recognition and translation research.
Multimodal Speech Summarization through Semantic Concept Learning
Shruti Palaskar, Ruslan Salakhutdinov, Alan W Black, Florian Metze
Interspeech 2021
We propose a cascaded multimodal speech summarization model that generates semantic concepts as an interpretable intermediate step. Using multimodal fusion on How2 data, we achieve significant improvements of 7.5 METEOR and 5.1 ROUGE-L points over previous methods, demonstrating scalability on 2000+ hours of video.
Multimodal Abstractive Summarization of How2 Videos
Shruti Palaskar, Jindrich Libovicky, Spandana Gella, Florian Metze
ACL 2019
We present a multi-source sequence-to-sequence model with hierarchical attention for abstractive video summarization that fuses information from video and audio transcripts. We demonstrate effective integration of multimodal information on How2 instructional videos and propose Content F1, a new metric measuring semantic adequacy of summaries.

Other

In my free time, I am an amateur barista, baker, painter and pilot.