Why speech recognition models are hiding their real strategy.
Audio-visual speech recognition (AVSR) systems do something remarkable: they watch your lips and listen to your voice simultaneously, piecing together understanding even when one channel is corrupted by noise. This mirrors how humans actually communicate. You can understand someone in a loud restaurant by reading their lips when audio fails. AVSR models attempt the same feat, combining acoustic features extracted from sound waves with visual features extracted from video of the speaker's face.
The promise is compelling. When audio quality degrades, video information should compensate. When visual information is ambiguous (the speaker is turned away), audio takes over. In theory, robust multimodal systems leverage whichever source is most reliable at any given moment. In practice, nobody knows how these models actually balance audio and video in real time. We measure overall error rates, but we remain blind to the internal decision-making process. That gap between what we observe and what we understand is where this research lives.
The ablation experiment shows us almost nothing.
The first clue that something is wrong appears when you ablate modalities. Remove audio from an AVSR model, and the word error rate rises. Remove video and error rises too. Both modalities matter. But this experiment answers the wrong question. It tells you that both modalities are useful, not how a model weights them during decoding, how that weighting changes under noise, or whether the model trusts one modality more than it should.
A chart comparing WER across six AVSR models when audio or video is removed. The y-axis shows WER percentage on a logarithmic scale. Gray bars represent audio ablation, black bars represent video ablation. All bars show significant increases from baseline.
When audio or video is removed from six AVSR models in clean conditions, word error rate increases in both cases, but this tells us nothing about how the model balances these modalities in real time or how that balance changes under noise.
Existing interpretability methods struggle with this problem. Attention weights, the most common tool for understanding neural networks, show you where a model looks, not how much each input actually influenced the output. A model might attend to video frames while extracting the crucial information from audio. Gradient-based attribution methods measure sensitivity but not contribution. The specific challenge for AVSR is that audio and video have different temporal structures, different dimensionalities, and different information densities. A fairness framework that works for single-modality models doesn't naturally extend to comparing cross-modal contributions.
What's needed is a method that answers a precise question: "If I remove this modality and recompute the model's prediction, how much does the answer change?" This is exactly the problem that Shapley values were designed to solve.
Shapley values as a fairness lens for credit assignment.
Shapley values come from cooperative game theory, where they solve a deceptively hard problem: how to fairly divide credit among players who cooperate to produce a result. Imagine three people working on a project that generates 900 units of value. You can't simply look at what each person produces alone because they interact. Alice might produce 500 alone, but with Bob, she produces 800 total, meaning Bob adds 300. With Charlie added to Alice, they produce 1000 together, so Charlie's contribution depends on who else is present.
The Shapley value computes a player's fair contribution by calculating their marginal value in every possible coalition and averaging those contributions. This single fairness principle satisfies four key axioms simultaneously: efficiency (all value is distributed), symmetry (equivalent players get equal credit), linearity (the method works additively), and null player (players who don't help don't get credit).
For AVSR, the game is structured like this: what happens to the model's prediction if we remove audio? If we remove video? If we remove both? A Shapley value for audio is the average marginal contribution of audio features across all possible subsets of audio and video inputs. This directly answers the question: across all ways the model could have used the available information, how much did audio actually matter?
Why this matters more than it initially appears: audio and video aren't independent. Removing audio might hurt the model less when the video is highly informative, but the same audio could be crucial in a different scenario. Shapley values naturally account for these interactions. A Shapley matrix can be constructed where each row represents an input feature from either audio or video, and each column represents a generated word. Each cell contains the Shapley value: how much that input contributed to that output. From this single matrix, three distinct types of analysis become possible.
Three lenses for understanding modality decisions.
The researchers introduce three different ways to extract insights from the Shapley matrix, each answering a different question about model behavior.
An overview diagram showing how the Shapley matrix generates three types of analyses. The matrix is divided into audio features and video features on the y-axis and token positions on the x-axis. Three arrows point to three different analyses: Global SHAP (aggregating across all tokens), Generative SHAP (tracking contributions across decoding steps), and Temporal Alignment SHAP (visualizing input-output correspondence).
From the Shapley matrix, three complementary analyses emerge. Global SHAP aggregates contributions across all tokens to show overall modality balance. Generative SHAP tracks how balance evolves as the model decodes word by word. Temporal Alignment SHAP reveals which input frames actually influenced which outputs.
Global SHAP aggregates all Shapley values for audio features across all generated tokens and all test samples. Do the same for video. What's the average split between modalities? This is the headline statistic: on average, how much does this model rely on audio versus video? This becomes far more interesting when measured across varying noise levels.
A bar chart showing global audio and video contributions across six AVSR models under different SNR conditions. Clean conditions (light bars), medium noise (medium bars), and severe noise (dark bars) are compared for each model. Audio consistently dominates across all conditions.
Global audio and video contributions across six AVSR models at clean and noisy acoustic conditions reveal a striking pattern: audio contributions remain high even under severe noise, suggesting a persistent audio bias in model behavior.
Generative SHAP tracks how modality contributions change as the model generates tokens sequentially. The model doesn't freeze its strategy at the beginning. Instead, early tokens might rely more heavily on one modality, while later tokens shift toward the other. Under noise, these dynamics change substantially. This reveals the model is doing something adaptive during decoding, not applying a static formula.
A line plot showing how audio and video contributions change as the model generates tokens. Clean conditions (solid lines) and noisy conditions (dashed lines, SNR -10 dB) are compared. The curves diverge significantly over time.
As models generate tokens sequentially, their modality preferences shift. Clean and noisy conditions show markedly different patterns, with noise causing the model to diverge from its baseline strategy as decoding progresses.
Temporal Alignment SHAP creates heatmaps showing which audio frames and video frames contributed most to each generated token. When the model predicts a specific word, which parts of the video were most relevant (early face movements, middle, late?)? Which parts of the audio? This reveals whether the model respects temporal structure or gets confused about timing.
Two heatmaps for AV-HuBERT showing temporal alignment. The top heatmaps show audio feature contributions in clean vs noisy conditions, with a clear diagonal pattern indicating temporal coherence. The bottom shows grouped video features (early, middle, late) with similar diagonal structure preserved under noise.
Temporal alignment remains coherent even under severe noise. The diagonal structure in heatmaps indicates that the model respects the temporal correspondence between inputs and outputs, rather than developing spurious or scrambled relationships.
What emerges under pressure
Across six different AVSR models tested on two benchmarks at varying noise levels, three striking patterns emerge.
First: audio bias persists under noise. Intuition suggests that as audio degrades, models should shift toward visual information. Video becomes relatively more valuable when audio is corrupted. Yet the data reveals something unexpected. Even under severe noise (SNR -10 dB, where noise is as loud as speech), models maintain high audio contributions in their Shapley-based attribution. The audio Shapley values barely decrease.
This contradicts the narrative that AVSR systems are adaptively balancing modalities. The models aren't saying "audio is getting worse, I'll trust it less." Instead, they're assigning persistently high credit to audio even when audio is heavily corrupted. This is either because the audio encoder extracts useful information from noise, or more likely, because the model has learned a strong inductive bias that audio is usually reliable. This prior doesn't adapt appropriately under extreme conditions.
A bar chart showing global SHAP audio contributions across different acoustic noise types. Bars remain consistently high across white noise, Babble noise, and other conditions, even at severe SNR levels. Numbers above bars indicate achieved WER percentages.
Audio Shapley contributions remain stubbornly high across different noise types and SNR levels, revealing that the audio bias persists regardless of noise characteristics. The model doesn't learn to distrust corrupted audio proportionally.
Second: modality balance evolves during generation. The Global SHAP analysis measures an average, but the Generative SHAP analysis reveals that this average hides all the interesting behavior. Early in decoding, the model might favor one modality strongly. As it generates more tokens, the balance shifts. Under noise, these shifts become more pronounced. This suggests the model is doing something genuinely adaptive, not applying a fixed strategy.
Third: temporal alignment holds under noise. When you examine which input frames actually influenced which outputs, the correspondence remains temporally coherent even under severe noise. The heatmaps show diagonal structure (early inputs influencing early outputs, later inputs influencing later outputs), not scattered random patterns. The model isn't getting confused about timing or learning spurious correlations across distant points in time.
The hidden audio bias that explains everything
These three findings synthesize into a single insight: AVSR models suffer from a persistent audio bias that prevents them from adapting optimally to severe noise.
Consider the tension between the ablation results and the Shapley analysis. The ablation experiment shows that removing audio causes large increases in error rate. This might suggest audio is genuinely more informative. But the Shapley analysis reveals a different story. The model assigns high Shapley values to audio not because audio is reliably more informative across all conditions, but because the model learned a strong prior from realistic training data where audio is typically dominant. This prior doesn't turn off when audio becomes unreliable.
Why does this happen? Speech is predominantly an audio phenomenon. Training data naturally features high-quality audio as the primary signal and video as supplementary. Models develop an inductive bias favoring audio. This is adaptive in normal conditions but becomes maladaptive under extreme noise. The model can't distrust audio sufficiently even when it's heavily corrupted.
This explains a persistent mystery in AVSR research: why do naive models saturate in performance under extreme noise? They don't fail because they can't use visual information. They fail because they haven't learned to weight modalities appropriately based on input reliability. Throwing more noisy training data at the problem doesn't fully solve it, because the audio bias is structural, not just a function of training data composition.
Implications and open frontiers
For practitioners building AVSR systems, Dr. SHAP-AV becomes a diagnostic tool. Before deployment, run these three analyses. If you see persistent audio bias, you need explicit mechanisms to adjust modality weighting dynamically. If temporal alignment breaks down under noise, investigate your temporal modeling layers. This shifts Shapley-based attribution from a post-hoc explanation tool into an active diagnostic for uncovering design flaws.
The finding that SNR is the dominant factor driving modality weighting suggests that future architectures should explicitly estimate input signal quality and adjust their strategy accordingly. Confidence-based reweighting, noise-dependent scaling, or learnable modality gates trained to respond to input degradation could resolve the audio bias problem.
This work also demonstrates that Shapley attribution extends beyond interpretability into diagnosis. Similar methods could be applied to other multimodal tasks: video understanding, embodied AI, medical imaging, or any system combining heterogeneous information sources. The question "are we using all modalities fairly, or is one being neglected?" becomes systematically answerable.
Several questions remain. Does the audio bias appear in other multimodal settings, or is it specific to speech where one modality naturally dominates? Can these Shapley-based diagnostics inform better training objectives that encourage balanced modality development? Do models trained adversarially against audio noise show different patterns than models trained blindly on noisy data? How does the choice of modality-specific encoder architecture shape these attribution patterns?
The deeper contribution is methodological. This work treats interpretability not as an afterthought or a compliance requirement, but as a scientific instrument for discovering how models actually work. By asking "how do these models balance multimodal information," the authors uncover architectural flaws that performance metrics alone would never reveal. This approach, applying rigorous attribution methods to diagnose model behavior before it reaches production, points toward a future where interpretability guides design rather than explaining results after the fact.
This is a Plain English Papers summary of a research paper called Dr. SHAP-AV: Decoding Relative Modality Contributions via Shapley Attribution in Audio-Visual Speech Recognition.
