Skip to main content

Convo-Max

Tools for Conversation·2025·Remote

Multimodal automated transcription for conversation analysis.

Research Questions

How might a transcription pipeline preserve overlap, gaps, and prosody for conversation analysis - rather than smoothing them out to maximize word accuracy?

A multimodal convo-box session is transcribed and opens in ELAN software, shown.

A multimodal convo-box session is transcribed and opens in ELAN software, shown.

Overview

Convo-Max investigates how automated transcription systems can support conversation analysis (CA) without collapsing interactional structure into flat text.

Rather than optimizing for benchmark accuracy or end‑to‑end automation, the system treats transcription as a provisional structuring task. Its goal is to preserve timing, overlap, and multimodal alignment while making uncertainty explicit for human review.

The project advances a clear position: automation should amplify context, not replace interpretive judgment.

Role in the Program

Convo-Max is the processing layer of the Tools for Conversation program.

It ingests synchronized multimodal recordings from Convo-Box and Convo-Recorder, transforming raw media into CA‑compatible annotations that can be inspected, corrected, and extended by human analysts.

The system explicitly flags regions where automation is unreliable rather than smoothing them out.

Problem

Conversation analysis relies on fine‑grained alignment of speech, timing, overlap, and embodied cues such as gaze and facial expression.

Mainstream ASR systems are optimized for clean, single‑speaker audio and treat overlap as error. In CA, however, overlap is a site of negotiation and meaning.

As a result, fully automated transcription systems fail precisely where CA is most sensitive: overlapping talk, interruption, repair, and prosodic detail. Recent evaluations show that some commercial ASR systems handle overlap better than older baselines, yet they still do not reliably preserve the full set of interaction-relevant details, such as systematic representation of non-lexical tokens and multimodal structure, which CA treats as analytically central (Mazeland, 2006).

These failures are compounded by capture constraints and by transcription formats that flatten multimodal structure into linear text.

Design Position

Convo-Max was designed around three guiding principles:

  1. Automation is assistive, not authoritative All output is provisional and intended for human correction.

  2. Multimodality is structural Audio, video, and spatial metadata are treated as co‑equal inputs.

  3. Uncertainty is informative Low‑confidence regions are surfaced to guide human attention.

Methodology & System Architecture

Convo-Max operates as a modular, multimodal transcription pipeline configured for conversation analysis rather than benchmark performance.

Multimodal Inputs

The pipeline integrates synchronized inputs including:

  • mixed and separated audio streams
  • per‑speaker video streams
  • spatial metadata from Convo-Recorder
  • session‑level configuration data

This fusion enables the system to reason about overlap, speaker attribution, and embodied cues beyond what single‑stream audio allows.

Spatial–Acoustic Reasoning

Audio diarization is combined with direction‑of‑arrival (DOA) telemetry.

Rather than treating DOA as a continuous localization signal, Convo-Max interprets instability in spatial angle as a marker of collision (overlap). This allows the system to detect regions where attribution is fundamentally uncertain rather than attempting false precision.

Visual Mouth‑Activity Cues

Visual signals derived from face detection and mouth activity are used selectively to disambiguate quiet or ambiguous segments.

These cues are treated as conditional and context‑dependent, not universally reliable, and are only applied when capture geometry supports meaningful inference.

Tier‑Based Output (ELAN‑Compatible)

Instead of producing a flat transcript, Convo-Max outputs structured annotations in formats compatible with ELAN (EAF).

Automated output is restricted to empirically stable tiers, including:

  • word‑level timing
  • speaker segments
  • overlap and gap regions
  • selected prosodic and embodied cues

This structure preserves analytic flexibility and supports iterative human correction.

Evaluation Approach

The pipeline was evaluated against a human-corrected golden transcript (128 words, 2 speakers, 66 seconds, Spanish) under controlled configurations. Five experiments tested whether spatial metadata (DOA angles, declared speaker positions) improved outcomes.

ConfigurationWERSpeaker AccuracyAlignment IoU
Mixed audio + diarization (baseline)17.19%93.10%0.191
Mixed + DOA fusion (with positions)17.19%93.10%0.191
Separated audio + channel labels28.12%90.91%0.227
Separated + DOA fusion (with positions)28.12%90.91%0.227

Mixed audio with Pyannote diarization achieved the best results on both transcription accuracy and speaker attribution. Adding DOA spatial metadata did not change outcomes in this 2-speaker scenario. The pipeline processed DOA data correctly (confirmed via pipeline stats), but the diarization baseline was already 93% accurate, leaving no room for spatial correction.

An additional AMI Corpus benchmark (120s, 4 speakers, 21 overlap regions) showed pipeline accuracy of 95.5% in clean speech dropping to 72.7% during overlap. This pattern aligns with a broader body of speech-processing research showing that overlapping speech remains a persistent challenge for automatic transcription and speaker diarization, where simultaneous talkers complicate segmentation, speaker attribution, and recognition accuracy (Park 2022; Bullock et al., 2020). Findings were used diagnostically to refine capture assumptions, signal integration, and confidence thresholds rather than as performance claims.

Evaluation is limited to a single short recording with two speakers and clear turn-taking. Results should not be generalized to multi-speaker, overlapping, or longer recordings.

Key Insights

1. Overlap Is the Dominant Failure Mode

Across real conversational data, overlapping speech, not background noise, emerged as the primary source of transcription error. Findings from the AMI benchmark show accuracy dropping from 95.5% to 72.7% during overlap, a 22.7-point degradation. The 23.40% word deletion rate was driven largely by overlapping talk, which accounted for 26.8% of all words.

These results suggest that multi-party timing and turn-taking dynamics, rather than acoustic cleanliness alone, define the central challenge for reliable conversational transcription.

2. DOA Instability Is a Signal, Not Noise

Spatial variance increased sharply during simultaneous speech. In the AMI Corpus benchmark (4 speakers, 21 overlap regions), DOA standard deviation rose from 34° to 104° for the most affected speaker — a threefold change. Rather than treating this instability as measurement error, it functioned as an indicator of overlap. A DOA instability detector at 60° threshold achieved 90.5% recall for overlap detection, flagging roughly 29% of the audio for closer analysis. An improved diarization-aware variant achieved 100% recall at the cost of lower precision.

These results are from a single AMI Corpus specimen (120s). The magnitude of DOA instability during overlap varied across speakers (1.3x to 3.1x increase in standard deviation).

3. Mixed and Separated Audio Reveal Different Error Profiles

Mixed audio produced both lower WER (17.19%) and higher speaker accuracy (93.10%) than separated audio (28.12% WER, 90.91% speaker accuracy) in the tested 2-speaker scenario. Speaker separation introduced transcription artifacts (higher deletion and substitution rates) that offset its channel-isolation advantage for attribution.

The pipeline supports both processing modes. The results suggest that for simple 2-speaker conversations with clear turn-taking, mixed audio with algorithmic diarization is sufficient. Separated audio may prove more valuable in multi-speaker or heavily overlapping scenarios where diarization algorithms degrade, though this remains untested.

4. Camera Geometry Shapes Embodied Inference

Embodied cues depend heavily on capture geometry. In the AMI Corpus benchmark, two speakers produced near-identical DOA angles (~0-12 degrees apart), making acoustic-only disambiguation infeasible for the mic array. Visual mouth detection became the only viable disambiguation strategy, but only when cameras were positioned face-on. Off-axis placement degraded facial-cue interpretation, and when face detection failed entirely, the visual tier returned empty.

Note: AMI geometry is inferred from DOA telemetry, not from ground-truth calibration data. The Pi5 golden recording, where speaker geometry is known, showed that DOA fusion correctly processed spatial data but did not change outcomes compared to diarization alone.

5. Preserve uncertainty to guide review

Rather than suppressing low‑confidence output, Convo-Max marks uncertain regions so human analysts can focus attention where judgment matters most.

References

Bullock, L., et al. (2020). Overlap-aware speaker diarization with end-to-end neural networks. In ICASSP 2020 (pp. 7114--7118).

Mazeland, H. (2006). Conversation analysis. In K. Brown (Ed.), Encyclopedia of language and linguistics (2nd ed., Vol. 3, pp. 153--163). Elsevier.

Park, T. J., Kanda, N., Dimitriadis, D., Han, K. J., Watanabe, S., & Narayanan, S. (2022). A review of speaker diarization: Recent advances with deep learning. Computer Speech & Language, 76, 101415.

transcriptiondiarizationmultimodalconversation-analysiselan