Skip to main content

Convo-Script

Tools for Conversation·2025·Remote

Lightweight multimodal annotation software for rapid human correction.

Research Questions

How might a correction interface guide human attention to uncertain regions - so that review time is spent where human judgment matters most, rather than trudging through confident segments?

Low friction interface for high-focus manual work.

Low friction interface for high-focus manual work.

Overview

Convo-Script investigates how human attention and judgment can be effectively supported during the review and correction of multimodal conversational data.

Rather than treating annotation as a mechanical clean-up step after automation, the project reframes correction as a critical interpretive activity. The interface is designed to surface uncertainty, preserve provenance, and reduce friction so that expert judgment is applied where it matters most.

Automation serves as a guide, not an authority. Human analysts remain the final interpreters of conversational meaning.

Role in the Program

Convo-Script is the verification and correction layer of the Tools for Conversation program.

It sits downstream of Convo-Max, enabling analysts to review provisional transcripts, resolve ambiguity, and produce trusted reference data ("golden" transcripts) suitable for conversation analysis and methodological evaluation.

The tool completes the program's end-to-end research loop:

Capture → Encode → Process → Verify

Problem

Meaningful conversation analysis demands sustained attention to fine-grained interactional detail such as overlaps, repairs, timing, gesture. The analyst's interpretive focus is the scarce resource, and the one that matters most.

The paradox is that current annotation tools often tax this attention rather than protecting it. Environments like ELAN offer powerful multimodal capabilities, but their technical overhead means analysts spend significant effort navigating the interface rather than interpreting the interaction. Friction competes directly with depth.

This isn't just an ergonomic issue - transcription interfaces shape what gets seen. Choices about layout, segmentation, and notation determine what becomes visible and what gets missed (Ochs, 1979; Goodwin, 1994). And because annotation work distributes cognition across people, tools, and representations (Hutchins, 1995), a sub-optimal interface doesn't just slow analysis down, it constrains what analysis is possible.

Convo-Script starts from the premise that this difficulty is not an inherent cost of rigorous work — it's a design problem. The question is how to build an environment that conserves expert attention and directs it where it matters most.

Design Position

Convo-Script was designed around three guiding principles:

  1. Attention is the scarce resource Interfaces should direct human effort to analytically meaningful moments.

  2. Uncertainty should be visible Low-confidence regions are signals to surface, not errors to hide.

  3. Correction must be fast and reversible Analysts should be able to iterate without fear of corrupting data.

Automation is used to introduce friction by flagging uncertainty while the human interface is designed to remove friction during correction.

Interface & Workflow

Convo-Script was designed as a verification environment for reviewing synchronized multimodal recordings rather than as a general-purpose annotation tool. The workspace uses a fixed, scroll-free layout in which waveform, video, and transcript remain stationary while segments advance through view. Every segment can be replayed with a single click or keypress, reducing navigation overhead during iterative correction.

The interface renders each token as a discrete chip linked to timing and speaker attribution. Words carry confidence indicators and provenance metadata so that uncertainty remains visible during review rather than being collapsed into a single "final" transcript.

Multimodal Alignment and Data Model

Convo-Script operates on outputs produced by Convo-Max. Audio, video, diarization labels, and word-level timestamps are ingested as synchronized streams and mapped to a unified timeline. Each token preserves attribution provenance, including DOA clustering, visual verification, visual correction, or marked ambiguity.

A structured transcript schema stores:

  • word-level timing and speaker assignment
  • confidence tiers and provenance markers
  • overlap boundaries and repair segments
  • version history across correction passes

This schema allows the transcript to function as both a readable artifact and a diagnostic layer for upstream processing decisions.

Correction Workflow

Verification proceeds through iterative human review rather than automated post-processing. Analysts listen, replay, and revise segments directly within the synchronized workspace. Corrections generate new transcript versions while preserving prior states, enabling comparison across passes and supporting the construction of a trusted "golden" reference transcript.

The workflow emphasizes selective attention rather than linear playback. Ambiguous regions, such as overlap, speaker transitions, or low-confidence tokens, can be isolated quickly for repeated listening.

Evaluation and Diagnostic Integration

Convo-Script is integrated with an evaluation dashboard used alongside Convo-Max. Corrected transcripts serve as reference material for comparing pipeline configurations across audio processing modes (mixed, separated) and fusion settings (with or without spatial metadata). The system tracks WER and speaker accuracy across runs, supports side-by-side comparison, and enables regression checks across processing changes.

Evaluation

Evaluation focused on workflow efficiency and interpretive clarity rather than benchmark performance. Using Convo-Max alongside Convo-Script, a single "golden" transcript was produced through sustained, multi-pass correction (six file versions), demonstrating that the tool supported iterative refinement of a synchronized exemplar recording.

The golden transcript functioned diagnostically: it was used to evaluate pipeline output across five controlled configurations varying audio processing mode and spatial metadata, refine capture and processing assumptions upstream, and validate regression behavior. The evaluation dashboard tracks WER and speaker accuracy across runs, enables side-by-side comparison, and supports configuration testing.

Evaluation remains limited in scope. The system has not yet been assessed by multiple analysts, no formal timing comparisons against ELAN have been conducted, and long-form or large-scale corpora remain untested. These are all priorities for future work.

Key Insights

1. Uncertainty Improves Focus

The interface was designed to surface uncertainty so that difficult segments become immediately visible during review. Each word appears as a colored chip with three confidence tiers — unmarked (high), dashed border (medium), or orange border with warning icon (low) — alongside attribution provenance (DOA clustering, visual verification, visual correction, or marked ambiguous). Observations are based on single-analyst use during development rather than formal user testing.

2. Ergonomics as a Design Hypothesis

ConvoScript adopts a fixed, scroll-free workspace where waveform, video, and transcript remain stationary while segments move through view. The goal is to reduce navigation overhead and make repeated listening faster, especially for overlap and repair sequences. This reflects a design hypothesis — not yet a validated outcome — that lowering interaction friction will change what analysts choose to examine closely. No comparative studies with tools like ELAN have been conducted.

3. Human Judgment Remains Central

The golden transcript schema intentionally includes categories that current automation cannot produce, such as collaborative vs. competitive overlap or interruption vs. smooth transition. A six-version correction sequence created by a single analyst demonstrated that even with timing, speaker labels, confidence scores, and visual cues, interpretive decisions required iterative human judgment. Multi-analyst evaluation remains future work.

Future Directions

  1. Evaluate review workflows on longer recordings
  2. Explore comparative studies with traditional annotation tools
  3. Refine uncertainty signals to further reduce review effort

Programmatic Connections

Convo-Script operationalizes the program's epistemic stance:

  • Convo-Box determines what is available to be reviewed.
  • Convo-Recorder makes spatial assumptions explicit and inspectable.
  • Convo-Max generates provisional structure and uncertainty signals.

Convo-Script ensures that interpretive authority remains with the human analyst.

Ethics

All automated annotations are explicitly provisional. Human corrections are preserved alongside original output, supporting transparency and accountability.

All data reviewed through the system was collected with informed consent and used solely for research purposes.

References

Goodwin, C. (1994). Professional vision. American Anthropologist, 96(3), 606--633.

Hutchins, E. (1995). Cognition in the wild. MIT Press.

Ochs, E. (1979). Transcription as theory. In E. Ochs & B. B. Schieffelin (Eds.), Developmental pragmatics (pp. 43--72). Academic Press.

annotationtranscriptionhuman-in-the-loopelanconversation-analysis