Skip to main content

Convo-Recorder

Tools for Conversation·2025·Remote

Multi-camera recording software for improved transcription

Research Questions

How might recording software make spatial and temporal assumptions explicit at capture time - so that downstream analysis can rely on encoded geometry rather than post-hoc inference?

Speaker participant settings encode useful information for downstream transcription.

Speaker participant settings encode useful information for downstream transcription.

Overview

Convo-Recorder investigates how recording-time infrastructure — hardware, synchronization, and spatial configuration — shapes what can be known about conversational interaction.

The system combines hardware-level spatial capture (a four-microphone array producing raw DOA telemetry) with explicit metadata encoding (declared speaker positions, environmental context). Raw spatial telemetry has proven immediately useful for overlap detection; explicit metadata encoding establishes infrastructure for spatial reasoning that remains to be validated in more challenging multi-speaker scenarios.

The project advances the position that recording-time infrastructure — particularly spatial audio hardware and tight synchronization — provides structural signals that downstream analysis can exploit. Overlap detection via DOA instability (90.5% recall in benchmark testing) demonstrates this: the mic array's raw spatial telemetry flags regions of simultaneous speech automatically, without requiring the operator to declare speaker positions. The system also supports explicit spatial declaration, which establishes ground-truth context for future evaluation in more challenging multi-speaker scenarios.

Role in the Program

Convo-Recorder is the encoding layer of the Tools for Conversation program.

Running on the Convo-Box hardware, its role is to translate physical conversational space into synchronized media streams and structured metadata at the moment of capture. By forcing spatial configuration to be declared explicitly, the system reduces reliance on blind inference in later stages of processing.

Problem

Most recording software treats spatial configuration as incidental.

Speaker position, distance, orientation, and environmental conditions are typically undocumented, forcing downstream systems to infer structure from audio alone. When inference fails, errors are attributed to transcription or modeling rather than to missing capture-time information.

In the context of conversation analysis, where overlap timing, interruption, and pause length carry social meaning, even millisecond-scale misalignment can change how an interaction is interpreted. CA transcription conventions mark overlap onset and measure silences in tenths of seconds precisely because these temporal details contribute to the organization and intelligibility of talk (Mazeland, 2006).

Methodology & System Design

Convo-Recorder was designed to support explicit spatial declaration, tight synchronization, and minimally intrusive operation during live interaction.

Stream Synchronization

Capturing multiple media streams with millisecond accuracy proved to be a central challenge.

The naive approach was to issue simultaneous start commands to cameras and microphones. This failed due to independent device clocks, variable latency, operating system scheduling, and clock drift. These factors produced streams that appeared synchronized but were offset by hundreds of milliseconds, which is unacceptable for conversation analysis.

Sensor Timestamp Strategy

Modern camera modules expose sensor-level timestamps, indicating the precise moment each frame was captured according to the device's hardware clock.

By logging sensor timestamps alongside system monotonic timestamps, Convo-Recorder can detect drift, correlate timestamps across streams, and realign media with millisecond precision. To support this strategy, capture is restricted to CSI cameras, avoiding USB devices whose latency characteristics proved too unstable for reliable synchronization.

Speaker Geometry Configuration

Early analysis revealed that useful spatial information existed but was not being recorded explicitly, namely, who sat where relative to the microphone array.

To address this, the system provides a configuration interface that allows the operator to position speakers around a virtual representation of the microphone array, encoding identity, angular position, and distance at capture time.

This interface functions as a methodological intervention, replacing blind post-hoc inference with explicit declaration.

Environmental Modeling

Inconsistent DOA estimates initially appeared as noise. Further inspection revealed reflections from nearby surfaces as the cause.

The system was extended to encode basic environmental context, such as room depth and reflective boundaries, allowing downstream processing to interpret spatial audio signals more reliably. The current implementation assumes a static speaker configuration, reflecting common seated conversational contexts.

Remote Operation

To minimize observer effects, Convo-Recorder supports headless remote control over a local network.

This design choice prioritizes conversational naturalism, allowing the researcher to configure and start recordings without remaining physically present in the interaction space.

Evaluation

Evaluation

Evaluation focused on whether synchronized media and metadata were sufficient for downstream analysis rather than on interface performance metrics.

Synchronization accuracy is characterized across the full signal chain:

SourceJitter
CSI sensor capture0.1–1 µs (deterministic)
Thread scheduling10–100 µs
Frame buffering1–3 ms
ReSpeaker audio2–5 ms
Oscillator drift25–50 ppm

Current tests indicate 10–100 ms alignment across audio–video streams, with sub-millisecond synchronization between CSI cameras. Short exemplar recordings (2–3 minutes, two speakers) were successfully processed with Convo-Max and reviewed in Convo-Script, demonstrating sufficient temporal coherence for multi-party conversation analysis.

Evaluation remains limited to short recordings; long-duration capture, drift behavior, and extended synchronization stability remain priorities for future work. The system currently includes 15 automated tests covering worker lifecycle, multi-camera coordination, and graceful degradation.

Key Findings

  1. Hardware-level spatial signal outperforms explicit declaration The mic array's raw DOA telemetry proved more immediately useful than the operator's declared speaker geometry. DOA instability during overlapping speech was detectable directly from frame-to-frame angle variance, achieving 90.5% recall for overlap detection without any capture-time spatial declaration. In contrast, explicitly encoding speaker positions and DOA zones did not measurably improve speaker attribution in a controlled A/B test (93% accuracy with and without metadata). The raw spatial signal from hardware is the primary contribution; structured metadata declaration remains an untested hypothesis for more complex scenarios.

  2. Environmental context shapes spatial audio interpretation Reflective surfaces and room geometry significantly affect DOA signal quality, producing directional artifacts that could mislead downstream analysis. The system encodes basic environmental context (room dimensions, reflective boundaries) as metadata. While this metadata is available to the processing pipeline, its effect on inference accuracy has not yet been isolated in controlled evaluation.

Future Directions

  1. Evaluate synchronization stability for long-form recordings
  2. Improve drift-correction strategies across extended sessions

Programmatic Connections

Convo-Recorder packages synchronized media and metadata for downstream tools:

  • Convo-Box provides the physical capture platform.
  • Convo-Max provides provisional multimodal transcription.
  • Convo-Script supports human review and correction.

Ethics

The software is designed for research transparency. Spatial metadata is collected solely to support analysis, not surveillance.

All recordings were conducted with informed consent, and encoded data was used exclusively for research purposes.

References

Mazeland, H. (2006). Conversation analysis. In K. Brown (Ed.), Encyclopedia of language and linguistics (2nd ed., Vol. 3, pp. 153--163). Elsevier.

recording-softwarespatial-audiodoaraspberry-pimulti-camera