Overview
Convo-Recorder investigates how recording-time infrastructure — hardware, synchronization, and spatial configuration — shapes what can be known about conversational interaction.
The system combines hardware-level spatial capture (a four-microphone array producing raw DOA telemetry) with explicit metadata encoding (declared speaker positions, environmental context). Raw spatial telemetry has proven immediately useful for overlap detection; explicit metadata encoding establishes infrastructure for spatial reasoning that remains to be validated in more challenging multi-speaker scenarios.
The project advances the position that recording-time infrastructure — particularly spatial audio hardware and tight synchronization — provides structural signals that downstream analysis can exploit. Overlap detection via DOA instability (90.5% recall in benchmark testing) demonstrates this: the mic array's raw spatial telemetry flags regions of simultaneous speech automatically, without requiring the operator to declare speaker positions. The system also supports explicit spatial declaration, which establishes ground-truth context for future evaluation in more challenging multi-speaker scenarios.
Role in the Program
Convo-Recorder is the encoding layer of the Tools for Conversation program.
Running on the Convo-Box hardware, its role is to translate physical conversational space into synchronized media streams and structured metadata at the moment of capture. By forcing spatial configuration to be declared explicitly, the system reduces reliance on blind inference in later stages of processing.
Problem
Most recording software treats spatial configuration as incidental.
Speaker position, distance, orientation, and environmental conditions are typically undocumented, forcing downstream systems to infer structure from audio alone. When inference fails, errors are attributed to transcription or modeling rather than to missing capture-time information.
In the context of conversation analysis, where overlap timing, interruption, and pause length carry social meaning, even millisecond-scale misalignment can change how an interaction is interpreted. CA transcription conventions mark overlap onset and measure silences in tenths of seconds precisely because these temporal details contribute to the organization and intelligibility of talk (Mazeland, 2006).
Methodology & System Design
Convo-Recorder was designed to support explicit spatial declaration, tight synchronization, and minimally intrusive operation during live interaction.
Stream Synchronization
Capturing multiple media streams with millisecond accuracy proved to be a central challenge.
The naive approach was to issue simultaneous start commands to cameras and microphones. This failed due to independent device clocks, variable latency, operating system scheduling, and clock drift. These factors produced streams that appeared synchronized but were offset by hundreds of milliseconds, which is unacceptable for conversation analysis.
Sensor Timestamp Strategy
Modern camera modules expose sensor-level timestamps, indicating the precise moment each frame was captured according to the device's hardware clock.
By logging sensor timestamps alongside system monotonic timestamps, Convo-Recorder can detect drift, correlate timestamps across streams, and realign media with millisecond precision. To support this strategy, capture is restricted to CSI cameras, avoiding USB devices whose latency characteristics proved too unstable for reliable synchronization.
Speaker Geometry Configuration
Early analysis revealed that useful spatial information existed but was not being recorded explicitly, namely, who sat where relative to the microphone array.
To address this, the system provides a configuration interface that allows the operator to position speakers around a virtual representation of the microphone array, encoding identity, angular position, and distance at capture time.
This interface functions as a methodological intervention, replacing blind post-hoc inference with explicit declaration.
Environmental Modeling
Inconsistent DOA estimates initially appeared as noise. Further inspection revealed reflections from nearby surfaces as the cause.
The system was extended to encode basic environmental context, such as room depth and reflective boundaries, allowing downstream processing to interpret spatial audio signals more reliably. The current implementation assumes a static speaker configuration, reflecting common seated conversational contexts.
Remote Operation
To minimize observer effects, Convo-Recorder supports headless remote control over a local network.
This design choice prioritizes conversational naturalism, allowing the researcher to configure and start recordings without remaining physically present in the interaction space.
Evaluation
Evaluation
Evaluation focused on whether synchronized media and metadata were sufficient for downstream analysis rather than on interface performance metrics.
Synchronization accuracy is characterized across the full signal chain:
| Source | Jitter |
|---|---|
| CSI sensor capture | 0.1–1 µs (deterministic) |
| Thread scheduling | 10–100 µs |
| Frame buffering | 1–3 ms |
| ReSpeaker audio | 2–5 ms |
| Oscillator drift | 25–50 ppm |
Current tests indicate 10–100 ms alignment across audio–video streams, with sub-millisecond synchronization between CSI cameras. Short exemplar recordings (2–3 minutes, two speakers) were successfully processed with Convo-Max and reviewed in Convo-Script, demonstrating sufficient temporal coherence for multi-party conversation analysis.
Evaluation remains limited to short recordings; long-duration capture, drift behavior, and extended synchronization stability remain priorities for future work. The system currently includes 15 automated tests covering worker lifecycle, multi-camera coordination, and graceful degradation.
Key Findings
-
Hardware-level spatial signal outperforms explicit declaration The mic array's raw DOA telemetry proved more immediately useful than the operator's declared speaker geometry. DOA instability during overlapping speech was detectable directly from frame-to-frame angle variance, achieving 90.5% recall for overlap detection without any capture-time spatial declaration. In contrast, explicitly encoding speaker positions and DOA zones did not measurably improve speaker attribution in a controlled A/B test (93% accuracy with and without metadata). The raw spatial signal from hardware is the primary contribution; structured metadata declaration remains an untested hypothesis for more complex scenarios.
-
Environmental context shapes spatial audio interpretation Reflective surfaces and room geometry significantly affect DOA signal quality, producing directional artifacts that could mislead downstream analysis. The system encodes basic environmental context (room dimensions, reflective boundaries) as metadata. While this metadata is available to the processing pipeline, its effect on inference accuracy has not yet been isolated in controlled evaluation.
Future Directions
- Evaluate synchronization stability for long-form recordings
- Improve drift-correction strategies across extended sessions
Programmatic Connections
Convo-Recorder packages synchronized media and metadata for downstream tools:
- Convo-Box provides the physical capture platform.
- Convo-Max provides provisional multimodal transcription.
- Convo-Script supports human review and correction.
Ethics
The software is designed for research transparency. Spatial metadata is collected solely to support analysis, not surveillance.
All recordings were conducted with informed consent, and encoded data was used exclusively for research purposes.
References
Mazeland, H. (2006). Conversation analysis. In K. Brown (Ed.), Encyclopedia of language and linguistics (2nd ed., Vol. 3, pp. 153--163). Elsevier.
