Overview
Convo-Box investigates how recording hardware design shapes what can be known about in-person conversation.
Conversation is spatial, embodied, and often multi-party. Timing, orientation, overlap, and mutual visibility are structural features of interaction, not noise to be suppressed (Sacks, Schegloff, & Jefferson, 1974; Mondada, 2019). Convo-Box was designed to preserve these features at capture time, before any software inference or analysis occurs.
The project treats recording hardware as an epistemic boundary: once conversational structure is flattened during capture, it cannot be recovered downstream.
Role in the Program
Convo-Box is the physical instrumentation layer of the Tools for Conversation program.
Its role is to capture conversational data in a form that preserves spatial and temporal structure, so that downstream tools (recording software, transcription pipelines, and annotation interfaces) operate on richer, less distorted input.
By establishing explicit capture constraints, the system ensures that failures in analysis can be traced to physical recording conditions rather than hidden limitations of commodity hardware.
At the portfolio level, Tools for Conversation addresses the infrastructural conditions of conversation, while Local Conversation Studio addresses its social and cultural conditions.
Problem
Most recording systems are designed for telepresence, not analysis.
They assume:
- a single active speaker,
- orderly turn-taking,
- screen-facing participants, and
- aggressive noise suppression.
Real-world conversation rarely conforms to these assumptions. Multi-party talk involves overlap, off-axis orientation, embodied cues, and shifting attention.
When researchers rely on standard webcams or conference microphones, spatial relationships are flattened and fine-grained timing cues are lost before transcription or modeling begins. These losses define a hard ceiling on what can later be inferred.
Design Response
Convo-Box was designed as a tabletop, in-situ recording instrument optimized for studying face-to-face conversation rather than facilitating remote communication.
The system is built around the Raspberry Pi 5 platform to prioritize open components, configurability, and precise control over capture behavior.
Hardware architecture choices, such as interface type, sensor geometry, and synchronization strategy, are treated as epistemic decisions, not merely engineering tradeoffs.
Spatial Audio Capture
The device integrates a four-microphone array capable of high-frequency direction-of-arrival (DOA) estimation.
Unlike conference microphones that suppress instability, the array captures raw spatial variance. Directional fluctuation is treated as potential conversational signal, such as overlapping speech, rather than noise to be eliminated.
Multi-Camera Synchronization
The vision system uses two synchronized CSI cameras rather than USB inputs.
Early prototypes combining USB and CSI cameras revealed that bus-level latency drift was sufficient to destroy the millisecond-level timing required for conversation analysis. Restricting capture to CSI cameras enabled tighter synchronization and more reliable alignment between audio and video streams.
Physical Fabrication
The system is housed in a custom, locally fabricated two-tier enclosure with vibration isolation to decouple the microphone array from table-surface noise.
The low-profile, non-interactive form factor was intentionally chosen to minimize social intrusion, allowing the device to function as a passive listener rather than an attention-directing artifact.
Evaluation
The primary evaluation criterion for Convo-Box is its ability to capture short segments of multi-party conversation with sufficient fidelity for downstream analysis.
The device was successfully operated with Convo-Recorder and Convo-Max to produce a synchronized exemplar recording that served as the reference input for subsequent pipeline development.
The device captures 2 video streams (2x CSI Camera Module 3 at 1280x720/20fps) and 6-channel spatial audio (ReSpeaker XVF3800 at 16kHz) with DSP telemetry polled at 25 Hz.
| Capture Criterion | Status | Specification |
|---|---|---|
| Short conversation (2-3 min) | Met | Exemplar recordings produced and processed |
| Two speakers | Met | Support for more in later phases |
| Spatial audio for overlap analysis | Met | 6-channel (beamformed + 4 raw + loopback), 4-beam DOA telemetry |
| One camera per speaker | Met | CSI camera per speaker |
| Facial orientation and expression | Met | Face-on alignment required for mouth detection |
| Mutual line-of-sight | Met | Physical enclosure design preserves sightlines |
| Remote operation | Met | Flask web API, MJPEG preview, CLI interface |
| Synchronization | 10-100ms | Sub-millisecond between CSI cameras via libcamera software sync |
Caveat: Long-duration media drift has not yet been fully characterized. Typical consumer-grade clock oscillator tolerance of 25-50 ppm suggests the streams may drift about 180-360ms over 2 hours.
Key Insights
-
Recording is not neutral Capture choices actively shape what can be analyzed; data format, not just quality, defines epistemic limits.
-
Camera geometry affects inference Small changes in camera height and angle materially impact facial-feature and orientation analysis.
-
Reflective surfaces distort spatial audio DOA estimation is sensitive to reflective surfaces and speaker placement. Recording environmental context as metadata provides downstream systems with information to account for these effects, though the impact on interpretation accuracy remains to be evaluated systematically.
-
DOA instability is a hardware-level conversational signal In benchmark analysis (AMI Corpus, 4 speakers, 21 overlap regions), DOA standard deviation increased up to threefold during overlapping speech. A DOA instability detector operating on raw mic array telemetry achieved 90.5% recall for overlap detection, flagging 29% of audio for closer analysis. This required no capture-time spatial declaration — the signal is an inherent property of the four-microphone array. This suggests that hardware-level spatial audio capture, rather than explicit metadata encoding, is the primary spatial contribution of the recording platform.
Programmatic Connections
Convo-Box defines the capture boundaries for all downstream tools in the program:
- Convo-Recorder encodes spatial and temporal metadata at capture time.
- Convo-Max processes synchronized multimodal recordings into provisional analytic structure.
- Convo-Script supports human review and correction where capture ambiguity persists.
Together, these tools reinforce the program's core thesis: limitations in conversational AI are often infrastructural, not algorithmic.
Ethics
All recordings were conducted with informed consent. The device was intentionally visible and non-deceptive, prioritizing participant comfort and transparency over covert capture.
Data collected with Convo-Box was used solely for research and tool development.
References
Mondada, L. (2019). Contemporary issues in conversation analysis: Embodiment and materiality, multimodality and multisensoriality in social interaction. Journal of Pragmatics, 145, 47--62.
Sacks, H., Schegloff, E. A., & Jefferson, G. (1974). A simplest systematics for the organization of turn-taking for conversation. Language, 50(4), 696--735.
