Convo-Box

Overview

Convo-Box investigates how recording hardware design shapes what can be known about in-person conversation.

Conversation is spatial, embodied, and often multi-party. Timing, orientation, overlap, and mutual visibility are structural features of interaction, not noise to be suppressed (Sacks, Schegloff, & Jefferson, 1974; Mondada, 2019). Convo-Box was designed to preserve these features at capture time, before any software inference or analysis occurs.

The project treats recording hardware as an epistemic boundary: once conversational structure is flattened during capture, it cannot be recovered downstream.

Role in the Program

Convo-Box is the physical instrumentation layer of the Tools for Conversation program.

Its role is to capture conversational data in a form that preserves spatial and temporal structure, so that downstream tools (recording software, transcription pipelines, and annotation interfaces) operate on richer, less distorted input.

By establishing explicit capture constraints, the system ensures that failures in analysis can be traced to physical recording conditions rather than hidden limitations of commodity hardware.

At the portfolio level, Tools for Conversation addresses the infrastructural conditions of conversation, while Local Conversation Studio addresses its social and cultural conditions.

Problem

Most recording systems are designed for telepresence, not analysis.

They assume:

a single active speaker,
orderly turn-taking,
screen-facing participants, and
aggressive noise suppression.

Real-world conversation rarely conforms to these assumptions. Multi-party talk involves overlap, off-axis orientation, embodied cues, and shifting attention.

When researchers rely on standard webcams or conference microphones, spatial relationships are flattened and fine-grained timing cues are lost before transcription or modeling begins. These losses define a hard ceiling on what can later be inferred.

Design Response

Convo-Box was designed as a tabletop, in-situ recording instrument optimized for studying face-to-face conversation rather than facilitating remote communication.

The system is built around the Raspberry Pi 5 platform to prioritize open components, configurability, and precise control over capture behavior.

Hardware architecture choices, such as interface type, sensor geometry, and synchronization strategy, are treated as epistemic decisions, not merely engineering tradeoffs.

Spatial Audio Capture

The device integrates a four-microphone array capable of high-frequency direction-of-arrival (DOA) estimation.

Unlike conference microphones that suppress instability, the array captures raw spatial variance. Directional fluctuation is treated as potential conversational signal, such as overlapping speech, rather than noise to be eliminated.

Multi-Camera Synchronization

The vision system uses two synchronized CSI cameras rather than USB inputs.

Early prototypes combining USB and CSI cameras revealed that bus-level latency drift was sufficient to destroy the millisecond-level timing required for conversation analysis. Restricting capture to CSI cameras enabled tighter synchronization and more reliable alignment between audio and video streams.

Physical Fabrication

The system is housed in a custom, locally fabricated two-tier enclosure with vibration isolation to decouple the microphone array from table-surface noise.

The low-profile, non-interactive form factor was intentionally chosen to minimize social intrusion, allowing the device to function as a passive listener rather than an attention-directing artifact.

Evaluation

The primary evaluation criterion for Convo-Box is its ability to capture short segments of multi-party conversation with sufficient fidelity for downstream analysis.

The device was successfully operated with Convo-Recorder and Convo-Max to produce a synchronized exemplar recording that served as the reference input for subsequent pipeline development.

The device captures 2 video streams (2x CSI Camera Module 3 at 1280x720/20fps) and 6-channel spatial audio (ReSpeaker XVF3800 at 16kHz) with DSP telemetry polled at 25 Hz.

Capture Criterion	Status	Specification
Short conversation (2-3 min)	Met	Exemplar recordings produced and processed
Two speakers	Met	Support for more in later phases
Spatial audio for overlap analysis	Met	6-channel (beamformed + 4 raw + loopback), 4-beam DOA telemetry
One camera per speaker	Met	CSI camera per speaker
Facial orientation and expression	Met	Face-on alignment required for mouth detection
Mutual line-of-sight	Met	Physical enclosure design preserves sightlines
Remote operation	Met	Flask web API, MJPEG preview, CLI interface
Synchronization	10-100ms	Sub-millisecond between CSI cameras via libcamera software sync

Caveat: Long-duration media drift has not yet been fully characterized. Typical consumer-grade clock oscillator tolerance of 25-50 ppm suggests the streams may drift about 180-360ms over 2 hours.

Key Insights

Recording is not neutral Capture choices actively shape what can be analyzed; data format, not just quality, defines epistemic limits.
Camera geometry affects inference Small changes in camera height and angle materially impact facial-feature and orientation analysis.
Reflective surfaces distort spatial audio DOA estimation is sensitive to reflective surfaces and speaker placement. Recording environmental context as metadata provides downstream systems with information to account for these effects, though the impact on interpretation accuracy remains to be evaluated systematically.
DOA instability is a hardware-level conversational signal In benchmark analysis (AMI Corpus, 4 speakers, 21 overlap regions), DOA standard deviation increased up to threefold during overlapping speech. A DOA instability detector operating on raw mic array telemetry achieved 90.5% recall for overlap detection, flagging 29% of audio for closer analysis. This required no capture-time spatial declaration — the signal is an inherent property of the four-microphone array. This suggests that hardware-level spatial audio capture, rather than explicit metadata encoding, is the primary spatial contribution of the recording platform.

Programmatic Connections

Convo-Box defines the capture boundaries for all downstream tools in the program:

Convo-Recorder encodes spatial and temporal metadata at capture time.
Convo-Max processes synchronized multimodal recordings into provisional analytic structure.
Convo-Script supports human review and correction where capture ambiguity persists.

Together, these tools reinforce the program's core thesis: limitations in conversational AI are often infrastructural, not algorithmic.

Ethics

All recordings were conducted with informed consent. The device was intentionally visible and non-deceptive, prioritizing participant comfort and transparency over covert capture.

Data collected with Convo-Box was used solely for research and tool development.

References

Mondada, L. (2019). Contemporary issues in conversation analysis: Embodiment and materiality, multimodality and multisensoriality in social interaction. Journal of Pragmatics, 145, 47--62.

Sacks, H., Schegloff, E. A., & Jefferson, G. (1974). A simplest systematics for the organization of turn-taking for conversation. Language, 50(4), 696--735.

hardwarerecordingconversation-analysisraspberry-pispatial-audio