AI-QA

Overview

AI-QA explores how AI systems can support qualitative analysis without collapsing interpretation into fixed labels, authoritative summaries, or predictive outputs.

Rather than treating qualitative coding as a task to automate, the project reframes it as an alignment problem: how to keep machine-assisted organization responsive to human interpretation, provisional reasoning, and contextual nuance.

AI-QA advances a distinct position: it investigates qualitative alignment, not automation, positioning AI as a navigational aid for sensemaking rather than an engine for answers.

Role in the Program

AI-QA functions as the analytic engine of the Local Conversation Studio program.

Its role is to transform raw fieldwork materials, including interviews, observations, and conversational fragments, into structured, explorable evidence that can inform design while remaining grounded in original voices and contexts.

The system explicitly does not generate summaries, recommendations, or answers for end users. It supports researcher-led sensemaking and sits between ethnographic fieldwork and intervention design, preserving human interpretive authority throughout the analytic process.

Problem

Traditional qualitative analysis offers rigor and interpretive depth but does not scale easily. Fully automated AI analysis can scale, but often introduces semantic drift, cultural misreadings, or false authority.

Systems optimized for prediction, summarization, or retrieval tend to flatten ambiguity and erase provenance - precisely the qualities that qualitative research depends on.

AI-QA addresses this tension by treating AI as an organizational partner, not an expert. Interpretive judgment remains human; AI supports navigation, pattern discovery, and traceability rather than decision-making.

Methodology & System Architecture

AI-QA was developed as a layered analytic workflow with explicit feedback loops for refinement and alignment.

The dataset consists of 595 semantically significant clips selected from 26 audio-only interviews conducted as part of the Learning from Oaxaca project.

Phase 1: Interactive Labeling (Scaffolding)

The first phase establishes a revisable interpretive structure rather than ground truth.

Human anchors:
150 human-coded anchor clips span 15 interviews, each with a detailed analytical memo explaining the rationale.
Iterative codebook:
A hierarchical codebook was developed based on the human-coded clips and revised iteratively, expanding to 100+ codes across 8 domains.

A 139-code hierarchical codebook was developed iteratively across 8 analytical domains:

Domain	Codes	Description
I. Place	18	How places exist, are experienced, and become meaningful
II. Identity	18	Place-based selves and group boundaries
III. Knowledge	22	Ways of knowing tied to place
IV. Encounter	15	Interactions between different positionalities
V. Political Economy	24	Material conditions, economic systems, power relations
VI. Temporal	13	Continuity, change, and their narration
VII. Stance	15	How things are talked about (evaluative, epistemic, affective)
VIII. Language	14	Language as both topic and medium

Each code includes definitions, positive and negative examples in Spanish and English, and explicit decision trees for ambiguous cases.

The codebook is grounded in place attachment theory (Tuan, Massey, Proshansky) and epistemologies of the South (Santos). Applied across 595 clips, it produced 2,575 code assignments (avg 4.3 codes per clip).

Codes were treated as alignment infrastructure rather than definitive claims, supporting coordination across multiple evaluative perspectives (human and computational) while remaining provisional and revisable.

Phase 2: Embedding-Based Clustering

After semantic labeling stabilized, embeddings were used to test how analytical codes reshaped relationships among clips. Each clip was embedded under multiple configurations (transcript-only and transcript-plus-code) to compare semantic structure with and without analytic framing.

Clusters were examined through three analytic passes:

Divergent (edge detection)
Analytical codes primarily affected clips at semantic boundaries. Between 85–93% of boundary clips were reassigned once codes were added, indicating that the codebook resolves ambiguity rather than redefining dominant themes. For example, a clip about artisanal mezcal sales may sit between Political Economy and Encounter clusters in transcript-only space but stabilizes once codes like [commodification, local_visitor, artisanal_production] are applied.

Convergent (stabilization)
Transcript embeddings alone preserved most thematic structure (~70%; kNN preservation 61–68%). The codebook modulated the remaining structure, with minimal impact near cluster centroids. Temporal interview structure remained highly stable across configurations (r > 0.89), suggesting that coding refines interpretation without collapsing conversational form.

Temporal (narrative arcs)
Cluster membership showed strong sequential persistence: clips were 7.6× more likely to remain in the same cluster as the preceding clip. Narrative flow appears to be a property of the interaction itself, robust to embedding strategy and labeling scheme.

Evaluation Approach

The primary component evaluated and refined is the codebook, which was validated using a three-pass pipeline: structural code-name matching, LLM-as-judge quality assessment, and human-driven codebook revision.

Validation Cycle	Good Rate	Key Finding
v4 (first cycle, 2,323 codes)	66%	`direct_experience` overgeneralized to any first-person narration
v5.1 (second cycle)	61.5%	Tightened definitions exposed 87 false positives on `direct_experience`, 52 on `local_visitor`
v5.2 (final, 2,575 codes)	94.8%	0 bad codes remaining after decision trees and `when_not_to_apply` sections added

The v5.1 drop from 66% to 61.5% was not regression but diagnostic progress,since the codebook's tightened definitions exposed previously invisible failures. The code direct_experience was being applied to any first-person narration rather than evidential witnessing; local_visitor was triggered by any mention of tourism rather than actual encounters; the retired code neutral_description was still being applied 27 times despite deprecation.

These patterns echo a broader challenge in LLM-assisted analysis: plausible but incorrect outputs are a persistent failure mode that demands structured provenance tracking and human review (Huang et al., 2023). Each failure class drove specific revisions: decision trees, explicit when_not_to_apply guidance with bilingual examples, and one code retirement with a critical warning.

Key Insights

1. The “centroid trap”

Analytic leverage consistently concentrates at the edges of clusters, not their centers. In the corpus, 85-90% of edge clips change cluster assignments when analytical codes are added (vs ~50% of core clips). Edge clips sit 1.67-1.75x farther from centroids than core clips.

A concrete example: a clip about mayordomias (community members pooling money for celebrations) sits at the edge of a cluster because it blends community organizing (Place), economic exchange (Political Economy), and cyclical time (Temporal).

It is analytically rich precisely because it refuses neat categorization. A nearby core clip simply says "traditions are important", which is prototypical of its cluster but adding nothing the cluster label doesn't already tell you.

2. Codebooks as alignment infrastructure

Well-structured codebooks functioned not merely as labels, but as shared scaffolding between human interpretation and computational analysis.

By remaining provisional and revisable, the codebook supported alignment without imposing false certainty or erasing disagreement.

Future Directions

Conduct structured human review of 2,000+ applied codes to log errors and ambiguities
Extend alignment analysis across additional qualitative datasets
Compare alternative embedding strategies with respect to interpretive stability

Programmatic Connections

Insights from AI-QA informed other components of the program:

Local Knowledge Cache:
Evaluation strategies and edge-case analysis became the basis for claim-level review and quality control.
Local Language Cards:
Temporal and interactional insights informed the design of analog interventions that support reciprocal learning.

References

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., & Liu, T. (2023). A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232.

qualitative-researchai-labelingembeddingsclusteringhuman-ai-collaboration