A growing corpus of structured argumentative speech.
Every round on Debate AI produces a motion-tagged, format-aware, graded exchange of arguments. The internal corpus feeds a nightly learning loop that sharpens the AI on real usage. An opt-in subset is available for licensing to AI research organizations studying argumentative dialogue, voice-mode language models, and debate pedagogy.
Why this dataset is hard to replicate
Adversarial, not monologic
Open-web text is mostly one side talking past the other. Every row here is a real-time rebuttal under a format clock, often with POI interruptions. That's the part of training data that scraped op-eds, podcasts, and reddit don't cover.
Format-aware register
Policy spreads tagged cards. APDA stays impromptu. LD argues value/criterion. BP runs whip extensions. Each format has its own grammar of speech, and the corpus captures the register switching that a general-purpose model has no signal for.
Graded outcomes
Rounds get user ratings (1–5) and judge ballots with speaker points and weighing. Preference-pair labels for RLHF or DPO without commissioning a separate human-labeling pipeline.
What's in the licensable subset
Opt-in only. The toggle lives in every user's profile, off by default, with the legal terms in privacy §6. When a user turns it on, future rounds (typed and voice) carry a contributable: true flag; everything else stays internal.
Each row, after anonymization, is shaped roughly:
Anonymized means stripped of name, email, account id, IP, and any device fingerprints. What remains is the speech and its structural metadata. Voice audio is never stored; only the text transcript is eligible.
Per-format internal counts
Snapshot from the last nightly aggregation. Includes all generations, not just the opt-in subset, so you can see where the volume is concentrated.
The growth curve, not the row count
Volume today is small. What's compounding is the architecture: a learning loop that's been writing every generation to the corpus since 2026-05-13, a consent layer that went live 2026-05-25, and a daily distillation pass that re-shapes the AI based on rated outputs. The licensable subset is just starting; the wedge is what the dataset becomes at scale, not what it is this week.
License inquiries
Open to conversations with AI research orgs, academic labs, and dataset aggregators. Happy to share a sample export under NDA and walk through the schema in detail.
feedback@debateai.com Read the consent terms