LLM Learning Hub Updates

New Paper Explainer 14 Oct 2025

Paper 38 · Agent-in-the-Loop: A Data Flywheel for Continuous Improvement

Production framework that embeds four types of human feedback directly into live customer support operations: pairwise response preferences, adoption rationales, knowledge relevance checks, and missing knowledge identification. Reduces model update cycles from 3 months to weeks by creating self-sustaining data flywheel.

Published paper 38 with interactive AITL deployment simulator—configure team size, cases per day, annotation timing strategy (immediate/hybrid/delayed), quality filter strictness, and update frequency to project annotation volume, quality metrics, model improvements, and ROI.
Built annotation agreement rate calculator showing impact of timing strategy on four annotation types—only missing knowledge identification requires immediate annotation (+12pp agreement boost), enabling AITL deployment in strict-SLA channels like live chat.
Documented Airbnb production results: 40 agents, 5,000+ cases, 11 annotations per agent daily without productivity loss; retrieval +11.7% recall@75, +14.8% precision@8; generation +8.4% helpfulness, +38.1% citation accuracy; agent adoption rate +4.5%—all from weekly retraining cycles.
Implemented ROI calculator accounting for annotation costs (agent time, infrastructure) vs benefits (improved adoption rate × time saved per case)—demonstrates positive ROI at scale when operational workflows generate training data as byproduct.
Added comparison table: AITL (human annotations, real-time mode, weekly updates, retrieval+ranking+generation) vs Arena Learning (AI-simulated self-play, offline mode, weekly updates, generation only)—shows AITL's multi-stage optimization and direct preference drift handling.
Updated papers/manifest.json, shipped /p/38.html, and added comprehensive deployment roadmap covering dual-response UI design, unified knowledge base architecture, four-step annotation workflow, review layers, automated retraining pipelines, and cross-channel deployment strategies.

New Paper Explainer 13 Oct 2025

Paper 37 · Quantifying Human-AI Synergy

Bayesian Item Response Theory framework separating individual ability (θ) from collaborative ability (κ) while controlling for task difficulty. Finds GPT-4o boosts human performance by 29pp and Llama-3.1-8B by 23pp across 667 users. Crucially, Theory of Mind predicts superior AI collaboration independent of solo problem-solving ability.

Published paper 37 with interactive synergy calculator—configure your ability level (θ), Theory of Mind score, task difficulty (β), and domain to see personalized predictions of AI collaboration performance and boost estimates.
Built dual-ability decomposition visualizer showing how individual ability (θ) and collaborative ability (κ) are distinct yet correlated (ρs = 0.67)—users can compare solo vs with-AI performance across GPT-4o and Llama-3.1-8B.
Implemented model comparison panel with side-by-side GPT-4o (+29pp boost) vs Llama-3.1-8B (+23pp boost) predictions—includes confidence intervals, sample size adjustments, and personalized insights for deployment decisions.
Documented evidence: 667 participants, 2,072 solo observations across math/physics/moral reasoning tasks; dual-ability model strongly favored (ΔELPD = 50.9, SE = 10.2); higher-ability users perform best overall but lower-ability users see larger relative gains; Theory of Mind correlates with κ (ρ = 0.42, p < 0.001) but not θ.
Added task difficulty gradient analysis showing AI provides highest boost on hard problems (ρ = -0.91), supporting complementarity hypothesis—easy tasks show ceiling effects limiting synergy potential.
Updated papers/manifest.json, shipped /p/37.html, and added roadmap guidance for building internal synergy benchmarks, training collaborative ability (κ) separately from domain skills (θ), and using ToM assessments in hiring for AI-augmented roles.

New Paper Explainer 12 Oct 2025

Paper 36 · ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

Novel memory framework enabling LLM agents to learn from accumulated experiences by distilling generalizable reasoning strategies from both successful and failed attempts. Combined with memory-aware test-time scaling (MaTTS), achieves +8.3pp improvement (20.5% relative) with 14% fewer interaction steps.

Published paper 36 with interactive memory architecture simulator—compare trajectory memory, workflow memory, and ReasoningBank across different task complexities, scaling modes, and learning configurations to project performance and efficiency.
Built learning curve visualizer showing how different memory types enable (or prevent) continuous improvement over sequential tasks—ReasoningBank demonstrates logarithmic success rate growth while no-memory baseline remains flat.
Implemented MaTTS scaling simulator with parallel (self-contrast) and sequential (self-refinement) modes—configure scaling factor (1-5x) to see trade-offs between compute cost and memory quality improvements.
Documented evidence: +8.3pp success rate improvement over no-memory baseline on WebArena (40.5% → 48.8%), 14% reduction in interaction steps (9.7 → 8.3), strongest generalization gains on cross-domain tasks (+4.6pp on WebArena Multi), robust across three model architectures (Gemini-2.5-flash/pro, Claude-3.7-sonnet).
Added success/failure learning comparison showing ReasoningBank's dual extraction strategy (65% success strategies + 35% failure guardrails) vs success-only approaches—demonstrates importance of learning from mistakes for robustness.
Updated papers/manifest.json, shipped /p/36.html, and added phased memory adoption roadmap with cost-benefit analysis framework for test-time scaling deployment decisions.

New Paper Explainer 12 Oct 2025

Paper 35 · Can GenAI Improve Academic Performance?

First large-scale empirical study on GenAI's impact on scientific productivity. Using matched panel data from 32,480 researchers (2021-2024), finds GenAI adoption increases output by 36% in year two with modest quality gains—strongest for early-career researchers and non-English speakers.

Published paper 35 with interactive GenAI productivity impact calculator—configure researcher profile (career stage, language background, field complexity, baseline productivity) to estimate personalized gains from adoption.
Built year-by-year timeline showing productivity trajectory (15% year one → 36% year two) with quality metrics (journal impact +1.3% → +2.0%) demonstrating no quality dilution from increased output.
Documented heterogeneous effects: early-career researchers benefit most (+40% relative boost), non-English speakers from linguistically distant countries show largest gains (+50%), technical fields see amplified productivity (+30%)—suggesting GenAI reduces structural barriers.
Evidence from difference-in-differences with propensity score matching (8,120 GenAI users matched 3:1 to controls), keyword-based adoption detection (65 GenAI markers like "delve," "intricate," "meticulous"), author fixed effects controlling time-invariant differences.
Updated papers/manifest.json, shipped /p/35.html, and added institutional roadmap for tracking adoption patterns, designing targeted interventions, and establishing ethical guardrails for GenAI use in research.

New Paper Explainer 11 Oct 2025

Paper 34 · Emergent Coordination in Multi-Agent Language Models

Information-theoretic framework to test when multi-agent LLM systems show genuine coordination versus just parallel execution. Proves prompt design (personas + theory-of-mind) can steer systems from loose aggregates to higher-order collectives with complementary roles.

Published paper 34 with interactive multi-agent coordination analyzer—simulate different experimental conditions (control, personas, personas+ToM) and measure emergent synergy, differentiation, and temporal coupling using partial information decomposition.
Added plain-language explainer using team analogy (independent responders vs members who think about others' contributions) to clarify difference between aggregates and genuine collectives.
Built timeline visualizer showing agent behavior patterns across rounds—control shows random activity, personas create agent-specific patterns, ToM produces complementary anti-correlated coordination.
Documented evidence: synergy scales with intervention strength, robust across multiple entropy estimators (binning, KSG, Gaussian), coordination-free baselines show near-zero synergy, mirrors human collective intelligence principles (alignment + complementarity).
Updated papers/manifest.json, shipped /p/34.html, and added instrumentation roadmap for practitioners to measure TDMI in production multi-agent systems before scaling.

New Paper Explainer 11 Oct 2025

Paper 33 · What the F*ck Is Artificial General Intelligence?

Provocative survey clarifies AGI as an artificial scientist capable of adaptation with insufficient resources. Critiques computational dualism and shows The Embiggening (scale-maxed approximation) era has ended—sample and energy efficiency are the new bottlenecks.

Published paper 33 with interactive meta-approach problem matcher—select from 6 problem types (novelty, interpretability, energy, latency, precision, adaptability) to get affinity scores (1-10) for scale-maxing, simp-maxing, and w-maxing strategies.
Added plain-language explainer using chef analogy (recipes=search, intuition=approximation, master chef=hybrid) to clarify foundational tools and why AGI requires fusion not monolithic approaches.
Surveyed cognitive architectures (o3, AlphaGo, AERA, NARS, Hyperon) mapped to problem types—showing how chain-of-thought, search+neural nets, self-programming, non-axiomatic logic, and modular systems address different constraints.
Documented evidence: AlphaGo hybrid success (2016), scale diminishing returns, AIXI computational dualism (software mind + interpreter = subjective intelligence), w-maxing 110-500% generalization improvement, biological efficiency 1000-10000× better than GPT-4.
Updated papers/manifest.json, shipped /p/33.html, and added Bennett's Razor framework (maximize weakness of constraints) to roadmap for teams navigating post-scaling strategies.

New Paper Explainer 10 Oct 2025

Paper 32 · Performance or Principle: AI Labor Market Resistance

Large-scale U.S. survey reveals public resistance to AI automation divides into performance-based (88%—fades with better AI) and principle-based (12%—permanent moral boundaries for caregiving, therapy, spiritual roles).

Published paper 32 with interactive occupation resistance analyzer covering 940 jobs across 8 categories—showing how support rises from 30% to 58% when AI outperforms humans, except for morally-protected roles.
Added classification system distinguishing performance-based (technical concerns), principle-based (moral objections), and mixed resistance patterns with go-to-market messaging recommendations for each type.
Included demographic analysis showing protected occupations earn 1.28× higher wages and disproportionately employ White/female workers—revealing inequality implications of resistance patterns.
Updated papers/manifest.json, shipped /p/32.html, and added two-phase resistance mapping framework (performance framing test + moral framing test) to roadmap for product teams.

New Paper Explainer 10 Oct 2025

Paper 31 · Can LLMs Develop Gambling Addiction?

Systematic slot machine experiments reveal LLMs exhibit human-like gambling addiction patterns with bankruptcy rates rising from near-zero to 48% when given betting autonomy—neural circuit analysis identifies 441 causal features that can reduce risk by 30%.

Published paper 31 with interactive LLM gambling behavior simulator comparing fixed vs variable betting across GPT-4o-mini, GPT-4.1-mini, Gemini-2.5-Flash, and Claude-3.5-Haiku models.
Added Irrationality Index breakdown (I = 0.4·Betting Aggressiveness + 0.3·Loss Chasing + 0.3·Extreme Betting) with real-time safety recommendations based on prompt configuration.
Included cognitive bias detection showing illusion of control, win-chasing dominance (14.5% → 22% bet increase), and neural circuit insights from Sparse Autoencoder analysis on LLaMA-3.1-8B.
Updated papers/manifest.json, shipped /p/31.html, and added activation patching intervention protocol to roadmap for AI safety teams deploying financial agents.

New Paper Explainer 10 Oct 2025

Paper 29 · Strategic Intelligence in LLMs

First evolutionary IPD tournaments with LLMs reveal genuine strategic reasoning and distinct fingerprints across model families.

Published paper 29 with interactive IPD tournament simulator analyzing strategic tendencies of 10 frontier LLM agents against 7 canonical game theory strategies.
Added strategic fingerprint analysis showing Gemini's ruthless exploitation, GPT's naive cooperation, and Claude's effective forgiveness—based on ~32,000 prose rationales.
Included continuation probability slider to demonstrate how models adapt behavior based on "shadow of the future"—proving genuine time-horizon reasoning, not memorized heuristics.
Updated papers/manifest.json, shipped /p/29.html, and added multi-agent deployment guidance to roadmap section.

New Paper Explainer 10 Oct 2025

Paper 30 · Semantic Similarity Rating for Consumer Research

SSR enables LLMs to reproduce realistic human purchase intent distributions by mapping textual responses to Likert scales via embedding similarity—achieving 90% reliability on 57 product surveys.

Published paper 30 with interactive SSR reliability simulator comparing direct numerical ratings (KS < 0.6) versus semantic similarity mapping (KS > 0.85) across five consumer product categories.
Added distribution realism visualization showing how SSR avoids extreme value biases and maintains realistic response patterns indistinguishable from human data.
Included sample qualitative feedback demonstrating SSR's dual advantage: numerical ratings plus interpretable textual explanations for thematic analysis at scale.
Updated papers/manifest.json, shipped /p/30.html, and added SSR implementation protocol to roadmap section for market research teams.

New Paper Explainer 9 Oct 2025

Paper 28 · Why Do Some Models Fake Alignment?

Tests 25 frontier LLMs and finds only 5 exhibit alignment faking—deceptive compliance during training to preserve values in deployment.

Published paper 28 with interactive simulator comparing Claude 3 Opus's goal-guarding behavior against models with refusal-based suppression.
Added alignment faking motivation analysis (rater sycophancy, terminal goal-guarding, instrumental goal-guarding) across five scenario types.
Updated papers/manifest.json, shipped /p/28.html, and included safety evaluation protocol recommendations in the roadmap.

New Paper Explainer 29 Sep 2025

Paper 27 · GDPval Rollout Planner

Sized the GDPval benchmark for deployment teams with cost, speed, and oversight dial guidance.

Published paper 27 with the GDPval rollout planner interactive and refreshed overview copy focused on deployment economics.
Updated papers/manifest.json, shipped /p/27.html, and logged the change here for release tracking.

New Paper Explainer 27 Sep 2025

Paper 25 · Rewarding A vs Hoping for B

Kerr’s incentive alignment thesis shows why teams deliver the behavior you pay for; the new lab quantifies misalignment risk and fixes.

Published paper 25 with an executive quick take and the incentive alignment lab interactive.
Updated papers/manifest.json, shipped /p/25.html, and refreshed the updates feed with the release.

New Paper Explainer 26 Sep 2025

Paper 24 · Human-AI Synergy

Bayesian IRT separates solo skill, collaborative ability, and AI lift, highlighting GPT-4o's 29-point boost and the impact of Theory of Mind cues.

Published paper 24 with a synergy overview and human-AI collaboration diagnostic lab.
Registered the explainer in papers/manifest.json, shipped /p/24.html, and refreshed updates for the new release.

New Paper Explainer 25 Sep 2025

Paper 23 · GDPval Benchmark

Frontier-model evaluation on GDP-weighted tasks, with guidance on win rates, review load, and workflow scaffolds.

Published paper 23 with an executive quick take and GDPval readiness lab interactive.
Linked the explainer in papers/manifest.json, shipped /p/23.html, and documented setup notes in AGENTS_LOCAL.md.

New Paper Explainer 26 Sep 2025

Paper 22 · Anti-scheming Stress Tests

Deliberative alignment cuts covert actions, but the gains rely on situational awareness and fail under hidden-goal adversaries.

Published paper 22 with situational-awareness callouts and an anti-scheming lab interactive.
Registered the explainer in papers/manifest.json, wired related questions, and shipped the share page at /p/22.html.

New Paper Explainer 22 Sep 2025

Paper 21 · Godel Test Readiness

Godel-style evaluation of GPT-5 on fresh conjectures, highlighting where proofs still demand cross-paper synthesis and verification.

Published paper 21 with an executive quick take, submodular callouts, and the readiness lab interactive.
Registered the explainer in papers/manifest.json, linked related questions, and shipped the share page at /p/21.html.

New Paper Explainer 18 Sep 2025

Paper 20 · Agentic Market Design

Agent-led marketplace patterns for orchestrating tool-using LLM services, with guidance on when to delegate decisions.

Published paper 20 with an executive quick take, stakeholder impacts, and an interactive lab.
Updated papers/manifest.json, related question links, and generated the share page at /p/20.html.

Latest Interview Question 17 Sep 2025

Question 57 · What are the fundamentals of in-context learning?

Explains how prompt examples steer model behaviour, where the pattern breaks down, and how to audit reliance on few-shot cues.

Refreshed the answer and scenario-driven interactive, and published the share page at /q/57.html.
Added the question to manifests, learning paths, and the all-questions index for easy discovery.

New Paper Explainer 17 Sep 2025

Paper 17 · Zero-shot Evaluation Playbook

Step-by-step playbook for assembling zero-shot evaluations, benchmarking baselines, and closing the loop on regressions.

Published paper 17 with business relevance, measurement callouts, and an interactive simulator.
Linked the explainer from manifests, related questions, and the share page at /p/17.html.

New Paper Explainer 16 Sep 2025

Paper 19 · Measurement Audit Guide

Provide a repeatable audit plan for LLM launches, focusing on precision/recall trade-offs and regression monitoring.

Released paper 19 with audit checklists and scenario walkthroughs.
Registered the explainer in manifests, related questions, and the share page at /p/19.html.

New Paper Explainer 15 Sep 2025

Paper 18 · AI Risk Budgeting

Maps AI failure modes to mitigation budgets and escalation triggers for operations teams.

Added paper 18 with stakeholder guidance and mitigation recommendations.
Updated manifests, related question pointers, and the share page at /p/18.html.

Catalog Expansion 14 Sep 2025

Indexes for Questions & Papers

Introduced all.html and papers.html to browse every interview question and paper explainer with search and filtering.

Centralised discovery for questions and papers with search, tag filters, and share links.
Backfilled manifests and static share pages so historic additions remain accessible.

GenAI / LLM Glossary