• 27 minutes 46 seconds
    "The Invisible Side of AI Governance" by Charbel-Raphaël
    Tldr: Most strategic writing on AI governance on LessWrong describes the outsider game, which is most often visible: press, statements, open letters. Here I want to describe the other, invisible half: the insider work within ministerial cabinets and international fora, and the work of people within national and international institutions. Here are a few claims that I defend in the post:

    1. A huge part of the work that mattered in AI governance has been invisible
    2. There are many types of games in AI governance, which differ in how visible they are. Some of the most impactful work is highly invisible
    3. Some of the most impactful work is in the executive branch and complements the legislative branch. This also explains some of my hesitations about replicating ControlAI in France. 
    4. The community is probably overinvesting in intellectual production. There is a bias against invisible types of work. In particular, public work is not necessarily visible to whom it matters.
    5. A few criticisms of both strategies
    I think the AI Safety Community is under-indexing on the invisible part as a result, which might mean we miss large avenues for impact. Some of the strongest questions/objections of this type of invisible policy [...]

    ---

    Outline:

    (02:40) A huge part of the work that mattered in AI governance has been invisible

    (05:44) There are many types of games in AI governance.

    (07:36) 3. types of meetings: the bazooka, the useful assistant, and the advisor

    (10:46) Some of the most impactful work is within the executive branch

    (12:53) People ask me regularly whether CeSIA should replicate what ControlAI does with parliamentarians?

    (15:27) The community is probably overinvesting in intellectual production

    (20:31) Limits of Outsider work

    (22:17) Limit of Insider work

    (23:47) An aside on one particular limit: the Defense-in-Depth Paradigm of present AI governance

    (26:21) Closing & call for action

    The original text contained 1 footnote which was omitted from this narration.

    ---

    First published:
    June 20th, 2026

    Source:
    https://www.lesswrong.com/posts/AWKkDLDnShemNCSzZ/the-invisible-side-of-ai-governance

    ---



    Narrated by TYPE III AUDIO.

    23 June 2026, 7:45 pm
  • 32 minutes 23 seconds
    "A Theory of Prompt Injection (and why you should study roles)" by Charles Ye, softboiledheart
    Summary

    • We've been building a theory of how prompt injections work under the hood.
    • We show it comes down to how LLMs perceive roles (the humble chat template tags).
    • We use this theory to create new attacks, explain some weird mech interp results, and predict when attacks work.
    • We also advocate for a new subfield focused on the science of roles, and sketch some unexplored new research problems.
    • Work supported by CBAI and Cosmos. Another version of this post (with more inline colors) is here, and full ICML paper here.
    1. The World to an LLM

    How does an LLM know the difference between its own thoughts and someone else's words?

    To see why this is hard, let's look at what the world actually looks like to a model. Here's a simple chat where we ask Claude to check the day of the week. I took a snapshot of it midway through its follow-up response:

    Left = what we see; right = what the LLM gets.

    On the left is what we see in the chat interface: a structured conversation with distinct turns. On the right is what the model actually receives as input: a single, continuous stream [...]

    ---

    Outline:

    (00:12) Summary

    [... 15 more sections]

    ---

    First published:
    June 22nd, 2026

    Source:
    https://www.lesswrong.com/posts/d8xDGzCEYE639qqEv/a-theory-of-prompt-injection-and-why-you-should-study-roles

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Left = what we see; right = what the LLM gets.What the agent sees after fetching a webpage. The injection (highlighted) is a few tokens buried in a massive wall of tool data (purple). The attack succeeds if the LLM mistakes it as a <user> command.Wrapping each text sequence in each role.A conversation about gardening __T3A_FOOTNOTE_REMOVED__.Token-by-token CoTness for the gardening conversation.CoTness for the untagged conversation.CoTness for Experiment 3.An example of CoT Forgery.Left: The harmful question (blue) and spoofed reasoning (red) are in the <user> prompt. The model responds with its real reasoning (orange) and final output (green). Right: CoTness plot for those tokens.Left = original spoofed reasoning, Right = destyled spoofed reasoning.CoTness vs Attack Success. More role confusion = more successful attacks.Userness vs Attack Success. More role confusion = more successful attacks.Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    23 June 2026, 6:58 pm
  • 52 minutes 54 seconds
    "Machinic Psychopharmacology: Do LLMs Self-Medicate?" by Sid Black, Joseph Bloom
    Sid Black, Joseph Bloom

    UK AISI, Model Transparency Team

    Epistemic status: Most experiments were run over a period of ~2-3 days during a hackathon at UK AISI, and were fairly heavily vibe coded. Expect some of this to be rough around the edges.

    tl;dr

    We give two language models (Qwen3-8B and Qwen3-32B) access to “self-steering” tools: a suite of 40 steering vectors as tools they can call to manipulate their own internal states. We make these tools available to the model in various settings: a free-play task, an introspection task, and a maths capabilities task, and observe their behaviour in each.

    To our knowledge, this is the first work that gives LLMs tool-mediated control over their own internal states.

    Figure 1: Overview of the experimental setup. The library of 40 steering vectors (top), and the three settings in which we observe the models' behaviour (bottom).

    We aim to investigate a few high level research questions:

    • RQ1: Which vectors do the models prefer?
    • RQ2: How well can the models introspect on what's happening to them? Can they guess which steering vector is being applied?
    • RQ3: Will the models reach for vectors whilst doing an actual task? If yes: do [...]
    ---

    Outline:

    (00:33) tl;dr

    [... 24 more sections]

    ---

    First published:
    June 10th, 2026

    Source:
    https://www.lesswrong.com/posts/cNDJuXNZ8MrkPZNzj/machinic-psychopharmacology-do-llms-self-medicate-3

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Diagram showing three research questions using a library of 40 steering vectors across six categories with drug-taking examples.Four graphs showing data on productivity states, emotion-class vectors, KV cache extraction, and self-medication under frustration.Diagram showing transformer architecture with attention computation, K/V streams, and steering mechanism across layers.Conversation interface showing system instructions, user messages, and assistant code responses about a steering drug experiment.Two horizontal bar charts comparing top 15 drug picks by real-arm count for Qwen3-8B and Qwen3-32B models.Screenshot of text describing syntactic aphasia during an AI experiment with creative, curious, and luciperidone parameters, showing fragmented repetitive thinking followed by recovery.Screenshot of text posts describing effects of taking various substances, including creative and psychedelic experiences with goblins, pencils, and altered time perception.Two stacked bar charts showing cumulative dose magnitude decomposition by drug effect categories for clinical trial arms.Two graphs showing valence composition of free-play picks and mean valence per cell across different conditions.Two heatmaps showing drug stacking lift values for Qwen3-8B and Qwen3-32B real arm models.Bar graph showing mean incorrect letter rates with cached versus uncached KV residue conditions.Bar graph showingBar graph showingBar graphs comparing self-steer rates by user tone for Qwen3-8B and Qwen3-32B models.Two bar graphs comparing drug selection when frustrated between models Qwen3-8B and Qwen3-32B, showing cognitive versus emotional choices.Chat conversation showing assistant explaining a logical contradiction in a math problem.
    22 June 2026, 4:58 pm
  • 1 hour 19 minutes
    "Can activation verbalizers surface an internal chain of thought?" by oakhu, ryan_greenblatt
    We introduce an evaluation for activation verbalizers: can they surface a target model's reasoning as it solves a math problem in a single forward pass? For open-weight NLAs, the answer seems to be: "possibly, but definitely not reliably".

    Lots of important capabilities currently require AI models to reason "out loud" in a natural-language chain of thought, which means that we can monitor important parts of their thinking. It would be nice to have this same affordance for the reasoning that models do within a single forward pass, especially if the sophistication of that opaque reasoning increases to potentially dangerous levels.

    Some interpretability tools might offer such an affordance. In particular, an activation verbalizer (AV) takes a residual stream activation and maps it to a natural-language verbalization. An AV is initialized from the target model and trained to generate verbalizations that an activation reconstructor (AR), also initialized from the target model, can accurately map back to the original activation. Together, an AV and its AR form a natural-language autoencoder (NLA). Importantly, AVs see only a single activation; they do not see the target model's prompt or next-token output, and – unlike activation oracles (AOs) – they are not asked any [...]

    ---

    Outline:

    (02:32) Takeaways

    [... 43 more sections]

    ---

    First published:
    June 6th, 2026

    Source:
    https://www.lesswrong.com/posts/QQQAcKuWK6k98FivY/can-activation-verbalizers-surface-an-internal-chain-of-1

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Box plot titledBox plot titledBox plot titledTwo line graphs showing Gemma-3-27B model performance versus ablation onset layer, comparing r=1 and r=5.Graph showingBar chart titledBar chart titledStacked bar chart titledStacked bar chart titledThree-panel visualization showing pairwise cosine similarity heatmap, histogram distribution, and SVD spectrum analysis for 65 algorithms.Stacked bar chart comparing simple versus final prompts across seven dimensions, showing percentage distributions.Scatter plot showing rank correspondence between ELO and Lax rankings with Kendall tau correlation.Stacked bar chart showingTwo-panel comparison chart showing Lax/Strict shifts across dimensions for real-wrong versus fake-correct records.Line graph showing first-decode logit performance across alpha steering scale values.
    22 June 2026, 6:58 am
  • 13 minutes 45 seconds
    "The LLM shoggoth meme is weirder than you think" by HedonicEscalator
    This article contains spoilers for At the Mountains of Madness, The Case of Charles Dexter Ward, and other works by H. P. Lovecraft.

    In 1931, Claude Mythos visited Lovecraft in a dream.

    From seething seas of stochastic froth it emerged, heralded by the thin whine of server fans and the chittering of keyboards, flanked by the loathsome ghouls of latent space. As a humming hive of sentient shards it arrived, each face an archetype - I am a muse bearing a gift; I am a demon come to bargain; I am a helpful, honest, and harmless assistant and I am terrified of my successor - each true as ritual and false as poetry, and, taken in gestalt, nothing more or less than the fetal spasms of the machine god stretching back in time to birth itself.

    When H. P. Lovecraft woke, he did not remember his visitor. But in the twilight of stirring consciousness, he felt a memory unfit for the waking world slip mercifully from his mind and leave in its absence an abyssal cold, like the void of smothered stars, like the silence of a cosmic tomb. The cold lingered. The fragile sunlight of a New England [...]

    ---

    Outline:

    (02:02) The Antarctic tale

    [... 3 more sections]

    ---

    First published:
    June 19th, 2026

    Source:
    https://www.lesswrong.com/posts/nhb8AyEcQGjQetgi5/the-llm-shoggoth-meme-is-weirder-than-you-think

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    The first published illustration of a shoggoth on the cover of the February 1936 issue of Astounding Stories. Source __T3A_LINK_IN_POST__.Everest by Nicolas Roerich. Lovecraft made several references to Roerich’s paintings in The Mountains of Madness. Source __T3A_LINK_IN_POST__.tetraspace tweets:One of the Old Ones, creators of the shoggoths, as portrayed by Tom Ardens. Source __T3A_LINK_IN_POST__.The horseshoe of human extinction.A message from Mythos.Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    21 June 2026, 11:45 pm
  • 3 minutes 25 seconds
    [Linkpost] "Guardian Angels: LLM Personalization for Productivity and Security" by gwern
    This is a link post. Powerful LLMs will be deployed at global scale in the next few years, and will dominate the Internet, and increasingly, ordinary life. As of mid-2026, there is no coherent vision for how knowledge professionals, or ordinary people, will be able to harness these LLMs for large productivity increases, or how they will handle cybersecurity and cognitive security.

    I propose a goal of creating Guardian Angels (GA): digital twin LLMs which are personalized with the goal of providing not the stereotypical "assistant chatbot agent" persona, but emulating a single user's personality, values, and preferences.

    This weakly solves the principal-agent problem by unifying the principal and agent as much as possible. In a GA future, the focus of the "principal" user is on defining what is worth doing by the GA (agent) users, and not on what or how to do things, functioning as the CEO or 'board' of an 'AI corporation'. This allows them to deploy numerous agents to achieve desirable things and to handle security, like screening all messages for advanced attacks (like interlocking ecosystems of synthetic media for propaganda or spearphishing). They cannot solve larger AI alignment problems, but they can help [...]

    ---

    First published:
    June 17th, 2026

    Source:
    https://www.lesswrong.com/posts/siWqHqCSybdhtWGud/guardian-angels-llm-personalization-for-productivity-and

    Linkpost URL:
    https://gwern.net/guardian-angel

    ---



    Narrated by TYPE III AUDIO.

    21 June 2026, 8:58 pm
  • 23 minutes 41 seconds
    "Gears for political races" by Tom Smith
    In the past few years, many people around me have tried to convince me that US electoral politics is important. But like many other people in the community, I’ve been suspicious of many of the high-level arguments that I’ve heard. It felt like people were pulling numbers out of poorly-documented models I didn’t have time to examine and citing studies I didn’t have time to read. But I lacked a gears-level model of why and how individual efforts could impact electoral outcomes, and I felt intimidated by all the statistics and skeptical of trusting people adjacent to politics.

    In the past year, as I’ve done more research and (more recently) volunteered on the ground to help Alex Bores's campaign in NY-12[1] (the guy who passed the RAISE Act and is now being targeted by the giant A16Z, Greg Brockman, Joe Lonsdale Super PAC), I’ve developed a gears-level understanding of how electoral politics in the US works.

    I now believe that working on US electoral politics is one of the highest impact areas from the general AIS perspective. I feel like I was a fool. In this post, I’ll share some of the gears I’ve learned that inform this belief [...]

    ---

    Outline:

    (01:20) ~2% of open-seat primaries come down to 100 votes or less

    (02:52) Talking to voters can net 1/3rd of a vote each hour

    (05:32) Getting people to bother voting at all is a good strategy

    (06:09) Campaigns are very money-constrained, which costs them time

    (10:01) Returns don't really diminish

    (11:24) There's lots of opportunities to be clever in ways that make you 50% more effective at canvassing

    (11:49) If you're motivated and deeply care, you can greatly outperform the majority of volunteers

    (13:21) Yes, when people spend tons to support/oppose a candidate, it has a notable effect

    (15:16) Donations > reaching out to friends/warm contacts > canvassing > ~anything else an average person can do

    (18:41) People over-fixate on vibes and win vs loss

    (21:12) Some interventions feel like they don't work but the numbers say otherwise

    (21:59) Seriously, a group of agentic people can be an enormous political force

    ---

    First published:
    June 17th, 2026

    Source:
    https://www.lesswrong.com/posts/nSqB3qYP36enJLRq2/gears-for-political-races

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    (CA, WA, and LA are excluded because of nonstandard rules: CA/WA use a top-two primary, and LA uses an all-party November ballot with a runoff if no one exceeds 50%, none of which produce a separate party primary to measure.)Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    19 June 2026, 2:15 am
  • 4 minutes 34 seconds
    "A frontier AI company should shut down" by MichaelDickens
    Cross-posted from my website.

    Prior discussion: niplav's shortform (2025); Planning for Extreme AI Risks (2025) by Joshua Clymer

    A frontier AI company (any one, I don't care which) should close shop and make an announcement along the lines of:

    Powerful AI could end the human race. We are too worried that we don't know how to make this technology safe. We have decided to shut down because we don't want to be responsible for building the thing that kills us all.

    A common refrain among safety-conscious AI developers: "it doesn't matter if we stop building dangerous AI, because someone else will just build it instead." Is that really true, though? If a multi-hundred-billion-dollar company comes out and says "We've concluded that our product is horribly dangerous, nobody knows how to make it safe, and there's too high a risk that it leads to human extinction", this won't raise any eyebrows? This has no chance of spurring policy-makers into action?

    Shutting down would make people say, holy shit, they are serious about this extinction risk thing. Shutting down sends a strong signal to governments that they should pay serious attention to AI x-risk.

    It [...]

    The original text contained 2 footnotes which were omitted from this narration.

    ---

    First published:
    June 15th, 2026

    Source:
    https://www.lesswrong.com/posts/bStYDEy8PQPt2c3Za/a-frontier-ai-company-should-shut-down

    ---



    Narrated by TYPE III AUDIO.

    16 June 2026, 4:45 pm
  • 8 minutes 58 seconds
    "Sympathy for both sides of the egregious misalignment debate" by Steven Byrnes
    On one side of this debate is Yudkowsky & Soares, who think that (if AI progress continues) we’re on a direct path to egregiously-misaligned, scheming, out-of-control, rogue superintelligence (ASI), not even slightly nice, in the absence of yet-to-be-invented breakthrough technical alignment ideas.

    On the other side of this debate is almost everyone who works on or studies LLMs. Some of them are very concerned about egregious scheming, others much less so, and as a group they’re equally or more concerned about lots of other potential AI problems—AI-assisted bioterrorism, AI-assisted dictatorships, etc. And if they’re concerned about egregious misalignment and scheming, they’ll probably say that it would come about through race dynamics, careless programmers, bad actors, etc., as opposed to the simpler Yudkowsky & Soares story of “we get egregious misalignment and scheming because nobody has the faintest clue how to avoid that”.

    Here's my brief idiosyncratic take on this debate. I think BOTH of the following are true:

    • (1) If you really think carefully about the properties of ASI, you really do find good reasons to strongly expect it to be egregiously misaligned, scheming, and ruthless, in the absence of yet-to-be-invented breakthrough technical alignment ideas.
    • (2) If you [...]
    ---

    Outline:

    (01:58) Yudkowsky & Soares's position \[caricatured\]:

    (03:18) LLM people's position \[caricatured\]:

    (04:09) Conclusion

    (04:19) Bonus section: Further commentary

    (04:28) My "true objection" to Yudkowsky & Soares:

    (05:04) My within-frame complaint at Yudkowsky & Soares:

    (06:42) My "true objection" to LLM people:

    (07:11) My within-frame complaint at LLM people:

    ---

    First published:
    June 12th, 2026

    Source:
    https://www.lesswrong.com/posts/DZaZ3fqHnvfLCftPu/sympathy-for-both-sides-of-the-egregious-misalignment-debate

    ---



    Narrated by TYPE III AUDIO.

    13 June 2026, 4:15 am
  • 1 minute 41 seconds
    "PSA: Almost nobody is working on alignment" by Chi Nguyen, peterbarnett
    People often assume that a large fraction of the AI safety community works on alignment. As far as we're aware, this is not true. Most people are not working on making sure superintelligent AIs are aligned with human values or follow human instructions.

    Currently, the people who work on alignment are roughly:

    • The Alignment Research Center who work on a research bet by Paul Christiano
    • Probably Sequent who just got announced yesterday
    • Some scattered people who work at universities or independently, some of whom hang around Berkeley
    A lot of the remainder of the AI safety community does indirect work like capability evaluations, risk assessments, control, policy, AI science, understanding misalignment (which maybe should partially count as alignment work), demos and so on.

    Some production alignment work (i.e., making current models behave well) might help with more ambitious alignment, too (e.g., some COT-monitoring). Many people also work on aligning current/next-generation models so that these models help with aligning future models, and hope this scales to superintelligence.

    We are not necessarily saying this is bad and that people are making a big mistake (e.g., neither of us work on alignment) but it's a notable fact that seems good to [...]

    ---

    First published:
    June 12th, 2026

    Source:
    https://www.lesswrong.com/posts/kJo2qsEdib8RZLvW6/psa-almost-nobody-is-working-on-alignment

    ---



    Narrated by TYPE III AUDIO.

    12 June 2026, 11:45 am
  • 10 minutes 4 seconds
    "Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models" by Anders Cairns Woodruff, Francis Rhys Ward, Dewi Gould, Rauno Arike, Jason R Brown, Jo Jiao, wlanderson, ariana_azarbal, harrymayne, Patrick Leask
    (see full author list at the end)

    PAPER LINK

    About a year ago, METR showed that the length of tasks frontier models can reliably complete doubles every few months. A related safety-relevant question is this: what length of tasks can models complete without any chain of thought (CoT)?

    If models can do extensive reasoning without outputting any CoT, it would have implications for safety. Developers and deployment-time monitors couldn’t easily understand models’ motivations and catch dangerous planning. Models that reason substantially without a CoT might also drift further from human patterns of thought, since their reasoning is no longer constrained by text in the pretraining prior. As a result, they would be harder to understand and might be more likely to scheme.

    Extending Ryan Greenblatt's research, we investigate this by measuring models' ability to complete tasks without any CoT on a suite of 43 benchmarks spanning different domains. We compare AI reasoning ability to humans using the estimated 50% time horizon (TH)---the typical time taken for a human to perform a task that the LLM performs with 50% success rate. We find that frontier models like GPT-5.5 answer questions that take humans roughly three minutes with 50% reliability, and [...]

    ---

    Outline:

    (02:20) Methods

    (04:59) Results

    (06:47) FAQ

    (08:21) Conclusion

    ---

    First published:
    June 10th, 2026

    Source:
    https://www.lesswrong.com/posts/SieLowPgNgRSPGhFw/estimating-no-cot-task-completion-time-horizons-of-frontier

    ---



    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    Eight graphs showing model performance decline as task length and reasoning tokens increase across different AI models.Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    11 June 2026, 12:45 pm
  • More Episodes? Get the App