LessWrong Curated Podcast

LessWrong

Audio version of the posts shared in the LessWrong Curated newsletter.

  • 27 minutes 19 seconds
    “Catastrophic sabotage as a major threat model for human-level AI systems” by evhub
    Thanks to Holden Karnofsky, David Duvenaud, and Kate Woolverton for useful discussions and feedback.

    Following up on our recent “Sabotage Evaluations for Frontier Models” paper, I wanted to share more of my personal thoughts on why I think catastrophic sabotage is important and why I care about it as a threat model. Note that this isn’t in any way intended to be a reflection of Anthropic's views or for that matter anyone's views but my own—it's just a collection of some of my personal thoughts.

    First, some high-level thoughts on what I want to talk about here:

    • I want to focus on a level of future capabilities substantially beyond current models, but below superintelligence: specifically something approximately human-level and substantially transformative, but not yet superintelligent.
      • While I don’t think that most of the proximate cause of AI existential risk comes from such models—I think most of the direct takeover [...]
    ---

    Outline:

    (02:31) Why is catastrophic sabotage a big deal?

    (02:45) Scenario 1: Sabotage alignment research

    (05:01) Necessary capabilities

    (06:37) Scenario 2: Sabotage a critical actor

    (09:12) Necessary capabilities

    (10:51) How do you evaluate a model's capability to do catastrophic sabotage?

    (21:46) What can you do to mitigate the risk of catastrophic sabotage?

    (23:12) Internal usage restrictions

    (25:33) Affirmative safety cases

    ---

    First published:
    October 22nd, 2024

    Source:
    https://www.lesswrong.com/posts/Loxiuqdj6u8muCe54/catastrophic-sabotage-as-a-major-threat-model-for-human

    ---

    Narrated by TYPE III AUDIO.

    15 November 2024, 12:30 am
  • 22 minutes 11 seconds
    “The Online Sports Gambling Experiment Has Failed” by Zvi
    Related: Book Review: On the Edge: The GamblersI have previously been heavily involved in sports betting. That world was very good to me. The times were good, as were the profits. It was a skill game, and a form of positive-sum entertainment, and I was happy to participate and help ensure the sophisticated customer got a high quality product. I knew it wasn’t the most socially valuable enterprise, but I certainly thought it was net positive.When sports gambling was legalized in America, I was hopeful it too could prove a net positive force, far superior to the previous obnoxious wave of daily fantasy sports. It brings me no pleasure to conclude that this was not the case. The results are in. Legalized mobile gambling on sports, let alone casino games, has proven to be a huge mistake. The societal impacts are far worse than I expected. Table [...]

    ---

    Outline:

    (01:02) The Short Answer

    (02:01) Paper One: Bankruptcies

    (07:03) Paper Two: Reduced Household Savings

    (08:37) Paper Three: Increased Domestic Violence

    (10:04) The Product as Currently Offered is Terrible

    (12:02) Things Sharp Players Do

    (14:07) People Cannot Handle Gambling on Smartphones

    (15:46) Yay and Also Beware Trivial Inconveniences (a future full post)

    (17:03) How Does This Relate to Elite Hypocrisy?

    (18:32) The Standard Libertarian Counterargument

    (19:42) What About Other Prediction Markets?

    (20:07) What Should Be Done

    The original text contained 3 images which were described by AI.

    ---

    First published:
    November 11th, 2024

    Source:
    https://www.lesswrong.com/posts/tHiB8jLocbPLagYDZ/the-online-sports-gambling-experiment-has-failed

    ---

    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    undefinedundefinedundefinedApple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    12 November 2024, 3:45 pm
  • 4 minutes 40 seconds
    “o1 is a bad idea” by abramdemski
    This post comes a bit late with respect to the news cycle, but I argued in a recent interview that o1 is an unfortunate twist on LLM technologies, making them particularly unsafe compared to what we might otherwise have expected:

    The basic argument is that the technology behind o1 doubles down on a reinforcement learning paradigm, which puts us closer to the world where we have to get the value specification exactly right in order to avert catastrophic outcomes.

    RLHF is just barely RL.

    - Andrej Karpathy

    Additionally, this technology takes us further from interpretability. If you ask GPT4 to produce a chain-of-thought (with prompts such as "reason step-by-step to arrive at an answer"), you know that in some sense, the natural-language reasoning you see in the output is how it arrived at the answer.[1] This is not true of systems like o1. The o1 training rewards [...]



    The original text contained 1 footnote which was omitted from this narration.

    ---

    First published:
    November 11th, 2024

    Source:
    https://www.lesswrong.com/posts/BEFbC8sLkur7DGCYB/o1-is-a-bad-idea

    ---

    Narrated by TYPE III AUDIO.

    12 November 2024, 2:30 pm
  • 10 minutes 10 seconds
    “Current safety training techniques do not fully transfer to the agent setting” by Simon Lermen, Govind Pimpale
    TL;DR: I'm presenting three recent papers which all share a similar finding, i.e. the safety training techniques for chat models don’t transfer well from chat models to the agents built from them. In other words, models won’t tell you how to do something harmful, but they are often willing to directly execute harmful actions. However, all papers find that different attack methods like jailbreaks, prompt-engineering, or refusal-vector ablation do transfer.

    Here are the three papers:

    1. AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
    2. Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents
    3. Applying Refusal-Vector Ablation to Llama 3.1 70B Agents
    What are language model agents

    Language model agents are a combination of a language model and a scaffolding software. Regular language models are typically limited to being chat bots, i.e. they receive messages and reply to them. However, scaffolding gives these models access to tools which they can [...]

    ---

    Outline:

    (00:55) What are language model agents

    (01:36) Overview

    (03:31) AgentHarm Benchmark

    (05:27) Refusal-Trained LLMs Are Easily Jailbroken as Browser Agents

    (06:47) Applying Refusal-Vector Ablation to Llama 3.1 70B Agents

    (08:23) Discussion

    ---

    First published:
    November 3rd, 2024

    Source:
    https://www.lesswrong.com/posts/ZoFxTqWRBkyanonyb/current-safety-training-techniques-do-not-fully-transfer-to

    ---

    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    undefinedApple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    9 November 2024, 8:30 pm
  • 21 minutes
    “Explore More: A Bag of Tricks to Keep Your Life on the Rails” by Shoshannah Tekofsky
    At least, if you happen to be near me in brain space.

    What advice would you give your younger self?

    That was the prompt for a class I taught at PAIR 2024. About a quarter of participants ranked it in their top 3 of courses at the camp and half of them had it listed as their favorite.

    I hadn’t expected that.

    I thought my life advice was pretty idiosyncratic. I never heard of anyone living their life like I have. I never encountered this method in all the self-help blogs or feel-better books I consumed back when I needed them.

    But if some people found it helpful, then I should probably write it all down.

    Why Listen to Me Though?

    I think it's generally worth prioritizing the advice of people who have actually achieved the things you care about in life. I can’t tell you if that's me [...]

    ---

    Outline:

    (00:46) Why Listen to Me Though?

    (04:22) Pick a direction instead of a goal

    (12:00) Do what you love but always tie it back

    (17:09) When all else fails, apply random search

    The original text contained 3 images which were described by AI.

    ---

    First published:
    September 28th, 2024

    Source:
    https://www.lesswrong.com/posts/uwmFSaDMprsFkpWet/explore-more-a-bag-of-tricks-to-keep-your-life-on-the-rails

    ---

    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    undefinedundefinedundefinedApple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    4 November 2024, 7:15 pm
  • 29 minutes 37 seconds
    “Survival without dignity” by L Rudolf L
    I open my eyes and find myself lying on a bed in a hospital room. I blink.

    "Hello", says a middle-aged man with glasses, sitting on a chair by my bed. "You've been out for quite a long while."

    "Oh no ... is it Friday already? I had that report due -"

    "It's Thursday", the man says.

    "Oh great", I say. "I still have time."

    "Oh, you have all the time in the world", the man says, chuckling. "You were out for 21 years."

    I burst out laughing, but then falter as the man just keeps looking at me. "You mean to tell me" - I stop to let out another laugh - "that it's 2045?"

    "January 26th, 2045", the man says.

    "I'm surprised, honestly, that you still have things like humans and hospitals", I say. "There were so many looming catastrophes in 2024. AI misalignment, all sorts of [...]

    ---

    First published:
    November 4th, 2024

    Source:
    https://www.lesswrong.com/posts/BarHSeciXJqzRuLzw/survival-without-dignity

    ---

    Narrated by TYPE III AUDIO.

    4 November 2024, 4:45 pm
  • 2 minutes 58 seconds
    “The Median Researcher Problem” by johnswentworth
    Claim: memeticity in a scientific field is mostly determined, not by the most competent researchers in the field, but instead by roughly-median researchers. We’ll call this the “median researcher problem”.

    Prototypical example: imagine a scientific field in which the large majority of practitioners have a very poor understanding of statistics, p-hacking, etc. Then lots of work in that field will be highly memetic despite trash statistics, blatant p-hacking, etc. Sure, the most competent people in the field may recognize the problems, but the median researchers don’t, and in aggregate it's mostly the median researchers who spread the memes.

    (Defending that claim isn’t really the main focus of this post, but a couple pieces of legible evidence which are weakly in favor:

    • People did in fact try to sound the alarm about poor statistical practices well before the replication crisis, and yet practices did not change, so clearly at least [...]
    ---

    First published:
    November 2nd, 2024

    Source:
    https://www.lesswrong.com/posts/vZcXAc6txvJDanQ4F/the-median-researcher-problem-1

    ---

    Narrated by TYPE III AUDIO.

    4 November 2024, 12:30 pm
  • 4 minutes 18 seconds
    “The Compendium, A full argument about extinction risk from AGI” by adamShimi, Gabriel Alfour, Connor Leahy, Chris Scammell, Andrea_Miotti
    This is a link post.We (Connor Leahy, Gabriel Alfour, Chris Scammell, Andrea Miotti, Adam Shimi) have just published The Compendium, which brings together in a single place the most important arguments that drive our models of the AGI race, and what we need to do to avoid catastrophe.

    We felt that something like this has been missing from the AI conversation. Most of these points have been shared before, but a “comprehensive worldview” doc has been missing. We’ve tried our best to fill this gap, and welcome feedback and debate about the arguments. The Compendium is a living document, and we’ll keep updating it as we learn more and change our minds.

    We would appreciate your feedback, whether or not you agree with us:

    • If you do agree with us, please point out where you think the arguments can be made stronger, and contact us if there are [...]
    ---

    First published:
    October 31st, 2024

    Source:
    https://www.lesswrong.com/posts/prm7jJMZzToZ4QxoK/the-compendium-a-full-argument-about-extinction-risk-from

    ---

    Narrated by TYPE III AUDIO.

    1 November 2024, 1:30 pm
  • 11 minutes 1 second
    “What TMS is like” by Sable
    There are two nuclear options for treating depression: Ketamine and TMS; This post is about the latter.

    TMS stands for Transcranial Magnetic Stimulation. Basically, it fixes depression via magnets, which is about the second or third most magical things that magnets can do.

    I don’t know a whole lot about the neuroscience - this post isn’t about the how or the why. It's from the perspective of a patient, and it's about the what.

    What is it like to get TMS?

    TMS

    The Gatekeeping

    For Reasons™, doctors like to gatekeep access to treatments, and TMS is no different. To be eligible, you generally have to have tried multiple antidepressants for several years and had them not work or stop working. Keep in mind that, while safe, most antidepressants involve altering your brain chemistry and do have side effects.

    Since TMS is non-invasive, doesn’t involve any drugs, and has basically [...]

    ---

    Outline:

    (00:35) TMS

    (00:38) The Gatekeeping

    (01:49) Motor Threshold Test

    (04:08) The Treatment

    (04:15) The Schedule

    (05:20) The Experience

    (07:03) The Sensation

    (08:21) Results

    (09:06) Conclusion

    The original text contained 2 images which were described by AI.

    ---

    First published:
    October 31st, 2024

    Source:
    https://www.lesswrong.com/posts/g3iKYS8wDapxS757x/what-tms-is-like

    ---

    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    undefinedundefinedApple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    31 October 2024, 10:45 pm
  • 28 minutes 38 seconds
    “The hostile telepaths problem” by Valentine
    Epistemic status: model-building based on observation, with a few successful unusual predictions. Anecdotal evidence has so far been consistent with the model. This puts it at risk of seeming more compelling than the evidence justifies just yet. Caveat emptor.

    Imagine you're a very young child. Around, say, three years old.

    You've just done something that really upsets your mother. Maybe you were playing and knocked her glasses off the table and they broke.

    Of course you find her reaction uncomfortable. Maybe scary. You're too young to have detailed metacognitive thoughts, but if you could reflect on why you're scared, you wouldn't be confused: you're scared of how she'll react.

    She tells you to say you're sorry.

    You utter the magic words, hoping that will placate her.

    And she narrows her eyes in suspicion.

    "You sure don't look sorry. Say it and mean it."

    Now you have a serious problem. [...]

    ---

    Outline:

    (02:16) Newcomblike self-deception

    (06:10) Sketch of a real-world version

    (08:43) Possible examples in real life

    (12:17) Other solutions to the problem

    (12:38) Having power

    (14:45) Occlumency

    (16:48) Solution space is maybe vast

    (17:40) Ending the need for self-deception

    (18:21) Welcome self-deception

    (19:52) Look away when directed to

    (22:59) Hypothesize without checking

    (25:50) Does this solve self-deception?

    (27:21) Summary

    The original text contained 7 footnotes which were omitted from this narration.

    ---

    First published:
    October 27th, 2024

    Source:
    https://www.lesswrong.com/posts/5FAnfAStc7birapMx/the-hostile-telepaths-problem

    ---

    Narrated by TYPE III AUDIO.

    28 October 2024, 7:30 am
  • 11 minutes 5 seconds
    “A bird’s eye view of ARC’s research” by Jacob_Hilton
    This post includes a "flattened version" of an interactive diagram that cannot be displayed on this site. I recommend reading the original version of the post with the interactive diagram, which can be found here.

    Over the last few months, ARC has released a number of pieces of research. While some of these can be independently motivated, there is also a more unified research vision behind them. The purpose of this post is to try to convey some of that vision and how our individual pieces of research fit into it.

    Thanks to Ryan Greenblatt, Victor Lecomte, Eric Neyman, Jeff Wu and Mark Xu for helpful comments.

    A bird's eye view

    To begin, we will take a "bird's eye" view of ARC's research.[1] As we "zoom in", more nodes will become visible and we will explain the new nodes.

    An interactive version of the [...]

    ---

    Outline:

    (00:43) A birds eye view

    (01:00) Zoom level 1

    (02:18) Zoom level 2

    (03:44) Zoom level 3

    (04:56) Zoom level 4

    (07:14) How ARCs research fits into this picture

    (07:43) Further subproblems

    (10:23) Conclusion

    The original text contained 2 footnotes which were omitted from this narration.

    The original text contained 3 images which were described by AI.

    ---

    First published:
    October 23rd, 2024

    Source:
    https://www.lesswrong.com/posts/ztokaf9harKTmRcn4/a-bird-s-eye-view-of-arc-s-research

    ---

    Narrated by TYPE III AUDIO.

    ---

    Images from the article:

    undefinedundefinedundefinedundefinedtruncato-artificial-root%3E
    27 October 2024, 5:30 pm
  • More Episodes? Get the App
© MoonFM 2024. All rights reserved.