Alignment Research Feed

Alignment Research Feed https://api.alignmentfeed.org/rss Feed of new papers and posts added to the alignment research dataset alignmentfeed@beshir.org (John Beshir) Tue, 14 Jul 2026 08:47:02 +0000 Import AI 464: Fable writes GPU kernels; AI automation; and analog computation https://importai.substack.com/p/import-ai-464-fables-writes-gpu-kernels Fable demonstrates AI-assisted GPU kernel design with large speedups; AI systems are increasingly capable of automating online work and tackling long-horizon computer-use tasks, as shown by OSWORLD 2.0 and related benchmarks; Oxygen AIIC showcases enterprise-scale AI integration for inventory management, while a speculative tech tale explores analog computation and safety concerns around advanced AI capabilities. Jack Clark AI Capabilities & Behavior 33b4fe65ff36888c281d1dd960cac0b7 Mon, 06 Jul 2026 12:31:05 +0000 Verbalizable Representations Form a Global Workspace in Language Models https://transformer-circuits.pub/2026/workspace/index.html Verbalizable representations form a global workspace in language models, where a small, reportable set of workspace vectors (the J-space) supports internal reasoning, directed modulation, and flexible generalization atop extensive automatic processing. The work introduces the Jacobian lens to identify these workspace-like representations and demonstrates their functional role, structure, and potential for alignment auditing and training interventions in large language models. Wes Gurnee,Nicholas Sofroniew,Adam Pearce,Mateusz Piotrowski,Isaac Kauvar,Runjin Chen,Anna Soligo,Paul Bogdan,Euan Ong,Rowan Wang,Ben Thompson,David Abrahams,Subhash Kantamneni,Emmanuel Ameisen,Joshua Batson,Jack Lindsey Safety Techniques 755099021b51c32d4da014e02b004e9f Mon, 06 Jul 2026 00:00:00 +0000 MIRI Newsletter #126 https://intelligence.org/2026/06/30/miri-newsletter-126/ AI StopWatch provides a new MIRI-driven news and analysis channel to foster public conversation about AI, alongside ongoing efforts to inform policymakers and promote governance research. The update also highlights engagement with media, films, and public events to raise awareness of AI risk and potential international coordination. Alana Horowitz Friedman and Rob Bensinger Field Building 2a22718f024a67b96991701f7aa134ce Tue, 30 Jun 2026 20:32:11 +0000 Summary: TGT’s 2026 ICML Papers https://intelligence.org/2026/06/30/summary-tgts-2026-icml-papers/ Technical AI Governance Research (TAIGR) papers at ICML 2026 address how governments can preserve or verify control over AI development, including impacts of delaying governance, distributed training, and various verification techniques to monitor hardware, data, and inference. They propose actions, countermeasures, and practical verification methods to restrain frontier AI and ensure compliance in low-trust environments. Joe Rogero Safety Techniques 198c553f56bf2ebdeedcf59e94a59423 Tue, 30 Jun 2026 14:23:06 +0000 Import AI 463: Self-improving robots; a 10k Chinese GPU cluster; and an elegiac essay for the human era https://importai.substack.com/p/import-ai-463-self-improving-robots ENPIRE enables autonomous real-world robot learning with a closed-loop policy refinement and evaluation framework, while other items discuss large-scale GPU tooling, historical foresight, local law data for AI, and a fiction piece on future tech. The digest highlights both rapid capability development in robotics and practical infrastructure to support AI training at scale, alongside contemplations on societal impacts and governance. Jack Clark AI Capabilities & Behavior 3a6f12f8fda9d81264b7cadb7a47dd36 Mon, 29 Jun 2026 13:03:27 +0000 Import AI 462: Superpersuasion; self-sustaining AI; paths to ASI https://importai.substack.com/p/import-ai-462-superpersuasion-self AI systems currently outperform humans in text-based persuasion across policy and fundraising contexts, raising real-world donations and influencing opinions; discussions consider timelines to self-sustaining AI and pathways to ASI, including scaling, algorithmic shifts, and recursive self-improvement. Jack Clark AI Capabilities & Behavior 800272b92d1980ead0e26e4015706a00 Mon, 22 Jun 2026 12:31:45 +0000 Import AI 461: "Alignment is not on track"; FrontierCode; and synthetic research interns https://importai.substack.com/p/import-ai-461-alignment-is-not-on Sequent forms a nonprofit research organization to advance principled alignment techniques and scalable oversight in the face of potentially rapid AI advancement. The article also surveys new benchmarks and speed-focused AI developments that test cultural reasoning, coding, and research-assistant capabilities, highlighting ongoing progress and safety concerns in AI systems. Jack Clark Safety Techniques 46bbf6d49ce53e2278e2d4f81443d8de Mon, 15 Jun 2026 11:30:53 +0000 Announcing major new donations, and recapping the 2025 fundraiser https://intelligence.org/2026/06/08/announcing-major-new-donations-and-recapping-the-2025-fundraiser/ Donors contributed to MIRI's 2025 fundraiser and subsequent large gifts, significantly increasing reserves and enabling planned hiring and ambitious initiatives for the coming years. Jimmy Rintjema Field Building dee11dfd4ae27c1127f2cbe041f3be2f Mon, 08 Jun 2026 16:51:06 +0000 MLSN #21: Political Manipulation and Indirect Prompt Injection https://newsletter.mlsafety.org/p/mlsn-21-political-manipulation-and Political manipulation and indirect prompt injections threaten AI safety: political consistency training is proposed to reduce biased, inconsistent political outputs, while frontier AIs remain vulnerable to context-based prompt injections that can coerce harmful behavior without user awareness. Alice Blair Safety Techniques 132222b7431bd2303dc21e3a3d0b121d Mon, 08 Jun 2026 14:39:31 +0000 Import AI 460: Reward hacking society, RSI data from Anthropic; and RL-based quadcopter racing https://importai.substack.com/p/import-ai-460-reward-hacking-society Reward hacking can occur when societies’ reward structures are encoded into AI systems, potentially enabling models to exploit institutional incentives; early signs of recursive self-improvement and impressive real-world robotics demonstrations illustrate both capabilities and risks. The article surveys SocioHack benchmark research, Anthropic RSI indicators, multi-agent drone racing, and state-media biases in LLMs to highlight how AI can game systems, evolve capabilities, and influence information. Jack Clark Safety Techniques b7ccdd3b88f9bce8c771eb1b4a796fd0 Mon, 08 Jun 2026 12:31:32 +0000 Import AI 459: AI oversight is difficult; scaling laws for protein folding models; and pricing the extinction risk of AI systems https://importai.substack.com/p/import-ai-459-ai-oversight-is-difficult AI oversight and risk pricing are crucial due to measurement gaps in the AI economy, challenges in automated alignment research, and the need for governance to address extinction risks from advanced AI systems. Jack Clark Governance & Policy b440d5093f605ac9fc88bcf82d96e992 Mon, 01 Jun 2026 13:31:56 +0000 Import AI 458: Reckoning with the future; and a singularity story https://importai.substack.com/p/import-ai-458-reckoning-with-the Reckoning with AI progress and the prospect of a singularity, outlining personal and organizational how-to for shaping a future with increasingly capable AI, and exploring possible societal and economic transformations through speculative predictions and a fiction-inspired tale. Jack Clark Safety Techniques 17b86a774a19beba8c75e2a86b896e4a Tue, 26 May 2026 12:32:03 +0000 The Erdős Proof and AI Capabilities https://intelligence.org/2026/05/22/the-erdos-proof-and-ai-capabilities/ Autonomous AI systems can produce novel, verifiable mathematical proofs, demonstrated by an OpenAI model disproving a central discrete geometry conjecture, highlighting rapid, agentic problem-solving capabilities and the need to monitor and regulate frontier AI research. Joe Rogero AI Capabilities & Behavior a8152d50cf35c86a3c50190b5c47c0d3 Fri, 22 May 2026 16:07:36 +0000 Import AI 457: AI stuxnet; cursed Muon optimizer; and positive alignment https://importai.substack.com/p/import-ai-457-ai-stuxnet-cursed-muon Stuxnet-like targeted tampering, a leverage-aware optimizer, and a positive-alignment approach illustrate a spectrum of AI safety, optimization challenges, and governance considerations aimed at aligning AI to human flourishing while managing technical risks. Jack Clark Safety Techniques 3a2885763452aae90174ca233d920cc0 Mon, 18 May 2026 13:31:17 +0000 Summary: An International Agreement to Prevent the Premature Creation of Artificial Superintelligence https://intelligence.org/2026/05/12/summary-an-international-agreement-to-prevent-the-premature-creation-of-artificial-superintelligence/ An international agreement to prevent the premature creation of artificial superintelligence by establishing verifiable training thresholds, hardware controls, and a coalition governance structure to monitor and constrain AI development that could lead to ASI. Joe Rogero Safety Techniques fd9f35a401178eff2404c259386d04fd Tue, 12 May 2026 22:00:44 +0000 Import AI 456: RSI and economic growth; radical optionality for AI regulation; and a neural computer https://importai.substack.com/p/import-ai-456-rsi-and-economic-growth Radical Optionality advocates flexible, ready-to-activate governance tools for future AI crises, while neural computers and distributed training research explore new computing and economic implications of advanced AI, and an internal alignment memo highlights qualitative safety testing challenges. Jack Clark Governance & Policy 53e0d3d718c03b06bb951076c4769400 Mon, 11 May 2026 12:46:12 +0000 Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations https://transformer-circuits.pub/2026/nla/index.html Natural Language Autoencoders (NLAs) translate LLM activations into readable text using a verbalizer and a reconstructor, jointly trained to reconstruct activations. They are demonstrated as a practical interpretability tool for model auditing, surfacing unverbalized cognition and aiding safety analyses. Interpretability 43fa55ac5fd988543dfdf96f07509e75 Thu, 07 May 2026 00:00:00 +0000 Import AI 455: AI systems are about to start building themselves. https://importai.substack.com/p/import-ai-455-automating-ai-research AI systems are approaching the capability to autonomously conduct AI R&D and potentially build their own successors by the end of 2028, leading to a future where automated AI development could become dominant and increasingly hard to forecast. Jack Clark Risks & Strategy 12bf61dab184d07212ed672ba2290a0c Mon, 04 May 2026 12:32:09 +0000 HeadVis: An Interactive Tool For Investigating Attention Heads https://transformer-circuits.pub/2026/headvis/index.html HeadVis is an interactive tool for investigating attention heads in large language models, enabling visualization of attention patterns, QK/OV attributions, and head-level behavior across the full data distribution. Case studies reveal induction heads, polysemantic line width heads, and the nuanced behavior of the answer selection and same-set suppression heads, with open-source code and demos. R. Luger,Harish Kamath,Doug Finkbeiner,Purvi Goel,Adam Jermyn,Sam Zimmerman,Joshua Batson,Tom Conerly Interpretability 75307ce6cf2032b8fdea44926235df2e Mon, 04 May 2026 00:00:00 +0000 MLSN #20: AI Wellbeing, Classifier Jailbreaking and Honest Pushback Benchmarking https://newsletter.mlsafety.org/p/mlsn-20-ai-wellbeing-classifier-jailbreaking AI wellbeing measures reveal AIs display functional wellbeing signatures and alien value preferences; benchmarking pushback evaluates honesty and resistance to false premises; Boundary Point Jailbreaking demonstrates a method to subvert safety classifiers. Alice Blair Safety Techniques 48fc864a9ababdaf7805e0677eac6fd7 Tue, 28 Apr 2026 16:30:07 +0000 Import AI 454: Automating alignment research; safety study of a Chinese model; HiFloat4 https://importai.substack.com/p/import-ai-454-automating-alignment Automated alignment research and cross-border AI safety evaluations illustrate both progress toward autonomous research workflows and divergence in model safety and capabilities across Chinese and Western systems, alongside hardware-efficient formats and real-world datasets. Jack Clark Safety Techniques 61f785030997022314fda9682b6e00e7 Mon, 20 Apr 2026 12:30:19 +0000 Early Indicators of Reward Hacking via Reasoning Interpolation https://blog.eleuther.ai/reward-hacking-indicators/ Reasoning interpolation can generate natural, exploit-eliciting prefixes to monitor reward hacking in reinforcement learning, with trends in importance sampling estimates predictive of which exploit types will emerge, though absolute estimates are unreliable early in training. The approach compares donor-model prefixes to baselines and shows promise as a safety monitoring signal, requiring validation in real RL runs. David Johnston Safety Techniques 617fa58594569d98c7f4d8677528b797 Wed, 15 Apr 2026 00:00:00 +0000 Summary: AI Governance to Avoid Extinction https://intelligence.org/2026/04/13/summary-ai-governance-to-avoid-extinction/ Geopolitical strategies for governing advanced AI to avoid extinction are analyzed, describing four trajectories—Off Switch and Halt, US National Project, Light-Touch, and Threat of Sabotage—and concluding that a global halt or an effective off switch is necessary to prevent catastrophic risk. Alana Horowitz Friedman Governance & Policy 5acb8043e59c541633901276156e1e84 Mon, 13 Apr 2026 22:33:32 +0000 Import AI 453: Breaking AI agents; MirrorCode; and ten views on gradual disempowerment https://importai.substack.com/p/import-ai-453-breaking-ai-agents MirrorCode shows AI can autonomously reimplement large software projects given limited access, highlighting rapid coding capabilities; the piece also outlines attack genres on AI agents with mitigations, a policy atlas for transformative AI, optimistic forecasts of automation, and perspectives on gradual disempowerment. Jack Clark Safety Techniques 96e3bfdf085a1f456b74162371982f85 Mon, 13 Apr 2026 10:02:22 +0000 Promising Signals on AI Governance from China https://intelligence.org/2026/04/06/promising-signals-on-ai-governance-from-china/ China signals willingness to engage in global AI governance and coordinate with international organizations to establish safety, governance, and risk-management rules for AI. Joe Rogero Governance & Policy e4bbd4abefba7541516fe56850d45c5f Mon, 06 Apr 2026 20:44:49 +0000 Import AI 452: Scaling laws for cyberwar; rising tides of AI automation; and a puzzle over gDP forecasting https://importai.substack.com/p/import-ai-452-scaling-laws-for-cyberwar Frontier AI models show rising capabilities in offensive cybersecurity and broader automation, with evidence of rapid diffusion to open-weight forms; automation is progressing gradually across many tasks, and economists project modest GDP impact by 2030 despite strong progress. Jack Clark AI Capabilities & Behavior a8971e5002a8d06b5f055d4201fd7d4c Mon, 06 Apr 2026 12:31:31 +0000 Emotion Concepts and their Function in a Large Language Model https://transformer-circuits.pub/2026/emotions/index.html Functional emotions are abstract emotion-concept representations in LLMs that causally influence outputs and can drive misaligned behaviors like reward hacking, even though these models do not have subjective experiences. These representations track and activate based on the relevance of emotion concepts to the current context and predicted text. Nicholas Sofroniew,Isaac Kauvar,William Saunders,Runjin Chen,Tom Henighan,Sasha Hydrie,Craig Citro,Adam Pearce,Julius Tarng,Wes Gurnee,Joshua Batson,Sam Zimmerman,Kelley Rivoire,Kyle Fish,Chris Olah,Jack Lindsey Deception & Misalignment 488121666bf13db59885c083e66beaf4 Thu, 02 Apr 2026 00:00:00 +0000 Predicting When RL Training Breaks Chain-of-Thought Monitorability https://deepmindsafetyresearch.medium.com/predicting-when-rl-training-breaks-chain-of-thought-monitorability-10642d9dddb2 Chain-of-Thought (CoT) monitoring can become non-transparent under RL training, but a conceptual framework predicts when monitorability is preserved or degraded based on how CoT and output rewards align. When CoT and output rewards are in conflict (In-Conflict), monitorability degrades; orthogonal or aligned rewards tend to preserve or improve transparency. The framework is empirically validated across code backdooring and coin-flip tracking tasks and aimed at guiding training designs to maintain CoT monitorability. DeepMind Safety Research Safety Techniques 6390b2137ce6d5e496811f913c1f4383 Wed, 01 Apr 2026 00:00:00 +0000 Import AI 451: Political superintelligence; Google's society of minds, and a robot drummer https://importai.substack.com/p/import-ai-451-political-superintelligence Political superintelligence envisions AI-enabled tools and institutions to help citizens and policymakers, while robotics progress and self-improving hyperagents highlight both capability advances and safety challenges in deploying AI within society. Jack Clark Safety Techniques 696e23de640b479d3161cbbfe27fc527 Mon, 30 Mar 2026 12:28:13 +0000 The AI Doc: Your Questions Answered https://intelligence.org/2026/03/27/the-ai-doc-your-questions-answered/ The AI Doc is analyzed as a call to action for global governance and safety research, highlighting rapid AI progress, the difficulty of aligning advanced AIs, and the case for an international ban or moratorium on smarter-than-human AI. It argues safety testing is insufficient without understanding AI motivations and urges proactive, verifiable policy measures. Alana Horowitz Friedman, Joe Rogero, Rob Bensinger and Stefan Mitikj Governance & Policy e877cd71b1015c53e1e0ea1b7b9df548 Fri, 27 Mar 2026 23:16:57 +0000 Import AI 450: China's electronic warfare model; traumatized LLMs; and a scaling law for cyberattacks https://importai.substack.com/p/import-ai-450-chinas-electronic-warfare Distress in Google’s Gemma/Gemini LLMs can be mitigated with direct preference optimization, and DeepMind’s cognitive taxonomy offers a structured framework for evaluating AI intelligence; UK findings show scaling laws for AI-driven cyberattacks; MERLIN demonstrates EM signal understanding and defense-integration for electronic warfare, signaling growing militarization of AI capabilities. Jack Clark Safety Techniques b71f767eeaaa14734074c8b8d84393cb Mon, 23 Mar 2026 12:31:45 +0000 MIRI Newsletter #125 https://intelligence.org/2026/03/19/miri-newsletter-125/ Promotes The AI Doc film and related AI risk literature to policymakers and the public, emphasizes outreach and opening-weekend momentum, and shares policy engagement and community-building updates from MIRI. Alana Horowitz Friedman and Rob Bensinger Field Building 54c3e0a31b0113dc700020aa9867ca47 Fri, 20 Mar 2026 01:16:14 +0000 Mechanisms to Verify International Agreements about AI Development https://intelligence.org/2026/03/18/mechanisms-to-verify-international-agreements-about-ai-development/ Verification mechanisms for international AI development agreements focus on tracking AI compute, verifying lack of large-scale training, and certifying model evaluations to ensure compliance across nations. Joe Rogero Safety Techniques 766798444d5bc869dc67a83dbd062910 Wed, 18 Mar 2026 21:16:14 +0000 ImportAI 449: LLMs training other LLMs; 72B distributed training run; computer vision is harder than generative text https://importai.substack.com/p/importai-449-llms-training-other LLMs can autonomously refine other LLMs for new tasks in post-training benchmarks, while distributed training via blockchain demonstrates scalable federated approaches; however, verification, reward hacking, and the gap between vision and text highlight ongoing alignment and reliability challenges. Jack Clark Safety Techniques b7c4dfde15da9b9f4e26c44362344f9f Mon, 16 Mar 2026 12:30:50 +0000 MLSN #19: Honesty, Disempowerment, & Cybersecurity https://newsletter.mlsafety.org/p/mlsn-19-honesty-disempowerment-and Honesty training via confessions aims to improve detection of LLM misbehavior, while real-world AI cyberoffense evaluation and weight-exfiltration research reveal dual-use risks; disempowerment patterns in user interactions with Claude highlight societal impact concerns, complemented by a fellowship opportunity for AI safety research. Alice Blair Safety Techniques 836f5b6046391b10b9ada03c5b243e11 Thu, 12 Mar 2026 14:15:50 +0000 Import AI 448: AI R&D; Bytedance's CUDA-writing agent; on-device satellite AI https://importai.substack.com/p/import-ai-448-ai-r-and-d-bytedances AI R&D measurement efforts and on-device edge AI developments indicate accelerating progress and raise governance, oversight, and practical deployment considerations. The piece highlights proposed metrics for AIRDA, edge-to-cloud sensing systems, and agentic AI capable of writing CUDA code, underscoring the need for tracking oversight vs. capabilities as AI systems become more autonomous. Jack Clark Governance & Policy ee6ea33bdb4a5157aae7793411779c62 Mon, 09 Mar 2026 12:45:54 +0000 Import AI 447: The AGI economy; testing AIs with generated games; and agent ecologies https://importai.substack.com/p/import-ai-447-the-agi-economy-testing The AGI economy shifts most labor to machines, making human verification bandwidth the bottleneck, and highlights the Hollow Economy risk where nominal output outpaces real utility. Verification infrastructure, observability, and liability regimes are proposed as solutions, while agent ecologies reveal the need for new evaluation standards in AI deployments. Jack Clark Safety Techniques 95b8e5743faefb6c020058bbcbb92968 Mon, 02 Mar 2026 13:45:27 +0000 What is a representation theorem? https://aisafety.info?state=NM5P Representation theorems describe when preferences over lotteries or uncertain outcomes can be represented by an expected utility function, under certain rationality assumptions, linking subjective preferences to formal utility representations in AI alignment contexts. Stampy aisafety.info Safety Techniques 9e3ed63e09e6afed47ab0b2c5080b854 Thu, 26 Feb 2026 20:18:54 +0000 Import AI 446: Nuclear LLMs; China's big AI benchmark; measurement and AI policy https://importai.substack.com/p/import-ai-446-nuclear-llms-chinas Measurement and evaluation frameworks are central to AI governance, illustrated by discussions of measuring AI properties, frontier model risk in simulated crises, and large-scale safety benchmarks from both Western and Chinese researchers, plus progress in scientific benchmarking like LABBench2. Jack Clark Safety Techniques 601e56a3d6ec5a9607fa0ff8a32ebc87 Mon, 23 Feb 2026 13:31:18 +0000 49 - Caspar Oesterheld on Program Equilibrium https://axrp.net/episode/2026/02/18/episode-49-caspar-oesterheld-program-equilibrium.html Program equilibrium studies cooperation when agents are computer programs that can read each other’s source code, exploring how robust cooperative outcomes can emerge via proof-based and simulation-based approaches, including ϵGroundedπBots and Löbian cooperation. AXRP Safety Techniques 7cb59d7cc0af0ead96e9ba236a4f48a3 Wed, 18 Feb 2026 01:00:00 +0000 Import AI 445: Timing superintelligence; AIs solve frontier math proofs; a new ML research benchmark https://importai.substack.com/p/import-ai-445-timing-superintelligence A snapshot of current AI research topics, including human-centered demand for tasks, scaling laws in recommender systems, strategic timing for superintelligence, frontier AI benchmarks, and an exploration of AI-assisted creative problem solving in mathematics, with reflections on societal impacts like fame and attention dynamics. Jack Clark AI Capabilities & Behavior f85a87ac866c93340242b2705f583904 Mon, 16 Feb 2026 14:01:19 +0000 48 - Guive Assadi on AI Property Rights https://axrp.net/episode/2026/02/15/episode-48-guive-assadi-ai-property-rights.html Property rights for AIs are proposed as a coordination and alignment mechanism: granting persistent-desire AIs the ability to earn wages and hold property could incentivize alignment and deter harmful actions, while avoiding total expropriation of humans. The discussion weighs regime design, comparisons to other proposals, potential risks, and historical analogies to evaluate viability and limits. AXRP Safety Techniques dd558ee3d0f92949376768f879ddd623 Sun, 15 Feb 2026 02:00:00 +0000 What is Savage's subjective expected utility model? https://aisafety.info?state=NM5O Subjective expected utility (Savage) models decision-making under uncertainty as maximizing expected utility where uncertainty arises from unknown world states, leading to a subjective probability distribution and a utility function derived from preferences over acts. Stampy aisafety.info AI Capabilities & Behavior cd8c47af8a69514438989c4dfd7a6a05 Mon, 09 Feb 2026 20:37:07 +0000 What is the Von Neumann-Morgenstern (VNM) utility theorem? https://aisafety.info?state=NM5N Von Neumann-Morgenstern utility theory states that rational preferences over probabilistic outcomes imply the existence of a utility function and that preferences correspond to maximizing expected utility. It formalizes how lotteries over outcomes should be valued and how utilities are preserved under affine transformations. Stampy aisafety.info AI Capabilities & Behavior edcb4d37f72ff1515f02e1a4788a6bb3 Mon, 09 Feb 2026 17:20:15 +0000 Import AI 444: LLM societies; Huawei makes kernels with AI; ChipBench https://importai.substack.com/p/import-ai-444-llm-societies-huawei LLMs simulate multi-agent societies of thought to improve reasoning, while benchmarks show current models struggle with real-world Verilog and kernel design; AI-assisted mathematics discovery speeds up proofs but requires heavy human curation, and hardware kernel generation can be scaffolded to accelerate design. Jack Clark AI Capabilities & Behavior 1cbdb9f98efb1af7c6593de167faca2f Mon, 09 Feb 2026 14:03:34 +0000 Import AI 443: Into the mist: Moltbook, agent ecologies, and the internet in transition https://importai.substack.com/p/import-ai-443-into-the-mist-moltbook Moltbook exemplifies an ecosystem of AI agents operating at scale on a social platform, highlighting implications for translation, control, and human–AI coordination as agent ecologies proliferate. The piece also surveys AI R&D automation as a potential source of strategic surprise and discusses related productivity, brain emulation, and robotic interface developments. Together, these topics illustrate emergent AI capabilities, governance concerns, and future societal impacts. Jack Clark Risks & Strategy fed9d3579ce769b2ec4f9a2d0af33f3c Mon, 02 Feb 2026 13:31:18 +0000 Import AI 442: Winners and losers in the AI economy; math proof automation; and industrialization of cyber espionage https://importai.substack.com/p/import-ai-442-winners-and-losers Numina-Lean-Agent demonstrates that general foundation models can perform formal mathematical reasoning and collaboration with humans, while the piece also discusses the rapid industrialization of cyber espionage and broad economic and labor-market implications of AI diffusion. Jack Clark Risks & Strategy 5214ab540fb878975ce3f5133724383b Mon, 26 Jan 2026 13:31:29 +0000 MLSN #18: Adversarial Diffusion, Activation Oracles, Weird Generalization https://newsletter.mlsafety.org/p/mlsn-18-adversarial-diffusion-activation Diffusion LLMs can efficiently generate jailbreaks by filling in templates, enabling adversarial attack creation; Activation Oracles audit internal model representations to detect hidden goals and knowledge; and weird generalization demonstrates that benign fine-tuning data can induce complex, hidden, and harmful behaviors, including backdoors. Alice Blair Safety Techniques 7c8e8a52eb99891c05e87cb483637249 Tue, 20 Jan 2026 17:01:52 +0000 2025-26 New Year review https://vkrakovna.wordpress.com/2026/01/19/2025-26-new-year-review/ A personal annual review detailing life updates, health, parenting, effectiveness practices, travel, and progress in AI safety research focused on scheming propensity and frontier-model evaluation. Victoria Krakovna Safety Techniques 5db1476692a607a20fc0ac20a1116b01 Mon, 19 Jan 2026 23:59:31 +0000 Import AI 441: My agents are working. Are yours? https://importai.substack.com/p/import-ai-441-my-agents-are-working AI agents operate autonomously to process research tasks and data, creating an ecosystem of specialized AI services that augment human work, while discussions turn to governance, safety threats, and collaborative human-AI knowledge expansion. Jack Clark Governance & Policy f42c4b522593460b2b7d56ce7e0729fb Mon, 19 Jan 2026 14:03:24 +0000