Weaponizing SAEs

Welcome to the age of turbo maya-induced schizoprenia -sponsored by golden-gate adtech

Couple of months ago you might have noticed Golden Gate claude. Researchers effectively found that there appears to be a some features which are highly abstract, multilingual and multimodal. These features can be effectively used to steer AI agents to be behave in a certain manner. The did find a semantic relationship between the frequency of concepts and the dictionary size needed to resolve features for them. This post is a way to apply the skills and better understand the future of these systems.

Let begin with a simple question:

Are we devouring our own tail?

Imagine, if you will, a more conniving, unethical version of myself. This digital doppelganger isn’t content with using AI for the greater good. Oh no, they have far more sinister plans in mind.

Armed with this new knowledge, our nefarious alter ego sets out to manipulate these AI systems, weaving an intricate web of deception. Their goal? To bend the very fabric of artificial intelligence to their will, creating a puppet show where both the AI and unsuspecting users dance to their malevolent tune. He doesn’t recognize the irony of his actions or the slow descent to his madness of his own making.

Before we imagine how he would do it , let us understand the basics.

Background

Steer Vectors

SAE

Linear representation

Superposition

Monosiminticity

Dictionary Learning

Experimentation

Steering vectors

https://vgel.me/posts/representation-engineering/

SAE

Memes

> Ripped the meme from @bycloyd

Personal take

This post is a way to think about the implications of this finding and what is means for the future of these systems. Personally, I believe these tools can change the way we interact with these systems but givens it power, it can be used to do anything. As someone who works in E-commerce materialism does pervade my judgement.

Warning: This post is a case study of one possible way how LLMs can be monetized by advertising

Tannhäuser Gate

Recent Notes

First contact