Interp Explorer

MNIST-SAE Log 2

Work Done

Fixed SAE training + testing code
Analyzed max activations of mnist, sae, and meta-sae

Confusions

What else can I do besides max activations to track features + latents?
What happens when I do the decoder instead of the encoder? Can I take something out of the decoder or is the activation just the output.

Observations

The sae features become more sparse as you do more SAEs
The activation values become really small as well <1
the meta-sae had neuron 4 activate a lot

Next Steps

Analyze the activations to look for
- Which predictions had the most activations
- What was the spread of predictions for each neuron
- Track the sparsity count
Look at showing-sae-latents-are-not-atomic-using-meta-saes to see how they analyzed meta saes

Stream Link: https://youtube.com/live/vBzGeV1ZaTQ