Work Done
- Fixed SAE training + testing code
- Analyzed max activations of mnist, sae, and meta-sae
Confusions
- What else can I do besides max activations to track features + latents?
- What happens when I do the decoder instead of the encoder? Can I take something out of the decoder or is the activation just the output.
Observations
- The sae features become more sparse as you do more SAEs
- The activation values become really small as well <1
- the meta-sae had neuron 4 activate a lot
Next Steps
- Analyze the activations to look for
- Which predictions had the most activations
- What was the spread of predictions for each neuron
- Track the sparsity count
- Look at showing-sae-latents-are-not-atomic-using-meta-saes to see how they analyzed meta saes
Stream Link: https://youtube.com/live/vBzGeV1ZaTQ