Goal: Figure out what features (as in curves, numbers, thoughts, ideas) are inside MNIST. Want to break it apart akin to breaking apart a transformer with an SAE (visualizations and stuff)
Work Done
- analyzed MNIST through generating maximally activating images for each neuron in layer fc2
- looked at SAE reconstructions + maximal activations for SAE and Meta-SAE
- also looked at cosine similarity between them
- and max activating features
- also made a 2d PCA
Confusions
- What do the maximally activating images mean?
- What do the cosine sims + PCA mean
- Should I switch to doing this on gpt2 or a transformer model (maybeAudioLM](https://google-research.github.io/seanet/audiolm/examples/) or whisper)
- How does any of this relate to features?
- Does improving the optimization algorithm significantly change the optimized images generated (does it parse the internals meaningfully better)
Observations
- The 2d Latent space of SAE has a right Angle. Appears there are two directions.
- The cosine similarity is far more sparse on the Meta-SAE than the SAE
- There is only one direction in the latent space of the MEta-SAE
- There are two distinct types of noise images in the pictures === The first is like a barcode while the others are a gray fuzzy mess
Next Steps
- Find out how to save everything to a pdf as reports
- Analyze it? for... seeing what each neuron in the SAE represents
- But how did they run max activaitons through the meta-sae?? they as in: - showing-sae-latents-are-not-atomic-using-meta-saes
- Generate thea ctivation map for fc1 on MNIST
Big Question: How do I look at a neuron in an SAE and tell what feature of the larger model it is representing?
Stream Link: https://youtube.com/live/pX6yByN8AFo