Work Done
- Established rudimentary interp via EMNIST classification labels
- Fixed up some of the experiment code to sagve the indices
-
got preliminary findings that there is a optimal depth for meta-saes
-
Found that the average amount of activations do increase the deeper you go which implies some level of fine-grained-ness
- The amount of activaitons of the max hit their peak at 2 depths in.
Confusions
- Am I actually doing what I think I am doing?
- what I am trying to do: Take a trained MNIST model, feed it MNIST data, train an SAE on those. Track the values with EMNIST data to see what patterns there are outside of numbers ( in this case a balanced dataset of letters and numbers)
- the max activatoins are seeing if tehre are any neurons taht fire exclusively on one label OR if there are some basic trends like letters and numbers
Next Steps
- Plot some of the data using the techniques from GPT
- Create a preliminary writeup using the charts and any other ones to show what i have so far
- REPLICATE, REPLICATE, REPLICATE with other datasets (like fruit or other EMNISTs) --> basically take all the classification datasets
- Also need to look at the analysis for the meta-sae paper
Cool Stuff
- Fruit Classification Dataset: Has 141 Classes
- SAE Vision Explainability
- Transcoders as a way to interpret MLP layers