Interp Explorer

MNIST-SAE Log 11

Work Done

Established rudimentary interp via EMNIST classification labels
Fixed up some of the experiment code to sagve the indices
got preliminary findings that there is a optimal depth for meta-saes
Found that the average amount of activations do increase the deeper you go which implies some level of fine-grained-ness
The amount of activaitons of the max hit their peak at 2 depths in.

Confusions

Am I actually doing what I think I am doing?
- what I am trying to do: Take a trained MNIST model, feed it MNIST data, train an SAE on those. Track the values with EMNIST data to see what patterns there are outside of numbers ( in this case a balanced dataset of letters and numbers)
- the max activatoins are seeing if tehre are any neurons taht fire exclusively on one label OR if there are some basic trends like letters and numbers

Next Steps

Plot some of the data using the techniques from GPT
Create a preliminary writeup using the charts and any other ones to show what i have so far
- REPLICATE, REPLICATE, REPLICATE with other datasets (like fruit or other EMNISTs) --> basically take all the classification datasets
Also need to look at the analysis for the meta-sae paper

Cool Stuff

Fruit Classification Dataset: Has 141 Classes
SAE Vision Explainability
Transcoders as a way to interpret MLP layers