PhD Defense: Interpretability of Deep Models Across different Architectures and Modalities

Talk
Hamid Kazemi
Time: 
04.08.2024 10:30 to 12:00
Location: 

IRB 5165

The quest for understanding deep models has been a longstanding pursuit in research. Specifically, model inversion aims to uncover the inner workings of a model pertaining to a target class. This process is crucial for interpreting the inner mechanisms of neural architectures, deciphering the acquired knowledge of models, and clarifying their behaviors. However, prevailing techniques in model inversion often depend on complex regularizers like total variation or feature regularization, necessitating meticulous calibration for each network to generate satisfactory images. Presenting Plug-In Inversion, our method relies on a straightforward set of augmentations, sidestepping the need for extensive hyperparameter tuning. We demonstrate the efficacy of our approach by applying it to invert Vision Transformers (ViTs) and Multi-Layer Perceptrons (MLPs).Utilizing model inversion with CLIP models leads to the creation of images that demonstrate semantic alignment with the provided target prompts. These inverted images offer us an opportunity to delve into different facets of CLIP models, such as their capacity to fuse concepts and their incorporation of gender biases. Particularly noteworthy are occurrences of NSFW (Not Safe For Work) images during model inversion. This phenomenon arises even with prompts that are semantically innocuous, such as "a beautiful landscape," as well as prompts involving celebrity names.While feature visualizations and image reconstructions have provided valuable insights into the workings of Convolutional Neural Networks (CNNs), these methods have struggled to interpret ViT representations due to their inherent complexity. Nevertheless, we demonstrate that with proper application to the appropriate representations, feature visualizations can indeed be successful with ViTs. This newfound understanding enables us to delve visually into ViTs and the information they extract from images.In the realm of image-based tasks, networks have been extensively studied using feature visualization, which generates interpretable images to activate individual feature maps. These visualization techniques aid in comprehending and interpreting what the networks perceive. Specifically, they reveal the semantic meaning of features at different layers, with shallow features representing edges and deeper features denoting objects. Although this approach has proven effective for vision models, our comprehension of networks processing auditory inputs, such as automatic speech recognition (ASR) models, remains limited as their inputs are non-visual. To address this, we explore methods to sonify, rather than visualize, their feature maps.