VLM-Lens (EMNLP 2025 System Demonstration)

arXiv | GitHub

This beta version processes an instruction with up to two images through various VLMs, computes cosine similarity between their embeddings at a specified layer, and visualizes the probability distribution of the first token in the response for each image.

Instructions:

Select a VLM from the dropdown
Select a layer from the available embedding layers
Upload two images for comparison
Enter your instruction/question about the images
Adjust the number of top tokens to display (1-20)
Click "Analyze" to see the first token probability distributions side by side

Note: You can upload just one image if you prefer single image analysis.