Model folding on large vision models

As vision models like Vision Transformers (ViTs), CLIP, or SAM grow larger, they require significant memory and computation, limiting their deployment in resource-constrained settings. Traditional compression methods such as pruning, quantization, and distillation reduce model size but often compromise accuracy or require retraining. Model Folding is a recent technique that clusters similar neurons to reduce parameters while preserving data statistics—offering a new trade-off between size and performance. While effective on smaller tasks, its impact on large vision models remains unexplored.
This thesis aims to investigate model folding on large-scale vision architectures, evaluating its effectiveness alone or in combination with other compression techniques like quantization.
Interested? Please contact us for more details!

Download as PDF

Student Target Groups:

  • Students of ICE;
  • Students of Computer Science;
  • Students of Software Engineering.

Thesis Type:

  • Master Thesis / Master Project

Goals and Tasks:

  • Conduct a literature review on model compression for large vision models;
  • Select and analyze one or more large-scale vision models (e.g., ViT, CLIP, SAM);
  • Implement model folding on selected models;
  • Evaluate performance trade-offs in terms of accuracy, size, and compute on public datasets;
  • Present your findings in a final presentation and written report.

Requirements:

  • Solid knowledge of neural networks and model architectures;
  • Programming skills in Python, PyTorch or TensorFlow;
  • (Optional) Familiarity with Vision Transformers, CLIP, or SAM models.

Used Tools & Equipment

  • A computation cluster of TU Graz

Start:

  • a.s.a.p.

Contact: