Pruna AI, a European startup working on algorithms for compressing artificial intelligence models, is making its optimization framework open source on Thursday.
Pruna AI is creating a framework that applies several efficiency techniques, such as caching, pruning, quantization, and distillation, to a given AI model.
“We also standardize saving and loading compressed models, applying combinations of these compression techniques, and evaluating your compressed model after it’s compressed,” Pruna AI co-founder and CTO John Rachwan told TechCrunch.
In particular, the Pruna AI framework can evaluate whether there is a significant loss of quality after compressing the model, as well as the performance gain you get.
“To use a metaphor, we are similar to how Hugging Face standardized transformers and diffusers – what to call them, how to store them, how to load them, etc. We are doing the same thing, but for performance methods,” he added.
Large AI labs already use various compression methods. For example, OpenAI uses distillation to create faster versions of its flagship models.
This is probably how OpenAI developed GPT-4 Turbo, a faster version of GPT-4. Similarly, the Flux.1-schnell image generation model is a distilled version of the Flux.1 model from Black Forest Labs.
Distillation is a method used to extract knowledge from a large AI model using a teacher-student model. Developers send queries to the teacher model and record the results. Sometimes the answers are compared to the data set to see how accurate they are. These results are then used to train a student model that learns to approximate the teacher’s behavior.
“Big companies tend to build all of this in-house. And what you can find in the open source world is usually based on individual methods. For example, say, one quantization method for LLMs, or one caching method for diffusion models,” says Rahwan. “But you can’t find a tool that brings all those methods together, makes them easy to use, and brings them together. And that’s the great value that Pruna brings to the table right now.”

Although Pruna AI supports all kinds of models, from large language models to diffusion models, speech-to-text models, and computer vision models, the company is currently focusing on image and video generation models.
Existing users of Pruna AI include Scenario and PhotoRoom. In addition to the open source version, Pruna AI offers an enterprise version with advanced optimization features, including an optimization agent.
“The most interesting feature we’ll be releasing soon will be a compression agent,” Rachwan said. “Essentially, you give it your model and say: “I want more speed, but don’t reduce my accuracy by more than 2%.” And then the agent will simply work its magic. It will find the best combination for you and return it to you. You don’t have to do anything as a developer.”
Pruna AI charges by the hour for its pro version. “It’s similar to how you think about a GPU when you rent one from AWS or any other cloud service,” says Rachwan.
And if your model is a critical part of your AI infrastructure, you’ll end up saving a lot of money on inference with an optimized model. For example, Pruna AI shrunk the llama model by a factor of eight without much loss using its compression framework. Pruna AI hopes that its customers will think of its compression framework as an investment that will pay for itself.
A few months ago, Pruna AI raised $6.5 million in seed funding. The startup’s investors include EQT Ventures, Daphni, Motier Ventures, and Kima Ventures.









