Kalaido Documentation
Kalaido Overview
Our primary focus has been to unleash the potential of diffusion models.
- We construct a multi-stage diffusion pipeline involving both “training free” approaches and minimal training which helps us enable.
- Richer details in the image, Higher fidelity and photorealism - delivering desired images in fewer steps, saving time, GPU costs, and reducing carbon footprint.
- 40% lesser steps/generation time/lesser carbon footprint while maintaining the above features.
- Optimized direct preference optimization (DPO) based state-of-the-art post-training alignment via “Golden pairs”.
- Multi-stage pipeline allows focus on composition in the first part of the generation leading to the base latent, followed by the refinement by the second model of the intermediate latent for better aesthetics (with higher focus on human aesthetics).
- Different style templates allowing users to generate images in multiple captivating themes.
- AI-based prompt enhancement outputting improved prompt, leading to a 2x reduction in time spent on iteration and maximizing creative output.
- We aim to incorporate our human-behaviour based micro stimuli framework to further build the efficiency and accuracy of our prompts.
Kalaido focuses on unleashing the potential of latent diffusion models with a mix of pre-training and training-free approaches. Our diffusion pipeline comprises a cascaded structure of two diffusion models (both operating at latent space) along with one vanilla LoRA and multiple style-LoRAs. The training-free approaches fix the prevailing issues with the denoising process, which otherwise leads to plastic or lower-fidelity images.
Model Details
The diffusion pipeline comprises a cascaded structure of two diffusion models (both operating at latent space) along with one vanilla LoRA and multiple style-LoRAs. The training-free approaches fix the prevailing issues with the denoising process, which otherwise leads to plastic or lower-fidelity images.
Size of Training Data: LAION-2B dataset subset comprising 30M+ images.
The diffusion and LoRA models are pre-trained on the aesthetic subset of the LAION-2B dataset, which comprises around 30M images. This dataset contains pairs of images and English-language prompts with high scores obtained using an aesthetic reward model.
Size of Model: 2.7 Bn
Technical Architecture

Benchmarking Details
The first table shows that we not only achieve 20x more training efficiency gain (for alignment) but also surpass vanilla state-of-the-art methods such as DPO, both in quantitative metrics and in human evaluation-based qualitative study.


Paper Publication(s)
- Effective Text-to-Image alignment with Quality Aware Pair Ranking(NeurIPS’24 Adaptive Foundational Model)
- VISUAL PROMPTING METHODS FOR GPT-4V BASED ZERO-SHOT GRAPHIC LAYOUT DESIGN GENERATION(ICLR’24)
All rights reserved © 2025 Fractal Analytics Inc.