Pixels to Pixar: Using AI to Transform Photos into Cartoon Art and 3D Animation

Pixels to Pixar: Using AI to Transform Photos into Cartoon Art and 3D Animation
/imagine Pixels to Pixar: Using AI to Transform Photos into Cartoon Art and 3D Animation --ar 16:9

My father was trained as an artist. I remember one of the first lessons that he taught me was to stop looking at the world as being full of things, and start thinking about what the eyes see as a combination of colours, shapes and shadows. His first assignment was to "go outside, choose a tree, focus on a small 2 inch square piece of the bark, and recreate what you see as a pencil sketch". I think he wanted me to draw something that my mind couldn't make sense of - to focus on the interplay of light and dark and to explore what my pencil was capable of absent of any bias in my understanding of the actual objects. I was lucky to have a good teacher.

When I see a new tool or piece of software, I start to think of it in the same way that I considered that piece of bark. I look past the object in front of me and consider the whole picture - what components would I combine to achieve this? How could I use the techniques and concepts that I know to build the same thing? Just like an artist has a toolbox of brush strokes and types of paint, I have APIs and code/architecture patterns.

Lately, I’ve found myself completely immersed in the world of image generation tools. Of course, platforms like DALL-E, Midjourney, Stable Diffusion, and Flux are incredible for transforming text into anything you can imagine. But recently my curiosity has taken a turn toward something even more intriguing: "image-to-image" tools. How do they actually work? And which ones really shine? That’s the rabbit hole I’ve been diving into.

What this Covers

  • Introduction to image-to-image generation with AI
  • Comparison of transformer and diffusion models for image creation
  • Use of ControlNets for guiding diffusion models
  • Customizing diffusion models with LoRAs for specific styles
  • Step-by-step process for "cartoonifying" images
  • Example of creating Pixar-style cartoons with depth maps and LoRAs.

Image-To-Image Generations

While "text-to-image" is amazingly cool, I am fascinated by the deluge of new apps that work from an image as a starting point. These "image-to-image" apps promise to take your photos or images and apply filters or maybe turn them into cartoons. I imagine that many users glance at the results and think, 'Ah, that’s just AI.' But for me, it just kicks off my curiosity. Exactly what kind of AI drives them? How is the whole thing architected? Can I re-create this? For over a month now, I’ve been on a mission to unravel their mechanics - recreating their effects, dissecting their processes, and asking: How can we harness large models to modify images? What techniques yield the most compelling results? And, more importantly, can I make a Pixar-style cartoon version of myself?

The two most prominent methods for image generation are transformer and stable diffusion models so let's start there.

Transformers v. Diffusion Models

Transformer models were originally designed for sequence data and have had massive success in the conversational space - the "T" in ChatGPT stands for transformer. They have been adapted to generate images, and accomplish that by processing input data (e.g., text prompts or partial images) through attention mechanisms to predict and sequentially generate patches, pixels, or tokens based on learned relationships. The current versions of DALL-E by OpenAI operates in this space. You can think of it as a direct map from words to pixels.

Stable Diffusion models were specifically designed for image generation. They generate images by starting with random noise and iteratively refining it through a learned denoising process that is (often) guided by optional conditioning inputs like text prompts. This makes it more complex to train, and slightly slower at inference (image building) time. The models are trained by simulating the reverse of a diffusion process, learning to predict clean images from noisy inputs. The number of steps taken at generation time is a parameter that can be altered to achieve different results.

FeatureTransformer ModelsDiffusion Models
Training ComplexityHigh, due to attention mechanisms and large datasets.High, due to iterative denoising learning.
Inference SpeedTypically faster for autoregressive models (but still slower than GANs).Slower due to iterative refinement.
Output QualityGood, with a strong global context understanding.Excellent, with fine-grained details.
VersatilityGreat for multi-modal tasks (e.g., text-to-image).Primarily image-focused, with emerging text integration.
Computational CostHigh due to large attention matrices.High due to multiple forward passes.
StabilityProne to challenges like mode collapse in some cases.Very stable, robust against mode collapse.

Attempt #1 - Using Transformers

How could each of these models be used to build my cartoon-filter clone? I started experimenting first with transformers. The best process I found was to have a multi-modal model (eg- ChatGPT) describe the original image, and then create a prompt to feed back into a transformer image generation model (eg - DALL-E). I couldn't find a way to directly inject an image into a transformer model. Transformers are fantastic at using vision to understand an image, but they are essentially still seeing objects and not qualities like the colours, shadows and contours that my dad taught me to see. The results are interesting, but I wanted more customization and the results didn't resemble my input image closely enough.

Inside the Weird World of Stable Diffusion

Customizing Stable Diffusion models is a popular area of exploration in the AI world, with researchers and developers delving into how to fine-tune these models to create more tailored custom results. A vibrant and eclectic community has emerged around the customization of massive foundational models like Stable Diffusion or Flux, bending their capabilities to suit an array of creative purposes. It reminds me a lot of the modding scene in video games. Tweak a parameter here or there and you can transform the output entirely.

You can get a glimpse into the imaginative (and occasionally eccentric) depths of this community on websites like Civitai. Fair warning: these corners of the internet come with a strong undercurrent of NSFW content, much of it carrying the unmistakable vibe of “sexy anime.” Users share their LoRAs (Low-Rank Adaptations) which are a handy way to cheaply fine-tune the behaviour of large pre-trained models cheaply and quickly. By applying a LoRA to your image model, you unlock the ability to tailor your outputs to specific styles or genres simply by incorporating a targeted keyword into your prompts. I stumbled upon one that perfectly captured the essence of a 3D Pixar aesthetic. When paired with my diffusion model of choice, Flux, it enabled me to generate the kind of images I was after. While the results without the LoRA were still quite good, it does give a noticeable boost in quality and opens the door to exploring more styles later.

ControlNets - Guiding Generation with Input Images

I trawled through message boards and youtube tutorials hoping to find someone who had solve the problem I was interested in and amid the noise, I stumbled upon a gem: a paper introducing a technique called ControlNets (2023). This method stood out as both elegant and effective, using tools like edge detection and pose estimation to guide diffusion models with surprising precision.

Image: From the original paper introducing ControlNets for stable diffusion models Left to Right: Guidance image + generations without prompts, followed by generations with additional text prompt guidance. (https://arxiv.org/abs/2302.05543 -"Adding Conditional Control to Text-to-Image Diffusion Models")

Finally, it felt like I had struck gold. ControlNets promised a practical, intuitive approach to taming these powerful models. They proposed a way to take some simple outlines and guide a stable diffusion model to "paint" using them as guidance. The process is fairly mathematical, but there are good tools to accomplish all of this available. The most popular tool is called Automatic1111 or it's spiritual successor ComfyUI. I wanted to be able to build my own software so I looked for a version of these tools behind an API. I like fal.ai for that purpose, and it even allows you to deploy a ComfyUI workflow as an endpoint if you want to really do some customization. Through some trial and error, I came up with a recipe that works pretty well.

The Cartoon Filter Recipe

  1. Given an input image, derive an outline or depth map.
  2. Generate a description of the image using a transformer model like ChatGPT.
  3. Use that description plus the LoRA I found with Flux to get the final output.

The prompt that I used along with the outlines / depth maps for these generations was "3d animation style, pixar, big eyes. + [ description of original image ]" I found that the depth map version above did a better job of leaving space for those big cartoony animated eyes that I was after. If you want to try this for yourself, you can combine a couple of web tools: ChatGPT for the description of the image, and Fal's Flux + depth with LoRAs tool to generate the image.

Playing with "image-to-image" tools has been a lot of fun. From figuring out how transformers and diffusion models tick to tweaking things with LoRAs and ControlNets, it’s been equal parts trial, error, and surprise wins. Honestly, there’s something addictive about seeing an idea click, and it was satisfying to realize my photo to cartoon goal. I’m still experimenting, but one thing’s for sure: I’m not done messing with this yet.

🎨
Want to try this out? I added the whole photo to cartoon flow to You Marv - the Practical AI that you Text. Send an image with the text "@cartoonify" and get back a cartoon version of your image in seconds.

Further Reading / Resources

How artists see
Painters may view scenes in a way that is similar to how the world really is.
FLUX.1 [dev] Depth with LoRAs | Image to Image | AI Playground | fal.ai
Generate high-quality images from depth maps using Flux.1 [dev] depth estimation model. The model produces accurate depth representations for scene understanding and 3D visualization.
Adding Conditional Control to Text-to-Image Diffusion Models
We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with “zero convolutions” (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.