Condition Stable Diffusion images with ControlNet

September 21 2023

Background

This is another aspect of the Stable Diffusion AI art library, covered previously. With ControlNet, users can ‘condition’ the generation of an image with a spatial context such as a segmentation map or a scribble. That is, you can weight the model to produce images that are constrained to the form of another. We can turn a cartoon drawing into a realistic photo for example, or place another face in a portrait. We can still provide a prompt to guide the image generation process, just like normal. You can do this yourself using the Diffusers library. It exposes the StableDiffusionControlNetPipeline similar to other pipelines. Much of the code here is taken from this colab notebook.

Install

pip install -q diffusers==0.14.0 transformers xformers git+https://github.com/huggingface/accelerate.git
pip install -q opencv-contrib-python
pip install -q controlnet_aux

Code

First we create the pipe object. The controlnet argument which lets us provide a particular trained ControlNetModel instance while keeping the pre-trained diffusion model weights the same.

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import torch

controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16, safety_checker=None)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16
)
from diffusers import UniPCMultistepScheduler

pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
pipe.enable_xformers_memory_efficient_attention()
pipe.safety_checker = lambda images, clip_input: (images, False)

This code then makes the conditioning image. We put the image through the canny pre-processor, which is like edge detection. The the controlnet_prompt function takes the canny_img and provides it to the pipeline.

from diffusers.utils import load_image

def conditioning(image):
    import cv2
    from PIL import Image

    image = np.array(image)

    low_threshold = 100
    high_threshold = 200

    image = cv2.Canny(image, low_threshold, high_threshold)
    image = image[:, :, None]
    image = np.concatenate([image, image, image], axis=2)
    canny_image = Image.fromarray(image)
    return canny_image

def controlnet_prompt(prompt, canny_img, n=1, style=None, path='.'):
    if style != None:
        prompt += ' by %s'%style
    
    for c in range(n):
        random_seed = np.random.randint(1000)
        generator = torch.Generator(device="cpu").manual_seed(random_seed)
        output = pipe(
            prompt,
            canny_img,
            negative_prompt="disfigured, monochrome, lowres, bad anatomy, worst quality, low quality" * len(prompt),
            generator=generator,
            num_inference_steps=20,
        )
        image = output.images[0]
        if not os.path.exists(path):
            os.makedirs(path)        
        i=1
        imgfile = os.path.join(path,prompt[:90]+'_%s.png' %i)
        while os.path.exists(imgfile):
            i+=1
            imgfile = os.path.join(path,prompt[:90]+'_%s.png' %i)
        image.save(imgfile,'png')           
    return image

Mona Lisa smile

Let’s try use this to make the Mona Lisa smile. This can be done by using the original painting as the condition, with the face removed (May not be necesary). Then we use the prompt “mona lisa smiling, style of leonardo da vinci”. The results are compared to the original below. Not totally convincing but it shows the concept.

Superman actors

Here is the same idea with some actors as superman, using a picture of Superman with face erased.

This is just a tiny example. There are many other specific uses detailed elsewhere, such as in this much more complete guide. It details how to use ControlNet in AUTOMATIC1111, a popular and full-featured Stable Diffusion GUI.