Really large (GPT-3-sized) language models have much more parameters than diffusion models, so it's difficult to load them locally unless you have a server with 8x 3090/3x A100 GPUs. Petals is the only way to fine-tune and inference 100B+ parameter models from Colab, as far as I know.
Interesting, how does that work with the multiple GPUs? I'm not familiar with the internal workings of these models, is there anywhere where I can get a brief rundown of how the processing is split. I imagine there can't me much swapping between GPUs as that seems prohibitively slow? How is the model split such that it can be worked on in parallel by multiple GPUs w/o being bottlenecked by IO?
For large LMs, people usually use tensor-parallelism (TP) or pipeline-parallelism (PP). TP involves lots of communication, but uses all GPUs 100% of the time and works faster. PP requires much less communication, but may keep some GPUs idle while they are waiting for data from others.
Usually, TP is used when you have good communication channels between GPUs (e.g., they are in one data center and connected with NVLink), while PP is used when communication is a bottleneck (like in Petals, where the data is sent over the Internet, which is much slower than NVLink).
You can split the model across devices with huggingface accelerate library.
Check out the infer_auto_memory_map metho which will optimize the model for your configuration (multi gpu, ram, nvme) and then run dispatch model on with that memory map.
clarification: You can also use offloading on Colab, but inference with offloading is at least 10x slower (see other comment threads). So it can't really be used for interactive inference, but may be used for fine-tuning with large batches/sequence lengths.