Really large (GPT-3-sized) language models have much more parameters than diffus...

thot_experiment · on Jan 2, 2023

Interesting, how does that work with the multiple GPUs? I'm not familiar with the internal workings of these models, is there anywhere where I can get a brief rundown of how the processing is split. I imagine there can't me much swapping between GPUs as that seems prohibitively slow? How is the model split such that it can be worked on in parallel by multiple GPUs w/o being bottlenecked by IO?

borzunov · on Jan 2, 2023

I think this is a relevant link for you: https://huggingface.co/transformers/v4.9.0/parallelism.html

For large LMs, people usually use tensor-parallelism (TP) or pipeline-parallelism (PP). TP involves lots of communication, but uses all GPUs 100% of the time and works faster. PP requires much less communication, but may keep some GPUs idle while they are waiting for data from others.

Usually, TP is used when you have good communication channels between GPUs (e.g., they are in one data center and connected with NVLink), while PP is used when communication is a bottleneck (like in Petals, where the data is sent over the Internet, which is much slower than NVLink).

nmitchko · on Jan 2, 2023

You can split the model across devices with huggingface accelerate library.

Check out the infer_auto_memory_map metho which will optimize the model for your configuration (multi gpu, ram, nvme) and then run dispatch model on with that memory map.

zone411 · on Jan 2, 2023

You can read all the gory details here: https://arxiv.org/pdf/2207.00032.pdf

borzunov · on Jan 2, 2023

clarification: You can also use offloading on Colab, but inference with offloading is at least 10x slower (see other comment threads). So it can't really be used for interactive inference, but may be used for fine-tuning with large batches/sequence lengths.