Data Parallelism
in data parallelism : master gpu - weights copied to slave gpus - fwd pass done on part of data - gradients sent back to master, summed and master model updated. - so for every feed-forward, every model on every slave device is essentially destroyed, makes it slow.