Accelerating Training Using Multi GPUs¶

Experimental¶

The Training Environment: Athena

Training Data: A subset was random selected 1000 samples from HKUST training dataset.

Network: LAS Model

Primary Network Configuration: NUM_EPOCHS 1， BATCH_SIZE 10

The training time is changed by different number of server and GPU when using Horovod+Tensorflow. At the same time, the training data and network structure etc still keep same to train one epoch. These results of experiment are shown below:

The training time using `Horovod`+`Tensorflow`(Character)¶

Server and GPU number	1S - 1GPU	1S - 2GPU	1S - 3GPU	1S - 4GPU	2S - 2GPU	2S - 4GPU	2S - 6GPU	2S - 8GPU
Training time(s/1 epoch)	121.409	83.111	61.607	54.507	82.486	49.888	33.333	28.101

The Result Analysis¶

As shown in Table above, training time gets shorter when more GPUs are used. The speedup using four GPU is 2.2 times compared to using one GPU.
The communication overhead is really small between difference server using Horovod. We have trained model with same structure respectively using 1 servers with 2 GPUs and using 2 servers with 1 GPU each. The total training time is almost the same.