Accelerating Training Using Multi GPUs¶
Experimental¶
The Training Environment: Athena
Training Data: A subset was random selected 1000 samples from HKUST training dataset.
Network: LAS
Model
Primary Network Configuration: NUM_EPOCHS
1, BATCH_SIZE
10
The training time is changed by different number of server and GPU when using Horovod
+Tensorflow
. At the same time, the training data and network structure etc still keep same to train one epoch. These results of experiment are shown below:
The training time using Horovod
+Tensorflow
(Character)¶
Server and GPU number | 1S - 1GPU | 1S - 2GPU | 1S - 3GPU | 1S - 4GPU | 2S - 2GPU | 2S - 4GPU | 2S - 6GPU | 2S - 8GPU |
---|---|---|---|---|---|---|---|---|
Training time(s/1 epoch) | 121.409 | 83.111 | 61.607 | 54.507 | 82.486 | 49.888 | 33.333 | 28.101 |
The Result Analysis¶
As shown in Table above, training time gets shorter when more GPUs are used. The speedup using four GPU is 2.2 times compared to using one GPU.
The communication overhead is really small between difference server using
Horovod
. We have trained model with same structure respectively using 1 servers with 2 GPUs and using 2 servers with 1 GPU each. The total training time is almost the same.