Standard backpropagation: Variation of learning rate

Introduction

A range of learning rates were looked at of roughly uniform intervals. This was done in order to give a good idea of what learning rates are most suitable for training a FFNN on the iris dataset using back-propagation. The test set error is also looked at to see how well the trained network generalises.This is a fairly broad investigation though and so although it was the intention to find good leaning rates, this is not essential, and the following data will show other characteristics related resulting from training.

Method

Learning rates: 0.01, 0.03, 0.06, 0.1, 0.13, 0.16, 0.2, 0.23, 0.26, 0.3

Learning rate: 0.01


^ Error graph for training set runs.

Average M.S.E. after 100 cycles = (0.3663 + 0.4267 + 0.3835 + 0.3594 + 0.4129) / 5 = 0.3898

Standard Deviation after 100 cycles = 0.0292

Average M.S.E. rate on test set after 100 cycles = (0.3068 + 0.3670 + 0.3288 + 0.2769 + 0.3190) / 5 = 0.3197

Standard Deviation after 100 cycles = 0.0328

Learning rate: 0.03


^ Error graph for training set runs.

Average M.S.E. after 100 cycles = (0.3490 + 0.2107 + 0.3518 + 0.2074 + 0.2480) / 5 = 0.2734

Standard Deviation after 100 cycles = 0.0721

Average M.S.E. rate on test set after 100 cycles = (0.3404 + 0.1456 + 0.3412 + 0.1458 + 0.1621) / 5 = 0.2270

Standard Deviation after 100 cycles = 0.1041

Learning rate: 0.06


^ Error graph for training set runs.

Average M.S.E. after 100 cycles = (0.1189 + 0.3354 + 0.080 + 0.1544 + 0.0645) / 5 = 0.1506

Standard Deviation after 100 cycles = 0.1189

Average M.S.E. rate on test set after 100 cycles = (0.1221 + 0.3638 + 0.0865 + 0.1465 + 0.0824) / 5 = 0.1603

Standard Deviation after 100 cycles = 0.1168

Learning rate: 0.1


^ Error graph for training set runs.

Average M.S.E. after 100 cycles = (0.0642 + 0.0474 + 0.0760 + 0.3028 + 0.0548) / 5 = 0.1090

Standard Deviation after 100 cycles = 0.1088

Average M.S.E. rate on test set after 100 cycles = (0.0881 + 0.0716 + 0.1169 + 0.4889 + 0.0744) / 5 = 0.1680

Standard Deviation after 100 cycles = 0.1803

Learning rate: 0.13


^ Error graph for training set runs.

Average M.S.E. after 100 cycles = (0.3177 + 0.0530 + 0.0597 + 0.0480 + 0.0512) / 5 = 0.1059

Standard Deviation after 100 cycles = 0.1185

Average M.S.E. rate on test set after 100 cycles = (0.2092 + 0.1003 + 0.0785 + 0.0834 + 0.1001) / 5 = 0.1143

Standard Deviation after 100 cycles = 0.0539

Learning rate: 0.16


^ Error graph for training set runs.

Average M.S.E. after 100 cycles = (0.0557 + 0.0466 + 0.0500 + 0.0501 + 0.0460) / 5 = 0.0497

Standard Deviation after 100 cycles = 0.0038

Average M.S.E. rate on test set after 100 cycles = (0.0839 + 0.0895 + 0.0671 + 0.0879 + 0.0856) / 5 = 0.0828

Standard Deviation after 100 cycles = 0.00903

Learning rate: 0.20


^ Error graph for training set runs.

Average M.S.E. after 100 cycles = (0.0450 + 0.0441 + 0.0507 + 0.0514 + 0.0366) / 5 = 0.0456

Standard Deviation after 100 cycles = 0.0060

Average M.S.E. rate on test set after 100 cycles = (0.1099 + 0.1129 + 0.1181 + 0.1282 + 0.0752) / 5 = 0.1087

Standard Deviation after 100 cycles = 0.0201

Learning rate: 0.23


^ Error graph for training set runs.

Average M.S.E. after 100 cycles = (0.0394 + 0.0365 + 0.0372 + 0.0430 + 0.0429) / 5 = 0.0398

Standard Deviation after 100 cycles = 0.0031

Average M.S.E. rate on test set after 100 cycles = (0.0779 + 0.0624 + 0.0590 + 0.1065 + 0.0843) / 5 = 0.0780

Standard Deviation after 100 cycles = 0.0191

Learning rate: 0.26

Average M.S.E. after 100 cycles = (0.0484 + 0.0314 + 0.0328 + 0.0408 + 0.0286) /5 = 0.0364

Standard Deviation after 100 cycles = 0.0081

Average M.S.E. rate on test set after 100 cycles = (0.107 + 0.1515 + 0.1205 + 0.1407 + 0.1119) / 5 = 0.1263

Standard Deviation after 100 cycles = 0.0191

Learning rate: 0.3

Average M.S.E. after 100 cycles = (0.0346 + 0.0534 + 0.0353 + 0.0316 + 0.0473) / 5 = 0.0404

Standard Deviation after 100 cycles = 0.0094

Average M.S.E. rate on test set after 100 cycles = (0.1289 + 0.1006 + 0.0958 + 0.1133 + 0.2141) / 5 = 0.1305

Standard Deviation after 100 cycles = 0.0484

Learning rate: 0.4

Average M.S.E. after 100 cycles = (0.0602 + 0.0705 + 0.0283 + 0.0378 + 0.0661) / 5 = 0.0525

Standard Deviation after 100 cycles = 0.0185

Average M.S.E. rate on test set after 100 cycles = (0.0602 + 0.1078 + 0.2016 + 0.1813 + 0.0516) / 5 = 0.1205

Standard Deviation after 100 cycles = 0.0686

Summary

 
Learning rate
  0.01 0.03 0.06 0.10 0.13 0.16 0.20 0.23 0.26 0.3 0.4
Average MSE (Training) 0.3898 0.2734 0.1506 0.1090 0.1059 0.0497 0.0456 0.0398 0.0364 0.0404 0.0525
Average MSE (Testing) 0.3197 0.2270 0.1603 0.1680 0.1143 0.0828 0.1087 0.0780 0.1263 0.1305 0.1205
S.D. MSE (Training) 0.0292 0.0721 0.1189 0.1088 0.1185 0.0038 0.0060 0.0031 0.0081 0.0094 0.0185
S.D. MSE (Testing) 0.0328 0.1041 0.1168 0.1803 0.0539 0.00903 0.0201 0.0191 0.0191 0.0484 0.0686

Mean MSE on training set (with standard deviations)

Mean MSE on training set (with standard deviations)

Conclusion

At very low learning rates, the error on both datasets is quite high after 100 cycles. The error tends to converge more quickly as the learning rate is increased, although this increases the deviation of results obtained at around 0.1 learning rate. Examining the results for when the learning rate equals 0.13 shows how local minima is reached in one case. Higher learning rates mean that although the MS error rates increase in some cases early on, by 100 cycles it is relatively stable at around 0.05. Towards 0.3, and particularly 0.4, the learning is more unstable in when converging and the error flucatuates greatly at in some cases. However, statistically a t-test analysis shows that there is no significant difference between any of the sets of results above and including those obtained after a learning rate of 0.16 was set (probability of null hypothesis is always greater than 10%).

The results could be made more significant by looking at more runs of the network for each learning rate, and by examining more learning rates within certain ranges. The experiments looked at a fairly broad range of values for the learning rate (which was the aim of this). Additionally, it would be interesting to quantitatively examine what happens after 100 cycles. It is assumed that overtraining would occur in many cases, and in cases with large learning rates, 'thrashing', or oscillation would be exhibited.

Next: Batch Vs Standard Backpropagation

Home