
2 hidden neurons

3 hidden neurons

4 hidden neurons

5 hidden neurons
6 hidden neurons

7 hidden neurons

Number of hidden neurons |
|||||||
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
| Average MSE (Training) | 0.46 | 0.2061 | 0.3194 | 0.437 | 0.0670 | 0.0890 | 0.07418 |
| Average MSE (Testing) | 0.5637 | 0.1834 | 0.2671 | 0.1215 | 0.1088 | 0.111 | 0.1162 |
| S.D. MSE (Training) | 0.1775 | 0.1995 | 0.0276 | 0.0303 | 0.0114 | 0.0155 | 0.0127 |
| S.D. MSE (Testing) | 0.1392 | 0.1826 | 0.3051 | 0.0349 | 0.0197 | 0.0093 | 0.0157 |
Training data graph mean MSE with standard deviation
Testing data graph mean MSE with standard deviation
Conclusion
Using only one or two hidden neurons, the network is incapable of learning and the high error rates on both testing and training sets indicates this. For three hidden neurons the network appears to be able learn the training data but has variable behaviour when used on the test set. A significant difference is noticed when four or more neurons are used. Clearly this is the minimum number of hidden neurons that is required for successful learning and generalisation behaviour of a FFNN. Statistically there is no significant difference between the results where the number of hidden neurons is greater than four. In which case, based on the principal of Occam's Razor, we should favour networks of four hidden neurons when using the Iris dataset. However, it is widely known that a large number of hidden neurons can lead to overfitting of the training data.
Next: Hidden Activation Functions