One of the best parts about working on the analytics team at Locus Energy is its access to one of the world's largest repositories of photovoltaic power generation and solar irradiance data sets. This gives the analytics team an opportunity to apply machine learning algorithms to a data set that very few people have had an opportunity to use. In this context, the team is occasionally fortunate enough to run into real world problems that exhibit slightly different behavior than what might be expected based off of textbooks, papers, and blogs.

One such situation occurred when discussing potential changes to the statistical model that is used to estimate surface irradiance in the virtual irradiance product. Locus' CTO asked what fraction of the performance improvements being seen was due to each of the changes being discussed. (i.e., suppose added features *f1*, *f2*, *f3* were added as inputs to the model and the estimation error fell by 10 percent, how much would be due to each of *f1*, *f2*, or *f3?*).

Unfortunately the situation was not quite as simple as might be expected. While the team was expecting to see a number like -X percent for a change in the error estimate by adding the feature, the number was actually closer to +X percent after re-running the numbers. This was both unexpected and worrisome.

Looking a little closer, the team realized that the error with the feature in question looked correct, but that the baseline for comparison was lower than expected. It appeared that re-running the model training was not actually producing consistent test errors. This raised the question of, ‘What does that distribution look like?’

In this diagram, the box and whisker on the left is the old version of the model, the middle is with some of the features that we added (i.e, f1 and f2), and the right is the full model (i.e., f1, f2, & f3). For the full model it appears that the error could easily vary by 10-15 percent, or more. This is a real problem. But, why?

The first thing that came to mind to explain this was that the training step, which involves an iterative optimization that isn't guaranteed to converge to a global minimum, local minima could be causing the problem. To investigate whether this was going on, the team calculated the test and training errors repeatedly and plotted them in a scatter plot.

It's interesting to notice that the test errors vary much more than the training errors, and the points make a vertically stretched cloud. If local minima were the problem, a clump of points in the lower left where the optimization was successful would have been expected, with outliers up and to the right.

At this point, it seemed that poor generalization capability of the model could be an explanation for what was being seen. It was unusual that the possible overfitting occurred inconsistently. Most explanations of overfitting describe it as something that *will* happen if your model is too flexible, not something that *might* happen.

The team decided to temporarily set this mismatch aside and try using standard techniques for addressing overfitting on the problem. This resulted in the error distribution to the rightmost box and whisker in the diagram below.

This is a much improved result, both because the expected error is lower and because the generalization error is much more predictable at this point.

In conclusion, if a training algorithm implicitly uses randomness, this can cause test errors to have a significant random component to them if the model is overly flexible. If distributions of test or validation errors under this scenario aren't being considered, incorrect decisions regarding model flexibility could easily be made. To be certain that this issue isn’t affecting models, repeatedly train the model to sample the error distribution in question.