When developing an AI system the chances are that you’ll end up in one of the two cases:
Both cases, despite looking very different, are caused by the same issue, but can be fixed in different ways! Let’s have a look at the Venn diagram below - we have our train set and test set as part of the bigger data generator that is the open world. Of course, there are all kinds of permutations of how these 3 sets (Train, Test and World) can overlap, but since we can draw similar conclusions, let’s stick with the example in the image.
Case 1 and Case 2 are also shown as green circles. In Case 1 the test set is not big enough to cover the observations we have in the real world, and so even though our model performs well on the test set it underperforms in the real world. In Case 2 the real world data is much closer to the train set, rather than the test set and so the model fails on the test set and slightly less in the real world (though still failing due to overfitting). In other words, the model cannot generalize well beyond the test set in Case 1 or the train set in Case 2.
This brings us to the situation, when inspired by the recent chess game cheating scandals, we rolled our sleeves and created our own image-to-chess positions detector. For our dataset, we chose the Roboflow Chess dataset and the training was performed on a FasterRCNN model.
What makes the task non-obvious is that the dataset contains a variety of similar-looking classes, which are usually clustered together and differentiated by a few pixels. The goal is to detect all of the pieces on the board, such that we can, later on, locate their true positions and evaluate every move.
Using the dataset we successfully trained the model to reach mAP of 0.960, detecting correctly every single piece, and on the testing set, we managed to achieve mAP 0.787 and failed to correctly detect the class on just 2% of the pieces (if only there was an easy way to examine those).
This should result in an amazing model, but when tested in the real world, even balancing for scale and viewing angle to mimic the dataset, we get an extremely badly performing model!
Could we have anticipated that the model would not generalise to the real world before we deployed it?
Given the poor performance of the model in the real world, we can see that the data has not been diverse enough to capture a wider part of the “world” data.
One point of action is to use the data that our model has failed on from the real-world test, annotate it and mix it with the rest of the data. Then re-sample the train/test/validation sets, making sure we pull samples from both cases, and reiterate the experiment.
An alternative strategy is to use an ML testing platform like Efemarai and close the feedback loop much earlier, before deploying to the real world and avoiding the possibility of end users interacting with the untested ML system.
The specifications we want to make sure our ML model works on are (1) a wider viewing angle around what we already have collected, (2) changes to lighting conditions, (3) camera softness and “smart” features, (4) compression artifacts for when we view a streamed game. Let’s build that up as the Full Test Domain.
Importing these specifications on the platform we get the following diagram above. While the majority of the datapoints of the test dataset (light blue) are with very low failure scores (a proxy for how much the model is underperforming), performing a stress test with the Efemarai platform resulted in much more varied scores (dark blue). Nearly every sample touched can be altered to make the model misbehave. This replicates our observations from the real world much more closely in some images performing okay-ish and others - totally off.
The mAP for the Full Domain test has shifted away from the dataset-only-test and the new mAP is 0.618 with failures to classify the correct piece up to 5%.
Let’s look at the latter first. Inspecting the top failure modes highlighted a number of particular assets and images from the dataset. Having them ordered was easy enough to observe that some of the annotations were wrong (a) king-queen has been flipped on a few occasions (even in the test set) and (b) some black-white pieces have had their colours changed. The tightness of the bounding box was not considered an issue at the current stage.
In order to expand the dataset and allow the model to generalise better within the domain we can perform these scenario building offline, without the need to deploy the system and collect stress test data through the platform. We can easily increase the dataset size by an order of magnitude without the need to annotate additional data, yet knowing that it expands the capabilities of the model.
Retraining the model gives us a mAP of 0.756 and accuracy of the classified piece off by 3%.
And there we go! Everything looks much better.
Experiment with the chess dataset by registering at https://ci.efemarai.com.