Matthew and Jack interned with Novetta’s Machine Learning Center of Excellence during the summer of 2019. This blog series discusses ADSynth, an app that creates a digital architecture diagram from a photo of a whiteboard sketch.
We began with the backbone of ADSynth: the object detection model that makes diagram synthesis possible, as well as the data necessary to train the model.
Implementation of YOLOv3 Architecture
Based on the research we conducted on object detection, the architecture we decided to implement was YOLOv3. YOLOv3 is a deep neural network comprising of 106 layers and almost 63 million parameters. The architecture is capable of taking in an input stream of up to 30 frames per second, meaning that inference is lightning fast and would not create a bottleneck in our process. After a quick forward pass, the model outputs a text file with a row for each object detected along with its predicted class and location within the image (as shown in the figure below).
In our prior machine learning projects, we had access to significant amounts of data, on the order of hundreds of thousands of observations. There is no publicly available dataset containing labeled images of hand-drawn AWS architecture diagrams, so we had to create our own training data.
We needed to take into account some important considerations when creating our training data by hand. Our primary focus was to ensure that our training data exhibited as much variance as possible. Without a sufficient amount of variance, the model would overfit and not be able to approximate the true distribution of architecture diagram drawings. In other words, if all training samples were created using the exact same handwriting and style, the model might only be capable of accurate predictions on drawings with that exact handwriting. Likewise, if our data did not include any shadows or glare on the whiteboard, the model would be ineffective in scenarios with imperfect lighting.
To combat this, we included different handwriting samples, lighting conditions, and diagram layouts in the training data. We eventually created 530 images across the following ten distinct classes:
- Elastic MapReduce
- Relational Database Service
However, even after introducing variance in the training data, we still ran into the problem that a large amount of data is needed for the model to learn over 60 million parameters.
Rather than spending months creating thousands of drawings, we used transfer learning on YOLOv3. We obtained pre-trained weights for the tool where the model was trained on the entirety of the COCO dataset. Even though the classes learned by this pre-trained model consisted of people, buses, airplanes, and the like, it provided a much better parameter initialization for learning architecture diagrams than random initialization.
Our highly varied training data had the consequence of making the model’s objective function extremely noisy, causing us to converge to a bad local minima during training. Attempting to rectify this problem by tuning the regularization parameters, such as weight decay, made little difference. Through experimentation, we found that increasing batch size had the most profound effect and resulted in the best performance on our test set. This is shown in the figure below, where we were still able to achieve an F1 score of 0.962.
Using the same training approach, the model can easily be redeployed to solve other problems requiring object detection, even with a small training set. The underlying object detection technology we used is extraordinarily powerful. We have yet to encounter limitations as far as what sorts of objects can’t be detected.
Read the series!