- A key learning, is that the way in which these SVM’s are structured can actually have a significant impact on how much training data has to be applied, for example, a naive approach would have been as follows:
This approach requires that for every additional sub-category, two new SVM’s be trained – for example, the addition of a new class for ‘Swimwear’ would require an additional SVM under Men’s and Women’s – not to mention the potential complexity of adding a ‘Unisex’ class at the top level.
- We were able to avoid a great deal of labelling& training work, by flattening our data structures into many sub-trees like so:
By decoupling our classification structure from the final hierarchy, it is possible to generate the final classification by traversing the SVM hierarchy with each document, and interrogating the results with simple set-based logic such as:
Mens Slim-fit jeans = (Mens and Jeans and Slim Fit) and not Womens
This approach vastly reduces the number of SVM’s required to classify documents, as the resultant sets can be intersected to represent the final classification.
- For example – adding a top-level ‘Children’s’ class – would immediately allow the creation of an entire dimension of new Children’s categories (children’s jeans, shirts, underwear, etc), with minimal additional training data (Only one additional SVM):
Because of the structure we chose, one key insight that we were able to leverage, was that of re-using training data, via linked data relationships.
- For example, given some basic domain knowledge of the categories – we know for certain that ‘Washing machines’ can never be ‘Carpet cleaners’
By adding the ability to link ‘Exclude data’, we can heavily bolster the amount ‘Negative’ training examples for the ‘Washing machines’ SVM by adding to it the ‘Positive’ training data from ‘Carpet cleaners’ SVM.
- This approach has a nice uptick, in that whenever the need arises to add some additional training data to improve the ‘Carpet Cleaners’ SVM – it inadvertently improves the ‘Washing machines’ class, via linked negative data.
In many cases, getting enough well-labelled training data is a huge hurdle for developing accurate prediction systems. Here is an innovative approach which uses SVM to get the most from training data.
Continue reading “How to squeeze the most from your training data”