Can a deep net see a cat? – Piekniewski’s blog

Can #deeplearning recognise a cat?  The answer is not obvious. #AI #vision

  • In summary, the thesis of this post is that visual perception is NOT a solved problem and requires more work, likely a fundamental shift.
  • An adversarial example constructed from the category “Egyptian cat” classified with 98% confidence.
  • BTW: since I uploaded these cat images to ClarifAI, they may become part of their training set future results from them might be different than those presented .
  • The “best cat” is quite confidently classified as a dalmatian (0.74) and only with 0.04 confidence as the “Egyptian cat”.
  • It does classify my best cat as a cat (success!

In this post I will explore the capabilities of contemporary deep learning models on the vitally important task of detecting a cat. Not an ordinary cat though, but a sketch of an abstract cat. This task matters because success tells us something about whether a visual system has learned generalization and abstraction  — at least on par with a 2-year old. This post is inspired by my ex co-worker Peter O’Connor who tried similar experiments on LeNet several years ago. In addition, this post is a continuation of this blog’s highly popular “Just how close are we to solving vision?” which to-date has amassed nearly 15,000 hits. Let’s begin by introducing my menagerie:

@filippie509: Can #deeplearning recognise a cat? The answer is not obvious. #AI #vision

In this post I will explore the capabilities of contemporary deep learning models on the vitally important task of detecting a cat. Not an ordinary cat though, but a sketch of an abstract cat. This task matters because success tells us something about whether a visual system has learned generalization and abstraction  — at least on par with a 2-year old. This post is inspired by my ex co-worker Peter O’Connor who tried similar experiments on LeNet several years ago. In addition, this post is a continuation of this blog’s highly popular “Just how close are we to solving vision?” which to-date has amassed nearly 15,000 hits. Let’s begin by introducing my menagerie:

I made these sketches myself, based on a photo of a cat. NOTE: Whenever you test a deep net (or any other machine learning model), always use new data. Anything you find on the Internet is either already in the training set or soon will be.

Now it’s time to have some fun. First I took VGG16 trained on ImageNet. ImageNet does not have an explicit category for a general cat, but there are several kinds of cats (category number in parenthesis): tabby (281), Tiger cat (282), Persian Cat (283), Siamese Cat (284) and Egyptian Cat (285). I’m not picky: I will consider a success if any of my cats get classified as any of these classes (I’m not even sure what kind of cat I have drawn here, probably a tabby). Again I used Keras and the models pre-trained, downloaded from the web. Let’s see the results:

OK, my first sketch, the “abstract cat,” is not recognised. (If anything it’s considered a “fire screen”).  This outcome is entirely expected (or unexpected, depending on your knowledge of how deep nets works). It seems that only humans (age of 2 and above) have the ability to connect rough abstract sketches with a particular animal (my 2 year old daughter had no doubts).

The subsequent sketches aren’t classified all that well either. Ultimately, my “best cat” sketch is classified as “Egyptian cat” but with extremely low confidence (0.05) and the model is equally confident that it sees a zebra or triceratops. Oh well, I would not call this “superhuman vision” capability. Let’s move on.

VGG16 used above is an old generation model, and there’s been some improvement since, namely with the use of residual networks. In this case I will use RESNET-50 again pre-trained on ImageNet and downloaded using Keras. Same procedure as above:

Alright, again the “abstract cat” is completely miss-classified. Same with the more detailed sketches. The “best cat” is quite confidently classified as a dalmatian (0.74) and only with 0.04 confidence as the “Egyptian cat”. Interestingly the inverted “best cat” is recognised with medium-low confidence (0.28) as a jaguar.

One might argue that I’m doing something wrong, my networks are somehow messed up, or I have a bug and my code transposes the picture somewhere or what not. So for a sanity check I tried the same experiment with clarifai.com (through their demo site) which is IMHO the best deep-net based image recognition engine out there. Results are below:

Much like VGG-16 and RESNET50, ClarifAI is unable to correctly classify the abstract cat, nor is it able to classify the rough sketches. It does classify my best cat as a cat (success!), it is however only the seventh ranked keyword returned.  And all it takes is to invert the picture (the white cat drawing on black background) for ClarifAI engine to get confused again. BTW: since I uploaded these cat images to ClarifAI, they may become part of their training set, consequently future results from them might be different than those presented here.

So overall, three different deep nets and none are able to really confidently deal with any of these sketches. Is this surprising?

Why is it so?

These results are actually not very surprising. Deep nets do not see the world the same way humans do (as I argue in many posts in this blog). To a deep net, a canonical (Egyptian) cat is this, as can be determined by probing the network:

Figure 5. An adversarial example constructed from the category “Egyptian cat” classified with 98% confidence.

Human vision generalises in completely different directions than deep learning models. We are able to recognise objects in rough sketches even without shading – somehow it is often sufficient for us to see a contour. When shading is available, then we see things very confidently.

In contrast, deep nets need textures and frankly this is the only thing they care about.  See the result in the following figure:

Disjoint collection of textures will be more convincing to a deep net than even the best contour (as shown in Figure 6 above). A cat chopped into pieces and displaced is still a 91% cat to a deep net, while even the best sketch never exceeded 5% confidence. And this is, in my opinion, a substantial warning sign that our contemporary approach to vision (even though celebrating many successes, particularly compared to earlier solutions) is likely fundamentally flawed. My proposed direction for vision is the predictive vision model. Although I cannot yet show the sort of generalisation required for the above experiment to succeed with PVM,  I’m certain that PVM generalises in different directions that the feedforward Deep Nets.

In summary, the thesis of this post is that visual perception is NOT a solved problem and requires more work, likely a fundamental shift. PVM is one step in a new direction, and likely more steps need to be taken.

comments

Can a deep net see a cat? – Piekniewski’s blog