In this image, the two figures can be classified as Derek Zhao and Loc Tran. |
Traditionally, machine learning consists of the ingestion of a large body of data and recognizing the patterns that exist within it. From these patterns, predictions can be made on new data. An example is spam classification in emails. Given a large body of emails that have already been labeled as spam or not spam, a machine learning algorithm will learn patterns and key words that indicate spam or not spam. Once these algorithms are trained, they can be used to accurately predict whether a new body of unlabeled emails are spam.
In the context of image classification, this involves using a learning algorithm to identify, for example, whether an image is of a person.
Computer vision tasks typically generate numerical or symbolic information that exist in the form of a decision. These symbols are a means of sorting and classifying real world data. One of the most common algorithms that people use for computer vision tasks and image classification are neural networks, though they come with a significant drawback.
"They're pretty black box-y," says Derek. "There are a series of computations that work well, but there are so many computations going on under the hood, you end up having a lot of trouble developing an intuition for what is actually happening and determining what the network is actually learning. There are hundreds of thousands of pixel values and activations being passed through and transformed."
Loc is Derek's mentor at the Autonomy Incubator during his internship. |
Derek and Loc are trying to peek under that hood of neural networks using an algorithm called the variational autoencoder (VAE).
In image processing, an image is defined by all of its pixel values, which is then unspooled into a tall, skinny vector. It is fed into a neural network, and gradually compressed until it produces a code for whether that image is of a person or other entity. For example, 0 could equal not person and 1 could equal person.
"You normally just take a big vector and keep compressing it down until you get a code out.
An autoencoder, however, is different, in that you take this big, long vector of around 4,000 pixels, and compress it down into an intermediate form, which is typically about thirty-two numbers. Then, you try to reconstruct the image that you put in by recreating the same pixels.
At this point, it isn't image classification anymore, it's image reconstruction, and it already has a lot of applications in just compressing and decompressing images!"
Derek described his coding display as being "kind of like a bowtie architecture." |
An autoencoder is comprised of two components: the first is called an encoder network, which compresses an input into a latent representation. The second component, the decoder network, decompresses the latent representation and attempts to reconstruct the original image.
In Loc and Derek's current model, this latent representation consists of thirty-two numbers, and one of the most commonly asked questions is simply: what do these thirty-two numbers mean?
"It turns out that, under certain conditions, if you fix the latent representation for an image and take just one of these components and move it around, the reconstructions that you get back out change only on one facet of the image, like hair or chin width," as you can see in the images below.
For his research with faces, Derek has been using celebrities as examples. |
Steve Carell |
David Schwimmer |
In this gif, the hair changes, as well as facial rotation. |
Unfortunately, the latent representations of traditional autoencoders don't mean all that much. Changing one compound around, results in reconstructions that morph wildly from one face to another face. By modifying the architecture slightly to a VAE instead, they found that they could get some latent dimensions disentangled from the others so that changing one component changes only one aspect of the image instead of changing many, as the autoencoder would have done.
Their research examines many cutting edge variations of the autoencoder. There are several variants, including the VAE, and the beta-VAE, but their research has recently focused on the TC-VAE for faces, such as those in the CelebA dataset.
By looking at each of these grids of faces, you're looking at the results from the TC-VAE. |
Have you ever wondered what you'd look like with a bowl cut? |
The TC-VAE can even change the gender of the person in the photograph. |
Their immediate goal is to build an app that performs live reconstructions on a webcam stream and offers sliders for manually explaining the latent space. It offers an easy way to gain a better intuition about what exactly the TC-VAE is learning.
As Derek told me, in the end, "if you have a neural network that can compress your images into latent dimensions that are human interpretable then you suddenly have a means of telling people what exactly is in an image or even why a more classical neural network is making the decisions that it is."
Beta-VAE - https://openreview.net/forum?id=Sy2fzU9gl
Total Correlation VAE - https://arxiv.org/abs/1802.04942
Marvelous things you've for the most part bestowed to us. Just proceed composed work this kind of posts… Thank amazon fire tv customer support phone number
ReplyDeleteYou made some decent focuses there. I looked on the web for the subject and discovered most folks will oblige with your blog hulu forgot password
ReplyDeletewhen i was child then i used to learn machine blogs .this blog refresh my mind again.fire tv remote not working
ReplyDeleteDirectv Error Code 775 provide you full solution with perfect manner.
ReplyDeletesbcglobal customer service phone number We have a team of highly skilled and qualified engineers from different industries. They are specialized in the product that they handle and troubleshoot.
ReplyDelete