Is it possible to teach artificial intelligence to joke?

image Recently, cars have won a number of convincing victories over people: they already play better in go, chess, and even Dota 2. Algorithms compose music and write poetry. Scientists and entrepreneurs around the world are predicting a future in which artificial intelligence will greatly surpass humans. In a few decades, we will most likely live in a world in which robots not only drive cars and work in factories, but also entertain us. One of the important components of our life is humor. It is believed that only a person can come up with jokes. Despite this, many scientists, engineers, and even ordinary lay people are wondering: is it possible to teach a computer to joke?

Gentleminds , a developer of machine learning systems and computer vision, together with FunCorp tried to create a generator of funny picture captions using the iFunny meme database. Since the application is English-language and is used primarily in the United States, the signatures will be in English. Details under the cut.

Unlike composing music in which there are laws of harmony, the nature of what makes us laugh is very difficult to describe. Sometimes we ourselves can hardly explain what made us laugh. Many researchers believe that a sense of humor is one of the last frontiers that artificial intelligence needs to overcome in order to get as close to a person as possible. Research show that a sense of humor was formed in people for a long time under the influence of sexual selection. This can be explained by the fact that there is a positive correlation between intelligence and sense of humor. Even now, in our understanding, humor is a good marker of human intelligence. The ability to joke includes such complex elements as skillful language skills and horizons. Indeed, language proficiency is important for some types of humor (for example, British), which are largely based on a pun. In general, teaching an algorithm to joke is not an easy task.

Researchers from all over the world have tried to teach the computer to make jokes. So, Janelle Shane created a neural network that writes jokes in the style of “Knock Knock! Who's there?" (knock-knock jokes). To train this network, a data set of 200 knock-knock jokes was used. On the one hand, this is a fairly simple task for AI, since all the jokes in this set have the same structure. On the other hand, a neural network simply finds associations between words in a small set of input data and does not give these words any meaning. The result is jokes tailored to the same template, which for the most part can hardly be called funny.

In turn, researchers from the University of Edinburgh presented a successful method of teaching a computer to jokes I like my X like I like my Y, Z. The main contribution of this work is the creation of the first fully unsupervised humor generation system. The resulting model significantly exceeds the base and generates funny jokes that are considered funny by people in 16% of cases. The authors use only a large amount of unallocated data, which indicates that generating a joke does not always require a deep semantic understanding.

Scientists from the University of Washington created a system , which can come up with vulgar jokes according to the template that's what she said - TWSS (literally: this is what she said; it can be translated into Russian approximately as “if you understand what I mean”). “That's what she said” is a well-known group of jokes that has become popular again since The Office. TWSS's task is a problem with two distinctive characteristics: firstly, the use of nouns, which are euphemisms for the sexually explicit nature of nouns, and secondly, ambiguity. For the TWSS solution, Double Entendre via Noun Transfer (DEviaNT) was used. As a result, in 72% of cases, the DEviaNT system knew when to say that's what she said - an excellent achievement for this type of natural language program.

Authors of articles represent a model for generating jokes based on neural networks. A model can generate a short joke related to a predetermined topic. It uses an encoder to represent user information about a topic and an RNN decoder to generate jokes. The model is trained on short jokes by Conan O'Brien (Conan Christopher O'Brien) using the POS Tagger. Quality was rated by five English-speaking people. On average, this model is superior to the probabilistic one, taught to write jokes of a fixed structure (the approach described above from the University of Edinburgh).

Microsoft researchers also tried teach a computer to joke. Using The New Yorker magazine's cartoon contest as training data, they developed an algorithm for selecting from the thousands of funniest captions provided by readers.
As can be seen from all the above examples, to teach the car to joke is not an easy task. Moreover, she does not have a universal quality metric, since everyone can perceive the same joke in different ways. And the wording “come up with a funny joke” itself does not look concrete.

In our experiment, we decided to make the task a little easier and add context - the image. The system needed to come up with a funny signature for it. But, on the other hand, the task became a little more complicated, as one more space was added and it was required to teach the algorithm to compare the text and the picture.
The task of creating a funny caption to a picture can be reduced to choosing a suitable one from an existing database or generating a new caption using some method. In this experiment, we tried both approaches.

We relied on the base provided by iFunny. It had 17,000 memes, which we divided into two components: a picture and a signature. We used only memes in which the text was located strictly above the picture:

We tried two approaches:

  • signature generation (in one case by the Markov chain, in the other by a recurrent neural network);
  • selection of the signature most suitable for the image from the database. It was carried out on the basis of the visual component. In the first approach, for the image, they searched for a signature inside clusters based on memes. In the approach Word2VisualVec, which in this paper was called Membedding, they tried to transfer images and text into one vector space in which the relevant signature would be close to the image.

The following approaches are described in more detail.

Base analysis

Any research in machine learning always starts with data analysis. First of all, I wanted to understand what kind of images are contained in the database. Using the classification network trained at , for each image we received a vector with ratings for each category and clustered them based on these vectors. 5 large groups were identified:

  1. People.
  2. Food.
  3. Animals.
  4. Cars.
  5. Animation.

The results of clustering were further used in the construction of the basic solution.

To assess the quality of the experiments, a test base of 50 images was collected manually, which covered the main categories. Quality was evaluated by “expert” advice, determining whether it is funny or not.

Cluster Search

The approach was based on determining the cluster closest to the picture, in which the signature was randomly selected. The image descriptor was determined using a categorization neural network. We used the 5 clusters identified earlier using the k-means algorithm: people, food, animals, animation, cars.
Examples of the results are given below. Since the clusters were quite large, and the content in them could still vary greatly in meaning, the number of phrases that fit the picture, in relation to the wrong ones, was about 1 to 5. It might seem that this was due to the fact that the clusters were 5, but in fact, even if the cluster is defined correctly, then a large number of unsuitable signatures remain inside it.
Me buying food vs me buying textbooks
Boss: It says here that you love science
Guy: Ya, I love to experiment
Boss: What do you experiment with?
Guy: Mostly just drugs and alcohol
Cop: Did you get a good look at the suspect?
Guy: Yes
Cop: Was it a man or a woman?
Guy: I don't know I didn't ask them
hillary: why didn't you tell me they were
reopening the investigation?
obama: bitch, we emailed you

"Ma'am do you have a permit for this
Girl: does it look like I'm selling fucking donuts ?!
A swarm of 20,000 bees once followed a car for two days because their queen was trapped inside the car.
I found the guy in those math problems with all the watermelons ...
So that's what those orange cones were for

Visual Similarity Search

The attempt at clustering led to the idea that you should try to narrow the space for the search. And if the clusters inside themselves remained very diverse, then the search for the picture that was most similar to the incoming one could yield a result. As part of this experiment, we still used a neural network trained in 7880 categories. At the first stage, we passed through the network all the images and saved the 5 best-rated categories, as well as the values ​​from the penultimate layer (it stores both visual information and category information). At the stage of searching for a signature for a picture, we received the 5 best categories and searched the images with the most similar categories throughout the database. Of these, we took the 10 closest, and from this set we randomly chose a signature. A search experiment was also conducted using the values ​​from the penultimate layer of the network. The results for both methods were similar. On average, 1-2 unsuccessful signatures accounted for 5 unsuccessful ones. This may be due to the fact that the signature for visually similar photographs of people played a lot of human emotions in the photo and the situation itself. Examples are given below.
Me buying food vs me buying textbooks
Don't Act Like You Know
Politics If You Don't
Know Who This Is Q
when u tell a joke and no one else laughs
When good looking people have no sense of humor

Assholes, meet your king.
Free my boy he didn't do nothing
I guess they didn't read the license
When someone starts telling you how
to drive from the backseat

Membedding, or finding the most appropriate signature by casting the image descriptor into the vector space of text descriptors

The purpose of building Membedding is a space in which vectors of interest to us would be “close”. Let's try the approach from the article Word2VisualVec.

We have pictures and captions to them. We want to find text that is “close” to the image. In order to solve this problem, we need:

  1. build a vector describing the image;
  2. build a vector that describes the text;
  3. to build a vector space with the desired properties (the text vector is “close” to the image vector).

To build a vector describing the image, we use a neural network pre-trained for 6000+ classes . As a vector, we will take the output from the penultimate layer of this network with a dimension of 2048.
Two approaches were used to vectorize the text: Bag Of Words and Word2Vec. Word2Vec learned from words from all image captions. The signature was transformed as follows: each word of the text was translated into a vector using Word2Vec , and then the general vector was found according to the arithmetic mean rule - the averaged vector. Thus, an image was supplied to the input of the neural network, and an averaged vector was predicted at the output. To "embed" text vectors in the vector space of image descriptors, a three-layer fully connected neural network was used.

Using a trained neural network, we calculate the vectors for the signature base.
Then, using a convolutional neural network, we get a descriptor for finding a signature for the image and look for the signature vector closest in cosine distance. You can choose the closest, you can randomly from n closest.
Good examples Bad examples
How you show up to your ex's funeral
Being a teacher in 2018 summed up in one image.
You shall not pass me
Me: Be gentle closing the door Passenger:

To construct a vector using the Bag of Words method, which describes the text, we use the following method: calculate the frequency of three-letter combinations in signatures, discard those that occur less than three times, and compose a dictionary from the remaining combinations.
To convert text to vector, we calculate the number of occurrences of three-letter combinations from the dictionary in the text. We get a vector of dimension 5322.
Result (5 “closest” signatures):
When she sends nudes but you wanted a better America
when ur enjoying the warm
weather in december but deep
down u know it's because of
global warming
The stress of not winning an Oscar is
beginning to take its toll on Leo
Dear God, please make our the next
American president as strong as this
yellow button. Amen.

My laptop is set up to take a picture after 3
incorrect password attempts.
My cat isn't thrilled with his new bird saving bib ...
tell your cat "he" sa fucking pussy "
This cat looks just like Kylo Ren from Star Wars

Ugly guys winning bruh QC
Single mom dresses as dad so her
son wouldn't miss "Donuts With Dad"
day at school
Steak man gotta relax ....
My friend went to prom with two dates.
It didn't go as planned ...

For similar images, the captions are almost the same:
My girlfriend can take beautiful photos
of our cat. I seemingly can't ...
My laptop is set up to take a picture after 3
incorrect password attempts.
This cat looks just like Kylo Ren from Star Wars

Cats constantly look at you like
you just asked them for a ride to
the airport
My laptop is set up to take a picture after 3
incorrect password attempts.
Here's my cat, sitting on the best wedding gift we
received, a blanket with a picture of his face on it ...

As a result, the ratio of successful to bad examples turned out to be approximately 1 to 10. This is most likely explained by the small number of universal signatures, as well as the presence in the training sample of a large percentage of memes, the signature of which makes sense if the user has some prior knowledge.

Signature Generation: WordRNN Approach

The basis of this method is a two-layer recurrent neural network, each layer of which is an LSTM . The main property of such networks is the ability to extrapolate time series in which the next value depends on the previous one. The signature, in turn, is such a time series.

This network was trained to predict every next word in the text. For the training sample, the entire body of signatures was taken. It was assumed that such a network would be able to learn how to generate meaningful, or at least funny, signatures in some way.

Only the first word was asked, the rest was generated. The results are as follows:
Trump : trump cats "got friends almost scared about the only thing in the universe
Obama: obama LAUGHING dropping FAVORITE 4rd FAVORITE 4rd fucking long
Asian: asian RR II look looks Me: much before you think u got gray candy technology that wore it
Cat: cat That when you only giving but the waiter said * hears feeling a cake with his bun
Car: Car Crispy "Emma: please" BUS 89% Starter be disappointed my mom being this out of pizza penises?
Teacher: teacher it'll and and felt not to get out because ppl keep not like: he failed so sweet my girl: has

Contrary to expectations, the signatures received were rather a collection of words. Although in some places the sentence structure was rather well imitated and individual pieces were meaningful.

Signature generation using Markov chains

Markov chains are a popular approach for modeling natural language. To build a Markov chain, the body of the text is divided into tokens, for example words. Groups of tokens are assigned by states and the probabilities of transitions between them and the next word in the body of the text are calculated. During generation, the next word is selected by sampling from the probability distribution obtained in the analysis of the corpus.

For implementation, this library was used , and as a training base - signatures cleared of dialogs.
New line - new signature.

Result (State - 2 words):
when your homies told you m 90 to

dwayne johnson & the rock are twins. like if they own the turtle has a good can of beer & a patriotic flag tan.

this guy shows up on you, but you tryna get the joke your parents vs me as much as accidentally '
getting ready to go to work in 5 mins to figure out if your party isn't this lit. please don't be sewing supplies ...
justin hanging with his legos
when ya mom finds the report card in her mind is smoking 9h and calling weed
texting a girl that can save meek

Result (state - 3 words):
when you graduate but you don 't like asks for the answers for the homework
when u ugly so u gotta get creative

my dog ​​is gonna die one day vs when you sit down and your thighs do the thing
when you hear a christmas song on the radio but it's ending
your girl goes out and you actually cleaned like she asked you to
chuck voted blair for prom queen 150 times and you decide to start making healthier choices.

when you think you finished washing the dishes and turn around and there are more on the stove
when you see the same memes so many times it asks for your passcode trust nobody not even yourself

In a state with three words, the text is more meaningful than with two, but it is hardly suitable for direct use. Probably, it can be used to generate signatures with subsequent moderation by a person.

Instead of a conclusion

To teach the algorithm to write jokes is an incredibly difficult task, but very interesting. Her decision will make intellectual assistants more “human”. As an example, you can imagine a robot from the movie Interstellar, whose humor level was regulated and the jokes would be unpredictable, unlike the current versions of assistants.

In general, after all of these experiments, the following conclusions can be drawn:

  1. The approach of generating a signature requires a very complex and time-consuming work with the body of the text, the teaching method, the architecture of the model; also in such an approach it is very difficult to predict the result.
  2. More predictable from the point of view of the result is the approach with the selection of signatures from the existing database. But this is fraught with difficulties:

    • memes whose meaning can only be understood with a priori information. Such memes are hard to separate from the rest, and if they fall into the database, the quality of the jokes will decrease;
    • memes in which you need to understand what is happening in the picture: what kind of action, what kind of situation. Such memes, again, falling into the base, reduce quality.
  3. From an engineering point of view, it seems that at this stage a suitable solution is a careful selection of phrases by the editorial group for the most popular categories. This is a selfie (as people usually check the system on themselves or on the photo of friends and acquaintances), photos of celebrities (Trump, Putin, Kim Kardashian, etc.), pets, cars, food, nature. You can also enter the category “rest” and have prepared jokes in case the system does not recognize what is shown in the picture.

In general, today artificial jokes are not able to generate jokes (although not all people can cope with this), but they can very well choose the one that suits them. We will follow the development of events and participate in them!