Multi-classification of Google queries using neural network in Python

Enough time has passed since the publication of my first article on the topic of natural language processing. I continued to actively explore this topic, discovering something new every day.

Today I would like to talk about one of the ways to classify search queries into separate categories using a Keras neural network. The subject area of ​​inquiries was the automobile sector.

The dataset was taken as the size of ~ 32,000 search queries, marked up in 14 classes: Auto History , Auto Insurance, VU (driver's license), Complaints, Registration in the traffic police, Registration in MADI, Registration for medical examination, Violations and penalties, Appeals to MADI and AMPP, Title, Registration, Registration Status, Taxis, Evacuation.

The dataset itself (.csv file) looks like this:
And so on ...

Preparing the dataset


Before building a model of a neural network, it is necessary to prepare a dataset, namely to delete all stop words, special characters. So, as in the requests like “punch Camry 2.4 by the wine number online”, the numbers do not carry a meaning, we will delete them as well.

We take stop words from the NLTK package. Also, update the list of stop words with symbols.
Here is what should result:

stop = set(stopwords.words('russian'))
stop.update(['.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}','#','№'])

def clean_csv(df):
    for index,row in df.iterrows():
        row['запрос'] = remove_stop_words(row['запрос']).rstrip().lower()

A request that will be received at the input for classification also needs to be prepared. Let's write a function that will "clear" the request

def remove_stop_words(query):
    str = ''
    for i in wordpunct_tokenize(query):
        if i not in stop and not i.isdigit():
            str = str + i + ' '

    return str

Data formalization


You can’t just pick up and stuff ordinary words into the neural network, and even in Russian! Before starting training the network, we transform our queries into matrixes of sequences (sequences), and classes should be represented as a vector of size N, where N is the number of classes. To transform the data, we need the Tokenizer library, which by matching each word with a separate index, can convert requests (sentences) into arrays
indexes. But since the lengths of requests can be different, then the lengths of arrays will turn out to be different, which is unacceptable for a neural network. To solve this problem, it is necessary to transform the query into a two-dimensional array of sequences of equal length, as previously discussed. With the output (class vector), things are a little simpler. The vector of classes will contain either ones or zeros, which indicates that the request belongs to the corresponding class.

So, look what happened:

#считываем из CSV
df = pd.read_csv('cleaned_dataset.csv',delimiter=';',encoding = "utf-8").astype(str)
num_classes = len(df['класс'].drop_duplicates())
X_raw = df['запрос'].values
Y_raw = df['класс'].values

#трансформируем текст запросов в матрицы
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(X_raw)
x_train = tokenizer.texts_to_matrix(X_raw)

#трансформируем классы
encoder = LabelEncoder()
encoder.fit(Y_raw)
encoded_Y = encoder.transform(Y_raw)
y_train = keras.utils.to_categorical(encoded_Y, num_classes)

Building and compiling a model


We initialize the model by adding several layers, then compile it, indicating that the loss function will be “categorical_crossentropy”, as we have more than 2 classes (not binary). Then, train and save the model to a file. See the code below:

model = Sequential()
model.add(Dense(512, input_shape=(max_words,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1)

model.save('classifier.h5')

By the way, the accuracy during training was 97%, which is quite a good result.

Model testing


Now we’ll write a small script for the command line that takes an argument - a search query, and outputs a class that the query most likely belongs to in the opinion of the model that we created earlier. I will not go into the details of the code in this section, all sources look at GITHUB . Let's get to the point, namely, run the script on the command line and start driving in queries:
Figure 1 - Example of using the classifier
The result is quite obvious - the classifier will accurately recognize any requests we enter, which means that all the work was done in vain!

Conclusions and Conclusion


The neural network coped with the task perfectly and this can be seen with no armed gaze. An example of the practical application of this model can be considered the scope of public services, where citizens submit all kinds of statements, complaints, etc. By automating the reception of all these “pieces of paper” with the help of intellectual classification, you can significantly speed up the work of all government agencies.

Your suggestions for practical application, as well as an opinion on the article, are waiting in the comments!