Recommender Systems: LDA
 Tutorial
Last time I talked about Bayes theorem and gave a simple example  a naive Bayes classifier. This time we will move on to a more complex topic, which develops and continues the work of naive bayes: we will learn how to highlight topics using the LDA (latent Dirichlet allocation) model, and also apply this to recommendation systems.
The LDA model solves the classic task of text analysis: to create a probabilistic model of a large collection of texts (for example, for information retrieval or classification). We already know the naive approach: the theme is a hidden variable, and words are obtained with a fixed topic independently in a discrete distribution. Clusteringbased approaches work similarly. Let's complicate the model a bit.
Obviously, a single document may have several topics; approaches that cluster documents by topic do not take this into account. LDA is a hierarchical Bayesian model consisting of two levels:
Here is the graph of the model (the picture is taken from Wikipedia ):
Complex models are often easiest to understand like this  let's see how the model will generate a new document:
For simplicity, we fix the number of topics k and assume that β is just a set of parameters that need to be estimated, and we will not worry about the distribution on N. The joint distribution then looks like this:
Unlike ordinary clustering with an a priori Dirichlet distribution or an ordinary naive bayes, we do not select a cluster here once, and then we throw words from this cluster, and for each word we first select a topic by distribution θ, and only then we outline this word on this topic.
At the output, after training the LDA model, we obtain θ vectors showing how topics are distributed in each document, and β distributions showing which words are more likely in certain topics. Thus, from the results of the LDA, it is easy to obtain for each document a list of topics found in it, and for each topic a list of words characteristic of it, i.e. actually a description of the topic. Note: all this training without a teacher (unsupervised learning), the dataset from the texts does not need to be marked out!
As for how this all works, I admit honestly  I don’t think that in a brief hub post you can tell about how the conclusion in the original work about LDA was organized; it was based on the search for variational approximations to the distribution of the model (we simplify the structure of the model by breaking ties, but adding free variables that we optimize for). However, much simpler approaches based on Gibbs sampling were soon developed; maybe someday I will return to this topic when I will talk about sampling, but now the fields are too narrow for this. Let me just leave here a link to MALLET  the most popular and, as far as I can tell, the best readymade implementation of LDA. MALLET is enough for you for almost all occasions, except if you want to isolate topics from the entire Internet as a whole  on the MALLET cluster, it seems, it does not know how to work.
And I will tell you about how you can apply LDA in the recommendation system; this method is especially suitable for situations where the “ratings” are not set on a long scale of asterisks, but simply a binary “expression of approval”, like like in Surfingbird. The idea is quite simple: let's consider the user as a document consisting of the products he liked . At the same time, products become “words” for LDA, users become “documents”, and as a result, “themes” of user preferences stand out. In addition, you can evaluate which products are most characteristic of a particular topic  that is, select a group of products that are as relevant as possible for the corresponding user group, and also enter distances from both the users and the products from the distributions on topics.
We applied this analysis to Surfingbird.ru database . and got a lot of interesting topics  it turned out that in almost all cases really stand out groups of sites, united by one topic and quite similar to each other. In the pictures below, I drew statistics on words that are often found on the pages of some of the topics obtained using the LDA (while the LDA itself did not work with page texts, but only with user preferences!); links to the pages themselves, I still cut out just in case.
The LDA model solves the classic task of text analysis: to create a probabilistic model of a large collection of texts (for example, for information retrieval or classification). We already know the naive approach: the theme is a hidden variable, and words are obtained with a fixed topic independently in a discrete distribution. Clusteringbased approaches work similarly. Let's complicate the model a bit.
Obviously, a single document may have several topics; approaches that cluster documents by topic do not take this into account. LDA is a hierarchical Bayesian model consisting of two levels:
 at the first level  a mixture whose components correspond to the "themes";
 at the second level, a multinomial variable with an a priori Dirichlet distribution that defines the “distribution of topics” in the document.
Here is the graph of the model (the picture is taken from Wikipedia ):
Complex models are often easiest to understand like this  let's see how the model will generate a new document:
 select the document length N (this is not drawn on the graph  it’s not that part of the model);
 choose a vector  a vector of the “degree of expression” of each topic in this document;

for each of
N
words
w
:
 choose a topic by distribution ;
 choose a word with the probabilities given in β.
For simplicity, we fix the number of topics k and assume that β is just a set of parameters that need to be estimated, and we will not worry about the distribution on N. The joint distribution then looks like this:
Unlike ordinary clustering with an a priori Dirichlet distribution or an ordinary naive bayes, we do not select a cluster here once, and then we throw words from this cluster, and for each word we first select a topic by distribution θ, and only then we outline this word on this topic.
At the output, after training the LDA model, we obtain θ vectors showing how topics are distributed in each document, and β distributions showing which words are more likely in certain topics. Thus, from the results of the LDA, it is easy to obtain for each document a list of topics found in it, and for each topic a list of words characteristic of it, i.e. actually a description of the topic. Note: all this training without a teacher (unsupervised learning), the dataset from the texts does not need to be marked out!
As for how this all works, I admit honestly  I don’t think that in a brief hub post you can tell about how the conclusion in the original work about LDA was organized; it was based on the search for variational approximations to the distribution of the model (we simplify the structure of the model by breaking ties, but adding free variables that we optimize for). However, much simpler approaches based on Gibbs sampling were soon developed; maybe someday I will return to this topic when I will talk about sampling, but now the fields are too narrow for this. Let me just leave here a link to MALLET  the most popular and, as far as I can tell, the best readymade implementation of LDA. MALLET is enough for you for almost all occasions, except if you want to isolate topics from the entire Internet as a whole  on the MALLET cluster, it seems, it does not know how to work.
And I will tell you about how you can apply LDA in the recommendation system; this method is especially suitable for situations where the “ratings” are not set on a long scale of asterisks, but simply a binary “expression of approval”, like like in Surfingbird. The idea is quite simple: let's consider the user as a document consisting of the products he liked . At the same time, products become “words” for LDA, users become “documents”, and as a result, “themes” of user preferences stand out. In addition, you can evaluate which products are most characteristic of a particular topic  that is, select a group of products that are as relevant as possible for the corresponding user group, and also enter distances from both the users and the products from the distributions on topics.
We applied this analysis to Surfingbird.ru database . and got a lot of interesting topics  it turned out that in almost all cases really stand out groups of sites, united by one topic and quite similar to each other. In the pictures below, I drew statistics on words that are often found on the pages of some of the topics obtained using the LDA (while the LDA itself did not work with page texts, but only with user preferences!); links to the pages themselves, I still cut out just in case.