About machine learning, history and life with Dmitry Vetrov

As part of an open machine learning course , we continue to communicate with prominent representatives of this field. Our first interlocutors were Alexander Dyakonov, Konstantin Vorontsov and Evgeny Sokolov, see the video on the course’s YouTube channel . This time we talked with Dmitry Vetrov.

Good day! Today, our guest is Dmitry Petrovich Vetrov, a professor at the Faculty of Computer Science and the head of the Bayesian research group. Dmitry Petrovich continues to give courses at the Department of Mathematical Forecasting Methods of the VMK Moscow State University, where he studied at the time. So it turns out, perhaps by accident, that we have already met and talked with Alexander Dyakonov, Konstantin Vorontsov, Evgeny Sokolov, and now you are the fourth representative of the Department of Mathematical Methods of Prediction of the Naval Forces of Moscow State University. Tell us how you ended up together, did you work together on any project, how life scattered you, and are you talking to your colleagues now?

Of course, we actively communicate. We all left one academic school of academician Zhuravlev, who founded the Department of Mathematical Forecasting Methods at the Naval Forces in 1997. Konstantin Vyacheslavovich, Alexander Gennadievich, Evgeny Andreevich and your humble servant actively taught and partially teach at it. Regarding work on joint research, it seems to me that this was my main mistake. For 5 years I held the post of scientific secretary of the department, carried out operational management of the department and during that time we did not work on any research project together. It seemed that there was a lot of time, we still had time to work, but it turned out that life began to scatter us. Konstantin Vorontsov concentrated on working at the PhysTech, Alexander Dyakonov went into the industry, Evgeny Sokolov and I transferred the main efforts to the Faculty of Computer Science of the Higher School of Economics. As a result, we worked in one unit for several years, taught, talked together, but did not conduct any joint research. Now I regret it. I learned a lot from my colleagues, but I could learn more ...

So your collaboration was more like teaching, right?

Да. And therefore, despite the fact that life has scattered us, in my opinion, the department of mathematical methods of forecasting today is the strongest at the VMK and, as I can, I try to help it. We still read courses there: of the Deacons, and the Sokolov, and the Vorontsov, and the Winds. That is, even though we lost affiliation with the department, we continue to participate in its life, although, of course, not like before.

Well, probably, if we talk about theoretical machine learning, now the department of mathematical methods of forecasting provides the best education. If you compare with the same FKN, where many courses are more practical ... Here you can recall the article by Konstantin Vyacheslavovich on MachineLearning.ru, how to teach machine learning in general. My subjective feeling is that the most powerful theoretical base is given specifically to the IUD.

I would not argue so categorically. Today, there are many places where they teach machine learning very well: at the same VMK, at the Faculty of Computer Science with a specialization in Machine Learning and Applications, in ShAD and at PhysTech. It’s hard for me to say where the program is more theoretical, and where more applied.

But you moved from Moscow State University to the Higher School of Economics, what if we generally compare these two universities? HSE, perhaps, is criticized by many for its Westernizing attitude and generally for its orientation toward the Western scientific system, scientific citation bases Web of Science and Scopus ... In general, HSE has such a double game: on the one hand, there are many government orders for research, and on the other hand - race for publications in the best magazines, at the best western conferences. You just get it all, you publish in the best magazines, go to the top conferences on machine learning. How would you answer such a rather philosophical question: how do we catch up with the west? Should I focus on their values, on publications in their magazines? Or all the same if we catch up all the time, then we will never overtake them?

See, firstly, I would correct you. You say: "western, western ...". This is not at all western, not western for a long time - it is a global trend in the development of science. Secondly, the specifics of the development of science in our country, both for objective and subjective reasons, consisted in the fact that in many industries science was isolated from world trends. And machine learning alas belonged to them. It seems to me that any form of scientific isolation is harmful for that community, which isolates itself from the rest of the world. Therefore, I fully advocate maximum integration. But not with the western community, I repeat, but with the world. There are many world-class researchers from China, India, Japan ... If we want to continue achievements in world science, to be at the forefront, then, of course, we need to follow international conferences, magazines, and, of course, post there. In my opinion, integration into the global scientific community will make it possible to enter these advanced positions and, possibly, even become leaders in certain areas. Now it has become obvious to everyone that the Russian scientific community in the field of machine learning is 10-20 years behind world trends. This is very sad. In fact, this means that this scientific field must be rebuilt from scratch. And the main reason for this lag was self-isolation from the global scientific community. We need to catch up with him - there is still no choice. And yes, mankind has not yet come up with anything better than focusing on world scientific standards for conducting research (with strict adherence to the scientific method, competent design of the experiment, anonymous peer review, continuous reading of scientific articles, to be "in the subject," etc.). Any attempts to counterbalance these standards lead to lag and gradual degradation. At the same time, we have our competitive advantages: a high level of mathematical knowledge among applicants and students, a number of industry initiatives aimed at teaching modern machine learning methods. New projects appeared such as the organization of school and student olympiads for data analysis. These are very good events that give grounds for cautious optimism. It is unfortunate that all these undertakings are not due to, but, often, contrary to the Russian Academy of Sciences, which, it would seem, should have led this trend. Therefore, I believe that science in the field of artificial intelligence in Russia should be rebuilt from scratch. There are places where intelligent specialists and solvers of applied problems will make of you but there are practically no places where you will be made into a developer of new machine learning technologies. It seems to me to be bored with the fact that forever reimplementing the technologies developed in Google, as many companies do, is boring and I have a feeling that we can do more.
As for the fact that I publish at leading conferences ... I believe that I publish a little, I am dissatisfied with the current publication activity. I want to do this much more intensively and we are actively working on it.

And yet, now it often happens that even the scientist’s salary, his scientific reputation depend on citation, in particular in Web of Science and Scopus. It seems that this system has the same drawbacks as the exams of the same exam. Despite the shortcomings, you still need to focus on publications and indexing in scientific citation databases?

Clarify please…

It seems to me that soon the scientific community will learn how to better evaluate the contribution of scientists. Suppose, somehow, based on the PageRank algorithm. Indeed, now the context of citation and emotional coloring are not even taken into account. Suppose I’ll quote you now, but I’ll say that I don’t agree with what was written and that in general it’s all nonsense. With the current system, this is still taken into account as +1 to the number of citations of your articles. What are your options for improving the system for assessing the contribution of a researcher?

Even if you quote me with negative emotions, the mere fact of this quotation will mean that my research has somehow influenced your work. Citation defines a very simple thing: what a person has done, someone needs it, someone uses it, even with negative emotions. This is better than no quotes at all. This is the first. The salary of the HSE employees does not depend on the number of citations. It is determined by the level of publication, that is, the level of publication in which your work is published. This is the second. With citation, you can do anything, for example, you can engage in self-citation. But, raising the level of the publication in which you published the article is impossible in principle. For no money, not through connections ... You cannot "ask" to publish you at the leading conference - there you have to break through a strict review and selection system. By the way, the fact that the scientist’s salary is determined on the basis of the level of publication at which he is published is not only typical for HSE, the same is true at Moscow State University and the Physics Institute. The next question is: how to determine which publications are considered good and which are bad. The question is critical. Any errors in the definition of this lead to the fact that researchers begin to focus on the wrong goals. For example, instead of growing professionally and publishing at increasingly prestigious conferences, they are starting to chase an increasing number of publications in junk magazines. And, for example, to the criteria introduced at Moscow State University, I have questions. They hardly encourage the professional growth of a scientist, but they encourage his imitation. I see, that the system can be circumvented, for example, by making low-quality publication in order to get a big premium. And this happens all the time. Bypassing the HSE system is much more difficult just because it is compiled taking into account the ratings of publications, although I admit that it is also possible.

If we talk about the international conference ICML level and IJCAI, then one of your work colleagues about Bayesian networks thinning deep ( "Variational Dropout Sparsifies the Neural Networks' Deep", arxiv ), published on ICML, received a lot of feedback from the scientific community. Can you talk about it - is it such a small gradient step in the development of science or is it a revolutionary thing? How can this theoretically and practically help the development of deep learning? In general, you can talk about Bayesian methods in deep learning. Or in the deep :)

Let's not talk about the revolutionary contribution. It seems that revolutionary articles can be counted on the fingers. We have taken a step in the right direction, this direction, in my opinion, is technically important, with significant prospects. This is what we in our group are trying to do - cross the Bayesian approach to machine learning with deep neural networks. And the work that you mentioned really aroused certain interest in the scientific community. We took the well-known procedure for regularizing neural networks - a dropout, and based on the work of our colleagues from the University of Amsterdam, who showed that a dropout can be considered as a Bayesian procedure, we proposed a generalization of it. It includes the usual dropout as a special case, but also allows you to automatically use the variational Bayesian output to adjust the dropout rate. That is, the probability with which every weight or every neuron is thrown out in our network is selected not by eye or using cross-validation, but automatically. As soon as we learned to do this automatically, it became possible to introduce an individual dropout rate for each weight in the neural network and optimize the functionality for all these parameters. As a result, such a procedure leads to amazing results. It turns out that over 99% of the weights can simply be removed from the network (i.e., their dropout rate becomes equal to one), while the quality of work on the test sample does not sag. That is, we will maintain a high generalizing ability, a low test error, but at the same time the neural network can be compressed to 100, а то и 200 раз.

Does this mean dropout intensity can even be selected analytically?

Not analytically, of course, but here is the most common optimization. Here the functional was strictly defined, which arises naturally from the Bayesian inference procedure. Our result suggests that we are moving in the right direction. It is known that modern neural networks are very redundant, but it is not clear how to eliminate this redundancy. Attempts, of course, were, for example, to take a smaller network, but the quality was sagging. Therefore, now, it seems, the more correct way is to take the redundant neural network, train it, and then eliminate the redundancy using the Bayesian dropout procedure.

Clear. And here is a more general question. How do you see the prospects for the development of Bayesian methods in relation to deep learning? What problems are there?

Modern deep neural networks are trained, in essence, by the method of maximizing likelihood, about which it is known from statistics that this is the best method under certain conditions. The whole problem is that the situation that arises with the training of deep neural networks does not satisfy those conditions that guarantee the optimality of the maximum likelihood method. The conditions are very simple. It is necessary that the number of training examples by which the parameters of the machine learning algorithm are adjusted be much larger than the number of these parameters. In modern deep networks, this is not so. And the maximum likelihood method can be used, but at your own peril and risk, without any guarantees. It turns out that in such a situation, when the number of weights is comparable to or even greater than the size of the training sample, in place of the frequency approach with classical methods of estimation, comes Bayesian statistics. Bayesian methods can be used for any sample size, up to zero. It can be shown that if the sample size in relation to the number of estimated parameters tends to infinity, then the Bayesian approach goes over to the method of maximizing likelihood. That is, the classical and Bayesian approaches do not contradict each other. On the contrary, Bayesian statistics can be considered as a generalization of classical to a wider class of problems. The application of the Bayesian approach to deep learning leads to the fact that the neural network has a number of additional advantages.

Firstly, it becomes possible to work with omissions in the data, that is, when values ​​of some attributes are not indicated in the training sample for some examples. The most suitable way of working in such a situation is the Bayesian probabilistic model.

Secondly, the training of a Bayesian neural network can and should be considered as a conclusion of the distribution in space of all kinds of networks to which the ensemble technique can be applied. That is, we get the opportunity to average the forecasts of many neural networks obtained from the posterior distribution in the space of weights at once. Such an ensemble in full accordance with Bayesian statistics leads to an increase in the quality with respect to the use of one (albeit the best) neural network.

Thirdly, Bayesian neural networks are much more resistant to retraining. Retraining is now one of the most acute problems of machine learning, and in publications 2016-17. It is shown that modern neural network architectures are catastrophically prone to retraining. And Bayesian neural networks practically do not retrain. It is especially noteworthy how our ideas about regularization change with the development of Bayesian methods. Classical regularization is simply the addition of an additional term, the regularizer, to the optimized functional. For example, it may be the norm of tunable parameters. The regularizer shifts the optimum point and partially helps to cope with retraining. Now we understand that it is possible (and necessary) to carry out regularization differently: adding noise to the process of optimizing the functional, which will not allow stochastic optimization methods to converge to the exact optimum value. The most successful regularization methods to date, such as dropout or batch normalization, work just that way. This is not an addition of the regularizer to the loss function, but a controlled injection of noise into the task. This is a completely different look at the regularization of machine learning algorithms! But what should be the intensity of this noise, where should it be added? This question can be correctly answered by applying the procedure of stochastic variational inference in a Bayesian model of a neural network.

Fourth, the potential resistance to what adversarial attacks are called when we artificially create examples that mislead the neural network. One network can be fooled, 10 networks can be fooled, but it’s not so easy to fool the continuum of neural networks that result from Bayesian output to the learning process. I think the combination of neural network and Bayesian approaches is extremely promising. There is beautiful math, amazing effects and good practical results. So far, we do not have enough Bayesian tools to conduct effective Bayesian inference. But the scalable methods of approximate Bayesian inference necessary for this are now actively developing in the world.

And here is just a clarification. Is it true that a dropout can be considered as a transition to distribution over neural networks, and then the result of training when applying the dropout will be an ensemble of neural networks?

Да. And in the original formulation of the dropout, we also come to an ensemble of neural networks, but it is not clear where this ensemble came from. If we reformulate the dropout in terms of Bayesian inference, then everything falls into place. It becomes clear how to configure it and how to automatically select the dropout rate. Moreover, we immediately have a number of opportunities for generalizing and modifying the original dropout model.

But can Bayesian methods offer some kind of understanding of what generally happens when training neural networks? In particular, now setting up network hyperparameters is a kind of heuristic procedure, by trial and error, we somehow understand that in one situation you need to add BatchNorm, in the other - slightly twist the dropout. That is, while we are far from a theoretical understanding of how numerous hyperparameters affect the learning of neural networks. Can Bayesian methods offer a new look?

Let me clarify. The question is about our understanding of how neural networks make decisions or how they solve an optimization problem? This is an important difference.

Firstly, what are hyperparameters responsible for and how does this affect learning. This is our first misunderstanding. The second - is it still possible any theoretical guarantees for the generalization error in the case of neural networks? As far as I know, the computational theory of training is still applicable to perceptrons and networks with one hidden layer, but is powerless as soon as we move on to deep neural networks. In particular, the same adversarial attacks show that we still do not understand how neural networks are capable of generalization. That is, they literally changed one pixel, and now the neural network says that the penguin is not a penguin, but a tractor. After all, this is a disaster, if you think so! Even despite the excellent results of convolutional networks on ImageNet. Can Bayesian methods offer something here?

A lot of questions, let's go in order. I already spoke about resistance to adversarial examples, Bayesian neural networks are more resistant to such attacks, although there are still problems. The cause of this problem is actually understandable. All adversarial examples are extremely atypical from the point of view of the general population (to which the neural network is configured, it does not matter if it is Bayesian or not). The fact that we do not visually see differences from the original image does not mean that they are not. And on atypical objects, the answer of any machine learning algorithm can be arbitrary. This logically implies a way to deal with adversarial examples, but that's a completely different story ...

As for the statistical theory of training and guarantees for generalizing ability: the situation is really such that the results of the theory are not transferred to modern neural networks, everyone understands this, therefore specialists in the statistical theory of training actively work to ensure that new methods are also applicable to deep neural networks. I hope we will see this in the coming years. Is it possible to define network architecture using Bayesian methods? Answer: hypothetically possible, practically - now the first steps are being taken in the world. Bayesian thinning can also be considered as a choice of neural network architecture. In order to answer this question more fully, new tools are needed, in particular, other regularization methods, for example, batch normalization, must be translated into Bayesian language. Obvious need for this, obvious desire. Such work is underway, but so far no success has been achieved. Hope this is a matter of time.

And in fact, the main advantage of the Bayesian approach is the automatic configuration of hyperparameters. The more neural network construction procedures we transfer to Bayesian rails, the more opportunities appear for the automatic selection of neural network topology. Well, the last question about why the neural network makes this or that decision ... This is a question that we are unlikely to be able to get an exhaustive answer in the near future. From my point of view, one of the most promising techniques for understanding what is happening in neural networks is what is called the attention mechanism. Part of this mechanism is also based on Bayesian principles, but these methods are still quite crude. I hope that in the near future it will be possible to reach a level at which it will be clear what is happening with neural networks. However, a series of indirect experiments, including those conducted in our group, testifies that the computer understands the meaning of the data much better than is commonly believed. In some cases, you can make the computer express its understanding in human language. I will talk about one of these models and the incredible effects that we have observed in it at my next public speech. I think this may be one way to understand the logic of the operation of a neural network - it itself must generate an explanation of why a decision was made.

Okay, but are Bayesian methods somehow getting fuel from observing the human brain? In particular, not all neural connections are involved in our brain, and this could serve as a motivation for the dropout technique. Do you know such cases when research in the field of neurophysiology served as a source of new ideas in the field of Bayesian statistics?

Well, firstly, I’ll immediately dispel the popular misconception that artificial neural networks supposedly simulate the functioning of the human brain. Not. This is not true. They have nothing to do with the human brain. More precisely, earlier, when artificial neural networks just appeared, they were associated with the human brain. But now we understand much more both in machine learning and in neurophysiology, and we can safely say that these are different mechanisms. An artificial neural network is a separate model that has no more in common with a biological brain than, say, a decision tree. On the other hand, there are many psychological studies that show that the human brain works to a large extent according to Bayesian principles. I am not ready to comment on this in detail, but there is such an opinion.

Well, I’ll transfer the conversation a little to another area. I studied, of course, in mathematics, physics, various sciences at school and quickly realized that the formulas in my head were learned just immediately, once and for all. If I once found out what an impulse is, then I don’t need to remember what it is an impulse - mass multiplied by speed or squared speed. And in history, we had amazing lecturers at both the school and the university. For example, at Fiztekh we could even have such a situation that, for example, 15 people attended a lecture on a specialty, computer architecture, and the next lecture on history had a head on head, the entire audience was crowded. The whole thing, of course, is the lecturer, he is also a brilliant actor, people almost came with popcorn to his lecture - each time as a performance. But, unfortunately, historical information was very poorly perceived by me. Flew into one ear - flew into the other. It seems like 3 times both domestic and foreign history took place, right from the Rurikovich to the Romanovs, but all this just flew out of me instantly. I know about you that you gave lectures on history, both at Moscow State University and at the FCS. Tell us how you understood that you can deal with history and applied mathematics, that these two worlds can coexist in your head. And how do you now support this interest in history?

Well, firstly, it is important to understand that the historical lectures that I read are exclusively amateurish and do not pretend to anything other than arouse interest and independently study a particular topic. I do not deal with history as a science. Interest in history appeared in my school. One of the reasons for my interest is that a person who knows history, in a sense, expands his training set. He sees that many of the problems that humanity is facing are not new - other people, generations, and states have faced them. A person sees how people acted, and what it led to. That is, he enriches his experience a little.

Where did the historical lectures come from ... the answer is simple. From the experience of communicating with my students, I realized that the guys do not know practically anything from history, and it seemed to me that they would be interested if I told you a little about it. At least so that people can distinguish facts from the noodles that some media hang on our ears. On the one hand, history can be studied and taught so as not to make mistakes of the past. On the other hand, there is an opinion that history should be taught for the education of patriotism. From my point of view, for students of elite universities, history is needed precisely for the first reason, and not in order to educate your patriotism. There are many other ways to educate the latter, for example, sports. But the price we pay is too high fostering the patriotism of the intellectual elite through the teaching of "patriotic" history. In this case, historical events and facts are distorted in order to fit into the patriotic paradigm, citizens form a consciousness of their own superiority over other nations and countries, a feeling that “we all owe”. Never and nowhere did this end with anything good.

Medina's dissertation, say ...

Well, this is a clinical example, which does not need comments. I have nothing to add to the opinion of the VAK experts who recommended depriving him of a scientific degree ... So in my lectures I try to show that history should be looked at without patriotic blinkers and then it becomes much more interesting and multifaceted. There are objective historical processes in which there are no right and guilty, history, it is not black and white - it is gray. All the actors pursue certain mercenary interests. And the fact that these are right, and these are not, is just an attempt to adapt history to ideology. In the hope, at least partially, of these blinders, I am giving my lectures and reading to young people. But this is not a presentation of some objective truth, but rather an attempt to arouse students' interest and desire to turn to historical sources and understand everything themselves.

And do you think there is any mathematical order in history? No, of course, I know that, despite some criticism, history is a science, and all the methods used in it are purely scientific. But, nevertheless, is it possible to establish any historical laws? Or will history be a description of life more like a chronicle? As an example, I will cite the historian Gregory Kvasha. He identifies the historical cycles of development - 12 years, 36, 144 ... For example, if you look at the XX century, then once in 12 years quite interesting historical events take place. 1905 — revolution, 17 — revolution, 29 — crisis and NEP, 41 — the beginning of the war, 53 — the death of Stalin, 65 — the USSR introduced tanks, it seems .... Well, I see ... from my point of view, this is a typical fit, the facts can always be imposed on any desired pattern. Но как Вы считаете, могут ли иметь успех такие попытки найти некие законы в истории, как в математике?

Well, as a person who knows the scientific method, I can not seriously comment on numerology :) I only note that every time when there is some kind of beautiful theory, there is a great temptation to fit facts under it. There would be a desire, one could find 13- and 14-year cycles. Nevertheless, this does not cancel my dream and hope that in history, over time, formalism will begin to develop, which will allow us to formulate the fundamental laws of historical development. Of course, not in numerological terms. This is such a strategic dream of mine ... Historians overwhelmingly disagree, at least those with whom I spoke. They consider history as a description of past events, their systematization, but not in order to derive universal laws. They believe that there is no universal law. It seems to me that there are such laws, because I see that similar events regularly occur in history, and similar actions lead to similar results. This leads to the idea that there are universal laws for the development of society, and their knowledge will not only better describe the events of the past, but also predict the development of the future. Here I am inspired by a series of novels by Isaac Asimov, in which one of the main characters, by the way, a mathematician, managed to derive such fundamental laws, predict the future and even suggest ways to correct it in order to reduce the damage from inevitable wars and social disasters. This, of course, is a fairy tale. But the tale is very beautiful. That fairy tale that I really want to believe in. But in order to try to establish these laws in real life, it is necessary to significantly revise the methodology of historical research. So far, fellow historians cannot convince even the simplest methods of semantic analysis and thematic modeling to be applied to historical texts and chronicles. It seems that here we are suffering from a peculiar neglect of representatives of many humanities to mathematics and their unwillingness to master modern methods of automated processing of large volumes of information. Unfortunately, this neglect continues to be cultivated in humanitarian communities. This is a profound mistake. No one will use the humanities to use mathematical methods of information processing in their disciplines. They themselves must understand the limits of applicability of methods and use them wisely. To do this, we in HSE launched a special cycle of educational courses Data Culture, чтобы научить гуманитариев современным математическим методам работы с данными.

Recently, your research group in Skoltech received a mega-grant from the Government for development, and in particular, there your group of Bayesian methods somehow participated in this, and received a grant from the Russian Science Foundation. Tell us why we in Russia are so bad at getting grants. In the West, this system is very developed. There are a lot of such megagrants, the project budget goes to millions, when you really work on a large multi-year project, you can culturally recruit graduate students, post-docs. We, of course, have grants from the Russian Federal Property Fund, Russian Science Foundation ... - this is the cat wept. And here is a concrete example when your research team received a mega-grant. Tell us how you did it and how you plan to develop within the project?

I will correct you right away. You say that grants are badly beaten out in Russia. It is not a matter of how they are knocked out, it is a matter of their quantity and level of expertise in the distribution of grants. The issue of quantity should be addressed to the Ministry of Education and Science. There are also problems with the expertise. Officials cannot carry it out qualitatively due to a lack of understanding of what good research is and what poor research is. And many domestic scientists cannot carry it out qualitatively due to a misunderstanding of this. Now the best scientific expertise is achieved in the leading ones, I emphasize in the leading ones, and not just from the list of Web of Science, international journals and at leading conferences. It seems to me that the problems with the examination could be solved by formal criteria for the publication of the team. Но это тоже в компетенции министерства образования и науки находится.

Next, I will correct you too regarding the megagrant. Мегагрант получен в Сколтехе институтом науки и технологий — коллективом из нескольких исследовательских групп, и мы ­- лишь одни из множества участников, причем не основные.

How to receive grants? It’s hard for me to say, I don’t have much experience here, we received the same RNF grant from our third or fourth attempt. But my recipe is simple. We need a research plan, the better it is systematized and consistent, the greater the chances. You convince the reviewer who will look at your application, what you understand, what you write about, understand what you will do and how the different points of the plan are related to each other. This is far from always easy to do. It is not always clear what results will be obtained in a particular sub-study, what you can rely on in the next steps, and what not. Nevertheless, you need to keep some integral picture in your head. It seems to me that the chance of getting a grant greatly increases the overview of what is happening in the world. A good literature review shows that the team represents the current situation in its field, clearly positions its research on it, and that it simply reads scientific literature. In our applications about 30-40 references to literature always. Well, do not forget about your own publications because if you have a publication in a good place, this is also an indicator of your level as a researcher.

It seems to me that you recently spoke with Yoshua Benjio. What were you talking about, how did the meeting take place?

Communicating with Yoshua Benjio is far from a trivial task. In a way, I was lucky to be invited to a forum on artificial intelligence organized by Samsung, and where Benjio was also a speaker. As a stroke to the portrait: Yoshua Benjio arrived at the conference in the morning, made a presentation, then sat a little in the hall, then a small buffet, after which he already had to go to the airport and fly to another place where he would speak the next day. That is, a person practically lives in an airplane. And I also thought that I had a busy schedule when I had to fly back and forth to Peter or Kaliningrad one day ... but this is one of the leading world scientists. I talked with him quite a bit on his speech. In a nutshell, the essence of Benjio’s speech is this: there is an opinion, that we are one step away from creating artificial intelligence, but that’s not the case. Based on modern deep neural networks, artificial intelligence will not be created! To the question, when will artificial intelligence be created, the answer follows: "We do not know exactly when it will be created, but it will definitely be created." I talked with him a little on this topic, and asked what if not on neural networks, then on what? He replied: “I don’t know yet, on something else. The technological paradigm must once again change. ” This made me reconsider my position a bit. If earlier I told students at lectures that we are one step away from creating AI based on neural networks, now I am more cautious because one of the leading researchers is so skeptical.

What surprised me in his point of view was that Benjio believes in the future consciousness of artificial intelligence. I think differently. If artificial intelligence is created in the foreseeable future, then artificial consciousness is not. But it’s even not bad. Unconscious artificial intelligence - sounds like something safe. But artificial intelligence, as an entity that realizes itself alive, can potentially wonder if it needs people. Therefore, in place of governments that thought about restricting the development of artificial intelligence, I would start monitoring in the field of methods of studying nature and the origin of consciousness. Although for now, it seems to me that we are far from this. But, from the point of view of Yoshua Benjio, the situation is directly opposite and artificial consciousness will appear before artificial intelligence.

Very interesting topic. I immediately want to ask what consciousness is and what artificial consciousness is, but it’s time for Dmitry Petrovich to let go.

Oh yes, you can talk about consciousness ad infinitum.

Perhaps the last question and a wish to the audience. We conduct an open machine learning course, it is more practical, albeit with mathematics, too. But we get to Bayesian methods. Can I have a little advice? What is the best way for people to start exploring the Bayesian perspective on machine learning? It seems that Bishop’s book just traces the Bayesian approach, moreover, sequentially, from simple to complex. Maybe you will advise something else? Moreover, for people with different levels of mathematical training.

I do not recommend studying Bayesian methods without mathematical preparation. Therefore, further tips for those who, relatively speaking, have at least the level of four in the standard university course in probability theory and statistics. I really like Bishop's Pattern Recognition and Machine Learning book. Indeed, this is such a consistent introduction to Bayesian machine learning. The only thing: the book was written 10 years ago, it is, of course, outdated. It describes things that happened before the “deep revolution” in machine learning. Therefore, I would also recommend Murphy’s book “Machine Learning: A Probabilistic Perspective”, it is more modern, it has examples of deep learning and the application of Bayesian methods in neural networks. Yes, and it is written well. That is, if you just want to learn Bayesian methods, then Bishop’s book will do. If we are talking about their application in modern machine learning, then it is better to read Murphy's book. Well, if you want to learn how Bayesian methods are used in deep neural networks, read articles from leading scientific conferences. There are no books on this yet.

PS. Dmitry Petrovich is open to your questions here in the comments.