Intel Perceptual Computing Challenge - An Inside Look


September 23 at 23-59 GMT the acceptance of applications for the Intel Perceptual Computing Challenge contest has ended . Now we, the participants of the competition (including several habrayuzers, for example still ithabr ), we are waiting for the results. And the results will probably not be soon - at best in mid-October. And in order not to go crazy while waiting, I decided to write this post about how the competition was held and the development was conducted from the point of view of our competitive project. After all, as you know, if you do not know what to do, bite the wand .

I hope other contestants will supplement my story with their impressions in the comments, and even, what the hell is not joking, in their posts.



If possible, I will try to refrain from advertising / PR of our project, but since it so happened that I participated in the development of my own project and did not participate in strangers, I can tell most about its evolution. I hope someone will be interested, and maybe useful.

Just in case, I’ll clarify - this is not a success story. Nobody knows (I think even the judges do not know yet) what our project will take and whether it will take at all. It is possible that our application simply will not start for them because of some trifle that we did not foresee. It will not start, and it will simply move on to the next candidate. And, in my opinion, that’s why it’s best to write an article now than to write another proud success story, or to silently not write a failure story.

Start


After the hackathon, we knew that there would also be a big Intel contest on the same PerC SDK, and we already knew that we would participate there. And finally, on the sixth of May, the competition was announced. Until June 26, it was necessary to submit an application with a description of the idea. The organizers had to select the 750 best from the list of all ideas, and only they were allowed to continue participating in the competition.

The idea, as such, was basically ready for us. It was formulated even at the hackathon - this is our project Virtualens (virtual lens, translated from Basurman). The idea was to solve one simple problem that arises during video calls (for the same Skype, for example) - sometimes in the background we have something (or someone) that is not very convenient to show to the other person. Well, there socks on the couch are scattered, or a mountain of unwashed dishes, or just someone from the household, such as a girl, does not want her hair to be accidentally seen in the background in full detail, or vice versa, if you are a girl, it can be a guy in family members, whose mother during a video call is better not to show in small parts.

Since the camera working with the PerC SDK (Creative senz3d, but other models promise) provides not only a picture, but also information on depth (depth map), the solution suggested itself - we will emulate the effect of a shallow depth of field of the camera, that is, only that remains clear what is located at a given distance from the camera, and everything further away from it is blurred (in reality, and what is closer is blurred too, and we even realized this, but then refused - it only creates additional inconvenience for the user).

Algorithms

We even had some kind of prototype implementation that we put on a hackathon. And there was an understanding that there was still a lot of work. So I took a vacation for the whole of June, and the whole vacation was busy with algorithms. And there was something to do. For example, holes in the depth map after displaying it on an rgb-picture (see the start picture of the post - this greens is there for a reason, these are all points whose depth information is missing).

The appearance of points with unknown depth has several reasons:
  • The distance to some points is so far away that the camera just does not shine through them with its "IR flashlight"
  • To some points, on the contrary, it is either too close, or the material there is shiny, or generally the sun from that side, so here we have a flare, and again we do not know the information about the point
  • The picture shows a shadow on the curtain by hand. It arises due to the fact that the "IR flashlight" and the camera lens are not at the same point. Exactly the same effect can be observed when shooting from a mobile phone with built-in flash
  • And finally, the main and most evil thing is the artifacts of pulling the depth map onto rgb - the angle of the field of view of the rgb camera and the depth camera is different. Accordingly, when you try to combine the pictures, we get about the same as if you try to wrap the ball with a flat sheet of paper - it will not turn around, you will have to cut it into pieces and glue it later. Here is the picture and “cracked”. (Of course, you can do this more accurately, with interpolation, and so on, but the thing is that you need to do this very quickly if there is some other wonderful algorithm in the PerC SDK instead of the current one that will do it perfectly, but load processor and give out 10 fps, nobody will need it.)


I read all kinds of research on the topic, looked at classic inpaint algorithms. Say, the Telea algorithm for this task was completely unsuitable due to performance. As a result, we managed to write a heuristic algorithm that fits our task well (but is not a solution in the general case).

Filing an application

The contest organizers at the last moment presented a surprise. Pleasant. The application deadline was postponed for a week (July 1). So, not only did I manage to comb the algorithms, but we also managed to implement all this to the state of technodemo, to shoot and “mount” even a video with a demonstration, but without sound. It turned out this:


Here we still have an implementation with blur and that is too close to the camera.

Began to appear on YouTube and video applications of other participants. For example, in my opinion, a cool idea / demo:


Finalists


image The results of the selection of applications became known only on July 12. That is, 12 days after filing. The finalists were supposed to send cameras. We already had one camera (according to the results of the hackathon), but another camera was still needed - there were two of us, and one camera. In addition, the second instance of the camera was important because it was necessary to assess what the variation in calibration of the cameras was. The fact is that there was a problem on the camera we had - after displaying depth on rgb, the result was shifted (see the picture - green dots, this is the depth data corresponding to the hand). And it was not clear - is this the case with all cameras, or will everyone have different things? For our task, this was critical, although for many others it was not.

While waiting for the camera, we solved two more problems.

Firstly, it was necessary not only to be able to blur the image where necessary, and not to blur where it was not necessary, it was also necessary to somehow deliver this video stream to the application (say Skype). That is, you had to pretend to be a camera. After the experiments, it was decided to stay on creating a DirectShow source filter. Alas, Microsoft has cut access to DirectShow in Metro-applications, now there is MediaFoundation, but in Media Foundation you can no longer pretend to be a camera for someone else’s application. To support Metro, you need to create a fully-fledged driver, which is more complicated. Therefore, we decided to refuse support for Metro in the version of the program for the competition - there are more chances to manage to bring the application to mind by the deadline.

Secondly, we were looking for someone who would not be a programmer, but vice versa. That is, a designer / creator, someone who would understand in the human, and not in the programmer. For we needed icons, we needed a minimal but neat interface, we had to shoot video and so on. Ah, yes, you still had to speak sane English in order to voice this video. And after ten days of searching for such a person, they managed to find it!

Thus, now our team has grown to three people:
  • Nadia is a GUI programmer, a connoisseur of sharpe, and, in general, a clever beauty
  • Lech is our friend and creative inspirer, he helped with the UI design, drew pictures and icons, shot and edited the video, and you will hear his voice in all our English-language videos
  • Your humble servant, valexey - algorithms and any system, as well as design and ideological


At the same time, the long-awaited camera arrived.

Development


To the joy that we finally got a person who knows what end to hold Photoshop, we started to come up with options for icons for our application (the most important thing in the program, yeah, especially when there is a month left until the deadline, but nothing is ready, although this is for me Now it’s clear that then nothing was ready, but then it seemed that everything was almost ready!). Probably 10 pieces were tested, and something was spent about a week. And then we put it in the system tray (where it should be - when using our virtual camera in the system tray, an application lives through which you can quickly configure something there if you don’t want to use gestures) and realized that it’s well doesn’t look at all. So I had to redo it again.

Then we started to come up with an interface for the program. It’s clear that basically the user of the interface will not see - he works with Skype, or whatever Google Hangouts there, configures Virtualens settings mainly with gestures, but sometimes ... for the first time, for example, he will go into the settings dialog, where you can turn it on and off gestures, well, adjust the distance from which you want to blur, and the blur strength (although these two parameters are more convenient to configure with gestures). In addition to the settings, this dialogue was also supposed to play a learning role - to prompt which gestures work, how to perform them. The gestures are simple, but people are not used to them yet, and at least for the first time it would be nice for the user to see how someone does it right.

There was also an idea about how to show what the blurring power is, and how it differs from the distance. A GUI component was invented and even implemented for this, which depicted at the same time the dependence of the size of the convolution kernel (blur strength) on the distance from the camera, and an example of exactly how the picture will be blurred here in this combination. But after several people completely differently interpreted what this component showed, it was decided to abandon it. Too ... hmm ... it turned out mathematically.

As a result, it was decided to simplify the GUI as much as possible, to make it look as native as possible (WinForms were used for the GUI, by default they do not look native, I had to tinker a bit) and lightweight. And for teaching the user to borrow the idea from apple, from the Trackpad’s settings dialog - with playing video demonstrations of gestures when hovering over the corresponding item:
image


But what happened with us:


The “play” buttons were added not just like that - they are for those who use the touchscreen, because in this case a person will not crawl with the mouse and most likely they just won’t guess that you can poke there to play the video. And by default, a preview from the camera is shown in this video window (already blurred as needed and where necessary).

Camera calibration

The arrived camera made us sad. She had no (or almost no) displacements. This meant that either she was so lucky or everything was fine in the new cameras, but in any case it meant that we did not know what the user (judge) would have - which camera and with which depth offsets relative to rgb. It was necessary to think about calibration. And, apparently, one that would be as simple and user friendly as possible. It was complicated. It was very difficult, and threatened to bury the entire project. That is, it is difficult not to do the calibration as such, but to make it so that any person far from technology can make it.

Fortunately, on August 9th, the Q & A Perceptual Computing Challenge took place. Yes, in Russian. With Russian developers from Intel (before that there was still an English-language webinar, it seemed to me rather useless). At this session, I asked a question about calibration, and about problems. And they answered me that yes, the problem is known, that this poorly fixed matrix of old cameras can move during transportation, and that soon they will release a tool for calibration. That is, they take this problem upon themselves. Phew! We exhaled. Thank Intel

Shooting video

The deadline was approaching and it was necessary to shoot a video about the project and about us. To do this, we armed ourselves with Man with the Camera and went to some studio, which at very inexpensively agreed to tolerate our presence for one evening.

I never thought it was so hard to talk to the camera. No, I gave an interview (including on camera) - that’s not it. This is not at all. To record a video with a clear story, you need to do many, many takes. It is very difficult to act together - while one speaks, the other will probably look somewhere in the wrong direction.

But we shot. And then they threw it out and re-shot at another time and in another place and in a different format, and already completely independently. And already in September.

Postponement

On August 10, 16 days before the Final Great Deadline, the organizers said that everything was rescheduled again. September 23rd. In fact, they gave us another month, but I would not say that this month went very well for us. We relaxed a bit. We rested a little. And then it suddenly got colder and a little sick. For example, I tried unsuccessfully to tear between the main work and the project - and both were lacking in strength. It was impossible to completely score on either one or the other. As a result, there and there, successes were more than modest. Alas.

But something happened to be done in the second half of August and the beginning of September. Basically, these were not so much programmatic accomplishments as a look around, what is similar in general in the world. And there was something to see. For example, Google Hangouts already has built-in blur or background replacement. Only there is a nuance - when you leave the frame, your transparent outline remains, through which everything is visible, what is happening in the room.

And a competitor was discovered even worse - we discovered Personify for Skype . Which removes the background (does not blur), and which goes immediately with the camera. And with whom Intel is friends. They are already where we only dream of being! It is designed as a plugin for skype (it’s something in the conversations of Intel that we always thought that our project is also a plugin for skype! But I was still surprised, because doing it with a plugin is, well, somehow strange, well, in general, in theory, knowledge of the subtleties of the plug-in organization at Skype is beyond the competence of Intel employees involved in the PerC SDK). However, the installation of this application showed a strange thing - it was not designed very well. And there was a bug there (I didn’t understand how they achieved this), because of which during their plugin’s work (and he insisted right after Skype started, grabbed the camera and loaded the processor of my laptop by 70%, то есть это без звонка) невозможно использовать другие PerC SDK приложения, использующие depth&rgb одновременно.

Generally. as a result, we were not upset, but rather the opposite - we received a boost of optimism - we can be better!

But the main work of course, as always, fell on the last week (who would doubt it!).

the last week


Suddenly, a week was left before the deadline. And we still had to re-shoot the video, we needed to establish the interaction of various components that are spinning in different address spaces, we needed to make a quick settings dialog, make an installer and a thousand other vital details. The main functionality works, but without trifles it will not take off! However, there were no trifles, for example, a small memory leak was found in the GUI ...
We worked 20 hours a day this week.

We have worked and reworked gestures. Initially, it was assumed that the user would clap his hands in front of the camera lens to enter the camera settings mode. It turned out that this is very unpleasant for the interlocutor. Therefore, we switched to a combination of gestures - now the user must first make a gesture V (for example, he is in the picture at the beginning of the post), the Virtualens icon appears on the preview, confirming that the gesture was recognized, after that you just need to wave his hand. It turned out very intuitively - first we turn to Virtualens, showing the first letter of the name on the fingers, then wave to him. The way out of the settings is the same V, only you don’t have to wave your hand already.

They made the camera image mute if the user closed it with his hand - instead of looking for where the video mute button is located on Skype, just close the lens with your hand and go to solve your problems.

We shot the video ourselves, this time on the street, and mounted it. We caught and crushed a terrible cloud of bugs (how many we added them, coding at such a pace ?!). We shot a video to demonstrate gestures. Nadia got sick from filming on the street - she caught a cold. The day before the deadline, she said, “Anyway, I won’t fall asleep until the morning, but I’ll finish the project!” And yet I completed it - I decided two very vicious problems that suddenly got out that night. And the next day she was put in a hospital.

My sister and installer helped me a lot - she did the main work, I just had to fix some details in the installer.

2 hours before the deadline, everything seemed to be ready for me (in fact, later it turned out that not everything). They only waited for the final video to be mounted with English voice acting and rendered. I'm waiting. Remains an hour to the deadline. I'm waiting. There are 30 minutes left - I begin to feverishly independently impose English-language subs on the video. Without a 15 deadline I’m calling Alexei asking what is there. And there, it turns out, he put it on a render, it is rendered for about 20 minutes ... and it was knocked out in place - he fell asleep! People have a tensile strength. But we did it. Without a 4 minute deadline, everything was sent.

Have time.

Summary


As a result, I forgot in the readme, in the project description, to write that you need to restart Skype after installation. Added to the description of the video on YouTube, but not the fact that someone will notice it. Now I am tormented by a nightmare - the judge’s installer was launched, the Virtualens item did not appear among the cameras on Skype, deleted, and moved on to the next project.

I don’t know, we don’t know what the results of the contest will be. We can’t even guess. But we know that he has already paid off - paid off with experience. I did not know that I could work so productively (albeit only a few days). This competition served as a benchmark of our abilities. However, our project will not end at the competition. We have now distributed our cameras to several people to test our program, and Personify for Skype. We need to understand what we have and what is wrong, and what needs to be fixed.

Project Videos


Our main video:

I think the video about how to install something, and then restart skype is not interesting to anyone here. But the settings dialog to see live I think it will be interesting:

Well, how to configure Virtualens, if your client does not have webcam options buttons:


Other projects: in general, you can go here and watch all the videos that are not older than one month - these will be the final applications for the competition.

Also, one participant already walked on YouTube and put together all the projects found (it turned out to be about 110 pieces): software.intel.com/en-us/forums/topic/474069
Well, what I liked specifically:
I liked project ithabr but he requested a link to his video not to be distributed. So I won’t.

This is also from Russia, more precisely from Samara:


But this is just magic. In the literal sense - through the PerC SDK spells are created in the game. You can feel like a magician to the fullest. To pump not abstract figures, but your own dexterity and smoothness of hands:


Thanks to everyone who read this sheet to the end.