JD Humanoid Robot and Microsoft Cognitive Services

Original author: Marek Lani
  • Transfer
Today we will tell you about one interesting project. It used Microsoft Cognitive Services , which makes it easy to apply artificial intelligence technology by calling the REST API (and no training is needed). And all this with the example of the cute robot JD Humanoid. More details under the cut!



Cognitive Services also contains client libraries for various programming languages, which further simplifies their use. We decided to integrate these services into an application designed to control EZ's JD Humanoid robot (hereinafter we will call it simply “EZ robot”). We chose this particular humanoid robot, since it is quite simple to assemble. Кроме того, к нему прилагаются .NET , UWP SDK и даже Mono SDK, что открывает широкие возможности его реализации с индивидуальными настройками.

Also included is the EZ Builder application , which allows you to control the EZ robot and implement specific scenarios based on built-in functional plug-ins and Blocks. It is mainly intended for educational purposes, but it also has a “movement creator” function, which allows you to create robot movements and export them for use in applications created using the SDK. Of course, this application is a great starting point for setting up and starting work with any EZ robot.

The application is based on an existing Windows Forms project , which uses its own capabilities of the EZ robot and effectively implements the work with the camera. Taking this application as a basis, we expanded it and connected it to the following Cognitive Services: Face API, Emotion API, Speech API, Voice Recognition API, Language Understanding Intelligent Service, Speaker Recognition API, Computer Vision API, Custom Vision API. Thanks to this, the robot has new features:

  • Recognition of voice commands - in addition to the buttons of the Win Form application that trigger certain actions of the robot, we added speech recognition and natural language understanding functions so that our EZ robot understands the commands spoken aloud.
  • Face recognition and identification - the EZ robot is able to recognize faces by several parameters, as well as identify people by faces.
  • Emotion recognition - When recognizing faces, the EZ also detects emotions.
  • Speaker Recognition - The EZ robot is able to recognize people by voice.
  • Computer Vision - The EZ robot can also describe the environment.
  • Object recognition by its own parameters - the EZ robot is able to recognize specific objects placed in its field of vision.

The cognitive capabilities of the robot are demonstrated in this video.

Working with the EZ Robot SDK


Later in this document we will briefly talk about how to work with the EZ Robot SDK, and also describe in detail the implementation scenarios using Cognitive Services.

Connect to EZ Robot


The main prerequisite for working with the application is the ability to connect to the robot from the code of our application, call movements and receive incoming images from the camera. The application runs on the developer's computer, because the EZ robot does not have its own runtime and storage environment where the application could be placed and launched.

Thus, it works on a computer that connects to the EZ robot via a WiFi network, directly through an access point enabled by the robot itself (AP mode), or through a WiFi network created by another router (client mode). We chose the second option, because it supports an Internet connection when developing and running the application. Network settings when working with the EZ robot are described in detail here. . When connected to a WiFi network, the EZ robot is allocated an IP address, which is then used to connect to the robot from the application. When using the SDK, the procedure is as follows:

using EZ_B;

//Подключение к роботу EZ с помощью SDK
var ezb = new EZB();
this.ezb.Connect("robotIPAddress");

Since the EZ robot camera is an isolated network device, we need to connect to it for future use.

var camera = new Camera(this.ezb);
this.camera.StartCamera(new ValuePair("EZB://" + "robotIPAddress"), CameraWidth, CameraHeight);

The official documentation for the EZ Robot SDK provides detailed examples of invoking special functions of the EZ Robot that become available after connecting to it.

Creating Robot Movements


The SDK package allows you to interact with the robot's servomotor and control its movements. To create a movement, you must specify frames (specific positions) and actions consisting of a set of such frames. Manually implementing this in code is not very simple, but in the EZ Builder application you can set frames, create actions from them, and then export to the code and use in the application. To do this, we need to create a new project in EZ Builder, add the Auto Position plugin and click the gear button.



Figure 1. Auto Position plugin
In the future, you can create new frames on the Frames panel, changing the angles of the robot servomotors. The next step is to create the necessary movements from existing frames in the Action panel.



Figure 2. Auto Position frames
Figure 3. Auto Position actions
The created action can be exported through the import / export toolbar.



Figure 4. Exporting the source code of the Auto Position plugin
After completing the export of an action, you can copy the code and paste it into your application. If we use different positions, we should rename the AutoPositions class, giving it a name that accurately reflects the type of movement. Then it can be used in the code as follows:

//Класс WavePositions создан с именем AutoPositions и переименован
private WavePositions wavePosition;

//Обработчик изменения подключения робота EZ
private void EzbOnConnectionChange(bool isConnected)
{
    this.ezbConnectionStatusChangedWaitHandle.Set();

    if (isConnected)
    {
      //После подключения к роботу создается экземпляр WavePosition
      wavePosition = new WavePositions(ezb);
    }
}

//Метод вызова действия Waving
private async void Wave()
{
    wavePosition.StartAction_Wave();
    //Повтор роботом действия Wave в течение 5 секунд
    await Task.Delay(5000);
    wavePosition.Stop();
  
    //Перевод робота в исходную позицию
    ezb.Servo.ReleaseAllServos();
}

Receiving a camera image


Since in the application we use images from the robot camera as input when invoking Cognitive Services, we must find a way to obtain these images. It is done like this.

var currentBitmap = camera.GetCurrentBitmap;
MemoryStream memoryStream = new MemoryStream();
currentBitmap.Save(memoryStream, System.Drawing.Imaging.ImageFormat.Jpeg);
memoryStream.Seek(0, SeekOrigin.Begin);
//Получаем из памяти поток данных, который отправляется в Cognitive Services

Voice functions of the robot


As already mentioned, the application runs on the developer's computer and simply sends commands to the robot. The robot does not have its own runtime. If you need to synthesize speech (in other words, you want the robot to say a couple of phrases), then you need to choose one of two options offered by the SDK. The first option involves the use of a standard audio device: the sound will be reproduced by the developer's computer, and not the robot speaker. This option is convenient in cases where you need the robot to speak through the computer speaker, for example, during presentations. However, in most cases, it is desirable that the sound is reproduced by the robot itself. The following is an implementation of both options for calling an audio function using the SDK:

//Используется стандартное аудиоустройство на компьютере разработчика
ezb.SpeechSynth.Say("Произносимый текст");

//Синтез речи напрямую через встроенный динамик робота
ezb.SoundV4.PlayData(ezb.SpeechSynth.SayToStream("Произносимый текст"));

Cognitive Services Integration


In this section, we will look at specific scenarios of cognitive functions implemented in a robot control application.

Voice recognition


You can power the robot by giving voice commands through a microphone connected to the developer's computer, since the robot itself does not have an EZ microphone. First we used the EZ robot SDK for speech recognition. As it turned out, the recognition was not accurate enough, and the robot performed the wrong actions based on incorrectly understood commands. To increase recognition accuracy and freedom of action when issuing commands, we decided to use the Microsoft Speech API , which converts speech to text, as well as the Language Understanding Intelligent Service ( LUIS ) to recognize the action required by a particular command. Use the links to these products to get more information about them and get started.

First, create a LUIS application in which the actions we need are bound to each command. The process of creating a LUIS application is described in this guide to get started. LUIS offers the option of working through a web interface, where you can easily create an application and set the necessary actions. If necessary, you can also create entities that the LUIS application will recognize by the commands sent to the service. The result of exporting the LUIS application is contained in this repository in the LUIS Mode l folder .

After preparing the LUIS application, we implement the following logic: waiting for voice commands, calling the Microsoft Speech API, calling the LUIS recognition service. As a basis for this functionality, we used the following sample.

It contains the logic for recognizing long and short phrases from a microphone or from a .wav file with subsequent recognition of an action by or without LUIS.

We used the MicrophoneRecognitionClientWithIntent class, which contains functions for waiting for commands by the microphone, speech recognition, and the required action. In addition, the short phrases waiting function is called using the SayCommandButton_Click descriptor.

using Microsoft.CognitiveServices.SpeechRecognition;

private void SayCommandButton_Click(object sender, EventArgs e)
{
  WriteDebug("---Начало записи через микрофон с распознаванием действия ----");

  this.micClient =
    SpeechRecognitionServiceFactory.CreateMicrophoneClientWithIntentUsingEndpointUrl(
    this.DefaultLocale,
    Settings.Instance.SpeechRecognitionApiKey,
    Settings.Instance.LuisEndpoint);
  this.micClient.AuthenticationUri = "";
  
  //Дескриптор времени распознавания требуемого действия
  this.micClient.OnIntent += this.OnIntentHandler;
  this.micClient.OnMicrophoneStatus += this.OnMicrophoneStatus;

  //Дескрипторы событий для результатов распознавания речи
  this.micClient.OnPartialResponseReceived += this.OnPartialResponseReceivedHandler;
  this.micClient.OnResponseReceived += this.OnMicShortPhraseResponseReceivedHandler;
  this.micClient.OnConversationError += this.OnConversationErrorHandler;

  //Запуск распознавания речи через микрофон
  this.micClient.StartMicAndRecognition();
}

The command invocation logic uses the OnIntentHandler descriptor - here we analyze the response received from the LUIS service.

private async void OnIntentHandler(object sender, SpeechIntentEventArgs e)
{
  WriteDebug("---Получение действия обработчиком OnIntentHandler () ---");
  dynamic intenIdentificationResult = JObject.Parse(e.Payload);
  var res = intenIdentificationResult["topScoringIntent"];
  var intent = Convert.ToString(res["intent"]);

  switch (intent)
  {
    case "TrackFace":
      {
        //Запуск отслеживания и распознавания лиц
        ToggleFaceRecognitionEvent?.Invoke(this, null);
        break;
      }

    case "ComputerVision":
      {
        var currentBitmap = camera.GetCurrentBitmap;
        var cvc = new CustomVisionCommunicator(Settings.Instance.PredictionKey, Settings.Instance.VisionApiKey, Settings.Instance.VisionApiProjectId, Settings.Instance.VisionApiIterationId);
        var description = await cvc.RecognizeObjectsInImage(currentBitmap);
        ezb.SoundV4.PlayData(ezb.SpeechSynth.SayToStream(description));
        break;
      }
      
    //... прочие действия/Команды
      
    default: break;

  }
}

Recognition and identification of faces and emotions


To implement the face and emotion recognition function, we used the Face API and Emotion API services . By clicking on the links above, you can learn more about these services and how to start working with them, get instructions on creating an API key and integrating services into your application.

The EZ robot is able to independently recognize faces using the SDK. A call to Cognitive Services is not required. However, this is only the basic type of face recognition without additional parameters (for example, age, gender, facial hair, etc.). But we will still use this local recognition function: it will help to understand that there is a face on the image. After that, we use the Cognitive Services Face API to get additional face parameters. This eliminates the need for unnecessary API calls.

The EZ robot function, which performs face recognition, also provides information about the location of the face in the picture. We decided to use this so that the robot turns its head and its camera is aimed directly at the face. We borrowed this code from Win Form applications as the basis of our project. At the same time, we added a sensitivity parameter that determines the speed of movement and adjustments to the position of the head of the robot.

So, in order for the Face API to identify specific people, we need to create a group of such people, register them and train the recognition model. We were able to do this effortlessly using the Intelligent Kiosk Sampl e application . The application can be downloaded from github . Remember to use the same Face API key for the Intelligent Kiosk app and the robot app.

For greater accuracy in face recognition, it is advisable to train the model on image samples captured by the camera, which will be used in the future (the model will be trained on images of the same quality, which will improve the operation of the Face Identification API). To do this, we implemented our own logic, the implementation of which allows you to save images from the camera of the robot. In the future, they will be used to train Cognitive Services models:

//Сохранение изображений для обучения набора данных
var currentBitmap = camera.GetCurrentBitmap;
currentBitmap.Save(Guid.NewGuid().ToString() + ".jpg", ImageFormat.Jpeg);

Next, we launch the HeadTracking method , which performs the function of tracking, identifying and identifying faces. In a word, this class first determines whether the face is in front of the robot camera. If so, then the position of the head of the robot changes accordingly (face tracking is performed). Then the FaceApiCommunicator method will be called , which, in turn, will call the Face APIs (face detection and identification), as well as the Emotion API. The last section processes the result obtained from the Cognitive Services APIs.

In the case of face recognition, the robot says: "Hello!" and adds the name of the person, except when the robot determines the expression of sadness on his face (using the Emotion API). Then the robot tells a funny joke. If it was not possible to identify the person, the robot simply says “hello”. At the same time, he distinguishes between men and women and arranges phrases accordingly. According to the results obtained from the Face API, the robot also makes an assumption regarding the age of the person.

private async void HeadTracking()
{

  if (!this.headTrackingActive)
  {
    return;
  }

  var faceLocations = this.camera.CameraFaceDetection.GetFaceDetection(32, 1000, 1);
  if (faceLocations.Length > 0)
  {
    //Процедура распознавания лица производится только раз в секунду
    if (this.fpsCounter == 1)
    {
      foreach (var objectLocation in faceLocations)
      {
        this.WriteDebug(string.Format("В H:{0} V:{1} выявлено лицо человека", objectLocation.HorizontalLocation, objectLocation.VerticalLocation));
      }
    }
  }

  //Возврат в начало, если лицо не выявлено
  if (faceLocations.Length == 0)
  {
    return;
  }

  //Регистрация первого положения лица (ТОЛЬКО ОДНО)
  var faceLocation = faceLocations.First();

  var servoVerticalPosition = this.ezb.Servo.GetServoPosition(HeadServoVerticalPort);
  var servoHorizontalPosition = this.ezb.Servo.GetServoPosition(HeadServoHorizontalPort);

  //Track face
  var yDiff = faceLocation.CenterY - CameraHeight / 2;
  if (Math.Abs(yDiff) > YDiffMargin)
  {
    if (yDiff < -1 * RobotSettings.sensitivity)
    {
      if (servoVerticalPosition - ServoStepValue >= mapPortToServoLimits[HeadServoVerticalPort].MinPosition)
      {
        servoVerticalPosition -= ServoStepValue;
      }
    }
    else if (yDiff > RobotSettings.sensitivity)
    {
      if (servoVerticalPosition + ServoStepValue <= mapPortToServoLimits[HeadServoVerticalPort].MaxPosition)
      {
        servoVerticalPosition += ServoStepValue;
      }
    }
  }

  var xDiff = faceLocation.CenterX - CameraWidth / 2;
  if (Math.Abs(xDiff) > XDiffMargin)
  {
    if (xDiff > RobotSettings.sensitivity)
    {
      if (servoHorizontalPosition - ServoStepValue >= mapPortToServoLimits[HeadServoHorizontalPort].MinPosition)
      {
        servoHorizontalPosition -= ServoStepValue;
      }
    }
    else if (xDiff < -1 * RobotSettings.sensitivity)
    {
      if (servoHorizontalPosition + ServoStepValue <= mapPortToServoLimits[HeadServoHorizontalPort].MaxPosition)
      {
        servoHorizontalPosition += ServoStepValue;
      }
    }
  }

  this.ezb.Servo.SetServoPosition(HeadServoVerticalPort, servoVerticalPosition);
  this.ezb.Servo.SetServoPosition(HeadServoHorizontalPort, servoHorizontalPosition);

  //Выявление ЛИЦА
  //Распознавание лица с помощью API
  var currentBitmap = camera.GetCurrentBitmap;

  (var faces, var person, var emotions) = await FaceApiCommunicator.DetectAndIdentifyFace(currentBitmap);

  //Если лицо выявлено, и робот не подает речевой сигнал
  if (person != null && !ezb.SoundV4.IsPlaying)
  {
    //Если у человека грустное лицо
    if (emotions[0].Scores.Sadness > 0.02)
    {
      ezb.SoundV4.PlayData(ezb.SpeechSynth.SayToStream("У тебя грустный вид, но я попробую тебя развеселить. Расскажу тебе одну шутку! Вот она. Моя собака постоянно гонялась за велосипедистами. Дошло до того, что мне пришлось отобрать у нее байк". ));
      //Ждем, пока робот закончит говорить
      Thread.Sleep(25000);
    }
    else
    {
      ezb.SoundV4.PlayData(ezb.SpeechSynth.SayToStream("Привет" + person.Name));
      Wave();
    }             
  }
  //Если выявлены лица, но не установлены их владельцы
  else if (faces != null && faces.Any() && !ezb.SoundV4.IsPlaying)
  {
    if (faces[0].FaceAttributes.Gender == "male")
      ezb.SoundV4.PlayData(ezb.SpeechSynth.SayToStream("Здравствуйте, незнакомец! На вид вам лет " + faces[0].FaceAttributes.Age));
    else
      ezb.SoundV4.PlayData(ezb.SpeechSynth.SayToStream("Здравствуйте, незнакомка! На вид вам лет " + faces[0].FaceAttributes.Age));
    Wave();
  }
}

The following is the FaceApiCommunicator code , which contains the messaging logic with the Face API and Emotion APIs.

using Microsoft.ProjectOxford.Common.Contract;
using Microsoft.ProjectOxford.Emotion;
using Microsoft.ProjectOxford.Face;
using Microsoft.ProjectOxford.Face.Contract;
using System;
using System.Collections.Generic;
using System.Drawing;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace EZFormApplication.CognitiveServicesCommunicators
{
    public class FaceApiCommunicator
    {
        private const string FaceApiEndpoint = "https://westeurope.api.cognitive.microsoft.com/face/v1.0/";
        private static List<FaceResult> personResults = new List<FaceResult>();
        private static DateTime lastFaceDetectTime = DateTime.MinValue;


        public static async Task<(Face[] faces, Person person, Emotion[] emotions)> DetectAndIdentifyFace(Bitmap image)
        {
            FaceServiceClient fsc = new FaceServiceClient(Settings.Instance.FaceApiKey, FaceApiEndpoint);
            EmotionServiceClient esc = new EmotionServiceClient(Settings.Instance.EmotionApiKey);

            //Выявление ЛИЦА
            //Добавить в качестве параметра интервал между процедурами распознавания
            Emotion[] emotions = null;
            Person person = null;
            Face[] faces = null;


            //Распознавание производится один раз в 10 секунд
            if (lastFaceDetectTime.AddSeconds(10) < DateTime.Now)
            {
                lastFaceDetectTime = DateTime.Now;

                MemoryStream memoryStream = new MemoryStream();
                image.Save(memoryStream, System.Drawing.Imaging.ImageFormat.Jpeg);

                //Нужно начать поиск
                memoryStream.Seek(0, SeekOrigin.Begin);
                faces = await fsc.DetectAsync(memoryStream, true, true, new List<FaceAttributeType>() { FaceAttributeType.Age, FaceAttributeType.Gender });

                if (faces.Any())
                {

                    var rec = new Microsoft.ProjectOxford.Common.Rectangle[] { faces.First().FaceRectangle.ToRectangle() };
                    //Распознавание эмоций

                    //Нужно начать поиск; из-за проблем параллельного доступа следует создать новый поток данных в памяти
                    memoryStream = new MemoryStream();
                    image.Save(memoryStream, System.Drawing.Imaging.ImageFormat.Jpeg);
                    memoryStream.Seek(0, SeekOrigin.Begin);

                    //Вызываем Emotion API и включаем информацию для распознавания овала лица,
                    //поскольку в этом случае вызов проще реализовать — для работы Emotion API не нужно распознавать наличие лица
                    emotions = await esc.RecognizeAsync(memoryStream, rec);


                    //Идентификация человека
                    var groups = await fsc.ListPersonGroupsAsync();
                    var groupId = groups.First().PersonGroupId;

                    //Интересует только первый подходящий человек
                    var identifyResult = await fsc.IdentifyAsync(groupId, new Guid[] { faces.First().FaceId }, 1);
                    var candidate = identifyResult?.FirstOrDefault()?.Candidates?.FirstOrDefault();

                    if (candidate != null)
                    {
                        person = await fsc.GetPersonAsync(groupId, candidate.PersonId);
                    }

                }
            }
            return (faces, person, emotions);
        }
    }

    public class FaceResult
    {
        public string Name { get; set; }
        public DateTime IdentifiedAt { get; set; }
    }
}

Speaker Recognition


The robot application supports the identification of a person not only by the image of a person, but also by voice. This data is sent to the Speaker Recognition API . You can use the link provided for more information about this API.

As with the Face API, the Speaker Recognition service needs a recognition model trained on voice information from the speakers used. First you need to create sound material for recognition in wav format. The code below is suitable for this. Having created sound material for recognition, we will use this application as a sample . With it, we will create profiles of people whom our robot should recognize by voice.

Keep in mind that the created profiles do not have a username field. This means that you need to save a couple of the generated ProfileId and Name values ​​in the database. In the application, we store this pair of values ​​as entries in a static list:

 public static List<Speaker> ListOfSpeakers = new List<Speaker>() { new Speaker() { Name = "Marek", ProfileId = "d64ff595-162e-42ef-9402-9aa0ef72d7fb" } };

This way we can create a .wav record and then send it to the Speaker Recognition service (for registering or identifying people). We have created logic that allows you to record voice data into a .wav file. Чтобы достичь этого результата из приложения .NET , используем сборку взаимодействия winmm.dll:

class WavRecording
{
  [DllImport("winmm.dll", EntryPoint = "mciSendStringA", ExactSpelling = true, CharSet = CharSet.Ansi, SetLastError = true)]
  private static extern int Record(string lpstrCommand, string lpstrReturnString, int uReturnLength, int hwndCallback);

  public string StartRecording()
  {
    //MCIErrors — пользовательский перечень ошибок вывода межпрограммного взаимодействия
    var result = (MCIErrors)Record("open new Type waveaudio Alias recsound", "", 0, 0);
    if (result != MCIErrors.NO_ERROR)
    {
      return "Error code: " + result.ToString();

    }
    // Создание специальных настроек вывода в формат .wav для соответствия требованиям службы Speaker Recognition 
    result = (MCIErrors)Record("set recsound time format ms alignment 2 bitspersample 16 samplespersec 16000 channels 1 bytespersec 88200", "", 0, 0);
    if (result != MCIErrors.NO_ERROR)
    {
      return "Error code: " + result.ToString();
    }

    result = (MCIErrors)Record("record recsound", "", 0, 0);
    if (result != MCIErrors.NO_ERROR)
    {
      return "Error code: " + result.ToString();
    }
    return "1";
  }

  public string StopRecording()
  {
    var result = (MCIErrors)Record("save recsound result.wav", "", 0, 0);
    if (result != MCIErrors.NO_ERROR)
    {
      return "Error code: " + result.ToString();
    }
    result = (MCIErrors)Record("close recsound ", "", 0, 0);
    if (result != MCIErrors.NO_ERROR)
    {
      return "Error code: " + result.ToString();
    }

    return "1";
  }
}

Next, we will create the SpeakerRecognitionCommunicator component that is responsible for communicating with the Speaker Recognition API:

using Microsoft.ProjectOxford.SpeakerRecognition;
using Microsoft.ProjectOxford.SpeakerRecognition.Contract.Identification;
...
class SpeakerRecognitionCommunicator
{
  public async Task<IdentificationOperation> RecognizeSpeaker(string recordingFileName)
  {
    var srsc = new SpeakerIdentificationServiceClient(Settings.Instance.SpeakerRecognitionApiKeyValue);
    var profiles = await srsc.GetProfilesAsync();

    //Сначала выберем набор профилей, с которыми будет сравниваться голосовая информация
    Guid[] testProfileIds = new Guid[profiles.Length];
    for (int i = 0; i < testProfileIds.Length; i++)
    {
      testProfileIds[i] = profiles[i].ProfileId;
    }

    //IdentifyAsync больше не работает, поэтому нам нужно реализовать механизм запроса результатов
    OperationLocation processPollingLocation;
    using (Stream audioStream = File.OpenRead(recordingFileName))
    {
      processPollingLocation = await srsc.IdentifyAsync(audioStream, testProfileIds, true);
    }

    IdentificationOperation identificationResponse = null;
    int numOfRetries = 10;
    TimeSpan timeBetweenRetries = TimeSpan.FromSeconds(5.0);

    //
    while (numOfRetries > 0)
    {
      await Task.Delay(timeBetweenRetries);
      identificationResponse = await srsc.CheckIdentificationStatusAsync(processPollingLocation);

      if (identificationResponse.Status == Microsoft.ProjectOxford.SpeakerRecognition.Contract.Identification.Status.Succeeded)
      {
        break;
      }
      else if (identificationResponse.Status == Microsoft.ProjectOxford.SpeakerRecognition.Contract.Identification.Status.Failed)
      {
        throw new IdentificationException(identificationResponse.Message);
      }
      numOfRetries--;
    }
    if (numOfRetries <= 0)
    {
      throw new IdentificationException("Срок операции идентификации истек");
    }
    return identificationResponse;
  }
}

Finally, we integrated the two functional elements discussed earlier into the ListenButton_Click handler. On the first click of the mouse, it initiates the recording of voice information, and on the second, it sends the recording to the Speaker Recognition service. Once again, we note that the EZ robot is not equipped with a microphone; to create voice information, we use a microphone connected to the developer's computer (or a built-in microphone of the computer).

private async void ListenButton_Click(object sender, EventArgs e)
{
  var vr = new WavRecording();

  if (!isRecording)
  {
    var r = vr.StartRecording();
    //если успешно
    if (r == "1")
    {
      isRecording = true;
      ListenButton.Text = "Прекратить ожидание звуковой информации";
    }
    else
      WriteDebug(r);
  }
  else
  {
    var r = vr.StopRecording();
    if (r == "1")
      try
      {
        var sr = new SpeakerRecognitionCommunicator();
        var identificationResponse = await sr.RecognizeSpeaker("result.wav");

        WriteDebug(Идентификация завершена);
        wavePosition.StartAction_Wave();

        var name = Speakers.ListOfSpeakers.Where(s => s.ProfileId == identificationResponse.ProcessingResult.IdentifiedProfileId.ToString()).First().Name;
        ezb.SoundV4.PlayData(ezb.SpeechSynth.SayToStream("Привет " + имя));

        await Task.Delay(5000);
        wavePosition.Stop();
        ezb.Servo.ReleaseAllServos();
      }
    catch (IdentificationException ex)
    {
      WriteDebug("Speaker Identification Error: " + ex.Message);
      wavePosition.StartAction_Wave();

      ezb.SoundV4.PlayData(ezb.SpeechSynth.SayToStream("Привет, незнакомец"));

      //Ждем, пока робот закончит говорить
      await Task.Delay(5000);
      wavePosition.Stop();
      ezb.Servo.ReleaseAllServos();
    }
    catch (Exception ex)
    {
      WriteDebug("Ошибка: " + сообщение);
    }
    else
      WriteDebug(r);

    isRecording = false;
    ListenButton.Text = "Распознавание голоса";

  }
}

Computer vision


A robot is capable of describing in a natural language what it “sees,” that is, objects falling into the field of view of its camera. To do this, use the Computer Vision API . We have created a helper method responsible for communicating with the Computer Vision API.

using Microsoft.ProjectOxford.Vision;
...
public  async Task<string> RecognizeObjectsInImage(Bitmap image)
{
  //Используем конечную точку westeurope
  var vsc = new VisionServiceClient(visionApiKey, "https://westeurope.api.cognitive.microsoft.com/vision/v1.0");
  MemoryStream memoryStream = new MemoryStream();
  image.Save(memoryStream, System.Drawing.Imaging.ImageFormat.Jpeg);
  memoryStream.Seek(0, SeekOrigin.Begin);
  var result = await vsc.AnalyzeImageAsync(memoryStream,new List<VisualFeature>() { VisualFeature.Description });
  return result.Description.Captions[0].Text;
}

We will use this method if in the process of recognizing the team it turns out that it is necessary to use ComputerVision .

case "ComputerVision":
{
  var currentBitmap = camera.GetCurrentBitmap;
  var cvc = new CustomVisionCommunicator();
  var description = await cvc.RecognizeObjectsInImage(currentBitmap);
  ezb.SoundV4.PlayData(ezb.SpeechSynth.SayToStream(description));
  break;
}

Recognition of objects with custom parameters


The robot also uses the Custom Vision API , which allows you to very accurately recognize specific objects. This service also requires a trained model. This means that we need to load a set of images of recognized objects with assigned tags. We use images received from the robot camera so that our model learns from images of the same quality that the recognition service will subsequently work with. The Custom Vision API provides a web-based interface through which you can upload images, tag them and train the model. After creating and training the model, we implement CustomVisionCommunicator, with which we connect our robot to the published model:

using Microsoft.Cognitive.CustomVision;
using Microsoft.Cognitive.CustomVision.Models;
...

 class CustomVisionCommunicator
 {
     private string predictionKey;
     private string visionApiKey;
     private Guid projectId;
     private Guid iterationId;

     PredictionEndpoint endpoint;
     VisionServiceClient vsc; 

     public CustomVisionCommunicator()
     {
       this.visionApiKey = Settings.Instance.VisionApiKey;
       this.predictionKey = Settings.Instance.PredictionKey;
       this.projectId = new Guid(Settings.Instance.VisionApiProjectId);

       //меняется каждый раз при переобучении модели
       this.iterationId = new Guid(Settings.Instance.VisionApiIterationId);
       PredictionEndpointCredentials predictionEndpointCredentials = new PredictionEndpointCredentials(predictionKey);

       //Создаем конечную точку прогнозирования, передаем учетные данные прогнозирования объекта, содержащего полученный ключ предсказания
       endpoint = new PredictionEndpoint(predictionEndpointCredentials);
       vsc   = new VisionServiceClient(visionApiKey, "https://westeurope.api.cognitive.microsoft.com/vision/v1.0");
     }

     public  List<ImageTagPrediction> RecognizeObject(Bitmap image)
     {
       MemoryStream memoryStream = new MemoryStream();
       image.Save(memoryStream, System.Drawing.Imaging.ImageFormat.Jpeg);

       //Нужно произвести поиск с начала
       memoryStream.Seek(0, SeekOrigin.Begin);
       var result = endpoint.PredictImage(projectId, memoryStream,iterationId);
       return result.Predictions.ToList();
     }  
}

In our demo application, the Custom Vision API is used in scenarios when the robot tries to find a screwdriver, or when we ask if it is hungry, and the robot, in turn, must understand whether we offer it a bottle of oil or not. The following is an example of calling the Custom Vision API in a scenario when we offer a robot a bottle of oil:

case "Голоден":
{
  ezb.SoundV4.PlayData(ezb.SpeechSynth.SayToStream(("Да, я всегда хочу есть! Сейчас бы масла!")));
  var currentBitmap = camera.GetCurrentBitmap;

  var cvc = new CustomVisionCommunicator();
  var predictions = cvc.RecognizeObject(currentBitmap);
  if (RecognizeObject("масло"))
  {
    //Берет бутылку масла и пьет
    grabPosition.StartAction_Takefood();
    await Task.Delay(1000);
    ezb.SoundV4.PlayData(ezb.SpeechSynth.SayToStream(("Вкусное масло")));
  }
  else
    ezb.SpeechSynth.Say("Я не ем это");
  break;
}

Conclusion


In this document, we described an application through which a humanoid robot gains cognitive abilities. The implemented scripts are intended to publicly demonstrate the capabilities of Cognitive Services, however, fragments of this code are suitable for any other project where it is advisable to use the artificial intelligence of Cognitive Services. The code can also be useful to those who work with the modification of the EZ robot and plan to use the SDK bundled with it in combination with Cognitive Services.

conclusions


During the development process, we came to several conclusions regarding the EZ robot and Cognitive Services.

  • Models for Face API or Custom Vision must be trained on images of the same quality that will be used in future work. If the lighting conditions can change and affect the quality of the images, it is necessary to train models based on images obtained in different lighting conditions. In the case of the Custom Vision service, it makes sense to experiment with the background and orientation of the object.
  • The Speaker Recognition API requires voice data in .wav format with certain settings (see the section on the Speaker Recognition service above). There is no name field in the profile, so you need to store data separately from the application.
  • Recording in the .wav format is performed using the winmm.dll inter-program interaction file - we created a module that allows you to record sound and save it in the .wav format. A search on the Internet showed: this is the most convenient way to implement this function.
  • Speaker Recognition service call duration - if the set of compared profiles and sound fragments is large enough, the API call can take a lot of time. Try to limit the size of the set of compared profiles if your script allows this.
  • The voice recognition function of the EZ robot is not suitable for recognizing several different phrases - we are faced with a situation where the EZ robot performed actions based on incorrectly understood phrases. We were able to solve this problem using the more advanced Cognitive Service Microsoft Speech API and the Language Understanding Intelligent Service (LUIS).
  • The EZ JD robot does not cope with the exact forward movement and does not have an ultrasonic sensor, which limited us in the implementation of the scenario when the robot approaches the object and lifts it from the floor. However, taking into account the price of the robot, its development capabilities are quite extensive, and the SDK is an excellent addition to implement complex robot use cases.

useful links


Cognitive Services:

EZ Robot:

Future improvements


The most realistic refinement is the registration of persons and (or) people and their associated voice fragments, as well as the call of the model training function and their publication for the Face API and Speaker Recognition API directly from the application. This will ensure autonomy and independence of work in the sample application that we are using for this purpose now.

about the author


Марек Лани — технологический евангелист Microsoft в Словакии.
«As a technology evangelist at Microsoft, I do have an opportunity to learn and work with the newest technologies and subsequently help developers with adoption of these with ultimate goal of making their project/business even more successful. My area of focus is Azure Cloud in general, and especially services related to topics such as Micro Services, Internet of Things, Chat Bots, Artificial Intelligence. I like to spend my free time with technology, working on interesting projects, but I can also enjoy time outside of IT. I like to take in hand, put on or kick to almost any sports equipment and I really enjoy time when I can change the urban grey for the forest green.»

If you have questions about the project, you can ask them directly to the author on Facebook . Do not forget, if you do not know the Slovak language, it is better to write in English.