We study parallel computing with OpenMPI and a supercomputer as an example of hacking neighbor WiFi

  • Tutorial

At the time of writing the thesis, one of the areas of research was the parallelization of searches in the state space on computing clusters. I had access to a computing cluster, but had no programming practice for clusters (or HPC - High Performance Computing). Therefore, before moving on to the combat mission, I wanted to practice something simple. But I am not a fan of abstract hello world without real practical problems, so such a task was quickly found.



Everyone knows that exhaustive search is the most low-efficient way to select passwords. However, with the advent of supercomputers, it became possible to significantly accelerate this process, since, as a rule, sorting is parallel with virtually no overhead. Therefore, theoretically, a process with a linear coefficient can be accelerated on a cluster, i.e. having 100 cores - speed up the process by 1000 * k times (where 0.0

Therefore, as a training exercise, the task was to verify this. The practical task was the organization of enumeration of passwords for WPA2 on a computing cluster. Therefore, in this article I will describe the methodology used, the results of experiments and lay out the source codes written for this.


It should be understood that the practical benefit of the task tends to zero, since no one will really drive huge clusters, burning electricity worth hundreds of times the cost of a router and an annual subscription fee for the Internet, in order to find the password for this miserable router. But if you close your eyes to this and look at the rating of supercomputers , you can see that top clusters include up to 6 million cores. This means that if a password is selected for 10 years on a single-core machine, then such a cluster in the case of a spherical horse in a vacuum will figure it out in 10 365 24 60 60/6000000 = 53 seconds.


How WPA2 authentication works


There are enough resources on this subject in the vastness of the network, so I will explain very briefly. Suppose there is an access point and a client device located nearby. If the client wants to establish a connection, he initiates a sequence of packet exchanges, during which he gives the password to the access point, and the point decides whether to grant it access or not. If they "agree", then the connection is considered established.


In order for an attacker to obtain an authorization password, it is necessary to fully listen to the exchange of messages between the access point and the client, and then pick up the hash from a specific field of a specific package (the so-called handshake). It is in this package that the client reports its password.


But how to catch the authorization process? If we assume that we are breaking neighboring WiFi, we must either wait until the neighbor leaves the phone and returns, or force the connection to be disconnected. And there is such an opportunity. To do this, send a request to disconnect the connection to the air, which the device will eventually respond to and reconnect. For the user, this process will be invisible, and we will stop the authorization process.


Instruments


The packages themselves can be caught by "special" WiFi points, drivers for which are in the specialized Kali Linux distribution , as well as all the necessary tools. On ebay or aliexpress you can find hundreds of suitable points, and compatibility should be checked on the OS website in advance (compatibility should be checked with the chip on which the point is based).


However, for this work, the most interesting are the tools for handling the handshake package and password selection. The most famous and advanced tool is aircrack-ng , which also has open source code. We will still need it, but this is later.


However, they are all designed to use a local processor or video card. I did not find such a tool to run on a supercomputer, which means we do not invent a bicycle and it's time to write it ourselves.


Alphabetical search


Before you parallelize something, you need to make a semantic part - a brute force mechanism. To do this, we introduce the concept of "alphabet", on the basis of which search is carried out. The alphabet is a non-repeating set of all characters from which the password can be composed. Therefore, if there are N characters in the alphabet, then any password can be represented as a number in the N-th number system, where each digit corresponds to a character with an identical serial number.


алфавит: abcdefgh
(8 символов, восьмеричная система счисления)

00000 => aaaaa
01070 => abaha
11136 => bbbdg

Therefore, we will create the key_manager class, which will be initialized in the string alphabet.


class key_manager {
    using key_t = std::string;
    using internal_key_t = numeric_key_t;

    key_manager(const key_t & _alphabet);

    //По числовому представлению получить пароль
    key_id_t get_key_id(const internal_key_t & key) const;

    //Из пароля получить внутреннее представление
    void to_internal(const key_t & alpha_key, internal_key_t & int_key) const;

    //Из пароля получить его порядковый целочисленный идентификатор
    key_id_t get_key_id(const key_t & alpha_key) const;
};

It will need methods for converting from internal representation (in numerical form) to string form and in the opposite direction. It will be necessary to convert it in the opposite direction, if it becomes necessary to start with some kind of a given password, and not zero, for example, if you need to continue the search, and not start over.


Moreover, the numbers themselves will be stored in a special class, the so-called internal representation. I don’t want to lose readability and do it competently, so we’ll make it in the form of a vector, where each element corresponds to a “digit”.


struct numeric_key_t {
    ...
    std::vector<uint8_t> data;
};

The ordinal identifier will be an integer, take the 128-bit unsigned int as an example.


    using key_id_t = __uint128_t;

Of course, if you do it in a completely clever way, you should write your own big_integer class, but as part of my experiments, all passwords fit into a 128-bit integer, so we will not waste time on something that will never be needed.


Search Engine Architecture


The task of the search engine is to sort through the range of keys, tell if the correct one is found, and if so, return the found key. But the engine doesn’t care why pick passwords - for WiFi or md5, so we hide the implementation details inside the template.


template<typename C>
class brute_forcer : public key_manager {
    bool operator()(unsigned key_length, uint_t first, uint_t last, uint_t & correctKeyId) const;
}

Inside this method, we simply write a loop that will linearly go from first to last, and if there is a suitable key, it will return true and write the identifier found in correctKeyId.


    using checker_t = C;
    ...
private:
    checker_t checker;
    ...
    if(checker(first, str_key, correctVal, correctKeyId))
    return true;

Thus, it becomes clear what the interface should be for a class that already “knows” what password guessing is for. In my version, I debugged this class on md5 before switching to wpa2, so in the repository you can find classes for both tasks. Next, let's do the checker itself.


Verify Password


Let's start with a simple version for password selection for md5 hashes. The corresponding class will be as simple as possible, it needs only one method, in which the check actually takes place:


struct md5_crypter{
    using value_t = std::string;

    bool operator()(typename key_manager::key_id_t cur_id, const std::string & code, const std::string & correct_res, typename key_manager::key_id_t & correctId) const {
        bool res = (md5(code) == correct_res);
        if(res)
        correctId = cur_id;
        return res;
    }
};

To do this, use the std :: string md5 (std :: string) function, which returns its md5 based on the given string. Everything is as simple as possible, so now we’ll start connecting aircrack-ng fragments.


We connect aircrack


The first difficulties come in getting a file with a handshake package. Here I find it difficult to remember exactly how I received it, but it seems to patch airodump or aircrack so that it saves the desired fragment. And perhaps this was done by regular means. In any case, the repository contains an example of such a package.


Having discarded all unnecessary, for work we need the following structure:


struct ap_data_t {
    char essid[36];
    unsigned char bssid[6];
    WPA_hdsk hs;
};

Which of the corresponding file can be considered as reading several fragments of the handshake file:


    FILE * tmpFile = fopen(file_name.c_str(), "rb");
    ap_data_t ap_data;
    fread(ap_data.essid, 36, 1, tmpFile);
    fread(ap_data.bssid, 6, 1, tmpFile);
    fread(&ap_data.hs, sizeof(struct WPA_hdsk), 1, tmpFile);
    fclose(tmpFile);

Next, you need to perform some preprocessing of this data so as not to repeat the calculations for each password (in the wpa2_crypter constructor), but here you can not really think about it, but simply transfer it from aircrack. The most interesting is in aircrack / sha1-sse2.h, which has the functions calc_pmk and calc_4pmk, which perform useful calculations.


Moreover, calc_4pmk is a version of calc_pmk, which, if there is SSE2 in the processor, allows you to count as many as 4 keys in one step, correspondingly accelerating the process by 4 times. Considering that now there is such an extension on almost all processors, it must be used, although this adds a little complexity to the implementation.


To do this, we add buffering to our wpa2_crypter class - since brute_forcer will request one key each, calculations will only be run every 4th time. And the data processing logic, again, is carefully transferred from aircrack, without changing anything. As a result, the verification function is obtained as follows:


value_t operator()(typename key_manager::key_id_t cur_id, const std::string & code, bool , typename key_manager::key_id_t & correctId) const {
    cache[localId].key_id = cur_id;
    memcpy(cache[localId].key, code.data(), code.size());
    ++localId;

    if(localId == 4) {  //Запускаем проверку только на каждый 4й шаг
        //Собственно смысловая функция
    calc_4pmk((char*)cache[0].key, (char*)cache[1].key, (char*)cache[2].key, (char*)cache[3].key, (char*)apData.essid, cache[0].pmk, cache[1].pmk, cache[2].pmk, cache[3].pmk);

    for(unsigned j = 0; j < localId; ++j) {
        for (unsigned i = 0; i < 4; i++){
            pke[99] = i;
            HMAC(EVP_sha1(), cache[j].pmk, 32, pke, 100, cache[j].ptk + i * 20, NULL);
        }

        HMAC(EVP_sha1(), cache[j].ptk, 16, apData.hs.eapol, apData.hs.eapol_size, cache[j].mic, NULL);

        if(memcmp(cache[j].mic, apData.hs.keymic, 16) == 0) {   //Проверяем, что ключ подошел
            correctId = cur_id;
            return true;
        }
    }

    localId = 0;
    }

    return false;
}

For all repeated 4m requests - we say that the key is not suitable. And at 4m, if the key is found, then we return both true and the key itself. The buffer is accumulated in an array of 4 elements of the following type:


struct cache_t {
    mutable char key[128];
    unsigned char pmk[128];
    unsigned char ptk[80];
    unsigned char mic[20];
    typename key_manager::key_id_t key_id;
};

In fact, the buster is ready. In order to use it, we perform the following steps:


//Читаем сюда данные из handshake-пакета
ap_data_t ap_data;

//Создаем проверочный класс, в нем будет производиться кэширование
wpa2_crypter crypter(ap_data);

//Создаем объект переборщика и передаем алфавит
brute_forcer<wpa2_crypter> bforcer(crypter, true, "abcdefg123455...");

//Проверим первые 1000 ключей
std::string correct_key;
bforcer(8, 0, 1000, correct_key);

However, aircrack itself can count in one thread, but we don’t need this?


Concurrency architecture


After studying the existing frameworks and software for organizing parallel computing installed on a cluster to which I had access, I decided to focus on Open MPI . Work with him begins with the lines:


//Инициализация и fork
MPI::Init();

//Получение идентификатора процесса
int rank = MPI::COMM_WORLD.Get_rank();
int size = MPI::COMM_WORLD.Get_size();

//Смысловая часть
...

//Завершаем работу
MPI::Finalize();

The init () call will separate processes, after which rank will be the serial number of the calculator, and size will be the total number of calculators. Processes can be run on different machines that make up the cluster, so you will have to forget about shared memory in an easy way - communication between processes is carried out using two functions:


MPI :: COMM_WORLD.Recv (& res_task, sizeof (res_task), MPI :: CHAR, MPI_ANY_SOURCE, 0, status);


//Отправить данные конкретному вычислителю
MPI_Send(
    void* data,
    int count,
    MPI_Datatype datatype,
    int destination,
    int tag,
    MPI_Comm communicator)

//Получить данные от конкретного вычислителя
MPI_Recv(
    void* data,
    int count,
    MPI_Datatype datatype,
    int source,
    int tag,
    MPI_Comm communicator,
    MPI_Status* status)

Read more about the interaction here . And now we’ll come up with a parallel search engine architecture.


Good, given the specifics of the task, it would be worthwhile to make the following architecture. If there are N cores in the cluster, then each i-th core should check the keys with identifiers i + N * k, k = 0,1,2 ... without interacting with each other. Then the performance will be maximum. But, as I said at the very beginning, the main task is to master the technology, so it is necessary to master communication between computers.


Therefore, I chose a different architecture, where the processes are divided into two types, and here I will describe the most understandable option, the option is implemented in the repository a little bit more complicated and faster, but still this is not the fastest option.


To do this, we conventionally divide the processes into 2 types - control and working. The manager will send tasks to the workers, and the workers will, in fact, count the hashes. For simplicity, we will exchange POD structs between processes. Schematically, you can imagine the processes in the form of the following figure:



Create the controller and searcher classes, respectively, which we instantiate after identifying the process:


if(rank == 0) {
    controller ctrl(this, start_key);
    ctrl.run(size - 1);
} else {
    searcher srch(this, rank);
    srch.run();
}

Searcher objects will wait for job messages, process them, and send report messages back. The controller will only deal with the distribution of tasks and verification of results. Seriously, the controller can also do calculations during downtime, but for our example, we will not complicate this architecture.


Moreover, it is necessary to avoid situations when the workflow has counted the task, and did not manage to get a new one and is idle. This is achieved through the use of queues, and the separation of the transport and computing stream in the simplest case with mutexes, although you should do conditional_variable for good. Therefore, schematically, the exchange of data between the controller and the workflow can be represented as follows:



Instead of citing code fragments with synchronization here, I will refer to my own repository. And we move on to the final part - experiments.


Experiments


It would seem that this is the most expected part of the article, but due to the simplicity of the problem being solved, the results completely coincide with the expected ones.


For experiments, a handshake package was taken, which I took from my point, as well as a couple of neighbors. By the way, the process is not very pleasant and deterministic, and it took more than an hour.


On my packages, I was convinced of the correct operation of the developed software, and on neighboring ones I already tried the technology in real conditions.


In the given sources I made a periodic output of debugging information about the current speed, the number of keys scanned and the expected time to check the keys of the current length.


At my disposal was a small supercomputer with 128 cores in total. In it, of course, there was SSE, although for the purity of the experiment I turned it off - and got a speed 4 times less.


The dynamics of the work itself is also quite expected - it takes several seconds to accelerate and collect statistics, after which the engine shows a stable search speed. Depending on the number of cores, an approximately linear increase in speed is achieved (which is obvious), however, a simple controller core and careless synchronization of flows make themselves felt - the growth constant turned out in the region of 0.8-0.9.


But the most interesting thing was waiting for me when I started the engine on a neighbor’s key - before I had time to overclock all the kernels, the password was already found ... it turned out to be the date of birth of someone from the neighbor’s family. On the one hand, I was disappointed, on the other - I still solved the initial problem.


I don’t see the point of bringing absolute values ​​of speeds, because you can try it on your available machines - all sources are provided at the end of the article. And knowing the coefficient of parallelism and the characteristics of the machine, you can accurately calculate what speed can be achieved.


Source code


Sources for the described implementation can be found in my github repository . The code was completely working; it was built under Linux and Mac OS. But ... more than 2 years have passed, I don’t know how much the API of the same Open MPI has changed.


In the test / folder you can find an example of a handshake package compiled for the format used.


The code itself is rather dirty, but due to the lack of practical value of the problem being solved, combing it did not make sense either. The project also does not develop due to meaninglessness, but if someone liked the idea, or he saw why it can be used ... then take it and use it.


The start line shows the handshake file and optionally the start key, if you want to start from a non-zero element.


brute_wpa2 k aaa123aaab ap_Dom.hshk

The parameter parsing itself is located in src / brute_wpa2.cpp, from it you can also understand how to set the identifier of the first key by a number, and also set the size of the chunk.


Conclusion


This article gives a small excursion into parallel programming using the simplest practical problem and its rather simple solution as an example. I completed the task - not only having mastered modern technologies of parallel programming, I got the skills to implement the combat mission in the dissertation, but also picked up the password from the neighbor’s WiFi. True, I used it only once - to check the correctness, but it was still nice.


But returning to the practical benefits of the work done, in connection with the latest events, I would like to note the hype around Bitcoins. Given that the basis of WPA2 hacking is used to calculate SHA hashes, as with bitcoin mining, a new direction is opening up for the development of this work. Since special equipment was developed for mining that can only count the necessary hashes, it’s interesting to check how adaptable and applicable it is for hacking WPA2. Of course, such chips are much more efficient than the latest general-purpose processors in the selection of hashes, so, perhaps, here you can achieve interesting results.