Code coverage on Badoo

A few months ago, we accelerated the generation of code coverage from 70 to 2.5 hours. This was implemented as an additional format in export / import coverage. And recently, our pull requests ended up in the official repositories phpunit, phpcov and php-code-coverage.

We have repeatedly told at conferences and in articles that we “drive” tens of thousands of unit tests in a short time. The main effect is achieved, as you might guess, due to multithreading. And everything would be fine, but one of the important testing metrics is code coverage with tests.
Today we will tell you how to count it in a multithreading environment, aggregate it and do it very quickly. Without our optimizations, coverage calculation took more than 70 hours only for unit tests. After optimization, we spend only 2.5 hours calculating coverage for all unit tests and two sets of integration tests with a total number of more than 30 thousand.

We write tests in Badoo in PHP, we use the PHPUnit Framework from Sebastian Bergmann (Sebastian Bergmann, phpunit.de ).
Coverage in this framework, like in many others, is considered using the Xdebug extension as simple calls:

xdebug_start_code_coverage();
//… тут выполняется код …
$codeCoverage = xdebug_get_code_coverage();
xdebug_stop_code_coverage();

The output is a nested array containing files that were executed during the collection of coverage, and line numbers in files with special flags: whether the code was called, was not, or should not have been called at all. Details about the work of Xdebug coated can be found on the project website .

Sebastian Bergman has a PHP_CodeCoverage library , which is responsible for collecting, processing and displaying coverage in various formats. The library is convenient, extensible and quite happy with us. She has a console frontend phpcov .
But for convenience, the calculation of coverage and output in different formats are already integrated into the PHPUnit call itself:

 --coverage-clover <file>  Generate code coverage report in Clover XML format.
 --coverage-html <dir>     Generate code coverage report in HTML format.
 --coverage-php <file>     Serialize PHP_CodeCoverage object to file.
 --coverage-text=<file>    Generate code coverage report in text format.

The --coverage-php option is what we need for multi-threaded startup: each thread counts coverage and exports to a separate * .cov file. Aggregation and output to a beautiful html report can be done by calling phpcov with the --merge flag.

--merge                 Merges PHP_CodeCoverage objects stored in .cov files.

It turns out everything is folding, beautiful and should work out of the box. But, apparently, not everyone uses this mechanism, including the author of the library itself, otherwise the “non-optimality” of the export-import mechanism used in PHP_CodeCoverage would quickly have surfaced. Let's take it in order, what's the matter.

For export to a format * .cov meets special class reporter PHP_CodeCoverage_Report_PHP , the interface is very simple. This is the process () method, which takes an input object of the PHP_CodeCoverage class and serializes it with the serialize () function.

The result is written to the file (if the path to the file is passed), or returned as the result of the method.

class PHP_CodeCoverage_Report_PHP
{
    /**
     * @param  PHP_CodeCoverage $coverage
     * @param  string           $target
     * @return string
     */
    public function process(PHP_CodeCoverage $coverage, $target = NULL)
    {
        $coverage = serialize($coverage);

        if ($target !== NULL) {
            return file_put_contents($target, $coverage);
        } else {
            return $coverage;
        }
    }
}

Importing with phpcov utility, on the contrary, takes all files in a directory with the extension * .cov and for each does unserialize () in the object . The object is then passed to the merge () method of the PHP_CodeCoverage object into which the coverage is aggregated.

    protected function execute(InputInterface $input, OutputInterface $output)
    {
        $coverage = new PHP_CodeCoverage;

        $finder = new FinderFacade(
            array($input->getArgument('directory')), array(), array('*.cov')
        );

        foreach ($finder->findFiles() as $file) {
            $coverage->merge(unserialize(file_get_contents($file)));
        }

        $this->handleReports($coverage, $input, $output);
    }

The merge process itself is very simple. This is a merger of array_merge () arrays with small nuances like ignoring what has already been imported, or passed as a filter parameter to phpcov (--blacklist and --whitelist).

     /**
     * Merges the data from another instance of PHP_CodeCoverage.
     *
     * @param PHP_CodeCoverage $that
     */
    public function merge(PHP_CodeCoverage $that)
    {
        foreach ($that->data as $file => $lines) {
            if (!isset($this->data[$file])) {
                if (!$this->filter->isFiltered($file)) {
                    $this->data[$file] = $lines;
                }

                continue;
            }

            foreach ($lines as $line => $data) {
                if ($data !== NULL) {
                    if (!isset($this->data[$file][$line])) {
                        $this->data[$file][$line] = $data;
                    } else {
                        $this->data[$file][$line] = array_unique(
                          array_merge($this->data[$file][$line], $data)
                        );
                    }
                }
            }
        }

        $this->tests = array_merge($this->tests, $that->getTests());
    }

It was the use of the serialization and deserialization approach that became the very problem that did not allow us to quickly generate coverage. More than once, the community discussed the performance of serialize and unserialize functions in PHP:
http://stackoverflow.com/questions/1256949/serialize-a-large-array-in-php ;
http://habrahabr.ru/post/104069 etc.

For our small project, the PHP repository of which contains more than 35 thousand files, files with coverage weigh a lot, several hundred megabytes each. A common file, "contiguous" from different streams, weighs almost 2 gigabytes. On such data volumes unserialize showed itself in all its glory - we waited for the generation of coverage for several days.

Therefore, we decided to try the most obvious optimization method - var_export and the subsequent include files.

To do this, a new reporter class has been added to the php-code-coverage repository , which does export in a new format via var_export:

class PHP_CodeCoverage_Report_PHPSmart
{
    /**
     * @param  PHP_CodeCoverage $coverage
     * @param  string           $target
     * @return string
     */
    public function process(PHP_CodeCoverage $coverage, $target = NULL)
    {
        $output = '<?php $filter = new PHP_CodeCoverage_Filter();'
            . '$filter->setBlacklistedFiles(' . var_export($coverage->filter()->getBlacklistedFiles(), 1) . ');'
            . '$filter->setWhitelistedFiles(' . var_export($coverage->filter()->getWhitelistedFiles(), 1) . ');'
            . '$object = new PHP_CodeCoverage(new PHP_CodeCoverage_Driver_Xdebug(), $filter); $object->setData('
            . var_export($coverage->getData(), 1) . '); $object->setTests('
            . var_export($coverage->getTests(), 1) . '); return $object;';

        if ($target !== NULL) {
            return file_put_contents($target, $output);
        } else {
            return $output;
        }
    }
}

We modestly called the file format PHPSmart. The extension for files of this format is * .smart.

In order for the object of the PHP_CodeCoverage class to allow itself to be exported and imported into a new format, setters and getters of its properties were added.
A few corrections in the phpunit and phpcov repositories, so that they learn how to work with such an object, and our coverage began to be collected in just two and a half hours.
Here is the import:

    foreach ($finder->findFiles() as $file) {
        $extension = pathinfo($file, PATHINFO_EXTENSION);
        switch ($extension) {
            case 'smart':
                $object = include($file);
                $coverage->merge($object);
                unset($object);
                break;
            default:
                $coverage->merge(unserialize(file_get_contents($file)));
        }
    }

You can find our edits on GitHub and try this approach on your project.
github.com/uyga/php-code-coverage
github.com/uyga/phpcov
github.com/uyga/phpunit
We sent pull requests to our edits to Sebastian Bergman, hoping to see them in the official repositories of the creator soon.
github.com/sebastianbergmann/phpunit/pull/988
github.com/sebastianbergmann/phpcov/pull/7
github.com/sebastianbergmann/php-code-coverage/pull/185
But he closed them, saying that he wants not an additional format, and ours instead of ours:



Which we happily did. And now our changes have entered the official repositories of the creator, replacing the previously used format in * .cov files.
github.com/sebastianbergmann/php-code-coverage/pull/186
github.com/sebastianbergmann/phpcov/pull/8
github.com/sebastianbergmann/phpunit/pull/989
These small optimizations helped us speed up the collection of coverage by almost 30 ( !) times. It allowed us to drive not only unit tests to calculate coverage, but also add two sets of integration tests. This did not significantly affect the time of import-export and merge of results.

P.S.:


Илья Агеев,
QA Lead