CleanTalk Malware Scanner - heuristic code analysis

We already talked about launching the Security service for WordPress in a previous article . Today we want to talk about the launch of heuristic analysis to determine malicious code.

The presence of malicious code itself can lead to a ban in search results or a warning in the search that the site is infected, to protect users from potentially dangerous content.

You can find the malicious code yourself, but this is a lot of work and most WordPress users do not have the necessary skills to find and delete unnecessary lines of code.

Often authors of malicious code mask it, which makes it difficult to determine by signature. The malicious code itself can be located anywhere on the site, for example, the obfuscated php code in the logo.png file, and the code itself is called with one invisible line in index.php. Therefore, using plugins to search for malicious code is preferable.

When scanning for the first time, CleanTalk scans all WordPress core files, plugins and themes. With repeated scans, only those files that have been modified since the last scan are scanned. This saves resources and increases scanning speed.

How heuristic analysis works


One of the main drawbacks of heuristic analysis is that it is quite slow, so we only use it when it is really needed. First of all, we break the source code into tokens (the minimum language construct) and delete everything unnecessary:

  1. Space characters.
  2. Commentary of various kinds.
  3. Not PHP code (outside the Tags)

Next, we recursively simplify the code until there are no "complex constructions":

  1. Concatenate strings.
  2. Substitution of variables into variables.
  3. etc

Also, in the process of simplifying the code, we monitor the origin of variables and much more.

As a result, we get clean code that can be analyzed. It is very important that we get the code not as a string, but as tokens. Thus, we know where the string token is located with the desired text, and where is the token function.

In the sense of searching for a “bad design” eval, there is a difference for us:

<?php echo 'eval("echo \"some\"")'; ?>

- in this case there will be no T_EVAL token, there
will be a T_CONSTANT_ENCAPSED_STRING token 'eval (“echo \” eval \ "")'

<?php eval('echo "some"'); ?>

- and here it will be. And it is precisely this option that we will discover.

We are looking for such designs, we break them into degrees of criticality:

  1. Critical:
    • eval
    • include * and require *
      • with bad file extension
      • nonexistent files (will be deleted in the next version)
      • remote file connection
  2. Dangerous
    • system
    • passthru
    • proc_open
    • exec
    • include * and require *
      • with error suppression operator (will be removed in the next version)
      • with variables dependent on POST or GET.
  3. Suspicious
    • base64_encode
    • str_rot13
    • syslog

  4. Other.

We are constantly improving this analysis: we add new constructions for searching, reduce the number of false positives, and optimize code simplification.

The plans to teach him to detect and decode strings encoded in URL and BASE64 and others.

The plugin itself is available in the WordPress directory .