Authors:
(1) Simon R. Davies, School of Computing, Edinburgh Napier University, Edinburgh, UK ([email protected]);
(2) Richard Macfarlane, School of Computing, Edinburgh Napier University, Edinburgh, UK;
(3) William J. Buchanan, School of Computing, Edinburgh Napier University, Edinburgh, UK.
This section introduces a collection of potential tests that could be used in collaboration to determine if a process is malicious or benign. There is a binary outcome for each of these tests with a test failure indicating that the subject of the test is more likely to be malicious and passing the test indicating that it is more likely to be benign. The resulting votes from each test are then recorded. Each of these proposed test results would then contribute to the final overall malice
score of the process under investigation. Each contributing test has the same weighting and thus the same impact on the final scoring. After all the tests have been conducted the classification decision is made, based on an aggregation of received votes, malicious or benign. A conceptual overview of how the proposed system would be configured is shown in Figure 1.
This collection of tests is performed on any output produced by the process. In the majority of cases, this would manifest itself as files being written to disk. This behaviour is common for processes such as editors, web downloads, email clients, system logging, compression programs, as well as the output from crypto-ransomware programs. These tests will use both the content of the file being written as well as metrics derived from the file’s metadata such as file name and extension.
The NapierOne [16] data set was leveraged in many of the tests that rely on file analysis. This data set is ideally suited for this task as it contains many examples of the most commonly used file types. The data set contains 5,000 example files for each of the prevalent file types shown in Table 1.
Apart from the normal file types found in typical use, the NapierOne data set also contains example files that have been encrypted by the ransomware strains shown in Table 2. The data set contains 5,000 example encrypted files for each of these ransomware strains. (The SHA256 hash values for these ransomware strains are provided in Table 5 which appears in the Appendix). According to previous work [67, 59] the use of diverse families of ransomware strains is more important than the number of ransomware samples from a few families for evaluating the performance of ransomware detectors. It is because the core behavioural traits shown by crypto-ransomware in encrypting data attack does not change from one variant to the other within a family [67].
The entire dataset used during this research contains 365,000
files covering 73 separate and distinct file types and is publicly accessible at www.napierone.com. The dataset contains 210,000 benign files from the 42 different file types
shown in Table 1 and 155,000 encrypted files from the 31 ransomware strains shown in Table 2.
File Magic Number Test. Magic numbers are usually the first few bytes of a file. These are normally unique to a file format and can be used to identify many common types of files [75]. While not all files contain this signature, for example, plain text files such as CSS, CSV, JSON, SVG, TXT and XLST, file types such as DOCX, PDF, XLSX and many others do contain this unique value. An extensive search was performed in an attempt to generate a comprehensive list of commonly used file types [18, 75, 10, 27, 35, 48, 76] and where possible the corresponding magic number and typical file extension for that type. This research resulted in the creation of a reference list of more than 600 entries of documented magic numbers and corresponding file extensions.
This test focuses on determining the magic number of the file under investigation and then comparing it with the file name’s extension to confirm that they correlate. As plain text files do not have a magic number, then these were excluded from this test. The test was then applied to all the remaining files within the test dataset. For a file under test, if its magic number matched the corresponding expected file extension, the test passed and the file was considered benign, otherwise, the test failed and the file was considered a possible consequence of malicious activity.
Printable Characters Test. This is a complimentary test and is only run on files that do not usually contain a magic number. As these are plain text files, then the majority of their contents should contain printable ASCII characters. Examples of files of this type are markup files such as HTML or plain text documents such as TXT. The definition of printable characters are characters that have an ASCII value between 32 and 126 as well as the format control characters which have ASCII values between nine and 13. From analysing the nearly 50,000 plain text files in the NapierOne dataset, it was found that on average plain text files contain at least 98% printable ASCII content.
The test was then applied to all the plain text files within the test dataset. For a file under test, if its printable ASCII content was above 98%, then the test passed and the file was considered benign, otherwise, the test failed and the file was considered a product of malicious activity.
File Entropy Test. A reoccurring theme within many cryptoransomware detection techniques is the concept of randomness and file entropy. Researchers assert that a good indicator [56, 24, 25, 38] of crypto-ransomware activity is the generation of files whose contents appears to be random and contain no distinguishable structure. It is agreed that Well-encrypted data should be indistinguishable from random data [12]. Traditionally researchers in crypto-ransomware detection have chosen to use the value known as Shannon entropy [69] when calculating this metric, however, in this research, it was decided to use the chi-square [22] method of calculating this metric based on the findings of Davies [17].
The test was then applied to all the files within the test dataset. For a file under test, if its Chi-Square entropy probability value was less than 0.01 [74], then the test passed and the file was considered benign. Otherwise, the test failed and the file was considered the product of malicious activity.
BitByte Value Test. This test is based on the method described by Davies [15] which successfully distinguished between encrypted files and all other file types. This method is particularly effective at differentiating between encrypted and compressed files. A separation which previously has been proven in the past to be problematic to achieve with a reasonable level of accuracy. Essentially this test is performed by profiling the entropy distribution of the first few hundred bytes of the file under examination and comparing this profile with the entropy distribution of a control file. The difference in entropy profiles is then calculated and a value known as a BitByte value is determined. Files that produce lower BitByte values have a higher probability that their contents are encrypted. The research [15] identified that any BitByte value below 56, indicates with high probability, that the file is encrypted and thus possibly a consequence of a ransomware infection.
The test was then applied to all the files within the test dataset. For a file under test, if its BitByte value was greater than 56, then the test passed and the file was considered benign. Otherwise, the test failed and the file was considered a product of malicious activity.
Ransom Note Creation Test. During a crypto ransomware attack, one action often performed by the malicious process is to generate a Ransom note file. The purpose of this file generation is two-fold. Firstly, to inform the user that their files have been encrypted and that they are the victim of a ransomware attack. Secondly, the file’s contents will usually provide the victim with instructions on how they can recover from the attack and retrieve their files. The Ransom note normally explains how the victim should transfer a specific amount of crypto-currency to the perpetrator of the attack in exchange for help in recovering the affected files. There are normally several characteristics of this Ransom note file that can be used to distinguish it from other files. The file is normally below one KB in size, is plain text and usually contains some specific keywords such as: encrypted, ransom, tor, onion, recover, wallet, bitcoin [47]. In this test, the actual file name is also analysed for typical ransom note file name strings such as:decrypt, readme, restore and helpme. It has been identified that often these ransom note files are created prior to the actual encryption of the target files, so the identification of the creation of ransom notes would thus prove to be a good predictor of impending file encryption. This approach was leveraged in the HelDroid [6] ransomware detection system and utilised a text classifier that applies linguistic features to detect threatening text.
The test was then applied to all the files within the test dataset. For a file under test, if it is of limited size and its contents contain one or more of the trigger keywords, then the test failed and the file is considered malicious. Otherwise, the test passed and the file was considered benign.
This paper is available on arxiv under CC BY 4.0 DEED license.