B1;2c Dataset Formats Supported by leakiEst

# leakiEst

## Datasets

leakiEst estimates the information leakage from the secret information to the public information in a system, based on a dataset containing both types of information that occurred during previous executions of that system; the secret and public information that occurred during each execution could therefore be presented in many ways in the dataset.

### Supported dataset formats

leakiEst supports three different dataset formats; they are described below. S refers to the secret information, and O to the public output, that occurs during a single execution of the system.

#### Observation files

This is the most simple format. One execution of the system is represented as a single line in the file, and lines are of the form (S,O).

For example, the following observation file describes the execution of a simple system that reads a secret integer between 1 and 9 and outputs that integer modulo 3:

```(1,1)
(7,1)
(5,2)
(4,1)
(9,0)
(8,2)
(3,0)
(4,1)
(7,1)
(8,2)
(2,2)
(9,0)
(6,0)
```

#### Channel files

This format encodes an information-theoretic channel in tabular form, with secret information as the input to the channel and public output as the output from the channel. The top-left cell in the table is of the form m,n where m is the number of unique inputs to the channel (i.e., the number of rows in the table) and n is the number of unique outputs from the channel (i.e., the number of columns in the table). The cell corresponding to secret S and output O contains the probability of the system outputting O given that the secret was S; the probabilities in each row of the table should therefore sum to 1.

The following channel file describes the behaviour of the example modulo-3 system described earlier:

```(9,3) | 0        | 1        | 2
1     | 0.000000 | 1.000000 | 0.000000
2     | 0.000000 | 0.000000 | 1.000000
3     | 1.000000 | 0.000000 | 0.000000
4     | 0.000000 | 1.000000 | 0.000000
5     | 0.000000 | 0.000000 | 1.000000
6     | 1.000000 | 0.000000 | 0.000000
7     | 0.000000 | 1.000000 | 0.000000
8     | 0.000000 | 0.000000 | 1.000000
9     | 1.000000 | 0.000000 | 0.000000
```

#### ARFF files

leakiEst can also read files encoded in the industry-standard Attribution-Relation File Format used by Weka, a machine learning tool. ARFF datasets describe the relationship between different features, or attributes, of a system.

leakiEst treats each line in the ARFF dataset as a single execution of the system. ARFF files may contain an arbitrary number of attributes, and not all of them may be intended for processing by leakiEst, so certain attributes (or groups of attributes) must be identified as containing secret or public information using the -high and -low command line options respectively.

If represented as an ARFF dataset, the behaviour of the example modulo-3 system described earlier may look like this:

```@relation modulo3

@attribute secret integer
@attribute modulus integer

@data
1,1
7,1
5,2
4,1
9,0
8,2
3,0
4,1
7,1
8,2
2,2
9,0
6,0
```

Using the command line options, the secret attribute would be identified as containing the secret information and the modulus attribute would be identified as containing the public information.

### Randomising execution data in datasets

For leakiEst's -t and -csv command line options to produce meaningful output when processing observation or ARFF files, the order of the lines representing the execution data in those file types may need to be randomised. For observation files, this can be performed with any utility that randomises the order of lines in text files, such as shuf(1) on Linux. A similar procedure can be used to randomise lines in ARFF files, although care should be taken not to corrupt the header section.