GP-Fileprints: File Types Detection Using Genetic Programming

  abstract =     "We propose a novel application of Genetic Programming
                 (GP): the identification of file types via the analysis
                 of raw binary streams (i.e., without the use of meta
                 data). GP evolves programs with multiple components.
                 One component analyses statistical features extracted
                 from the raw byte-series to divide the data into
                 blocks. These blocks are then analysed via another
                 component to obtain a signature for each file in a
                 training set. These signatures are then projected onto
                 a two-dimensional Euclidean space via two further
                 (evolved) program components. K-means clustering is
                 applied to group similar signatures. Each cluster is
                 then labelled according to the dominant label for its
                 members. Once a program that achieves good
                 classification is evolved it can be used on unseen data
                 without requiring any further evolution. Experimental
                 results show that GP compares very well with
                 established file classification algorithms (i.e.,
                 Neural Networks, Bayes Networks and J48 Decision
