This page collects together the output from the ongoing study of Corpus Derived Semantic Representations by John Bullinaria and Joe Levy.
Currently it provides links to our published research papers in this area, and some of the key word sets and semantic vectors discussed in those papers.
Bullinaria, J.A. & Levy, J.P. (2012). Extracting Semantic Representations from Word Co-occurrence Statistics: Stop-lists, Stemming and SVD. Behavior Research Methods, 44, ?-?. (pdf)
Levy, J.P. & Bullinaria, J.A. (2012). Using Enriched Semantic Representations in Predictions of Human Brain Activity. In: E.J. Davelaar (Ed.), Connectionist Models of Neurocognition and Emergent Behavior: From Theory to Applications, 292-308. Singapore: World Scientific. (pdf)
Bullinaria, J.A. (2008). Semantic Categorization Using Simple Word Co-occurrence Statistics. In: M. Baroni, S. Evert & A. Lenci (Eds), Proceedings of the ESSLLI Workshop on Distributional Lexical Semantics, 1-8. Hamburg, Germany: ESSLLI. (pdf)
Bullinaria, J.A. & Levy, J.P. (2007). Extracting Semantic Representations from Word Co-occurrence Statistics: A Computational Study. Behavior Research Methods, 39, 510-526. (pdf)
Levy, J.P. & Bullinaria, J.A. (2001). Learning Lexical Properties from Word Usage Patterns: Which Context Words Should be Used? In: R.M. French & J.P. Sougne (Eds), Connectionist Models of Learning, Development and Evolution: Proceedings of the Sixth Neural Computation and Psychology Workshop, 273-282. London: Springer. (pdf)
Levy, J.P., Bullinaria, J.A. & Patel, M. (1998). Explorations in the Derivation of Semantic Representations from Word Co-occurrence Statistics. South Pacific Journal of Psychology, 10, 99-111. (pdf)
Patel, M., Bullinaria, J.A. & Levy, J.P. (1997). Extracting Semantic Representations from Large Text Corpora. In: J.A. Bullinaria, D.W. Glasspool & G. Houghton (Eds.), Fourth Neural Computation and Psychology Workshop: Connectionist Representations, 199-212. London: Springer. (pdf)
Bullinaria, J.A. & Huckle, C.C. (1997). Modelling Lexical Decision Using Corpus Derived Semantic Representations in a Connectionist Network. In: J.A. Bullinaria, D.W. Glasspool & G. Houghton (Eds.), Fourth Neural Computation and Psychology Workshop: Connectionist Representations, 213-226. London: Springer. (pdf)
To facilitate further research in this area, the key word sets and semantic vectors discussed in Bullinaria & Levy (2012) are made available here. There are four semantic tasks, each with a word list in a plain text file and three corresponding sets of vectors in MATLAB formatted binary files (MAT-files). Each set of vectors is computed as described in the paper, using an L+R word co-occurrence widow of size 1.
| Task | Word set | Vectors |
|---|---|---|
| TOEFL | 400 words | PPMI - PC - Caron |
| Distance | 400 words | PPMI - PC - Caron |
| Sem.Cat. | 530 words | PPMI - PC - Caron |
| Purity | 60 words | PPMI - PC - Caron |
PPMI = Positive Pointwise Mutual Information, 10000 context word frequency ordered components
PC = Principal Components (US from SVD), 10000 singular value ordered components (50k starting matrix)
Caron = Caron approach vectors (US^0.25 from SVD), 10000 singular value ordered components (50k starting matrix)
The TOEFL task was first used by Tom Landauer & Susan Dumais (1997), A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge, Psychological Review, 104, 211-240.
The Purity task uses the word set of Tom Mitchell et al. (2008), Predicting human brain activity associated with the meanings of nouns, Science, 320, 1191-1195.
All the vectors were generated using the ukWaC corpus that is available from WaCky.