In our paper "Genre classification on German novels" we used a data set consisting of 1682 novels which are partly labeled with genre information. Novels were labeled into two subgenres, educational and social, by Lukas Weimer, student assistant at the chair of computer philology. All novels are freely available and were collected from TextGrid, DTA and Gutenberg.
All novels were written in or translated into German and date of origin ranges from the 16th to the 20th century. Authors include Charles Dickens, Theodor Fontane, Karl May, Sir Walter Scott and Émile Zola. Text lengths range from 4000 to over one million words, the average word count being 100,000.
For the TIR15 Genre corpus we divided the novels into five disjoint folders:
- labeled educational: 37 novels
- labeled social: 63 novels
- prototype educational: 21 novels
- prototype social: 11 novels
- unlabeled: 1550 novels
In our paper we used the two prototype folders as one data set and prototype and labeled folders (= 132 novels) as the second. The entire set of novels was used for Latent Dirichlet Allocation for which the stopword list is provided additionally.
You can download the corpus using this link. If you would like to refer to the TIR15 Genre corpus please cite
- Hettinger, L., Becker, M., Reger, I., Jannidis, F., Hotho, A.: Genre classification on German novels.Proceedings of the 12th International Workshop on Text-based Information Retrieval (2015).