This repository contains the implementation of a classification model that distinguishes between fiction and non-fiction texts using linguistic features derived from part-of-speech (POS) tagging. Inspired by the research paper "A Simple Approach to Classify Fictional and Non-Fictional Genres", we replicate the results with a slight modification by utilizing the NLTK POS tagger instead of the one mentioned in the paper. The results demonstrate the robustness of the study. We further explore additional POS-based features for genre classification.
The primary goal of this project is to classify text as fiction or non-fiction based on POS-based features. Initially, the study focuses on two key features:
- Adverb-to-Adjective Ratio
- Adjective-to-Pronoun Ratio
The classification is done using a logistic regression model. Additional POS-based features are also explored to test their efficacy in genre classification. Read the original paper for more details.
- Adverb-to-Adjective Ratio: Measures the prevalence of descriptive adverbs relative to adjectives.
- Adjective-to-Pronoun Ratio: Measures the descriptive richness of text in relation to pronouns.
- Custom POS-Based Features: Additional features derived from linguistic analysis are being evaluated for performance improvement.
- Brown Corpus: A collection of texts categorized into fiction and non-fiction, provided by the NLTK library.
- Baby BNC (British National Corpus): Fictional and non-fictional texts, sourced from the
baby_bnc.csv
file in the repository.
Ensure the following libraries are installed:
- Python 3.7+
- NLTK
- pandas
- scikit-learn
Install dependencies using:
pip install -r requirements.txt
-
Prepare the Data:
- Place the
baby_bnc.csv
file in the repository root. - The Brown Corpus is automatically loaded from NLTK.
- Place the
-
Run the Notebook:
- Open and execute the Jupyter Notebook
similar_results.ipynb
to reproduce results or experiment with additional features.
- Open and execute the Jupyter Notebook
-
Generate Features:
- Modify the feature extraction logic in the
extract_two_features
function or extend it to include new features.
- Modify the feature extraction logic in the
-
Train and Test:
- Execute the classification pipeline in the notebook to test the logistic regression model with various feature combinations.
- Using the NLTK POS tagger, the model achieves results comparable to the original study, validating its robustness.
- Preliminary experiments with additional POS-based features show promising directions for improving classification accuracy.
- Exploring additional POS-based ratios to improve classification accuracy.
- Testing the model on a broader set of corpora.
- Applying other machine learning algorithms to evaluate performance enhancements.
- Mohammed Rameez Qureshi, Sidharth Ranjan, Rajakrishnan P. Rajkumar, and Kushal Shah. "A Simple Approach to Classify Fictional and Non-Fictional Genres". Proceedings of the Second Storytelling Workshop, Florence, Italy, August 1, 2019.
- NLTK documentation: https://www.nltk.org/
- Scikit-learn documentation: https://scikit-learn.org/