-
Notifications
You must be signed in to change notification settings - Fork 16
Datasets
QuaPy makes available several datasets that have been used in quantification literature, as well as an interface to allow anyone import their custom datasets.
Three datasets of reviews about Kindle devices, Harry Potter's series, and the well-known IMDb movie reviews can be fetched using a unified interface. For example:
import quapy as qp
data = qp.datasets.fetch_reviews('kindle')These datasets have been used in:
Esuli, A., Moreo, A., & Sebastiani, F. (2018, October).
A recurrent neural network for sentiment quantification.
In Proceedings of the 27th ACM International Conference on
Information and Knowledge Management (pp. 1775-1778).
The list of reviews ids is available in the list:
qp.datasets.REVIEWS_SENTIMENT_DATASETSSome statistics of the fhe available datasets are summarized below:
| Dataset | classes | train size | test size | train prev | test prev | type |
|---|---|---|---|---|---|---|
| hp | 2 | 9533 | 18399 | [0.018, 0.982] | [0.065, 0.935] | text |
| kindle | 2 | 3821 | 21591 | [0.081, 0.919] | [0.063, 0.937] | text |
| imdb | 2 | 25000 | 25000 | [0.500, 0.500] | [0.500, 0.500] | text |
11 Twitter datasets for sentiment analysis. Text is not accessible, and the documents were made available on tf-idf format. Each dataset presents two splits: a train/val split for model selection purposes, and a train+val/test split for model evaluation. The following code exemplifies how to load a twitter dataset for model selection.
import quapy as qp
data = qp.datasets.fetch_twitter('gasp', for_model_selection=True)The dataset was used in:
Gao, W., & Sebastiani, F. (2015, August).
Tweet sentiment: From classification to quantification.
In 2015 IEEE/ACM International Conference on Advances in
Social Networks Analysis and Mining (ASONAM) (pp. 97-104). IEEE.
Three of the datasets (semeval13, semeval14, and semeval15) share the same training set (semeval), meaning that the training split one would get when requesting any of them is the same. The dataset "semeval" can only be requested with "for_model_selection=True". The lists of the Twitter dataset's ids can be consulted in:
# a list of 11 dataset ids that can be used for model selection or model evaluation
qp.datasets.TWITTER_SENTIMENT_DATASETS_TEST
# 9 dataset ids in which "semeval13", "semeval14", and "semeval15" are replaced with "semeval"
qp.datasets.TWITTER_SENTIMENT_DATASETS_TRAIN Some details can be found below:
| Dataset | classes | train size | test size | features | train prev | test prev | type |
|---|---|---|---|---|---|---|---|
| gasp | 3 | 8788 | 3765 | 694582 | [0.421, 0.496, 0.082] | [0.407, 0.507, 0.086] | sparse |
| hcr | 3 | 1594 | 798 | 222046 | [0.546, 0.211, 0.243] | [0.640, 0.167, 0.193] | sparse |
| omd | 3 | 1839 | 787 | 199151 | [0.463, 0.271, 0.266] | [0.437, 0.283, 0.280] | sparse |
| sanders | 3 | 2155 | 923 | 229399 | [0.161, 0.691, 0.148] | [0.164, 0.688, 0.148] | sparse |
| semeval13 | 3 | 11338 | 3813 | 1215742 | [0.159, 0.470, 0.372] | [0.158, 0.430, 0.412] | sparse |
| semeval14 | 3 | 11338 | 1853 | 1215742 | [0.159, 0.470, 0.372] | [0.109, 0.361, 0.530] | sparse |
| semeval15 | 3 | 11338 | 2390 | 1215742 | [0.159, 0.470, 0.372] | [0.153, 0.413, 0.434] | sparse |
| semeval16 | 3 | 8000 | 2000 | 889504 | [0.157, 0.351, 0.492] | [0.163, 0.341, 0.497] | sparse |
| sst | 3 | 2971 | 1271 | 376132 | [0.261, 0.452, 0.288] | [0.207, 0.481, 0.312] | sparse |
| wa | 3 | 2184 | 936 | 248563 | [0.305, 0.414, 0.281] | [0.282, 0.446, 0.272] | sparse |
| wb | 3 | 4259 | 1823 | 404333 | [0.270, 0.392, 0.337] | [0.274, 0.392, 0.335] | sparse |
A set of 32 datasets from the UCI Machine Learning repository used in:
Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017).
Using ensembles for problems with characterizable changes
in data distribution: A case study on quantification.
Information Fusion, 34, 87-100.
The list does not exactly coincide with that used in Pérez-Gállego et al. 2017 since we were unable to find the datasets with ids "diabetes" and "phoneme".
These dataset can be loaded by calling, e.g.:
import quapy as qp
data = qp.datasets.fetch_UCIDataset('yeast', verbose=True)This call will return a Dataset object in which the training and test splits are randomly drawn, in a stratified manner, from the whole collection at 70% and 30%, respectively. The verbose=True option indicates that the dataset description should be printed in standard output. The original data is not split, and some papers submit the entire collection to a kFCV validation. In order to accommodate with these practices, one could first instantiate the entire collection, and then creating a generator that will return one training+test dataset at a time, following a kFCV protocol:
import quapy as qp
collection = qp.datasets.fetch_UCILabelledCollection("yeast")
for data in qp.data.Dataset.kFCV(collection, nfolds=5, nrepeats=2):
...Above code will allow to conduct a 2x5FCV evaluation on the "yeast" dataset.
All datasets come in numerical form (dense matrices); some statistics are summarized below.
| Dataset | classes | instances | features | prev | type |
|---|---|---|---|---|---|
| acute.a | 2 | 120 | 6 | [0.508, 0.492] | dense |
| acute.b | 2 | 120 | 6 | [0.583, 0.417] | dense |
| balance.1 | 2 | 625 | 4 | [0.539, 0.461] | dense |
| balance.2 | 2 | 625 | 4 | [0.922, 0.078] | dense |
| balance.3 | 2 | 625 | 4 | [0.539, 0.461] | dense |
| breast-cancer | 2 | 683 | 9 | [0.350, 0.650] | dense |
| cmc.1 | 2 | 1473 | 9 | [0.573, 0.427] | dense |
| cmc.2 | 2 | 1473 | 9 | [0.774, 0.226] | dense |
| cmc.3 | 2 | 1473 | 9 | [0.653, 0.347] | dense |
| ctg.1 | 2 | 2126 | 22 | [0.222, 0.778] | dense |
| ctg.2 | 2 | 2126 | 22 | [0.861, 0.139] | dense |
| ctg.3 | 2 | 2126 | 22 | [0.917, 0.083] | dense |
| german | 2 | 1000 | 24 | [0.300, 0.700] | dense |
| haberman | 2 | 306 | 3 | [0.735, 0.265] | dense |
| ionosphere | 2 | 351 | 34 | [0.641, 0.359] | dense |
| iris.1 | 2 | 150 | 4 | [0.667, 0.333] | dense |
| iris.2 | 2 | 150 | 4 | [0.667, 0.333] | dense |
| iris.3 | 2 | 150 | 4 | [0.667, 0.333] | dense |
| mammographic | 2 | 830 | 5 | [0.514, 0.486] | dense |
| pageblocks.5 | 2 | 5473 | 10 | [0.979, 0.021] | dense |
| semeion | 2 | 1593 | 256 | [0.901, 0.099] | dense |
| sonar | 2 | 208 | 60 | [0.534, 0.466] | dense |
| spambase | 2 | 4601 | 57 | [0.606, 0.394] | dense |
| spectf | 2 | 267 | 44 | [0.794, 0.206] | dense |
| tictactoe | 2 | 958 | 9 | [0.653, 0.347] | dense |
| transfusion | 2 | 748 | 4 | [0.762, 0.238] | dense |
| wdbc | 2 | 569 | 30 | [0.627, 0.373] | dense |
| wine.1 | 2 | 178 | 13 | [0.669, 0.331] | dense |
| wine.2 | 2 | 178 | 13 | [0.601, 0.399] | dense |
| wine.3 | 2 | 178 | 13 | [0.730, 0.270] | dense |
| wine-q-red | 2 | 1599 | 11 | [0.465, 0.535] | dense |
| wine-q-white | 2 | 4898 | 11 | [0.335, 0.665] | dense |
| yeast | 2 | 1484 | 8 | [0.711, 0.289] | dense |
All datasets will be downloaded automatically the first time they are requested, and stored in the quapy_data folder for faster further reuse. However, some datasets require special actions that at the moment are not fully automated.
- Datasets with ids "ctg.1", "ctg.2", and "ctg.3" (Cardiotocography Data Set) load an Excel file, which requires the user to install the xlrd Python module in order to open it.
- The dataset with id "pageblocks.5" (Page Blocks Classification (5)) needs to open a "unix compressed file" (extension .Z), which is not directly doable with standard Pythons packages like gzip or zip. This file would need to be uncompressed using OS-dependent software manually. Information on how to do it will be printed the first time the dataset is invoked.
QuaPy provides data loaders for simple formats dealing with text, following the format:
class-id \t first document's pre-processed text \n
class-id \t second document's pre-processed text \n
...
and sparse representations of the form:
{-1, 0, or +1} col(int):val(float) col(int):val(float) ... \n
...
The code in charge in loading a LabelledCollection is:
@classmethod
def load(cls, path:str, loader_func:callable):
return LabelledCollection(*loader_func(path))indicating that any loader_func (e.g., a user-defined one) which returns valid arguments for initializing a LabelledCollection object will allow to load any collection. In particular, the LabelledCollection receives as arguments the instances (as an iterable) and the labels (as an iterable) and, additionally, the number of classes can be specified (it would otherwise be inferred from the labels, but that requires at least one positive example for all classes to be present in the collection).
The same loader_func can be passed to a Dataset, along with two pathds, in order to create a training and test pair of LabelledCollections. E.g.:
import quapy as qp
train_path = '../my_data/train.dat'
test_path = '../my_data/test.dat'
def my_custom_loader(path):
with open(path, 'rb') as fin:
...
return instances, labels
data = qp.data.Dataset.load(train_path, test_path, my_custom_loader)QuaPy implements a number of preprocessing functions in the package qp.data.preprocessing, including:
- tfidf vectorization
- reducing the number of columns based on term frequency
- standardization
- indexing (i.e., transforming textual tokens into lists of numeric ids)