Stability dataset clustering data loss

Hi FLIP authors, 
I have been working with the data split routine you applied to the meltome atlas data and found some irregularities. You create the train and test splits based on clusters from mmseq2 but the notebook routine seems off (in [collect_flip/2_meltome_atlas.ipynb](https://github.com/J-SNACKKB/FLIP/blob/main/collect_splits/2_meltome_atlas.ipynb)). 
For creating the mixed dataset based on the cluster you remove the cluster center datapoints from the set once you encountered it in the full protein list which I think makes the output datasets incorrect: 

Cell 30, last 20 LOC
```
            if key in train: <-- current datapoint is a cluster center
                clustered_set.append({
                    'sequence': protein.get('sequence'),
                    'target': protein.get('meltingPoint'),
                    'set': 'train'
                })
                train.remove(key)  <--- HERE
            elif key in test: <-- current datapoint is a cluster center
                clustered_set.append({
                    'sequence': protein.get('sequence'),
                    'target': protein.get('meltingPoint'),
                    'set': 'test'
                })
                
                mixed_set.append({
                    'sequence': protein.get('sequence'),
                    'target': protein.get('meltingPoint'),
                    'set': 'test'
                })
                test.remove(key) <--HERE
```

While removing the sequences is fine for the test set (only the cluster center points are used anyways), for the training set it holds out all sequences of this cluster that are processed in the loop after the cluster center. 
Upon fixing this I get a training set of 67361 datapoints + 3134 test datapoints (in comparison to 24817 training datapoints  reported on the paper). 

Do I understand something wrong here? 67361 is also 80% of the full cluster dataset (84030 entries) so this would make more sense based on the setting. The mixed set should in the end be 80% of all data in train + only cluster centers for test, which are obviously a lot less than 20% of all data. 

I haven't checked if the same error happened on the other datasets but would recommend to do so. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stability dataset clustering data loss #10

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Stability dataset clustering data loss #10

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions