Skip to content

Conversation

@ahl27
Copy link
Collaborator

@ahl27 ahl27 commented Aug 18, 2025

Adapted from Bioconductor/XVector#6

Scope of this fix has been narrowed to just operating on XString objects (and their comparison(s) with character) rather than all of XVector.

The buggy behavior previously mentioned in the Biostrings TODO has been fixed to the following behavior:

> DNAString("A") != "A"
[1] FALSE
> DNAString("A") <= "A"
[1] TRUE
> DNAString("A") >= "A"
[1] TRUE
> DNAString("A") < "A"
[1] FALSE
> DNAString("A") > "A"
[1] FALSE
> DNAString("AAA") == "AAA"
[1] TRUE
> DNAString("AAA") <= "AAA"
[1] TRUE
> BString("AAA") <= "AAA"
[1] TRUE
> DNAString("A") > DNAString("A")
[1] FALSE
> DNAString("C") > DNA_ALPHABET
 [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE  TRUE FALSE  TRUE  TRUE  TRUE

Note that there's a bit of a decision to be made with something like RNAString("U") < "T" and RNAString("U") == "T". I chose to coerce to BString to use existing previously written methods, since it has consistency with prior work and consistency across inequality and equality comparisons. You could make the argument that RNAString("U") == "T" if we're always interpreting letters as bases, but then we get into tricky territory with the inequality operators (i.e., who is to say if DNAString("A") < DNAString("C")? Do bases have an inherent ordering?). This solution seems to be the most internally consistent and closest to the behavior I'd expect as a user. It also results in the fewest confusing error messages (handling cases like DNAString("C") < "E" is more challenging).

Unit tests have been updated. They're a bit verbose so that we could cover a lot of the possible input variable arrangements.

@ahl27
Copy link
Collaborator Author

ahl27 commented Aug 18, 2025

workflow runs are broken until new bioc release, I can't get it to pull in Seqinfo correctly.

@hpages
Copy link
Contributor

hpages commented Nov 3, 2025

Thanks @ahl27.

I was looking at this and realized that we had a long-standing inconsistency with our comparison methods:

BStringSet("AAT") == DNAStringSet("AAT")
# [1] TRUE

BString("AAT") == DNAString("AAT")
# Error in BString("AAT") == DNAString("AAT") : 
#   comparison between a "BString" instance and a "DNAString" instance is not supported

This is because the logic that decides whether things are comparable or not is implemented twice, with 2 different semantics. One instance is in Biostrings:::.coerce_to:

if (seqtype1 != seqtype2) {
if ((seqtype1 != "B" && seqtype2 == "AA")
|| (seqtype2 != "B" && seqtype1 == "AA"))
stop("comparison between a \"", class(x), "\" instance ",
"and a \"", class(y), "\" instance\n",
" is not supported")
if (seqtype1 == "B" && seqtype2 != "AA")
seqtype1 <- seqtype2
if (seqtype2 == "B" && seqtype1 != "AA")
seqtype2 <- seqtype1
}

and the other one in Biostrings:::comparable_seqtypes:

Biostrings/R/seqtype.R

Lines 132 to 137 in 778d5f9

comparable_seqtypes <- function(seqtype1, seqtype2)
{
is_nucleo1 <- seqtype1 %in% c("DNA", "RNA")
is_nucleo2 <- seqtype2 %in% c("DNA", "RNA")
is_nucleo1 == is_nucleo2
}

Another good example of why the DRY principle is such an important one, and also a good reminder that failing to adhere to it always leads to bad things. I'm trying to be a lot more careful with this these days but it seems like I was not paying attention enough 12-13 years ago when all this was implemented 🤕

Anyways, I'll try to merge this PR (I expect some conflicts, nope, no conflicts! 😃 ) and will make additional tweaks to try and reconciliate comparability between XStringSet derivatives with comparability between XString derivatives.

Thanks for the PR.

H.

@hpages hpages merged commit ab26c60 into Bioconductor:devel Nov 3, 2025
0 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants