-
Notifications
You must be signed in to change notification settings - Fork 3
Description
As everyone probably recalls, the TG allows for a couple of easy solutions called "shorthand". Back at the start of the project we used to think that some or all of these would be automatically replaced with the proper implementations in time. Since then, Arlo and I have discussed shorthand a couple of times and were inclined to think that most (all?) shorthand solutions would have to be replaced by the encoder who used them. Now that I'm revising the TG along with the EGD, I think we need to make some final decisions on these, and I need quite a bit of collaborative effort for that, in particular @michaelnmmeyer to detect what shorthand has been used in our corpora and devise solutions for bulk replacement, and @manufrancis and @arlogriffiths to tell how strongly they insist on keeping the easy way around for phenomena relevant to their specific fields.
I've thought long about how our shorthand solutions fit into an ordered world, and here's what I think now.
There is a basic dichotomy between transliteration shorthand and markup shorthand.
Transliteration shorthand means that we permit easy-to-access characters in place of difficult-to-access ones. Markup shorthand means that we permit text-based markup (special characters) in a function that should ideally be done in XML markup, for saying something about the text (i.e. interpretation or normalisation, rather than saying what the text is which, strictly speaking, should be transliteration's only job).
And there is a different kind of distinction that is more vague, but may be classified along these lines. Essential shorthand is stuff that we want to keep for good and have no intention to convert to a "proper" solution. Optional shorthand may be useful on the one hand for our individual and team working process (easier entry, easier legibility, possibility of adding basic markup to a non-XML file), and it may also be useful in case someone wants to use it in print, or in another environment where XML markup is not applicable. But all optional shorthand should be replaced, one way or another, with the proper solution in our XML files. Within optional shorthand, we need to distinguish private shorthand and public shorthand, the difference being that the latter will be automatically replaced in our files, while for the former, it's always the encoder who uses it that will have to make sure it is replaced with the proper solution. ***but see addendum below
In these terms, then, most shorthand solutions were presented in the old TG as public shorthand, but it is not feasible to auto-replace all of these, and I also think it is not good practice to permit alternatives, so that a given phenomenon is represented by shorthand in some files and by XML encoding in others. So all in all, we should eliminate public shorthand, or at least reduce it drastically, declaring various shorthand solutions to be either essential or private. Once we have declared a particular shorthand to be private, any existing instances should be replaced by the people in charge of the files where they occur, probably with some help from @michaelnmmeyer who can do corpuswide searches to spot instances, and may be able to suggest good xpath formulations for batch search and replace. (In EGD-type editions, these would probably occur mostly within the edition div [aside from any children of that div which have an explicit @xml:lang
], but probably also in apparatus readings and lemmas, and possibly also elsewhere in the files, but hopefully only within <foreign>
tags.)
A. Transliteration shorthand examples that I know of are the following:
- plain apostrophe ' instead of the right single quote ’ for representing avagraha and Tamil elision sandhi
- underdotted ṛ, ṝ and ḷ for the vocals, instead of r̥, r̥̄ and l̥
- ĕ instead of ə for transliterating Indonesian pepet
- asterisk * instead of middle dot · for representing virāma and sisters
Of these, the last three should be quite straightforward: if they have been used at all, these characters will in all probability not occur in files of the same corpus with a different function. I think all of these should be explicitly declared private henceforth.
QUESTION 1: OK on making these private? I need a clear yes, or if no, then some explanation and an alternative approach suggestion.
The apostrophe seems more problematic with hindsight, since it can also occur in English and French (etc.). Constraining search-and-replace with xpath (as above) may help, but I'm wondering if we should instead declare this to be essential transliteration shorthand. In that case, we would either have to live with some inconsistency in the shape of the apostrophes in our editions, or devise a printer's quotes kind of display solution that will always show the right single quote (never the opposite) when a plain apostrophe appears in a source-language context.
QUESTION 2: What shall we do about the plain apostrophe? Any input is welcome.
B. Essential markup shorthand will have to remain to some extent, since there are some features we have no intention of converting to markup even though we could and, to be completely fastidious, should. These are:
- editorial spaces
- editorial hyphens
- the left and right ceiling characters we use for physically broken akṣaras
- (the disambiguation colon sort of fits in here too, but it says something about our transliterated text [and not about the source text], so I'm treating that separately)
QUESTION 3: Do all agree that we want to keep these essential? I expect (and strongly suggest) yes, but would like explicit confirmation.
C. Optional markup shorthand is where things get muddy, since we have quite a lot of it, and of various kinds and complexities: some are subcorpus-specific, others are not; some are easy to auto-replace, others are not. Here are the examples I am aware of.
-
elided final u in Tamil (and afaik also Kannada; perhaps other Dravidian languages too?) represented by an apostrophe, as in arit’ eṉṟu அரிதென்று
- note that this issue is not the same as that about the shape of the apostrophe in A
- we probably want to declare this essential, which means that any apostrophe in Tamil (etc.) text is understood to be editorial, representing this kind of sandhi analysis
- the alternative is to treat it like the editorial avagraha, i.e. use
<supplied reason="subaudible">’</supplied>
as its markup alternative - in that case we need to decide whether it is public or private. - QUESTION 4: Make this essential?
-
editorial avagraha represented in XML by
<supplied reason="subaudible">’</supplied>
and in shorthand simply by ’ (with the rationale that this can be auto-replaced to the proper markup), and original avagraha represented in XML by ’ without markup and in shorthand by !’- I would prefer to make this private, i.e. insist on XML encoding for all editorial avagrahas and no markup for original avagrahas
- rationale: if we want the DHARMA standard to be universal, sooner or later people are going to start encoding texts which do have original avagrahas, and it's really confusing that in shorthand the editorial avagraha is unmarked and the original one comes with markup, while it's the other way around in encoding
- a feasible alternative might be to make the blanket declaration that all avagrahas shown in our editions are by default editorial, and to wrap any original avagrahas in
<orig>
(instead of wrapping editorial avagrahas in<supplied>
). In that case, !’ would still be private shorthand that encoders have to change to encoding, but at least the markedness of the avagrahas is the same in shorthand and encoding- the major drawback of this is that existing files where editorial avagrahas are already correctly marked up will need to have that markup removed, while any existing files which have original avagrahas correctly without markup will need markup added
- QUESTION 5: What shall the fate of editorial and original avagrahas be?
- I would prefer to make this private, i.e. insist on XML encoding for all editorial avagrahas and no markup for original avagrahas
-
distinction of long ē and ō in Dravidian languages, now assumed to be always editorial, i.e. based on the premise that this distinction never occurs in the source texts
- however, if we want our standard to be universal, then we or someone are going to have source texts which do employ different graphemes for short and long e and o, and then we are shot
- possibilities I see for going ahead:
- a, require encoding along the lines of
<choice><orig>e</orig><reg>ē</reg></choice>
in situations where the editor wants to add this distinction- in this case, original short e and original distinguished long ē can remain without markup
- the editorial use of ē could be declared private shorthand, resulting in a lot of code clutter once it is replaced with encoding
- b, leave without markup and change the blanket declaration, along the lines of "ē and ō without markup are always editorial in Tamil (etc.) texts; for any text that does make this distinction, this must be explicitly noted in the palaeographic description, and in the encoding of such texts, ē and ō without markup are understood to be original, and any editorial ē or ō must be marked up with choice-orig-reg"
- c, invert the encoding: ē and ō in these languages are always understood to be editorial when not marked up, and any original instances must be wrapped in
<orig>
- a, require encoding along the lines of
- QUESTION 6: What to do about ē and ō?
-
short vowels in Indonesian texts, where a long vowel is expected, optionally transliterated as ă, ĭ or ŭ, especially in Sanskrit loanwords
- the old TG says this will be auto-converted into encoding with
<orig>
, i.e. that it is public shorthand; if we want to keep this so, then we need an algorithm that will make that conversion, which should not be too problematic - I would still rather avoid public shorthand as much as possible, and prefer to make this private
- alternatively, we could make it essential
- QUESTION 7: What to do about these short vowels?
- the old TG says this will be auto-converted into encoding with
-
the = sign used in unusual akṣara formations, especially Tamil ligatures and Indonesian superscript repha behaviour, but also some more exotic uses
- the basic idea is that an = sign is put between two target graphemes which are in the source part of the same akṣara, even though by the standard rules of the applicable writing system they would be in separate akṣaras
- the encoding equivalent would therefore involve
<seg type="aksara">
, but batch automated replacement is out of the question, because only a human can tell exactly what goes inside that akṣara - in other words, this cannot be a public optional shorthand, so we have to declare it either
- essential, in which case we can go on as before, but we have some inconsistency because the encoding with
<seg type="aksara">
is also available - private, in which case I'll add explicit markup instructions to the new EGD for all the situations where = is now prescribed or suggested in the TG, and any encoder who has used = in their files will have to make the change to the encoding
- essential, in which case we can go on as before, but we have some inconsistency because the encoding with
- QUESTION 8: What shall be the fate of the = sign?
-
the + sign used for numeric signs other than decimal digits (i.e. Brāhmī ciphers for tens, hundreds, etc.; Khmer numbers in vertical strokes; and fraction signs)
- we currently have the option of either using a + sign (e.g. 10+ for the Brāhmī sign 10) or
<g type="numeral">
encoding, and the old TG promises that this will be auto-converted to mark up - I would strongly prefer to declare this private shorthand, since auto-conversion would be pretty problematic
- QUESTION 9: What shall be the fate of the + sign?
- we currently have the option of either using a + sign (e.g. 10+ for the Brāhmī sign 10) or
-
the _ underscore for original space
- recommended in the TG with the implication that it is private shorthand
- I would like to make it explicitly private
- QUESTION 10: Agree to make the _ private shorthand?
-
the use of §abc and $abc for space fillers and miscellaneous symbols, to be auto-converted to encoding where “abc” (any sequence of letters, followed by a space) will be converted into a symbol token in the XML
- automation should not be very problematic, but the whole symbol encoding system still awaits revision (and simplification)
- my preference would be to make this, too, private shorthand, and in XML editions, permit only proper encoding, whatever shape that ultimately takes
- QUESTION 11: What shall be the fate of these solutions
Whew... I think that's all. Thanks for reading; please do comment.
P.S. Added 16 June 2025
I now think optional shorthand should be classified as follows:
- private shorthand: defined as above (encoder is responsible for replacing it with markup)
- optional shorthand proper: corresponding to the above definition of public shorthand (there is an authorised XML markup alternative, and we may want to auto-convert the shorthand to the XML)
- public shorthand: there is an authorised XML markup alternative, to which the encoder must convert the shorthend, but we endorse the use of the shorthand in non-XML contexts (e.g. publications, where we endorse using arbitrary dingbats for particular symbols)
- there is also public transliteration shorthand, for use in print publications where the required unicode characters are not available, so compromises must be made, e.g. by using underdots instead of undercircles