INTERNATIONAL DATA BASES FOR MEDIAEVAL MANUSCRIPT STUDIES Nijmegen 15-17 September 1987 Section 2: Definition and Selection of Textological Data Tito ORLANDI, Definition of Textological Data for Coptic Texts. When I received a kind invitation to this congress, for a while I was in doubt as to where I should insert my paper. In fact, the subjects of all the sections are more or less interre- lated with one another, and the scholars interested in mediaeval manuscripts and in their automatic treatment work more or less in all of them. At last I decided for this section, because it seems to me that the "definition of textological data" corresponds to what the research group in Rome with whom I collaborate, namely "Informatica e discipline umanistiche", calls "codifica" (encod- ing), and we believe that it represents the root of the relation between our disciplines and Information Technology ("informati- ca"). On this it seems to depend all possibility to obtain many different results in the research on the manuscripts and the texts which they contain. On the other hand, the process of encoding whichever materi- al, be it textological, codicological, or group of data, is the most simple to define and organize, from the technical point of view. It consist exclusively in the accurate application of the well known principle of the CORRISPONDENZA BIUNIVOCA, viz.: that each phenomenon in the set of phenomena subject to the encoding (in other words, the part of "world" which the encoder takes into consideration) must have one and only one symbol to express it, and viceversa, that that symbol must have no other meaning. It is true that more often than not the scholars in human sciences tend to forget that principle, or not to apply it con- sistently, and the philologists have a good record for that (be- fore the spread of computers and after), possibly due to the fact that language and writing are two relevant examples of very im- perfect coding systems. It is also true that some minor problems would remain, eg. the use of a symbol to modify the meaning of other symbols (which, in my opinion, should be avoided wherever possible). But, on the whole, provided the principle is recog- nized and good will is devoted to apply it correctly, no impor- tant technical problems remain to be discussed. Somebody might draw the attention, in this regard, to the relation and interference between different keyboards, different video displays, different printers; and to the various systems incessantly proposed to obtain confortable ways to input texts and to read them. We all know well such problems, and that their solution is probably left to the technological progress and the skill of some engineers, not to particularly brilliant ideas by the scholars in the humanities. Therefore I would not take them too seriously, though I realize that we must always try and im- prove the machines with which we work. So we are left totally on the other side of the question. The scholar has to identify very carefully the phenomena to en- code (what is properly called in the "Call for papers": dis- cretization of continua) within the material which forms the sub- ject of his study. A first consequence is that all discussion on the standard for the different languages and purposes should cen- ter on this matter, not on the choice of the symbols (which, in any case, may be easily translated by means of elementary pro- grams of translation from one code to another). More important is that scholars do not seem to realize how difficult it is to carry on properly the task of identifying the phenomena to encode. In a sense, to this task were devoted all scholars since the beginning of their sciences; but this is even more deceiving, because the stringent consistency of the machine has shown how different is the treatment of data to be communi- cated by means of natural language in a monograph, from the one devised in order that information (in broad sense) may be re- trieved by means of a computer. But there is more, especially if we turn to the particular branch of codicology and textology. The first idea to be firmly kept in mind is that the manuscript, from one side, and the text, from the other, are two entirely different things, having a sort of dialectical relationship between themselves. On this relation- ship we shall deal later; now we start from the manuscript, con- sidering it as a material artifact, and noting that infinite are the phenomena in it, which might interest scholars, either in the time when they are working, or later. Every particularity in its construction, every sign or mark put in by the scribes (and cor- rectors), and their relative position, may eventually prove im- portant. From this point of view, the only satisfactory way to store all this in a magnetic memory is the analogical, not the digital reproduction, that is, a continuous and not a discrete one (pho- to, videotape, videodisk, etc.). Or, to be more precise, we do not see today a way to store such information so that its dis- crete elements (because after all discrete they are, even in a photo or in a video-disk) may be treated as logical elements for information retrieval. The "real digital" storage is obtained, just as we said, through a first step done by the scholar, never by the machine, consisting in the individuation of the elements to encode. It is the objective logical and factual relations between these single elements which will permit the use of a logical information re- trieval language, whichever it is. (In this sense, also a concor- dance program or a lexical analyzer.) The pitfall in this operation (as I happened to notice in many cases) is that scholars tend to confuse between encoding and transcoding. I call an encoding just the operation alluded to above; a transcoding is more simply the encoding done on already encoded material. In this case, we have simply to substitute each sign of an alphabet (in broad sense) with that of another one, employed because it may be "written" on a different support (eg. the Morse alphabet). If we have this in mind, it is easy to un- derstand that the "text" is simply to be transcoded, because it is already encoded in the written alphabet or scripture (though here also problem arise, due to the imperfection of the alphabets as such). On the contrary the visual organization of the text and the material organization of the codex are the object of a true, "primitive", encoding process. Therefore, the problems pertaining to the two operations should be accurately kept distinct, even if the result should be unitarian, viz. the production of ONE file of encoded information, because in this paper it is assumed that the interest of the scholars is ultimately centered on the text. We shall consider first the transcoding of the text. The problems here arise from the fact that we find the text already encoded in the manuscript, but by means of a peculiar alphabet, in the sense that its signs (the "letters" and other marks) are to be recognized only in part from their material form or sub- stance, and in part from the relation between that form and (a) the general meaning of the text; (b) the position in the page (cp. the page numbers; the titles; the glosses; etc.). It is well known to every palaeographer that the single let- ter, with its different forms, due to the skill of the scribe, but also the various conditions in which he works, is recognized only in part because of its form, and in part from the fact that, given the context, that letter "must" be that letter. (Attention! I do not allude to the possibility of confusion between two let- ters; that is a different problem, which cannot be solved by the context. I allude to the often peculiar forms of one single let- ter). There is another problem. Each letter has different mean- ings, that is, it refers to more than one "single phenomenon", and those meanings depend sometimes on its form (capitals etc.), sometimes on its position (numeration etc.). The scholar should decide whether: (a) to propose a true and simple transcodifica- tion, by which the new signs acquire the same duplicity of the old ones; (b) to propose a kind of ameliorated transcodification, in which the same sign is transcoded in different ways according to its different meanings, thereby correcting (for what is possi- ble) the incorrect encoding of the ancient, traditional, scribes. In any case we want to stress that also in the operation of transcoding the written text, the elements of subjective inter- pretation by the scholar are present, and he (like any text edi- tor) must assume his responsibility in the choices, and of course declare them as clearly as possible. Much more subjective is the operation of encoding the "graphic organization" of the manuscript. As we have said, the primary interest, also here, is represented by the text; there- fore what the scholar should do first, is to discern the relation between the graphic organization and the meaning of (parts of) the text, and choose those phenomena to be encoded, which are "meaningful" in very broad sense. One way to do it is, eg., to extrapolate a "regular" graphic organization in that particular manuscript (columns, lines, margins, etc.), in which are obvious- ly collocated the "normal" parts of the text. (For "normal" parts we simply assume those letters which are kept within the bound- aries of the "regular" graphic organization). The encoder will signal with appropriate codes wherever some group of letters is out of the "regular" place, or some part of the "regular" place is not filled with letters. As the graphic organization which one assumes at the begin- ning is somewhat imaginary, that is, it is an imaginary regular- ization of the actual graphic organization of each page of the manuscript, the alterations which it suffers by the actual posi- tion of the scripture in the manuscript are infinite. For in- stance, it is often impossible to establish whether a group of letter are in "interlineo" from its physical position, because a scribe may write some letters a bit up, only by chance. This is why the operation of encoding presupposes the subjective inter- pretation of the editor, who will act on the basis of his appre- hension of the relation between the graphic organization and the meaning of the text (including the groups of letters and marks accompanying the text proper). Finally we would add some remarks about the correctness of the encoding, that is, how we can judge whether a manuscript is encoded correctly or not. From this point of view one should con- sider that the aim of encoding a manuscript is not simply that of later obtaining a faithful reproduction on different support (e.g. a screen) or by means of electronic processing (e.g. the print from a computer printer). The aim is also that of producing various kinds of textual or codicological analysis. Therefore the correctness of the encoding (that is, of the choice of the phenomena to be encoded) depends on the final product which is to be obtained, where this product should not be seen only as a book or a traditional edition of the text. Thus we may judge the cor- rectnes from the possibility that the encoding gives to reach the aims which we want, or others may possibly want in future. Noth- ing else is required, because the choice of the sign within the code is in itself irrelevant, on account of the possibility to reshape it automatically. We come now to the practical application of this theory in the field of Coptic manuscripts and texts, beginning with some historical information outside the work in our Corpus dei Mano- scritti Copti Letterari. We mention first the enterprise of the Nag Hammadi project, done at the Institute for Antiquity and Christianity of the Claremont (California) Graduate School with the well known Ibycus system. The Ibycus was first conceived by David Packard for Latin and Greek texts. Its encoding system per- mits some degrees of textual analysis, and above all the exit on photocomposer, which produces a very good printed text. Its main drawback is that it is too much "printing-oriented", thus e.g. providing a code to put missing or uncertain letters within brackets, rather than signalling what really IS in the manuscript. Futhermore, it provides a code to signal whether a letter has some superlinear stroke or other such peculiarities, instead of a code individuating the "letter with superlinear stroke" (if this is considered an individual phenomenon), or else a code to indicate the superlinear stroke and its position. It also tends not to distinguish textual from codicological phenome- na (cp. above). What the Claremont enterprise has produced up to now are beautiful prited editions (some Nag Hammadi Texts and al- so a dictionary), but not other results. On the same line we may put the Princeton enterprise for Old Testament, which uses the same Ibycus system, but has not yet yielded practical results. On a different level we should mention the tools which sim- ply provide for the possibility to print Coptic fonts. Some of them are meant to accompany a word-processor installed on a Per- sonal Computer machine, instructing the screen and the printer (Academic Font; Toolbox for Languages; Lettrix; etc.); others provide the fonts for photo-typesetting, and are professional oriented for the typography. All this is out of our scope, here. For our system in the Corpus dei Manoscritti Copti Letterari we have worked on the basis of the following principles: - To encourage the collaboration of the scholars; - to have files fully portable on all kind of machines, great and small, supporting all kind of priters; - to use a delimited range of program types, which may be indi- vidually different: editors, word-processors, text-formatters, data-bases, concordance programs. The manuscripts are encoded in one file, which contains all the indications which may be useful in future, but are selected at different phases in order to be submitted to different proce- dures. I phase: diplomatic reproduction. II phase: edition of the text, normalized in orthography, and di- vided into paragraphs according to the modern taste. III phase: semantic analysis (concordance, translation, etc.). In the first phase the manuscript is encoded in the most faithful way. The first phase is the fundamental one and the edi- tor must limit himself in the operation of encoding without any intervention of interpretative editorial or explaining kind. This of course for what is possible because every such operation en- volves in part a subjective intervention. The results should ide- ally substitute the manuscript and make useless the recourse to the manuscript except for verifying errors of reading. In this phase the eventual kind of printing or communication or analysis should remain out of the scope. The editor has only to choose what is to be encoded and the way to encode it. The faithfulness to the manuscript is only partially related to the future repro- duction because in this phase we do not take yet into considera- tion the problems of printing. We have done our proposals for what regards the elements to encode. We have acted according to the practical possibilities given by the keyboards as they are in the market in order to obtain an easy operation of input and have as a result a standard ASCII file, excluding the use of the func- tion codes and also the double codes, and we rely on the standard USASCII keyboard. The number of elements that we can encode is therefore 95 including the space. With this number it is possible to obtain a good encoding of a normal manuscript. The elements that we have enucleated are as follows: The letters of the Coptic alphabet, each with eventual su- perlinear stroke (33 + 33 = 66 signs). The letters "iota" and "ypsilon" with the diaeresis. 5 punctuation signs. Change of page, line, column. Capital letter in the margin. 5 signs for special superlinear strokes. Illegible letter; letter in a physical lacune. Remarks by the editor. Original page numeration. PROBLEMS: The punctuation cannot be encoded according to the physical appearence of the signs, because the scribes tend to be inconsistent. The editor should declare what seems to be the sys- tem of the manuscript, and then interpret the signs according to the intention of the scribe. We have eliminated the category of the letters "not quite readable, but presumable", because such interventions by the edi- tor are better placed in the "second phase" (cp. below). Also the editor should refrain from filling any lacuna, in this phase. In principle, the editor should not even encode spaces be- tween words, unless they are in the manuscript. But it is not harmful to add such spaces, and it is advisable to do so, in or- der to spare time in the "second phase", when the division be- tween words (or rather grammatical entities) must be done. II PHASE. In this phase the editor leaves aside the point of view dedicated to the manuscript, and assumes that of the text itself as an entity. Therefore another file is formed, derived from the fundamental one, where the codes for the elements proper to the manuscript (line division, interspersed blanks, column and page division, abbreviations, etc.) are eliminated, normally by automatic processing. We are also in favour of the elimination of the code for the superlinear stroke (except for special texts), because its use had to do with the ancient reading habits, rather than the mean- ing of the text, and it can be substituted by the separation of words. The editor will now fill (wherever possible) the lacunae; will change or insert punctuation, in order to normalize the text according to logical paragraphs and sentences; and also will nor- malize orthography, though this is a point still very debatable. From this file the editor will obtain, through automatic processing, a formatted edition and the concordance. III PHASE. This phase cannot be defined precisely as the previous ones. It is implemented when there are more manuscripts for the text, each of which has been previously treated as stated above. The aim is to produce the critical edition, through the comparison of the different readings of the manuscripts, and make the lexical and semantic analysis for which appropriate programs may be prepared. Also some kind of automatic or semi-automatic translation might be envisaged.