Dictionaries
for Text Processing and Language Teaching:
Use and Compilation
A.F. Gelbukh, I.A. Bolshakov, Sofia N. Galicia-Haro,
M.A. Alexandrov, P.P. Makagonov, P.J. Cassidy
Contents
Preface__________________________________________________________________ 2
A Bilingual Thesaurus of Related Concepts________________________________________ 3
English_________________________________________________________________ 3
Russian_________________________________________________________________ 4
Spanish_________________________________________________________________ 6
A Word Combination Dictionary_______________________________________________ 8
What is CrossLexica™_____________________________________________________ 8
Syntactic disambiguation with CrossLexica™_____________________________________ 9
Lexical disambiguation with CrossLexica™______________________________________ 9
Referential disambiguation with CrossLexica™____________________________________ 9
Lexical Functions in Language Acquisition________________________________________ 10
Lexical function Magn for verbs, adjectives, and adverbs____________________________ 10
More about lexical function Oper1_____________________________________________ 10
Lexical functions Caus and Liqu______________________________________________ 11
Lexical functions for semantic derivates________________________________________ 11
Lexical functions for semantic conversives______________________________________ 13
Several other lexical functions_______________________________________________ 14
An Extended Subcategorization Frames Dictionary_________________________________ 15
Government patterns______________________________________________________ 15
ASFs for languages with relaxed word order constraints____________________________ 16
Representation of ASFs____________________________________________________ 17
Compilation of Domain Oriented Dictionaries_____________________________________ 20
Domain-oriented dictionary and document image__________________________________ 20
Statistic methods for text analysis_____________________________________________ 21
Methods of computer graphics for detail text analysis_______________________________ 22
Selection of homogeneous set of texts for constructing dictionaries_____________________ 23
Further development of dictionary-constructor____________________________________ 24
A.F. Gelbukh
This article collection represents the materials of a Colloquium Dictionaries for Non-Native Language Teaching: Use and Compilation at CAAL-99, the Annual Conference of Canadian Association for Applied Linguistics, joint with the Congress of the Social Sciences and Humanities, Canada, 1999. Since the material was collected from different sources not originally intended for publication, we do not present consistent views on the topics of discussion. Some fragments of the articles are borrowed from the other publications of the same authors.
Dictionaries play the key role in non-native language teaching, as well as in automatic text processing and any other linguistic applications. To reach the real command of a language, the student must know not only the translations of individual words, but also know which words combine with each other. At the intermediate level of language knowledge, the most common type of problems is connected with word combinations. For example, English native speakers when speaking in French would frequently make mistakes like *payer attention (for prêter attention), *se marier Marie (for se marier avec Marie), while French native speakers would say in English *to lend attention or *to marry with Mary. The other kind of frequent errors is related to the use of terms in specific subject domains, e.g., scientific or technical. In this paper, several types of the dictionaries useful for non-native language teaching is discussed, as well as the practical and theoretical issues of manual or semi-automatic compilation of such dictionaries from text corpora. English, Russian, Spanish, and French examples are used. Three papers describe working systems, two of which are based on very large language resources of a completely new type: an English-Russian parallel thesaurus and a Russian dictionary of word combinations.
P.J. Cassidy,
I.A. Bolshakov,
A.F. Gelbukh
For good command of a language, it is important to organize the knowledge in the student’s memory in associative fields, i.e., by semantic relationship. The dictionaries organized by semantic affinity are called ideographic dictionaries, of which the most famous is the Roget thesaurus. However, we are not aware of any existing bilingual idiographic dictionary. The paper describes a large English-Russian parallel thesaurus FACTOTUM ThesauRuss based on the English FACTOTUM SemNet semantic network originally derived from the Roget thesaurus. It contains approx. 80,000 words and phrases and 15,000 semantic links of about 100 types. Such a dictionary proves very useful in both learning and translation practice, since it greatly facilitates the search for and memorizing synonyms and related words. It lists the concepts in a deep hierarchical form as groups of quasi-synonymous words, arranged side by side for the two languages. It also provides semantic relationships between the concepts. The dictionary is intended to be symmetric, i.e., the grouping of the words and the relationships between the concepts look equally logical in both languages. The task is not easy because of the fuzzy nature of semantic fields and the differences in their boundaries in different languages. As and example, the concept of youngster is discussed.
Here we present, as an example, the youngster concept in three languages: English, Russian, and Spanish:
Child: prototypical non-adult.
Kid (colloquial): a child.
Lad: a boy. Sexual attribution: male.
Youth (masc.), stripling, lass, lassy, maid, maiden, damsel (all fem.): practically adult, but the speaker’s evaluation on the basis of his/her appearance (first of all, face) is that s/he is not able to take care of family. Sexual attribution.
Preteen, preadolescent: a child who is not yet teen or adolescent but close to it.
Adolescent, teen, teenager (the inner form is related to the numerals ending in -teen, from 13 to 19): a child whose size is evaluated as more or less standard for an adult.
Infant, baby: a child whose size is small and who can not take care of him/herself.
Neonate: a baby who was born recently.
Boy (masc.), girl (fem.): a child; sexual attribution; may be baby.
Toddler: a child who starts to walk, but is not doing it very good.
Tyke, tot: a child, with emphasis on small size.
Moppet: a child, with emphasis on small size. Expression of certain fondness.
Scout, schoolboy (masc.), schoolgirl (fem.), kindergartner, preschooler, cadet (masc.), coed (fem.): a child belonging to the respective organizations.
Minor: a child under the juridical fixed norm.
Wench: a country lass (girl).
Orphan: a child who has no parents.
Foundling: a child that lives in a family and is a part of it but was found.
Waif, gamin: a child that lives without a family and has no home.
Firstborn: a child in a family that was born first.
Pickaninny: a black child.
Papoose: a North American Indian child.
Crybaby: a child that used to cry frequently and for insigificant reasons.
Chit, minx: a girl whose behavior is pert. Sexual attribution: female.
Urchin: a boy whose behavior is mischievous. Sexual attribution: male.
Whippersnapper (pejorative): a child who is unimportant but whose behavior is offensively presumptuous.
Tomboy, hoyden: a girl whose behavior is typical for a boy. Sexual attribution: female.
Romp: a child who used to play lively or boisterous games.
Brat: a child whose behavior is annoying and impolite.
Hobbledehoy: a youth who is awkward, ungainly.
Nymph: a beautiful girl with some features of sexuality. Sexual attribution: female.
Youngster: a child with the accent on his/her physical youth.
Rebjonok, ditja (obsolete)‘child’: prototypical non-adult.
Mladenec ‘baby’: rebjonok whose size is very small and who absolutely can not take care of him/herself; cf. metaphoric usage “he plays chess like a baby.”
Grudnichok ‘baby’: mladenec who is fed at the breast.
Novorozhdenny ‘neonate’: mladenec who was born recently.
Otrok (obsolete)‘child’: rebjonok or podrostok (see below) who can take care of himself to some extent and has certain duties in his family (take care of others to some extent). So, he is not a small child. Sexual attribution: male.
Podrostok ‘teenager’: prototypical non-adult whose size is evaluated to be close to standard for adult and who has a specific angular constitution.
Junosha, paren’ (masc.), devushka (fem.) ‘teenager’: practically adult, but in the speaker’s evaluation based on his/her appearance (mainly, face) is unable to take care about family. Sexual attribution.
Malchik ‘boy’, devochka ‘girl’: rebjonok with sexual attribution.
Junec (pejorative)‘callow youth’: junosha who is mentally a child (negation of his adult mental abilities). Thus, he is adult only at sight. Sexual attribution: male.
Malysh (masc.), malyshka (fem.)‘little child’: rebjonok with emphasis on small size. Sexual attribution.
Malenkij (masc.), Malenkaja (fem.) (substantive)‘little child’: rebjonok with emphasis on small size. Sexual attribution.
Maljutka ‘little child’: rebjonok with emphasis on small size. Expression of certain fondness.
Kroshka, kroha ‘little child’: rebjonok with special evaluation of it as very small. Expression of a certain fondness.
Karapuz ‘little child’: rebjonok of small size; especially stressed specific childish constitution.
Butuz ‘dumpy little child’: rebjonok with emphasis on small size and very solid constitution (but not fat).
Oktjabrjonok ‘octobrist’.
Pioner ‘pioneer’, komsomolec ‘komsomol-member’, shkolnik ‘school-boy’, student ‘student’, peteushnik ‘student of PTU’, detsadovec ‘kindergartner’, doshkolnik ‘preschooler’; sexual attribution: male.
Pionerka, komsomolka, shkolnica, studentka, peteushnica, detsadovka, doshkolnica – same as above, but sexual attribution is female.
Juridical fixed norm
Nesovershennoletnij (masc.), nesovershennoletnjaja (fem.) (substantive)‘minor’: a child under the juridical fixed norm (in Russia the norm is 16 years); sexual attribution.
Barchonok (masc.), barchuk (masc.), baryshnja (fem.)(all obsolete): a child of a Russian landlord; sexual attribution.
Kazachok (obsolete): child for service, usually like a batman[1].
Sirota ‘orphan’: a child that has no parents[2].
Pervenec ‘firstborn’: a child that was born first in the family.
Posledysh (pejorative)‘last-born’: a child that was born last in the family.
Najdjonysh ‘foundling’:a child that lives in a family as a part of it but was found rather than born in this family.
Podkidysh ‘foundling’: a child that lives in a family as a part of it but was found rather than born in this family, because someone intentionally left him in a place where it would be found (probably) by a member of this family.
Besprisornik (masc.), besprisornica (fem.)‘gamin’: a child that lives without a family and has no home. Sexual attribution.
Kitajchonok[3], tatarchonok, zhidjonok (pejorative), turchonok, arapchonok (obsolete), negritjonok ‘Chinese, Tatar, Jewish, Turkish, Arab, black child’, respectively; sexual attribution: male.
Plaksa[4], rjova ‘crybaby’: rebjonok that is used to cry frequently and for insignificant reasons.
Pisklja ‘crybaby’: rebjonok that is used to cry or complain in especially high-pitched voice.
Zaznajka, zadavaka (both pejorative)‘whippersnapper’: rebjonok who is not important in fact but whose behavior is offensively presumptuous, who thinks that he is better than others in some aspect.
Krivljaka (pejorative)‘~ mugger’: rebjonok who tries to express his feelings using his body or face in a bad manner.
Shalun (masc.), shalunja (fem.) ‘frolic child’: rebjonok who makes something to entertain him/herself that can annoy others. Sexual attribution.
Ozornik (masc.), ozornica (fem.)‘prankster’: rebjonok or podrostok who makes something to entertain him/herself that can harm others. Sexual attribution.
Balovnik (masc.), balovnica (fem.)‘prankster’: rebjonok or podrostok who makes something useless to entertain him/herself. Sexual attribution.
Prokaznik (masc.), prokaznica (fem.) ‘prankster’: rebjonok or podrostok who makes something to entertain him/herself that can annoy others. Sexual attribution.
Sorvanec ‘romp’: rebjonok or podrostok who demonstrates behavior (usually, negative) typical for a boy. Sexual attribution: male.
Neposeda ‘fidget’: rebjonok who is always doing something and can not stay without a movement even for a moment.
Poprygunja: rebjonok who is always changing her activities without accomplishing them. Sexual attribution: female.
Postrel, postreljonok: rebjonok who manages to do everything very well and quickly or on time. Sexual attribution: male.
Pochemuchka: rebjonok who asks the questions starting with “why” too frequently.
Vunderkind ‘child prodigy’: rebjonok with brilliant (usually intellectual) capabilities.
Pererostok (pejorative)‘overgrown child’: rebjonok whose physical conditions are too developed as compared with a norm for a collective, status or conditions in which s/he is found.
Akselerat (masc.), akseleratka (fem.)‘rapidly growing adolescent’: podrostok whose physical conditions are more developed than is expected for his/her age.
We will not specially mention sexucal attribution for Spanish words that follow the regular -o/-a alternation rule, e.g., niño (masc.) versus niña (fem).
Niño ‘child’: prototypical non-adult.
Infante, bebé, nene, inocente ‘baby’: niño who absolutely needs an adult to take care of.
Lactante, criatura, rorro ‘newly-born infant’: newly born child who is fed at the breast.
Crío, chamaquillo, chavalillo, muchachito ‘child, preteen’: niño who can take care of himself in restricted sense and is of a small size.
Adolescente, muchacho, chamaco, jovencito ‘teenager’: niño whose size is evaluated as more or less standard for an adult.
Preadolescente ‘initial teenager’: adolescente, with emphasis on that s/he shows only some initial adolescence characteristics.
Joven, chico, chavo ‘rather matured teenager’: practically adult, but the speaker’s evaluation is that s/he could not take care of family.
Niño ‘boy’, niña ‘girl’: child, with sexual attribution.
Chiquitín, pequeñito ‘baby’: bebé who starts moving on his knees, but not walking.
Size and constitution
Chiquito, pequeño, pequeñuelo, chamaquito ‘little child’: niño, with emphasis on small size.
Comino, varoncito, hombrecito (all masc.), mujercita (fem.)‘little child’: niño, with emphasis on small size, but emphasizing his abilities or characteristics similar to those of older children.
Chicuelo, escuincle ‘teenager’: chamaco or adolescente, with emphasis on small size, sometimes pejorative.
Boy-scout (masc.), niña guía (fem.)‘scout’.
Acólito ‘church helper’: a boy that helps to catholic priests in the mass.
Cadete ‘cadet’: a boy or male teenager taking military instruction.
Colegial (masc.)‘schoolboy’, colegiala (fem.)‘schoolgirl’ , estudiante ‘student’, escolar ‘scoolboy or schoolgirl’.
Párvulo, preescolar, niño de kinder ‘kindergartner’.
Menor ‘minor’: a child under the juridical fixed form; in Mexico the norm is 18 years.
Señorito (masc.), amita (fem.) (obsolete): a child of rich land owners. Sexual attribution.
Niño bien (phraseological unit) ‘rich teenager’: teenager with high economic position and a special behavior.
Primogénito ‘firstborn’: a firstborn child.
Benjamín ‘last-born’: a last-born child.
Expósito ‘abandoned child’: a baby that was abandoned.
Huérfano ‘orphan’: a child who has no parents.
Adoptado ‘foundling’: a child that was legally adopted as a part of family.
Recogido ‘foundling’: a child that lives in a family as a part of it but s/he was found rather than born in the family.
Boshito:Yucatan child.
Chillón, llorón ‘crybaby’: niño that used to cry frequently.
Malcriado, mimado, consentido, mocoso ‘spoiled child’: niño who is ill mannered because of the lack of good parental education.
Marimacha: niña (girl) that has a behavior like a boy. Sexual attribution: female.
Travieso ‘mischievous child’: niño causing annoyance, harm, or trouble.
I.A. Bolshakov,
A.F. Gelbukh
The problems related to the use of free word combinations (we do not consider idioms here) are the most frequent errors made by the students in their non-native language. The obvious reason for this is the great quantity of such combinations in language: pay attention, high wind – such familiar English expressions are absolutely not obvious for people studying English. A great help for the students would be a dictionary listing for each head word all the words normally (or frequently) used with this one. As a case study, the paper describes the CrossLexica dictionary of a new type combining the features of a thesaurus and a dictionary of collocations of the general Russian lexicon. For each word the dictionary presents the words semantically related to this one, the derivatives of this word in other parts of speech, and the words that can form collocations with this word. Also presented are inflectional paradigms, government patterns (subcategorization frames) for verbs and nouns, as well as some usage marks. To our knowledge, it is the largest available dictionary of Russian collocations. In a hypertext-style environment, for about 100,000 words it provides 600,000 bilateral semantic links and 700,000 collocations. A unique feature of this dictionary is its ability to suggest hypothetical semantic links and collocations which effectively adds millions of high-probable collocations to the dictionary, useful for people having a passive knowledge of language. Russian-to-English and English-to-Russian translations of individual words and some collocations are provided.
The second type of the dictionaries useful for disambiguation tasks are the dictionaries of word combinations. These dictionaries contain information about the occurrences of pairs of words in texts, e.g., good idea (but not *green idea), deep sleep (but not *furiously sleep), idea works (but not *idea sleeps)[5].
The simplest type of such information is just binary: whether or not a given pair of words occurs in texts, or at least occurs with a sufficient frequency. In more detailed computer dictionaries, the frequency of occurrences is given. Such dictionaries are used in so-called lexical attraction parsing models.
The dictionaries of this type provide more detailed information than the government patterns dictionaries, since for a given word they list not only the prepositions, but all the words that can appear together with the given one in a text, or, more precisely, that can be syntactically connected with the given one in a phrase.
On the other hand, such a dictionary is simpler than a thesaurus or a lexical functions dictionary (we describe these in the next sections), since they do not indicate the type of the relationship between the given two words. Thus such a dictionary is much easier to compile than a thesaurus. In fact such a dictionary can be extracted from a large enough corpus of texts.
As an example of a word combinations dictionary, we will consider the CrossLexica™ dictionary compiled by some of the authors for Russian language; now we are planning to develop a similar dictionary for Spanish, see Fig. 1.
Fig. 1. A fragment of the CrossLexica™ word combinations dictionary.
The dictionary consists of two parts. The main part are word combinations. Additionally, the dictionary contains a simple thesaurus that includes synonyms, more general words (Paris is-a city), parts (book has-a-part page), etc. The thesaurus is used to automatically generate more hypothetical word combinations not included in the dictionary directly, e.g., to walk around Paris (from to walk around a city), page of a book, etc.
Here are some examples of Spanish word combinations: junta de gobierno, poner atención (not *pagar atención), café cargado (not *café duro), gato peludo, ver con telescopio.
The Russian dictionary now contains more than half million word combinations, as well as about half million relationships of the thesaurus, that can generate about 20 million hypothetical word combinations. We expect similar figures for the Spanish version.
Though most of our experience with this type of dictionaries is related to Russian, we will describe the examples from English, since there is nothing language-specific in the examples. Let us consider again the phrase The man sees a cat with a telescope.
To resolve the ambiguity, it is enough to note that the combination see with a telescope is present in the word combinations dictionary, while *cat with a telescope is not. This observation gives more chances to the first variant of the analysis: ‘The man uses the telescope to see the cat’ rather than ‘The man sees a cat that has a telescope’.
If the variants differ in several links, each link under the question is checked against the dictionary, and the variant is chosen that has a greater number of links present in the dictionary.
In lexical attraction parsing schemes, i.e., if the dictionary provides also the frequencies of the combinations, the variants are assigned weights according to the frequencies of the combinations rather than simple yes-or-no judgments. If both combinations see with a telescope and cat with a telescope are present in the dictionary but the frequency of the first one is greater than that of the second one, then the first variant of the analysis is assigned a better weight.
We suppose that the homonyms in the word combinations dictionary are distinguished, e.g., for any occurrence of the Spanish word gato it is indicted which of its senses is meant, like gatocat or gatojack.
Let is consider again the example Veo un gato peludo, with the ambiguous lexem gato: ‘cat’ or ‘jack’. The word combinations in which the word under question participates are checked against the dictionary. Since the combination gato peludo is present in the dictionary with the sense mark gatocat, the first variant of analysis should be chosen. If in the dictionary there are several combinations with the given word, the decision is made by choosing the variant that agrees with the dictionary in a greater number of the combinations. In case of lexical attraction dictionary, their frequencies are combined.
Other Spanish examples: té cargado, el jefe cargado del trabajo. The dictionary contains the combinations té cargadostrong and cargadoresponsible del trabajo.
As in the section on the government patterns, the referential ambiguity can be resolved in some cases by substitution the pronoun in question with each candidate and checking the resulting combinations against the dictionary.
Let us consider an example: Mi hermano puso un café ligero en el escritorio y uno cargado. Possible candidates for uno are hermano, café, escritorio, that results in the possible combinations hermano cargado, café cargado, escritorio cargado. Only the combination café cargado is present in the dictionary of word combinations. In the lexical attraction dictionary other combinations might be present as well, but the frequency of the combination café cargado would be greater than those of the other combinations.
I.A. Bolshakov,
A.F. Gelbukh
Lexical functions introduced by I. Mel’cuk of Montreal Univercity, specify the standard, and usually the only acceptable, way of expressing certain general meanings with specific words. For example, the meaning ‘of high degree’ is expressed differently with different English words: strong wind, high temperature, badly wounded; the meaning of ‘to act’ is expressed as to pay attention, to drive a car, to deliver a speech. The choice of the corresponding words in another language frequently is not obvious: Fr. grièvement (*mauvaisment) blessé, prêter (*payer) attention. The paper describes the different types of lexical functions useful in language acquisition on the base of examples taken from Spanish, French, English, and Russian. Detailed lists of Spanish words expressing the most important functions are present. Also discussed are some formal rules of synonymic transformation of sentences, based on the use of lexical fucntions. As a way of computer-aided compilation of a lexical functions dictionary it is proposed to mark up the lexical function names in a word combination dictionary like the CrossLexica dictionary described in another our paper within this publication. Below we present some examples of lexical fucntions.
In the previous section, the function Magn was already defined, but only for nouns. However, it is applicable to many verbs, adjectives, and adverbs as well. Its argument should then have some main feature that can be qualified in grades, and correspondingly, Magn expresses the idea of the great degree of this feature, i.e., the meaning of ‘very’ or ‘intensely’.
In Appendix there are a few examples of Magn for some Spanish verbs, adjectives, and adverbs. For all of them the values are adverbs.
Notice that in any natural language there is some "standard", or the most usual, such adverb for adjectives and adverbs: muy in Spanish, very in English, très in French, ochen’ in Russian. On the other hand, a non-native speaker should use these values very cautiously, since a lot of adjectives and verbs require quite different words in this meaning. That is why the true usage of Magn, as well as ofother LFs, is so important for deep language competence.
In Table 1 shown are some examples of the function Oper1 for Spanish verbs and their English equivalents in parallel.
Sp. argum. |
Oper1 |
Glued verb |
Eng. equival. |
Oper1 |
Glued verb |
atención |
prestar, dedicar |
– |
attention |
pay, focus, devote |
– |
ayuda |
prestar, venir en |
ayudar |
aid |
render, give, offer, provide, come to |
aid |
auxilio |
prestar |
auxiliar |
help |
give, offer, provide |
help |
clases |
dar |
– |
lessons |
teach |
teach |
concierto |
dar |
– |
concert |
give |
– |
confianza |
tener |
creer |
confidence |
have |
trust |
cooperación |
prestar |
cooperar |
cooperation |
give |
cooperate |
dolor |
aguantar |
– |
pain |
feel, have |
suffer |
grito |
dar |
gritar |
cry |
let out |
cry out |
frutas |
dar, apostar |
frutar |
fruits |
yield, bear |
fruit |
necesidad |
tener |
necesitar |
necessity |
be under |
need |
pregunta |
hacer |
preguntar |
question |
ask |
ask |
victoria |
conseguir, alcanzar |
vencer |
victory |
win, gain, achieve |
conquer |
Table 1. Function Oper1.
It is easy to see that the values of Oper1 do not have any autonomous meaning for any of its arguments in both languages. Instead, they are merely tools to incorporate the substantive argument into the syntactic structure. Their role in the text is similar to that of auxiliary words or, let us say, suffixes. For example, the word dar when used with the word grito have the same “meaning” as the suffix ‑ar: dar un grito = gritar. Namely, it has no meaning at all but instead performs a function of converting a noun into a verb.
The proof of the absence of any meaning of the values of Oper1 can be seen in that most textual combinations of Oper1 (S) and S can be replaced with a single verb (see the columns 3 and 6 above in the Table 1) which is equal in its meaning to the noun[6] S: prestar ayuda = ayudar, hacer una pregunta = preguntar, etc. Such transformations are called paraphrasing. They play a significant role in theory and practice of linguistics.
The meaning of the LF Caus is ‘to cause’, ‘to make the situation existing’. The grammatical subject for the verb V = Caus (L) should be different from L, while the complement should be L: 1st actant of V ¹ L; 2nd actant of V = L. E.g., to cause the situation of rendering attention of somebody, it is necessary to attract the attention of the person. Informally, Caus (L) answers the question: What one does with L to cause it?
The meaning of the LF Liqu is ‘to eliminate’, ‘to cause the situation not existing’. Informally, Liqu » Caus not; Liqu (L) answers the question: What one does with L to eliminate it? As for Caus, the grammatical subject for V = Liqu (L) should be different from L while the complement should be L: 1st actant of V ¹ L; 2nd actant of V = L. For example, to eliminate the situation of rendering attention it is necessary to distract this attention.
Argument |
Caus |
Liqu |
Eng. equiv. |
Caus |
Liqu |
atención |
atraer |
distraer |
attention |
attract |
distract, divert |
descanso |
conceder |
privar |
rest |
give |
deprive |
derechos |
discernir, conceder |
privar |
rights |
grant |
deprive, concede |
dolor |
causar |
tranquilizar |
pain |
give |
calm |
palabra2 |
conceder, conferir, dar |
retirar |
floor2 |
give |
deny |
parlamento |
convocar |
dar receso, disolver |
parliament |
convene, convoke |
disband, dissolve |
posibilidad |
dar |
privar |
opportunity |
raise, afford |
exclude, rule out |
visa |
dar, conceder |
privar |
visa |
grant, issue |
deny, cancel |
Table 2. Functions Caus and Liqu
Table 2 illustrates both lexical functions, for Spanish and English in parallel.
The LF A0 is defined on nouns, verbs, and adverbs and gives the meaning equal to that of its argument, but expressed by an adjective: A0 (ciudad) = urbano, A0 (hermano) = fraternal, A0 (mover) = movedizo, A0 (bien) = bueno. These are examples of semantic derivation in Spanish. One could see that, in contrast to morphological word derivation, semantic derivation preserves the meaning though can take quite a different stem for the derivate.
The reader may wonder what the numerical indices of the function names such as A0 or Oper1 stand for. The indices 1, 2, …, refer to the different actants of the argument lexeme. For example, Oper1 (L) is a verb that describes the action of the first (not second, etc.) actant of L. By convention, the number 0 refers to the lexeme L itself rather than to any of its actants. As you might have guessed, there exist also such functions as Oper2, or A1, A2, etc.
The LFs A1, A2, …, express the typical adjective qualifiers for the first, second, etc., actants of L:
x |
A1 (x) |
A2 (x) |
sorprender |
sorprendente |
sorprendido |
simpatizar |
simpatizante |
simpático |
x |
S0 (x) |
S1 (x) |
S2 (x) |
S3 (x) |
S4 (x) |
llover |
lluvia |
– |
– |
– |
– |
vivir |
vida |
– |
– |
– |
– |
nadar |
natación |
nadador(a) |
– |
– |
– |
poseer |
posesión |
poseedor(a) |
propiedad |
– |
– |
amar |
amor |
amante |
amado(a) |
– |
– |
expedir |
expedición |
expedidor(a) |
mensaje |
destinatario(a) |
– |
comprar |
adquisición, compra |
comprador(a) |
mercancía |
vendedor(a) |
precio |
Table 3. Noun derivates.
Similarly, S0, S1, …, express the substantive semantic derivates, i.e., nouns. The function S0 gives a noun with the same meaning as its argument, for example[7] S0 (real1) = realidad, S0 (defender) = defensa. The functions S1, S2, S3, …, give the standard names of the first, second, third, etc., actants of the argument. Table 3 presents some examples for the verbs with different number of actants.
There are several substantive LFs with the meaning of standard circumstances: Sloc denotes a standard place, Smod denotes a standard manner, etc.: Sloc (vivir) = vivienda / habitación / morada; Smod (vivir) = modo,Smod (escribir) = estilo, Smod (hablar) = pronunciación / dicción.
In the same way, verbal semantic derivates are expressed by the function V0: V0 (teléfono) = telefonear, V0 (diploma) = diplomar.
Adverbial semantic derivates are expressed by the function Adv0: Adv0 (intenso) = intensamente, Adv0 (cierto) = ciertamente / de cierto.
There exist LFs rather similar to the derivates. The function Pred of a noun or an adjective gives the standard predicative verb for its argument: Pred (maestro) = enseñar, Pred (unido) = unirse / aunarse / juntarse. The function Copul of a noun gives the standard copulative verb for this noun: Copul (maestro) = ser: José es maestro; Copul (ejemplo) = servir (de): Esta palabra sirve de ejemplo de la regla; Copul (camarado) = resultar: José resultó un buen camarado. These functions are used in the transformation rules. Informally, both functions answer the question “how to be an L,” but Copul requires the word L as a complement, while Pred gives the complete answer by itself.
As we have seen, there are functions related to different actants of the argument; the number of actant is indicated as an index of the function name. Similarly, there are functions related to two different actants of the argument; their names are given two indices.
A conversive to a lexeme L is a lexeme which denotes the same situation from another “point of view,” i.e., with some permutation of the actants. The application of LF Convij(L) means that the actant i of L becomes the actant j of and the actant j of L becomes the actant i of , i.e., the actants are swapped. In general, the permutations may be possible for any number of actants, the number of possible options grows with the number of available actants, and the function name may have, accordingly, three or more indices, e.g., Conv123.
The simplest examples of conversives are the words with two actants: Conv12 (padres) = hijos, Conv12 (hijos) = padres. The same situation can be expressed by the phrases Juan y María son los padres de estos niños or Estos niños son los hijos de Juan y María. The situation is the same, but it is presented from the “point of view” either of Juan y María or of the children. In some cases the conversives are antonymous: Juan es mas fuerte que José vs. José es menos fuerte que Juan.
x |
S0 (x) |
|
S1 (x) |
S2 (x) |
S3 (x) |
S4 (x) |
comprar |
compra |
|
comprador |
mercancía |
vendedor |
precio |
vender |
venta |
|
vendedor |
mercancía |
comprador |
precio |
Table 4. Conversives.
More complicated examples are verbs comprar and vender with four actants each: Conv31 (comprar) = vender, Conv31 (vender) = comprar, see Table 4. Note that the standard names of the two corresponding actants of vender are those permuted for comprar. Again, the same situation is presented from the “point of view” of the seller or buyer.
There are many other lexical functions of interest, of which we will mention here the following:
· The synonyms are expressed through LF Syn. There are various degrees, or types, of synonymy, so there are several options for this LF: an absolute synonym Syn, a narrower synonym SynÌ, a broader synonym SynÉ, and an intersecting synonym SynÇ. Here are some examples: Syn (cómputo) = computación, SynÌ (respetar) = considerar, SynÉ (asesinato) = masacre, SynÇ (evitar) = salvarse / esquivar / eludir.
· The contrastive terms, or antonyms, are expressed through LF Contr: Contr (cima) = fondo, Contr (bueno) = malo,Contr (más) = menos.
· The generic terms, or hypernyms, are expressed through LF Gener: Gener (ira) = sentido, Gener (televisión) = medios masivos de comunicación.
· The standard terms for collectivity are expressed through LF Mult: Mult (nave) = flota, Mult (borrego) = rebaño, Mult (ganso) = bandada.
· The standard terms for singleness are expressed through LF Sing: Sing (lluvia) = gota, Sing (flota) = nave, Sing (banda) = bandido.
· The ability to be the i-th actant in a situation is expressed through LF Ablei: Able1 (mudar) = mudable, Able1 (combar) = flexible, Able1 (comer) = hambriento, Able2 (comer) = comestible, Able1 (excusar) = indulgente, Able2 (excusar) = excusable.
A.F. Gelbukh
Sofia N. Galicia-Haro
The information on syntactic government, or subcategorization, is an integral part of a language lexicon. The lack of this information leads to such errors as *to marry with Mary that can be said by a French-speaking person, or, say, *to marry on Mary or *to marry behind John that can be said by a Russian-speaking person; clearly, for an English-speaking person it would be difficult to choose the correct preposition when speaking in these languages: *se marier Marie. The intuition based on one’s native language often does him or her an ill service when speaking in another, even closely cognate language. Unfortunately, in the teaching practice and in the existing manuals little attention is paid to clear and systematic consideration of syntactic government. In the paper, a Spanish dictionary of extended subcategorization frames is presented. Such a frame (a government pattern) for a word lists the means of expression of its valencies as well as gives the information on the compatibility of these valencies or specific means of their expression (e.g., Spanish *mover de ... hasta ...). A statistical algorithm of compilation of such a dictionary from a large unprepared text corpus is discussed. The algorithm is non-supervised and produces a list of prepositions (or grammatical cases) used with each word; at the same time the algorithm resolves the syntactic ambiguity in the corpus. Another, supervised algorithm is intended for computer-aided compilation of the human-oriented dictionary.
In the Meaning Û Text theory, syntactic dictionary zone describes correspondence between semantic and syntactic valences of the headword, all ways of realization of the syntactic valences, and the indication of obligatoriness of presence for each actant (if necessary). For this, the dictionary presents a government pattern table and the GP restrictions; also examples are usually given. The restrictions considered in GP can be of any type (semantic, syntactic, or morphological). The compatibility between syntactic valences is considered among those restrictions. The examples cover all possibilities: examples for each actant, examples of all possible combination of actants, and finally examples of impossible or undesirable combinations.
The main part of a GP is the list of syntactic valences of the headword. These are listed in a rather arbitrary order, though the order of growing obliqueness is preferred: subject, then direct object, then indirect object, etc. Each headword usually imposes certain order on its actants. For example, an active entity (subject) occupies the first place, then the principal object of the action (direct complement) follows, then another (indirect) complement (if it exists), etc. Also the way of expression for headword meaning influences this order. For example, the expression for Spanish acusar: person X accuses person Y of action Z before a person W.
Other obligatory information at each syntactic valence is the list of all possible ways of expressing this valence in texts. The order of the options for a given valence is arbitrary, though the most frequent options usually go first. The options are expressed with symbols of parts of speech or specific words.
Following the notation of GP table and list of examples, we can describe the Spanish verb acusar as:
1 = X |
2 = Y |
3 = Z |
4 = W |
1. NP |
2. a NP |
1. de NP |
4. ante NP |
|
|
2. de INF |
|
obligatory |
obligatory |
|
|
C.1 + C2 Juan acusa a María.
C.1 + C.2 + C.3.1 Juan acusa a María de robo.
C.1 + C.2 + C.3.1 + C.4 Juan acusa a María de robo ante los demás estudiantes.
Prohibited:
C.1 + C.3.1 *Juan acusa de robo.
C.3.1 + C.4 *Acusa de robo ante los demás estudiantes.
Though in a complete dictionary, syntactic zone specifies all possible examples after the GP table, in our example only some elements of the complete list of possible combinations of actants and some elements of the list of examples for impossible or undesirable combinations were showed. Obligatory indication was the only consideration for impossible combination examples. We omitted the list of examples for each actant because they are implicit in the examples below.
This example shows how preposition usage allows for identification of prepositional phrases that realize the valences of a given verb. Another example, the Spanish verb expresar ‘express’ has the meaning related not only to telling something to someone, but with the manner the subject does this. Spanish speakers use preposition a to introduce an animated actant to whom something is expressed, and several prepositions (en, de, con, mediante) to say by means of what he or she expresses something. Preposition usage also aids to distinguish between verb meanings. Many verbs have different meanings distinguishable by the corresponding prepositions, for example, lanzar: lanzar1 ‘throw’, lanzar2, ‘throw out’,etc.; these homonyms use different prepositions.
ASF stands for Advances (or Extended) Subcategorization Frame. For such languages, describing complements as usual SFs do (with fixed complement order, i.e., without order preferences) could lead to errors or at least problems. For example, the Spanish phrase le acusas ‘you accuse him/her’ uses le as a pronoun for the accused person, in spite of the “unusual” word order. Another example: ellos quieren acusarle ‘they want to accuse you’ where the clitic represents the person accused. The SF number 6 from the example in section 2 considers acusar as an intransitive verb; it represents sentences that seem to end with the verb without specifying the person that is accused, though in fact this information is located before the verb or just within the verb form.
Missing description of subject in subcategorization information could result in erroneous recognition of complements. For example, distinction between inanimate and animate noun groups permits subject inversion detection for some verbs; however, if we can not make such distinction, then we recognize a wrong SF with erroneous valences, as with the verb acusar in the phrase:
... se [PPR] acusa [V] el atleta [NP] ...
(PPR stands for a personal pronoun). The subject inversion here looks like an NP type complement. In section 3, the valences for this verb were described, and the only valence with NP type realization was the subject. The SF of NP type, appearing in section 2, represents complements for another meaning of the verb acusar (related to reveal). This confusion between subject and complement results in a wrong structure or wrong meaning assignment to the verb in such sentences.
There is semantic information detectable in syntactic analysis that must be considered in ASFs: for example, detecting syntactic valences that are linked to semantic valences in deeper analysis levels and distinguishing between complements of the verb and circumstances described by the same SF. The identification of syntactic valences of the verb is accomplished through detecting introductory complement words. For some verbs, only one introductory word is used to realize each valence; for other verbs, there are several such words related to the same syntactic valence. For example, for acusar, de NP is referred to the action of which somebody is accused, but for the verb expresarse (reflexive ‘express’), prepositions en, de, con and mediante (en NP, de NP, con NP, mediante NP) refer to the way somebody expresses something.
For some verbs, only one SF is enough to describe both the complements (or valences) of the verb and the circumstances. For example, locative verbs such as colocar ‘collocate’ require complements related to place with prepositions and noun phrases with locative meaning, like en ‘in’ and espacio ‘space’. The frame en NP describes place and time complements for the verb colocar, for example: coloca en este espacio ‘collocates in this space’ and coloca en este momento ‘collocates in this moment’. The former case is related to a valence of the verb, while the latter one is a circumstance and thus is not connected with any valence.
These examples suggest selecting of subcategorization information on the basis of GP, including subject description, specifying introductory words for valences, and describing valence permutations and possible and impossible valence combinations. In addition, it suggests including animity and type of the complement (locative, etc) features in ASF. Thus, the subcategorization information proposed for ASFs is: subject description, differentiation of valences and circumstances, statistical information for each type of valence realization and valence combinations, and specification of animity and locativeness in valence realization.
Animity has some peculiarities in Spanish. For many European languages, the direct object is connected to the verb without prepositions. However, in Spanish animate entities are connected by preposition a, which is substitutable by pronominal forms: le, les, etc. Inanimate entities are connected directly, i.e., without any prepositions. Animity can be considered as a personification: for example, government in Spanish is addressed with preposition a: acusaron al Gobierno. Besides persons, the category of animity is considered covering groups of persons; animals; and some abstract entities: political parties, organizations, etc.
The use of preposition a for animity in fact is related to different uses, but here we only consider its use connected with the direct object. One of such uses is related to distinguish between the meanings of a given verb: for example, querer algo ‘to have the desire to obtain something’ and querer a alguien ‘to love or to estimate somebody’. The principal use that we differentiate is related with animate and inanimate nouns: for example, veo una casa ‘I see a house’ and veo a mi vecina ‘I see my neighbor’. Thus, in Spanish animity is evidently a syntactical feature, though it has clear semantic allusion. This feature is included in ASF mainly with connection to the preposition a.
Spanish has a freer word order (though not totally free) and by consequence, the possible combinations are limited. Impossible combinations could be defined on the base of experience, though in this case they will not reflect the change of language style and the preferences in some domains. Considering obligatory presence, we can list some impossible combinations, but not all. So, statistical corpus-based weights were considered to acquire such information. If one valence occurs in all sentences extracted for a given verb from the corpus, in one or several different realizations, this is considered an evidence of obligatoriness, represented in our notation (see below) by a 100%. This weight will be used by a parser to find even long distance links for valences. For example, the verb acusar requires the presence of direct object; with this specification, the parser should try to find this piece of information somewhere, after or before the verb.
To build an ASF dictionary, first, for possible valence realizations, the weights for each valence related to a specific introductory word or preposition are calculated. Then, the weights for each valence in different positions related to the verb are obtained for possible valence combinations. This statistical information gives a way to evaluate specific type descriptions of valences and valence combinations to increase the efficiency of ambiguity resolution during parsing.
The ASFs are considered specific for each verb word. The homonymous verbs are distinguished by numebrs, for example: acusar1, acusar2, the order in the numeration being quite arbitrary. Each pair of homonyms should have at least one different element in their ASFs. An ASF is represented in the following way:
Where:
+ denotes one or more elements
* denotes zero or more elements
~ denotes the verb
The information necessary to fill in these frames is the information of usual SFs plus syntactic valence information of the verb, semantic features (animity and type of complement: locative, etc.), and statistics of common use. We describe the form to acquire this information first from linguistic sources (Fig. 1; x stands for numbers that are not important for the present discussion; ~ stands for the verb itself) and then from corpus.
For Spanish, there are no dictionaries with complete subcategorization information. There is some spread information considered by several authors that we have been taking into account for the ASFs. For example, the verb acusar was considered by Penadés [1994] among the 145 verbs analyzed, with the following syntactic-semantic scheme:
alguien ‘somebody’ |
Acusa ‘accuse’ |
a alguien ‘to somebody’ |
de algo ‘of something’ |
pure internal direct |
direct intrinsic |
specific affected |
specification |
Other authors, e.g., Alonso [1960] showed some usage examples: a alguno al, ante el juez, de haber robado, de los pecados (for reflexive verb), de lo mal que se ha portado (for reflexive verb). Nañez [1995] presents preposition usage in alphabetical order for syntactic constructions, according to him, acusar employs three prepositions: a, ante, de;some examples are given using the same form as Alonso did. For the verb acusar and only considering information from the mentioned authors, we have in Figure 1 the corresponding ASF.
Figure 1.
Obligatory occurrence of valences is marked with 100% weight. From such syntactic-semantic scheme, all valences could be considered obligatory. The last valence is described as de NP, but Alonso considers also de V_INF (de haber robado), so the weights for X valence and the weights for all non-obligatory valences or unknown frequencies are marked with x%.
Taking into account the information proposed in the previous section, and considering only all the sentences for acusar from the LEXESP corpus, the corresponding ASFs can be built as showed in Fig. 2 and 3. Fig. 1 presented a formal example, while Fig. 2 and 3 use a more practical form just as a shorthand.
Figure 2.
Figure 3.
The two examples for acusar show the difference in subcategorization information for the two verbs: acusar1and acusar2. For the second one, the complement NP makes the distinction in meaning if we discriminate between common nouns and animate nouns. DEUM [1996] shows three different meanings for this verb, acusar1 specified as Penadés, acusar2 with a NP as direct object, and acusar3 recibo ‘action of receipt’. In LEXESP, acusar1represents 86.5% of sentences for verb acusar, and acusar2 represents 13.5%. Of these two cases, the former is more complex; we discuss this statistics in the next paragraphs. The probabilities of valence combinations for acusar1appear in Combinations section, the universe being the total number of sentences for the given verb.
There are some combinations representing little number of sentences in each case. The combinations [VW~YX, 0.44%], [VW~XY, 0.44%], [WV~Y, 0.44%] show statistics of combinations related to the valence Y, which represents the entity before which the subject accuses. Despite in LEXESP these cases are rare – one sentence for each case – in legal texts this valence appears more frequently. This is an indication of corpus specificity. This obliged us to work in two different ways: obtaining more texts with diversity of topics on them and making specific domain dictionaries.
The combinations [V~WX, 40.97%], [V~W, 7.05%], [VW~X, 27.75%], [WV~X, 10.13%], [VW~, 5.28%], represent the most common use. The last three ones represent almost half of the total combinations showing the presence of an obligatory valence before the verb in different positions.
The valence W indicates two types of realizations: with the preposition a or with personal pronoun. We consider that this distinction must be described, because even if in common use any personal noun could be substituted by a personal pronoun, for this verb the valence occurrences are almost equal before and after the verb. In LEXESP, 47.56% of total sentences have this valence before the verb, 11% use a as the introductory word for W, and the remainder use PPR. The total percentage is not 100% because of three sentences with direct speech.
The use of personal pronoun requires more analysis for automatic acquisition because of the difficult-to-analyze pronouns, especially se. Cano [1987] mentioned that se is used almost in 25% of total verb phrases but its use varies from reflexive to reflexive passive and impersonal form. Also he mentioned that other authors considered it as a lexical addition with repercussion on verb meaning. We found that 36.56% of total sentences in LEXESP use personal pronoun for W valence before the verb, and 10.13% use it as a part of the verb form. We formally considered the latter case as V~W.
Finally, we will discuss the V valence, which only appears 58.13% despite its obligatory condition. There are mainly two reasons for this fact. One reason is the use of reflexive pronouns expressing the same person for subject and direct object. This case was represented by very few sentences, and agreement between reflexive and verbal form helps to disambiguate this lack of subject. The second reason representing 39.2% of total sentences is the absence of subject that some theories consider as so called zero subject. The use of word se as impersonal pronoun and the use of zero subject for linking subject in long-distance lexical position requires deep syntactic analysis, capable of finding specific subjects outside the selected text window or even in previous sentences. The same discussion applies to V valence for acusar2.
M.A. Alexandrov
P.P. Makagonov
The use and translation of words as special terms often are not reflected neither by the general language lexicons nor by the existing specialized dictionaries even if they are available for a given subject domain. In many cases the only available information on the domain is a rather small set of texts. The paper concerns the computer-aided compilation of domain-oriented dictionaries. For a given domain area, by a domain-oriented dictionary we mean a list of representative special words or expressions in the form of so-called normalized codes, or stems; the latter concerns morphological normalization that is especially important for languages with sophisticated morphology, such as French, Spanish, or Russian. Such dictionaries are also useful for automatic text processing since they allow for search of information on the given domain in a large text archive as well as for detecting information patterns and hidden regularities in the specialized text. The compilation of a high-quality dictionary is a multistage procedure consisting of three mayor stages: (1) selection and evaluation of the texts to train the dictionary, (2) statistical analysis, detection of the normalized codes, and compilation of the frequency lists, and (3) analysis and correction of the results by the human experts. Algorithms of detection of the normalized codes, the quantitative parameters for selection of the most informative part of the frequency list, and so-called coefficients of importance for keywords are discussed. A practical system supporting supervised compilation of domain-oriented dictionaries is presented.
A domain dictionary (DD) is a dictionary containing key expressions (words or word combinations) and their appropriate coefficients of importance for the given domain. Hereafter, we will use a more familiar term keyword to refer to any key expression, which can be one word or word combination.
A DD is constructed by an expert or the user according to the following principles:
- A DD does not contain the keywords that are used with the same (or close) frequency in general-topic texts or in some other domain-oriented documents.
- A DD does not contain keywords that have very non-homogeneous distribution in the set of domain-oriented documents.
In our Text Recognizer system, keywords are listed in the DD not literally but in the form of a pattern (we call it key pattern) describing a group of words with equivalent meaning. In such a pattern the inflection for time, person, gender, number, etc., as well as part of speech distinction, suffixes, etc., are ignored: obligation, obligations, obligatory, oblige® oblig-. In some cases a group of words can not be represented just with a common substring because of “false alarms:” admission, admit, admits® admi- causes a “false alarm” on administration; in such cases two alternatives in the pattern, or simply two individual entries, can be used: admis- and admit-. A description of our regular expression-like language for key patterns goes beyond the scope of the present article.
The coefficient of importance reflects the fuzzy nature of the relationship between the keywords and the selected domain, i.e., a DD is a fuzzy set of the keywords. Such a coefficient is a number between 0 and 1. These coefficients are determined based on expert's or user’s intuition. The general recommendations for assignment of such weights are: the keyword that is essential for the given domain is assigned the weight 0.9, important 0.7, regular 0.5, important for this and also for some other domains 0.3, typical for many domains 0.1. If a domain dictionary does not contain these coefficients, they all are considered to be 1.
To give an example of a DD, we could include to the dictionary on the domain “Text Mining” the following keywords taken from the Information Letter about IJCAI-99:
Association Rule Discovery |
0.5 |
Case-Based Reasoning |
0.5 |
Computer |
0.1 |
Distribution Analysis |
0.3 |
Document Cluster– |
0.9 |
Information Retrieval |
0.7 |
Machine Learning |
0.7 |
Natural-Language Understanding |
0.7 |
Text Mining |
1.0 |
Trend Analysis |
0.3 |
Unstructured Text |
0.7 |
etc. |
|
Fig. 1. A domain dictionary.
If we have a DD, for every document we can build its so-called image relative to the domain defined by the given DD. Such an image is a list of the domain keywords with their corresponding numbers of occurrences in this document. If we have several DDs, we can build several images for a document, one for each domain.
Fig. 2. A document image relative to a specific domain.
Fig.1 and 2 show screenshots of our Text Recognizer system: one of DDs and the correspondent document image.
These methods must help to select texts, intended for building qualitative domain-oriented dictionaries. Then these dictionaries must provide selection of texts for training neuron network.
The process of creating dictionary is iterative process. First of all it is necessary to define the domain and to form initial dictionary of key-words and key-expression for this domain. The initial dictionary is created by the expert or group of expert in the selected area. Then this dictionary is checked on standard texts. After that the initial dictionary is added on the basis of discussion with expert (s).
It should notice that it is considered texts as the set of fragments. The last ones represent minimal elements of information used for building dictionary and trining neuron network.
The solution about including or excluding new lexical unit (key-words or key-expressions) to the dictionary needs to use some special characteristics:
Kw - mininal critical quantity of words from dictionary, which is assumed as enough for including fragment (or text) to the set of good (productive) ones.
Kt-max - maximal critical quantity of fragments or texts. It is relative part of independent texts or fragments, containing the fixed word from the dictionary. Condition is this word still may be used for evaluation if information closeness of one fragments to others from the fixed set of the texts and fragments (for example, Kt-max=0.9)
Kt-min - minimal critical quantity of fragments or texts. It is relative part of independent texts or fragments, containing the fixed word from the dictionary. Condition is this word still may be used for evaluation if information closeness of one fragments to others from the fixed set of the texts and fragments (for example, Kt-max=0.9)
These parameters allow to take into account the fact that the words, including in almost all texts or in few number of texts, isn't interesting for structuring text.
These parameters must be assigned by super-expert which forms domain-oriented dictionary on the basis of corpus of texts.
In many case it is useful to check fragments and texts, intended for building dictionaries, on complexity and must be harmonicity.
Text complexity is value C= M*LogN, where M,N-average length of word (in symbol) and phrases (in words) accordingly. Of cause, M and N are connected between themselves. But this criteria works good.
Long texts must be checked on harmonicity: it must be conformity between the distribution of words in frequency list of the set of texts with Ziph law: F(r)=1-exp(-c(r)**k), where r-rank of words, c and k - some constants. Deviation from this law reflects the non uniform word's contents of text: it was written by various authors or it was written on various themes.
When it is analyzed the various parts of frequency list of words it is useful to compare the information capacity of high-frequency part, middle-frequency part and low-frequency part. It is possibly to use Entropy H= -SUM( p(i)*Log[p(i)], where p(i)-frequencies of words (lexical units) in corpus of texts.
Computer graphics allow us to use effectively the subjective opinion of expert (user) for evaluation of fragment (text) properties and relation between fragments (texts).
In particular it is useful to analysis complexity of texts by means of 'oscillogram' of complexity along the text (see Fig.3). Here the complexity is calculated for every phrase.
Fig.3 Presentation of 'oscillogram' of text complexity
Besides this it is useful to observe the harmonicity of all fragments, using 'oscillogram'
of concordance coefficient between word's contents of fragments and Zipf law (see Fig.4). Here the concordance coefficient is calculated for every fragment.
Fig.4 Presentation of 'oscillogram' of text harmonicity.
These graphical possibilities can help to select the most interesting fragments of text for further analysis and to exclude the 'bad' fragments.
For evaluation of relation between fragments it is necessary to prepare preliminary the images for all fragments (texts). They look like the image of text (see above).
Now the relation between texts can be determined using formal procedure of comparison of text images. It is very convenient for analysis to present these relations in the form of circle graph. We present here the work of Text Classifier (Fig. 5). Color can give to expert the additional possibilities of text analysis. It is possible to break week links and to select the groups of strong-linked fragments or texts.
Fig.5 Graph of links between fragments (texts).
Of course, it is obviously that the texts, selected for building dictionary of one fixed domain, must have approximately equal degree of link between each to others. When we have the hundreds of texts, such graph can help to expert to evaluate the homogeneity of this corpus of texts very quickly.
It is supposed following methodology:
1. It is built the frequency list of words for every text and integral frequency list for all texts, selected on the basis of initial dictionary (see above).
2. Expert in selected domain considers words, which appear in the number of texts more then Kt-max and less then Kt-min. Expert must decide to include these words to dictionary or no. Expert can attract here additional charectiristics for evalution high-frequent part and low-frequent part of the frequency list, for example it may be Entropy.
3. As result we have the words from the most informative part of the dictionary, i.e the words, which are in some neighborhood of 50% -level in relation to all fragments (texts). All these words are announced as new standard dictionary for analysis new texts.
4. If standard dictionary is pure, it may be extended by means of infrequent words and frequent words by the expert.
Therefore we can have information compatibility between standard domain-oriented dictionary and standard texts.
The created dictionary can be used now for text's selection. And at the end of analysis the selected set of texts is checked on their homogeneity. It can be completed by means of special graph of links (see Fig. 5).
In future it is supposed to use domain - oriented dictionaries in following applications, connected with classification problems:
· Selection main reporters for the conference, meeting, council. This operation uses analysis of text, submitted by these reporters. It is obviously that main reporters are those ones who have the most informative text on one hand and those ones who have the least degree of link between themselves on the other hand. This selection may be completed on the graphs, where it is reflected the relations between texts. Of course, graph is constructed on the basis of images of texts (see Fig.5)
· Preparation new document from fragments. This problem is typical when it is necessary to prepare new text from the proposals of group of experts. In this case, it is necessary to select the most informative fragments from all the texts. Then this set of fragments may be represented to decision maker. He can define the order of fragment's location using special graph of type "dendrite".
· Analysis of information in Internet. Now it is observed the catastrophic increase of quantity of information, "walking" over Internet. Our instruments could help to analyze it. First of all it is necessary to select the domain and domain-oriented dictionary and to define criteria of closeness of documents to selected domain. When the texts are filtered it is possible to analyze this set of texts for decision-making: is this set of texts homogeneous or it is possible to find several sub-themes inside selected theme. This analysis may be completed by means of graph of links (see Fig.5). Here expert must have possibilities to isolate some links and emphasizes the others.
· Consulting on the basis of laws and normative deeds. The main problem consists in concordance of description of situation and appropriate law or normative deed, allowing to make a justified decision in this situation. This procedure is typical for Expert System, providing juridical help for lawyers and administrators. It can be realized if we have the image of the situation and the image of laws. The image of law is parametrizied form of this law. It is possible we will use elements of fuzzy logic for description of relation between the situation and appropriate law.
[1] This word also has other meanings.
[2] It is important for a child, because he can not take care of himself. When he grows up, formally he still is an orphan, but it is not so important any longer.
[3] It is interesting that plural is used only for “kitajchonok” and “negritjonok”. Maybe it is connected with the fact that they are used for denoting the corresponding races (see the Discussion section).
[4] The concepts from plaksa to poprygunja may also denote adult persons, though such behavior is typical mainly for children.
[5] We used the example by N. Chomsky “Colorless green ideas furiously sleep.”
[6] In linguistics, the part of speech is considered a purely grammatical category that cannot change the core meaning of the word.
[7] The index 1 of the word real denotes the first of the homonyms of this word: real1 refers to reality, real2 refers to a king, etc.