Igor Bolshakov, Alexander
Gelbukh, and Sofia Galicia-Haro. A
Simple Method to Detect and Correct Spanish Accentuation Typos. Proc.
PACLING-99, Pacific Association for Computational Linguistics, University of Waterloo,
Waterloo, Ontario, Canada, August 25-28, 1999, ISBN 0-9685753-0-7, pp. 104-113.
A Simple Method
to Detect and Correct
Spanish Accentuation Typos
I. A.
Bolshakov, A. F. Gelbukh, and Sofía N. Galicia-Haro
Natural Language
Laboratory, Center for Computing Research (CIC), National Polytechnic Institute
(IPN),
Av.
Juan de Dios Bátiz, esq. Mendizabal, Zacatenco, C.P. 07738, Mexico D.F., Mexico.
{igor, gelbukh, sofia}@pollux.cic.ipn.mx
A frequent
typographic error in Spanish connected with omission of a stress mark is
studied. It transforms one existing word to another existing one, and cannot be
detected by usual spell-checkers. A simple recovering procedure is proposed
relying on conforming of a specific 4-word context to a noun or adjective and
on a closed list of pairs of words that differ only in accentuation mark, with
one of them being a verb and the other being a noun or adjective of certain
gender and number. When one of the words of such a pair is found in the text,
the algorithm checks its context to detect whether it corresponds to a noun or
adjective; if the condition is not satisfied, the alternative hypothesis of a
verb is supposed. The idea is applicable to numerous nouns or adjectives like número or máquina that pass to quasi-homonymous personal
verb forms when losing their stress marks. An exhaustive list of 300
quasi-homonyms is given.
Key words: natural language processing,
spell-checking, Spanish, accentuation.
Introduction
The usual
spell-checkers consider each word out of its context, i.e., independently of
its environment. With such a strategy, only those typographic and orthographic
errors can be detected that change an existing word to a senseless string of
letters that does not exist in the given language. As to the errors converting
one existing word to another existing one, the changed words stay unnoticed and
make the text in essence ungrammatical. For their reliable correction, advanced
grammar checkers are to be used, with a full-scale syntactic parser, which is
not always affordable. To solve this problem, different heuristic methods,
among others phonologic, morphological, syntactic, statistical, and
knowledge-based ones, were proposed.
The purpose of
application of these methods is principally to formulate, basing on the context,
some conditions for the word in question and then check whether the word
satisfies them. In the manner the context is analyzed, this task is similar to
lexical disambiguation.
One of the
best-known disambiguation tasks is context-based part of speech tagging. It is
useful for spell checking, especially for wide coverage spelling error
detection and correction. For example, Elmi and Evens (1998) used a combination
of dictionary-based methods and context part of speech analysis for designing a
tutorial system to help medical students to learn the language used in
cardiovascular physiology. A part of speech tagger was used to disambiguate
contexts with adjacent part of speech sequences.
There are
different methods for part of speech tagging, mainly linguistic, statistical,
and machine learning ones. There are also hybrid methods that combine different
approaches; for example, Tzoukerman et al.
(1994) have used statistical and knowledge based resources. Nevertheless,
though methods of full part of speech tagging are well known, they are too
complicated and resource-demanding for the applications in spell checking.
In some works
on wide coverage spelling checking, a combination of different methods was used
to increase the detection and correction accuracy. Golding and Schabes (1996)
used a hybrid method based on part of speech trigrams and context features.
Agirre et al. (1998) proposed a
multi-resource method based on syntagmatic, paradigmatic, and statistical
knowledge for context sensitive spelling correction, which proves to work
better as wider context is taken into account.
In this paper,
we suggest a method applicable to a specific kind of errors. In a real
spell-checker, it is to be used in conjunction with or in addition to
well-known methods for dealing with other types of errors, which we here
neither consider nor discuss. The proposed method is a combination of the
method of confusion lists with an extremely simplified context-based part of
speech tagging. Namely, we suggest some heuristics to solve the spell-checking
problem for a specific group of Spanish words.
In Spanish,
some errors fully undetectable out of context are connected with accentuation
rules. For example, the phrases este artículo tiene … ‘this
article has…’, or el número
del adjetivo… ‘the number of the adjective…’,
or las páginas siguientes… ‘the following pages…’ would be considered correct by a
spell-checker even if the underlined words lost their stress marks: *este articulo
tiene…‘this I join has…’, *el
numero del adjetivo…‘the I enumerate of the adjective…’,
*las paginas siguientes…‘the you paginate
following…’. Indeed, the words articulo, numero, and paginas
are correct Spanish
verb forms, though fully unacceptable in the mentioned contexts.
This type of
errors is quite usual for foreigners and is very common in informal Spanish
writing, especially in the Internet. Thus, it appears in large text corpora
collected from the Internet. One of the authors, who is Russian, has made more
than 60 accentuation errors in his first paper in Spanish, and only half of
them were detected by the spell-checker of Word for Windows, version 6. A newer
version of Word for Windows uses a grammar checker to detect more errors, but
nevertheless it missed most of the mentioned errors either.
Many other
languages also employ diacritic marks extensively, among them Polish, Czech,
Hungarian, and French. In these languages, the omission of a diacritic mark
usually converts a valid word to a meaningless string, as well as in many cases
in Spanish itself, as with words énfasis,
ambigüedad, and niño after replacing é,
ü and ñ with e, u, and n, correspondingly. For Polish, Czech, and Hungarian, such cases
prevail and thus can be corrected by a usual spell-checker. Meanwhile, in
French (cf. noun pose versus participle
posé) and especially in Spanish there are
numerous groups of words, which are valid words both with and without specific
diacritic marks. Here we are interested not in all possible diacritics, but
only in accentuation marks and, in this article, only in Spanish. With some
other types of diacritics or in some languages other than Spanish, well-known
methods of spelling correction are applicable. In some other cases, especially
with diacritics other than accent marks (e.g., Spanish soñar vs. sonar),
the methods described here are not applicable.
Yarowsky (1994)
proposed a method for restoration of accent marks in arbitrary French or
Spanish words, basing on statistically proved contexts. This method requires
very large and strictly well-formed learning corpora. Learning gives vast
context collections for various word forms. Even after compressing these
collections by means of a morphological analyzer and a semantic generalizer
supported by a dictionary, the tool used for accent mark adjusting remains somewhat
cumbersome.
To our
knowledge, in another accent restoration program Daciuk (1998) used a similar
statistical method, while Gutting et al.
(1992) used hidden Markov chains. All these programs require large tables and
dictionaries. Meanwhile, the methods used by commercial products like Spanish
version of WordPerfect or French Le
Correcteur of Les Logiciels Machina
Sapiens were not published.
Our method uses
very restricted set of word endings and a small closed list of relevant words
that can be processed properly. At the same time, our method is applicable only
to pairs opposed in their parts of speech, namely, noun or adjective versus
verb. Hence, it does not pretend to universality, so that any confusion with
verb pairs like Spanish presento vs. presentó or hable vs. hablé cannot be
resolved by it regularly.
To repeat,
Spanish accent marks are used in such a way that they often distinguish between
the words of different parts of speech. These words have the endings ‑o, ‑a, ‑as, ‑e, ‑es,
both the nouns and adjectives on the one hand and usual personal verb forms on
the other hand.
Our method
employs a kind of a part-of-speech tagger, that can set only one mark,
“possible adjective or noun,” leaving all the other words unmarked.
Part-of-speech taggers usually take into account the linear context for
disambiguation part-of-speech homonymy (Yarowsky 1994). Fortunately, Spanish
presents good conditions for detecting the noun or adjective context: unlike
English, the articles are used in Spanish much more frequently, and there is
grammatical agreement between the nouns, adjectives, and articles.
The main group
of error-prone Spanish words with stress marks contains those nouns or
adjectives that turn to personal verb form if the stress mark is omitted. At
the first stage of our study, the quasi-homonymous pairs “noun or adjective
versus verb” were collected through the manual search in large Spanish
dictionaries, such as of Academy of Madrid (Diccionario de la lengua
española 1996) and Anaya
Group (Diccionario del español contemporáneo 1997). It gave us about 150 pairs and stimulated
further search. Then a program was written for automatic extraction of such
nouns and adjectives from large electronic dictionaries. Thus, the full amount
of pairs reached nearly 300. The accented counterparts of all the gathered
pairs are given in the Appendix. The common features of the pairs are the
following:
·
The pairs
are Spanish words of middle statistical rank, and many of them are quite common
is scientific and technical literature. Note that the noun in each pair is
usually more frequent than the corresponding verb.
·
An
accentuated counterpart is a specific word form of a noun (like cómputo), of an adjective (like legítimas) or of a vocable combining two
homonyms, a noun and an adjective (like crítica
‘the criticism’ or feminine ‘critical’).
·
A
non-accentuated counterpart is a specific personal form of a verb in singular.
Therefore, the components of each pair correlate not as a lexeme to lexeme, but
as a word form to another word form. Hence, all the corresponding pairs should
be considered separately.
·
Each
accentuated form, independently of its part of speech (noun, adjective or
homonymous vocable), is characterized by a specific combination of number and
gender, e.g., ánimo (sing, masc), específica (sing, fem), cópulas (plur, fem), partícipes (plur, masc). This combination is fixed
for each pair and can be assigned to it beforehand.
Thus, the
entire information about a pair may be represented in the computer form as a
triple (accentuated counterpart, number, gender). The non-accentuated
counterpart can be easily computed from its accentuated counterpart through
simple replacement of the codes of accentuated letters to their non-accentuated
analogs. The structure of the list is described in more detail in the Appendix.
The main
algorithm scans the text, word by word, and uses a technique known in some
style checkers (Ashmanov 1995): instead of checking the correctness of the
current situation in the text, a hypothesis is formed and then checked about a
possible error in the current place. If the hypothesis looks reasonable, a
possible error is reported to the user.
In our case,
either a verb or a non-verb can be used in a specific context, while the
contexts equally suitable for both cases are rather rare. When a doubtful
verbal context is found, a hypothetical noun or adjective is supposed. If the
context is good for it, then this context should be bad for the original verb.
While scanning,
two forms of each word are matched against the two list entries, an accentuated
one and automatically computed its non-accentuated counterpart. The
characteristics of the hypothetical noun or adjective, namely, its gender and
number, are retrieved from the same list.
Let the word
under consideration be w0,
its immediate linear context be the sequence w-1, w0,
w1, w2, and the variable be the hypothetical
noun or adjective. Then the work of the algorithm depends in which form of the
word was found.
·
If the
word was found in the accentuated form, it is considered to be a noun or
adjective and the suitability of the immediate context is checked as described
in the next section; the variable is set to w0.
·
If the
word was found in the non-accentuated form, it is considered to be a verb.
Since we cannot check the context for a verb, a hypotheses is considered that
the true intended word was the corresponding accentuated counterpart of the
non-accentuated word w0,
i.e., a noun or adjective. The variable is set to this
accentuated counterpart, and the context is checked for this hypothetical word.
If the context is suitable for it, the hypothesis is accepted and a possible
error is reported.
When a possible
error is reported, the user is asked, in an interactive manner, whether the
doubtful word w0 should be
replaced with its counterpart.
For the
algorithm, a procedure is necessary to check, for the given word , which is supposed to be a noun or adjective in the form of
already known gender g and number n, whether a specific 4-word immediate
linear context w-1, w1, w2, such that the word order in the text is w-1, , w1, w2, is suitable for a noun or
adjective with these gender and number. Here is either the current
word w0 considered by the
main algorithm, or its accentuated counterpart, as it was described above.
Let Preps be the list of all simple (one-word)
prepositions, Preps = {a, de, con, por, sin, en, sobre, para,…}, and Detsg,n be the
list of quasi-determinatives that depends on the gender g and number n of w0 according to the following
table:
|
Singular |
Plural |
Masculine |
un, el, este, ese, aquel, mi, tu, su, al, del,
buen, mal, primer, gran |
unos, los,
estos, esos, aquellos, mis, tus, sus, buenos, malos, primeros, grandes |
Feminine |
una, la,
esta, esa, aquella, mi, tu, su, buena, mala, primera, gran |
unas, las,
estas, esas, aquellas, mis, tus, sus, buenas, malas, primeras, grandes |
Only the case
of singular masculine has some peculiarities connected with the use of short
forms of adjectives and contracted articles al,
del. We should admit, though, that
the presence of the word la ‘the’ in
the table is questionable because it is homonymous with the accusative pronoun la ‘her’. We decided to include it in
the table because of statistical considerations.
Let us use the
notation u ~ v for grammatical agreement of words u and v in gender and number, i.e., for the fact that the first word form
has or could have (if it is ambiguous) the same gender and number as the second
one. The procedure implementing this check is described in the next section.
Note that in all cases, the characteristics of the second word v are already known.
The word is considered to be a
noun or adjective properly used in the given context, and thus the word w0 ¹ is considered likely
to be an error, if any of the following four conditions is satisfied:
1.
w-1 Î Detsg,n
È Preps, or
2.
w-1 ~ , or
3.
w1 ~ , or
4.
w1 Î {más, mas,
menos}and w2 ~ .
The tests
should be carried out in the given order, that helps to cope with the
combinations like el número y el género
gramaticales, where the agreement is more difficult to check. In the
condition 4, the word mas (without accent) is tried only in case
if the accent marks are totally lost in the text; really we mean here the word más ‘more’ that has lost its accent rather
than the very rare Spanish word mas ‘but’.
Since the
gender and number of are known, to check
the agreement in the conditions 2 to 4, it is enough to check whether the
corresponding word w-1, w1, or w2 is compatible with the hypothesis about its number
and gender.
For the
algorithm described above, a procedure is necessary to check whether the two
given words agree in gender and number, or, more precisely, since the
characteristics of one of the words are already known, it is enough to
determine if the given word has the given characteristics; let us denote this
fact as w ~ (number n, gender g).
The strict
implementation of this procedure implies availability of a morphological
analyzer of Spanish nouns and adjectives based on a morphological dictionary.
However, it turned out that in most cases nearly the same results might be
achieved through a rougher approximation taking into account only short lists
of final substrings of the word. Namely, the following Spanish endings usually
indicate the following characteristics:
|
Singular |
Plural |
Masculine |
‑do, ‑to, ‑ismo,
‑ero, ‑rio, ‑je, ‑te, ‑al, ‑il |
‑os, ‑es |
Feminine |
‑a, ‑ión, ‑ad,
‑te, ‑al, ‑il |
‑as, ‑es |
The lists of
endings intersect for masculine and feminine, both in singular (‑te, ‑al, ‑il) and
plural (‑es), however, just as
with the use of demonstrative pronouns mi,
tu, su, it does not create any problems. It is more dangerous that there
exist nouns of masculine gender ending in -a,
such as el problema, el/la deportista, and very few nouns of feminine gender ending in -o, such as or la mano, la
foto, la
radio. However, such situations
are rather rare and can be handled by the corresponding lists of exceptions.
There are other
error-prone words related to the accent, which are not processed by the general
mechanism described above. Namely, these are the words of the parts of speech
different from verbs, nouns, or adjectives. For each group of such words, a
special case of the algorithm is necessary. Here we describe two heuristics
that work for one pair of words each.
·
Tú. If w0 = tu, then w1 must have the
characteristic “singular”, which is checked by the condition w1 ~ (masculine, singular) or
w1 ~ (feminine, singular).
Otherwise, a possible error should be reported.
·
Mí. If w0 = mi, then w1 Ï Preps
and w1 must have the
characteristic “singular”, otherwise, a possible error should be reported.
Though these
groups contain only one pair each, the words in these pairs are much more
frequent in texts than other quasi-homonyms, that justifies the introduction of
these special cases to the algorithm. There are other similar words, such as sólo, él, más, cómo, sí, but their handling is more difficult.
Another type of
errors is the case of the pronouns that require accent in the interrogative
sentences, such as ¿Qué…,
¿Cómo…
¿Cuándo…, etc. There is a very little closed
list of such words, and the error of this type can be detected by rather simple
heuristics. For example, after a ¿
sign the words like qué always require an accent. We do not
discuss here these heuristics.
The algorithm
was realized in a complete program consisting of 27 subprograms in Pascal. It
includes the scanner of the input text and the module of dialog with the user
for interactive correction of the reported errors in the text.
We conducted
two experiments, one with a small text and full manual analysis of the results,
and another with a large text corpus, with semi-automatic statistical analysis
of the results.
In the first
experiment, a real unprepared text of scientific genre written by a foreigner
and consisting of 9 pages was analyzed[1].
After application of the Microsoft Word spell-checker, as many as 35 such
errors remained undetected in it. The program detected 32 of them, i.e., 91.4%
of the errors of the type under consideration not detected by the commercial
spell-checker. Only one of the missed errors corresponded to quasi-homonyms
from the list: *no es
practica, the correct
form being no es práctica; two other ones were connected with the
word sólo, e.g.: *el
modelo solo dificulta, the correct form being el modelo sólo dificulta.
In the second
experiment, we used a large Spanish corpus. Since there is no special Spanish
corpora to compare the results produced by our method, we took as our corpus a
collection of unprepared technical texts from Gaceta
UNAM (Mexico) and from
the Internet, and then employed the following procedure:
1.
The entire
corpus was passed through the Microsoft Word spell-checker to eliminate
orthographic and typographic errors. Among orthographic errors, mainly unknown
words were found. When variants of correction were suggested, the first one was
automatically accepted. Our experiment was then conducted with this corpus.
2.
All stress
marks throughout the corpus were removed.
3.
Once
again, the full corpus was passed through the Microsoft Word 97 spell-checker.
We disabled the Grammar check option since in many cases it reported false
alarms, and the first experiment has shown that this spell-checker rarely
detects the errors of the type under consideration with grammar checking (see
below). For each reported orthographic error, the first suggested variant of
correction was automatically accepted.
4.
Finally,
the entire corpus was processed by our program and all its suggestions were
automatically accepted.
The resulting
corpus after the pass 4 was not quite identical to the one obtained after the
pass 1; thus, some errors of the type under consideration remained undetected.
Namely, with the passes 2 to 4, we achieved about 92% of error detection,
whereas only Microsoft Word tests gave approximately 82%.
Most of the
remaining errors not detected neither by Word spell checker nor by our program
are related with the más/mas and sólo/solo cases, and some with the qué/que type
discussed in the previous section. Here is an example of missed errors: *consideran totalmente validas, the correct form being válidas. In some cases our program reported false alarms:
for example, it marked the word participe in the phrase cada uno participe en la educación as a possible error while in this
context it was a true verb form.
There was only
one difference found between the performance of Microsoft Word version 6 versus
version 97 that includes a syntactic-based grammar checking. When the article el appears immediately before the type of error under consideration, the
version 97 reports a possible error in this article and proposes to change it
to the pronoun él. If this proposal is accepted, then
with phrases like el articulo a new, induced error is reported: after
changing it to él articulo ‘he I join’, the spell-checker suggests changing it to el articula ‘he joins’, which has the meaning totally different
from the intended el artículo ‘the article’.
Our algorithm
of accent restoration has the following advantages:
·
It is
hypotheses-driven, i.e., based on the active search for errors rather than on
passive check of grammaticality of the given text.
·
It uses
small closed word lists and a very small list of endings.
·
Implementation
of the algorithm is very simple; it can be implemented independently and used
after a regular spell-checker.
The idea
similar to the one described here can be used to detect other types of errors
related to the agreement in number and gender in Spanish, between adjacent
nouns, adjectives, adjectival pronouns, ordinal numeral, articles, etc.
Appendix.
The list of quasi-homonyms
The entries of
the quasi-homonyms dictionary have the following format:
Word |
Gender |
Number |
específico |
masculine |
singular |
específica |
feminine |
singular |
específicas |
feminine |
plural |
célebre |
both |
singular |
célebres |
both |
plural |
No difference
is made between adjectives and nouns; when one of the forms is a noun while the
other are adjectives, this is not marked since this does not affect the gender
and number attribution. The masculine nouns are given only in one form, since
only this form can be confused with a verb.
In the
following list, the characteristics are not given since they can be easily
restored by the reader. When the value of some characteristics is “both”, two
hypotheses should be tried by the algorithm. The word forms with common stem,
and usually of the same lexeme, are grouped together: específico, -a, -as; célebre,
-es. Here is the complete list:
acídulo, -a, -as
adúltero, -a, -as
ágora, -as
álabe, -es
alígero, -a, -as
ánimo, -a, -as
apócope, -es
ápodo, -a, -as
apóstata, -as
apóstrofe, -es
apóstrofo
árbitro, -a, -as
artículo
auténtico, -a, -as
báscula, -as
beatífico, -a, -as
cálamos
cálculo
cántara, -as
capítulo
cápsula, -as
catálogo
célebre, -es
centrífugo, -a, -as
círculo
cláusula, -as
coágulo
cómputo
cópula, -as
crítico, -a, -as
cronómetro
décimos
decrépito, -a, -as
décuplo, -a, -as
depósito
desánimo
diagnóstico, -a, -as
diálogo
doméstico, -a, -as
dómine, -es
ejército
émbolo
émulo, -a, -as
epílogo
equívoco, -a, -as
específico, -a, -as
espontáneo, -a, -as
estímulo
estípula, -as
estómago
estrépito
fábrica, -as
filósofo, -a, -as-
fórmula, -as
gárrulo, -a, -as
género
gráfico, -a, -as
gránulo
hábito
hidrógeno
homólogo, -a, -as
idólatra, -as
ilegítimo, -a, -as
ímprobo, -a, -as
incómodo, -a, -as
íncubo
íntegro, -a, -as
interlínea, -as
intérprete, -es
íntimo, -a, -as
inválido, -a, -as
júbilo
lágrima, -as
lámina, -as
lápida, -as
lástima, -as
légamos
legítimo, -a, -as
letífico
líbero
lícito, -a, -as
límite, -es
línea, -as
líquido, -a, -as
lúbrico, -a, -as
mácula, -as
magnífico, -a, -as
máquina, -as
matrícula, -as
módulo
monólogo
náufrago, -a, -as
nómina, -as
núcleo
número
ópera, -as
óptimo, -a, -as
órbita, -as
óvalo
óvulo
óxido
oxígeno
pacífico, -a, -as
página, -as
pálpito
páramos
partícipe, -es
pátina, -as
pérdida, -as
petróleo
plática, -as
práctico, -a, -as
prédica, -as
préstamos
pródigo, -a, -as
prólogo
pronóstico
prórroga, -as
próspero, -a, -as
público, -a, -as
púrpura, -as
purpúreo, -a, -as
recíproco, -a, -as
réplica, -as
réprobo, -a, -as
reválida, -as
rótula, -as
rúbrica, -as
simultáneo, -a, -as
síncopa, -as
síncope, -es
síndico
solícito, -a, -as
sólido, -a, -as
subtítulo
súplica, -as
tálamos
témpano
témpera, -as
término
térreo, -a, -as
título
tráfago
tráfico
trámite, -es
tránsito
trépano
triángulo
úlcera, -as
último, -a, -as
válido, -a, -as
vínculo
vómito
ACKNOWLEDGMENTS
The work done under partial support of CONACyT grant 26424-A,
REDII-CONACyT, DEPI-IPN, SNI, Mexico.
References
Agirre, E. et al.
1998. Towards a single proposal in spelling correction. In Proceedings of the
36th ACL Meeting and 17th COLING conference. Montreal,
Canada.
Ashmanov, I. 1995.
Grammar and style corrector for Russian texts (in Russian). International
workshop on computational linguistics and its applications, Dialogue-95,
Khazan, Russia.
Cutting, D., et al.
1992. A practical part-of-speech tagger. Proc. of the 3 Conference on Applied
Natural Language Processing. Trento, Italy. p. 133-140.
Daciuk J. 1998.
Incremental Construction of Finite-State Automata and Transducers, and their
Use in the Natural Language Processing, Ph.D. dissertation, Technical
University of Gdańsk, Poland; http: //www.pg.gda.pl/ elka/ csapp/ jd/ fsa.html.
Diccionario de la lengua Española. 1992. Edition 21. Real Académia Española, Madrid.
Diccionario del Español contemporaneo. 1997, grupo ANAYA, http: //www. anaya. es.
Elmi, M. A. and M. Evens. 1996 Spelling Correction Using Context. In Proceedings of the 36th
ACL Meeting and 17th COLING conference. Montreal, Canada.
Golding A. and Schabes, Y. 1996. Combining trigram-based and feature-based methods for
context-sensitive spelling correction. In Proceedings of the 34th
ACL Meeting, Santa Cruz, CA.
Yarowsky, D. 1994.
Decision List for Lexical Ambiguity Resolution. Application to Accent in
Spanish and French. Proc. 32 Annual Meeting of ACL. New Mexico, USA, p. 88-95.
[1] For the experiment, the initial draft
of the following text was used: Bolshakov, I. A. El modelo morfológico
formal para sustantivos y adjetivos en el español. Computación y Sistemas, No. 1, 1996, p. 27-35.