In this article, I want to present the minor subject I had at
university: computational linguistics (CL). Many UniLangers are
interested in CL, since it combines languages with information
technology and mathematical methods.
Because of this unusual combination, I will examine CL from three
sides: for language freaks, linguists and computer programmers. A
general approach written by a CL professor can be found at
http://www.coli.uni-sb.de/ ~hansu/ what_is_cl.html
A brochure informing about the study programs at my university says:
Computational Linguistics examines human language. It uses
formal models that can be implemented on the computer. In this way, it
gains knowledge about the phonetic, syntactic and semantic structure
of languages and about the way humans understand, produce and learn
language.
Aims of CL
Computational linguistics is used to create computer programs for
natural language processing, natural language generation and translation.
- A natural language processing (NLP) system tries to
understand spoken or written language. There are car navigation
systems and word processors that react on spoken input. But the
technology is still far from what you can see in Star Trek.
- The other direction is natural language generation (NLG),
which deals with producing an acceptable text (or even speech) in a
natural language. NLP and NLG are often combined in dialog systems,
but they are two distinct modules that work in quite different
ways.
- Machine translation is more than looking up words in a
dictionary. It is very difficult to translate a text you can’t
understand. Therefore the best machine translation systems are those
that use both NLP and NLG, but computer scientists developed more
primitive methods that are quite good in many cases.
- Information retrieval tries to find relevant information in
texts. With the incredible size of the World Wide Web today, it is
impossible for humans to read every single website when you’re
looking for information. CL can help to advance search engines by
making them ‘understand’ the texts on the websites.
CL for Language Freaks
Forget about learning foreign languages
If you are a language freak, you want to learn as many languages as
possible, because you like communicating with people from all over the
world. You don’t need that in CL. Most of the examples you will study
will be in English and/or the language of instruction. Sometimes you
will see sentences in exotic languages, but you needn’t understand
them.
Computational linguists use exotic languages to show properties
unusual in well-known languages like English. In a course on
grammar theory, I once had an example from Warlpiri, a
Pama-Nyungan language spoken in northern Australia:
Kurdu-jarra-rlu |
wita-jarra-rlu |
ka-pala |
maliki |
wajilipi-nyi. |
child-DUAL-ERG |
small-DUAL-ERG |
PRES-DUAL |
dog-ABS |
chase-NONPAST |
"The two small children are chasing the dog."
You will always get those literal translations annotated with
grammatical functions; no need to actually learn Warlpiri…
Fall in love with grammar
What do you love in a language? Its sound, the beauty of the writing
system, proverbs, metaphors, poetry? That is mostly irrelevant to
CL. You will deal with language on the following levels:
- Lexical semantics deals with word meanings, which can
become difficult the deeper you get into the topic.
- Morphology deals with the way words change when they are
inflected. There is little variation in English (speak, speaks, spoke,
spoken), none in Chinese and lots of different forms in French
(parler, parlez, parlerions, parlées, parlant,
parlèrent, …).
- Syntax deals with the way words combine to form phrases and
sentences. It’s about finding the subject of the sentence, the
predicate, the temporal adverbial, the relative clause and so on.
- Semantics deals with the meaning of sentences. Once the
computer knows the grammatical structure of an utterance like
"John loves his cousin" and the meaning of the individual
words, it can infer who is loved by John.
- Pragmatics deals with the meaning of a sentence in a
context. If someone says ‘I am cold’, is s/he informing the
listener about unpleasant room temperature, fearing s/he might get ill
or requesting to close a window?
If you want to know how CL deals with those levels, please read the
following section from the linguistic point of view.
CL for Linguists
A practical approach with computers
Computational linguistics combines several linguistic disciplines in a
practical way. CL is no blabla and no philosophical discussions. Its
aim is to make computers process or generate language, so you should
know a little about computer programming.
Computers are stupid. They can only perform a task if it is
sufficiently explained (either by you or the programmer). If you ask
people how to say ‘apple’ in Finnish, they will either use
their knowledge or a dictionary to tell you the answer. A computer
would have to execute the following program:
if English-Finnish dictionary database is available then
return dictionary_entry(database, English-Finnish, "apple")
else
check which dictionaries are on the bookshelf
if book_found(bookshelf, English-Finnish dictionary) then
search_entry(that book, "apple")
if entry found then
return read_entry(that book, found entry)
else
return error_message("Word not found in dictionary")
end if
else
return error_message("Dictionary not found on bookshelf")
end if
end if
And this is only a simplified sketch; real computer programs are much
more complicated!
The language of mathematics
Computational linguists use strict formalisms like mathematicians and
computer scientists. The example below shows a formalization of the
sentence ‘I want to put a book on the table’.
For computational linguistics, it is important to express
everything related to a language (phonetics, phonology,
morphology, syntax, semantics, pragmatics) in such a formal way. No
philosophical reasoning about metaphors, just plain logical
thinking. Why? Because CL is about teaching language to computers. And
everything computers understand is formal logic.
CL for Computer Programmers
As a programmer and computer freak, you don’t have much of a problem
with the points mentioned in the two previous chapters. But wait a
moment, CL still stands for computational linguistics…
CL is about human languages
There are huge differences between programming languages and real
human languages. How would you react if someone told you
for(i=1; i<life.length; i++) if(location==NYC) return 1;
instead of plain English: "Have you ever been to New York"?
Of course you use programming languages, or at least stuff like
PROLOG or LISP. But that’s not what you’ll implement. Parsing natural
languages is much more difficult:
Have |
you |
ever |
been |
to |
New York |
have (present, not (3rd sg)) |
pronoun (2nd sg/ pl, subj) |
ever |
be (past participle) |
preposition (to) |
geographical entity (state, ny, us)
or geographical entity (city, nyc, us_ny) |
verb or auxiliary |
personal pronoun |
adverb |
past participle |
preposition |
noun |
auxiliary |
noun_phrase |
adverb |
main verb |
prepositional phrase |
Now your parser should know that the order Aux-NP-V indicates a
question in English (with that noun phrase as its subject), and that
the present tense of the auxiliary ‘to have’ together with
the past participle of a verb puts that verb into present perfect.
CL is a branch of linguistics
A program always has a purpose, even if it’s fun writing it. When you
write natural language processing or generation programs, you need to
know how natural languages work.
Do you know what an adjunct is? Ever heard of equi and raising
phenomena? Are you able to recognize the finite verb in a long
sentence? You should be. In my CL studies, I was confronted with tons
of words like that, and all that was really explained by the teachers
was unification, which is not much more than a simple intersection of
two sets.
The main problem of CL is the complexity of natural language,
especially ambiguities. A word can have several meanings (the
bank of a river vs. the Bank of England), and so can a complete
sentence ("John follows a gangster in a sports
car" – Who drives the sports car?). But even if there is no
ambiguity, there are difficulties. The sentence ‘I am cold’
can be an expression about temperature, an indication of a disease or
a request to close the window.
CL for You
Are you interested in expressing natural language in a formal way?
Interested in programming software that can process and generate
natural language? Why don’t you start computational linguistics? :-)
CL methods can be used in various applications such as car navigation
systems, spell checkers for word processors or better search engines
in the WWW.
CL is a rather new science that still has a lot to discover. Although
computers are getting faster and faster, we’re far from universal
translators devices like in Star Trek. Though we can do better than
Altavista’s Babelfish,
as the example of
Verbmobil
shows:
To learn more, check out
The Association for Computational Linguists and
The Linguist List
or ask UniLangers who have experience with CL. :-)
By the way, I was told the best universities to study
computational linguistics are the following three:
|