Natural language interfaces to databases

June 28, 2017 | Autor: Ann Copestake | Categoria: Cognitive Science, Knowledge Engineering, Natural language interface

Descrição do Produto

NATURAL LANGUAGE INTERFACES TO DATABASES

Arati K. Deshpande
Prof. P.R.Devale
Lecturer, Department of Information Technology
Assistant Professor, Department of Information Technology
JSPM's Jayawantrao Sawant College Of Engg,, Bharati
Vidyapeeth University College of Engg,
Pune Maharashtra , Pune
Maharashtra
e-mail: [email protected]
e-mail: [email protected]

Abstract- Natural Language Processing is becoming one of the most active
areas in Human Computer Interaction. An important issue in the area of
database management is to provide a high level interface for non technical
users. Database NLP may be one of the most important in Natural Language
Processing. Asking questions to databases in natural language is a very
convenient and easy method of data access, especially for casual users who
do not understand complicated database query languages such as SQL. This
paper gives an introduction to natural language interfaces to databases.
Some advantages and disadvantages of NlIDBs are discussed. An introduction
to some of the linguistic problems NLIDBs have to confront follows, for the
benefit of readers less familiar with computational linguistics. A brief
overview of earlier NLIDB(Natural Language Interface Database) systems are
also discussed.

Keywords-Natural Language Processing(NLP), NLIDB, Structured Query
Language(SQL)

I. Introduction
There has been much work on Natural Language Process recently, but the
area has been around for a relatively long time in the computing world. The
main aim of NLP research is to create a better interface to the computer.
Spoken language is the most natural interface available for humans to use,
but computers are still unable to come close to the rich communication
humans can achieve with each other. Database NLP may be one of the most
important successes in NLP since it began. Asking questions to databases in
natural language is a very convenient and easy method of data access,
especially for casual users who do not understand complicated database
query languages such as SQL.
Databases usually provide small enough domains that ambiguity problems in
natural language can be resolved successfully.
In this paper we discuss what Natural Language Interface Database (NLIDB)
is and problems arise while designing the NLIDB system, advantages and
disadvantages of NLIDB and some earlier natural language interfaces to
databases.
Natural language interfaces to databases (NLIDB) are systems that
translate a natural language sentence into a database query
(Androutsopoulos et al., 1995). NLIDB can be considered as a classical
problem in the field of natural language processing [8]. Although the
earliest research has started since the late sixties [1], NLIDB remains as
an open research problem. Several NLIDB systems have also been made for
commercial use; regardless, the use of NLIDB systems certainly is not wide-
spread and it is not a standard option for interfacing to a database. This
lack of acceptance is mainly due to the still large number of deficiencies
in the NLIDB system in order to understand a natural language. A complete
NLIDB system will benefit us in many ways. We can allocate the need for an
expert to the NLIDB system; thus, anyone may be able to gather the
information he or she wants from a database. Additionally, it may change
our perception about the information in a database. Traditionally, people
are used to working with a form; their expectations depend heavily on the
capabilities of the form. NLIDB makes the entire approach more flexible,
therefore will maximize the use of a database. There are many applications
that can take advantages of NLIDB. In PDA and cell phone
environments, the display screen is not as wide as a computer or a laptop.
Filling a form that has many fields can be tedious: one may have to
navigate through the screen, to scroll, to look up the scroll box values,
etc. Instead, with NLIDB, the only work that needs to be done is to type
the question similar to the SMS (Short Messaging System).
As another example, consider a travel agent who may have a lot of
customers from various backgrounds; their needs may or may not be the same.
A client may ask, "What is the most popular vacation destination in the
US?" This type of question, along with the other most frequent questions,
can be answered by building a non-NLIDB system. However, many questions are
not in the most frequent category, such as, "Can you recommend me a place
that is not too crowd but also not too quiet?" Of course, we can extend the
system to cover more questions, but for a large database it may not be
possible to build such a system. Another problem is as the system covers
more questions, the complexity increases; therefore, at a certain level the
system may not be user-friendly anymore. In addition, after the first
question, a client may ask more questions like, "What are the good places
to visit there?", "How many restaurants are there?", "Is it full during the
weekend?", "Do we have to book in advance?", and many others. Currently a
travel agent handles this situation by providing operators to help with the
questions. Alternatively using a complete NLIDB system can give a lot of
benefits.

II. Linguistic Problems
"Understanding and communicating in natural language is one of the
defining problems of AI" (Mooney, 2006). In order to fully understand a
natural language, a great deal of knowledge is required such as the
morphology, syntax, pragmatic, discourse, and semantic. Because of the
difficulties in understanding a natural language, "Ideally an AI
(Artificial Intelligence) system would be able to learn language like a
human child" (Mooney, 2006). This section will discuss specific problems
related to the NLIDB domain. However, the problems described below are not
exclusively attached to the NLIDB, as they may also appear in other fields
of NLP, nor are they a complete list of all the problems in NLIDB. The
problems listed are the problems that I had to deal with while building the
NLIDB system.
A. Ambiguity
The most common problem in the area of NLP is ambiguity, Inputs are
considered ambiguous if there are multiple alternative structures that can
be built for them [7]. There are many types of ambiguity such as part-of-
speech ambiguity, word sense ambiguity, syntactic ambiguity, and many
others. While a human can often immediately understand the correct meanings
of ambiguous terms in a sentence, a computer system must develop a method
to handle these terms. Consider the example: "Through which states does the
Mississippi traverse?" For Computer, the term "Mississippi" here is
ambiguous since it can be the name of a state or a river. In contrast, a
human will immediately know that "Mississippi" here refers to the river
because a state cannot "traverse". More severe cases of ambiguity do occur.
In this case, even a human cannot understand the correct meaning because
the context itself is ambiguous. Consider the example: "What is the
population of New York?", the word "New York" here could be interpreted
either as a city or as a state. Unless the questioner specifies the correct
meaning, both interpretations are correct.
B. Nominal Compound Problem
"A noun phrase can be viewed as revolving around the central noun"
(Jurafsky et al.,2000). In English, a noun phrase can consist of both pre-
nominal modifiers and post-nominal modifiers, and thus, the meaning of a
noun phrase can sometime be hard to predict. For example, the noun phrase:
"The states bordering Texas" could easily be traced by following the words
sequentially. However, consider the noun phrase"major river". The"major
river" here may be a river which traverses at least several states, a river
with certain length or width, or other interpretations. Because of the
difficulty in determining noun phrases.
C. Grammatical Correctness
In our daily life, even though we often say something that is
grammatically incorrect, other people may still understand what we are
trying to say. Consider the example:"states bordering Iowa" In English, a
sentence should contain at least a subject and a predicate; thus, the above
sentence is not correct. However it still can be translated into a correct
SQL query. Other examples of incorrect grammar are sentences where the
subject and verb do not agree, incorrect articles, capitalization errors,
punctuation errors, etc.
D. Conjunction and Disjunction
In the logic domain, the meaning of conjunction (denoted by AND) is
obvious: the output will be true if both inputs are true, while disjunction
(denoted by OR) means the output will be true if at least one of the inputs
is true. This rule does not always apply in the natural language. Consider
the example: "Name all the cities in Texas and Oklahoma." The term "and"
here does not mean a conjunction, because a city can only have one state.
Instead, it reflects a disjunction, where every city located in Texas or in
Oklahoma should be listed. A conjunction in English is a part of speech
that connects phrases, words, or clauses; this part of speech can consist
of a single word ("and", "or", "nor", "yet", "while", etc) or multiple
words ("either ... or", "not only ... but also", "both ... and", etc). The
meaning may vary from stating that both inputs are true, both inputs are
false, and a contradiction where one of the input is true and the other is
false. In order to correctly interpret a conjunction or a disjunction, more
extensive knowledge about the structure is required. Moreover, some
conjunctions or disjunctions cannot be translated into a SQL query without
using any sub-query structure.

III. Advantages and Disadvantages of NLIDB
A. Advantages of NLIDB
No learning required
The main problem for most people that attempt to acquire information from
a database is that they have to learn a computer language, which sometime
can be difficult. On the other hand, they have been exposed to a natural
language since an early age and have used it in daily communication;
therefore, we can say that a natural language is already mastered by the
user.
Simple, easy to use
Consider a database with a query language or a certain form designed to
display the query. While an NLIDB system only requires a single input, a
form-based may contain multiple inputs (fields, scroll boxes, combo boxes,
radio buttons, etc) depending on the capability of the form. In the case of
a query language, a question may need to be expressed using multiple
statements which contain one or more sub queries with some joint operations
as the connector.
Fault tolerance
Most of NLIDB systems provide some tolerances to minor grammatical
errors, while in a computer system most of the time, the lexicon should be
exactly the same as defined, the syntax should correctly follow certain
rules, and any errors will cause the input automatically be rejected by the
system. In the case of incomplete sentences, most of computer systems do
not provide any support.
B. Disadvantages of NLIDB
Linguistic coverage is not obvious
Currently all NLIDB systems can only handle some subsets of a natural
language and it is not easy to define these subsets. Even some NLIDB
systems cannot answer certain questions belong to their own subsets. This
is not the case in a formal language. The formal language coverage is
obvious and any statements that follow the given rules are guaranteed to
give the corresponding answer.
Linguistic vs. conceptual failures
In the case of NLIDB system failures, it is often the case that the
system does not provide any explanation of what causes the system to fail.
Some users may try to rephrase the question or just leave the question
unanswered. Most of the time, it is up to the users to determine of what
causes the errors.
False expectations
People can be misled by an NLIDB system's ability to process a natural
language: they may assume that the system is intelligent. Therefore rather
than asking precise questions from a database, they may be tempted to ask
questions that involve complex ideas, certain judgments, reasoning
capabilities, etc, which an NLIDB system cannot be relied upon.

IV. Earlier Systems
A. Natural Language Interface for Structured Data:
Using Natural Language to communicate between a database system and its
human users, has become very important since database systems have become
widespread. In order to make full use of the database systems, its
accessibility to non-expert users is desirable. Following are the some of
the systems discussed.
LUNAR SYSTEM: LUNAR [5] is a system that answers questions on rock samples
brought back from the moon. The system describes in relation to the moon.
The system was introduced in 1971. The LUNAR system uses two databases to
accomplish its function; one for the chemical analysis and the other for
literature references. The program used an Augmented Transition Network
(ATN) parser and Woods' Procedural Semantics. Its performance was quite
impressive: it managed to handle 78% of requests without error, a figure
that rose to 90% when dictionary errors were corrected. A scientist who
used it to extract information for everyday work would soon have found that
he wanted to make requests beyond the linguistic ability of the system. ATN
parsers are useful because they are very efficient, even for large
grammars; however, ungrammatical sentences are not handled well and they
are not very flexible.
LIFER/LADDER: LIFER/LADDER was one of the good database NLP systems. It is
a natural language interface to a database of information about US Navy
ships. This system, as described in a paper by Hendrix (1978), used a
semantic grammar to parse questions and query a distributed database. The
question answering is done via parsing the input and mapping the parse tree
to a database query. The system LADDER is based on a three layered
architecture. The first component of the system is for Informal Natural
Language Access to Navy Data (INLAND), which accepts questions in a natural
language and produces a query to the database. The queries from the INLAND
are directed to the Intelligent Data Access (IDA), which is the second
component of LADDER. According to [6], the INLAND component builds a
fragment of a query to IDA for each lower level syntactic unit in the
English language input query and these fragments are then combined to
higher level syntactic units to be recognized. At the sentence level, the
combined fragments are sent as a command to IDA. IDA would compose an
answer that is relevant to the user's original query in addition to
planning the correct sequence of file queries. The third component of the
LADDER system is for File Access Manager (FAM). The task of FAM is to find
the location of the generic files and manage the access to them in the
distributed database. The system LADDER was implemented in LISP. At the
time of the creation of the LADDER system was able to process a database
that is equivalent to a relational database with 14 tables and 100
attributes.
CHAT-80: The system CHAT-80 [7] is one of the most referenced NLP systems
in the eighties. The system was implemented in Prolog. The CHAT-80 was an
quite impressive, efficient and sophisticated system. The database of CHAT-
80 consists of facts (i. e. oceans, major seas, major rivers and major
cities) about 150 of the countries world and a small set of English
language vocabulary that are enough for querying the database. The CHAT-80
system processes an English language question in three stages as depicted
in Figure-1

Translation Planning
Execution Figure-1.CHAT-80 Processing System

The system translates the English language question into three serial and
complementary functions where:
1. Words are represented by logical constants.
2. Verbs, nouns, and adjectives with their associated prepositions are
represented by predicates. The predicates can have one or more arguments.
3. Complex phrases or sentences are represented by conjunctions of
predicates.
These functions are being; parsing, interpretation and scoping. The
parsing module function determines the grammatical structure of a sentence
and the interpretation and scoping consist of various translation rules,
expressed directly as Prolog clauses. The basic method followed by Chat-80
is to append some extra control information to the logical form of a query
in order to make it an efficient piece of Prolog program that can be
executed directly to produce the answer. According to [7], the generated
control information comes into two forms:
1. Orders the predications for a query that will determine the order in
which Prolog will attempt to satisfy them.
2. Separates the overall program into a number of independent sub problems
to limit the amount of backtracking performed by Prolog.
Wenhua(1992) designed an NLP system for Computer Integrated Manufacturing
(CIM) databases, which are large, diverse and represent completely
different concepts, from accounting to computer-aided-design and schedule
planning. Four different databases were used, and one NLP system was
designed to access all these databases, as necessary, to answer the user's
questions. This system uses a Definite Clause Grammar (DCG) and a semantic
interpreter to process the English question into a database query. This
system is much larger than the earlier attempts at database NLP systems and
advanced queries are possible across the different databases and different
data representations.
B. Natural Language Interface for Unstructured Data:
These systems do not restrict themselves to interact with data in
database tables only. Data from various sources can be used and
accumulated. Following are the some of the systems discussed.
ELIZA- by Joseph Weisenbaum (1966). This program is a natural language
interface to a psychiatrist. It used pattern-matching rules that were
triggered based on key words found in user's dialog. ELIZA used literal
text form within users dialog to reformulate questions. There was no
'understanding' of what was being said, ELIZA just gave back questions that
seemed most relevant according to the last user input. Weisenbaum reported
that some subjects were convinced that ELIZA was a real person. He notes
"The human speaker will contribute much to clothe ELIZA's responses
investments of plausibility."
SHRDLU – by Terry Winnograd (1973). This is one of the first programs that
could carry out tasks and provide responses in natural language well. It
was bound within an artificial blocks world of coloured bricks and
pyramids. SHRDLU was able to perform tasks like moving objects around
within the limited world, when directed to do so in English. The program
used a procedural representation for semantics. This means that each
English predicate or term was associated with a procedure that conveyed the
meaning (or semantics) of the term. The problem with procedural semantics
is that they do not scale up into large domains.
Conclusion

Natural Language Processing can bring powerful enhancements to any
computer program interface. Different Natural Language interfaces are
studied which will help to implement new advanced interface for handling
more complex queries.

References

[1] I. Androutsopoulos, G.D. Ritchie, and P. Thanisch, Natural Language
Interfaces to Databases – An Introduction, Journal of Natural Language
Engineering 1 Part 1 (1995), 29–81.
[2] Huangi,Guiang Zangi, Phillip C-Y Sheu "A Natural language database
Interface based on probabilistic context free grammar", IEEE
International workshop on Semantic Computing and Systems 2008
[3] ELF Software CO. Natural-Language Database Interfaces from ELF Software
Co, cited November 1999, available from Internet:
[4] Linguistic Technology. English Wizard – Dictionary Administrator's
Guide. Linguistic Technology Corp., Littleton, MA, USA, 1997.
[5] Woods, W., Kaplan, R. and Webber,B. (1972). The Lunar Sciences Natural
Language Information system. Bolt Beranek and Newman Inc., ambridge,
Massachusetts Final Report. B. B. N. Report No 2378. [6] Hendrix, G.
(1977). The LIFER manual A guide to building practical natural language
interfaces. SRI Artificial Intelligence Center, Menlo Park, Calif. Tech.
Note 138. [7]
Warren, D., Pereira, F. (1982). An efficient and easily adaptable system
for interpreting natural language queries in Computational Linguistics.
Volume 8 pages 3 – 4.
[8] Ana-Maria Popescu, Alex Armanasu, Oren Etzioni, David Ko, and
Alexander Yates, Modern Natural Language Interfaces to Databases:Composing
Statistical Parsing with Semantic Tractability, COLING (2004).

-----------------------
English

Logic

Prolog

Answerr

Lihat lebih banyak...

Natural language interfaces to databases

Descrição do Produto

Comentários