Collaborative Multilingual
Knowledge Engineering Based on
Controlled Natural Language
Kaarel Kaljurand
Institute of Computational Linguistics, University of Zurich
College of Graduate Studies and the Academy of African Languages and Science,
University of South Africa
2013-11-29

Presenter Notes

About myself

  • studied at the University of Tartu, Estonia
  • worked 2002-2013 at the University of Zurich, Switzerland
  • mainly in projects related to controlled natural languages
    • REWERSE (2004-2008): using Attempto Controlled English (ACE) as a semantic web language
    • MOLTO (2012-2013): building a multilingual semantic wiki based on ACE and Grammatical Framework (GF)
    • joint work with: Norbert E. Fuchs (group leader) and Tobias Kuhn
  • other interests:
    • grammar-based speech recognition applications
    • Estonian grammar in GF
    • biomedical text mining
  • see also: http://google.com/+KaarelKaljurand

Presenter Notes

  • PhD on mapping between ACE and Semantic Web languages
  • recently interested in GF, applied to ACE, speech recognition, and Estonian

Overview of the tutorial

  • semantic web (SW)
    • knowledge representation languages (OWL, SWRL) and reasoning
    • user interfaces for the SW (Protégé, Semantic MediaWiki)
  • controlled natural languages (CNLs)
    • Attempto Controlled English (ACE)
    • its construction and interpretation rules, discourse representation
  • ACE as a user interface language for the SW
    • translating between ACE and OWL
    • end-user tools: ACE View, AceWiki
  • Grammatical Framework (GF)
    • GF as a framework for defining multilingual CNLs
    • resource grammar library (RGL)
  • ACE-in-GF: ACE in ~20 languages
  • AceWiki-GF: multilingual collaborative knowledge editor

Presenter Notes

Questions are welcome during the tutorial.

Semantic Web

Presenter Notes

Semantic Web

  • original vision by Berners-Lee, Hendler, Lassila (2001)
    • smart personal agents that book flights and negotiate hotel prices
    • concepts are denoted by URIs
    • ontologies relate concepts to each other
  • simple knowledge representation language Resource Description Framework (RDF)
    • knowledge as a set of URI triples <subject,predicate,object>, e.g.
    • <john,like,mary>, <john,rdf:type,man>, ...
  • highly-expressive ontology and rule languages
    • OWL (1st version from 2004, 2nd in 2009), SWRL (2004), RIF (2010)
    • fragments of first-order logic, OWL based on description logics (DLs)
    • core reasoning tasks have decidable algorithms
    • focus on reasoning efficiency (=> define efficient OWL profiles: EL, QL, RL)
  • ~2008: web of linked open data ("little semantics goes a long way")
    • publish data as RDF and access it using the query language SPARQL
    • use very little OWL reasoning

Presenter Notes

Systems like Siri are not too far from fulfilling the original vision. SW languages allow one to state various terminological and rule-like statements Multilinguality is not mentioned.

Semantic Web stack

Presenter Notes

OWL and SWRL as an extension to the taxonomy language of RDFS. Actually hard to say what relation does the alignment of blocks denote, maybe simply "was developed before".

Expressive SW languages: OWL

  • formal language for defining terms (URIs) via other terms
  • entities: classes (aka concepts), individuals, object and data properties (roles, relations)
  • simple and complex classes and properties
    • uri://.../animal # denotes the class of animals
    • uri://.../eats # denotes the relation of eating
    • (uri://.../animal AND uri://.../African) # "African animal"
    • InverseOf(uri://.../eats) # the relation of being eaten by something
  • complex classes can use: negation, conjunction, disjunction, existential and universal restrictions, cardinality restrictions
    • (african AND animal) OR (NOT (InverseOf(eat) SOME animal))
  • axioms (ontology is a set of axioms)
    • lion SubClassOf animal
    • (african AND animal) SubClassOf (InverseOf(eat) SOME lion)

Presenter Notes

  • format = well-defined syntax and semantics (unlike English)
  • semantically description logic (DL) SROIQ (fragment of first-order logic)
  • annotations for describing the meaning of entities and axioms informally in natural language(s)

The structure of OWL (2009)

Presenter Notes

Syntax of OWL

  • many different syntaxes
    • RDF/XML
    • OWL/XML
    • Turtle
    • functional syntax
    • Manchester syntax
  • some based on the RDF idea of triples (RDF/XML, Turtle)
  • some based on description logic style syntax (functional-style, Manchester)
  • in this tutorial: functional-style and Manchester syntax, e.g.
    • SubClassOf(:country ObjectSomeValuesFrom(:border :country))
    • country SubClassOf border SOME country
  • also in this tutorial: ACE-based syntax, e.g.
    • Every country borders a country.

Presenter Notes

Semantics of OWL

The meaning of OWL elements is defined via set theory.

OWL set theory example
individual instance of some set Switzerland
class set of instances country
object property set of pairs <instance1, instance2> bordering
data property set of pairs <instance, dataitem> population
C1 AND C2 intersection of C1 and C2 EU-country and NATO-country
C1 OR C2 union of C1 and C2 EU-country or NATO-country
NOT C complement set of C not an EU-country
P SOME C set of things that have a P-property with a member of class C something that flows into a lake
... ... ...
DifferentIndividuals(I1 I2) I1 and I2 stand for different instances Switzerland is not Germany
ClassAssertion(C I) I is an instance of C Switzerland is a country
SubClassOf(C1 C2) C1 is a subset of C2 every EU-country is a country
Domain(C P) the subject of P is in C everything that borders something is a country
SubPropertyOf(P1 P2) P1 is a subset of P2 if X borders Y then X neighbors Y.
... ... ...

Presenter Notes

P only C | set of things whose all P-properties lead to a member of class C | something that flows only into a lake

Axioms related classes, properties, individuals with each other in various ways.

Semantics of OWL

Presenter Notes

Semantics of OWL: UNA and OWA

Unique Name Assumption (UNA): differently named individuals are necessarily different instances of the domain. OWL does not make the UNA.

Open World Assumption (OWA): unstated facts are not assumed to be false. OWL makes the OWA.

Thus, OWL is different from some database/knowledgebase languages (SQL, Prolog) where a missing assertion means that it is false rather than unknown.

Presenter Notes

T-Box and A-Box

T-Box

  • terminological statements / universal statements (universal quantification in first-order logic), e.g.
    • Every EU-country is a country.
    • If X border Y then Y borders X.
  • conceptionally complex (domain, range, reflexivity, inheritance, ...)
  • usually small and static

A-Box

  • instance data, e.g.
    • Switzerland is not a EU-country.
    • Switzerland borders Germany.
  • conceptually much simpler
  • usually very large and often changing

Presenter Notes

Expressive SW languages: SWRL

  • rule-like statements with explicit variables
    • OWL: eu-country SubClassOf country
    • SWRL: IF eu-country(?X) THEN country(?X)
  • more expressive extension of OWL
    • OWL does not support unrestricted variable sharing
    • SWRL: IF man(?X) AND car(?Y) AND own(?X, ?Y) THEN like(?X, ?Y)
    • English: Every man that owns a car likes the car.
    • OWL approx.: (man AND own SOME car) SubClassOf like SOME car
    • SWRL representation of OWL approx.: IF man(?X) AND car(?Y) AND own(?X, ?Y) THEN like(?X, ?Z) AND car(?Z)
  • still a fragment of first-order logic, but reasoning is not decidable

Presenter Notes

Automatic reasoning with OWL

  • reasoning helps to detect modeling errors during development time
    • e.g. necessarily empty classes should be avoided
  • reasoning can be used during runtime to classify new instances
  • reasoning tasks
    • is the ontology consistent?
    • which axioms does the ontology entail? (e.g. classification = entailment of simple SubClassOf-axioms)
    • is a class satisfiable (i.e. can it contain individuals?)
    • which (smallest) subsets of the ontology cause a class to be unsatisfiable?
    • which classes and individuals does the given class contain? (DL Query)
  • reasoning is decidable, i.e. will always solve the task in finite time (although sometimes very slowly)
  • many off-the-shelf reasoners: Pellet, HermiT, FaCT++, ...
  • integrated into end-user tools (Protégé, Semantic MediaWiki, AceWiki, ...)

Presenter Notes

Reasoning example

(written in ACE for easier readability)

Ontology

Every country is a territory.
If X borders Y then Y borders X.
If X borders something then X is a country.
Germany borders Switzerland.

Entailments

Switzerland borders Germany.
Switzerland is a country.
Switzerland is a territory.
Germany is a country.
Germany is a territory.
It is false that Germany does not border Switzerland.
...

Unknown truth value

Switzerland is not Germany. # no UNA in OWL
Switzerland borders exactly 1 territory.
Switzerland borders itself.
...

Presenter Notes

Entailments are true axioms which are not explicitly part of the ontology.

OWL: Usage examples

  • SNOMED Clinical Terms
    • large (300k concepts) collection of medical terms, with NL lables and formal relations
    • multilingual lables on concepts, currently English (~1m labels), Spanish, Danish, Swedish
    • EL fragment of OWL (excludes disjunction, negation, cardinality constraints, ...)
    • Common-cold SubClassOf causative-agent SOME Virus
    • used >50 countries
  • content management with semantic wikis
    • e.g. Semantic MediaWiki, OntoWiki
    • small subset of OWL considered for reasoning
  • DBpedia
    • structured part of Wikipedia (infoboxes)
    • background ontology (SubClassOf, domain, range)
    • multilingual

Presenter Notes

See also:

http://www.slideshare.net/micheldumontier/owlbased-applications

http://www.reportinghub.org/

User interfaces for the Semantic Web

  • semantic web is for both machines (agents) and humans
  • humans require a user-friendly language to interact with the SW
  • RDF's triple-based data model is very simple
    • does not directly allow for a user-friendly syntax for complex structures
    • originally was written in RDF/XML (unnecessarily verbose)
  • OWL (version 2009) syntax is much more elegant (based on Description Logics)
    • still remote from natural language
    • lots of syntactic sugar for various types of SubClassOf-constructs
  • separate ontology, rule, and query languages
  • little focus in the SW community on UI issues, see e.g. D. Karger ESWC 2013 keynote "A Semantic Web for End Users"

Presenter Notes

Two examples of SW tools

Presenter Notes

Protégé

  • open-source and widely used OWL editor (actually predates OWL)
  • user interface
    • sub class tree hierarchy with drag-and-drop
    • otherwise reflects the OWL axiom types 1-to-1
    • integrates Manchester OWL Syntax (MOS) for expressing classes and axioms
  • focus on T-Box editing
  • integrates reasoners
    • query answering
    • entailment explanation
  • WebProtégé: simpler web-based editor with focus on collaborative editing (e.g. revision history support)

Presenter Notes

Protégé

Presenter Notes

Protégé: DL Query

Presenter Notes

Protégé: Entailment explanation

Smallest subset that explains why mad-cow is an empty class, indicating that the ontology contains a modeling error.


Presenter Notes

Wikis and Semantic wikis

Wiki

  • user-friendly collaborative environment for knowledge management
  • content typically unconstrained natural language, therefore not easily automatically processable
  • powered by software, e.g. MediaWiki, best known example: Wikipedia

Semantic wiki = wiki + formal semantics

  • provides a richer query language + improves consistency by reducing redundancy: the same information stated only once
  • content typically in natural language enriched with typed links (i.e. RDF triples)
  • user interface
    • wiki-style: plain text with embedded meta information
    • forms for filling templates
  • software: Semantic MediaWiki, Freebase, ...

Presenter Notes

Shortcomings: cannot copy content from one language to the other, cannot ask questions, cannot check that the different versions of an article in different languages are about the same thing.

Semantic MediaWiki

  • extension of MediaWiki / Wikipedia
  • add: "semantic annotations" structured formats for
    • attaching attributes to concepts
    • representing lists, tables, etc.
    • representing maps, timelines, calendars, etc.
  • queries that automatically generate wiki pages
  • supports subset of OWL for query answering
    • subclass, subproperty, inverse property, equality
    • excludes e.g.: transitivity, domain/range, cardinality restrictions
  • focus on A-Box editing

Presenter Notes

Semantic MediaWiki

Example: sentence in an article Berlin is entered as

Berlin is the capital of [[Is capital of::Germany]]
and has a population of [[Has population::3,292,365|3.3 million]].

and is rendered as

Berlin is the capital of Germany
and has a population of 3.3 million.

and adds two triples to the underlying model

<Berlin, Is_capital_of, Germany>
<Berlin, Has_population, 3292365>

Presenter Notes

Semantic MediaWiki: Query

Which countries are located in Africa?

{{#ask: [[Category:Country]] [[located in::Africa]]
 | ?Area#km²
 | ?Name
 | ?Population
 | sort = population
 | order = descending
}}

Is rendered as a sortable table listing links to the pages of category Country (or some of its subcategories) that contain the property located in (or some of its subproperties) with the value Africa.

Presenter Notes

The value is that one can easily insert this table (possibly with some modifications/customizations) to various pages without having to type it in from scratch.

Multilingual Semantic Web

Presenter Notes

In general not much work on it.

Multilingual SW: Ontology verbalization

  • RDF, OWL, ... keywords are in English
  • most ontologies are written with English entity names
  • recommendation (in biomed ontologies) to name entities in a language-neutral way (with ID codes)
  • entity and axiom annotations provide a way to include free-form natural language in the ontologies
    • e.g. via rdf:label with the ISO language tag
  • The Lexicon Model for Ontologies (lemon)
    • new proposal for annotating ontologies with linguistic information
    • each entity is mapped to its corresponding lexical entries specifying the word forms and part of speech categories (for different languages)

Presenter Notes

Multilingual SW: Semantic wikis

Presenter Notes

Links

Presenter Notes

CNLs and ACE

Presenter Notes

Controlled Natural Languages

  • motivation
    • natural language with tool support, e.g. auto-completion, paraphrasing, consistency checking, question answering, translation
    • wide-coverage parsers achieve only ~90% accuracy, and on simpler tasks (e.g. detecting subjects and objects)
  • formal syntax, but fragment of a natural language
  • formal semantics
    • ambiguity handling
    • translations to other (natural or formal) languages
    • reasoning, paraphrasing, etc.
  • many languages with different goals and formal properties
    • see Tobias Kuhn. A Survey and Classification of Controlled Natural Languages, Computational Linguistics
  • well-known example: Attempto Controlled English (ACE)
  • frameworks: Grammatical Framework (GF)

Presenter Notes

usage: UI for formal logics in cases where domain experts lack formal background

Attempto Controlled English (ACE)

  • subset of English
  • controls ambiguity and synonymy
    • A dog hates a cat. == There is a dog that hates a cat.
    • A dog hates a cat. =/= Every dog hates a cat.
    • Every dog hates a cat. == If there is a dog then the dog hates a cat.
  • end-user documentation: construction and interpretation rules, as restrictions of English
    • construction rules, interpretation rules, ...
    • easy to learn
  • well-defined translations to first-order logic, OWL, ...
  • tool support for end-users
  • developed in the Attempto project at the University of Zurich, lead by Norbert E. Fuchs

Presenter Notes

ACE: Language overview

  • subset of natural English
    • conjunction (and), disjunction (or), negation (no, not, it is false that), if-then, ...
    • anaphoric references: pronouns (he, it), definite noun phrases (the man), variables (X, Y123)
    • quantifiers: every, no, at least 3, ...
    • content words: proper names, common nouns, verbs, adjectives, adverbs
  • grammar is fixed, but users can change content words
  • deterministic ambiguity handling
    • anaphora resolution (France borders Spain and it borders Portugal.)
    • quantifier scope (Every country borders a country.)
    • attachment (Every EU-country borders a country that is a EU-country and is a NATO-country.)

Presenter Notes

Example: Syntactic restrictions

English: Koalas eat eucalyptus leaves.

Violates the ACE construction rules:

  • every noun must have a determiner ('every', 'a', 'at least 2', ...)
  • multi-word nouns are not allowed

Correct ACE:

  • Every koala eats an eucalyptus-leaf.
  • If a koala eats something X then X is an eucalyptus-leaf.
  • ...

Presenter Notes

Example: Semantic restrictions

English: Every EU-country borders a country that is a EU-country and is a NATO-country.

The attachment differences must be expressed in a lexically different way in ACE:

  • Every EU-country {borders a country that is a EU-country} and {is a NATO-country}.
  • Every EU-country borders a country {{that is a EU-country} and {that is a NATO-country}}.

(The {curly brackets} are not part of ACE. They are used here to denote the scopes.)

Presenter Notes

Discourse Representation Structure

The_Mediterranean_Sea is a sea. Every country that does not border a sea is a landlocked-country. Switzerland does not border the sea.

  • DRS
    • interpreted as a first-order logic formula
    • ACE content words map to atomic conditions
    • ACE function words (and, or, not, if-then) introduce various complex DRS "boxes"
    • variables identify content words
    • variables denote anaphoric references to nouns, the DRS box structure reflects accessibility constraints on references

Presenter Notes

Syntactic sugar

It is possible to express the same meaning (the same DRS) in syntactically different ways, e.g. every = if-then:

  • Every koala eats an eucalyptus-leaf. # M1
  • If there is a koala then the koala eats an eucalyptus-leaf. # M1
  • Everything that a koala eats is an eucalyptus-leaf. # M2
  • If a koala eats something X then X is an eucalyptus-leaf. # M2

relative clause = coordination

  • Every mammal that is endemic-to Australia is a marsupial or is a .... # M3
  • If a mammal is endemic-to Australia then the mammal is a marsupial or is a .... # M3
  • If there is a mammal and the mammal is endemic-to Australia then the mammal is a marsupial or is a .... # M3

(M{1,2,3} indicate sentences with the same DRS.)

Presenter Notes

Content word lexicon

Content words are user-defined. They are classified by their word class and have one or more surface forms.

  • nouns: 2 forms: singular: man, plural: men
  • proper names: 1 form: singular: John
  • verbs
    • intransitive (wait), transitive (eat), ditransitive (give)
    • 3 forms: 3rd person finite: eats, infinite: eat, past participle: eaten
  • adjectives
    • intransitive (rich), transitive (fond-of)
    • 3 comparison forms (rich, richer, richest)
  • adverbs (3 comparison forms)

The lexicon allows the definition of aliases but does not otherwise provide any semantics, e.g. statements like "Every man is a human." can be made only at the level of ACE.

Presenter Notes

Tools

  • ACE parser (APE): maps ACE to DRS
  • DRS verbalizer: maps DRS to ACE
    • the roundtrip: ACE -parser-> DRS -verbalizer-> ACE usually results in a syntactically different ACE text thanks to syntactic sugar
  • DRS translators: map DRS to
    • OWL/SWRL
    • TPTP
    • ...
  • look-ahead editor for a subset of ACE
  • native ACE reasoner: RACE
  • OWL verbalizer: maps OWL to ACE
  • end-user ACE / SW ontology editors:
    • ACE View
    • AceWiki

Presenter Notes

Attempto Parsing Engine (APE)

  • tokenizes and parses the given ACE text
  • resolves anaphora
  • translates to one or more logical forms (TPTP, OWL/SWRL, ...) via DRS
  • paraphrases the input in ACE via DRS
  • accessible from Prolog, Java, HTTP, commandline

Commandline example. Parse a sentence into the DRS form.

$ echo "No dog is a cat." | ape.exe -solo drspp
[]
   [A]
   object(A,dog,countable,na,eq,1)-1/2
   =>
   []
      NOT
      [B,C]
      object(B,cat,countable,na,eq,1)-1/5
      predicate(C,be,A,B)-1/3

Presenter Notes

ACE as a semantic web language

Presenter Notes

Motivation

  • user-friendly language because based on English
    • standard English instead of various formal notations
    • easy to read (and read out-loud)
    • easy to write (?)
  • single language instead of 3 different languages for ontologies, rules and queries
  • different (more natural) motivation for the syntactic sugar
  • usage:
    • verbalizing existing ontologies
    • writing new and modifying existing ontologies
    • querying existing ontologies
    • presenting entailments and entailment explanations

Presenter Notes

Mapping between ACE and OWL/SWRL

ACE OWL/SWRL
proper name individual
noun named class
intransitive adjective named class
noun phrase (with relative clause) complex class
transitive verb with NP object property
transitive verb with data object data property
transitive adjective property
sentence OWL axiom or SWRL rule (SWRL rule as a fallback)
question (with 1 query word) DL-Query
text ontology

Presenter Notes

  • parse ACE into DRS (FOL-formula)
  • translate DRS to OWL, or if this fails then to SWRL, or return error message is this fails

ACE sentence and question in OWL

Every country that does not border a sea is a landlocked-country.

SubClassOf(
   ObjectIntersectionOf(
      :country
      ObjectComplementOf(
         ObjectSomeValuesFrom(
            :border
            :sea
         )
      )
   )
   :landlocked-country
)

Which country is a landlocked-country?

ObjectIntersectionOf(
    :country
    :landlocked-country
)

Presenter Notes

Verbalizing OWL ontologies as ACE texts

OWL contains many shorthand constructs (DisjointClasses, ObjectPropertyDomain) with no direct ACE counterpart

  DisjointClasses(
    :country
    ObjectHasSelf( :border )
  )

Directly reflecting the keywords and axiom structure would give something like:

Countries and self-borderers are disjoint.

ACE-style verbalization gives a more natural:

No country borders itself.

Presenter Notes

OWL Verbalizer

Rewrite an OWL axiom, e.g.

  ObjectPropertyDomain( write human )

into something with a more suitable structure (but preserving meaning):

  SubClassOf(
    ObjectIntersectionOf(
      owl:Thing
      ObjectSomeValuesFrom(
        write
        owl:Thing
      )
    )
    human
  )

then directly map it to ACE (with a Prolog DCG grammar):

  Everything
    that
    writes something
  is a human.

Presenter Notes

OWL Verbalizer: Issues

  • OWL naming conventions for entities
    • usually: nouns for classes, verbs for properties
    • sometimes contain complex structure: nonNormal, MaleOrFemalePatient
  • simple one-to-one mapping of axioms to sentences
    • other approaches also reorder/combine axioms

Presenter Notes

Evaluation: Comparison to MOS

  • Manchester OWL Syntax (MOS)
    • concise notation for OWL
    • part of the OWL (2009) standard
    • used in editors like Protégé
  • Tobias Kuhn. The Understandability of OWL Statements in Controlled English (2013)
    • compare the ACE and MOS representation of wide variety of OWL axiom types
    • evaluation with 64 users (students, but not of computer science)
    • ask subjects to look at diagrams and evaluate corresponding ACE and MOS statements as true or false
    • result: ACE is easier to learn and understand
    • no evaluation of "writability"

Presenter Notes

Evaluation: Comparison to MOS

Presenter Notes

Evaluation: Comparison to MOS

Presenter Notes

ACE as a SW language: Issues

  • need to support word-forms (e.g. in the lexicon editor):
    • 2 for nouns: man, men
    • 3 for verbs: eats, eat, eaten by
  • English-specific
    • can be misleading (if interpreted as English, and not ACE)
    • non English speakers might prefer a more language-neutral notation
  • harder to implement tool support (parser, verbalizer, etc.) for ACE than for MOS
    • covering a wide variety of programming languages

Presenter Notes

short-coming shared with MOS: possibly harder to write, needs a look-ahead editor. ACE tool support exists for Prolog, but not for other languages.

Two ACE tools for the SW

Presenter Notes

ACE View

  • ACE-based OWL/SWRL editor, viewer, query interface
  • plug-in for Protégé 4+
  • provides set of ACE "views" to the ontology
  • integration with other Protégé views
    • changes done via standard Protégé views are reflected in the ACE views
    • motivation: sometimes graphical UI is simpler to use than plain text
  • integrates both ACE->OWL/SWRL and OWL->ACE
  • ACE-based entailment explanation

Presenter Notes

ACE View: Snippets

Presenting the complete ontology as a sortable/searchable set of ACE sentences.

Presenter Notes

ACE View: Entailments

Explaining an entailment by a set of ACE sentences.

Presenter Notes

ACE View: Question answering

Counting and presenting DL Query results.

Presenter Notes

ACE View vs Protégé

Property axioms presented in ACE vs checkboxes+labels.

Presenter Notes

AceWiki

  • semantic wiki engine
  • focus on user-friendliness and high expressivity of formal content
  • uses ACE for the content
    • OWL-compatible fragment of ACE (e.g. excludes prepositional phrases and adverbs)
    • user is syntactically assisted by a look-ahead editor
  • uses OWL for the reasoning language
    • supports different OWL profiles
    • more expressive than usually in semantic wikis
    • completely hidden from the users

Presenter Notes

Hiding the formal language enables one to use a more expressive fragment of it.

AceWiki: Screenshot

AceWiki article about landlocked country and the look-ahead editor.

Presenter Notes

AceWiki: Main features

  • model
    • content word = article (like in traditional wikis)
    • article = collection of statements
    • statement = declarative sentence or query or comment
  • declarative sentences and queries are written in an OWL-compatible subset of ACE
    • Codeco grammar which formally describes this subset and
    • ... makes it available via a look-ahead editor
    • UI for avoiding syntactically and semantically not supported inputs
  • automatic feedback
    • inconsistent sentences are tagged by red color
    • queries are answered by automatically populating the list of matching entities
  • collaborative editing of
    • lexicon
    • articles

Presenter Notes

AceWiki: Evaluation

Two small usability experiments:

  • altogether 26 untrained participants
  • task: collaborative creation of a knowledge base
  • results:
    • 78%-81% of the sentences were correct and sensible
    • 61%-70% of them were complex (containing negations, implications, disjunctions or number restrictions)
    • creation of a correct sentence every 5-6 minutes
    • definition of a new word every 5-7 minutes

=> Even untrained users can effectively use AceWiki

Presenter Notes

AceWiki: Other usages

AceWiki (or parts of it) have been used in various research projects, e.g.

  • Coral: a CNL-based query interface for annotated text corpora (UZH, Zurich)
    • uses the AceWiki sentence editor
    • does not use ACE, but a new CNL that maps to the ANNIS Query Language
    • no collaborative editing
    • source code: https://github.com/tkuhn/Coral
  • AceCAPTCHA (AGH UST, Kraków)
    • CAPTCHA = Completely Automated Public Turing test to tell Computers and Humans Apart
    • AceWiki preloaded with terminological knowledge (entered by experts)
    • presents end-users with ACE questions about the wiki content to (1) distinguish them from computers (CAPTCHA) and to (2) populate the wiki with instance data (collaborative editing)
  • ACE for source code documentation (University of Chile)
    • AceWiki populated with ACE statements about software structure (classes, methods, inheritance, ...)
    • users can ask ACE questions about module dependencies etc.

Presenter Notes

Sometimes parts of AceWiki are reused to build new systems, rather that using the complete AceWiki as it is.

Mention that AceWiki-GF is an other application, discussed later.

AceWiki: Small case study

Article on Fynbos containing the original definition (in full English), its formal content (in ACE), and automatically generated upper concepts.

Presenter Notes

AceWiki: Small case study

Agriculture glossary (= set of term-definition pairs) as a CNL-based semantic wiki

  • convert an existing glossary into the wiki format (wiki article for every term, every term considered to be a common noun)
  • represent the core information in the definition as a set of ACE sentences
  • purpose
    • educational: definition easier to understand for non English speakers
    • sentences are formal: automatic semantic feedback
    • collaborative development of course content
    • sentences are easier to translate (future work)
  • results
    • several new terms (e.g. relations/verbs) were added
    • some ambiguity/vagueness was identified and resolved
    • correct and natural formulation is sometimes difficult to find

Presenter Notes

Links

Presenter Notes

Grammatical Framework
as a CNL framework

Presenter Notes

Other CNLs (in comparison to ACE)

ACE is general-purpose, English-based (and difficult to port to other NLs), with fixed grammar and largely fixed interpretation.

Other CNLs:

  • more domain specific
    • several SW-oriented CNLs: Rabbit, SOS, Clone
    • optimized for querying (not authoring), e.g. Coral, NL-interfaces to databases
  • no underlying formal semantics
  • relaxed syntactic constraints
  • relaxed or different ambiguity handling
  • speech-oriented: voice commands for mobile devices
  • other goals:
    • automatic translation into other natural languages

Presenter Notes

Components of a CNL framework

  • programming language with built-in support for describing natural language grammars
  • CNL design guidelines and best practices
  • language-independent parsing engine
  • library that contains the linguistic knowledge of several NLs, ideally with a language-independent API
  • general enough to support a variety of uses cases

Presenter Notes

Grammatical Framework (GF)

  • framework for multilingual grammar engineering
    • functional programming language optimized to handle natural languages
    • resource grammar library implementing common morphological and syntactic structures
  • a GF program (aka grammar) consists of
    • language-neutral abstract syntax
    • multiple concrete syntaxes that implement the abstract functions and categories, specifying words, word order, agreement, etc.
  • main operations
    • parsing: map a string in some language to abstract tree(s)
    • linearization: linearize tree(s) as strings in some language
    • translation = parse a string in language A to tree(s) + linearize these tree(s) as strings in language B
  • various tools + bindings to Python, Java, Javascript, Prolog, ...
  • developed at the University of Gothenburg, lead by Aarne Ranta

Presenter Notes

parsing (translation, look-ahead, ...) based on Parallel Multiple Context-Free Grammars (MCFG), mildly context sensitive formalism

GF Editor/Translator for Foods.pgf


Presenter Notes

GF example: Abstract module

abstract Unitconv = {
  flags startcat = Unitconv ;
  cat Unit ; Unitconv ;
  fun
    unitconv : Unit -> Unit -> Unitconv ;
    land_mile, nautical_mile : Unit ;
}
  • declares the categories (Unit, Unitconv)
  • declares the functions (unitconv, land_mile, nautical_mile) and their types (Unit->Unit->Unitconv, Unit)
  • trees that can be obtained from the start category:
    • unitconv land_mile nautical_mile
    • unitconv land_mile land_mile
    • ...

Presenter Notes

GF example: Concrete modules

concrete UnitconvEng of Unitconv = {
  lincat Unit, Unitconv = {s : Str} ;
  lin
    unitconv x y = {s = "how much is" ++ x.s ++ "in" ++ y.s ++ "?"} ;
    land_mile = {s = "mile"} ;
    nautical_mile = {s = "nautical mile" | "mile"} ;
}

concrete UnitconvWolfram of Unitconv = {
  lincat Unit, Unitconv = {s : Str} ;
  lin
    unitconv x y = {s = "convert" ++ x.s ++ "to" ++ y.s} ;
    land_mile = {s = "mile"} ;
    nautical_mile = {s = "nmi"} ;
}
  • define how trees are linearized in a concrete language
  • assign linearization structures to categories (e.g. "record containing a string": { s : Str })
  • define how functions are linearized, e.g. as strings or concatenation of strings
    • e.g. (x.s ++ "in") concatenates the string of the argument (x.s) with the string "in"

Presenter Notes

GF parsing and linearizing

Parsing i.e. converting a string how much is nautical mile in mile ? to tree(s)

Unitconv> parse -lang=Eng "how much is nautical mile in mile ?"

unitconv nautical_mile land_mile
unitconv nautical_mile nautical_mile

Linearization i.e. converting a tree unitconv nautical_mile land_mile to string(s)

Unitconv> linearize -treebank -list (unitconv nautical_mile land_mile)

UnitconvEng: how much is nautical mile in mile ?, , how much is mile in mile ?
UnitconvWolfram: convert nmi to mile

Translation i.e. parse + linearize

Unitconv> parse -lang=Eng "how much is nautical mile in mile ?" | l -lang=Wolfram

convert nmi to mile
convert nmi to nmi

Presenter Notes

GF example: More complex example

abstract Geography = {
  cat
    Country ; Relation ;
  fun
    germany : Country ;
    switzerland : Country ;
    border : Country -> Country -> Relation ;
}

concrete GeographyGer of Geography = {
  param
    CaseGer = Nom | Acc | Dat ;
  lincat
    Country = CaseGer => Str ; Relation = Str ;
  lin
    germany      = table { _ => "Deutschland" } ;
    switzerland  = table { Dat => "der Schweiz" ; _ => "die Schweiz" } ;
    border c1 c2 = c1 ! Nom ++ "grenzt an" ++ c2 ! Acc ;
}
  • German cases require a more flexible linearization structure (table of strings)
  • fortunately this (and other types of) complexity can be hidden into a library, allowing to simplify the code to something like:
    • border c1 c2 = c1 ! Nom ++ (mkVP3rdPers "grenzen" c2)

Presenter Notes

GF Resource Grammar Library (RGL)

  • morphology and syntax for ~30 languages via language-neutral API
  • developers do not need detailed knowledge of the languages that they want to support in their application

Presenter Notes

RGL: Content words

  • some languages have many content word forms, e.g. Estonian has ~28 noun forms (2 numbers with 14 cases each)
  • smart paradigm
    • construct (all) word forms on the basis of as few base forms (nominative, genitive, ...) as possible
    • other input information: gender (for some languages), verb valency (for verbs, transitive adjectives, ...)
  • benefit: keeps the lexicon equally simple across languages
  • lexicon editors are likely to be familiar with the base forms (rather than language-specific morph. type systems)
  • examples
    • mkN "dog" constructs all the noun forms for the singular nominative "dog": dog, dogs, dog's, dogs'
    • mkN "man" "men" (two forms needed for irregular English words)
  • some languages in the RGL have also large (~50k entries) lexicons (with mapping to WordNet senses)

Presenter Notes

RGL: Syntax

  • language independent API for many syntactic functions
    • mkNP : Numerl -> N -> NP : five men
    • mkCl : NP -> V -> Cl : five men sleep
  • library handles word order, agreement, choice of function words, etc.
    • "red apple" vs "pomme rouge"
  • language-specific constructions can be included in the Extra-modules
    • e.g. verb phrase coordination

Example: constructing a clause:

$ gf
> i -retain alltenses/TryEng.gfo
> cc -one
     mkS pastTense
         (mkCl (mkNP (mkNumeral "1") (mkN "dog")) (mkV "sleep" "slept" "slept"))
one dog slept

> cc -one
     mkS (mkCl (mkNP (mkNumeral "2") (mkN "dog")) (mkV "sleep" "slept" "slept"))
two dogs sleep

Presenter Notes

RGL-based GF programs

incomplete concrete FoodsI of Foods = open Syntax, LexFoods in {
  lincat
    Phrase = Cl ;
    Item = NP ;
    ...
  lin
    is x y = mkCl x y ;
    ...
    wine = mkCN wine_N ;
    ...
}

concrete FoodsGer of Foods = FoodsI with
  (Syntax = SyntaxGer),
  (LexFoods = LexFoodsGer) ;
  • best practice for multilingual grammars: share most of the code via a functor
  • functor = "incomplete concrete" module that is parametrized by language-specific resources
  • the functor references the RGL via its language-neutral API (Cl, NP, mkCl, mkCN)
  • the concrete languages import the functor and plug in language-specific resources

Presenter Notes

Benefits of the GF framework

  • support for multilinguality
  • support for both parsing and generation
  • grammar engineering tools and guidelines
  • parser/translator with look-ahead support
  • library of pre-implemented linguistic knowledge for ~30 languages
  • large lexicons (for some languages)

Presenter Notes

ACE-in-GF

Presenter Notes

Motivation

  • implement ACE in a multilingual way
    • ACE-based modeling and ACE tools become available to speakers of other natural languages
    • test how English-specific is ACE
  • implement ACE in a different and more general formalism
  • gain new tools
    • tree-based sentence editor
    • access to other programming languages (C, Python, Javascript, ...)

Presenter Notes

  • the existing APE implementation (APE) in Prolog
  • the existing AceWiki look-ahead parser in Codeco

Multilingual ACE

An ACE grammar implemented in GF adds multiple natural languages as front-ends to ACE. As a result, these languages can be mapped to and from various formal languages already supported by ACE.

Presenter Notes

German <-> ACE <-> OWL

German

Jedes Land, das nicht an ein Meer grenzt, ist ein Binnenland.

ACE-in-GF tree

baseText (sText (s (vpS (everyNP (relCN (cn_as_VarCN country_CN)
  (neg_predRS which_RP (v2VP border_V2 (thereNP_as_NP
   (aNP (cn_as_VarCN sea_CN))))))) (npVP (thereNP_as_NP
    (aNP (cn_as_VarCN landlocked_country_CN)))))))

ACE

Every country that does not border a sea is a landlocked-country.

OWL

SubClassOf(
   ObjectIntersectionOf(
      :country
      ObjectComplementOf(
         ObjectSomeValuesFrom( :border :sea )
      )
   )
   :landlocked-country
)

Presenter Notes

Implementation of ACE-in-GF

  • extension of Angelov and Ranta. Implementing Controlled Languages in GF (CNL 2009)
  • implementation of the ACE syntax
    • focus on the subset of ACE that can be mapped to OWL
    • about 100 syntactic functions reflecting the reference implementation of ACE (APE)
    • no direct generation of discourse representation structures (DRS)
    • almost 100% coverage at almost 0% ambiguity (formally tested for ACE)
  • multilinguality
    • support most RGL languages: Bulgarian, Catalan, Chinese, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hindi, Italian, Latvian, Maltese, Norwegian, Polish, Romanian, Russian, Spanish, Swedish, Thai, Urdu
    • RGL-based design provides automatic increase in quality and language-coverage over time
  • lexicon
    • user-specified lexicon as needed by ACE applications
    • rely on RGL smart paradigms and large lexicons

Presenter Notes

ACE-in-GF translation example

ACE: every person that speaks a language X does not forget X .
Bul: всеки човек който говори език X не забравя X .
Cat: cada persona que parla una llengua X no oblida X .
Chi: 说 一 种 X 语 言 的 每 个 人 没 忘 X 。
Dan: hver person , som taler et sprog X glemmer ikke X .
Dut: elke persoon , dat een taal X spreekt vergeet niet X .
Est: iga inimene , kes räägib keelt X ei unusta X .
Fin: jokainen henkilö , joka puhuu kieltä X ei unohda X:ää .
Fre: chaque personne qui parle une langue X n' oublie pas X .
Ger: jede Person , die eine Sprache X spricht vergißt X nicht .
Gre: κάθε πρόσωπο που μιλά μία γλώσσα τον X δεν ξεχνά τον X .
Hin: हर [person_CN] , जो [language_CN] X बोलता है X नहीं भूलता है .
Ita: ogni persona che parla una lingua X non dimentica X .
Mlt: kull persuna , li jkellem lingwa X ma jinsix X .
Lav: ikviena persona , kas saka valodu X neaizmirst X .
Nor: hver person , som snakker et språk X glemmer ikke X .
Pol: każda osoba , która rozmawia z językiem X nie zapomina X .
Ron: orice persoană care vorbeşte o limbă X nu îl uită pe X .
Rus: каждый лицo , который говорит на языке X не забывает X .
Spa: cada persona que habla una lengua X no olvida X .
Swe: varje person , som talar ett språk X glömmer inte X .
Tha: บุคคล ทุก คน ที่ พูด ภาษา X ไม่ ลืม X
Urd: ہﺭ ﺶﺨﺻ , ﺝﻭ ﺰﺑﺎﻧ X ﺏﻮﻠﺗﺍ ہے X ﻥہیں ﺏھﻮﻠﺗﺍ ہے

Presenter Notes

RGL-based implementation

Import most of the implementation from the RGL using a functor.

  • adding a new language is easy
  • profit from continuous improvements in the RGL

Presenter Notes

ACE-in-GF: Issues

  • precision problems (over-generation), which are visible in the look-ahead editor
    • anaphoric references do not obey DRS accessibility constraints (e.g. Every man likes the woman.)
    • some language-independent features (e.g. ACE NP types) are hard to describe in the abstract syntax
  • coverage problems in some languages
    • e.g. verb phrase coordination not available in the core RGL
    • can be handled in the tool (e.g. AceWiki-GF) using paraphrasing, etc.
  • ambiguity problems in some languages
    • e.g. missing determiners in Finnish
    • can be handled in the tool (e.g. AceWiki-GF) using disambiguation dialogues
  • by default, sentences use the same structure in every language
    • extra work and linguistic competence is needed to override this (possibly using Extra-modules)
  • lack of smart paradigms and/or large lexicons for some languages

Presenter Notes

More development effort has gone into German, Spanish and Finnish. Other implementations have holes in the coverage of ACE constructs that are not provided by the RGL. TODO: lexical ambiguity: French river example. ACE NP types could be done with dependent types but this is not compatible with the look-ahead algorithm. ACE has open vocabulary (i.e. different from MOLTO Phrasebook in that sense) and it is not easy for the user to create correct lexical entries.

Evaluation of ACE-in-GF: Design

  • picked 10 languages which had a complete implementation
  • generated ~100 ACE sentences/questions and automatically translated them to all the languages, ensuring
    • full coverage of all the grammar functions
    • large coverage of OWL axiom structures (subclass, range, domain, transitivity, ...)
  • measured translation accuracy from ACE to other languages
  • used Google Translate as the baseline
  • ... and 20 human evaluators (2 per language) as the gold standard

Presenter Notes

Evaluation of ACE-in-GF: Results

  • participants preferred ACE-in-GF translations to Google translations and post-edited them less
  • many edits were just stylistic
    • e.g. users preferred elliptical sentences but these are not allowed in ACE
  • some languages performed clearly better, e.g. Finnish, German, Dutch

Presenter Notes

Evaluation of ACE-in-GF: Results


Presenter Notes

AceWiki-GF
A Multilingual CNL-based Semantic Wiki

Presenter Notes

Multilingual CNL-based Semantic Wiki

  • multiple languages
    • natural: English, German, ...
    • formal: first-order logic, OWL, ...
    • languages for content vs user interface
  • CNL-based
    • backed by formal grammar(s)
    • formal languages are hidden
  • semantic
    • content automatically kept in sync via precise translation
    • consistency checking, question answering, ... (depending on the domain)
  • wiki
    • user-friendly
    • collaborative

Presenter Notes

Motivation

for adding GF to AceWiki

  • increase user-friendliness for non English speakers
  • experiment with a more general CNL setting
    • content not necessarily based on ACE and OWL
  • experiment with more aspects of collaboration
    • ambiguity handling
    • full grammar editing

Presenter Notes

Possible use cases

  • multilingual ontology editor
    • e.g. environment where users agree on the content and multilingual vocabulary of an OWL-style ontology (for a certain domain, e.g. geography)
    • like AceWiki, but multilingual
  • catalog of museum objects
    • each object (e.g. painter, painting) on its own wiki page
    • rich queries (e.g. "which Dutch painter painted which French painter?")
    • like previous, but more focus on instance data and multimedia content
  • tourist phrasebook
    • book structure (chapters and sections)
    • multilingual content presented in parallel (at least 2 languages)
    • e.g. based on the MOLTO Phrasebook grammar
    • like AceWiki, but different UI and no reasoning
  • other
    • collection of math exercises
    • ...

Presenter Notes

AceWiki integration with GF

  • wiki content is based on a (single) GF grammar
    • provided by GF Webservice / Cloud service
  • wiki entry is GF abstract tree set
    • viewed via linearization(s)
    • can represent ambiguity
  • multilingual viewing and editing of wiki content
    • grammar-based look-ahead editing that shows next possible tokens
    • ambiguity resolution via another concrete language
  • grammar integrated into the wiki
    • GF grammars are very modular
    • grammar modules as wiki articles (wiki-linking of grammar and content)
    • grammar can be changed while editing the wiki

Presenter Notes

AceWiki-GF user interface

  • UI not (yet) customizable for any specific application (e.g. MOLTO Phrasebook)
    • i.e. most suitable for ACE-based wikis
  • language-switching menu
    • displays the wiki articles and sentence editor in the given language
    • user interface labels reflect the content language
  • grammar editor
    • table-based editor for lexical functions
    • basic full grammar editor
  • ambiguity resolution dialog

Presenter Notes

ACE-based geography wiki

  • main use case developed in the MOLTO project
  • uses ACE-in-GF as the underlying grammar

Presenter Notes

ACE-based geography article

Presenter Notes

Depicted are the ACE version and the German version (containing the look-ahead editor).

Note that the UI is language dependent.

Ambiguity resolution

Disambiguation dialog (only) if the entry was added in another language, and the trees of this ambiguity have different linearizations in the viewed language.

Presenter Notes

Ambiguity between object and subject relative clause. Occurs in German and Dutch. The wiki users can choose the correct tree by looking at the tree set in a language other than German, e.g. DisambGer (if it exists).

Lexicon modules as a table

Presenter Notes

Grammar module page

Presenter Notes

GF source editing is available in the GF Cloud Service. AceWiki-GF just reflects that. Some types of errors can be pinpointed.

Automatic question answering


Presenter Notes

AceWiki-GF: Technologies

  • web application written in Java
  • Echo Web Framework
  • OWL API (managing the ontology and reasoners)
  • ACE Parser (conversion to OWL)
  • GF Webservice (predictive parsing, translation, grammar compilation)

Presenter Notes

AceWiki-GF: Dependencies

Presenter Notes

Evaluation of AceWiki-GF: Design

  • developed a 500-word geography lexicon
    • 3 languages: English, German and Spanish
    • 3 authors (incl. native speakers of German and Spanish, and a GF engineer)
    • avoid lexical ambiguity
  • asked users of different languages to supply the wiki with sentences and tag each as true or false
  • asked them then to evaluate others' sentences as true or false
  • measured the user (dis)agreement and how much it is influenced by the automatic translation
    • Hypothesis: A group of users reaches almost the same level of agreement on the content of an article presented to them in different languages as when the article is presented to all of them in the same language.
  • asked them for general feedback via a questionnaire

Presenter Notes

Evaluation of AceWiki-GF: Results

  • 30 participants entered 316 sentences
  • almost all the syntactic functions were used (i.e. they were discoverable via the look-ahead editor)
  • agreement level was ~83% with no significant influence from the translation
  • AceWiki-GF user interface was found to be easy to use
  • 80% reported the the CNL did not let them express everything that they wanted
    • missing content words
    • some grammar restrictions were confusing or presented confusingly in the look-ahead editor

Presenter Notes

Not used: negated object relative clause: "... that a country does not border", variable in apposition "a country X".

Future work

  • generalize to handle other types of grammars and reasoning
  • tree-based (language-neutral) sentence construction
  • optimize the user interface to other types of content
  • improve collaborative grammar editing features
  • improve ambiguity management (e.g. automatic reasoning-based ambiguity resolution)
  • use the wiki content to automatically generate documentation, grammar fragments, look-ahead editor customizations, etc. for novice users
  • more evaluation needed

Presenter Notes

Get involved!

Presenter Notes

Links

Presenter Notes

Thank You!
Questions?

Presenter Notes