Collaborative Multilingual
Knowledge Engineering Based on
Controlled Natural Language
Kaarel Kaljurand
Institute of Computational Linguistics, University of Zurich
College of Graduate Studies and the Academy of African Languages and Science,
University of South Africa

--- # About myself - studied at the University of Tartu, Estonia - worked 2002-2013 at the University of Zurich, Switzerland - mainly in projects related to controlled natural languages - REWERSE (2004-2008): using Attempto Controlled English (ACE) as a semantic web language - MOLTO (2012-2013): building a multilingual semantic wiki based on ACE and Grammatical Framework (GF) - joint work with: Norbert E. Fuchs (group leader) and Tobias Kuhn - other interests: - grammar-based speech recognition applications - Estonian grammar in GF - biomedical text mining - see also: ## Presenter Notes - PhD on mapping between ACE and Semantic Web languages - recently interested in GF, applied to ACE, speech recognition, and Estonian --- # Overview of the tutorial - semantic web (SW) - knowledge representation languages (OWL, SWRL) and reasoning - user interfaces for the SW (Protégé, Semantic MediaWiki) - controlled natural languages (CNLs) - Attempto Controlled English (ACE) - its construction and interpretation rules, discourse representation - ACE as a user interface language for the SW - translating between ACE and OWL - end-user tools: ACE View, AceWiki - Grammatical Framework (GF) - GF as a framework for defining multilingual CNLs - resource grammar library (RGL) - ACE-in-GF: ACE in ~20 languages - AceWiki-GF: multilingual collaborative knowledge editor ### Presenter Notes Questions are welcome during the tutorial. --- # Semantic Web --- # Semantic Web - original vision by Berners-Lee, Hendler, Lassila (2001) - smart personal agents that book flights and negotiate hotel prices - concepts are denoted by URIs - ontologies relate concepts to each other - simple knowledge representation language Resource Description Framework (RDF) - knowledge as a set of URI triples ``, e.g. - ``, ``, ... - highly-expressive ontology and rule languages - OWL (1st version from 2004, 2nd in 2009), SWRL (2004), RIF (2010) - fragments of first-order logic, OWL based on description logics (DLs) - core reasoning tasks have decidable algorithms - focus on reasoning efficiency (=> define efficient OWL _profiles_: EL, QL, RL) - ~2008: web of linked open data ("little semantics goes a long way") - publish data as RDF and access it using the query language SPARQL - use very little OWL reasoning ## Presenter Notes Systems like Siri are not too far from fulfilling the original vision. SW languages allow one to state various terminological and rule-like statements Multilinguality is not mentioned. --- # Semantic Web stack ## Presenter Notes OWL and SWRL as an extension to the taxonomy language of RDFS. Actually hard to say what relation does the alignment of blocks denote, maybe simply "was developed before". --- # Expressive SW languages: OWL - formal language for defining terms (URIs) via other terms - entities: classes (aka concepts), individuals, object and data properties (roles, relations) - simple and complex classes and properties - `uri://.../animal` # denotes the class of animals - `uri://.../eats` # denotes the relation of eating - `(uri://.../animal AND uri://.../African)` # "African animal" - `InverseOf(uri://.../eats)` # the relation of being eaten by something - complex classes can use: negation, conjunction, disjunction, existential and universal restrictions, cardinality restrictions - `(african AND animal) OR (NOT (InverseOf(eat) SOME animal))` - axioms (ontology is a set of axioms) - `lion SubClassOf animal` - `(african AND animal) SubClassOf (InverseOf(eat) SOME lion)` ## Presenter Notes - format = well-defined syntax and semantics (unlike English) - semantically description logic (DL) SROIQ (fragment of first-order logic) - annotations for describing the meaning of entities and axioms informally in natural language(s) --- # The structure of OWL (2009) --- # Syntax of OWL - many different syntaxes - RDF/XML - OWL/XML - Turtle - functional syntax - Manchester syntax - some based on the RDF idea of triples (RDF/XML, Turtle) - some based on description logic style syntax (functional-style, Manchester) - in this tutorial: functional-style and Manchester syntax, e.g. - `SubClassOf(:country ObjectSomeValuesFrom(:border :country))` - `country SubClassOf border SOME country` - also in this tutorial: ACE-based syntax, e.g. - `Every country borders a country.` --- # Semantics of OWL The meaning of OWL elements is defined via set theory. OWL | set theory | example --- | ---------- | ------- individual | instance of some set | Switzerland class | set of instances | country object property | set of pairs `` | bordering data property | set of pairs `` | population `C1 AND C2` | intersection of `C1` and `C2` | EU-country and NATO-country `C1 OR C2` | union of `C1` and `C2` | EU-country or NATO-country `NOT C` | complement set of `C` | not an EU-country `P SOME C` | set of things that have a `P`-property with a member of class `C` | something that flows into a lake ... | ... | ... `DifferentIndividuals(I1 I2)` | `I1` and `I2` stand for different instances | Switzerland is not Germany `ClassAssertion(C I)` | `I` is an instance of `C` | Switzerland is a country `SubClassOf(C1 C2)` | `C1` is a subset of `C2` | every EU-country is a country `Domain(C P)` | the subject of `P` is in `C` | everything that borders something is a country `SubPropertyOf(P1 P2)` | `P1` is a subset of `P2` | if X borders Y then X neighbors Y. ... | ... | ... ### Presenter Notes `P only C` | set of things whose all `P`-properties lead to a member of class `C` | something that flows only into a lake Axioms related classes, properties, individuals with each other in various ways. --- # Semantics of OWL --- # Semantics of OWL: UNA and OWA __Unique Name Assumption (UNA)__: differently named individuals are necessarily different instances of the domain. OWL __does not__ make the UNA. __Open World Assumption (OWA)__: unstated facts are not assumed to be false. OWL makes the OWA. Thus, OWL is different from some database/knowledgebase languages (SQL, Prolog) where a missing assertion means that it is false rather than unknown. --- # T-Box and A-Box __T-Box__ - terminological statements / universal statements (universal quantification in first-order logic), e.g. - Every EU-country is a country. - If X border Y then Y borders X. - conceptionally complex (domain, range, reflexivity, inheritance, ...) - usually small and static __A-Box__ - instance data, e.g. - Switzerland is not a EU-country. - Switzerland borders Germany. - conceptually much simpler - usually very large and often changing --- # Expressive SW languages: SWRL - rule-like statements with explicit variables - OWL: `eu-country SubClassOf country` - SWRL: `IF eu-country(?X) THEN country(?X)` - more expressive extension of OWL - OWL does not support unrestricted variable sharing - SWRL: `IF man(?X) AND car(?Y) AND own(?X, ?Y) THEN like(?X, ?Y)` - English: Every man that owns a car likes __the__ car. - OWL approx.: `(man AND own SOME car) SubClassOf like SOME car` - SWRL representation of OWL approx.: `IF man(?X) AND car(?Y) AND own(?X, ?Y) THEN like(?X, ?Z) AND car(?Z)` - still a fragment of first-order logic, but reasoning is not decidable --- # Automatic reasoning with OWL - reasoning helps to detect modeling errors during development time - e.g. necessarily empty classes should be avoided - reasoning can be used during runtime to classify new instances - reasoning tasks - is the ontology consistent? - which axioms does the ontology entail? (e.g. classification = entailment of simple `SubClassOf`-axioms) - is a class satisfiable (i.e. can it contain individuals?) - which (smallest) subsets of the ontology cause a class to be unsatisfiable? - which classes and individuals does the given class contain? (DL Query) - reasoning is decidable, i.e. will always solve the task in finite time (although sometimes very slowly) - many off-the-shelf reasoners: Pellet, HermiT, FaCT++, ... - integrated into end-user tools (Protégé, Semantic MediaWiki, AceWiki, ...) --- # Reasoning example (written in ACE for easier readability) _Ontology_ Every country is a territory. If X borders Y then Y borders X. If X borders something then X is a country. Germany borders Switzerland. _Entailments_ Switzerland borders Germany. Switzerland is a country. Switzerland is a territory. Germany is a country. Germany is a territory. It is false that Germany does not border Switzerland. ... _Unknown truth value_ Switzerland is not Germany. # no UNA in OWL Switzerland borders exactly 1 territory. Switzerland borders itself. ... ## Presenter Notes Entailments are true axioms which are not explicitly part of the ontology. --- # OWL: Usage examples - SNOMED Clinical Terms - large (300k concepts) collection of medical terms, with NL lables and formal relations - multilingual lables on concepts, currently English (~1m labels), Spanish, Danish, Swedish - EL fragment of OWL (excludes disjunction, negation, cardinality constraints, ...) - `Common-cold SubClassOf causative-agent SOME Virus` - used >50 countries - content management with semantic wikis - e.g. Semantic MediaWiki, OntoWiki - small subset of OWL considered for reasoning - DBpedia - structured part of Wikipedia (infoboxes) - background ontology (SubClassOf, domain, range) - multilingual ### Presenter Notes See also: --- # User interfaces for the Semantic Web - semantic web is for both machines (agents) and humans - humans require a user-friendly language to interact with the SW - RDF's triple-based data model is very simple - does not directly allow for a user-friendly syntax for complex structures - originally was written in RDF/XML (unnecessarily verbose) - OWL (version 2009) syntax is much more elegant (based on Description Logics) - still remote from natural language - lots of syntactic sugar for various types of SubClassOf-constructs - separate ontology, rule, and query languages - little focus in the SW community on UI issues, see e.g. D. Karger ESWC 2013 keynote "A Semantic Web for End Users" - --- # Two examples of SW tools --- # Protégé - open-source and widely used OWL editor (actually predates OWL) - user interface - sub class tree hierarchy with drag-and-drop - otherwise reflects the OWL axiom types 1-to-1 - integrates Manchester OWL Syntax (MOS) for expressing classes and axioms - focus on T-Box editing - integrates reasoners - query answering - entailment explanation - WebProtégé: simpler web-based editor with focus on collaborative editing (e.g. revision history support) --- # Protégé --- # Protégé: DL Query --- # Protégé: Entailment explanation Smallest subset that explains why `mad-cow` is an empty class, indicating that the ontology contains a modeling error.
--- # Wikis and Semantic wikis __Wiki__ - user-friendly collaborative environment for knowledge management - content typically unconstrained natural language, therefore not easily automatically processable - powered by software, e.g. MediaWiki, best known example: Wikipedia __Semantic wiki__ = wiki + formal semantics - provides a richer query language + improves consistency by reducing redundancy: the same information stated only once - content typically in natural language enriched with typed links (i.e. RDF triples) - user interface - wiki-style: plain text with embedded meta information - forms for filling templates - software: Semantic MediaWiki, Freebase, ... ## Presenter Notes Shortcomings: cannot copy content from one language to the other, cannot ask questions, cannot check that the different versions of an article in different languages are about the same thing. --- # Semantic MediaWiki - extension of MediaWiki / Wikipedia - add: "semantic annotations" structured formats for - attaching attributes to concepts - representing lists, tables, etc. - representing maps, timelines, calendars, etc. - queries that automatically generate wiki pages - supports subset of OWL for query answering - subclass, subproperty, inverse property, equality - excludes e.g.: transitivity, domain/range, cardinality restrictions - focus on A-Box editing --- # Semantic MediaWiki Example: sentence in an article _Berlin_ is entered as Berlin is the capital of [[Is capital of::Germany]] and has a population of [[Has population::3,292,365|3.3 million]]. and is rendered as Berlin is the capital of Germany and has a population of 3.3 million. and adds two triples to the underlying model --- # Semantic MediaWiki: Query Which countries are located in Africa? !bash {{#ask: [[Category:Country]] [[located in::Africa]] | ?Area#km² | ?Name | ?Population | sort = population | order = descending }} Is rendered as a sortable table listing links to the pages of category _Country_ (or some of its subcategories) that contain the property _located in_ (or some of its subproperties) with the value _Africa_. ## Presenter Notes The value is that one can easily insert this table (possibly with some modifications/customizations) to various pages without having to type it in from scratch. --- # Multilingual Semantic Web ## Presenter Notes In general not much work on it. --- # Multilingual SW: Ontology verbalization - RDF, OWL, ... keywords are in English - most ontologies are written with English entity names - recommendation (in biomed ontologies) to name entities in a language-neutral way (with ID codes) - entity and axiom annotations provide a way to include free-form natural language in the ontologies - e.g. via `rdf:label` with the ISO language tag - The Lexicon Model for Ontologies (lemon) - new proposal for annotating ontologies with linguistic information - each entity is mapped to its corresponding lexical entries specifying the word forms and part of speech categories (for different languages) --- # Multilingual SW: Semantic wikis - wikis: multilingual articles but interlinked only at the document level - WikiMedia: proposal for multilingual Wikipedia - - AceWiki-GF: a prototype for creating ontologies in a multilingual way --- # Links - the original semweb paper - The Semantic Web. T Berners-Lee, J Hendler, O Lassila. Scientific American (2001) - OWL specification: - Protégé: - Semantic MediaWiki: - Dagstuhl Seminar 12362 (2012): The Multilingual Semantic Web - - lemon: --- # CNLs and ACE --- # Controlled Natural Languages - motivation - natural language with tool support, e.g. auto-completion, paraphrasing, consistency checking, question answering, translation - wide-coverage parsers achieve only ~90% accuracy, and on simpler tasks (e.g. detecting subjects and objects) - formal syntax, but fragment of a natural language - formal semantics - ambiguity handling - translations to other (natural or formal) languages - reasoning, paraphrasing, etc. - many languages with different goals and formal properties - see _Tobias Kuhn. A Survey and Classification of Controlled Natural Languages, Computational Linguistics_ - well-known example: Attempto Controlled English (ACE) - frameworks: Grammatical Framework (GF) ## Presenter Notes usage: UI for formal logics in cases where domain experts lack formal background --- # Attempto Controlled English (ACE) - subset of English - controls ambiguity and synonymy - A dog hates a cat. == There is a dog that hates a cat. - A dog hates a cat. =/= Every dog hates a cat. - Every dog hates a cat. == If there is a dog then the dog hates a cat. - end-user documentation: construction and interpretation rules, as restrictions of English - construction rules, interpretation rules, ... - easy to learn - well-defined translations to first-order logic, OWL, ... - tool support for end-users - developed in the Attempto project at the University of Zurich, lead by Norbert E. Fuchs --- # ACE: Language overview - subset of natural English - conjunction (_and_), disjunction (_or_), negation (_no_, _not_, _it is false that_), _if-then_, ... - anaphoric references: pronouns (_he_, _it_), definite noun phrases (_the man_), variables (_X_, _Y123_) - quantifiers: _every_, _no_, _at least 3_, ... - content words: proper names, common nouns, verbs, adjectives, adverbs - grammar is fixed, but users can change content words - deterministic ambiguity handling - anaphora resolution (`France borders Spain and it borders Portugal.`) - quantifier scope (`Every country borders a country.`) - attachment (`Every EU-country borders a country that is a EU-country and is a NATO-country.`) --- # Example: Syntactic restrictions English: _Koalas eat eucalyptus leaves._ Violates the ACE construction rules: - every noun must have a determiner ('every', 'a', 'at least 2', ...) - multi-word nouns are not allowed Correct ACE: - Every koala eats an eucalyptus-leaf. - If a koala eats something X then X is an eucalyptus-leaf. - ... --- # Example: Semantic restrictions English: _Every EU-country borders a country that is a EU-country and is a NATO-country._ The attachment differences must be expressed in a lexically different way in ACE: - Every EU-country {borders a country that is a EU-country} and {is a NATO-country}. - Every EU-country borders a country {{that is a EU-country} and {**that** is a NATO-country}}. (The {curly brackets} are not part of ACE. They are used here to denote the scopes.) --- # Discourse Representation Structure _The_Mediterranean_Sea is a sea. Every country that does not border a sea is a landlocked-country. Switzerland does not border the sea._ - DRS - interpreted as a first-order logic formula - ACE content words map to atomic conditions - ACE function words (and, or, not, if-then) introduce various complex DRS "boxes" - variables identify content words - variables denote anaphoric references to nouns, the DRS box structure reflects accessibility constraints on references --- # Syntactic sugar It is possible to express the same meaning (the same DRS) in syntactically different ways, e.g. _every_ = _if-then_: - Every koala eats an eucalyptus-leaf. # M1 - If there is a koala then the koala eats an eucalyptus-leaf. # M1 - Everything that a koala eats is an eucalyptus-leaf. # M2 - If a koala eats something X then X is an eucalyptus-leaf. # M2 _relative clause_ = _coordination_ - Every mammal that is endemic-to Australia is a marsupial or is a .... # M3 - If a mammal is endemic-to Australia then the mammal is a marsupial or is a .... # M3 - If there is a mammal and the mammal is endemic-to Australia then the mammal is a marsupial or is a .... # M3 (`M{1,2,3}` indicate sentences with the same DRS.) --- # Content word lexicon Content words are user-defined. They are classified by their word class and have one or more surface forms. - nouns: 2 forms: singular: man, plural: men - proper names: 1 form: singular: John - verbs - intransitive (wait), transitive (eat), ditransitive (give) - 3 forms: 3rd person finite: eats, infinite: eat, past participle: eaten - adjectives - intransitive (rich), transitive (fond-of) - 3 comparison forms (rich, richer, richest) - adverbs (3 comparison forms) The lexicon allows the definition of aliases but does not otherwise provide any semantics, e.g. statements like "Every man is a human." can be made only at the level of ACE. --- # Tools - ACE parser (APE): maps ACE to DRS - DRS verbalizer: maps DRS to ACE - the roundtrip: ACE -parser-> DRS -verbalizer-> ACE usually results in a syntactically different ACE text thanks to syntactic sugar - DRS translators: map DRS to - OWL/SWRL - TPTP - ... - look-ahead editor for a subset of ACE - native ACE reasoner: RACE - OWL verbalizer: maps OWL to ACE - end-user ACE / SW ontology editors: - ACE View - AceWiki --- # Attempto Parsing Engine (APE) - tokenizes and parses the given ACE text - resolves anaphora - translates to one or more logical forms (TPTP, OWL/SWRL, ...) via DRS - paraphrases the input in ACE via DRS - accessible from Prolog, Java, HTTP, commandline __Commandline example.__ Parse a sentence into the DRS form. !bash $ echo "No dog is a cat." | ape.exe -solo drspp [] [A] object(A,dog,countable,na,eq,1)-1/2 => [] NOT [B,C] object(B,cat,countable,na,eq,1)-1/5 predicate(C,be,A,B)-1/3 --- # ACE as a semantic web language --- # Motivation - user-friendly language because based on English - standard English instead of various formal notations - easy to read (and read out-loud) - easy to write (?) - single language instead of 3 different languages for ontologies, rules and queries - different (more natural) motivation for the syntactic sugar - usage: - verbalizing existing ontologies - writing new and modifying existing ontologies - querying existing ontologies - presenting entailments and entailment explanations --- # Mapping between ACE and OWL/SWRL ACE | OWL/SWRL --- | -------- proper name | individual noun | named class intransitive adjective | named class noun phrase (with relative clause) | complex class transitive verb with NP object | property transitive verb with data object | data property transitive adjective | property sentence | OWL axiom or SWRL rule (SWRL rule as a fallback) question (with 1 query word) | DL-Query text | ontology ## Presenter notes - parse ACE into DRS (FOL-formula) - translate DRS to OWL, or if this fails then to SWRL, or return error message is this fails --- # ACE sentence and question in OWL _Every country that does not border a sea is a landlocked-country._ !haskell SubClassOf( ObjectIntersectionOf( :country ObjectComplementOf( ObjectSomeValuesFrom( :border :sea ) ) ) :landlocked-country ) _Which country is a landlocked-country?_ !haskell ObjectIntersectionOf( :country :landlocked-country ) --- # Verbalizing OWL ontologies as ACE texts OWL contains many shorthand constructs (`DisjointClasses`, `ObjectPropertyDomain`) with no direct ACE counterpart !haskell DisjointClasses( :country ObjectHasSelf( :border ) ) Directly reflecting the keywords and axiom structure would give something like: Countries and self-borderers are disjoint. ACE-style verbalization gives a more natural: No country borders itself. --- # OWL Verbalizer Rewrite an OWL axiom, e.g. !haskell ObjectPropertyDomain( write human ) into something with a more suitable structure (but preserving meaning): !haskell SubClassOf( ObjectIntersectionOf( owl:Thing ObjectSomeValuesFrom( write owl:Thing ) ) human ) then directly map it to ACE (with a Prolog DCG grammar): Everything that writes something is a human. --- # OWL Verbalizer: Issues - OWL naming conventions for entities - usually: nouns for classes, verbs for properties - sometimes contain complex structure: `nonNormal`, `MaleOrFemalePatient` - simple one-to-one mapping of axioms to sentences - other approaches also reorder/combine axioms --- # Evaluation: Comparison to MOS - Manchester OWL Syntax (MOS) - concise notation for OWL - part of the OWL (2009) standard - used in editors like Protégé - _Tobias Kuhn. The Understandability of OWL Statements in Controlled English (2013)_ - compare the ACE and MOS representation of wide variety of OWL axiom types - evaluation with 64 users (students, but not of computer science) - ask subjects to look at diagrams and evaluate corresponding ACE and MOS statements as true or false - result: ACE is easier to learn and understand - no evaluation of "writability" --- # Evaluation: Comparison to MOS --- # Evaluation: Comparison to MOS --- # ACE as a SW language: Issues - need to support word-forms (e.g. in the lexicon editor): - 2 for nouns: man, men - 3 for verbs: eats, eat, eaten by - English-specific - can be misleading (if interpreted as English, and not ACE) - non English speakers might prefer a more language-neutral notation - harder to implement tool support (parser, verbalizer, etc.) for ACE than for MOS - covering a wide variety of programming languages ### Presenter Notes short-coming shared with MOS: possibly harder to write, needs a look-ahead editor. ACE tool support exists for Prolog, but not for other languages. --- # Two ACE tools for the SW --- # ACE View - ACE-based OWL/SWRL editor, viewer, query interface - plug-in for Protégé 4+ - provides set of ACE "views" to the ontology - integration with other Protégé views - changes done via standard Protégé views are reflected in the ACE views - motivation: sometimes graphical UI is simpler to use than plain text - integrates both ACE->OWL/SWRL and OWL->ACE - ACE-based entailment explanation --- # ACE View: Snippets Presenting the complete ontology as a sortable/searchable set of ACE sentences. --- # ACE View: Entailments Explaining an entailment by a set of ACE sentences. --- # ACE View: Question answering Counting and presenting DL Query results. --- # ACE View vs Protégé Property axioms presented in ACE vs checkboxes+labels. --- # AceWiki
- semantic wiki engine - focus on user-friendliness and high expressivity of formal content - uses ACE for the content - OWL-compatible fragment of ACE (e.g. excludes prepositional phrases and adverbs) - user is syntactically assisted by a look-ahead editor - uses OWL for the reasoning language - supports different OWL profiles - more expressive than usually in semantic wikis - completely hidden from the users ### Presenter Notes Hiding the formal language enables one to use a more expressive fragment of it. --- # AceWiki: Screenshot AceWiki article about _landlocked country_ and the look-ahead editor. --- # AceWiki: Main features - model - content word = article (like in traditional wikis) - article = collection of statements - statement = declarative sentence or query or comment - declarative sentences and queries are written in an OWL-compatible subset of ACE - Codeco grammar which formally describes this subset and - ... makes it available via a look-ahead editor - UI for avoiding syntactically and semantically not supported inputs - automatic feedback - inconsistent sentences are tagged by red color - queries are answered by automatically populating the list of matching entities - collaborative editing of - lexicon - articles --- # AceWiki: Evaluation Two small usability experiments: - altogether 26 untrained participants - task: collaborative creation of a knowledge base - results: - 78%-81% of the sentences were correct and sensible - 61%-70% of them were complex (containing negations, implications, disjunctions or number restrictions) - creation of a correct sentence every 5-6 minutes - definition of a new word every 5-7 minutes => Even untrained users can effectively use AceWiki --- # AceWiki: Other usages AceWiki (or parts of it) have been used in various research projects, e.g. - Coral: a CNL-based query interface for annotated text corpora (UZH, Zurich) - uses the AceWiki sentence editor - does not use ACE, but a new CNL that maps to the ANNIS Query Language - no collaborative editing - source code: - AceCAPTCHA (AGH UST, Kraków) - CAPTCHA = Completely Automated Public Turing test to tell Computers and Humans Apart - AceWiki preloaded with terminological knowledge (entered by experts) - presents end-users with ACE questions about the wiki content to (1) distinguish them from computers (CAPTCHA) and to (2) populate the wiki with instance data (collaborative editing) - ACE for source code documentation (University of Chile) - AceWiki populated with ACE statements about software structure (classes, methods, inheritance, ...) - users can ask ACE questions about module dependencies etc. ### Presenter Notes Sometimes parts of AceWiki are reused to build new systems, rather that using the complete AceWiki as it is. Mention that AceWiki-GF is an other application, discussed later. --- # AceWiki: Small case study Article on _Fynbos_ containing the original definition (in full English), its formal content (in ACE), and automatically generated upper concepts. --- # AceWiki: Small case study Agriculture glossary (= set of term-definition pairs) as a CNL-based semantic wiki - convert an existing glossary into the wiki format (wiki article for every term, every term considered to be a common noun) - represent the core information in the definition as a set of ACE sentences - purpose - educational: definition easier to understand for non English speakers - sentences are formal: automatic semantic feedback - collaborative development of course content - sentences are easier to translate (future work) - results - several new terms (e.g. relations/verbs) were added - some ambiguity/vagueness was identified and resolved - correct and natural formulation is sometimes difficult to find --- # Links - some overview papers about CNL for SW - _Paul Smart. Controlled natural languages and the semantic web. (2008)_ - _Schwitter et al. A comparison of three controlled natural languages for OWL 1.1. (2008)_ - Attempto project: - documentation, tools, demos, etc. - ACE-based SW tools - ACE View () - AceWiki () - other CNL-based editors for OWL - ROO () - Fluent Editor () - CLOnE --- # Grammatical Framework
as a CNL framework --- # Other CNLs (in comparison to ACE) ACE is general-purpose, English-based (and difficult to port to other NLs), with fixed grammar and largely fixed interpretation. Other CNLs: - more domain specific - several SW-oriented CNLs: Rabbit, SOS, Clone - optimized for querying (not authoring), e.g. Coral, NL-interfaces to databases - no underlying formal semantics - relaxed syntactic constraints - relaxed or different ambiguity handling - speech-oriented: voice commands for mobile devices - other goals: - automatic translation into other natural languages --- # Components of a CNL framework - programming language with built-in support for describing natural language grammars - CNL design guidelines and best practices - language-independent parsing engine - library that contains the linguistic knowledge of several NLs, ideally with a language-independent API - general enough to support a variety of uses cases --- # Grammatical Framework (GF) - framework for multilingual grammar engineering - functional programming language optimized to handle natural languages - resource grammar library implementing common morphological and syntactic structures - a GF program (aka grammar) consists of - language-neutral abstract syntax - multiple concrete syntaxes that implement the abstract functions and categories, specifying words, word order, agreement, etc. - main operations - parsing: map a string in some language to abstract tree(s) - linearization: linearize tree(s) as strings in some language - translation = parse a string in language A to tree(s) + linearize these tree(s) as strings in language B - various tools + bindings to Python, Java, Javascript, Prolog, ... - developed at the University of Gothenburg, lead by Aarne Ranta
### Presenter Notes parsing (translation, look-ahead, ...) based on Parallel Multiple Context-Free Grammars (MCFG), mildly context sensitive formalism --- # GF Editor/Translator for Foods.pgf
--- # GF example: Abstract module !haskell abstract Unitconv = { flags startcat = Unitconv ; cat Unit ; Unitconv ; fun unitconv : Unit -> Unit -> Unitconv ; land_mile, nautical_mile : Unit ; } - declares the __categories__ (`Unit`, `Unitconv`) - declares the __functions__ (`unitconv`, `land_mile`, `nautical_mile`) and their __types__ (`Unit->Unit->Unitconv`, `Unit`) - __trees__ that can be obtained from the __start category__: - `unitconv land_mile nautical_mile` - `unitconv land_mile land_mile` - ... --- # GF example: Concrete modules !haskell concrete UnitconvEng of Unitconv = { lincat Unit, Unitconv = {s : Str} ; lin unitconv x y = {s = "how much is" ++ x.s ++ "in" ++ y.s ++ "?"} ; land_mile = {s = "mile"} ; nautical_mile = {s = "nautical mile" | "mile"} ; } concrete UnitconvWolfram of Unitconv = { lincat Unit, Unitconv = {s : Str} ; lin unitconv x y = {s = "convert" ++ x.s ++ "to" ++ y.s} ; land_mile = {s = "mile"} ; nautical_mile = {s = "nmi"} ; } - define how trees are linearized in a concrete language - assign linearization structures to categories (e.g. "record containing a string": { s : Str }) - define how functions are linearized, e.g. as strings or concatenation of strings - e.g. `(x.s ++ "in")` concatenates the string of the argument (`x.s`) with the string `"in"` --- # GF parsing and linearizing Parsing i.e. converting a string __how much is nautical mile in mile ?__ to tree(s) Unitconv> parse -lang=Eng "how much is nautical mile in mile ?" unitconv nautical_mile land_mile unitconv nautical_mile nautical_mile Linearization i.e. converting a tree __unitconv nautical_mile land_mile__ to string(s) Unitconv> linearize -treebank -list (unitconv nautical_mile land_mile) UnitconvEng: how much is nautical mile in mile ?, , how much is mile in mile ? UnitconvWolfram: convert nmi to mile Translation i.e. parse + linearize Unitconv> parse -lang=Eng "how much is nautical mile in mile ?" | l -lang=Wolfram convert nmi to mile convert nmi to nmi --- # GF example: More complex example !haskell abstract Geography = { cat Country ; Relation ; fun germany : Country ; switzerland : Country ; border : Country -> Country -> Relation ; } concrete GeographyGer of Geography = { param CaseGer = Nom | Acc | Dat ; lincat Country = CaseGer => Str ; Relation = Str ; lin germany = table { _ => "Deutschland" } ; switzerland = table { Dat => "der Schweiz" ; _ => "die Schweiz" } ; border c1 c2 = c1 ! Nom ++ "grenzt an" ++ c2 ! Acc ; } - German cases require a more flexible linearization structure (table of strings) - fortunately this (and other types of) complexity can be hidden into a library, allowing to simplify the code to something like: - `border c1 c2 = c1 ! Nom ++ (mkVP3rdPers "grenzen" c2)` --- # GF Resource Grammar Library (RGL) - morphology and syntax for ~30 languages via language-neutral API - developers do not need detailed knowledge of the languages that they want to support in their application --- # RGL: Content words - some languages have many content word forms, e.g. Estonian has ~28 noun forms (2 numbers with 14 cases each) - smart paradigm - construct (all) word forms on the basis of as few base forms (nominative, genitive, ...) as possible - other input information: gender (for some languages), verb valency (for verbs, transitive adjectives, ...) - benefit: keeps the lexicon equally simple across languages - lexicon editors are likely to be familiar with the base forms (rather than language-specific morph. type systems) - examples - `mkN "dog"` constructs all the noun forms for the singular nominative "dog": dog, dogs, dog's, dogs' - `mkN "man" "men"` (two forms needed for irregular English words) - some languages in the RGL have also large (~50k entries) lexicons (with mapping to WordNet senses) --- # RGL: Syntax - language independent API for many syntactic functions - `mkNP : Numerl -> N -> NP` : five men - `mkCl : NP -> V -> Cl` : five men sleep - library handles word order, agreement, choice of function words, etc. - "red apple" vs "pomme rouge" - language-specific constructions can be included in the Extra-modules - e.g. verb phrase coordination Example: constructing a clause: !bash $ gf > i -retain alltenses/TryEng.gfo > cc -one mkS pastTense (mkCl (mkNP (mkNumeral "1") (mkN "dog")) (mkV "sleep" "slept" "slept")) one dog slept > cc -one mkS (mkCl (mkNP (mkNumeral "2") (mkN "dog")) (mkV "sleep" "slept" "slept")) two dogs sleep --- # RGL-based GF programs !haskell incomplete concrete FoodsI of Foods = open Syntax, LexFoods in { lincat Phrase = Cl ; Item = NP ; ... lin is x y = mkCl x y ; ... wine = mkCN wine_N ; ... } concrete FoodsGer of Foods = FoodsI with (Syntax = SyntaxGer), (LexFoods = LexFoodsGer) ; - best practice for multilingual grammars: share most of the code via a functor - functor = "incomplete concrete" module that is parametrized by language-specific resources - the functor references the RGL via its language-neutral API (`Cl`, `NP`, `mkCl`, `mkCN`) - the concrete languages import the functor and plug in language-specific resources --- # Benefits of the GF framework - support for multilinguality - support for both parsing and generation - grammar engineering tools and guidelines - parser/translator with look-ahead support - library of pre-implemented linguistic knowledge for ~30 languages - large lexicons (for some languages) --- # ACE-in-GF --- # Motivation - implement ACE in a multilingual way - ACE-based modeling and ACE tools become available to speakers of other natural languages - test how English-specific is ACE - implement ACE in a different and more general formalism - gain new tools - tree-based sentence editor - access to other programming languages (C, Python, Javascript, ...) ### Presenter Notes - the existing APE implementation (APE) in Prolog - the existing AceWiki look-ahead parser in Codeco --- # Multilingual ACE An ACE grammar implemented in GF adds multiple natural languages as front-ends to ACE. As a result, these languages can be mapped to and from various formal languages already supported by ACE. --- # German <-> ACE <-> OWL __German__ Jedes Land, das nicht an ein Meer grenzt, ist ein Binnenland. __ACE-in-GF tree__ baseText (sText (s (vpS (everyNP (relCN (cn_as_VarCN country_CN) (neg_predRS which_RP (v2VP border_V2 (thereNP_as_NP (aNP (cn_as_VarCN sea_CN))))))) (npVP (thereNP_as_NP (aNP (cn_as_VarCN landlocked_country_CN))))))) __ACE__ Every country that does not border a sea is a landlocked-country. __OWL__ SubClassOf( ObjectIntersectionOf( :country ObjectComplementOf( ObjectSomeValuesFrom( :border :sea ) ) ) :landlocked-country ) --- # Implementation of ACE-in-GF - extension of _Angelov and Ranta. Implementing Controlled Languages in GF (CNL 2009)_ - implementation of the ACE syntax - focus on the subset of ACE that can be mapped to OWL - about 100 syntactic functions reflecting the reference implementation of ACE (APE) - no direct generation of discourse representation structures (DRS) - almost 100% coverage at almost 0% ambiguity (formally tested for ACE) - multilinguality - support most RGL languages: _Bulgarian_, _Catalan_, _Chinese_, _Danish_, _Dutch_, _English_, _Estonian_, _Finnish_, _French_, _German_, _Greek_, _Hindi_, _Italian_, _Latvian_, _Maltese_, _Norwegian_, _Polish_, _Romanian_, _Russian_, _Spanish_, _Swedish_, _Thai_, _Urdu_ - RGL-based design provides automatic increase in quality and language-coverage over time - lexicon - user-specified lexicon as needed by ACE applications - rely on RGL smart paradigms and large lexicons --- # ACE-in-GF translation example
ACE: every person that speaks a language X does not forget X .
Bul: всеки човек който говори език X не забравя X .
Cat: cada persona que parla una llengua X no oblida X .
Chi: 说 一 种 X 语 言 的 每 个 人 没 忘 X 。
Dan: hver person , som taler et sprog X glemmer ikke X .
Dut: elke persoon , dat een taal X spreekt vergeet niet X .
Est: iga inimene , kes räägib keelt X ei unusta X .
Fin: jokainen henkilö , joka puhuu kieltä X ei unohda X:ää .
Fre: chaque personne qui parle une langue X n' oublie pas X .
Ger: jede Person , die eine Sprache X spricht vergißt X nicht .
Gre: κάθε πρόσωπο που μιλά μία γλώσσα τον X δεν ξεχνά τον X .
Hin: हर [person_CN] , जो [language_CN] X बोलता है X नहीं भूलता है .
Ita: ogni persona che parla una lingua X non dimentica X .
Mlt: kull persuna , li jkellem lingwa X ma jinsix X .
Lav: ikviena persona , kas saka valodu X neaizmirst X .
Nor: hver person , som snakker et språk X glemmer ikke X .
Pol: każda osoba , która rozmawia z językiem X nie zapomina X .
Ron: orice persoană care vorbeşte o limbă X nu îl uită pe X .
Rus: каждый лицo , который говорит на языке X не забывает X .
Spa: cada persona que habla una lengua X no olvida X .
Swe: varje person , som talar ett språk X glömmer inte X .
Tha: บุคคล ทุก คน ที่ พูด ภาษา X ไม่ ลืม X
Urd: ہﺭ ﺶﺨﺻ , ﺝﻭ ﺰﺑﺎﻧ X ﺏﻮﻠﺗﺍ ہے X ﻥہیں ﺏھﻮﻠﺗﺍ ہے
--- # RGL-based implementation Import most of the implementation from the RGL using a functor. - adding a new language is easy - profit from continuous improvements in the RGL --- # ACE-in-GF: Issues - precision problems (over-generation), which are visible in the look-ahead editor - anaphoric references do not obey DRS accessibility constraints (e.g. `Every man likes the woman.`) - some language-independent features (e.g. ACE NP types) are hard to describe in the abstract syntax - coverage problems in some languages - e.g. verb phrase coordination not available in the core RGL - can be handled in the tool (e.g. AceWiki-GF) using paraphrasing, etc. - ambiguity problems in some languages - e.g. missing determiners in Finnish - can be handled in the tool (e.g. AceWiki-GF) using disambiguation dialogues - by default, sentences use the same structure in every language - extra work and linguistic competence is needed to override this (possibly using _Extra_-modules) - lack of smart paradigms and/or large lexicons for some languages ## Presenter Notes More development effort has gone into German, Spanish and Finnish. Other implementations have holes in the coverage of ACE constructs that are not provided by the RGL. TODO: lexical ambiguity: French river example. ACE NP types could be done with dependent types but this is not compatible with the look-ahead algorithm. ACE has open vocabulary (i.e. different from MOLTO Phrasebook in that sense) and it is not easy for the user to create correct lexical entries. --- # Evaluation of ACE-in-GF: Design - picked 10 languages which had a complete implementation - generated ~100 ACE sentences/questions and automatically translated them to all the languages, ensuring - full coverage of all the grammar functions - large coverage of OWL axiom structures (subclass, range, domain, transitivity, ...) - measured translation accuracy from ACE to other languages - used Google Translate as the baseline - ... and 20 human evaluators (2 per language) as the gold standard --- # Evaluation of ACE-in-GF: Results - participants preferred ACE-in-GF translations to Google translations and post-edited them less - many edits were just stylistic - e.g. users preferred elliptical sentences but these are not allowed in ACE - some languages performed clearly better, e.g. Finnish, German, Dutch --- # Evaluation of ACE-in-GF: Results
--- # AceWiki-GF
A Multilingual CNL-based Semantic Wiki --- # Multilingual CNL-based Semantic Wiki - multiple languages - natural: English, German, ... - formal: first-order logic, OWL, ... - languages for content vs user interface - CNL-based - backed by formal grammar(s) - formal languages are hidden - semantic - content automatically kept in sync via precise translation - consistency checking, question answering, ... (depending on the domain) - wiki - user-friendly - collaborative --- # Motivation for adding GF to AceWiki - increase user-friendliness for non English speakers - experiment with a more general CNL setting - content not necessarily based on ACE and OWL - experiment with more aspects of collaboration - ambiguity handling - full grammar editing --- # Possible use cases - multilingual ontology editor - e.g. environment where users agree on the content and multilingual vocabulary of an OWL-style ontology (for a certain domain, e.g. geography) - like AceWiki, but multilingual - catalog of museum objects - each object (e.g. painter, painting) on its own wiki page - rich queries (e.g. "which Dutch painter painted which French painter?") - like previous, but more focus on instance data and multimedia content - tourist phrasebook - book structure (chapters and sections) - multilingual content presented in parallel (at least 2 languages) - e.g. based on the MOLTO Phrasebook grammar - like AceWiki, but different UI and no reasoning - other - collection of math exercises - ... --- # AceWiki integration with GF - wiki content is based on a (single) GF grammar - provided by GF Webservice / Cloud service - wiki entry is GF abstract tree set - viewed via linearization(s) - can represent ambiguity - multilingual viewing and editing of wiki content - grammar-based look-ahead editing that shows next possible tokens - ambiguity resolution via another concrete language - grammar integrated into the wiki - GF grammars are very modular - grammar modules as wiki articles (wiki-linking of grammar and content) - grammar can be changed while editing the wiki --- # AceWiki-GF user interface - UI not (yet) customizable for any specific application (e.g. MOLTO Phrasebook) - i.e. most suitable for ACE-based wikis - language-switching menu - displays the wiki articles and sentence editor in the given language - user interface labels reflect the content language - grammar editor - table-based editor for lexical functions - basic full grammar editor - ambiguity resolution dialog --- # ACE-based geography wiki - main use case developed in the MOLTO project - uses ACE-in-GF as the underlying grammar --- # ACE-based geography article ## Presenter Notes Depicted are the ACE version and the German version (containing the look-ahead editor). Note that the UI is language dependent. --- # Ambiguity resolution Disambiguation dialog (only) if the entry was added in another language, and the trees of this ambiguity have different linearizations in the viewed language. ## Presenter Notes Ambiguity between object and subject relative clause. Occurs in German and Dutch. The wiki users can choose the correct tree by looking at the tree set in a language other than German, e.g. DisambGer (if it exists). --- # Lexicon modules as a table --- # Grammar module page ## Presenter Notes GF source editing is available in the GF Cloud Service. AceWiki-GF just reflects that. Some types of errors can be pinpointed. --- # Automatic question answering
--- # AceWiki-GF: Technologies - web application written in Java - Echo Web Framework - OWL API (managing the ontology and reasoners) - ACE Parser (conversion to OWL) - GF Webservice (predictive parsing, translation, grammar compilation) --- # AceWiki-GF: Dependencies --- # Evaluation of AceWiki-GF: Design - developed a 500-word geography lexicon - 3 languages: English, German and Spanish - 3 authors (incl. native speakers of German and Spanish, and a GF engineer) - avoid lexical ambiguity - asked users of different languages to supply the wiki with sentences and tag each as _true_ or _false_ - asked them then to evaluate others' sentences as _true_ or _false_ - measured the user (dis)agreement and how much it is influenced by the automatic translation - Hypothesis: A group of users reaches almost the same level of agreement on the content of an article presented to them in different languages as when the article is presented to all of them in the same language. - asked them for general feedback via a questionnaire --- # Evaluation of AceWiki-GF: Results - 30 participants entered 316 sentences - almost all the syntactic functions were used (i.e. they were discoverable via the look-ahead editor) - agreement level was ~83% with no significant influence from the translation - AceWiki-GF user interface was found to be easy to use - 80% reported the the CNL did not let them express everything that they wanted - missing content words - some grammar restrictions were confusing or presented confusingly in the look-ahead editor ### Presenter Notes Not used: negated object relative clause: "... that a country does not border", variable in apposition "a country X". --- # Future work - generalize to handle other types of grammars and reasoning - tree-based (language-neutral) sentence construction - optimize the user interface to other types of content - improve collaborative grammar editing features - improve ambiguity management (e.g. automatic reasoning-based ambiguity resolution) - use the wiki content to automatically generate documentation, grammar fragments, look-ahead editor customizations, etc. for novice users - more evaluation needed --- # Get involved! - many ways to extend AceWiki-GF, i.e. good for a thesis project - the source of most projects mentioned today is available on GitHub: - APE: - ACE-in-GF: - OWL Verbalizer: - AceWiki / AceWiki-GF: - GF: - ... i.e. easy to fork and/or contribute to - in case of questions send an email to the Attempto mailing list or --- # Links - ACE-in-GF - project page: - AceWiki-GF - project page (same as for AceWiki): - demos, links, etc.: - MOLTO project: - see the deliverables of Work Package 11 - Grammatical Framework project: --- # Thank You!