ExRel - An Open Architecture for Relation Extraction and Semantic Role Labeling

Overview

ExRel (a not too original contraction for Relation Extraction) is a general framwork for relation extraction that we have been using for some years now as the main component of our architectures for Semantic Role Labeling (SRL) for the English and Arabic languages. 

Our approach to SRL is characterized by the combination of traditional linear features (e.g. Path, Governing Category or Phrase Type) with ad-hoc engineered structured features that we exploit by means of tree kernel functions.

For more details about our approach and the results that we obtained, please refer to the following publications: [1], [2], [3], [4], [5], [6], [7].

Main features

ExRel is written in Java v. 6. As such, it runs on all systems for which a Java6 runtime is available. It consists of the following main ingredients:

  • A collection of abstract classes that define abstract components of a relation-extraction architecture
  • A syntax (XML-based) to define architectures based on these components
  • Concrete classes that implement the modules used by the SRL architecture
  • A feature extractor that can extract any combination of structured and linear features, and generate files with forests of structured features and arrays of feature vectors  that can be used as input for SVM-Light-TK (an extension of the SVM-Light optimizer that can cope with such spaces)
  • The implementation of most of the explicit features described in the literature and of the structured features discussed in our papers.

For learning and classification, currently implemented models rely on SVM-Light-TK, which is not bundled with Exrel and must be downloaded separately. ExRel comes with facilities for training models and for using them to annotate fresh data. The current distribution of ExRel does not come with already trained models.

Requirements

ExRel requires a Java 6 Standard Edition (it should also work with Java 7, but it is not tested) and a local installation of SVM-Light-TK.

License

ExRel is distributed under a double licensing scheme.

For personal, teaching or research uses, the software is available under the GNU Lesser GPL (LGPL) v.3 license. The text of the license is available at http://www.fsf.org/licensing/licenses/lgpl.html.

If you use this software for research, please reference this paper [1] in your publications.

Please note that research uses do NOT include those involving the development of technology to be employed for commercial or any other kind of revenue purposes. These include selling, releasing, or providing commercial services based on the software.

For any other uses, the software is released under a commercial license. The terms of the license are defined on a per-request basis. You can contact me by email for more information.

Download

Click on this link to download the current version (0.9) of ExRel.

Installation

Just decompress the archive somewhere on your hard disk. Executable scripts are in the bin folder and are executable from any directory.

Running ExRel (the simple way)

The ./bin/wrapper.sh script under the installation directory can be used to learn the required models or to classify never seen before examples. If invoked without arguments it prints a brief usage message:

Usage: ./bin/wrapper.sh
   <[training|test]> <trees_idx_file> <props_idx_file>
   <svmlighttk_dir> <outputdir>

The arguments for the script are:

  • the string training or test, to learn SRL models or to annotate new sentences
  • an index of parsed sentences (<trees_idx_file) and the corresponding index of semantic role annotations (<props_idx_file>). The syntax of the index files will be detailed shortly
  • the path to the directory where svm_learn and svm_classify executables are located
  • the path to an output directory. If the directory does not exist, the script will create it for you.

When run in training mode, the script will extract structured and explicit features from the training data, generate input data for svm_learn and use these data to learn all the required models. All these files can be found under <outputdir>/training after the execution terminates.

When run in test mode, the models previously learned are used to annotate new sentences. In this case, <outputdir> should point to the same directory that had been used for learning. The directory <outputdir>/test will be used to store temporary and final results of the evaluation. If everything goes fine, the file <outputdir>/test/heuristicprops.idx will contain valid (and hopefully adequate!) annotations for the test sentences.

Data format

Syntax of tree index files

A tree index file is just a list of parsed sentences, each being associated a unique id:

<tree id><TAB><tree surface>

For example:

2400000 (S1 (S (NP (NP (DT The) (NN economy) (POS 's)) (NN temperature)) (VP (MD will) (VP (AUX be) (VP (VBN taken) (PP (IN from) (NP (JJ several) (NN vantage) (NNS points))) (NP (DT this) (NN week)) (, ,) (PP (IN with) (NP (NP (NNS readings)) (PP (IN on) (NP (NN trade) (, ,) (NN output) (, ,) (NN housing) (CC and) (NN inflation)))))))) (. .)))

Syntax of proposition index files

A proposition index is a collection of all the annotations for each parse tree. Each record looks like:

<tree id><SPACE><predicate offset><predicate upsteps><TAB><propbank annotation>

For example:

2400000 6 0     wsj/24/wsj_2400 0 6 gold take.XX ----- 0:2-A1 4:0-AM-MOD 6:0-rel 7:1-A2 11:1-AM-TMP 14:1-AM-ADV

Here, 2400000 is the id of one of the trees in the tree index file, and 6 and 0 are the coordinates of the predicate word in propbank notation. Concerning the propbank annotation, all the fields are ignored but for the actual annotation (i.e. the part after "-----" in the example).

During test, if you are working with free text you are not supposed to know the semantic annotation (unless you are carrying out an evaluation). Anyway, you are still required to input a proposition index, which will only contain information about the predicates to be targetted.. So, the "test" version of the same example would look something like:

2400000 6 0     wsj/24/wsj_2400 0 6 gold take.XX ----- 6:0-rel

You can use your own methods to identify the target predicates, and then generate the proposition index accordingly.


References

  1. Tree Kernels for Semantic Role Labeling, Moschitti, Alessandro, Pighin Daniele, and Basili Roberto , Computational Linguistics, Volume 34, Number 2, p.193-224, (2008)
  2. Generalized Framework for Syntax-based Relation Mining, Coppola, Bonaventura, Moschitti Alessandro, and Pighin Daniele , 8th IEEE International Conference on Data Mining (ICDM 2008), December 15-19, Pisa, Italy, (2008)
  3. Semantic Role Labeling Systems for Arabic Language using Kernel Methods, Diab, Mona, Moschitti Alessandro, and Pighin Daniele , 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL 2008: HLT), June 15-20, Columbus, Ohio, USA, (2008)
  4. RTV: Tree Kernels for Thematic Role Classification, Pighin, Daniele, Moschitti Alessandro, and Basili Roberto , SemEval-2007 Workshop, co-located with ACL 2007, (2007)
  5. CUNIT: A Semantic Role Labeling System for Modern Standard Arabic, Diab, Mona, Moschitti Alessandro, and Pighin Daniele , SemEval-2007 Workshop, co-located with ACL 2007, (2007)
  6. Semantic Role Labeling via Tree Kernel Joint Inference, Moschitti, Alessandro, Pighin Daniele, and Basili Roberto , Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X), June, New York City, p.61–68, (2006)
  7. Semantic Tree Kernels to classify Predicate Argument Structures, Moschitti, Alessandro, Coppola Bonaventura, Pighin Daniele, and Basili Roberto , 17th Biennial European Conference on Artificial Intelligence (ECAI 2006), (2006)