Mixed Features for Semantic Role Labeling
From this page you can download mixed structured (AST1m)/linear features data files that we used in our experiments on Semantic Role Labeling. The features are extracted from Charniak automatic parses as provided for the CoNLL 2005 shared task on SRL. The task and the extraction process are detailed in this paper.
Format
The format of the feature files records is as follows:
<sentence id> <rel offset> <node offset> <node upsteps> <role label> <features>
+ -- annotation identifier --+
+ ---- identifier of a node in a sentence annotation ---- +
where:
- <sentence id> is the unique identifier of each sentence within the section;
- <rel offset> is the offset of the predicate (relation) word in the sentence, starting with 0. Each pair (<sentence id>, <rel offset>) is a unique identifier for a predicate/proposition within a section;
- <node offset> and <node upsteps> are used to identify a node within a parse tree. Each node is a candidate argument of a proposition. The notation is the same as used for PropBank annotations: the offset is the offset of the first word (leaf) dominated by the node, the upsteps is the number of nodes (starting from the POS) that must be climbed in the hierarchy in order to identify the node.
The tuple (<sentence id>, <rel offset>, <node offset>, <node upsteps>) is a unique identifier for a tree node/candidate argument within an annotation;
- <role label> is the role label of the candidate argument. __NARG means that the corresponding node is not an argument of the proposition. Update 2009-Oct-28: please note that a candidate pair is marked as a positive argument if and only if all the words dominated by the candidate argument node exactly cover the words of some argument. In other words, an argument is defined by its projection on the leaves of the tree, not by the node being exactly labeled as an argument in the annotation. So, if two nodes have the same projection on the leaves, and one of the nodes is marked as an argument in the annotation, then both nodes are considered positive instances of the argument class;
- <features> has the following format:
|BT| <AST1m> |ET| <linear> |EV|
where <AST1m> is the structured feature describing the predicate/candidate argument pair, <linear> is a vector of linear features (represented as attribute:value pairs) describing the same example and |BT|, |ET| and |EV| are separators needed by SVM-Light-TK.
The linear features that we encode are automatically extracted from automatic parses and are the most commonly used in the literature. Namely, they are:
- AscendingPath
- PredicatePOS
- VerbSubCategorization
- ContentWord
- PredicateWord
- ContentWordPOS
- FirstWord
- NoDirectionPath
- FirstWordPOS
- LastWord
- Path
- HeadWordPOS
- HeadWord
- PredicateStem
- VerbSyntacticFrame
- LastWordPOS
- GrammaticalRule
- PhraseType
- PredicateVoice
- PathDistance
- DescendingPath
- GoverningCategory
- CommonAncestor
- Position
For a more in-depth description of these features, please refer to this paper.
Licensing
The data is provided for research purposes only. Published works based on these data should cite the following paper:
@article{MoschittiEtAl08,
author = {Moschitti,, Alessandro and Pighin,, Daniele and Basili,, Roberto},
title = {Tree Kernels for Semantic Role Labeling},
journal = {Computational Linguistics},
volume = {34},
number = {2},
year = {2008},
pages = {193--224},
}
Two versions of the data are available:
- a version without alignment information, i.e. the first four fields as described above. This version can be used to train and test all the classifiers required to recognize argument boundaries and roles.
- the full version containing alignment information. This version is freely available too, but those who are interested should prove that they have a valid Penn TreeBank license, issued by the LDC, before they're allowed to download the data. The alignment information, along with Charniak parses of the input sentences, is sufficient to carry out the complete SRL task.
Related work
@article{MarcusEtAl94,
author = {Marcus, Mitchell P. and Santorini, Beatrice and Marcinkiewicz, Mary A. },
journal = {Computational Linguistics},
number = {2},
pages = {313--330},
title = {Building a Large Annotated Corpus of English: The Penn Treebank},
volume = {19},
year = {1994}
}
@inproceedings{Charniak00,
author = {Charniak,, Eugene},
title = {A maximum-entropy-inspired parser},
booktitle = {Proceedings of NAACL'00},
year = {2000},
pages = {132--139},
}
@inproceedings{CarrerasEtAl05,
author = {Carreras, Xavier and Marquez, Lluis },
booktitle = {Proceedings of CoNLL '05},
title = {Introduction to the CoNLL-2005 Shared Task: Semantic Role Labeling},
year = {2005}
}
Downloads
Structured features without alignment information:
You can download them from here.
Structured features with alignment information:
- Please contact daniele dot pighin at gmail dot com.
