TedEval: An Architecture for Cross-Experiment Parse Evaluation


TedEval is a robust and heuristics-free framework for cross-experiment parse evaluation. TedEval can be used to compare parsing results across multiple parsing experiments that adhere to different annotation schemes. The TedEval scores provide an objective measure for parsing performance across experiments effectively and efficiently, without the need to hand-code idiosyncratic or heuristical rules. More information on the cross-experiment evaluation protocol and TedEval evaluation measures can be found in Tsarfaty, Nivre and Andersson (2011). Read more...

Cross-Experiment Evaluation: A New Evaluation Protocol

TedEval implements the following evaluation protocol:

The implementation uses the TED algorithm of Zhang and Shasha (1989).

TedEval also provides a method for testing the statistical significance of the cross-experiment parsing results. The null hypothesis is that the observed difference between parse hypotheses provided in the two different experiments is random. The implementation uses the stratified shuffling test of Cohen (1995).

Cross-Experiment Evaluation: Representation Types and Usage Scenarios

The TedEval Evaluation Software can be currently used with the following representation types:

TedEval can be used in different evaluation scenarios:

The TedEval output files (*.ted) provide the following info:

All measures are presented per sentence and are then averaged over the test set. The evaluation can be for labeled (default) or unlabeled (-unlabeled) trees. The average can take the arithmetic mean of single sentence scores (-micro) or it can be derived from the normalized global error (default). Detailed information on the TED-based measures calculation and options can be found in Tsarfaty, Nivre and Andersson (2011).

Joint Morpho-Syntactic Evaluation: New Metrics

Both Parseval (phrase-structure evaluation) and LAS (dependency structure evaluation) employ a very strict assumption concerning the gold and parse tree to be evaluated -- that the yield of the tree is known in advance.

It is a well-known fact that when parsing morphologically-rich languages the yield of the parse-tree is not known in advance. Morphologically rich input token may be segmented into multiple morphemes, each of which carry its own lexical meaning and part-of-speech tag, and the morphological segmentation of a valid input token may be highly ambigous. Consider, for instance, the Hebrew phrase below, it has multiple legitimate morphological analyses, out of which only the first one is correct.

When parsing raw sentences, input token are segmented -- prior to the parsing stage of jointly with it -- errors in segmentation may be introduced, which breaks the aforementioned strict assumption concerning identity between the yields of the gold and parse trees.

TedEval offers an elegant solution to this situation by measuring the edit distance between one morphosyntactic tree to another. The normalized edit distance is discounted from a unity to quantify the level of success on the morpho-syntactic analysis task. To evaluate the segmentation-parsing results in a single experiment, simply provide parse and gold tree files, as well as the parse and gold segmentation files, in the command line, using the following format:

  > java -jar tedeval.jar 
     -p parse_file -sp parse_segmentation_file 
     -g gold_file -sg gold_segmentation_file 
     -o eval_file 
     [-format conll] 
In case you compare two experiments for which the gold segmentation standards may vary, provide all relevant tree and segmentation files for both experiments, as follows:
  > java -jar tedeval.jar 
    -p1 parse_file1 -sp1 parse_segmentation_file1 
    -g1 gold_file1  -sg1 gold_segmentation_file1 
    -o1 eval_file1 
    -p2 parse_file2  -sp2 parse_segmentation_file2 
    -g2 gold_file2   -sg2 gold_segmentation_file2 
    -o2 eval_file2 
  [-format conll ] 
All TedEval Scenarios assumes the following file formats:

[-p, -p1, -p2, -g, -g1, -g2 ]: Each -p*|-g* flag introduces a file containing syntactic trees. A parse tree can be in either WSJ bracketed format (one tree per line) or in CoNLL-X format (one word per line, with an empty line tree-seperator). Note:

[-sp, -sp1, -sp2, -sg, -sg1, -sg2 ]: Each -p* flag indicate a file containing segmentation information. The segmentation files introduce the mapping from input tokens to segmented morphemes. Each line contains introduces an input token, and its decomposition. Sentences are separated with a line-break. For example:

We assume that each segmentation file has a corresponding parse file, where the segmented morphemes are the leaves of the tree (whether bracketed or CoNLL format). For example: TedEval will take both morphological and syntactic errors into account by editing the parse tree to look like the gold tree, and normalizing the error. In this way, it does not over-penalize incorrect segmentation by discarding all dominated trees (as happens with other metrics).

TedEval strongly assumes that the raw tokens in each experiment are the same for both the gold and the parse trees. The segmentations may be different, and TedEval will automatically construct the segmentation lattice for purposes for morphosyntactic evaluation.

Assuming raw tokens are the same, TedEval allows you to compare results in which the gold segmentation standards are different. For instance:

While this analysis is still correct, it corresponds to a different morphological theory. When compared against a parse hypothesis, TedEval allows you to compare faithfully parsing results from different theories. To ensure between-theory gaps are taken into account and are not penalized as errors, use the pairwise experiment evaluation scenario described above.

Do It Yourself

TedEval is easy to download, install and use it yourself. If you have questions, thoughts or suggestions, we will be happy to hear from you.

You may run a morpho-syntactic evaluation example using the following sample files.