TedEval is a robust and heuristics-free framework for cross-experiment parse evaluation. TedEval can be used to compare parsing results across multiple parsing experiments that adhere to different annotation schemes. The TedEval scores provide an objective measure for parsing performance across experiments effectively and efficiently, without the need to hand-code idiosyncratic or heuristical rules. More information on the cross-experiment evaluation protocol and TedEval evaluation measures can be found in Tsarfaty, Nivre and Andersson (2011). Read more...
TedEval implements the following evaluation protocol:
TedEval also provides a method for testing the statistical significance of the cross-experiment parsing results. The null hypothesis is that the observed difference between parse hypotheses provided in the two different experiments is random. The implementation uses the stratified shuffling test of Cohen (1995).
The TedEval Evaluation Software can be currently used with the following representation types:
TedEval can be used in different evaluation scenarios:
The TedEval output files (*.ted) provide the following info:
All measures are presented per sentence and are then averaged over the test set. The evaluation can be for labeled (default) or unlabeled (-unlabeled) trees. The average can take the arithmetic mean of single sentence scores (-micro) or it can be derived from the normalized global error (default). Detailed information on the TED-based measures calculation and options can be found in Tsarfaty, Nivre and Andersson (2011).
Both Parseval (phrase-structure evaluation) and LAS (dependency structure evaluation) employ a very strict assumption concerning the gold and parse tree to be evaluated -- that the yield of the tree is known in advance.
It is a well-known fact that when parsing morphologically-rich languages the yield of the parse-tree is not known in advance. Morphologically rich input token may be segmented into multiple morphemes, each of which carry its own lexical meaning and part-of-speech tag, and the morphological segmentation of a valid input token may be highly ambigous. Consider, for instance, the Hebrew phrase below, it has multiple legitimate morphological analyses, out of which only the first one is correct.
TedEval offers an elegant solution to this situation by measuring the edit distance between one morphosyntactic tree to another. The normalized edit distance is discounted from a unity to quantify the level of success on the morpho-syntactic analysis task. To evaluate the segmentation-parsing results in a single experiment, simply provide parse and gold tree files, as well as the parse and gold segmentation files, in the command line, using the following format:
Usage: > java -jar tedeval.jar -p parse_file -sp parse_segmentation_file -g gold_file -sg gold_segmentation_file -o eval_file [-format conll]In case you compare two experiments for which the gold segmentation standards may vary, provide all relevant tree and segmentation files for both experiments, as follows:
Usage: > java -jar tedeval.jar -p1 parse_file1 -sp1 parse_segmentation_file1 -g1 gold_file1 -sg1 gold_segmentation_file1 -o1 eval_file1 -p2 parse_file2 -sp2 parse_segmentation_file2 -g2 gold_file2 -sg2 gold_segmentation_file2 -o2 eval_file2 [-format conll ]All TedEval Scenarios assumes the following file formats:
[-p, -p1, -p2, -g, -g1, -g2 ]: Each -p*|-g* flag introduces a file containing syntactic trees. A parse tree can be in either WSJ bracketed format (one tree per line) or in CoNLL-X format (one word per line, with an empty line tree-seperator). Note:
[-sp, -sp1, -sp2, -sg, -sg1, -sg2 ]: Each -p* flag indicate a file containing segmentation information. The segmentation files introduce the mapping from input tokens to segmented morphemes. Each line contains introduces an input token, and its decomposition. Sentences are separated with a line-break. For example:
BCLM B:CL:FL:HM FL FL HECIM H:ECIM
BCLM BCL:FL:HM FL FL HECIM HECIM
(TOP (PP (IN B) (NP (NP (NN CL) (PP (POS FL) (PRP HM))) (ADJP (DEF H) (JJ NEIM)))))
(TOP (NP (NP (NN BCL) (PP (POS FL) (PRP HM))) (PP (POS FL) (VB HECIM))))
TedEval strongly assumes that the raw tokens in each experiment are the same for both the gold and the parse trees. The segmentations may be different, and TedEval will automatically construct the segmentation lattice for purposes for morphosyntactic evaluation.
Assuming raw tokens are the same, TedEval allows you to compare results in which the gold segmentation standards are different. For instance:
BCLM B:CLM HNEIM H:NEIM FL FL HECIM H:ECIM
(TOP (PP (IN B) (NP (NN CLM) (PP (POS FL) (NP (DEF H) (NN ECIM))))))
You may run a morpho-syntactic evaluation example using the following sample files.