Thot is a toolkit to train phrase-based models for statistical machine translation. Thot allows to estimate the phrase-based models described in (Och, 2002) and (Ortiz et al. 2005). Thot also allows to obtain the best phrase alignment given a phrase model as described in (Garc\'ia-Varea et al. 2005). A description of the toolkit can be found in (Ortiz et al. 2005).
Thot is being developed by Daniel Ortiz at the Pattern Recognition and Human Language Technology (PRHLT) research group of the Universidad Politécnica de Valencia (UPV) and the Intelligent Systems and Data Mining (SIMD) research group of the Universidad de Castilla-La-Mancha (UCLM).
The toolkit includes the following functionality:
It is known to compile on Linux and Windows (using cygwin) systems. As future work we plan to port the code to other platforms. If any compilation problem occurs, please first try to get the newest compiler version. Patches to the code are very welcome. Feel free to send me mail asking for help.
It is released under the GNU Public License (GPL).
Here is a BiBTeX entry:
@InCollection{Ortiz2005,
author = {D.~Ortiz-Mart\'{\i}nez and I.~Garc\'{\i}a-Varea and F.~Casacuberta},
title = {Thot: a Toolkit To Train Phrase-based Statistical Translation Models},
booktitle = {Tenth Machine Translation Summit},
publisher = {AAMT},
year = {2005},
address = {Phuket, Thailand},
month = {September},
}
@PhDThesis{ och02:phd,
author = {Franz Joseph Och},
title = {Statistical Machine Translation: From Single-Word Models to
Alignment Templates},
school = {Computer Science Department, RWTH Aachen},
year = {2002},
month = {October},
address = {Germany},
annote = {machine translation, statistical machine translation,
grammar inference, translation modeling, decoding
algorithms, decoder, search, alignment, alignment models,
word-based alignment, structure-based alignment}
}
@InCollection{Ortiz2005,
author = {D.~Ortiz-Mart\'{\i}nez and I.~Garc\'{\i}a-Varea and F.~Casacuberta},
title = {Thot: a Toolkit To Train Phrase-based Statistical Translation Models},
booktitle = {Tenth Machine Translation Summit},
publisher = {AAMT},
year = {2005},
address = {Phuket, Thailand},
month = {September},
}
@InCollection{GarciaVarea2005,
author = {I. Garc\'{\i}a-Varea and D. Ortiz and F. Nevado and
P.A. Gómez and F. Casacuberta},
title = {Automatic segmentation of bilingual corpora: A comparison of
different techniques},
booktitle = {Iberian Conference on Pattern Recognition and Image
Analysis},
year = {2005},
address = {Estoril (Portugal)},
month = {June},
pages = {614--621},
series = {Lecture Notes in Computer Science},
publisher = {Springer-Verlag},
volume = {3523},
}