THOT

Thot: a toolkit to train phrase-based models for statistical machine translation.

Thot is a toolkit to train phrase-based models for statistical machine translation. Thot allows to estimate the phrase-based models described in (Och, 2002) and (Ortiz et al. 2005). Thot also allows to obtain the best phrase alignment given a phrase model as described in (Garc\'ia-Varea et al. 2005). A description of the toolkit can be found in (Ortiz et al. 2005).

Thot is being developed by Daniel Ortiz at the Pattern Recognition and Human Language Technology (PRHLT) research group of the Universidad Politécnica de Valencia (UPV) and the Intelligent Systems and Data Mining (SIMD) research group of the Universidad de Castilla-La-Mancha (UCLM).

About Thot

The toolkit includes the following functionality:

In order to compile Thot you may need:

It is known to compile on Linux and Windows (using cygwin) systems. As future work we plan to port the code to other platforms. If any compilation problem occurs, please first try to get the newest compiler version. Patches to the code are very welcome. Feel free to send me mail asking for help.

It is released under the GNU Public License (GPL).

Additional requirements:

Thot estimates phrase-based models from word alignment matrices like those generated by the publicly available GIZA++ toolkit (xxxx.A3.final files). Therefore, you may need to download it for its use at the first step of the estimation (unless you have another way to generate the alignment matrices and to put them in the required format).

Citation:

You are welcome to use the code under the terms of the license for research or commercial purposes, however please acknowledge its use with a citation (click here to download the paper):

  • Daniel Ortiz-Mart\'inez, Ismael Garc\'ia-Varea, Francisco Casacuberta. "Thot: a toolkit to train phrase-based models for statistical machine translation" Proc. of the Tenth Machine Translation Summit, Phuket, Thailand, September 2005.
  • Here is a BiBTeX entry:

    @InCollection{Ortiz2005,
      author =  {D.~Ortiz-Mart\'{\i}nez  and  I.~Garc\'{\i}a-Varea and F.~Casacuberta},
      title =  {Thot: a Toolkit To Train Phrase-based Statistical Translation Models},
      booktitle = {Tenth Machine Translation Summit},
      publisher = {AAMT},
      year = {2005},
      address = {Phuket, Thailand},
      month = {September},
    }
    

    Download:

    click here

    Literature:

    @PhDThesis{       och02:phd,
      author        = {Franz Joseph Och},
      title         = {Statistical Machine Translation: From Single-Word Models to
    Alignment Templates},
      school        = {Computer Science Department, RWTH Aachen},
      year          = {2002},
      month         = {October},
      address       = {Germany},
      annote        = {machine translation, statistical machine translation,
                      grammar inference, translation modeling, decoding
                      algorithms, decoder, search, alignment, alignment models,
                      word-based alignment, structure-based alignment}
    }
    

    @InCollection{Ortiz2005,
      author =  {D.~Ortiz-Mart\'{\i}nez  and  I.~Garc\'{\i}a-Varea and F.~Casacuberta},
      title =  {Thot: a Toolkit To Train Phrase-based Statistical Translation Models},
      booktitle = {Tenth Machine Translation Summit},
      publisher = {AAMT},
      year = {2005},
      address = {Phuket, Thailand},
      month = {September},
    }
    

    @InCollection{GarciaVarea2005,
      author =       {I. Garc\'{\i}a-Varea and  D. Ortiz and  F. Nevado and 
                      P.A. Gómez and  F. Casacuberta},
      title =        {Automatic segmentation of bilingual corpora: A comparison of
                      different techniques},
      booktitle =    {Iberian Conference on Pattern Recognition and Image 
                      Analysis},
      year =         {2005},
      address =      {Estoril (Portugal)},
      month =        {June},
      pages =        {614--621},
      series =       {Lecture Notes in Computer Science},
      publisher =    {Springer-Verlag},
      volume = {3523},
    }
    

    Acknowledgements

    This work has been partially supported by the Spanish project TIC2003-08681-C02-02, the Agencia Valenciana de Ciencia y Tecnolog\'ia under contract GRUPOS03/031, the Generalitat Valenciana, and the project HERMES (Vicerrectorado de Investigaci\'on - UCLM-05)
    Last updated: 3 June 2006, danulino@users.sourceforge.net