TvQuran

Friday, July 3, 2009

GIZA++ Issues

el salamo 3alikom wa ra7amtoo ALLAH wa baraktoo

While I was trying to build GIZA++ for the first time, I've encountered a build break which made me enthusiastic enough to look into the code :). and play with it a little bit to solve that build break.
However when I used Cygwin after that to complie the original code before the modification, it compiled just fine and there was no need for my modification to successfuly build GIZA++

1st Issue...

Despite that, I liked to share the modification I've made to get the code build successfully, also I found that it was an issue reported at the official website of GIZA++ and the auther recommended to use different version of the gcc compliler - more info about the issue.

There are two definitions of a function and the complier get confused which one to call.
  1. Edit giza-pp\GIZA++-v2\collCounts.cpp, delete that function definition...
  2. template<classTRANSPAIR>

    doublecollectCountsOverNeighborhoodForSophisticatedModels(constMoveSwapMatrix&,LogProb,void*)

    {

    return 0.0;

    }


  3. And at the defenition...
  4. template<classTRANSPAIR,classMODEL>

    doublecollectCountsOverNeighborhoodForSophisticatedModels(constMoveSwapMatrix&msc,LogProb normalized_ascore,MODEL*d5Table){...}

    Change the last parameter of the following function call…
    from:

    _collectCountsOverNeighborhoodForSophisticatedModels(…,…,…,…,d5Table);

    to:

    _collectCountsOverNeighborhoodForSophisticatedModels(…,…,…,…,(d5model*)d5Table);

This is my first step to work with GIZA++, I don’t even know if the changes I made at the source files will cause problems or not but they seem logical to me :).

2nd Issue...

If you tried to run the command "make clean" at Cygwin the EXEs won't get cleaned because the code was written to be compiled under Linux and Linux doesn't know anything about EXEs, so you just let the Cygwin know.
  • Edit \giza-pp\GIZA++-v2\Makefile and modify "-rm -f snt2plain.out plain2snt.out snt2cooc.out GIZA++" to "GIZA++.exe"
  • Edit \giza-pp\mkcls-v2\Makefil and modify "-rm -f *.o mkcls" to "mkcls.exe"
Now the clean will work fine and delete the EXEs.

Hope it was a useful post for you guys.

GIZA++ commands

el salamo 3alikom wa ra7amtoo ALLAH wa baraktoo

Today was my first time to try running GIZA++ and most of the references at the internet is stating what I'll state here at that post. However I'll try to talk more about GIZA++ as I go further with it at coming posts In Sha'a ALLAH

So let me introduce how to train using GIZA++.
  • Assuming that you are having a bin folder that contain all the output files after you've built GIZA++
  • Assuming the parallel corpus files reside inside that bin folder, e.g. arabic.txt and english.txt
  • Execute the following commands under Cygwin...
  1. Convert the plain text to GIZA++ format
    Run: ./plain2snt.out english.txt arabic.txt
    Output: \bin\arabic.vcb
    \bin\arabic_english.snt
    \bin\english.vcb
    \bin\english_arabic.snt
  2. Generate Word vs. Freqency (classes) and Freqency vs. Words (cats) files
    Run: ./mkcls -penglish.txt -Venglish.vcb.classes
    ./mkcls -parabic.txt -Venglish.vcb.classes
    Output: \bin\arabic.vcb.classes
    \bin\arabic.vcb.classes.cats
    \bin\english.vcb.classes
    \bin\english.vcb.classes.cats
    Note: Almost every where they are saying executing that command is optional and I tried running GIZA++ with and without generating classes files and it worked fine
  3. Finally run GIZA++
    Run: ./GIZA++ -T english.vcb -S arabic.vcb -C english_arabic.snt
    Or
    ./GIZA++ -T english.vcb -S arabic.vcb -C arabic_english.snt
    Output: I found a nice PPT file through Google :) that describes the contents of all the output files and their jobs.
That's all for now, see you next post :).