Arabic tokenizer used by Azure Translator
Dear all,
I am writing to ask you about the type of tokenizer used for processing the Arabic language when translating and testing MT models via Azure Custom Translator Console.
I trained an English-to-Arabic MT model via Azure Custom Translator and got a BLUE score of 58.82, then I downloaded the test set to do some further testing with other MT engines. When I got all the results, I recalculated the automatic scores through Pyton and the BLUE score for the Microsoft custom engine I had trained via Azure turned out to be way lower (50 compared to 58.8). That got me thinking about the tokenizer used to process a non-Latin language like Arabic. Could this difference in scores be due to the usage of a different tokenizer? What kind of tokenizer are you using for Arabic? Are there different tokenizers for non-Latin languages? I am also testing Hebrew, Hindi, Russian and Ukrainian, and I am therefore wondering whether I shall use a different and maybe customized tokenizer for these languages.
Thank you!
Arianna