Arabic tokenizer used by Azure Translator

Arianna Mercuriali 0 Reputation points
2023-10-11T08:55:47.4533333+00:00

Dear all,

I am writing to ask you about the type of tokenizer used for processing the Arabic language when translating and testing MT models via Azure Custom Translator Console.

I trained an English-to-Arabic MT model via Azure Custom Translator and got a BLUE score of 58.82, then I downloaded the test set to do some further testing with other MT engines. When I got all the results, I recalculated the automatic scores through Pyton and the BLUE score for the Microsoft custom engine I had trained via Azure turned out to be way lower (50 compared to 58.8). That got me thinking about the tokenizer used to process a non-Latin language like Arabic. Could this difference in scores be due to the usage of a different tokenizer? What kind of tokenizer are you using for Arabic? Are there different tokenizers for non-Latin languages? I am also testing Hebrew, Hindi, Russian and Ukrainian, and I am therefore wondering whether I shall use a different and maybe customized tokenizer for these languages.

Thank you!

Arianna

Azure Translator
Azure Translator
An Azure service to easily conduct machine translation with a simple REST API call.
393 questions
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.