Модели морфологического анализа в узбекском языке

Атаджанов Ж.А.

doi:10.7256/2306-4196.2016.6.20945

Статья опубликована с лицензией Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0) – Лицензия «С указанием авторства – Некоммерческая».

Вернуться к содержанию

Cybernetics and programming

Правильная ссылка на статью:

Atadjanov J.A. Models of Morphological Analysis of Uzbek Words // Кибернетика и программирование. 2016. № 6. С. 70-73. DOI: 10.7256/2306-4196.2016.6.20945 URL: https://nbpublish.com/library_read_article.php?id=20945

Models of Morphological Analysis of Uzbek Words / Модели морфологического анализа в узбекском языке

Атаджанов Жасур Абдушарибович

кандидат технических наук

начальник отдела, АК Узбектелеком

100000, Узбекистан, г. Узбекистан, ул. Yunusobod

Atadjanov Jasur Abdusharibovich

PhD in Technical Science

head of the department at Uzbektelekom

100000, Uzbekistan, Tahskent Region, str. Yunusobod, Yunusobod, ap. 13

j.atadjanov@gmail.com

DOI:

10.7256/2306-4196.2016.6.20945

Дата направления статьи в редакцию:

03-11-2016

Дата публикации:

02-02-2017

Аннотация: Предметом исследования являются модели и алгоритмы морфологического анализа текстов, категория и правила использования суффиксов в узбекском языке. Объектом исследований в данной статье являются процессы определения корней слов в предложениях на узбекском языке согласно морфологических правил узбекского языка без использования дополнительных словарей. Разрабатываемые методы и модели ориентированы на специфику узбекского языка, его структуры, особенности словоформ для дальнейшего сопоставления, нормализация и поиска аналогов текстов в базах данных. Методами исследования являются методы морфологического анализа текстов, абстрактное программирование, теория метод конечных автоматов и графов, методы математического моделирования Для разработки компьютерной программы по определению плагиата наиболее важным является исследование специфики языка, на котором написан текст. В данной статье дан подход по морфологическому анализу в узбекском языке. Этот подход основан на анализе слова с использованием метода конечных автоматов и на автоматизации определения корней слов.

Ключевые слова:

анализ естественного языка, аффиксы, суффиксы, конечный автомат, опеределения корень, нормализация слов, теория автоматов, слово анализатор, морфологическая правила, абстрактный программирования

Abstract: The subject of the research is the models and algorithms of the morphological analysis of texts, the category of suffixes and rules for using them in the Uzbek language. The object of the research is the processes of defining roots in Uzbek sentences according to morphological rules of the Uzbek language without using additional dictionaries. Developed methods and models are oriented at specific features of the Uzbek language, its structure, peculiarities of word forms for further comparison, standardization and search for analogous texts in data bases. Research methods used by the author include morphological analysis of texts, abstract programming, method of finite state machines and flow charts and methods of mathematical modeling. In order to create the antiplaque program it is important to research the specific characters of the language of a text. The article presents an approach to morphological analysis of Uzbek words. The approach is based on the analysis of words according to the finite state machines (FSM) method and based on defining the root of the word according to the word order in Uzbek language.

Keywords:

analysing natural language, affixes, suffixes, finite state automaton, determining the root, normalization of words, automata theory, word parser, morphologic rule, abstract programming

Intrduction.

Morphological analysis, which deals with the subparts of the words, is one of the fundamental areas in natural language processing. Morphological analysis is depends on morphologic rule of natural languages. It means that it is required to develop different morphological analysis methods for every natural language. There are several different methods which we can use during word parsing process which has been implemented for several different languages such as Finnish ^[6], English ^[8], Turkish ^[7], Japanese ^[5], etc.

“Two-level Description of Turkish Morphology” ^[7] paper describes a full two-level morphological description of Turkish word structures. The description has been implemented using the PC-KIMMO environment and is based on a root word lexicon of about 23,000 root words. The phonetic rules of contemporary Turkish (spoken in Turkey) have been encoded using 22 two-level rules while the morph tactics of the agglutinative word structures have been encoded as finite-state machines for verbal, nominal paradigms and other categories. Almost all the special cases of, and exceptions to phonological and morphological rules have been taken into account. In this paper, it was showed the rules and the finite state machines along with examples and a discussion of how various special cases were handled.

In the paper ^[5] presents an analysis of the morpho-syntax of the verb phrase in modern Japanese and example of Kimo`s Two-Level rules. The analysis and the example have been made a basis respectively for a Lexicon and Automata for Japanese which serve as input to Kimmo Koskenniemi's Two-level morphological analyzer/generator.

Every speaking languages has own morphologic rule and this rule has different logic from each-other. In the following example it were analyzed word(s) in Uzbek and English which have same meaning to shown difference between Uzbek and English language words in sentence.

For example: Some of the different forms of the noun “job” in Uzbek are listed below:

ish – job, work (noun) - It is my work area

ishla – to work(verb) - I work at the station

ishlama – don’t work (verb) - Do not work after 10.00 oclock

ishlamagan – didn’t work (past simple) - He did not work yesterday

ishlamaganlar – the persons who did not work (noun) - Give me the list of persons who did not work yesterday

At the first glance, it seems possible to store all the word inflectional forms in a lexicon and do the language processing without any morphological analysis. This approach can be suitable for the languages, which are morphologically simple, but it is untenable to apply for agglutinative ones where a word can take hundreds of different forms after the concatenation of affixes.

We can see this difference between any other languages which means that we can not use one language’s morphological analyzer/generator for other language. Such type analyzers have been developed for many languages but there is not for Uzbek language, also there is not any general rule or algorithm which we can use for to analyze words/sentence in different languages.

It is very important to find main part of words when we try to index big texts, compare two or more texts, to calculate count of words which are used in text, and translate from one language to another.

Initially, for creating antiplaque program which can check similarities of texts, there is needed to pay attention to its opportunity of which languages it can search similar texts. According to it, there is demanded analyzing morphological structure of the words for comparing similarities of the Uzbek text.

As we know, natural language (spoken language) is the main part of Uzbek words’ morphological structure. This process is based on each language morphological orders and rules. Nowadays there are such styles like defining the root of the words. All of them are maintained to PC-KIMMO and Porter algorithms. (1) There is illustrated G. Eryigit Esref Adoli’s ^[2] work as an example.

Mainpart.

According to the word derevation order of Uzbek exist the limited strict orders as Turkish language.

There are give some affixs and suffixs may not be in the words content in Uzbek language. They are never used seperately and they are never used separately and they are always joined to the roots. Affixes are divided into 3 groups according their functions and meaning when they add to the root ^[4].

a) Derivational suffixes are added to the word and they can compose the word with another meaning. Suv-suvchi (water-waterer), kuch-kuchli (power-powerful)

b) Word modifying suffixes: (the suffixes are used to making compounds.) They are aparted into 3;

1. Case suffixes: -ni, -ning, -ga (-ka, -qa), -da, -dan

2. Possesive suffixes: -im, -ing, -imiz, -ingiz

3. Personal and number suffixes: -i, -(i)m, -(i)ng, -son, -man

c) Figure forming suffixes can slightly change the meaning of the word, increase additional meaning but can’t construct new word.

Suffixes are placed according to the following suffix concatenation in Uzbek language^[4].

Root+derivational suffix+figure forming suffix+word modifing suffix

ўзак+сўз ясовчи қўшимчалар+шакл ясовчи қўшимчалар+сўз ўзгартирувчи қўшимчалар

Suffixes are remified 2part according to the structure:

1. Simple suffixes – aren’t seperated other suffixes.

2. Complex suffixes – are built at subjoining at least 2suffixes. For example: -chilik, -lash, -lan, -lab, -lay.

Some suffixes are altered structure as directed by the last letter of the word. For example: if suffix –ga is joined words which end with letters –q, –g’ and –k, the suffixes are varied as –qa, -g’a, and –ka and then adjoined to the word.

There are 3degrees of the adjective in Uzbek language but only comparative degree exist suffix –roq. Numbers own collective meaning; -ov, -ala, piecing meaning; –ta, supposing quantity of the things; -lab, -larcha, distributive meaning; -tadan, ordinal suffix; -(i)nchi.

It is known, there are derevative adverbs which are contrived by suffixes –cha, -lab, -larcha, -ona, -an, -chasiga. Other derevative ones are performed by case suffixes –(n)iki, -dek, -day are used with pronouns.

Some suffixes can depict meaning of tenses, person and number, positive and negative. For example, in the word ishlamadim (I didn’t work). There are represented –ma negative, -di tense and –im person and number suffixes.

Past tense: -di, -gan, -ibdi;

Present tense: -moqda, -yotir, -yapti

Future tense: -adi, -ajak, -moqchi, -a.

Voice in verbs expound that how extend the person act in the sentence.

a. Active voice - the act is done by the subject;

b. Reflective voice - the act is done by others not subject;

c. Passive voice - the act is managed by anonym person: -(i)n, -(i)l.

d. Superlative voice - the act which is done by other person under the pressure of the subject: -t, -dir, -(tir), -giz, -(qiz), -gaz. Depends on the last symbol of text we use different suffixes.

e. Togetherness voice- is conformed when the act is done by several people: -sh, -ish; For example: ishlashdi, kelishdi, boshlashdi ( they worked, came, began).

Non-finite form of the verb

a) Infinitive- is such verb which mean like noun;-(i)sh, -u(v), -moq. For example: ishlov, bormoq (to treat, to go).

b) Participle- is specified to adjective: -gan(-qan, -kan), -iydigan (-adigan), -(a)yotgan, -(a)r. For example: oqar daryo, ketayotgan odam (flowing river, going person). The negative form of gerund is constructed by suffix –ma: ishlayotgan- ishlamayotgan (working- not working). If the gerund is arranged –(a)r suffix, there would be added –mas; for example: kelar-kelmas

c) Gerund which is specified to adverb: -(i)b, -(a)y, -gancha(-kancha, -qancha), -gach(-kach, -qach), -guncha(-qancha, -kuncha), -gani(-qani, -kani).

Derivational verbs

Following suffixes make verbs from other parts of speech;

-(a)y, -а – from nouns, adjectives, exclamatory words;

-(а)y, -i, -sira, -sa – from nouns and adjectives;

-(а)r – from adjectives;

-illa, -ira – from imitational words;

-sira – from pronouns.

Moods

a) Subjunctive and imperative mood: -(а)y, -(а)yin, -gin(-kin, -qin), -(i)ng, -sin, -(а)ylik, - (i)nglar;

b) Conditional mood: -sa (borsa, kelsa(if go, come)), -saydi (borsaydi, kelsaydi);

Hence, above we have observed several sorts of suffixes. There are such kinds of words (in Uzbek language) which their contents are similar with suffixes. For example, olma(apple), olma(don’t take). –ma is (negative) suffix. If there is joined case or possessive to the word “olma”, it means as a fruit. If there is adjusted tense, mood or gerund suffixes, -ma is presented as negative forming suffix. In this case there is required to correctly separate the suffix, to make semantic analysis of the text or to construct the concrete turns. To express the turns of the suffix, we can use following graph(Graph 1).

Graph 1

Code	Types of suffix	Coming after which suffixes	Example
0	Root
Word derivational suffix
1	Noun derivational	0	-chi, -uvchi, -la
2	Verb derivational	0	-la
3	Adjective derivational	0
Form derivational suffix
11	Suffix of plural form	0, 1, 3, 14	-lar
12	Suffix of negative form	0, 2	-ma (-mas)
13	Suffix of adjective degrees	0, 3	-roq
14	Tense of verbs	0, 2, 12	-di, -moqchi
Word altering suffix
101	Case suffix	0, 1, 11, 103	-ni, -ga, -ning
102	Person and number suffix	0, 1, 14	-(i)k, -(i)m
103	Possessive suffix	0, 1, 11	-(i)m, -(i)miz

To distinguish the suffix according to the Graph 1 the prosses can give hand to separate the real part of the word’s roots and suffixes during defining the root morphem.

2. FSM fo parsing text.

As known FSM consists of following parts:

ü Conditions set – Q (this band has a limit).

ü Given symbols set – E (this band has a limit).

ü Passing function – δ (the function which can pass from one condition to another).

ü Initial condition q0 € Q.

ü Total conditions set – F (this band is Q’s part set).

Let’s analyze the illustrated limited automate by the example word “ishlamaganlar”:

2.1. Conditional set – is special band which collect the word with realizing the suffixes turn by turn.

Q={ishlamaganlarni, ishlamaganlar, ishlamagan, ishlama, ishla, ish}Qi the condition of defining of the word’s root in Uzbek language is given following structure according to the suffixes are placed (1) form.

a) Stem – provided word is “ishlamaganlarni”;

b) Without word altering suffixes - “ishlamaganlar”;

c) Without word forming suffixes – “ishla”;

d) Without word derivational – “ish”;

e) Root is unchangable part of the word.

2.2. Related symbol set – symbols which can conduct from one form to another. It is done by suffixes in Uzbek language.

Σ={ni, lar, gan, ma, la} Wi

2.3. Passing function has facility of passing the word from one form to another with a help of given suffixes.

q1-ishla, q0 – ish, entered suffix –la

q1=δ (q0 , la ) = ishla

Initially, there is needed to contrive from left to right as (root+suffix1+suffix2+…). For making it to work from left to right there is claimed to make δ work in reverse order.

In hence Q0 = ishla , Q1-ishla

Q1= δ (Q0 , la)

2.4. Total foundations set – the consist of the words root part, it has only one element in following procedure. There must be at least one vowel letter in the content of the word.

As mentioned above, the word’s content consist of following things in Uzbek language:

Root+derivational suffix+figure forming suffix+word modifing suffix

The word analysis is started from end of the word and each suffix is dissect after the suffixxes which are joined before it. To say another word, the word analysis is done from right to left.

Pic 1. The form of FSM from right to left.

0 – word (initial form)

1 – the condition without word derivative suffix

2 – the condition without form altering suffix

3 – the condition without word derivative suffix.

А – root

Conclusion.

This paper includes some decisions on morphological analyzer of the texts on Uzbek to create the antiplaque program. Scientific novelties of this work are:

· method is proposed for doing the analysis of Uzbek words with an affix stripping approach and without using any lexicon.

· The rule-based and agglutinative structure of the language allows Uzbek to be modeled with finite state machines (FSMs).

· steps of this new methodology including the classification of the suffixes, the generation of the FSMs for each suffix class and their unification into a main machine to cooperate in the analysis.

Also there has been provided information about deviding affixes into groups and defining the root in the word with the help of FSM.

Clarifying the root of the word from the text is main factor of defining the attending each word’s retrieval in the text. This case is very profitable for verifying the key words automatically and checking the similarities among the text(in different languages). As the continuation of this scientific work, there are will be invented the algorithm of defining the word’s root in Uzbek language for Snowball Compilation.

Библиография

1. M. F. Porter, Snowball: A language for Stemming Algorithms .-October 2011. http://snowball.tartarus.org/texts/introduction.html
2. G. Erigit, E. Adali. An Affix stripping morphological analyzer for Turkish .-February 16-18, 2004, Innsbruck, Austria, 6 p.
3. J. E. Hopcroft, R. Motwani, J. D. Ullman. Introduction to Automata Theory Languages and Computation (3rd Edition).-2006, 750 p.
4. M.A.Hamroyev. “O‘zbek tilidan ma’ruzalar majmuasi”.-2005, 153 p.
5. Alam, Y.S., 'A Two-level Morphological Analysis of Japanese’ Texas Linguistic Forum, Vol. 22, pp. 229-252, 1983.
6. Koskenniemi, Kimmo, ‘Two-level Morphology: A General Computational Model for Word-form Recognition and Production”, Publications, Vol.11, University of Helsinki, Helsinki, Department of General Linguistics, 1983.
7. Oflazer, K., ‘Two-Level Description of Turkish Morphology’, Literary and Linguistic Computing, Vol. 9, pp. 137-148, 1994.
8. Russel, G.J., Pulman, S. G., Ritchie, G. D., and Black, A. W., ‘A Dictionary and Morphological Analyser for English’ COLING ‘86, pp. 277-279, 1986.

References

Журналы

Книги

Models of Morphological Analysis of Uzbek Words / Модели морфологического анализа в узбекском языке