Vector space model and the usage patterns of Indonesian denominal verbs

Published on 2019-03-18 - 01:32 by Gede Primahadi Wijaya Rajeg

This project proposes a computational, distributional semantic approach (DSM) with Vector Space Model (VSM) in exploring semantic cluster and (dis)similarity between a set of Indonesian denominal verbs with three verbal morphological schemas (i.e., me-, me- -kan, and me- -i). The VSM approach captures the semantics of the verbs from their co-occurrence properties in texts (i.e. from the words co-occurring in either side of a given denominal verb in large collection of Indonesian Leipzig Corpora). In particular we use "word2vec" (developed by Thomas Mikolov and colleagues) to create VSM from Indonesian Leipzig Corpora. We are interested to see how the verbs cluster given their distributional properties (e.g., whether we find verb cluster of certain semantic type [e.g., PSYCH an MOTION verbs], or whether there are split between verbs of the same root but different morphological schemas). The additional layer of morphology in the analysis are relevant to the description of the suffix -i and -kan in an Indonesian grammar textbook (Sneddon et al, 2010). One of the views is that there are a set of verbs (of the same root) occurring with both -i and -kan whose semantics are clearly different; our study detects those -i and -kan verbs that are split in the cluster (we plot it as dendrogram based on Hierarchical Cluster Analysis), such as membuahi vs. membuahkan; mengatai vs. mengatakan; melangkahi vs. melangkahkan. Their semantic differences are characterised by differences in the semantic domain of their co-occurring words. Another view is that there are -i and -kan verbs of the same root form that differ only at the arrangement of their arguments (rather than the semantic domain the verbs referring to). We found such -i and -kan verbs that fall within the same cluster, reflecting their similar co-occurrence profiles (e.g., mewariskan & mewarisi cluster together; mendasar, mendasari, mendasarkan also cluster together; other similar cases include the cluster for menempati, menempatkan, and menempat). The paper we are working on (under review for a special issue in NUSA journal since 15 March 2019) also addresses a number of issues regarding orthographical relics in the input corpora (which are raw, unannotated texts) that influence the automatic morphological parser (MorphInd) providing input verbs for the VSM of those verbs. We demonstrate how distance relation captured in VSM (i.e. using the nearest neighbours technique) can be used to resolve orthographic anomalies in our data. Watch this space for the data and R Notebook for the analyses in the paper.