Creating an Online Dictionary for the Waray Language

At its most basic, an online dictionary is simply a search form that checks if a user-provided word is in a database. If the word is there, the website displays whatever information the database has about the word (word meaning, regional origin, part of speech, sample sentence usage, etc.).

But the syntax of Waray makes word searches more complicated. Unlike English, where word roots are modified primarily through suffixes (ex. buy --> buys, buying), Waray also uses prefixes and infixes (ex. palit can take the form ginpalit iginpalit ipalit ipinalit makapalit pagpalit paliton papaliton pumalit, etc.).

This creates challenges for a dictionary-maker: a user might type "pumalit" or "pagpalit" or "napalit" or another variation to find the word root "palit". The online dictionary needs to know that any of these entries refer to the same word. It needs to know how to find the root within whatever word a user types.

Removing prefixes, infixes, suffixes

We started by defining common affixes in Waray (see code, right). If a word starts with "mag", "pag", "na", "um", etc. (ex. napalit), the program simply strips these from the beginning of the word. If "um" or "in" directly follows the first letter (ex., pinalit), these infixes are also removed. If words end with "a" or "i" (ex., palita), they are removed.

Removing doubled syllables

Waray also changes form or tense by doubling initial syllables (ex. palit becomes papaliton). Fortunately, all Waray syllables are either two letters or three letters long (consonants are one or two letters, vowels are always one letter). The program simply checks if the first two syllables are the same, and if so, removes the duplicate.

Hyphenated words

Another feature of Waray is that it has glottal stops. The typical way to represent these in writing is with hyphens (gab-i, hin-o). But hyphenation is not standardized. Thus, the dictionary checks for this: if a user searches for "gab-i", the dictionary also finds matches for "gabi", and vice versa.

Nonstandard Spelling

Waray is an oral language. Very few texts exist. Therefore, many words have no standard spelling. The program uses a "similar text" algorithm, like a spell checker, to find words that might be spelled differently than what the user search for.

Putting it all together

Here is a step by step example of what the online dictionary does:
1. A user types in the word "nagsusurat"
2. If the database has an exact match, all information about that word is shown; a link to all sentences that contain the word is given.
3. The program finds the prefix "nag" and truncates the word to "susurat".
4. The program finds the doubled syllable "su" and truncates to "surat".
5. The program searches the database for any words that contain "surat".
6. The program adds "in" and "um" as infixes ("sinurat" & "sumurat") and searches for any words containing the modified root.
7. Output: the program finds ginsurat magsurat magsusurat masurat nagsurat nagsusurat nakasurat pagsurat pagsusurat sinurat sinusurat surat tagsurat.

Source code for finding Waray word roots

$root $search;
// DEFINE THE COMMON PREFIXES, SUFFIXES, INFIXES
    
$prefixfour = array('igin');
    
$prefixthree = array('nag''gin''pag''mag''tag');
    
$prefixtwo = array('ma''na''ka''pa');
    
$infix = array('um''in');
    
$suffix = array('a','i');
// CHECK FOR PREFIXES; IF FOUND, REMOVE THEM FROM THE WORD
    
$firstfour substr($root,0,4);
    if (
in_array($firstfour,$prefixfour))
        { 
$root substr($root,4); }
    
$firstthree substr($root,0,3);
    if (
in_array($firstthree,$prefixthree))
        { 
$root substr($root,3); }
    else{    
        
$firsttwo substr($root,0,2);
        if (
in_array($firsttwo,$prefixtwo))
        { 
$root substr($root,2); }
        }
// CHECKS FOR INFIXES; IF FOUND, REMOVE THEM FROM THE WORD 
    
$infixb substr($root,1,2);
    
$infixa substr($root,0,2);
    if (
in_array($infixa,$infix))
        { 
$root substr($root,2);}
    if (
in_array($infixb,$infix)) {
        
$start substr($root,0,1);
        
$end substr($root,3);
        
$root $start.$end;
        }    
// CHECK FOR SUFFIXES; IF FOUND, REMOVE THEM FROM THE WORD 
    
$suffixtest substr($root,-1);
    if (
in_array($suffixtest,$suffix)) 
        { 
$root substr($root,0,-1); }        
// CHECK FOR TENSE: IF THERE ARE DOUBLED SYLLABLES, (ex. nagTI-TI-kang), REMOVE THE FIRST
    
$first substr($root,0,2);
    
$second substr($root2,2);
    if (
$first == $second) { $root substr($root2); }
//     SEARCH THE DICTIONARY FOR ANY WORDS THAT CONTAIN THE ROOT    
    
$sql " SELECT word FROM frequency";
    
$result mysql_query($sql)
    or die(
mysql_error());
    
$list = array();
    while (
$row mysql_fetch_array($result)) 
        { 
extract($row); $list[] = $word; }    
    foreach(
$list as $needle
    {
        
$pos strpos($needle$root);
        if (
$pos !== false)
            { echo 
"$needle ."">"$needle ." ";}
        
// INSERT INFIXES, (EX. palit BECOMES pumalit & pinalit) AND SEARCH FOR MATCHES
        
$firstletter substr($root,0,1);
        
$therest substr($root,1);
        
$um "um";
        
$in "in";
        
$modroot $firstletter.$um.$therest;
        
$pos strpos($needle$modroot);
        if (
$pos !== false)
            { echo 
"$needle ."">"$needle ." ";}
        
$modroot $firstletter.$in.$therest;
        
$pos strpos($needle$modroot);
        if (
$pos !== false)
            { echo 
"$needle ."">"$needle ." ";}
    }
echo 
"

"
;
?>

Copyright 2012, by Mark Fullmer & Panrehiyong Sentro sa Wikang Filipino-R8, Leyte Normal University