Computational dharma, really?

Computational dharma, really?#

The topic of computational dharma can be divided into two areas:

Using data science and NLP methods to work with Tibetan text, transform encodings and representations.
The nascending field of using neural networks and artifical intelligence, especially the dramatically growing field of large language models and machine translation.

Currently, this notebooks focusses on 1., data science with Python, continue below to get some first overview on how to use Python to work with Tibetan texts.
Information on the current state of artificial intelligence methods (e.g. OCR, machine translation) is given in the chapter Tibetan AI.

A digital toolset to work with multiple languages#

This article just wants to give a quick overview of what’s possible with today’s natural language processing tools. All examples will focus on the Python programming language as an additional tool for working with Tibetan texts.

There are two main ways to use this notebook:

either read it as a normal static website, and get an overview what current tools can do,
or click select the rocket-icon 🚀 above on the right of the title bar, and select one of the computational modes:
- Binder: this notebook will be executed on the binder service. You can change the code and update results
- Colab: will open this notebook in a Google colab instance (free), and will allow you to modify and execute the examples without installing any software locally. In addition to Binder, Google Colab allows access to free AI-accelerator hardware (‘tensor-cores’) which will be used in later language projects.
- Live-code: this is a preview-version that allows changing the examples and updating the results right within this web page.

With any of the three computational options: To execute a command in the notebook, press <Shift><Enter> and the current cell will be executed.

This will be just some first glimpse, more details and explanations will be added in later articles.

Conversion between transliteration systems#

Conversion between Wylie and Unicode Tibetan#

Let’s start with some conversions between Wylie and Unicode Tibetan, using Esukhia’s pyetws that is a port of earlier work by Roger Espel.

%%capture
!pip install pyewts

import pyewts

converter=pyewts.pyewts()
print(converter.toUnicode("oM AHhUM:"))
print(converter.toWylie("སེམས་ཉིད་"))

ཨོཾ་ཨཱཿཧཱུཾ༔
sems nyid 

Sanskrit transliteration#

The indic_transliteration project provides libraries to convert between many different Sanskrit (and other) transliteration systems. Let’s look at an example to convert and IAST-encoded word (‘Kālacakra’) into Devanagari:

%%capture
!pip install indic_transliteration

from indic_transliteration import sanscript
from indic_transliteration.sanscript import SchemeMap, SCHEMES, transliterate

transliterate("Kālacakra", sanscript.IAST, sanscript.DEVANAGARI)

'कालचक्र'

Having now explored libraries that convert Tibetan and Sanskrit transliterations, let’s look at an example that combines both.

Transforming glossaries#

Let’s say we found an old glossary, consisting of IAST-encoded Sanskrit vocabulary translated into Wylie-encoded Tibetan, and we want to add Unicode Tibetan and Devanagari to this glossary:

sample_glossary="""Abhiṣeka - dbang bskur
Akaniṣṭha - og min
Avalokiteśvara - spyan ras gzigs
Bhagavān - bcom ldan 'das
Ḍākinī - mkha' 'gro
Jñānasattva - ye shes sems dpa'
Mahāmudrā - phyag rgya chen po
Oḍḍiyāna - o rgyan
Samantabhadrī - kun tu bzang mo
Tathāgata - de bzhin gshegs pa
Yoginī - rnal 'byor ma"""

First step is to parse this piece of text into Pandas dataframe. Such a dataframe is similar to a spreadsheet table and can be easily transformed. The result can be saved as CSV-File or many different other formats.

import pandas as pd

# get all the IAST words from the glossary
iast = [x.split('-')[0].strip() for x in sample_glossary.split("\n")]
# get all the Wylie words from the glossary
wylie = [x.split('-')[1].strip()+' ' for x in sample_glossary.split("\n")]

# Combine the two lists (IAST and Wylie words) into a 'dataframe':
df = pd.DataFrame.from_dict({'Wylie': wylie, 'IAST': iast})
# Show the current table, nothing new, just a nice table that can be easily transformed:
df

	Wylie	IAST
0	dbang bskur	Abhiṣeka
1	og min	Akaniṣṭha
2	spyan ras gzigs	Avalokiteśvara
3	bcom ldan 'das	Bhagavān
4	mkha' 'gro	Ḍākinī
5	ye shes sems dpa'	Jñānasattva
6	phyag rgya chen po	Mahāmudrā
7	o rgyan	Oḍḍiyāna
8	kun tu bzang mo	Samantabhadrī
9	de bzhin gshegs pa	Tathāgata
10	rnal 'byor ma	Yoginī

We now have a table that consists of IAST transliterated Sanskrit and Wylie encoded Tibetan.

Now we can use the converters shown above to add Unicode Tibetan and Devanagari for each entry:

# Add a new column 'Tibetan' that will contain the Unicode Tibetan equivalent of the Wylie word:
df['Tibetan']=[converter.toUnicode(x) for x in df['Wylie']]
# Add a new column 'Devanagari' that will contain the Unicode Devanagari equivalent of the IAST word:
df['Devanagari']=[transliterate(x, sanscript.IAST, sanscript.DEVANAGARI) for x in df['IAST']]
# Show the exanded tabled that now has the Tibetan and Devanagari columns:
df

	Wylie	IAST	Tibetan	Devanagari
0	dbang bskur	Abhiṣeka	དབང་བསྐུར་	अभिषेक
1	og min	Akaniṣṭha	ཨོག་མིན་	अकनिष्ठ
2	spyan ras gzigs	Avalokiteśvara	སྤྱན་རས་གཟིགས་	अवलोकितेश्वर
3	bcom ldan 'das	Bhagavān	བཅོམ་ལྡན་འདས་	भगवान्
4	mkha' 'gro	Ḍākinī	མཁའ་འགྲོ་	डाकिनी
5	ye shes sems dpa'	Jñānasattva	ཡེ་ཤེས་སེམས་དཔའ་	ज्ञानसत्त्व
6	phyag rgya chen po	Mahāmudrā	ཕྱག་རྒྱ་ཆེན་པོ་	महामुद्रा
7	o rgyan	Oḍḍiyāna	ཨོ་རྒྱན་	ओड्डियान
8	kun tu bzang mo	Samantabhadrī	ཀུན་ཏུ་བཟང་མོ་	समन्तभद्री
9	de bzhin gshegs pa	Tathāgata	དེ་བཞིན་གཤེགས་པ་	तथागत
10	rnal 'byor ma	Yoginī	རྣལ་འབྱོར་མ་	योगिनी

The table now has the unicode renderings for both Tibetan and Devanagari.

Let’s use Wikipedia to explain those terms: We will use the IAST encoded name (e.g. Akaniṣṭha) and do a Wikipedia lookup. If an entry is available, we extract the summary (usually the first paragraph in Wikipedia) as text and add it to our table:

# First we write a litter helper that pulls data from a wikipedia article.
from urllib.request import urlopen
from urllib.parse import quote
import json

# We want to get the 'summary' part for a given search-word from wikipedia as text. We can add this
# summary to each entry in our dataframe as an explanation of the word.
'''Get the summary of a wikipedia article with title `query` and return it as text.'''
def wikipedia_summary(query, verbose=False):
    api="https://en.wikipedia.org/w/api.php"
    esc_query=quote(query)
    query_url=f"{api}?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&titles={esc_query}"
    result=[]
    with urlopen(query_url) as f:
        resp = json.load(f)
        if verbose is True:
            print(resp)
        for entry_name in resp["query"]["pages"]:
            entry=resp["query"]["pages"][entry_name]
            if "title" in entry.keys() and 'extract' in entry.keys():
                result.append((entry['title'],entry['extract']))
    if len(result)==1:
        ans=result[0][1]
        if 'refer' in ans and 'to:' in ans:
            return ""
        else:
            return ans
    else:
        return ""

# Do a Wikipedia lookup for each IAST word in the dataframe, and add the result to a new column 'Definition':
df['Definition']=[wikipedia_summary(x) for x in df['IAST']]
# Show the expanded dataframe with the new column, (abbreviated in display)):
df

	Wylie	IAST	Tibetan	Devanagari	Definition
0	dbang bskur	Abhiṣeka	དབང་བསྐུར་	अभिषेक	Abhisheka (Sanskrit: अभिषेक, romanized: Abhiṣe...
1	og min	Akaniṣṭha	ཨོག་མིན་	अकनिष्ठ	In classical Buddhist Cosmology, Akaniṣṭha (Pa...
2	spyan ras gzigs	Avalokiteśvara	སྤྱན་རས་གཟིགས་	अवलोकितेश्वर	In Buddhism, Avalokiteśvara (meaning "God look...
3	bcom ldan 'das	Bhagavān	བཅོམ་ལྡན་འདས་	भगवान्	The word Bhagavan (Sanskrit: भगवान्, romanized...
4	mkha' 'gro	Ḍākinī	མཁའ་འགྲོ་	डाकिनी	A ḍākinī (Sanskrit: डाकिनी; Tibetan: མཁའ་འགྲོ་...
5	ye shes sems dpa'	Jñānasattva	ཡེ་ཤེས་སེམས་དཔའ་	ज्ञानसत्त्व
6	phyag rgya chen po	Mahāmudrā	ཕྱག་རྒྱ་ཆེན་པོ་	महामुद्रा	Mahāmudrā (Sanskrit: महामुद्रा, Tibetan: ཕྱག་ཆ...
7	o rgyan	Oḍḍiyāna	ཨོ་རྒྱན་	ओड्डियान	Udiana (also: Uḍḍiyāna, Uḍḍāyāna, Udyāna or 'O...
8	kun tu bzang mo	Samantabhadrī	ཀུན་ཏུ་བཟང་མོ་	समन्तभद्री	Samantabhadri (Sanskrit; Devanagari: समन्तभद्र...
9	de bzhin gshegs pa	Tathāgata	དེ་བཞིན་གཤེགས་པ་	तथागत	Tathāgata (Sanskrit: [tɐˈtʰaːɡɐtɐ]) is a Pali ...
10	rnal 'byor ma	Yoginī	རྣལ་འབྱོར་མ་	योगिनी	A yogini (Sanskrit: योगिनी, IAST: yoginī) is a...

Now our table also contains an explanation for most of the entries. Let’s look at row 1 (the entry for Akaniṣṭha)

# The table is too small to show the full definitions, lets look at a sample:
from IPython.display import display, HTML
pd.set_option('display.max_colwidth', None)
display(HTML(df.loc[1,'Definition']))

In classical Buddhist Cosmology, Akaniṣṭha (Pali: Akaniṭṭha, meaning "Nothing Higher", "Unsurpassed") is the highest of the Pure Abodes, and thus the highest of all the form realms. It is the realm where devas like Maheśvara live. In Mahayana Buddhism, Akaniṣṭha is also a name for the Pure Land (Buddhafield) of the Buddha Vairocana. This is also the setting of the Ghanavyūha Sūtra. Tibetan Buddhism, Akaniṣṭha (Tib. 'og min) often describes three Akaniṣṭhas: The Ultimate Akaniṣṭha, the formless state of dharmakaya, the dharmadhatu, i.e. the ultimate reality. The Densely Arrayed Akaniṣṭha (Tib. 'Og min rgyan stug po bkod pa; Skt. Ghanavyūhakaniṣṭha), or the "Symbolic Akaniṣṭha" which is the realm of sambhogakaya. "Ghanavyūha Akaniṣṭha", refers to the pure Saṃbhogakāya Buddha field out of which emanate all Nirmāṇakāya Buddhas and Buddhafields such as Sukhāvati. It is the supreme Buddhafield in which all Buddhas attain Buddhahood. The Saṃbhogakāya Buddha Vajradhara is said to have taught the Vajrayana in the realm of Akaniṣṭha Ghanavyūha. The Mundane Akaniṣṭha, which is the highest pure level of the form realm, which is the sphere of nirmanakayas.

Next steps#

In an upcoming article we will look at:

how to learn more about the support-technologies like Python and Jupyter Notebooks,
how to install the software on a local machine,
how to use cloud services like Google Colab or Binder to generate useful output,
how to generate dictionaries in different formats,
how to generate markdown documents for different applications.

After next steps#

Once the foundations are laid for doing basic NLP (natural language processing tasks), we’ll look at deep learning and add Tensorflow to the mix to use some AI-methods on Tibetan corpora.