Computational dharma, really?#
The topic of computational dharma can be divided into two areas:
Using data science and NLP methods to work with Tibetan text, transform encodings and representations.
The nascending field of using neural networks and artifical intelligence, especially the dramatically growing field of large language models and machine translation.
Currently, this notebooks focusses on 1., data science with Python, continue below to get some first overview on how to use Python to work with Tibetan texts.
Information on the current state of artificial intelligence methods (e.g. OCR, machine translation) is given in the chapter Tibetan AI.
A digital toolset to work with multiple languages#
This article just wants to give a quick overview of what’s possible with today’s natural language processing tools. All examples will focus on the Python programming language as an additional tool for working with Tibetan texts.
There are two main ways to use this notebook:
either read it as a normal static website, and get an overview what current tools can do,
or click select the rocket-icon 🚀 above on the right of the title bar, and select one of the computational modes:
Binder: this notebook will be executed on the binder service. You can change the code and update results
Colab: will open this notebook in a Google colab instance (free), and will allow you to modify and execute the examples without installing any software locally. In addition to Binder, Google Colab allows access to free AI-accelerator hardware (‘tensor-cores’) which will be used in later language projects.
Live-code: this is a preview-version that allows changing the examples and updating the results right within this web page.
With any of the three computational options: To execute a command in the notebook, press
<Shift><Enter>
and the current cell will be executed.
This will be just some first glimpse, more details and explanations will be added in later articles.
Conversion between transliteration systems#
Conversion between Wylie and Unicode Tibetan#
Let’s start with some conversions between Wylie and Unicode Tibetan, using Esukhia’s pyetws that is a port of earlier work by Roger Espel.
%%capture
!pip install pyewts
import pyewts
converter=pyewts.pyewts()
print(converter.toUnicode("oM AHhUM:"))
print(converter.toWylie("སེམས་ཉིད་"))
ཨོཾ་ཨཱཿཧཱུཾ༔
sems nyid
Sanskrit transliteration#
The indic_transliteration project provides libraries to convert between many different Sanskrit (and other) transliteration systems. Let’s look at an example to convert and IAST-encoded word (‘Kālacakra’) into Devanagari:
%%capture
!pip install indic_transliteration
from indic_transliteration import sanscript
from indic_transliteration.sanscript import SchemeMap, SCHEMES, transliterate
transliterate("Kālacakra", sanscript.IAST, sanscript.DEVANAGARI)
'कालचक्र'
Having now explored libraries that convert Tibetan and Sanskrit transliterations, let’s look at an example that combines both.
Transforming glossaries#
Let’s say we found an old glossary, consisting of IAST-encoded Sanskrit vocabulary translated into Wylie-encoded Tibetan, and we want to add Unicode Tibetan and Devanagari to this glossary:
sample_glossary="""Abhiṣeka - dbang bskur
Akaniṣṭha - og min
Avalokiteśvara - spyan ras gzigs
Bhagavān - bcom ldan 'das
Ḍākinī - mkha' 'gro
Jñānasattva - ye shes sems dpa'
Mahāmudrā - phyag rgya chen po
Oḍḍiyāna - o rgyan
Samantabhadrī - kun tu bzang mo
Tathāgata - de bzhin gshegs pa
Yoginī - rnal 'byor ma"""
First step is to parse this piece of text into Pandas dataframe. Such a dataframe is similar to a spreadsheet table and can be easily transformed. The result can be saved as CSV-File or many different other formats.
import pandas as pd
# get all the IAST words from the glossary
iast = [x.split('-')[0].strip() for x in sample_glossary.split("\n")]
# get all the Wylie words from the glossary
wylie = [x.split('-')[1].strip()+' ' for x in sample_glossary.split("\n")]
# Combine the two lists (IAST and Wylie words) into a 'dataframe':
df = pd.DataFrame.from_dict({'Wylie': wylie, 'IAST': iast})
# Show the current table, nothing new, just a nice table that can be easily transformed:
df
Wylie | IAST | |
---|---|---|
0 | dbang bskur | Abhiṣeka |
1 | og min | Akaniṣṭha |
2 | spyan ras gzigs | Avalokiteśvara |
3 | bcom ldan 'das | Bhagavān |
4 | mkha' 'gro | Ḍākinī |
5 | ye shes sems dpa' | Jñānasattva |
6 | phyag rgya chen po | Mahāmudrā |
7 | o rgyan | Oḍḍiyāna |
8 | kun tu bzang mo | Samantabhadrī |
9 | de bzhin gshegs pa | Tathāgata |
10 | rnal 'byor ma | Yoginī |
We now have a table that consists of IAST transliterated Sanskrit and Wylie encoded Tibetan.
Now we can use the converters shown above to add Unicode Tibetan and Devanagari for each entry:
# Add a new column 'Tibetan' that will contain the Unicode Tibetan equivalent of the Wylie word:
df['Tibetan']=[converter.toUnicode(x) for x in df['Wylie']]
# Add a new column 'Devanagari' that will contain the Unicode Devanagari equivalent of the IAST word:
df['Devanagari']=[transliterate(x, sanscript.IAST, sanscript.DEVANAGARI) for x in df['IAST']]
# Show the exanded tabled that now has the Tibetan and Devanagari columns:
df
Wylie | IAST | Tibetan | Devanagari | |
---|---|---|---|---|
0 | dbang bskur | Abhiṣeka | དབང་བསྐུར་ | अभिषेक |
1 | og min | Akaniṣṭha | ཨོག་མིན་ | अकनिष्ठ |
2 | spyan ras gzigs | Avalokiteśvara | སྤྱན་རས་གཟིགས་ | अवलोकितेश्वर |
3 | bcom ldan 'das | Bhagavān | བཅོམ་ལྡན་འདས་ | भगवान् |
4 | mkha' 'gro | Ḍākinī | མཁའ་འགྲོ་ | डाकिनी |
5 | ye shes sems dpa' | Jñānasattva | ཡེ་ཤེས་སེམས་དཔའ་ | ज्ञानसत्त्व |
6 | phyag rgya chen po | Mahāmudrā | ཕྱག་རྒྱ་ཆེན་པོ་ | महामुद्रा |
7 | o rgyan | Oḍḍiyāna | ཨོ་རྒྱན་ | ओड्डियान |
8 | kun tu bzang mo | Samantabhadrī | ཀུན་ཏུ་བཟང་མོ་ | समन्तभद्री |
9 | de bzhin gshegs pa | Tathāgata | དེ་བཞིན་གཤེགས་པ་ | तथागत |
10 | rnal 'byor ma | Yoginī | རྣལ་འབྱོར་མ་ | योगिनी |
The table now has the unicode renderings for both Tibetan and Devanagari.
Let’s use Wikipedia to explain those terms: We will use the IAST encoded name (e.g. Akaniṣṭha
) and do a Wikipedia lookup. If an entry is available, we extract the summary (usually the first paragraph in Wikipedia) as text and add it to our table:
# First we write a litter helper that pulls data from a wikipedia article.
from urllib.request import urlopen
from urllib.parse import quote
import json
# We want to get the 'summary' part for a given search-word from wikipedia as text. We can add this
# summary to each entry in our dataframe as an explanation of the word.
'''Get the summary of a wikipedia article with title `query` and return it as text.'''
def wikipedia_summary(query, verbose=False):
api="https://en.wikipedia.org/w/api.php"
esc_query=quote(query)
query_url=f"{api}?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&titles={esc_query}"
result=[]
with urlopen(query_url) as f:
resp = json.load(f)
if verbose is True:
print(resp)
for entry_name in resp["query"]["pages"]:
entry=resp["query"]["pages"][entry_name]
if "title" in entry.keys() and 'extract' in entry.keys():
result.append((entry['title'],entry['extract']))
if len(result)==1:
ans=result[0][1]
if 'refer' in ans and 'to:' in ans:
return ""
else:
return ans
else:
return ""
# Do a Wikipedia lookup for each IAST word in the dataframe, and add the result to a new column 'Definition':
df['Definition']=[wikipedia_summary(x) for x in df['IAST']]
# Show the expanded dataframe with the new column, (abbreviated in display)):
df
Wylie | IAST | Tibetan | Devanagari | Definition | |
---|---|---|---|---|---|
0 | dbang bskur | Abhiṣeka | དབང་བསྐུར་ | अभिषेक | Abhisheka (Sanskrit: अभिषेक, romanized: Abhiṣe... |
1 | og min | Akaniṣṭha | ཨོག་མིན་ | अकनिष्ठ | In classical Buddhist Cosmology, Akaniṣṭha (Pa... |
2 | spyan ras gzigs | Avalokiteśvara | སྤྱན་རས་གཟིགས་ | अवलोकितेश्वर | In Buddhism, Avalokiteśvara (meaning "God look... |
3 | bcom ldan 'das | Bhagavān | བཅོམ་ལྡན་འདས་ | भगवान् | The word Bhagavan (Sanskrit: भगवान्, romanized... |
4 | mkha' 'gro | Ḍākinī | མཁའ་འགྲོ་ | डाकिनी | A ḍākinī (Sanskrit: डाकिनी; Tibetan: མཁའ་འགྲོ་... |
5 | ye shes sems dpa' | Jñānasattva | ཡེ་ཤེས་སེམས་དཔའ་ | ज्ञानसत्त्व | |
6 | phyag rgya chen po | Mahāmudrā | ཕྱག་རྒྱ་ཆེན་པོ་ | महामुद्रा | Mahāmudrā (Sanskrit: महामुद्रा, Tibetan: ཕྱག་ཆ... |
7 | o rgyan | Oḍḍiyāna | ཨོ་རྒྱན་ | ओड्डियान | Udiana (also: Uḍḍiyāna, Uḍḍāyāna, Udyāna or 'O... |
8 | kun tu bzang mo | Samantabhadrī | ཀུན་ཏུ་བཟང་མོ་ | समन्तभद्री | Samantabhadri (Sanskrit; Devanagari: समन्तभद्र... |
9 | de bzhin gshegs pa | Tathāgata | དེ་བཞིན་གཤེགས་པ་ | तथागत | Tathāgata (Sanskrit: [tɐˈtʰaːɡɐtɐ]) is a Pali ... |
10 | rnal 'byor ma | Yoginī | རྣལ་འབྱོར་མ་ | योगिनी | A yogini (Sanskrit: योगिनी, IAST: yoginī) is a... |
Now our table also contains an explanation for most of the entries. Let’s look at row 1 (the entry for Akaniṣṭha
)
# The table is too small to show the full definitions, lets look at a sample:
from IPython.display import display, HTML
pd.set_option('display.max_colwidth', None)
display(HTML(df.loc[1,'Definition']))
Next steps#
In an upcoming article we will look at:
how to learn more about the support-technologies like Python and Jupyter Notebooks,
how to install the software on a local machine,
how to use cloud services like Google Colab or Binder to generate useful output,
how to generate dictionaries in different formats,
how to generate markdown documents for different applications.
After next steps#
Once the foundations are laid for doing basic NLP (natural language processing tasks), we’ll look at deep learning and add Tensorflow to the mix to use some AI-methods on Tibetan corpora.