<a href="https://colab.research.google.com/github/DigitalTibetan/DigitalTibetan/blob/main/docs/computational_dharma.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Computational dharma, really?

The topic of computational dharma can be divided into two areas:

1. Using data science and NLP methods to work with Tibetan text, transform encodings and representations.
2. The nascending field of using neural networks and artifical intelligence, especially the dramatically growing field of large language models and machine translation.

- Currently, this notebooks focusses on 1., data science with Python, continue below to get some first overview on how to use Python to work with Tibetan texts.
- Information on the current state of artificial intelligence methods (e.g. OCR, machine translation) is given in the chapter [Tibetan AI](tibetan_ai.md).

(computational_tibetan_transliteration)=
## A digital toolset to work with multiple languages

This article just wants to give a quick overview of what's possible with today's natural language processing tools. All examples will focus on the Python programming language as an additional tool for working with Tibetan texts.

There are two main ways to use this notebook:

- either read it as a normal static website, and get an overview what current tools can do,
- or click select the rocket-icon üöÄ above on the right of the title bar, and select one of the computational modes:
  - Binder: this notebook will be executed on the binder service. You can change the code and update results
  - Colab: will open this notebook in a Google colab instance (free), and will allow you to modify and execute the examples without installing any software locally. In addition to Binder, Google Colab allows access to free AI-accelerator hardware ('tensor-cores') which will be used in later language projects.
  - Live-code: this is a preview-version that allows changing the examples and updating the results right within this web page. 

> With any of the three computational options: To execute a command in the notebook, press `<Shift><Enter>` and the current cell will be executed.

This will be just some first glimpse, more details and explanations will be added in later articles.

## Conversion between transliteration systems

### Conversion between Wylie and Unicode Tibetan

Let's start with some conversions between Wylie and Unicode Tibetan, using Esukhia's [pyetws](https://github.com/OpenPecha-dev/pyewts) that is a port of earlier work by [Roger Espel](https://github.com/rogerespel/ewts-js).

In [2]:
%%capture
!pip install pyewts

In [3]:
import pyewts

In [4]:
converter=pyewts.pyewts()
print(converter.toUnicode("oM AHhUM:"))
print(converter.toWylie("‡Ω¶‡Ω∫‡Ωò‡Ω¶‡ºã‡Ωâ‡Ω≤‡Ωë‡ºã"))

‡Ω®‡Ωº‡Ωæ‡ºã‡Ω®‡Ω±‡Ωø‡Ωß‡Ω±‡Ω¥‡Ωæ‡ºî
sems nyid 


(computational_sanskrit_transliteration)=
### Sanskrit transliteration

The [indic_transliteration project](https://github.com/indic-transliteration/indic_transliteration_py) provides libraries to convert between many different Sanskrit (and other) transliteration systems. Let's look at an example to convert and IAST-encoded word ('KƒÅlacakra') into Devanagari:

In [5]:
%%capture
!pip install indic_transliteration

In [6]:
from indic_transliteration import sanscript
from indic_transliteration.sanscript import SchemeMap, SCHEMES, transliterate

In [7]:
transliterate("KƒÅlacakra", sanscript.IAST, sanscript.DEVANAGARI)

'‡§ï‡§æ‡§≤‡§ö‡§ï‡•ç‡§∞'

Having now explored libraries that convert Tibetan and Sanskrit transliterations, let's look at an example that combines both.

### Transforming glossaries

Let's say we found an old glossary, consisting of IAST-encoded Sanskrit vocabulary translated into Wylie-encoded Tibetan, and we want to add Unicode Tibetan and Devanagari to this glossary:

In [23]:
sample_glossary="""Abhi·π£eka - dbang bskur
Akani·π£·π≠ha - og min
Avalokite≈õvara - spyan ras gzigs
BhagavƒÅn - bcom ldan 'das
·∏åƒÅkinƒ´ - mkha' 'gro
J√±ƒÅnasattva - ye shes sems dpa'
MahƒÅmudrƒÅ - phyag rgya chen po
O·∏ç·∏çiyƒÅna - o rgyan
Samantabhadrƒ´ - kun tu bzang mo
TathƒÅgata - de bzhin gshegs pa
Yoginƒ´ - rnal 'byor ma"""

First step is to parse this piece of text into Pandas dataframe. Such a dataframe is similar to a spreadsheet table and can be easily transformed.
The result can be saved as CSV-File or [many different other formats](https://pandas.pydata.org/docs/user_guide/io.html).

In [22]:
import pandas as pd

In [24]:
# get all the IAST words from the glossary
iast = [x.split('-')[0].strip() for x in sample_glossary.split("\n")]
# get all the Wylie words from the glossary
wylie = [x.split('-')[1].strip()+' ' for x in sample_glossary.split("\n")]

In [27]:
# Combine the two lists (IAST and Wylie words) into a 'dataframe':
df = pd.DataFrame.from_dict({'Wylie': wylie, 'IAST': iast})
# Show the current table, nothing new, just a nice table that can be easily transformed:
df

Unnamed: 0,Wylie,IAST
0,dbang bskur,Abhi·π£eka
1,og min,Akani·π£·π≠ha
2,spyan ras gzigs,Avalokite≈õvara
3,bcom ldan 'das,BhagavƒÅn
4,mkha' 'gro,·∏åƒÅkinƒ´
5,ye shes sems dpa',J√±ƒÅnasattva
6,phyag rgya chen po,MahƒÅmudrƒÅ
7,o rgyan,O·∏ç·∏çiyƒÅna
8,kun tu bzang mo,Samantabhadrƒ´
9,de bzhin gshegs pa,TathƒÅgata


We now have a table that consists of IAST transliterated Sanskrit and Wylie encoded Tibetan.

Now we can use the converters shown above to add Unicode Tibetan and Devanagari for each entry:

In [29]:
# Add a new column 'Tibetan' that will contain the Unicode Tibetan equivalent of the Wylie word:
df['Tibetan']=[converter.toUnicode(x) for x in df['Wylie']]
# Add a new column 'Devanagari' that will contain the Unicode Devanagari equivalent of the IAST word:
df['Devanagari']=[transliterate(x, sanscript.IAST, sanscript.DEVANAGARI) for x in df['IAST']]
# Show the exanded tabled that now has the Tibetan and Devanagari columns:
df

Unnamed: 0,Wylie,IAST,Tibetan,Devanagari
0,dbang bskur,Abhi·π£eka,‡Ωë‡Ωñ‡ΩÑ‡ºã‡Ωñ‡Ω¶‡æê‡Ω¥‡Ω¢‡ºã,‡§Ö‡§≠‡§ø‡§∑‡•á‡§ï
1,og min,Akani·π£·π≠ha,‡Ω®‡Ωº‡ΩÇ‡ºã‡Ωò‡Ω≤‡Ωì‡ºã,‡§Ö‡§ï‡§®‡§ø‡§∑‡•ç‡§†
2,spyan ras gzigs,Avalokite≈õvara,‡Ω¶‡æ§‡æ±‡Ωì‡ºã‡Ω¢‡Ω¶‡ºã‡ΩÇ‡Ωü‡Ω≤‡ΩÇ‡Ω¶‡ºã,‡§Ö‡§µ‡§≤‡•ã‡§ï‡§ø‡§§‡•á‡§∂‡•ç‡§µ‡§∞
3,bcom ldan 'das,BhagavƒÅn,‡Ωñ‡ΩÖ‡Ωº‡Ωò‡ºã‡Ω£‡æ°‡Ωì‡ºã‡Ω†‡Ωë‡Ω¶‡ºã,‡§≠‡§ó‡§µ‡§æ‡§®‡•ç
4,mkha' 'gro,·∏åƒÅkinƒ´,‡Ωò‡ΩÅ‡Ω†‡ºã‡Ω†‡ΩÇ‡æ≤‡Ωº‡ºã,·∏å‡§Ü‡§ï‡§ø‡§®‡•Ä
5,ye shes sems dpa',J√±ƒÅnasattva,‡Ω°‡Ω∫‡ºã‡Ω§‡Ω∫‡Ω¶‡ºã‡Ω¶‡Ω∫‡Ωò‡Ω¶‡ºã‡Ωë‡Ωî‡Ω†‡ºã,‡§ú‡•ç‡§û‡§æ‡§®‡§∏‡§§‡•ç‡§§‡•ç‡§µ
6,phyag rgya chen po,MahƒÅmudrƒÅ,‡Ωï‡æ±‡ΩÇ‡ºã‡Ω¢‡æí‡æ±‡ºã‡ΩÜ‡Ω∫‡Ωì‡ºã‡Ωî‡Ωº‡ºã,‡§Æ‡§π‡§æ‡§Æ‡•Å‡§¶‡•ç‡§∞‡§æ
7,o rgyan,O·∏ç·∏çiyƒÅna,‡Ω®‡Ωº‡ºã‡Ω¢‡æí‡æ±‡Ωì‡ºã,‡§ì‡§°‡•ç‡§°‡§ø‡§Ø‡§æ‡§®
8,kun tu bzang mo,Samantabhadrƒ´,‡ΩÄ‡Ω¥‡Ωì‡ºã‡Ωè‡Ω¥‡ºã‡Ωñ‡Ωü‡ΩÑ‡ºã‡Ωò‡Ωº‡ºã,‡§∏‡§Æ‡§®‡•ç‡§§‡§≠‡§¶‡•ç‡§∞‡•Ä
9,de bzhin gshegs pa,TathƒÅgata,‡Ωë‡Ω∫‡ºã‡Ωñ‡Ωû‡Ω≤‡Ωì‡ºã‡ΩÇ‡Ω§‡Ω∫‡ΩÇ‡Ω¶‡ºã‡Ωî‡ºã,‡§§‡§•‡§æ‡§ó‡§§


The table now has the unicode renderings for both Tibetan and Devanagari.

Let's use Wikipedia to explain those terms: We will use the IAST encoded name (e.g. `Akani·π£·π≠ha`) and do a Wikipedia lookup. If an entry is available, we extract the summary (usually the first paragraph in Wikipedia) as text and add it to our table:

In [35]:
# First we write a litter helper that pulls data from a wikipedia article.
from urllib.request import urlopen
from urllib.parse import quote
import json

In [36]:
# We want to get the 'summary' part for a given search-word from wikipedia as text. We can add this
# summary to each entry in our dataframe as an explanation of the word.
'''Get the summary of a wikipedia article with title `query` and return it as text.'''
def wikipedia_summary(query, verbose=False):
    api="https://en.wikipedia.org/w/api.php"
    esc_query=quote(query)
    query_url=f"{api}?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&titles={esc_query}"
    result=[]
    with urlopen(query_url) as f:
        resp = json.load(f)
        if verbose is True:
            print(resp)
        for entry_name in resp["query"]["pages"]:
            entry=resp["query"]["pages"][entry_name]
            if "title" in entry.keys() and 'extract' in entry.keys():
                result.append((entry['title'],entry['extract']))
    if len(result)==1:
        ans=result[0][1]
        if 'refer' in ans and 'to:' in ans:
            return ""
        else:
            return ans
    else:
        return ""

In [39]:
# Do a Wikipedia lookup for each IAST word in the dataframe, and add the result to a new column 'Definition':
df['Definition']=[wikipedia_summary(x) for x in df['IAST']]
# Show the expanded dataframe with the new column, (abbreviated in display)):
df

Unnamed: 0,Wylie,IAST,Tibetan,Devanagari,Definition
0,dbang bskur,Abhi·π£eka,‡Ωë‡Ωñ‡ΩÑ‡ºã‡Ωñ‡Ω¶‡æê‡Ω¥‡Ω¢‡ºã,‡§Ö‡§≠‡§ø‡§∑‡•á‡§ï,"Abhisheka (Sanskrit: ‡§Ö‡§≠‡§ø‡§∑‡•á‡§ï, romanized: Abhi·π£e..."
1,og min,Akani·π£·π≠ha,‡Ω®‡Ωº‡ΩÇ‡ºã‡Ωò‡Ω≤‡Ωì‡ºã,‡§Ö‡§ï‡§®‡§ø‡§∑‡•ç‡§†,"In classical Buddhist Cosmology, Akani·π£·π≠ha (Pa..."
2,spyan ras gzigs,Avalokite≈õvara,‡Ω¶‡æ§‡æ±‡Ωì‡ºã‡Ω¢‡Ω¶‡ºã‡ΩÇ‡Ωü‡Ω≤‡ΩÇ‡Ω¶‡ºã,‡§Ö‡§µ‡§≤‡•ã‡§ï‡§ø‡§§‡•á‡§∂‡•ç‡§µ‡§∞,"In Buddhism, Avalokite≈õvara ( Sanskrit: ‡§Ö‡§µ‡§≤‡•ã‡§ï‡§ø..."
3,bcom ldan 'das,BhagavƒÅn,‡Ωñ‡ΩÖ‡Ωº‡Ωò‡ºã‡Ω£‡æ°‡Ωì‡ºã‡Ω†‡Ωë‡Ω¶‡ºã,‡§≠‡§ó‡§µ‡§æ‡§®‡•ç,"Bhagavan (Sanskrit: ‡§≠‡§ó‡§µ‡§æ‡§®‡•ç, romanized: BhagavƒÅ..."
4,mkha' 'gro,·∏åƒÅkinƒ´,‡Ωò‡ΩÅ‡Ω†‡ºã‡Ω†‡ΩÇ‡æ≤‡Ωº‡ºã,·∏å‡§Ü‡§ï‡§ø‡§®‡•Ä,A ·∏çƒÅkinƒ´ (Sanskrit: ‡§°‡§æ‡§ï‡§ø‡§®‡•Ä; Tibetan: ‡Ωò‡ΩÅ‡Ω†‡ºã‡Ω†‡ΩÇ‡æ≤‡Ωº‡ºã...
5,ye shes sems dpa',J√±ƒÅnasattva,‡Ω°‡Ω∫‡ºã‡Ω§‡Ω∫‡Ω¶‡ºã‡Ω¶‡Ω∫‡Ωò‡Ω¶‡ºã‡Ωë‡Ωî‡Ω†‡ºã,‡§ú‡•ç‡§û‡§æ‡§®‡§∏‡§§‡•ç‡§§‡•ç‡§µ,
6,phyag rgya chen po,MahƒÅmudrƒÅ,‡Ωï‡æ±‡ΩÇ‡ºã‡Ω¢‡æí‡æ±‡ºã‡ΩÜ‡Ω∫‡Ωì‡ºã‡Ωî‡Ωº‡ºã,‡§Æ‡§π‡§æ‡§Æ‡•Å‡§¶‡•ç‡§∞‡§æ,"MahƒÅmudrƒÅ (Sanskrit: ‡§Æ‡§π‡§æ‡§Æ‡•Å‡§¶‡•ç‡§∞‡§æ, Tibetan: ‡Ωï‡æ±‡ΩÇ‡ºã‡ΩÜ..."
7,o rgyan,O·∏ç·∏çiyƒÅna,‡Ω®‡Ωº‡ºã‡Ω¢‡æí‡æ±‡Ωì‡ºã,‡§ì‡§°‡•ç‡§°‡§ø‡§Ø‡§æ‡§®,"O·∏ç·∏çiyƒÅna (also: U·∏ç·∏çiyƒÅna, U·∏ç·∏çƒÅyƒÅna or UdyƒÅna, ..."
8,kun tu bzang mo,Samantabhadrƒ´,‡ΩÄ‡Ω¥‡Ωì‡ºã‡Ωè‡Ω¥‡ºã‡Ωñ‡Ωü‡ΩÑ‡ºã‡Ωò‡Ωº‡ºã,‡§∏‡§Æ‡§®‡•ç‡§§‡§≠‡§¶‡•ç‡§∞‡•Ä,Samantabhadri (Sanskrit; Devanagari: ‡§∏‡§Æ‡§®‡•ç‡§§‡§≠‡§¶‡•ç‡§∞...
9,de bzhin gshegs pa,TathƒÅgata,‡Ωë‡Ω∫‡ºã‡Ωñ‡Ωû‡Ω≤‡Ωì‡ºã‡ΩÇ‡Ω§‡Ω∫‡ΩÇ‡Ω¶‡ºã‡Ωî‡ºã,‡§§‡§•‡§æ‡§ó‡§§,TathƒÅgata (Pali: [t…êÀàt ∞aÀê…°…êt…ê]) is a Pali word...


Now our table also contains an explanation for most of the entries. Let's look at row 1 (the entry for `Akani·π£·π≠ha`)

In [52]:
# The table is too small to show the full definitions, lets look at a sample:
from IPython.display import display, HTML
pd.set_option('display.max_colwidth', None)
display(HTML(df.loc[1,'Definition']))

In classical Buddhist Cosmology, Akani·π£·π≠ha (Pali: Akani·π≠·π≠ha, meaning "Nothing Higher", "Unsurpassed") is the highest of the Pure Abodes, and thus the highest of all the form realms. It is the realm where devas like Mahe≈õvara live.
In Mahayana Buddhism, Akani·π£·π≠ha is also a name for the Pure Land (Buddhafield) of the Buddha Vairocana.

Tibetan Buddhism, Akani·π£·π≠ha (Tib. 'og min) often describes three Akani·π£·π≠has:
The Ultimate Akani·π£·π≠ha, the formless state of dharmakaya, the dharmadhatu.
The Densely Arrayed Akani·π£·π≠ha (Tib. 'Og min rgyan stug po bkod pa; Skt. Ghanavy≈´hakani·π£·π≠ha), or the "Symbolic Akani·π£·π≠ha" which is the realm of  sambhogakaya. "Ghanavy≈´ha Akani·π£·π≠ha", refers to the pure Sa·πÉbhogakƒÅya Buddha field out of which emanate all NirmƒÅ·πáakƒÅya Buddhas and Buddhafields such as SukhƒÅvati. It is the supreme Buddhafield in which all Buddhas attain Buddhahood. The Sa·πÉbhogakƒÅya Buddha Vajradhara is said to have taught the Vajrayana in the r

## Next steps

In an upcoming article we will look at:

- how to learn more about the support-technologies like Python and Jupyter Notebooks,
- how to install the software on a local machine,
- how to use cloud services like Google Colab or Binder to generate useful output,
- how to generate dictionaries in different formats,
- how to generate markdown documents for different applications.

## After next steps

- Once the foundations are laid for doing basic NLP (natural language processing tasks), we'll look at deep learning and add Tensorflow to the mix to use some AI-methods on Tibetan corpora.