SPIRITS : Supporting Program for Interaction-based Initiative Team Studies  2020-2021 Interdisciplinary type project, in the priority area of humanities and social sciences


The origins of much of ancient literature is veiled in mystery. There are countless texts that are ascribed as ‘date unknown, author unknown,’ leaving no clue as to when, where or who wrote them. It goes without saying that a better understanding of the context behind their formation would likely yield a deeper understanding of the text. The subject of this research — the Vedic ritual literature of ancient India, said to be composed between 1500 and 500 BCE — is one example of this.

The Vedas collectively refer to a number of classical texts created by religious devotees — the so-called Brahmins — around the faith of the Indo-Aryan peoples, who invaded India beginning in 1500 BCE. Their religion consisted of communing with many gods, centered around those in nature, through praising them and bringing offerings for them into fire and benedictions made for prosperous relations between man and nature. This religion was the origin of Hinduism, the religion that represents Indian society throughout its history. The Vedic texts have been passed down through an oral tradition (They have been passed down through transcription of text after the written word spread throughout India; however, the oral passing down of the text has been at the core of the tradition.) with a variety of retellings unique to each of several lineages (or schools of thought). The Vedic texts passed down by each school consist of praise to gods and ritual benedictions — the oldest of the texts — with commentaries on rituals, philosophical discussions, and ritual details added later on to gradually create a body of literature over a long compilation period. In essence, the Vedas are multiple texts compiled by the various schools, which connected with each other, that is the horizontal axis, and in the shifting trends of the times, that is the vertical axis of time. Thus, we can position the composition and development of the Vedic texts against a vertical axis of time representing chronology and along a horizontal axis representing the geographical changes that occurred to these various schools or social groups — being situated in the north of ancient India — as well as the changes to their interconnectivity.

It’s believed that ancient Indian society of the time shifted from tribal, nomadic peoples to form small city-states. The goal of this project is to explore the development of ancient Indian society through the chronological and geographical positioning of the Vedas.

Data science + Ancient Indian literature

There is an existing body of research in the field of Indian Studies (Vedic Studies) that examines the philosophy, rituals or linguistic phenomena appearing in the Vedas and considers the differences across schools of thought and transitions over time. By introducing the methodology of data science to this category of research, the scale of what can be examined can be expanded, more detailed analyses can be performed, and more complex changes and relationships can be considered. We expect to obtain various types of analysis results through investigation of the complex text formation process of the Vedas, in relation with relevant geographical and chronological features through multiple information science perspectives.

Rather than being completed all at once, the Vedas can be taken as comprising multiple linguistic layers, with additional portions gradually added to the central text. This necessitates an analysis of the linguistic layers in one of the texts, thereby allowing a further comparison by linguistic layers across multiple texts. The cross-textual comparison will be a comparison of texts believed to be from the same era, as well as will allow for a consideration of the confluent relations of texts from differing eras.

By developing a system of “visual analytics,” in which various features are visualized interactively, it becomes possible to present an overview of the chronological and geographical features in the ancient Vedic corpus, something that individual analysis fails to achieve. We aim to integrate Indian studies with visual analytics that is the science of analytical reasoning facilitated by interactive visual interfaces. While this spatiotemporal literature mapping will be, itself, one result of this research, it will also serve as a departure point for deeper discussions into the development of ancient Indian society.

Research Method and Team

We are currently progressing with research centered on two pillars. The first uses an existing index* that covers which texts include the mantras used in rituals, and elucidates the relationship between the texts. Since old and revered mantras, like those collated in ancient collections, were favored for use in Vedic rituals, one mantra is often cited in various texts. We believe that the common citing of one mantra by multiple texts (co-occurrent relations of a single mantra) can indicate a relationality between those texts.

* BLOOMFIELD, Maurice (1893): A Vedic Concordance. [Harvard Oriental Series 10]. Cambridge – Mass.
The expanded edition (with electronic data) was used for this research:Franceschini, Marco (2007): An updated Vedic concordance : Maurice Bloomfield’s A Vedic concordance enhanced with new material taken from seven Vedic texts. Cambridge: Dept. of Sanskrit and Indian Studies, Harvard University.
Organizing of this data was performed by Naoko Oshiro of Faculty of Culture and Information Science, Doshisha University.

Parallel coordinate plot indicating the correspondence between the Atharvaveda (Paippalāda) and the Maitrāyaṇī Samhitā
Scatter plot indicating the correspondence between Rgveda and the Maitrāyaṇī Samhitā

Another pillar consists of digitizing the contents of the texts and performing text mining (an analysis of the frequency of appearance of character strings and the correlation of their common appearance, the trends of their appearance, and their chronological order) to elucidate the linguistic layers in the texts and hunt for relationality between them. This research is premised on a morphological analysis of the data of the Vedas. The analysis of Sanskrit through natural language processing techniques has been considered to be extremely difficult up to now; however, the program development of Oliver Hellwig is gradually making this possible. We aim to carry out various analyses based on morphological analytical data in the future, but we are still currently at the compiling stage for this morphological analytical data. The compiled data has been collated in the DSC – Digital Corpus of Sanskrit (http://www.sanskrit-linguistics.org/dcs/index.php).

About Digital Corpus of Sanskrit (from Home on DCS website):
The Digital Corpus of Sanskrit (DCS) is a Sandhi-split corpus of Sanskrit texts with full morphological and lexical analysis.
The DCS is designed for text-historical research in Sanskrit linguistics and philology. Users can search for lexical units (words) and their collocations in a corpus of about 4,600,000 manually tagged words in 650,000 text lines.

The DCS offers two main entry points for research:

  1. Words can be retrieved from the dictionary through a simple query or a dictionary page. For each lexical unit contained in the corpus, DCS provides the complete set of occurrences and a statistical evaluation based on historical principles.
  2. The text interface shows all contained texts along with their interlinear lexical and morphological analysis.

Large parts of the annotations are available for download at github.

The relational analysis of the texts through the co-occurrence of the mantras is being carried out by Hiroaki Natsukawa, who works with Kyoko Amano to examine what can be deciphered from the results of this analysis while providing feedback to aid in more useful visualization of the analytical findings.

Oliver Hellwig, Kyoko Amano and Makoto Fushimi are proceeding with the compiling of the morphological analytical data of the Vedic corpus. The data is yet to be complete; however, Yuki Kyogoku is carrying out the trials to see what kind of analysis is possible using the complete data.

Hiroaki Natsukawa is the lead researcher in the core area of data visualization, which links and aggregates the various types of analytical results onto a temporal-geographical map. The co-ocurrence relations among mantras are not limited to the one-to-one relationship shown in the previously mentioned parallel coordinate plot and scatter plot; a certain mantra may appear in multiple texts. Visualizing the co-ocurrence relationships among mantras, not by a simple node-link graph but by a hypergraph, a concept expanding the idea of ordinary graphs, should enable us to capture the global similarities or specific features that appear among many texts. In this project, we will seek to decipher the mystery of the formation process of these complex ancient Indian texts by visual analytics, the combination of data visualization and analytics.

Regarding the project of organizing the morphological analysis data of the Vedic corpus, we are conducting it as an independent research project within the purview of the Grant-in-Aid for Challenging Research (Exploratory). Oliver Hellwig has designed the process and operational framework of the project website for research team members to use as shaped platform through which they can analyze data, check the analysis results, and register data to the database, as follows:

A web-based interface for the morpho-lexical, AI-powered annotation of Sanskrit texts

Motivation: In order to obtain large scale linguistic data for research in the history of the Vedic corpus, a new framework needs to be built for the collaborative annotation of these data. The SanskritTagger (Hellwig, 2009) can only be run on one machine, as different versions of its database cannot be synchronized, and therefore does not allow for collaborative annotation. The Digital Corpus of Sanskrit (Hellwig, DCS – The Digital Corpus of Sanskrit, 2010-2019) is a static snapshot of the SanskritTagger database and therefore does not allow for annotation at all. The planned web-based system consists of three main components: database, analysis model and user interface.

The database contains the corpus, the lexicon and the linguistic rules (Sandhi, inflection, irregular and verbal forms) for parsing Sanskrit texts. This MySQL database will be extracted from the Access database underlying the SanskritTagger. The SanskritTagger database uses an ASCII based encoding of Sanskrit which needs to be converted into an appropriate Unicode encoding for the MySQL database; special attention needs to be paid to diphthongs and aspirated consonants which count as one phoneme in the linguistic analysis.

The analysis model takes a line of Sanskrit text as input, produces all possible morpho-lexical analyses of this line and orders these analyses by decreasing linguistic probabilities using corpus information and machine learning (ML) techniques. – Due to Sandhi, the rich morphology and the large vocabulary, even short lines of Sanskrit text can have several thousand morpho-lexical readings, most of which are, however, linguistically meaningless (Hellwig & Nehrdich, 2018). The aim of the ML model is to select the most probable morpho-lexical annotation of a text line and to facilitate and speed up the annotation in this way.

The analysis model itself consists of two layers:

  1. The first deterministic layer creates all possible morpho-lexical analyses of a text line using the phonetic (Sandhi), lexical and morphological information stored in the database, and stores the result in an XML file.
  2. The second probabilistic layer reorders the XML file using ML techniques. This step requires a certain amount of research, and it is planned to publish its results at a major NLP conference in 2022 (ACL, EMNLP, COLING). The design of this layer starts with a sequence (Hellwig, 2016) and a graph-based model (Krishna, et al., 2018), which will be implemented in a development environment using the tensorflow library. The performance of the developed models will be evaluated using gold data from the SanskritTagger database. The final model will combine the graph-based approach with an energy function of the edges that considers n-grams of linguistic information as well as long range dependencies by using recurrency (on the character level; Hellwig & Nehrdich, 2018) and attention mechanisms.

The user interface (UI) provides functions for managing texts and their analyses, correcting the automatically generated analyses and editing the linguistic information contained in the MySQL database. While the database and the analysis model can (partly) rely on previous work, the UI needs to be built from scratch in PHP/Javascript/Ajax, using a model-view-controller framework. It consists of the following elements:

  1. Views for checking and correcting the analyses provided by the ML model (registered users). These views are central for building the corpus, and care needs to be taken to design them as intuitively as possible. Each user has access to the analyses of his own text.
  2. A view for exporting linguistic information from the database (conllu format; all users).
  3. Views for user administration (admin).
  4. Views for importing texts into the corpus for further linguistic annotation (admin).

Hellwig, O. (2009). SanskritTagger, a stochastic lexical and POS tagger for Sanskrit. Sanskrit Computational Linguistics. First and Second International Symposia (pp. 266-277). Berlin: Springer.

Hellwig, O. (2010-2019). DCS – The Digital Corpus of Sanskrit. From http://www.sanskrit-linguistics.org/dcs/index.php

Hellwig, O. (2016). Improving the Morphological Analysis of Classical Sanskrit. Proceedings of the 6th Workshop on South and Southeast Asian Natural Languages, (pp. 142-151). Osaka.

Hellwig, O., & Nehrdich, S. (2018). Sanskrit Word Segmentation Using Character-level Recurrent and Convolutional Neural Networks. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 2754-2763). Brussels: ACL.

Krishna, A., Santra, B., Bandaru, S. P., Sahu, G., Sharma, V. D., Satuluri, P., & Goyal, P. (2018). Free as in Free Word Order: An Energy Based Model for Word Segmentation and Morphological Tagging in Sanskrit. Proceedings of the EMNLP (pp. 2550-2561). Brussels: ACL.”>A web-based interface for the morpho-lexical, AI-powered annotation of Sanskrit texts