Central Institute of Indian Languages in the past co-ordinated
the development of
45 plus million word corporain
Scheduled Languages under the scheme of Technology
Development for Indian Languages(TDIL)of the Ministry of Communication and Information Technology.
corpora was created followingsamplingmethodologies and hence this is a balanced corpora. This
is available in Indian Standard Code for Information Interchange
or Indian Script Code for Information Interchange(
ISCII) format.The Institute
also intends to enhance this corpora to the tune of twenty
million in each language.
Institute in collaboration with the Lancaster University has
converted the same into UNICODE
format. Corpora in this format it in addition to the Lancaster University
corpora is also available for users.
its own the Institute is now developing corpora in languages
recently included into the Eighth Schedule like Bodo, Dogri,
Maithili and Santali. All the prose texts available in these
languages are keyed since it is not possible to obtain texts
from all the sampling domains.
in Indian languages thus developed is maintained and distributed
free of cost to the scholars by the Institute for academic
in near future the institute will also be able to provide
different kinds of resources in Indian languages against payment/subscription.This
activity as is being planned now will form part of the proposal
: Linguistic Data Consortium in Indian Languages ( LDCIL