The aim of the ALLOW program is to facilitate communication across differences in language and culture. ALLOW hopes to accomplish this purpose through a number of projects and volunteer efforts with broad international participation.
The grand challenge of the ALLOW program is to support communciation and understanding among diverse peoples and cultures in all languages. There are about 7,000 languages in the world, depending on how you count languages that are similar but not quite the same. Therefore, this grand challenge will require many years and the cooperative efforts of volunteers from many countries. In addition, new tools for human and machine learning will be needed to facilitate this massive knowledge acquisition. ALLOW will specifically develop tools in speech recognition, optical character recognition, computer aided instruction, speech synthesis and machine faciliated translation.
A major issue in meeting ALLOW's goals is that of LANGUAGE DEATH. Over half the world's languages are in danger of extinction within the next few generations.
There are over 300 languages with at least a million speakers each. For most of these languages, there are plentiful resources in the form of dictionaries, phrase books, newspapers, trade books, textbooks and websites. For these languages, meeting the ALLOW goals will require substantial efforts in language-specific data collection and documentation, but will build from an existing base. For these languages the production of the ALLOW technical tools will require signicant development effort and native speaker experise, but the development can mostly use the methods that have already been used successfully for about dozen languages.
Most of the world's languages are have fewer language-specific knowledge resources. Most are minority languages, a fact that is obvious from comparing the number of languages to the number of countries. Most minority languages are not taught in schools. Dictionaries, phrase books or any kind of published literature is often only available in the majority language of the region.
Meeting the ALLOW goals in these languages will require the development of new tools and methods. It will also require a more concentrated, targeted data collection effort. It will be too expensive and time-consuming to label and annotate this data using expert human labor as has been done in most existing language tool development. Therefore, ALLOW will dedicate research efforts to automate the data annotation and analysis, utilizing and extended technique for semi-supervised machine learning.
Over half the world's languages are in an even more precarious state. These languages have not been taught in schools or used in commerce for several generations. Most fluent speakers are elderly. That is, for most families although the parents may be able to understand the language, they do not speak it fluently or use it as the language they speak at home. The children do not hear or use the language of their heritage at all. A language in such a state is on a clear path to extinction.
For endangered languages, urgent efforts must be made to document and record the language. For save and perpetuate the language, it is critical to develop material and tools to teach the language to younger generations. For knowledge acquisition, more advanced machine learning tools must be developed that can learn the essential elements of a new language from minimal resources.
For the base languages, ALLOW will develop material and tools to enable students to learn and teach each other's native language. These educational materials will be distributed free or at low cost to students who vounteer to help collect and annotate language data.
For minority and endangered languages, the emphasis will be on teaching the language to the younger generations for whom the endangered language is the language of their heritage.
To develop the ALLOW tools in a large number of languages will require training a large number of teams with native speakers in many languages. To do this, ALLOW will design a curriculum of project-based courses in which the students will get hands-on experience developing tools for specific languages. Although not initially associated with a degree program at any university, voluteer faculty will be recruited from top universities and the curriculum content will be at least the equivalent of a professional Master's degree.
Allow's core research program will discover and develop new techniques of machine learning in the target fields, speech and character recognition and semi-automatic translation. Volunteers will be recruited from research laboratories around the world. Participation in the program will be a joint, cooperative learning experience. It will be aimed at given the participant experience at least the equivalent of a Ph.D. program or of a two-year post-doc.
ALLOW will need text and audio data in nearly 7,000 languages. It will depend primarily on contributions by volunteers. Initially, volunteers will be asked to record word lists and children's stories, which will be used as resorces for teaching the language both to humans and to machines. The goal will be to get at least one to ten hours of recorded speech in each language. By way of comparison, Librivox.org has already collected over 15,000 hours of volunteer recordings. However, the current Librivox recordings cover a much smaller number of languages.
The ALLOW program was originated and is being coordinated by the Center for Innovations in Speech and Language (CISL) at the Language Technology Institute of Carnegie Mellon University.
Non-profit organizations involved in ALLOW include Libirvox.org, the Rosetta Project of the Long Now Foundation and the Internet Archive.
Since many languages do not have a written literature, volunteers will also be sought to translate stories from other languages and to record, transcribe and translate oral histories and folklore in the target language.