Accessing information by voice

Dr. Roger Tucker

Getting up to date agricultural information to farmers has always been a challenge, whether due to language, literacy or distance. A new system that can read out text in any language via mobile phone could enable many more rural communities to access the information they need.

A Kenyan farmer growing bananas for the first time is excited when the bunches of fruit appear, ready to be harvested. When he discussed expanding into this new crop, the local agricultural extension officer gave him a telephone number to call. The phone line would provide all the information needed to guide him through each step of the process. As this first batch ripens, the farmer needs advice on how best to harvest and look after the bananas once cut. He decides to call the number. A mistake now could cost him dearly.

The farmer calls using his mobile phone, and a voice offers him the choice of listening in Kiswahili or English. Although neither is his first language, he is more confident in Kiswahili, so he chooses that. After a few minutes the farmer has the information he needs, and he even recognizes the voice on the line as Ken Walibora, one of Kenya’s best known TV anchormen.

He does not know that Ken Walibora has never spoken a single word over the Banana Information Line; the information was generated automatically from text thanks to a ‘text-to-speech’ (TTS) system developed by a team led by Dr Mucemi Gakuru at the University of Nairobi. The advantage of this service is that the information can be kept up to date simply by editing web pages. All Ken Walibora had to do was to spend about 45 minutes reading aloud some carefully selected sentences. This is the power of speech technology.

Local input

The Banana Information Line is a project of the Local Language Speech Technology Initiative (LLSTI), produced in partnership with National Agriculture and Livestock Extension Programme (NALEP) of the Kenyan Ministry of Agriculture. It ran as a pilot for several months in 2006, investigating the use and possible functionality of a voice information service for Kenyan agriculture.

During the evaluation of the Banana Information Line, the project team consulted with NALEP to build up a picture of the many services a voice information portal could offer farmers in the future:

  • Separate phone lines for different crops.
  • An online database where anyone, such as district or provincial level extension workers, can update data or add new information to the system, and where farmers can add their own information or post questions.
  • An alert service for farmers, offering up-to-date commodity prices and other market information, weather reports, and urgent announcements of disease outbreaks, e.g. bird flu
  • Local information, such as which crops are most suitable for a specific area, and contact information for agricultural extension officers.
  • Personalized information services. For a particular crop, a farmer could enter the size of his farm and the system could calculate the investment required and potential revenue, etc. It could also provide timely advice on crop management.
  • Each farmer would have a unique caller ID or PIN code so that they would not have to re-enter information.
  • The voice system would allow users to choose to receive information by SMS or email.

LLSTI first looked for a partner in East Africa to help develop a Kiswahili text-to-speech system, and contacted Dr Gakuru in 2004. Interestingly, until then, the team at the University of Nairobi knew little about speech technology. In the following months, LLSTI supplied the tools, training and expertise to enable Dr Gakuru to develop the text-to-speech system used in the Banana Information Line.

This project is typical of the work done by LLSTI. The organization began in 2003 as a global initiative led by Outside Echo, a UK not-for-profit organization which facilitates audio access to information, and works together with partners from India, South Africa, Kenya and Nigeria. LLSTI provides the support needed for a team with no prior knowledge of speech technology, to produce usable, natural-sounding voices with a TTS system. The main requirements are a linguist, a software engineer, and a motivated team leader who preferably is an engineer as well. Each team is normally based in the area where the language is spoken, and part of a university or research institute. This is in contrast to the commercial development of TTS, where a speaker of the language temporarily joins a team of experts in a company’s lab in Europe or the USA.

Why is it important to encourage local involvement in the development of speech and language technology? One reason is motivation – people feel strongly about their own language. Then there is maintenance – language technology needs ongoing development to overcome problems that may arise during its use. Relevance is also very important – academics need to be encouraged to focus on problems that are most relevant to their own communities.

This last point is an important one. Academics are assessed on the basis of their publications in international journals and conference presentations, which all too often focus on small, barely significant, contributions to mainstream research. LLSTI therefore encourages a rigorous approach to local language technology development so that academics can publish their work internationally. So far, the partner organizations have been very successful, and their work in the local languages has generated several quality publications.

LLSTI also prefers to make the software code of a new language system, or at least the initial prototype, available for anyone to use, including for commercial purposes, through a Berkeley Software Development (BSD) open source licence. This makes it easier for the technology to be used, and also for new researchers to pick it up and enhance it.

The Meraka Institute in South Africa, one of the early partners of LLSTI, has gone on to develop a number of new local language TTS and related applications. In Botswana, for example, the Institute has set up an English/Setswana Aids Caregivers’ Helpline, which will be piloted in early 2008. Indeed, the Meraka Institute itself has become a centre of expertise in speech technology, and now offers training and support for researchers from as far away as Nigeria. The Meraka team has also extended the original project to encompass automatic speech recognition (ASR), with the long-term goal of perfecting automated translation. The system they are currently working on, called ‘Lwazi’, is an ambitious phone-based, speech-driven information system commissioned by the South African Department of Arts and Culture. With Lwazi, citizens will be able to access government information and services in any of South Africa's 11 official languages, using either landline or mobile phones.

Difficult languages

All of this may sound easy, but a good TTS system requires some extremely clever software. It requires very specific language knowledge, a lot of hand-annotated text and audio data, and skilled engineering judgements that are specific to the particular language. The ultimate aim is to develop a system with a voice that sounds just like a human reader, already achieved for many European languages. However, a TTS system is good enough to be used as long as any unnaturalness does not significantly affect intelligibility. It turns out that this is easier for some languages than for others.

For instance, if the script does not include vowels, as in Arabic, how can the system know how to pronounce the word? Or if the language has free (unpredictable) stresses, as in English, how can the system know which syllable of a word should be the strongest? And if a comprehensive pronunciation dictionary is part of the solution to these problems, how can that be built and used if the morphology of the language is complex and allows a large number of variations on any given word stem, as in Russian?

To understand these problems from the very start, the LLSTI project began by conducting a survey of 105 languages to identify all the script and language features in each case that can create complications for a TTS system. The team catalogued these features in a TTS-related multilingual database that would enable them to predict the issues that specific languages would raise. The results are summarized in the TTS development complexity scores (see table).

Difficulty of developing basic and good TTS systems in various languages (0=easy, 10=difficult)

Language Basic TTS Good TTS
Pashto 9 9.5
Arabic (classical) 7 8.5
Russian 6 9
Tibetan 6 7.5
isiZulu 6 8
Ibibio 5 7
Thai 5 8
English 4 6
Hindi 2 4
Welsh 1 4
Kiswahili - 4
Tamil - 2.5

The ideal language, from a TTS point of view, has a complexity score of 0. This is a language where the text to speech process can be defined in a straightforward set of rules that any linguist can write down from their existing knowledge of the language. In practice, such rules can never completely define the process, but there are some languages where they can produce a basic system – one in which phrasing, loan words, abbreviations and other such details may not be perfectly rendered, but the meaning is still quite intelligible.

In general terms, developing a TTS system involves the following steps:

  • Defining the language characteristics: phone-set (i.e. the sounds used in the language), letter-to-sound rules, the rules of syllabification (the separation of words into syllables), etc.
  • Selecting a set of phonetically balanced sentences, from a large database of phonetically transcribed texts that cover all the different phone combinations of the language in as few sentences as possible. This is an automatic process, but some compromise is always needed between the number of sentences and the coverage of rare phone combinations.
  • Selecting a speaker. The choice of voice is the single most important decision in the development of a TTS system. This may appear to be counter-intuitive – after all, doesn’t almost everyone speak clearly enough to be understood? But TTS requires a clear, precise rendition of every word in the database. The system constructs its output by combining small segments of words in the database, so each one has to be exactly right. Besides, the speech needs to be as intelligible as possible to start with, so that any deterioration in quality that may occur in the process of joining up these segments has the least possible impact.
  • Recording the phonetically balanced sentences. For Kiswahili, there were about 400 sentences, which took around 45 minutes to record; for most other languages more spoken sentences are needed.
  • Making phonetic annotations of the recordings by hand. Although this can be done automatically, any resulting errors can create problems for the TTS output.
  • Compiling all the data into a TTS system using Festival, an open source software package developed at the University of Edinburgh.
  • Testing the system.

If it is not possible to define the rules in the first step, there are a number of data-driven techniques that can be used instead. A data-driven technique takes a large database of annotated textual data, usually from public sources (like the internet) if they exist in that language, and attempts to derive ‘rules’ automatically. Annotating data by hand is laborious and time consuming, and over the years researchers have tried to minimize the proportion of data requiring accurate annotation, the ultimate aim being acceptable performance with no hand annotation at all. In South Africa, for instance, the Meraka Institute has developed a pronunciation dictionary builder that employs an iterative technique to build the entire dictionary with a minimum of effort.

For the majority of languages, even a basic TTS system requires morphological analysis (MA) of each word to derive its part of speech – a process that is language-specific and usually quite complicated, and is the major show-stopper for many languages. Consequently, LLSTI is currently involved in a major research project at the University of Bristol, UK, to develop a machine-learning MA system that can be applied to languages with very little known linguistic data.

Words talk

Back in Kenya, the Banana Information Line was formally evaluated with the help of a carefully selected group of 10 farmers in Kirinyaga district. The evaluation revealed some interesting problems. For example, seven out of the 10 participants chose to listen to the information in English, but then struggled with the British accent. Those who chose Kiswahili loved the voice, but then struggled with the formal Kiswahili grammar used in the translation. All of them said that they liked the voice system and that they preferred it to written material, but it was clear that the accent and translation issues would need to be fixed before it could be put to wider use. Dr Gakuru played some samples from a Kenyan English TTS under development, and the farmers found this even clearer than the original Kiswahili version. That TTS is now almost completed, and will be used in future.

During the pilot project and the evaluation, the LLSTI team consulted NALEP to build up a picture of what a phone-based agricultural information service should offer (see box). Such a service could allow farmers to get the specific information they need, whenever and wherever they need it, and in a language they can understand.

With the introduction of mobile data services across Africa, a future version of the information line becomes possible where the TTS system runs on the phone itself. Only the text data would be transferred, which the phone would then convert to speech. Not all farmers will have a mobile phone that can support this, but for those who do, it could provide a very attractive option, for a number of reasons. First, the cost of using the information line is much lower – mobile calls are still quite expensive in Kenya. Second, users can access the information they need using visual (pictorial or icon) menus, with a text search function for those who are happy to try it. Third, pictures and key numbers/words can be displayed along with the voice, making it easier for users to absorb the information and remember it afterwards.

All of this can be built with today’s technology. The part that is still missing is a TTS system in the local language that people are comfortable using. Some of these languages present difficulties, and there is a lot of interesting and challenging work to do, but once these problems are resolved, the systems will be there to be used. The South African government is already putting resources into this; surely it is time for others to follow.

Dr. Roger Tucker is director of the Local Language Speech Technology Initiative ( LLSTI)

Related links

Try the Text To Speech (TTS) system used for the Banana Information Line (in Kiswahili)

The TTS-related language database from LLSTI

The Meraka Institute’s Lwazi language project

Examples of European TTS system

The Dictionary Maker

Learning the Morphology of Complex Synthetic Languages

10 December 2007

Copyright © 2014, CTA. Technical Centre for Agricultural and Rural Cooperation (ACP-EU)