Introduction
Text mining has been defined as “the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources” [6].
The problem introduced by text mining is obvious: natural language was
developed for humans to communicate with one another and to record information, and computers are a long way from comprehending natural language. Humans have the ability to distinguish and apply linguistic patterns to text and humans can easily overcome obstacles that computers cannot easily handle such as slang, spelling variations and contextual meaning. However, although our language capabilities allow us to comprehend unstructured data, we lack the computer’s ability to process text in large volumes or at high speeds.
Herein lays the key to text mining: creating technology that combines a human’s linguistic capabilities with the speed and accuracy of a computer.
Figure 1 depicts a generic process model for a text mining application. Starting with a collection of documents, a text mining tool would retrieve a particular document and preprocess it by checking format and character sets. Then it would go through a text analysis phase, sometimes repeating techniques until information is extracted. Three text analysis techniques are shown in the example, but many other combinations of techniques could be used depending on the goals of the organization. The resulting information can be placed in a management information system, yielding an abundant amount of knowledge for the user of that system.
Currently no efficient software is available for Urdu/Arabic language searching/mining. Although a very large number of people speak urdu/Arabic but still they do not have access to a software/web interface where they may perform queries and view documents in the above mentioned languages. Similarly translation from Arabic <-> English and from English <-> is still most available. With this project we intend to provide a software engine which will provide these services.
Why Multi language Text Mining ?
Because almost all the developed countries are using their national languages for research and progressing. We have the example of China, France, Germany etc. In Urdu and Arabic many valuable texts are present but they are unavailable to researchers, also the English documents are not available for those persons who are good researchers or want to use some data from English documents but cannot do it because of not knowing English but somehow can make use of the document if available.
©2005 Jatit