Multilingual Text Mining

 

develop a full understanding, and then write a summary highlighting its main points. Since computers do not yet have the language capabilities of humans, alternative methods must be considered.

 

One of the strategies most widely used by text summarization tools, sentence

extraction, extracts important sentences from an article by statistically weighting the sentences. Further heuristics such as position information are also used for

summarization. For example, summarization tools may extract the sentences which follow the key phrase “in conclusion”, after which typically lie the main points of the document. Summarization tools may also search for headings and other markers of subtopics in order to identify the key points of a document. Microsoft Word’s AutoSummarize function is a simple example of text summarization. Many text summarization tools allow the user to choose the percentage of the total text they want extracted as a summary.

 

In our project we would present the summarized search results so that user can take a quick look and then decide whether to open the document or not.

 

 

Clustering

 

Clustering is a technique used to group similar documents, but it differs from

categorization in that documents are clustered on the fly instead of through the use of predefined topics. Another benefit of clustering is that documents can appear in multiple subtopics, thus ensuring that a useful document will not be omitted from search results. A basic clustering algorithm creates a vector of topics for each document and measures the weights of how well the document fits into each cluster. If someone goes to www.vivisimo.com, and type in “Islam and Science” in the search field, the returned topics include Quran, Modern, Medicine and Islam Online and Islamic History. (See screenshot)

 

next        previous

 

©2005 Jatit