Bilkent Information Retrieval Group
Computer Engineering Department
Bilkent University, Ankara, Turkey
Date: August 29
Time: 10:00 am -- 11:00
Host: Joyce Chai
Topic detection and tracking (TDT) systems aim to instantly organize the temporally ordered stories of a news stream according to the events. Commercial Web application examples of such systems include Google News and NewsIsFree. In such environments news stories are obtained from several Web resources.
Two major problems in TDT are new event detection (NED) and topic tracking (TT). The focus of these problems is on finding the first stories of previously unseen new events and all subsequent stories on a certain topic defined by a small number of initial stories.
In this work, we introduce the first large-scale TDT test collection for Turkish and investigate the NED and TT problems in this language. We present our generalizable test collection construction approach, which is inspired by the TDT research initiative and can also be used for other languages. We show that in TDT for Turkish with some similarity measures, a simple word truncation stemming method can compete with a sophisticated stemming approach that pays attention to the morphological structure of the language.
Our findings show that word stopping and the contents of the associated stopword list are important and can affect the performance. We demonstrate that the confidence scores of two different similarity measures can be combined in a straightforward manner for improving the effectiveness. The influence of several similarity measures on performance is also investigated.
Fazli Can received the PhD degree in Computer Engineering from the Middle East Technical University in Ankara, Turkey, in 1985. During his PhD he worked as an RA at Arizona State University and as an engineer at Intel Arizona. From 1986 to 2005 he taught and conducted research at Miami University, OH. He is presently a professor at Bilkent University, Ankara, Turkey. His publications appeared in journals such as ACM Transactions on Database Systems, ACM Transactions on Information Systems, Journal of the American Society for Information Science and Technology, and others. He was co-editor of the ACM SIGIR Forum between 1995 and 2002. He is a co-founder of the Bilkent Information Retrieval Group. His current research interests are information retrieval and text mining.