The Research Centre on Linguistics and Language Information Sciences (RCLIS), founded at The Hong Kong Institute of Education in May 2010, draws on its predecessor, the Language Information Sciences Research Centre established at City University of Hong Kong in 1995.
Being one of the five institute-level research centres at HKIEd, RCLIS aims to foster interdisciplinary research in the diverse areas of linguistics, natural language processing and information science. It provides a forum for experienced researchers and young scholars to work together on problems of language and information technology in Chinese speech communities, to play a major role in advancing language information sciences globally, to provide a new and useful bridge between technology and the humanities and social sciences, and to inform the community about relevant research findings, especially within the Chinese context.
The LIVAC (Linguistic Variations in Chinese Speech Communities) synchronous corpus contains texts from representative Chinese newspapers and electronic media of Hong Kong, Taiwan, Beijing, Shanghai, Macau and Singapore. The collection of materials from the diverse communities is synchronized, and so offers an innovative "Window" approach for a whole variety of comparative studies and useful IT applications.
Analyzed by various linguistic units (e.g. characters, words, sentences), the LIVAC corpus serves many purposes. In particular, it provides an important database and means for in-depth investigation of lexical development, including the evolution of new concepts and their expressions, in contemporary Chinese.
All corpus texts have undergone automatic segmentation, and the results have been manually verified. A lexical database is derived from the segmented texts. Apart from ordinary words, those expressing new concepts or undergoing sense shifts, as well as regionalistic words from the six communities, are singled out. The database is thus a rich resource for research into linguistics, sociolinguistics, and Chinese language and society. Up to date, quantitative data on the Chinese language are also particularly useful for applications in the field of Information Technology, including the development of search engines and machine translation systems in language engineering.
Fresh textual materials for the corpus have been collected every four days since July 1995, with a 10-year time span planned for the collection to capture salient pre- and post-millennium evolving cultural and social fabrics of the diverse Chinese speech communities. Up to January 2005, the unique and growing corpus contains over 150 million Chinese characters and over 720,000 word types, and is still expanding.
The Centre has launched a bi-weekly Celebrity Roster listing the top 25 celebrities in Beijing, Shanghai, Hong Kong and Taiwan according to their media exposure in Chinese newspapers, and similar indices for place names and common words. Comments and feedback are welcome.