International Publications

Finding Optimal Rank for LSI Models

Abstract: Latent Semantic Indexing is a powerful linear algebraic method for dimension reduction. It is also very useful in solving synonymy problem of textual corpora. A corpora of several documents representing as a bag of features is represented as a Term-Document matrix (TDM), where a Term represents a feature. A TDM can also be a visualization of an experiment repeated several times on an unknown system, where a term of the TDM represents an unknown variable of the system and a document of the TDM represents experiment iteration. Using LSI, a large hyper-space of a corpora or a system could be decomposed into three smaller matrices (Left Singular matrix 'U', Right Singular Matrix 'V', Diagonal matrix of Singular values 'S') as a function of rank 'K', a scalar value. The rank is expected to be optimally smaller, with which the hyperspace could be represented in a sub-space without much of data loss. The choice of Rank 'K' is critical because if the value is chosen to be smaller than optimal, the derived subspace representation is rendered useless as the data loss could become high. We propose a method to mathematically derive the optimal rank, which ensures the best subspace representation of a large hyper-space TDM in reduced dimension. We prove the efficiency of our method by comparing the accuracy values of synonymy measurements made on reduced dimension subspaces that are cut at different 'K' values.

Index:Accuracy Measurement, Diagonal Matrix, Dimension Reduction, Hyperspace, LSI, Optimal Rank, Singular Matrix, Singular values, Sub-space, Synonymy, Term Document Matrix

Reference: Sudarsun S, Venkatesh Prabhu, "Finding Optimal Rank for LSI Models", Proceedings of ICAET 2010, pp. 2010

Topic Models based Personalized Spam Filter

Abstract: Spam filtering poses a critical problem in text categorization as the features of text is continuously changing. Spam evolves continuously and makes it difficult for the filter to classify the evolving and evading new feature patterns. Most practical applications are based on online user feedback, the task calls for fast, incremental and robust learning algorithms. This paper presents a system for automatically detection and filtering of unsolicited electronic messages. In this paper, we have developed a content-based classifier, which uses two topic models LSI and PLSA complemented with a text pattern-matching based natural language approach. By combining these powerful statistical and NLP techniques we obtained a parallel content based Spam filter, which performs the filtration in two stages. In the first stage each model generates its individual predictions, which are combined by a voting mechanism as the second stage.

Index: Dimension Reduction, LSA, N-Gram, PCA, PLSA, Spam Filter, Topic Models, Vectorization

Reference: Sudarsun S, Venkatesh Prabhu, Valarmathi B, "Topic Models based Personalized Spam Filter", Proceedings of ISCF 2006, pp. 199-203, 2006

Role of Weighting on TDM in Improvising Performance of LSA on Text Data

Abstract: Abstract: In this paper, we show that the efficiency of LSA is significantly controlled by the choice of weighting algorithm applied. These weighting algorithms allocate relative importance to the document attributes (e.g. Keywords) based on their occurrences in the corpus. We evaluated various weighting algorithms to study their effects as measured by P-R values. Our experiments include weighting function application on TDM (Pre-Weighting) in order to increase or decrease the relative importance of words based on their occurrence. We also evaluated the application of weighting functions on the projected query (post-weighting). Post-weighted keyword queries were projected on an LSA model built on pre-weighted TDM to obtain closely correlated keywords or a document (keyword collection).

Index: Information Retrieval, Weighting Functions, IDF, IWF, WIDF, NDV, LSA, SVD, Precision, Recall, TDM

Reference: Sudarsun S, Venkatesh Prabhu G, Sathishkumar V, "Role of Weighting on TDM in Improvising Performance of LSA on Text Data", Proc of IEEE INDICON 2006, 2006

Adaptive Document Classification

Reference: G.Venkatesh Prabhu and Prof.S.S.Sridhar, "Adaptive Document Classification", Proceedings of the International Conference on "Trends in Computer Science and Engineering", held at Dr.M.G.R. Educational Research Institute, Chennai in May 2004.