Please use this identifier to cite or link to this item:
Authors: ABDULLAH, K. A.
Keywords: Term similarity
Multiple document sources
Web ontology language
Issue Date: May-2016
Abstract: Terms Similarity (TS) in retrieval systems are based on lexical matching, which determines if query terms are useful and reflect the users’ information need in related domains. Existing works on TS use Term Frequency-Inverse Document Frequency (TF-IDF) to determine the occurrence of terms in web documents (snippets) is incapable of capturing the problem of semantic language mismatch. This study was designed to develop a conceptual knowledge model to solve the problem of TS in web documents retrieval by amplifying structured semantic network in Multiple Document Sources (MDSs) to reduce mismatch in retrieval results. Four hundred and forty-two IS-A hierarchy concepts were extracted from Internet using a web ontology language. These hierarchies were structured in MDSs to determine similarities. The concepts were used to formulate queries with the addition of terms from knowledge domain. Suffix Tree Clustering (STC) was adapted to cluster, structure the web and reduce dimensionality of features. The IS-A hierarchy concept on parent and child relationship was incorporated into the STC to select the best cluster, consisting of 100 snippets, four web page counts and WordNet as MDSs. Similarity was estimated on Cosine, Euclidean and Radial Basis Function (RBF) on the TF-IDF. Based on STC, TF-IDF was modified to develop Concept Weighting (CW) estimation on snippets and web page count. Similarity was estimated between TF-IDF and developed Concept Weighting; Cosine and CW-Cosine, Euclidean and CW-Euclidean and RBF and CW-RBF. Semantic network (WordNetSimilarity) LIn’ measure was extended with PAth length of the taxonomy concept to develop LIPA. The LIPA was compared with other WordNetSimilarity distance measures: Jiang and Conrath (JCN) and Wu and Palmer (WUP) as well as LIn and PAth length separately. Concept Weighting and WordNetSimilarity scores were combined using machine learning techniques to leverage a robust semantic similarity score and accuracy measure using Mean Absolute Error (MAE). The RBF and CW-RBF generated inconsistent values (0.9 for null and zero snippets. Similarity estimation obtained on Cosine, Euclidean, CW-Cosine and CW-Euclidean were 0.881, 0.446, 0.950 and 0.964, respectively. The retrieved snippets removed irrelevant features and enhanced precisions. WordNetSimilarity JCN, WUP, LIn, PAth, and LIPA values were 0.868, 0.953, 0.995, 0.955 and 0.998, respectively. The WordNetSimilarity improved the semantic similarity of concepts. The Concept Weighting and WordNetSimilarity; CW-Cosine, CW-Euclidean, JCN, WUP, LIn, PAth, and LIPA were combined to generate similarity coefficient scores 0.941, 0.944, 0.661, 0.928, 0.996, 0.924 and 0.998, respectively. The MAE on Cosine, Euclidean, CW-Cosine and CW Euclidean were 0.058, 0.011, 0.014 and 0.009, respectively while for JCN, WUP, LIn, PAth, and LIPA were 0.022, 0.004, 0.022, 0.019 and 0.020, respectively. The accuracy of the combined similarity for JCN, WUP, LIn, PAth, CW-Cosine, CW-Euclidean and LIPA were 0.023, 0.050, 0.008, 0.011, 0.024, 0.015 and 0.009, respectively. The developed conceptual knowledge model improved retrieval of web documents with structured multiple document sources. This improved precision of information retrieval system and solved the problem of semantic language mismatch with robust similarity between the terms.
Description: A thesis in the Department of COMPUTER SCIENCE Submitted to the Faculty of Science in partial fulfillment of the requirements for the Degree of DOCTOR OF PHILOSOPHY of the UNIVERSITY OF IBADAN
Appears in Collections:Theses

Files in This Item:
File Description SizeFormat 
ui_thesis_abdullah_conceptual_2016.pdffull text5.49 MBAdobe PDFThumbnail

Items in UISpace are protected by copyright, with all rights reserved, unless otherwise indicated.