Clara Algorithm in A Distributed System

Dr.Vo Ngoc Phu


The development of websites, Facebook websites and social network websites are an extremely fast way of commerce, education, so on. The effective uses of billions of documents on the websites, Facebook websites and social network websites are a very important significant for many commercial applications, many personal applications and many researches in a long time. With these reasons, in this research, we propose a new model using Clara Algorithm with Hadoop Map (M) /Reduce (R) for English document semantic classification in distributed system – a parallel network environment. Our new model can be used in classifying billions of English documents in a short time in a distributed system. We test our new model on our testing data set (including 25,000 English reviews which have 12,500 positive English reviews and 12,500 negative English reviews) and achieved on 60.3% accuracy. Our English training data set has 70,000 English sentences, including 35,000 positive English sentences and 35,000 negative English sentences. 

Full Text:



Large Movie Review Dataset (2016)

ZHAO Guo-fu,QU Guo-qing: Analysis and implementation of CLARA algorithm on clustering. Journal of Shandong University of Technology (Science and Technology), 45-48 (2006).

Pakhira, M.K: Fast Image Segmentation Using Modified CLARA Algorithm. International Conference on Information Technology, ICIT '08, 14 – 18 (2008).

Tilton, J.C.; Marchisio, G.; Koperski, K.; Datcu, M: Image Information Mining Utilizing Hierarchical Segmentation. IEEE International of Geoscience and Remote Sensing Symposium, IGARSS '02, Volume:2, 1029 – 1031 (2002).

Satish Narayana Srirama, Pelle Jakovits, Eero Vainikko, Adapting scientific computing problems to clouds using MapReduce, Future Generation Computer Systems, Volume 28, Issue 1, pp 184–192, 2012.

Vaibhav Kant Singh, Vinay Kumar Singh, “VECTOR SPACE MODEL: AN INFORMATION RETRIEVAL SYSTEM”, International Journal of Advanced Engineering Research and Studies, 2015.

Víctor Carrera-Trejo, Grigori Sidorov, Sabino Miranda-Jiménez, Marco Moreno Ibarra and Rodrigo Cadena Martínez, “Latent Dirichlet Allocation complement in the vector space model for Multi-Label Text Classification”, International Journal of Combinatorial Optimization Problems and Informatics, Vol. 6, No. 1, pp. 7-19, 2015.

Pascal Soucy, Guy W. Mineau, “Beyond TFIDF Weighting for Text Categorization in the Vector Space Model”, Proceedings of the 19th international joint conference on Artificial intelligence, pp. 1130-1135, USA, 2005.

Hadoop, 2016.

Apache, 2016.

Cloudera, 2016.

Qiaoping Zhang and Isabelle Couloigner, A New and Efficient K-Medoid Algorithm for Spatial Clustering, Computational Science and Its Applications – ICCSA 2005, Volume 3482 of the series Lecture Notes in Computer Science, pp 181-189, 2005.

Leonard Kaufman1 and Peter J. Rousseeuw, Partitioning Around Medoids (Program PAM), Finding Groups in Data: An Introduction to Cluster Analysis, Chapter: 2, pp.68 - 125, 1990.

D.K Swami and R.C.Jain, PAMC: Partitioning Around Medoids for Classification, Information Technology Journal 5 (6): 1102-1105, 2006.

Mark Van der Laan, Katherine Pollard & Jennifer Bryan, A new partitioning around medoids algorithm, Journal of Statistical Computation and Simulation, Volume 73, Issue 8, 2003.

M Chaitanya Kumari, P Nagendra Babu, Survey on Clustering on the Cloud by Using Map Reduce in Large Data Applications, International Journal of Engineering Trends and Technology (IJETT), Volume 21, Number 8, March 2015

Xianfeng Yang and Liming Lian, A New Data Mining Algorithm based on MapReduce and Hadoop, International Journal of Signal Processing, Image Processing and Pattern Recognition, Vol.7, No.2, 2014.

José C. Carrasco-Jiménez, José M. Celaya-Padilla, Gilberto Montes, Ramón F. Brena, Sigfrido Iglesias, Social Interaction Discovery: A Simulated Multiagent Approach, Pattern Recognition, Volume 7914 of the series Lecture Notes in Computer Science, pp 294-303, 2013.

Pelle Jakovits, Satish Narayana Srirama, Clustering on the cloud: reducing CLARA to MapReduce, NordiCloud '13 Proceedings of the Second Nordic Symposium on Cloud Computing & Internet Technologies, pp 64-71, USA, 2013.

S. Thirumurugan, L. Suresh, Statistical spatial clustering using spatial data mining, IET International Conference on Wireless, Mobile and Multimedia Networks, 2008, pp 26-29, 2008.

Brian White, Jay Lepreau, Leigh Stoller, Robert Ricci, Shashi Guruprasad, Mac Newbold, Mike Hibler, Chad Barb, Abhijeet Joglekar, An integrated experimental environment for distributed systems and networks, ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation, Volume 36 Issue SI, pp 255-270, 2002

Donald D. Crouse, Harriet G. Coverston, Joseph M. Cychosz, Archiving file system for data servers in a distributed network environment, WO1994018634 A1, PCT/US1994/001125, 1994.

William Gropp, Ewing Lusk, Nathan Doss, Anthony Skjellum, A high-performance, portable implementation of the MPI message passing interface standard, Parallel Computing , Volume 22, Issue 6, pp 789-828, 1996.

X. Zeng, R. Bagrodia, M. Gerla, GloMoSim: a library for parallel simulation of large-scale wireless networks, Proceedings. Twelfth Workshop on Parallel and Distributed Simulation (PADS 98), pp 154-161, 1998.

I. Foster, C. Kesselman, J. M. Nick, S. Tuecke, Grid services for distributed system integration, Computer, Volume:35 , Issue: 6, pp 37-46, 2002.

Tracy D Braun, Howard Jay Siegel, Noah Beck, Ladislau L Bölöni, Muthucumaru Maheswaran, Albert I Reuther, James P Robertson, Mitchell D Theys, Bin Yao, Debra Hensgen, Richard F Freund, A Comparison of Eleven Static Heuristics for Mapping a Class of Independent Tasks onto Heterogeneous Distributed Computing Systems, Journal of Parallel and Distributed Computing, Volume 61, Issue 6, pp 810-837, 2001.

R. Bagrodia, R. Meyer, M. Takai, Yu-An Chen, Parsec: a parallel simulation environment for complex systems, Computer, Volume:31 , Issue: 10, pp 77-85, 2002.

Ian M. Bennet, Distributed internet based speech recognition system with natural language support, US7203646 B2,US 11/419,736, 2007.

Mahdi Bohlouli, Jens Dalter, Mareike Dornhöfer, Johannes Zenkert, Madjid Fathi, Knowledge discovery from social media using big data-provided sentiment analysis (SoMABiT), Journal of Information Science, vol. 41 no. 6, pp 779-798, 2015.

Daniel Marcu, Dragos Stefan Munteanu, Systems and methods for identifying parallel documents and sentence fragments in multilingual document collections, US8943080 B2, US 11/635,248, 2006.

Dragos Munteanu, Daniel Marcu, Discovery of parallel text portions in comparable collections of corpora and training using comparable texts, US 20050228643 A1; US 11/087,376; 2005.

Greg Langmead, Kenji Yamada, Kevin Knight, Daniel Marcu, Task parallelization in a text-to-text system, US 7389222 B1; US 11/412,307; 2008.

Arturo Crespo, Hector Garcia-Molina, Semantic Overlay Networks for P2P Systems, Agents and Peer-to-Peer Computing, Volume 3601 of the series Lecture Notes in Computer Science, pp 1-13, 2005.

Ian M. Bennett, Speech based learning/training system using semantic decoding, US 7392185 B2; US 10/603,998; 2008.

David Suendermann, Jackson Liscombe, Krishna Dayanidhi, Roberto Pieraccini, System and method for the localization of statistical classifiers based on machine translation, US 20120166183 A1; US 13/393,977; 2012.

Abdur Chowdhury, Gregory Scott Pass, Ajaipal Singh Virdy, Ophir Frieder, System and method for evaluating sentiment, US 8862591 B2; US 11/892,417; 2007.

Debra Jean Danielson, System and method for sentiment analysis, US 8843362 B2; US 12/639,153; 2014.

Albert Deirchow Lin, Patrick John Graydon, Justin Eliot Busch, Maureen Caudill, Nancy Ann Chinchor, Jason Chun-Ming Tseng, Lei Wang, Bryner Sabido Pancho, Kenneth Scott Klein, Yuri Adrian Tijerino, Concept-based search and retrieval system, US 6675159 B1; US 09/627,295; 2004.

Ammar Mars, Mohamed Salah Gouider, Lamjed Ben Saïd, A New Big Data Framework for Customer Opinions Polarity Extraction, Beyond Databases, Architectures and Structures. Advanced Technologies for Data Mining and Knowledge Discovery, Volume 613 of the series Communications in Computer and Information Science, pp 518-531, 2016.

Asha S Manek, P Deepa Shenoy, M Chandra Mohan, Venugopal K R (2016) Aspect term extraction for sentiment analysis in large movie reviews using Gini Index feature selection method and SVM classifier. World Wide Web, Print ISSN1386-145X, 10.1007/s11280-015-0381-x, 1-20, US.

Basant Agarwal, Namita Mittal (2016) Machine Learning Approach for Sentiment Analysis. Prominent Feature Extraction for Sentiment Analysis, Print ISBN 978-3-319-25341-1, 10.1007/978-3-319-25343-5_3, 21-45.

Basant Agarwal, Namita Mittal (2016) Semantic Orientation-Based Approach for Sentiment Analysis. Prominent Feature Extraction for Sentiment Analysis, Print ISBN 978-3-319-25341-1, 10.1007/978-3-319-25343-5_6, 77-88.

Sérgio Canuto, Marcos André, Gonçalves, Fabrício Benevenuto (2016) Exploiting New Sentiment-Based Meta-level Features for Effective Sentiment Analysis. Proceedings of the Ninth ACM International Conference on Web Search and Data Mining (WSDM '16), 53-62, New York USA.

Shoiab Ahmed , Ajit Danti (2016) Effective Sentimental Analysis and Opinion Mining of Web Reviews Using Rule Based Classifiers. Computational Intelligence in Data Mining, Volume 1, Print ISBN 978-81-322-2732-8, DOI 10.1007/978-81-322-2734-2_18, 171-179, India.

Vo Ngoc Phu, Phan Thi Tuoi (2014) Sentiment classification using Enhanced Contextual Valence Shifters. International Conference on Asian Language Processing (IALP), 224-229.

Vo Ngoc Phu, Nguyen Duy Dat, Vo Thi Ngoc Tran, Vo Thi Ngoc Chau, Tuan A. Nguyen, Fuzzy C-means for english sentiment classification in a distributed system, Applied Intelligence, pp 1-22, 2016



  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Creative Commons License

This site is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.