Machine learning for text classification in building management systems

Jose Joaquin Mesa-Jiménez; Lee Stokes; QingPing Yang; Valerie N. Livina

doi:10.3846/jcem.2022.16012

DOI: https://doi.org/10.3846/jcem.2022.16012

Abstract

In building management systems (BMS), a medium building may have between 200 and 1000 sensor points. Their labels need to be translated into a naming standard so they can be automatically recognised by the BMS platform. The current industrial practices often manually translate these points into labels (this is known as the tagging process), which takes around 8 hours for every 100 points. We introduce an AI-based multi-stage text classification that translates BMS points into formatted BMS labels. After comparing five different techniques for text classification (logistic regression, random forests, XGBoost, multinomial Naive Bayes and linear support vector classification), we demonstrate that XGBoost is the top performer with 90.29% of true positives, and use the prediction confidence to filter out false positives. This approach can be applied in sensors networks in various applications, where manual free-text data pre-processing remains cumbersome.

Keyword : free-text classification, building management systems, Haystack data standard, sensor tagging

How to Cite

Mesa-Jiménez, J. J., Stokes, L., Yang, Q., & Livina, V. N. (2022). Machine learning for text classification in building management systems. Journal of Civil Engineering and Management, 28(5), 408–421. https://doi.org/10.3846/jcem.2022.16012

Published in Issue

May 12, 2022

Abstract Views

1694

PDF Downloads

951

This work is licensed under a Creative Commons Attribution 4.0 International License.

References

Akinyelu, A. A., & Adewumi, A. O. (2014). Classification of phishing email using random forest machine learning technique. Journal of Applied Mathematics, 2014, 425731. https://doi.org/10.1155/2014/425731

Ali, J., Khan, R., Ahmad, N., & Maqsood, I. (2012). Random forests and decision trees. International Journal of Computer Science Issues, 9(5), 272–277.

Alsaleem, S. (2011). Automated Arabic text categorization using SVM and NB. International Arab Journal of e-Technology, 2(2), 124–128.

Barandiaran, I. (1998). The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), 832–844. https://doi.org/10.1109/34.709601

Boser, B., Guyon, I., & Vapnik, V. (1992). A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory (pp. 144–152). ACM. https://doi.org/10.1145/130385.130401

Brown, P., Desouza, P., Mercer, R., Della Pietra, V., & Lai, J. (1992). Class-based n-gram models of natural language. Computational Linguistics, 18(4), 467–479.

Chai, K., Chieu, H., & Ng, H. T. (2002). Bayesian online classifiers for text classification and filtering. In SIGIR ‘02: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 97–104). ACM. https://doi.org/10.1145/564376.564395

Chatterjee, S., George Jose, P., & Datta, D. (2019). Text classification using SVM enhanced by multithreading and CUDA. International Journal of Modern Education & Computer Science, 11(1), 11–23. https://doi.org/10.5815/ijmecs.2019.01.02

Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794). ACM. https://doi.org/10.1145/2939672.2939785

Dalal, M., & Zaveri, M. (2011). Automatic text classification: a technical review. International Journal of Computer Applications, 28(2), 37–40. https://doi.org/10.5120/3358-4633

Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. https://arxiv.org/abs/1810.04805

Elnagar, A., Al-Debsi, R., & Einea, O. (2020). Arabic text classification using deep learning models. Information Processing & Management, 57(1), 102121. https://doi.org/10.1016/j.ipm.2019.102121

Gargiulo, F., Silvestri, S., Ciampi, M., & De Pietro, G. (2019). Deep neural network for hierarchical extreme multi-label text classification. Applied Soft Computing, 79, 125–138. https://doi.org/10.1016/j.asoc.2019.03.041

Genkin, A., Lewis, D., & Madigan, D. (2007). Large-scale Bayesian logistic regression for text categorization. Technometrics, 49(3), 291–304. https://doi.org/10.1198/004017007000000245

Gneiting, T., & Raftery, A. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102, 359–378. https://doi.org/10.1198/016214506000001437

Goodman, J. (2001). A bit of progress in language modeling. Computer Speech & Language, 15(4), 403–434. https://doi.org/10.1006/csla.2001.0174

Gopi, A. P., Jyothi, R. N. S., Narayana, V. L, & Sandeep, K. S. (2020). Classification of tweets data based on polarity using improved RBF kernel of SVM. International Journal of Information Technology. https://doi.org/10.1007/s41870-019-00409-4

Hasanli, H., & Rustamov, S. (2019). Sentiment analysis of Azerbaijani twits using logistic regression, Naive Bayes and SVM. In 2019 IEEE 13th International Conference on Application of Information and Communication Technologies (AICT). IEEE. https://doi.org/10.1109/AICT47866.2019.8981793

Haystack Project. (2019). https://project-haystack.org

Ifrim, G., Bakir, G., & Weikum, G. (2008). Fast logistic regression for text categorization with variable-length n-grams. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 354–362). ACM. https://doi.org/10.1145/1401890.1401936

Jaskie, K., Elkan, C., & Spanias, A. (2019). A modified logistic regression for positive and unlabeled learning. In 2019 53rd Asilomar Conference on Signals, Systems, and Computers (pp. 2007–2011). IEEE. https://doi.org/10.1109/IEEECONF44664.2019.9048765

Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In European Conference on Machine Learning (pp. 137–142). Springer. https://doi.org/10.1007/BFb0026683

Joachims, T. (2001). A statistical learning learning model of text classification for support vector machines. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 128–136). ACM. https://doi.org/10.1145/383952.383974

Kurnia, R., Tangkuman, Y., & Girsang, A. (2020). Classification of user comment using Word2Vec and SVM classifier. International Journal of Advanced Trends in Computer Science and Engineering, 9(1), 643–648. https://doi.org/10.30534/ijatcse/2020/90912020

Lai, S., Xu, L., Liu, K., & Zhao, J. (2015). Recurrent convolutional neural networks for text classification. In Twenty-Ninth AAAI Conference on Artificial Intelligence (pp. 2267–2273). AAAI.

Le, C., Prasad, P., Alsadoon, A., Pham, L., & Elchouemi, A. (2019). Text classification: Naive Bayes classifier with sentiment lexicon. IAENG International Journal of Computer Science, 46(2), 141–148.

Liu, B., Lee, W., Yu, P., & Li, X. (2002). Partially supervised classification of text documents. In ICML ‘02: Proceedings of the Nineteenth International Conference on Machine Learning (pp. 387–394).

Liu, J., Chang, W., Wu, Y., & Yang, Y. (2017). Deep learning for extreme multi-label text classification. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 115–124). https://doi.org/10.1145/3077136.3080834

Liu, P., Zhao, H., Teng, J., Yang, Y., Liu, Y., & Zhu, Z. (2019). Parallel Naive Bayes algorithm for large-scale Chinese text classification based on spark. Journal of Central South University, 26, 1–12. https://doi.org/10.1007/s11771-019-3978-x

Maron, M. (1961). Automatic indexing: an experimental inquiry. Journal of the ACM, 8(3), 404–417. https://doi.org/10.1145/321075.321084

McCallum, A., & Nigam, K. (1998). A comparison of event models for Naive Bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization (pp. 41–48).

Miaschi, A., & Della-Orletta, F. (2020). Contextual and non-contextual word embeddings: an in-depth linguistic investigation. In Proceedings of the 5th Workshop on Representation Learning for NLP (pp. 110–119). https://doi.org/10.18653/v1/2020.repl4nlp-1.15

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013a). Efficient estimation of word representations in vector space. https://arxiv.org/abs/1301.3781v3

Mikolov, T., Le, Q., & Sutskever, I. (2013b). Exploiting similarities among languages for machine translation. https://arxiv.org/abs/1309.4168

Mikolov, T., Yih, W., & Zweig, G. (2013c). Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 746–751).

Montieri, A., Ciuonzo, D., Bovenzi, G., Persico, V., & Pescape, A. (2019). A dive into the dark web: Hierarchical traffic classification of anonymity tools. IEEE Transactions on Network Science and Engineering, 7(3), 1043–1054. https://doi.org/10.1109/TNSE.2019.2901994

Onan, A. (2017). Hybrid supervised clustering based ensemble scheme for text classification. Kybernetes, 46(2), 330–348. https://doi.org/10.1108/K-10-2016-0300

Onan, A. (2018). An ensemble scheme based on language function analysis and feature engineering for text genre classification. Journal of Information Science, 44(1), 28–47. https://doi.org/10.1177/0165551516677911

Onan, A. (2019). Topic-enriched word embeddings for sarcasm identification. In Computer Science On-line Conference (pp. 293–304). Springer. https://doi.org/10.1007/978-3-030-19807-7_29

Onan, A. (2020). Sentiment analysis on product reviews based on weighted word embeddings and deep neural networks. Concurrency and Computation: Practice and Experience, 33(23), e5909. https://doi.org/10.1002/cpe.5909

Onan, A. (2021). Sentiment analysis on massive open online course evaluations: a text mining and deep learning approach. Computer Applications in Engineering Education, 29(3), 572–589. https://doi.org/10.1002/cae.22253

Onan, A., Korukolu, S., & Bulut, H. (2016). A multiobjective weighted voting ensemble classifier based on differential evolution algorithm for text sentiment classification. Expert Systems with Applications, 62, 1–16. https://doi.org/10.1016/j.eswa.2016.06.005

Onan, A., & Korukolu, S. (2017). A feature selection model based on genetic rank aggregation for text sentiment classification. Journal of Information Science, 43(1), 25–38. https://doi.org/10.1177/0165551515613226

Onan, A., & Tocoglu, M. (2020). Satire identification in Turkish news articles based on ensemble of classifiers. Turkish Journal of Electrical Engineering & Computer Sciences, 28(2), 1086–1106. https://doi.org/10.3906/elk-1907-11

Onan, A., & Tocoglu, M. (2021). A term weighted neural language model and stacked bidirectional LSTM based framework for sarcasm identification. IEEE Access, 9, 7701–7722. https://doi.org/10.1109/ACCESS.2021.3049734

Prabhat, A., & Khullar, V. (2017). Sentiment classification on big data using Naïve Bayes and logistic regression. In 2017 International Conference on Computer Communication and Informatics (ICCCI). IEEE. https://doi.org/10.1109/ICCCI.2017.8117734

Quinlan, J. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106. https://doi.org/10.1007/BF00116251

Ramadhan, W., Novianty, S., & Setianingsih, S. (2017). Sentiment analysis using multinomial logistic regression. In 2017 International Conference on Control, Electronics, Renewable Energy and Communications (ICCREC) (pp. 46–49). IEEE. https://doi.org/10.1109/ICCEREC.2017.8226700

Rane, A., & Kumar, A. (2018). Sentiment classification system of twitter data for us airline service analysis. In 2018 IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC) (Vol. 1, pp. 769–773). IEEE. https://doi.org/10.1109/COMPSAC.2018.00114

Singh, R., Kumar, B., Gaur, L., & Tyagi, A. (2019). Comparison between multinomial and Bernoulli Naïve Bayes for text classification. In 2019 International Conference on Automation, Computational and Technology Management (ICACTM) (pp. 593–596). IEEE. https://doi.org/10.1109/ICACTM.2019.8776800

Sun, A., Lim, E., & Liu, Y. (2009). On strategies for imbalanced text classification using SVM: A comparative study. Decision Support Systems, 48(1), 191–201. https://doi.org/10.1016/j.dss.2009.07.011

Tocoglu, M., & Onan, A. (2020). Sentiment analysis on students evaluation of higher educational institutions. In International Conference on Intelligent and Fuzzy Systems (pp. 1693–1700). Springer. https://doi.org/10.1007/978-3-030-51156-2_197

Tong, S., & Koller, D. (2001). Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2, 45–66.

Vapnik, V., & Lerner, A. (1963). Recognition of patterns with help of generalized portraits. Avtomatika i Telemekhanika, 24(6), 774–780.

Venkatesh Ranjitha, K. V., & Venkatesh Prasad, B. S. (2020). Optimization scheme for text classification using machine learning Naive Bayes classifier. In A. Kumar, M. Paprzycki, & V. Gunjan (Eds.), Lecture notes in electrical engineering: Vol. 601. ICDSMLA 2019 (pp. 576–586). Springer. https://doi.org/10.1007/978-981-15-1420-3_61

Wang, X., Sheng, Y., Deng, H., & Zhao, Z. (2019). CHARCNN-SVM for Chinese text datasets sentiment classification with data augmentation. International Journal of Innovative Computing, Information and Control, 15(1), 227–246.

Xu, B., Guo, X., Ye, Y., & Cheng, J. (2012). An improved random forest classifier for text categorization. Journal of Computing, 7(12), 2913–2920. https://doi.org/10.4304/jcp.7.12.2913-2920

Yao, L., Mao, C., & Luo, Y. (2019). Graph convolutional networks for text classification. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 7370–7377. https://doi.org/10.1609/aaai.v33i01.33017370

Zhang, Y., Jin, R., & Zhou, Z. (2010). Understanding bag-of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics, 1(1–4), 43–52. https://doi.org/10.1007/s13042-010-0001-0

Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems 28 (NIPS 2015) (pp. 649–657).

Zhang, M., Ai, X., & Hu, Y. (2019). Chinese text classification system on regulatory information based on SVM. In IOP Conference Series: Earth and Environmental Science (Vol. 252), 022133. IOP Publishing. https://doi.org/10.1088/1755-1315/252/2/022133