Application of Big Data Analytics and Machine Learning to Large-Scale Synchrophasor Datasets: Evaluation of Dataset ''Machine Learning-Readiness''
Philip Hart, Lijun He, Tianyi Wang, Vijay S. Kumar, Kareem Aggour, Arun Subramanian, and Weizhong Yan
IEEE Members: Free
Non-members: FreePages/Slides: 12
This manuscript presents a data quality analysis and holistic 'machine learning-readiness' evaluation of a representative set of large-scale, real-world phasor measurement unit (PMU) datasets provided under the United States Department of Energy-funded FOA 1861 research program. A major focus of this study is to understand the present-day suitability of large-scale, real-world synchrophasor datasets for application of commercially-available, off-the-shelf big data and supervised or semi-supervised machine learning (ML) tools and catalogue any major obstacles to their application. To this end, dataset quality is methodically examined through an interconnect-wide quantifications of basic bad data occurrences, a summary of several harder-to-detect data quality issues that can jeopardize successful application of machine learning, and an evaluation of the adequacy of event log labeling for supervised training of models used for online event classification. A global 'six-point' statistical analyses of several key dataset variables is demonstrated as a means by which to identify additional hard-to-detect data quality issues, also providing an example successful application of big data technology to extract insights regarding reasonable operational bounds of the US power system. Obstacles for application of commercial ML technologies are summarized, with a particular focus on supervised and semi-supervised ML. Lessons-learned are provided regarding challenges associated with present-day event labeling practices, large spatial scope of the dataset, and dataset anonymization. Finally, insight into efficacy of employed mitigation strategies are discussed, and recommendations for future work are made.