Advanced

Feature Selection Methods with Applications in Electrical Load Forecasting

Utterbäck, Oscar LU (2017) In Master's Theses in Mathematical Sciences FMA820 20161
Mathematics (Faculty of Engineering)
Abstract
The purpose of this thesis is two-fold: implement and evaluate a method, the Fast Correlation-Based Filter (FCBF) by Yu et al., for feature selection applied on a meteorological data set consisting of 19 weather variables from 606 locations in Scandinavia, and investigate whether geography can be exploited in the search for relevant features. Four areas are chosen as target areas where load prediction error is evaluated as a measure of goodness. A subset of the total data set is used to lower the computation time; only Swedish locations were used, and only data from SMHI was used.

The impact of using different subsets of weather features as well as selecting features from several locations is investigated using FCBF and epsilon-Support... (More)
The purpose of this thesis is two-fold: implement and evaluate a method, the Fast Correlation-Based Filter (FCBF) by Yu et al., for feature selection applied on a meteorological data set consisting of 19 weather variables from 606 locations in Scandinavia, and investigate whether geography can be exploited in the search for relevant features. Four areas are chosen as target areas where load prediction error is evaluated as a measure of goodness. A subset of the total data set is used to lower the computation time; only Swedish locations were used, and only data from SMHI was used.

The impact of using different subsets of weather features as well as selecting features from several locations is investigated using FCBF and epsilon-Support Vector Regression. A modification to the FCBF algorithm is tested in one of the experiments, using Pearson correlation in place of symmetrical uncertainty. An investigation of how the relationships between features change with distance is performed and the results are then used to motivate a greedy feature selection method.

FCBF, even when implemented with the naive approximation of marginal and conditional entropy, filtered the total data set from 3180 to approximately 20 features with a prediction error of less than 1% for three of the target areas and 1.71% for the fourth. Further tests lowered the numbered of features even further without significantly affecting the prediction error. Using FCBF to rank the weather variables for a single area proved less than optimal which may be attributed to many of the extremely small intra-feature SU values. Selecting locations based on distance from target area resulted in prediction errors better than random sampling and comparable to the filter while still keeping the number of features low.

The very best feature selection results were only slightly lower than a base case, suggesting that the present experimental setting may not be enough to draw definitive conclusions regarding the efficacy of the selection methods. Two possible contributing factors are the unoptimized model used, and the choice to investigate the impact on average load over a 24 hour window. Future studies may also wish to extend the geographical investigation to use coordinates or direction in conjunction with distance from the target area, as some indication of latitude dependent behavior was found, most likely contributed by the elongated shape of Sweden. (Less)
Popular Abstract
Finding useful information in a large data set to better predict consumption of electricity

Data describing the weather at different places in Scandinavia shows a lot of redundancy which may affect its usefulness in predicting future electricity consumption. This master thesis tests two methods for removing lots of useless or harmful information.

Predicting the consumption of electricity on a city-wide scale allow those who manage equipment, generate and store electricity, and buy and sell energy to better plan the maintenance of their equipment, and to ensure that there are enough electrons flowing through your wall socket when you plug in your new computer. The predictions are done using artificial intelligence methods that look... (More)
Finding useful information in a large data set to better predict consumption of electricity

Data describing the weather at different places in Scandinavia shows a lot of redundancy which may affect its usefulness in predicting future electricity consumption. This master thesis tests two methods for removing lots of useless or harmful information.

Predicting the consumption of electricity on a city-wide scale allow those who manage equipment, generate and store electricity, and buy and sell energy to better plan the maintenance of their equipment, and to ensure that there are enough electrons flowing through your wall socket when you plug in your new computer. The predictions are done using artificial intelligence methods that look for patterns in data that can be used to determine the magnitude of electric consumption in the future. One of the main problems in performing accurate predictions is finding the right data to use; Choosing the wrong variables may lead to poor predictions which in turn may lead to equipment failure or other costly decisions for the energy providers and utility companies.

The data sets typically used for these kinds of predictions describe different aspects of future weather. Since weather is a natural phenomenon that varies differently depending on how far between two points you look we may assume that there will be a lot of data showing basically the same thing; the weather in Lund is probably not very different from the weather in Malmö, while the weather in Umeå might differ much more from the other two cities. In this case, we call the data from Lund and Malmö redundant in the light of each other. The goal of this thesis has been to investigate methods that sort through the data set in order to find useful data, which we call relevant, and remove redundant information.

Two approaches are taken. First, we look at the properties of the data itself and measure relevancy and redundancy by seeing if there is a significant similarity between pairs of variables. For this purpose an algorithm called the Fast Correlation-Based Filter is implemented and evaluated. The filter searches through the data set without considering all possible combinations of variables in order to make it faster. Furthermore, we look at the possibility of being able to choose relevant data based on the geographical location. Motivated by the fact that weather data from places close to each other are very similar it is possible to sort through the data set just by using distance from the city where electrical consumption is being predicted. Both methods show promising results when tested on predicting the daily average electricity consumption on four areas, managing to remove over 99% of the data while still performing accurate predictions. Further tests should investigate the computations performed for the statistical measure used, as well as see how useful the methods are on data of higher resolution. (Less)
Please use this url to cite or link to this publication:
author
Utterbäck, Oscar LU
supervisor
organization
course
FMA820 20161
year
type
H2 - Master's Degree (Two Years)
subject
keywords
Machine Learning, Data Science, Electrical Load Forecasting, Short Term Load Forecasting, Feature Selection, Filter Methods
publication/series
Master's Theses in Mathematical Sciences
report number
LUTFMA-3315-2017
ISSN
1404-6342
other publication id
2017:E16
language
English
id
8910931
date added to LUP
2017-06-20 14:56:03
date last changed
2017-06-20 14:56:03
@misc{8910931,
  abstract     = {The purpose of this thesis is two-fold: implement and evaluate a method, the Fast Correlation-Based Filter (FCBF) by Yu et al., for feature selection applied on a meteorological data set consisting of 19 weather variables from 606 locations in Scandinavia, and investigate whether geography can be exploited in the search for relevant features. Four areas are chosen as target areas where load prediction error is evaluated as a measure of goodness. A subset of the total data set is used to lower the computation time; only Swedish locations were used, and only data from SMHI was used.

The impact of using different subsets of weather features as well as selecting features from several locations is investigated using FCBF and epsilon-Support Vector Regression. A modification to the FCBF algorithm is tested in one of the experiments, using Pearson correlation in place of symmetrical uncertainty. An investigation of how the relationships between features change with distance is performed and the results are then used to motivate a greedy feature selection method. 

FCBF, even when implemented with the naive approximation of marginal and conditional entropy, filtered the total data set from 3180 to approximately 20 features with a prediction error of less than 1% for three of the target areas and 1.71% for the fourth. Further tests lowered the numbered of features even further without significantly affecting the prediction error. Using FCBF to rank the weather variables for a single area proved less than optimal which may be attributed to many of the extremely small intra-feature SU values. Selecting locations based on distance from target area resulted in prediction errors better than random sampling and comparable to the filter while still keeping the number of features low.

The very best feature selection results were only slightly lower than a base case, suggesting that the present experimental setting may not be enough to draw definitive conclusions regarding the efficacy of the selection methods. Two possible contributing factors are the unoptimized model used, and the choice to investigate the impact on average load over a 24 hour window. Future studies may also wish to extend the geographical investigation to use coordinates or direction in conjunction with distance from the target area, as some indication of latitude dependent behavior was found, most likely contributed by the elongated shape of Sweden.},
  author       = {Utterbäck, Oscar},
  issn         = {1404-6342},
  keyword      = {Machine Learning,Data Science,Electrical Load Forecasting,Short Term Load Forecasting,Feature Selection,Filter Methods},
  language     = {eng},
  note         = {Student Paper},
  series       = {Master's Theses in Mathematical Sciences},
  title        = {Feature Selection Methods with Applications in Electrical Load Forecasting},
  year         = {2017},
}