Anthropogenic environmental pollution is a known and indisputable issue, and the importance of searching for reliable mathematical models that help understanding the underlying process is witnessed by the extensive literature on the topic. In this article, we focus on the temporal aspects of the processes that govern the concentration of pollutants using typical explanatory variables, such as meteorological values and traffic flows. We develop a novel technique based on multiobjective optimization and linear regression to find optimal delays for each variable, and then we apply such delays to our data to evaluate the improvement that can be obtained with respect to learning an explanatory model with standard techniques. We found that optimizing delays can, in some cases, improve the accuracy of the final model up to 15%.

## Introduction

Anthropogenic environmental pollution is a known and indisputable issue. Every day, the human body is exposed to harmful substances in various ways, including by ingestion of food and drink and by absorbtion with breathing. While it is possible to limit the ingestion of harmful chemicals, it is impossible to choose the air we breathe. Chemicals dangerous to health (or that become so if absorbed over certain amounts) include suspended particulate matters, nitrogen oxides, *CO*, and *SO _{2}*. Exposure to these pollutants adversely affects human health, as confirmed by numerous scientific studies conducted in recent years around the world: Poland,

^{1,2}Deutschland,

^{3}Italy,

^{4}USA,

^{5–7}Canada,

^{8}Australia,

^{9}Chile,

^{10}among many others. While the quality of air is usually regularly monitored, actions leading to the reduction of air pollution are most definitely a growing need. Generally, anthropogenic sources of air pollution are known, and, due to the development of civilization, it is impossible to completely eliminate them. Therefore, scientific studies are conducted to determine the impact of factors that modify their concentrations via transformation, retention, or evacuation, and, to this purpose, mathematical models are an essential tool. Recognizing the factors that have the greatest impact on the concentration of pollutants in the air at a certain moment gives the opportunity to build plans for their manipulation, when possible, or, at least, for the improvement of design of cities, streets, crossroads, and houses, to ensure the fastest possible evacuation of contaminants, shorten the time of exposure to its harmful effects, and reduce the intensity of their action. In general, such models are known as

*land use regression models—LUR*(see previous studies,

^{11,12}among many others). However, in addition to the identification of the factors themselves, the timing of their effect also plays a significant role. For example, a momentary gust of wind results in a much smaller evacuation of pollutants than a wind with the same speed persisting for several hours. Therefore, it is important to identify not only the factors in play, but also the moment in time in which their influence on the concentrations of pollution is the greatest, to obtain a

*temporal*land use regression model (

*TLUR*).

Predicting the value of a time series, such as the concentration of a certain pollutant in time, is a very well-known problem. The usual statistical approach to its solution requires the use of *moving average* models, or, more generally, of *Autoregressive Integrated Moving Average* (*ARIMA*) models, that define the problem as an *autoregressive* one (see Box et al^{13} for a comprehensive introduction). The underlying idea is that the past values of the predicted variable, for example, the pollution concentration, can be used to predict future values. The implicit assumption is that the behavior of the predicted variable remains somewhat constant, and such models are designed to identify, and describe, such behavior. A different approach, more typical of machine learning, consists of enhancing a static model learner (eg, linear regression, tree regression, neural network) with *lagged variables*, that is, variables with the past values of the predictors. A classical regression model is designed to extract a (implicit or explicit) function of the (current) predictors that explains the (current) value of the interesting variable; a lagged regression model acts in the same way, but using, as predictors, both the current and the past values of the explanatory variables. The so-called *ARIMAX* models mix these 2 ideas, and are designed to extract functions of both the current and past values of both predicting and predicted variables. The most relevant drawback of ARIMA-type frameworks is that the resulting models are, in a way, implicit. As a matter of fact, in most natural phenomena, the variable to be predicted has enough slow-changing behavior for the past values to be very good predictors, so good that the real predictors cease to have relevant roles. As a consequence, with ARIMA-type methods, one obtains very good predicting models that offer no explanation of the underlying process, which cannot be used for the purposes of serving as basis for (T)LURs. Pure lagged models, such as those offered by typical learning packages (see Hall,^{14} for example), bypass this problem when the past values of the predicted variable are not used, but raise other issues. In particular, a typical lagged model (both explicit or implicit) uses tens of lagged variables per each explanatory parameter; this proliferation of columns makes it very difficult, or impossible, to interpret the underlying process even when explicit functions are used. As a consequence, the most successful lagged models are based on neural networks, which are already implicit. The form of the regressed function is, per se, an additional problem of classical approaches. ARIMA-type models and linear regression are designed to extract a linear function, which is, typically, a good first approximation of the underlying behavior. As we have recalled, lagged models can be also tree-like, that is, layered functions, or even highly nonlinear, such as in neural networks. But these approaches do not allow to explicitly model the nonlinearity level, so to say, in search for some simple, albeit nonlinear, behavior.^{15}

In this article, we present a methodology to select, at the same time: (1) predicting variables, (2) the amount of their lag, and, if necessary, (3) their nonlinear contribution. By using a multiobjective optimization algorithm, we produce a set of potential solutions. In some sense, each solution can be thought of as a *temporal convolution vector* that highlights the contribution of each predictor by taking into account the temporal component and their nonlinear contribution. Any such vectors can be applied to the original data, to test the effect of the transformation with different regression algorithms. A temporal convolution vector contains an optimal lag and an optimal nonlinear transformation for each variable, and it not only allows the induction of a better explanation model, but it is also interpretable per se, as it shows exactly the delay after which each predictor is most influential. Optimizing lags solves, in a way, both problems of ARIMA-type models and lagged models: the temporal convolution vector allows one to better understand the necessary delay for an explanatory variable to take effect, and, if it is the case, its nonlinear contribution.

To test our methodology, we considered a data set containing *NO _{2}* and

*NO*concentrations measured hourly from 2015 to 2017 by a monitoring station located in Wrocław, Poland, along with a set of meteorological and vehicle traffic data. As we shall see, applying a temporal convolution vector results in a sensible improvement in the performances of the learned model, showing that, in fact, delays and nonlinear contributions can be taken into account without losing the interpretability of the model. We tested out methodology with 3 types of learners: linear regression, random forest, and multilayer perceptron, and in all 3 cases, we found an improvement in the performances of the learned model.

_{x}## Background

### Mathematical models for air quality prediction and explanation

The relationships between concentrations of air pollutants in the urban agglomeration and meteorological factors, traffic flow, and other elements of the environment have been described using many different modeling techniques. However, taking into account the interpretative usefulness of the results obtained, they can be categorized into (1) interpretable models, including explicit function (linear, polynomial, exponential, and other nonlinear functions),^{16–19} clustered models,^{20} and probabilistic models,^{21–23} and (2) noninterpretable models, including those based on neutral networks,^{24} forests of decision trees or regression trees,^{25–27} and ensemble models.^{28,29}

The relationships between concentrations of air pollutants, traffic flow, and meteorological conditions based on the same hourly data used in this article were already analyzed. An interpretable simple probabilistic model was built for *NO _{2}* concentrations based on traffic flow and wind speed in 2015 to 2017.

^{30}Data from 2015 to 2016 (not including solar radiation) were used to create models based on a random forest for

*NO*,

_{2}*NO*, and

_{x}*PM*.

_{25}^{21}A total of 9 models were built, each for a different subset of data (warm and cold season, working/nonworking days, and all data for each pollutant). The best result obtained was

*R*and 0.52 for

^{2}=057*NO*and

_{2}*NO*, respectively. In Kamińska,

_{x}^{25}the author presents a modified model for

*NO*concentrations also based on the concept of random forest using data from 2015 to 2017, obtaining a prediction of

_{2}*NO*daily concentration with a determination coefficient

_{2}*R*, and by means of which it was possible to determine the importance of each feature on separated low and high pollution concentration. However, this result was obtained by a set of models, not a single 1; the results for the single model on all data were

^{2}=0.82*R*. In all cases, past values of the predictors were not taken into consideration.

^{2}=0.60### Time series and lagged models

A *time series* is a series of data points labeled with a temporal stamp. Time series arise in multiple contexts; for example, in our case, data from environmental monitoring stations can be seen as time series, in which atmospheric values (eg, pressure, concentration of chemicals) change over time. If each data point contains a single time-dependent value, then the time series is *univariate*; otherwise, it is called *multivariate*. In our case, we refer to a multivariate time series in which precisely 1 variable is dependent; for other applications, it makes sense to consider multivariate time series with multiple dependent variables. There are 2 main problems associated with time series: *time series explanation* and *time series forecasting*; these problems are usually associated with different contexts and approached with different tools. In particular, time series forecasting emerges in the realm of statistical economics, and, more recently, has found applications to other contexts. Time series explanation, on the other hand, is related to machine learning processes, and it is not linked to a particular field of application. The different approaches to time series analysis have, clearly, nonzero overlapping.

The typical statistical-based time series forecasting approach is based on *autoregressive* models. The simplest univariate forecasting approach is commonly known as *simple moving average* (SMA) model: an SMA is calculated over the time series by considering its last *n* values, used to perform a smoothing process of the series, and then used to forecast for the next value. Although such an approach has some clear limitations, it is still useful to establish a baseline, against which to compare more complex solutions.^{13} Based on the observation that the most recent values may be more indicative of a future trend than older ones, *simple exponential smoothing* models consider a weighted average over the last *n* observations, assigning exponentially decreasing weights as they get older.^{13} Other than this first, simple type of smoothing, it is also worth mentioning *Holt Exponential Smoothing* models^{31} and *Holt-Winters Exponential Smoothing*^{32} models. Technically, exponential smoothing belongs to the broader ARIMA family.^{33} Their multivariate counterpart, the ARIMAX models, allows one to study the interaction between *independent* time series and *dependent* ones, in a similar way as lagged machine learning models do.

In the machine learning context, there are 2 influential approaches to time series analysis: *recurrent neural networks*^{34–36} and *lagged models*. In the former case, neural networks are adapted to the specific form of a time series to be trained for forecasting. In the latter case, on the other hand, data are systematically transformed by adding a delay, so that a classical, propositional learning algorithm can be then applied; among the available packages to this purpose, we mention Weka *timeseriesForecasting*.^{14} Lagged models are flexible by nature, as they are not linked to a specific learning schema, and their focus is on time series explanation. In some cases, lagged variables have been used for neural network training, increasing their performances. While explicit models can be used for forecasting, it is not their focus, considering that their forecasting horizons are limited to the maximum lag in the model; time series forecasting models, on the other hand, allow long-term predictions, at the expenses of the interpretability of the models. Different, yet related, approaches include,^{37} in which a modified decision tree learner has been used to model air pollution using lagged versions of the predicting variable in the form of univariate time series. Lagged models, and more in general, models that include consideration on the past values of the predictors, have been mainly used in dealing with issues related to the analysis of factors affecting health and human life. In studies on the impact of air pollution on mortality, lagged variables are used to consider the duration of the exposure. Effect of exposure to high concentrations of particulate matter has been studied in previous studies,^{8,10,38} among others. The effect of exposure to ozone, including lagged variables, was analyzed, inter alia, in previous studies.^{39,40} As for concentration of pollution models, lagged variables are considered in forecasting models such as Catalano et al,^{41} but such models are definitely different from those developed in this work in many perspectives. Similarly, multiobjective optimization processes are used to solve various types of optimization problems also in the environmental context, for example, in allocation of sediment resources,^{42} long-term ground water monitoring systems,^{43} calibration of rainfall-runoff model,^{44} optimal location and size of the given number of check dams,^{45} or studying the problem of mitigating climate effects on water resources.^{46}

### Feature selection

*Feature selection* (FS) is defined as the process of eliminating features from the data base that are irrelevant to the task to be performed.^{47} Feature selection facilitates data understanding, reduces the storage requirements, and lowers the processing time, so that model learning becomes an easier process. Feature selection methods that do not incorporate dependencies between attributes are called *univariate* methods, and they consist in applying some criterion to each pair feature-response, and measuring the individual power of a given feature with respect to the response independently from the other features, so that each feature can be ranked accordingly. In *multivariate* FS, on the other hand, the assessment is performed for subsets of features rather than single features. In our case, multivariate FS may be paired with optimal lag searching, so that not only the delayed effect of each variable but also its actual need in the explanation model is taken into account. Relevant to this study are FS techniques based on multiobjective problems, reviewed in the next paragraph.

### Multiobjective optimization

A *multiobjective optimization problem* (see Collette and Siarry^{48}) can be formally defined as the optimization problem of simultaneously minimizing (or maximizing) a set of *k* arbitrary functions:

where
is a vector of decision variables. A multiobjective optimization problem can be *continuous*, in which we look for real values, or *combinatorial*, we look for objects from a countably (in)finite set, typically integers, permutations, or graphs. Maximization and minimization problems can be reduced to each other, so that it is sufficient to consider 1 type only. A solution *dominates* a solution if and only if is better than in at least 1 objective, and it is not worse than in the remaining objectives. We say that is *nondominated* if and only if there is not other solution that dominates it. The set of nondominated solutions from is called *Pareto front*. In general, finding the Pareto optimal front of a multiobjective problem is a computationally hard task. Optimization problems are therefore usually approximated. Among the popular approximation techniques, *multiobjective evolutionary algorithms* are a typical choice.^{49–52}

Feature selection can be seen as a multiobjective optimization problem, in which the solution encodes the selected features, and the objective(s) are designed to maximize the performance of some classification/regression algorithm in several possible ways; this may entail, for example, instantiating equation (1) as:

where represents the chosen features and we maximize the performance of a predetermined classification/regression algorithm (on those features), while minimizing their number. The use of evolutionary algorithms for the selection of features in the design of automatic pattern classifiers was introduced in Siedlecki and Sklansky.^{53} A review of evolutionary techniques for FS can be found in Jiménez et al,^{50} and a very recent survey of multiobjective algorithms for data mining in general can be found in Mukhopadhyay et al.^{52} The first evolutionary approach involving multiobjective optimization for FS was proposed in Ishibuchi and Nakashima,^{54} and a formulation of FS as a multiobjective optimization problem has been presented in Emmanouilidis et al.^{51} The wrapper approach proposed in Liu and Iba^{55} takes into account the misclassification rate of the classifier, the difference in error rate among classes, and the size of the subset using a multiobjective evolutionary algorithm where a niche-based fitness punishing technique is proposed to preserve the diversity of the population, while the one proposed in Pappa et al^{56} minimizes both the error rate and the size of a decision tree. Another wrapper method is proposed in Shi et al,^{57} while in García-Nieto et al,^{58} 2 wrapper methods with 3 and 2 objectives, respectively, applied to cancer diagnosis are compared. Finally, very recent examples of multiobjective FS systems can be found in Jiménez et al.^{59,60} To the best of our knowledge, the only attempt to use multiobjective optimization process in modeling air quality has been used for monitoring system planning in Sarigiannis and Saisana.^{61}

## Data

There is only 1 communication station for measuring the air quality in the city of Wrocław, and it is located within a wide street with 2 lanes in each direction (GPS coordinates: 51.086390 North, 17.012076 East, see Figure 1). The center of 1 of the largest intersections in the city with 14 traffic lanes is located approximately 30 m from the measuring station, and is covered by traffic monitoring. The measurement station is located on the outskirts of the city, at 9.6 km from the airport (the distance between the weather monitoring station and the air quality station is 1 of the reasons, although not the only 1, for which time lags play a role in our model). Pollution data are collected by the Provincial Environment Protection Inspectorate and encompass the hourly *NO ^{2}* and

*NO*concentration values during the full 3 years, from 2015 to 2017. The traffic data are provided by the Traffic Public Transport Management Department of the Roads and City Maintenance Board in Wrocław, and include hourly count of all types of vehicles passing the intersection. Public meteorological data are provided by the Institute of Meteorology and Water Management, and they include air temperature, solar duration, wind speed, relative humidity, and air pressure. The full data set contains 26 304 observations. In the preprocessing phase, each missing value (there were 617 samples, that is, 2.3%, with at least 1 missing value) has been interpolated with the 2 closest known values, immediately before and immediately after the missing one. Some basic statistic indicators on the remaining 25 687 instances are presented in Table 1, along with the symbol used in the tests for each variable.

^{x}## Table 1.

Descriptive statistics.

## Method

### Lagged regression

Linear regression is the most immediate approach to explicit modeling of a multivariate time series. In our case, for example, one can denote by:

the set of preprocessed data, in which *A _{1},…,A_{n}* are the independent columns (air temperature, solar duration, wind speed, relative humidity, air pressure, and temperature),

*>B*is the dependent one

*NO*or

_{2}*NO*, depending on the problem we want to solve), ordered by the time of observation, and use a linear regression algorithm to extract a function of the type:

_{x}Solving this problem entails finding *n*+1 optimal *parameters* (or *coefficients*) *c _{0},c_{1},…,c_{n}* to fit the above equation, which does not take into account past values of any independent parameter, and the temporal component is used implicitly.

*Lagged*(linear) regression consists of solving a more general equation, formulated as:

In other words, we use the value of each independent variable *A _{i}* not only at time

*t*but also at time

*t*−1,

*t*−2,…,

*t*−p

_{i}, to explain

*B*at time

*t*; each A

_{i}(

*t−l*) is associated with a coefficient

*c*

_{i,l}, which must be estimated, along with each

*maximum lag*p

_{i}. There are available techniques, based on standard regression algorithms, that allow one to solve the inverse problem associated with equation (5); unfortunately, the resulting equation may result very difficult to interpret. Equation (5) is very similar to ARIMAX model, but without the autoregressive part.

Lagged regression can be performed with other, more complicate, model extraction methods that may not allow explicit representation, such as random forest or neural networks of any type. The underlying idea is the same: enhance the independent data by adding lagged versions of each variable *A−i* up to a predetermined maximum *p*_{i}.

### Optimizing a lagged model

Both standard and lagged linear regression are classical, simple approaches to the problem of explaining the value of *B(t)* (in our case, pollution concentrations), using current and past values of *A _{1},…,A_{n}*: having learned the best coefficients, the performances of the model are computed using standard measures. As per classical, standard data analysis procedures, the set

*A*may be used in k-fold cross-validation mode to extract a linear model and test it at the same time, or separated into training and test subsets. In any case, solving equation (5) entails fixing the value

*p*for each independent variable; each past value of each independent variable may contribute in the same way to the model.

_{i}We work under the additional assumption that, for each *i*, there is precisely 1 lag *l _{i}*, such that

*A*influences the output more than any other lag; this may be reasonable in some applications, and less so in others: as we shall see, it fits perfectly our case. Under such an assumption, the model that we are assuming becomes:

_{i}(t−l_{i})Our methodology can be described as follows: (1) we split the original data set *A* into 2 sets that we call *A _{tr}* and

*A*, respecting the temporal ordering, (2) we optimize the values of

_{mo}*c*, and

_{0},c_{1},…,c_{n}*l*on

_{1},…,l_{n}*A*, obtaining a linear model that fits

_{tr}*A*well, and (3) we use the obtained lags to

_{tr}*transform*

*A*, so that a new model can be learned. Splitting into

_{mo}*A*and Amo is necessary to ensure that finding the optimal lags is computationally affordable: we search for the optimal lags in a small data set (Atr), and then we apply them to a bigger one

_{tr}*(A*, so that a model can be learned on the latter. The model that is learned on the transformed data can be of any type: linear (or, in general, functional), tree-based, or a neural network. The transformation gives us already some information on the problem itself, as it has optimized the delay after which a certain independent variable has effect; such a learned model, obtained on

_{mo})*A*, can be tested using classical testing modes, such as training + test (which would entail further separating

_{2}*A*into 2 sets), or, as we shall do,

_{2}*k*-fold cross-validation. Observe that

*A*cannot be considered a

_{tr}*training*set per se, but, instead, a pretraining set, used with the sole purpose of performing the optimization.

Having separated the optimization part from the training part, we can now arbitrarily complicate the model. For example, in some contexts, such as pollution concentration modeling, a super-linear explanation model may fit better than a linear one, yet preserving the possibility of an intuitive interpretation. From the mathematical point of view, the inverse problem that corresponds to searching for a super-linear model is a simple generalization of equation (6):

Thus, in the optimization part of our methodology, we can optimize coefficients *c _{i}s*, lags

*l*, and exponents

_{i}s*e*, for each predictor, and, then, apply this more elaborate transformation to the data. Once again, as a result of the optimization phase, we obtain new information; for example, we may learn that wind influences the amount of nitrogen oxide in the air at a certain moment with 2 hours of delay, and in a way proportional to the square of its strength. Even if we choose, in the second phase, to use a noninterpretable learning model (eg, random forest), we have more information on the underlying process than we would have had using the same learning model without transformation. Our purpose is to prove that such an optimization does increase the performances on the learned model in the second phase, independently from the specific learning model. A representation of this methodology can be found in Figure 2.

_{i}s### Solving the optimization problem

Deciding the best lag and the best exponent for each variable is an optimization problem. Formally, given a multivariate time series *A _{1}(t),…,A_{n}(t),B(t)* with

*m*distinct observations (such as our data on atmospheric pollution), let us define

*P*as the

*maximum lag of the problem*(ie, we do not search solutions with lags greater than

*P*—observe that in equation (5), we have that

*P*

_{i}=P for each i) and

*E*as the

*maximum exponent of the problem*(ie, we do not search solutions with exponents greater than E—observe that in equation (7), we have that e

_{≤}E for each

*i*). Let us fix a vector of

*decision variables*with domain , and let

*M*be the maximum of (called

*maximum actual lag*of . The vector entails a transformation of equation (3) into a new data set with m−M observations, in which the feature (time series) A

_{i}is lagged (ie, delayed) of the amount :

The transformation equation (8) works for every . The case of is interpreted as *excluding* the column from the problem (entailing an implicit FS method). Similarly, let , a second vector of decision variables with domain [1,…,E]. For each variable *A _{i}*, we interpret as the exponent to which

*A*is raised in equation (7). So, in conjunction, the pair entails a transformation of equation (3) into a new data set with m−M observations, in which the feature (time series)

_{i}*A*is lagged (ie, delayed) of the amount , and raised to the power of :

_{i}A simple numeric example of such a transformation can be seen in Figure 3. In this example, we start off with 5 observations and 5 independent variables (the sixth column indicates the time of observation); the left-hand side (that contains the lags) of the decision variables vector contains 0 for the first and the fourth variable, 1 for the second variable, 3 for the fifth variable, and −1 for the third variable (which entails that this variable is not selected); the right-hand side (that contains the exponents) contains 1 (linear behavior) for every variable except the third one (which, by the way, is eliminated) and the fifth one, in which the exponent is 2 (quadratic behavior). The original data set, therefore, must contain 3 less observations because in this example, we are assuming that to explain the current value of B, we need to look at the value of A_{5} 3 units of time before. The values of the selected variables, at the selected times, are then elevated to the chosen exponent, instantiating equation (9).

Solving the optimization problem entails evaluating a vector (,) on A_{tr}. As per our methodology, during the optimization phase, we use an efficient learner, such as linear regression, and any standard measure of the performances of the extracted model, such as the Pearson correlation test between the function values and the actual values of ; we denote such a generic function with , that is, the correlation coefficient extracted by a linear regression algorithm ran on A_{tr} after the transformation entailed by ,—correlation coefficient should be maximized. In a multiobjective context, however, we can also optimize the characteristics of both and >. On one hand, we want to minimize the number of selected variables, using:

and, on the other hand, we want to minimize the complexity of the extracted model, using:

Thus, solving our optimization problem means instantiating equation (1) with:

Clearly, other objective functions can be designed that may or may not improve the quality of the solutions.

### Implementation

*Multiobjective evolutionary algorithms* are known to be particularly suitable to perform multiobjective optimization, as they search for multiple optimal solutions in parallel. In this experiment, we have chosen the well-known Nondominated Sorted Genetic Algorithm II,^{49} which is available as open source from the suite *jMetal*.^{62} Nondominated Sorted Genetic Algorithm II is an elitist Pareto-based multiobjective evolutionary algorithm that employs a strategy with a binary tournament selection and a rank-crowding better function, where the rank of an individual in a population is the nondomination level of the individual in the whole population. As black box linear regression algorithm, during the optimization phase, we used the class *LinearRegression* from the open-source learning suite *Weka*,^{63} run in 10-fold *cross-validation* mode, with standard parameters and no embedded FS. We have represented each individual solution (*gene*) as an array:

with values in [−1,…,L] for the side and in [1,…,E] for the side.

The initial population has been generated randomly, in each execution, and the fitness functions reflect the objective functions of equation (12) in the obvious way; in particular, Weka implementation of any regression extraction algorithm returns, among other measures, the Pearson correlation coefficient of the extracted model. Mutation and crossover operations have been adapted to the form of the gene. To mutate an individual I, we proceed as follows: (1) we randomly choose between the x-part and y-part; (2) we randomly choose the index to mutate; and (3) we randomly choose to mutate the corresponding value with an increment (plus 1), decrement (minus 1), or random substitution. Increments, decrements, and random substitutions are implemented in the interval [−1,…,L] in the x-part, and in the interval [0,…,E] in the y-part. Similarly, crossover between 2 individuals I,J is implemented as follows: (1) we randomly choose between the x-part and y-part; (2) we randomly choose 2 indexes in I and 2 indexes in J; and (3) we perform the crossover on the selected indexes.

## Experiment

According to the methodology described in Figure 2, we prepared the data sets A_{tr} (7706 observations, 30% of the initial data) and A_{mo} (17 981 observations, 70% of the initial data) for 2 problems, NO_{2} and NO_{2}, which we kept separated, for 2 different sets of experiments. We then run the optimization procedure with random initial population for 10 independent experiments, with seeds from 1 to 10. In the final population of each execution, we applied a simple decision-making strategy to choose 1 individual of the Pareto front, that is, the one with the best correlation coefficient, which is a natural choice. On each execution, we set the maximum lag at 6 hours (P=6), and the maximum exponent at (E=3). As it turns out, all selected solutions contain all initial variables, suggesting that all predictors do have an nontrivial role in the problem. The training results are shown in Table 2 for NO_{2}, and in Table 3 for NO_{x}. Both tables are structured in the same way. For each execution, we show the best correlation coefficient that we have obtained during training. Also, we show the lags, the exponents, and the coefficient for each variable, in the original order: temperature, solar radiation, wind, humidity, pressure, and traffic.

## Table 2.

Training results for NO_{2}.

## Table 3.

Training results for NO_{X}.

As for the test phase, we have operated as follows: first, we have used the original test data set (A_{mo}), with no transformations, to run 3 different regression algorithms, and, second, we have used the same algorithms over the data transformed with each of the 10 selected solutions, to measure the difference in performance, in 10-fold cross-validation mode. The algorithms that we have used are: (1) the classic linear regression (class *LinearRegression* in Weka—same parameters as in the optimization phase); (2) the random forest algorithm (class *RandomForest* in Weka—parameters: 100 trees, 100% size per bag, minimum 1 instance per leaf, 10^{−3} maximum variance per leaf, unlimited depth per tree, no backfitting); (3) a multilayer perceptron (class *MultilayerPerceptron* in Weka—parameters: 0.3 learning rate, 0.2 momentum rate for backpropagation, 500 epochs, no validation set). Each execution of each learner has been run in 10-fold cross-validation mode, and, again, the column *corr.* denotes the result of the Pearson test of correlation, in average more than the 10 executions. We have also displayed the results of the standard residuals test: the mean absolute error, root mean squared error, the root absolute error, and the root squared error. The results of the test phase are shown in Tables 4 and 5.

## Table 4.

Test results for NO_{2}.

## Table 5.

Test results for NO_{X}.

## Discussion

In view of the results that have been obtained, 2 important elements emerge: first, how lags and nonlinear contributions are explained in the physical process, and, second, how the transformations improve the synthesis of regression models. As much as the first point is concerned, it is important to understand that different executions may give rise to different results and yet similar performances. On one side, some lags *are* consistent in all executions, which indicates that the temporal component of their contribution is stable and clear. On the other side, some lags and some individual coefficients do show certain variability: we believe that in such cases, more experiments are necessary to fully understand the underlying mechanisms. All notwithstanding, it is clear how convolution vectors have a clear positive effect on the cross-validated performances across the entire spectrum of algorithms for regression that we have tried. The following considerations can be done.

Traffic influences the amount of pollutant concentrations in a positive way, and with 1 hour of delay; this delay can be explained by the fact that the effect of exhaust gases needs some time to accumulate (observe, also, that we are limited by the granularity of observations: if sensors had collected data with granularity, say, 1 minute, we may have seen shorter delays for this variable).

Higher air temperatures are associated with a decrease in NO

_{2}and NO_{X}concentrations, which is caused by 3 processes: (1) at higher air intake temperatures, less NO_{2}is produced in the process of fuel combustion in the engine; (2) higher air temperatures usually imply more favorable atmospheric conditions, which encourage residents to use alternative means of transportation which reduces traffic volume^{64}; and (3) NO_{2}is dynamically transformed into NO (and back) at a higher rate with higher temperatures.Wind always affects concentration in a negative way, with a constant (across different executions) delay of 2 hours. This is probably due to the distance between the intersection and the meteorological station; the average wind speed of 3 m s

^{−1}and the roughness of terrain explain the amount of the delay.Higher humidity may cause a decrease in NO

_{2}concentration in exhaust gases^{65,66}; we have found that such an effect takes place with 5 to 6 hours of delay.High solar duration and high temperature favor the transformation of nitrogen oxides into secondary pollutants, which include ozone, and this implies a decrease in NO

_{X}concentration; even more important is sunlight, which acts as a catalyst, which explains the negative coefficient and the delay for sunlight duration.

Of how the transformation influences the synthesis of good regression models, we can observe what follows. First, not all models show the same improvement, but all models show some improvement. In the case ofNO_{2}, the models extracted with linear regression showed a maximum improvement of 0.948 in correlation, those with random forest of 0.544, and those with multilayer perceptron of 0.563, and in the case of NO_{X}, the improvements have been of 0.461, 0.398, and 0.415, respectively. Also, observe that there was an improvement in every independent execution, indicating that the results are stable. The fact that linear regression presented the most evident improvement is expected, given that used the same algorithm for optimization. Second, the improvement in predicting the NO_{2} concentration is greater than the one in predicting NO_{X}, probably due to the fact that the latter is a more complex problem. Third, linear regression and random forest show greater improvements than multilayer perceptron; the latter, however, shows low results even with nontransformed data, possibly suggesting that the underlying problem is not highly nonlinear enough to require, and justify, a neural network–based approach.

Comparing our results from the statistical point of view with those that can be obtained by other, well-known, methods is quite difficult. Take NO_{2} prediction models, for example. Running simple, atemporal, linear regression on the same data yields a correlation coefficient that, in average, is 0.678 in test (single execution). The extracted model is:

While the coefficients are not too dissimilar from those obtained with the transformations, the reduced correlation shows that delays must be taken into account. Running full lagged models, on the other hand, leads, in some cases, to higher correlations. Unfortunately, the resulting models are impossible to be interpreted from the environmental point of view. Just for example, the simplest full lagged model that we tried, with 6 hours of maximum delay for each variable (ie, with 36 independent variables), already leads to positive and negative coefficients of the same independent variables at different delays: the variable temperature, for instance, presents positive coefficients for lag 0, 4, and 6, and negative ones in all the other lags. A possible explanation of this phenomenon is that by artificially increasing the number of independent variables (such as it is done when several lagged variables are added for each parameter), one allows regression algorithms to find artificially good solutions by using many hyperplanes; in other words, the solution tends to overfit even in presence of several thousands of data instances. Overfitting models are less reliable explanatory models because they tend not to be associated with physical explanations.

## Conclusions

In this article, we have approached the problem of devising an explanation model for NO_{2} and NO_{X} concentrations observed via a monitoring station in the city of Wrocław (Poland). First, we revised the current literature on prediction, forecasting, and explanation models for temporal series. Then, we formulated the problem of finding an explanation model for air pollution as a lagged regression problem, and designed an optimization problem whose solutions are precisely the optimal lags for regression. Finally, we proposed realistic and physically explicable lagged regression models for both NO_{2} and NO_{X} concentrations based on available data from 2015 to 2017, induced with our method. We obtained significative improvements in the coefficient of determination over nonlagged regression models, while retaining (in fact, easing) the interpretability of the resulting equations. Our technology is immediately applicable to the same problems with different data, as well as to similar problems in which the temporal component plays an essential role. Time series explaining is less common in the technical literature than time series forecasting. Yet, forecasting is not focused on interpretability, often resulting in (sometimes, statistically reliable) black box models which, however, offer no explanation of the underlying phenomena. For future work, we plan to improve our design by allowing our optimization schema to consider reasonable temporal combinations of lagged variables, yet retaining the superior interpretability degree of lagged models over statistical forecasting.

## AUTHOR CONTRIBUTIONS

GS: experiment’s conception and design, results’ technical interpretation, and text writing. EL-S: implementation, test, and text writing. JK: data providing, results’ physical interpretation.

## REFERENCES

_{x}air concentrations at national scale: a case study for Poland. Proc Environ Sci. 2011;7:98–103. Google Scholar

_{2.5}and ozone levels in three large Chinese cities. Atmosp Environ. 2016;147:209–223. Google Scholar

_{2.5}air quality forecast model based on nonlinear regression and back-trajectory concentrations. Atmosph Environ. 2010;44:3015–3023. Google Scholar

_{x}and NO

_{2}concentrations in urban air in London. Atmosp Environ. 1997;31:4081–4094. Google Scholar

_{2}pollution episodes. Environ Pollut. 2017;229:321–328. Google Scholar

_{2}concentrations from traffic flow and meteorological conditions. Sci Total Environ. 2019;651:475–483. Google Scholar

_{x}are influenced by the background levels, traffic density, and meteorological conditions using boosted regression trees. Atmosp Environ. 2016;127:163–175. Google Scholar

_{2.5}concentration prediction based on CART and EELM. Sci Total Environ. 2019;651:3043–3052. Google Scholar

_{2.5}concentrations in Beijing, China. Sci Total Environ. 2018;635:644–658. Google Scholar