Case Study: Parametric, Non-parametric, and Non-linear Statistical Modelling

The development of a statistical method with which to relate LiDAR-derived tree height to IKONOS spectral reflectance involved the testing of three statistical paradigms, namely parametric, non-parametric, and non-linear artificial neural net­works. Parametric statistical procedures, e. g., regression, make certain assumptions regarding the underlying distribution of the data. It is primarily assumed that the data is normally distributed and that the input variables have similar variances (Cohen et al. 2003). However, as with most environmental data, this is not always the case and certain input variables had to be linearized using statistical transformations (Hudak et al. 2006). Two of the five input variables (IKONOS green and red bands) displayed non-normal distributions and were transformed using standard logarithmic methods. The five input variables were then used as independent variable inputs to a multiple linear regression, where the four IKONOS bands plus the age of the sample compartment were regressed against maximum LiDAR height. Results from this analysis were then interrogated for outliers. Outliers were removed using Cook’s distance measure (Cook 1977) and the regression was re­run using the resulting cases. The statistical approach is similar to that employed by Wulder and Seemann (2003); however, we employed only the per-band mean spectral reflectance values as independent variable. The reason for this is that the resultant regression model was applied to the imagery at the pixel level; hence distributional measures would have been unsuitable in this instance (Wulder and Seemann 2003).

The second statistical paradigm employed in this research makes use of non­parametric statistical methods. These approaches are known as “distribution-free” methods and do not rely on the assumption that data are drawn from a given probability distribution. The ^-nearest neighbour approach is such a non-parametric method that imputes forest inventory variables using reference samples and target mapping units (Reese et al. 2002). Reference samples typically are derived from remote sensing spectral reflectance and co-located forest variables of interest, which in our study was maximum LiDAR height as variable of interest within the reference sample plot. The goal of this approach is consistent with our primary objective, namely to evaluate canopy height estimation at locations not sampled by the LiDAR sensor using spectral reflectance and compartment age as predictor variables. Each target location is assigned a reference value based on the weighted Euclidean distance from its k nearest reference plot(s) according to this approach. The k nearest reference plots are typically defined using weighted Euclidean distance calculated in spectral feature space, while the target variable is estimated by the weighted average of the distances to the k nearest neighbours.

The weighted average distance (in spectral feature space) was calculated in our study using the random-Forest algorithm (Breiman 2001). This algorithm differs from the standard Euclidean distance measure in that it does not make use of a weight matrix, but instead classification and regression trees are used to classify reference and target observations: If a target and reference observation ends up in the same node, they are regarded as being similar. The distance measure is computed as one minus the proportion of trees which contain the same variable, and where a target observation is in the same terminal node as a reference observation. Crookston and Finley (2008) identify two advantages of using random-Forest as opposed to other distance metrics, namely that variables can be a mixture of continuous and categorical types and that the method is non-parametric. Hudak et al. (2008) compared a range of methods using several different error metrics (e. g., root mean square difference) and concluded that the random-Forest method was more robust and flexible than standard distance measures, e. g., Euclidean and Mahalanobis distance (Mahalanobis 1936). Preliminary tests conducted by the authors confirmed this finding, which resulted in the random-Forest approach being chosen as the appropriate distance measure.