Parent Topic: NGCLUS
This program provides an alternative for unsupervised classification. It is based on the algorithm developed by P.M. Narendra and M. Goldberg. The clustering algorithm operates upon the histogram and isolates the vectors into clusters that are unimodal in the histogram, with the boundaries between clusters running through the valleys in the histogram. This is a reasonable way to characterize the clusters, which can be of any shape. The number of clusters need not be specified a priori and, moreover, the algorithm is noniterative. The disadvantage of the algorithm is that, due to the use of a hash table, a maximum of only four 8-bit image channels may be used as input. A maximum of 255 classes can be generated in this program.
Histogram generation is the first step in the histogram clustering procedure. Histogram clustering uses one of several non-parametric histogram-based algorithms for unsupervised image data. It partitions a four-dimensional histogram into "clusters" or classes of data.
NGCLUS creates a new histogram based on image data stored in up to four database image channels (DBIC) on a specified database file (FILE).
The histogram table length specifies the maximum number of unique 4D histogram entries the table can hold. Histogram table length defaults to the maximum value, which depends on available memory size. If the histogram table length is specified, the actual histogram table length used is the largest PRIME number less than or equal to the histogram table length. In any case, the histogram table length used by HGN is displayed.
NBIT controls the number of high-order bits used per 8-bit data value in constructing the histogram file. For example, if NBIT=6, then the two low-order bits of data are dropped. In general, using all 8 bits of data may produce erroneous peaks in the histogram, resulting in too many classes. Also, specifying NBIT < 8 allows HTLN to be smaller.
SAMPRM is a parameter which allows the user to eliminate clusters with very few samples. If the number of samples in a cluster is less than SAMPRM, each sample inside the cluster will be merged into its neighbouring clusters.
SMOOTH is a parameter which can be used to smooth the histogram by averaging the histogram values over the different neighbourhoods of each vector. The histogram value of each vector is replaced by the new smoothed value. In this program `adaptive smoothing' is used. Only vectors with histogram values less than SMOOTH will be smoothed. This tends to smooth the histogram only over the low density areas that are prone to noise.
Sampling can be controlled using a rectangular grid (GRID). The sampling grid increment specifies the horizontal and vertical dimensions of a grid which controls the sampling of pixel values. For example, if GRID=3, every 3rd pixel on every 3rd line is sampled. However, if GRID=2,4, every 2nd pixel on every 4th line is sampled. The default for GRID is 4,4 (every fourth pixel on every fourth line). GRID can be used both to shorten the time required to construct the histogram, and to reduce the sample size.
NGCLUS includes the option of generating signatures for each cluster through the parameter SIGGEN. If SIGGEN is "YES", a signature for each cluster will be generated. The user can use the MLC (Maximum Likelihood Classifier) program to classify other images.
The MASK parameter specifies the area within the input channel which will be processed. Only the area under mask will be classified and the rest of the image will not be processed. If a single value is specified, then this value refers to a bitmap segment, which defines the area to be classified. When four values are specified, these values define the x,y offsets and x,y dimensions of a rectangular window within the image to be classified. If defaulted, the entire database is processed.
It is quite common for satellite images to have a lot of black- filled areas (with zero gray levels) which should not be included in the classification. To solve this problem, the user can first run the program THR by setting the TVAL's minimum and maximum values to 1 and 255, respectively. A bitmap mask is thus created only on the image area. The user then inputs this bitmap as the MASK parameter in this program.
The result of the clustering is a theme map directed to a specified database image channel (DBOC). A theme map encodes each cluster with a unique grey level. For example, cluster 1 is assigned with the grey level of 1, and cluster 2 is assigned with the grey level of 2. Grey level 0 represents unclassified pixels. Therefore, if the theme map is later directed to the display, a pseudo-colour table should be loaded so that each cluster is represented by a different colour. If no value is specified for DBOC, the clustering results will not be saved into a channel.
More details of Narendra & Goldberg algorithm can be found in the following paper:
Narendra & Goldberg. 1977. "A Non-parametric clustering scheme for Landsat". Pattern Recognition, vol. 9. pp 207-215.
Dr. Goldberg stated that he used a maximum table length of 19993 in his program GTC (Graph Theoretic Classifier). He said that clustering on 8-bit resolution data is not meaningful, because of undersampling, the need for smoothing, and the large number of erroneous peaks in the histogram.