IEEE Systems, Man and Cybernetics Magazine - April 2020 - 28
and reliable statistically supported technique was proposed
in [64], in which Pakhira and Dutta provided a quick solution for extracting k from the VAT RDI.
Extensions of VAT/iVAT in Big Data
Applications
Big data is a term coined to describe the exponential growth
of structured and unstructured data, which is difficult to
capture, store, manage, and process with conventional datamanagement and -analysis techniques. With the rapid
advances in information sensing, IoT devices, remote sensing, software logs, cameras, microphones, radio-frequency
identification readers, wireless sensor networks (WSNs), and
so on, the world's technological per capita capacity to store
information has roughly doubled every 40 months since the
1980s. In 2001, Laney [65] defined the data growth challenge
as having three dimensions-volume, velocity, and variety,
also called the three Vs of big data. (Many more Vs describing various attributes of big data and the associated processing challenges have been added over the years. In 2017, there
were 42 documented Vs of big data [66]).
The heterogeneity, ubiquity, and dynamic nature of the
resources and devices as well as the wide variety of data
make discovering, accessing, processing, integrating, and
interpreting big data a challenging task [67]. Much VATinspired research has been performed to tackle different
aspects of big data analytics to understand a variety of
novel data sources and extract actionable knowledge from
them. This research can be broadly classified into three
categories, which cater to the high-volume, high-velocity
(streaming data) and high-dimensionality aspects of big
data and are discussed next. In the following sections, we
use N and n to denote the sample size of "big data" and
"small data," respectively.
Clustering High-Volume Data Sets
Assessing Clustering Tendency Visually
Huband et al. [68] were the first to realize the limitations of
VAT to handle even moderately sized data sets (with a few
tens of thousands of data points) due to its O (n 2) computational complexity (n being the number of points in the
data set). To address this problem, they developed an alternative way to compute the most critical information in the
ordered image matrix, that is, the boundary between clusters shown by the dark blocks along the diagonal. They
developed the revised VAT (reVAT) algorithm, which
achieves results similar to VAT with less computation.
The reVAT algorithm builds and displays a set of profile graphs of specific rows of a pseudo-ordered dissimilarity matrix, rather than displaying an entire ordered
image matrix. When presented as an ensemble, the suite
of profile graphs visually suggests distinct clusters when
they are present. In this case, each profile graph will have
a unique peak (i.e., the peaks from one profile to the next
do not overlap). However, interpreting a series of profile
28
IEEE SYSTEMS, MAN, & CYBERNETICS MAGAZINE Apri l 2020
graphs is more difficult than the visual assessment of a
VAT image, especially when the number of clusters in the
data set is large-that is, reVAT does not produce a composite picture of possible substructure in the data. Also,
the computational complexity of reVAT is still O (n 2) due
to the computation of the n # n input-distance matrix.
To solve the interpretation problem of reVAT, a new
algorithm called bigVAT [69] was introduced, which combined the quasi-ordering technique used by reVAT with an
image display of the set of profile graphs displaying the
clustering tendency information with a VAT-like image. bigVAT uses random samples from reVAT-generated quasiordered profile graphs to create a VAT-like image to
simplify the interpretation of cluster structure in big data.
Although reVAT and bigVAT address the visualization
challenge of the VAT RDI for moderately sized data sets,
they still suffer from the high memory requirements of
storing an n # n (unordered) distance matrix. To address
this challenge, Hathaway et al. [13] developed a new scalable, sample-based version of VAT called scalable VAT
(sVAT) and its iVAT extension, scalable iVAT (siVAT) (algorithm S6 in "Pseudocode for Various Algorithms Belonging
to the Visual Assessment of Tendency Family"), which is
feasible for arbitrarily large data sets.
In a nutshell, sVAT/siVAT selects a sample of (approximately) size n from the full set of objects O = " o 1, o 2, f, o N ,
and performs VAT on the distance matrix of the n samples.
The sample is chosen so that it (hopefully) contains a cluster
structure similar to the full set. One of the most important
results in [13] is a (weak) theorem that guarantees that each
cluster in the big data is sampled at least once if it is compact
and well separated in the sense of Dunn's index. This is done
by first picking a set of kl distinguished objects using maximin sampling [70], selected to provide a representation of
each of the clusters. Then, the remainder of the sample is
built by choosing additional data near each of the distinguished objects. This sampling scheme is called maximin
random sampling (MMRS). The VAT/iVAT algorithm is then
applied to the MMRS samples. This yields a sample-based
approximation to the cluster heat map of the big data set that
cannot be made with VAT or iVAT.
To illustrate this, Figure 9(a) is a scatterplot of N =
1,000,000 2D points drawn from a four-component Gaussian, with 250,000 points per cluster. In this case, we cannot
generate a VAT image, indicated by the question mark in
Figure 9(b). However, we can generate sVAT and siVAT
images for this big data set by sampling n = 500 points
(0.05% of the total data set) from O. The sVAT image [Figure 9(c)] weakly suggests four clusters, which are seen with
much better visual acuity in the siVAT image [Figure 9(d)].
Wang et al. [19] proposed a random-sampling-based
extension to their SpecVAT algorithm to enable VCA for
large data sets. Prasad and Reddy [71] proposed a samplebased VAT (PSVAT), which uses several distinguished features (DFs) [72] to select the best random samples in a
progressive sampling scheme. DFs are selected from the
IEEE Systems, Man and Cybernetics Magazine - April 2020
Table of Contents for the Digital Edition of IEEE Systems, Man and Cybernetics Magazine - April 2020
Contents
IEEE Systems, Man and Cybernetics Magazine - April 2020 - Cover1
IEEE Systems, Man and Cybernetics Magazine - April 2020 - Cover2
IEEE Systems, Man and Cybernetics Magazine - April 2020 - Contents
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 2
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 3
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 4
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 5
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 6
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 7
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 8
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 9
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 10
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 11
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 12
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 13
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 14
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 15
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 16
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 17
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 18
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 19
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 20
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 21
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 22
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 23
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 24
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 25
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 26
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 27
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 28
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 29
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 30
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 31
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 32
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 33
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 34
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 35
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 36
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 37
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 38
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 39
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 40
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 41
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 42
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 43
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 44
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 45
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 46
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 47
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 48
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 49
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 50
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 51
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 52
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 53
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 54
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 55
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 56
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 57
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 58
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 59
IEEE Systems, Man and Cybernetics Magazine - April 2020 - 60
IEEE Systems, Man and Cybernetics Magazine - April 2020 - Cover3
IEEE Systems, Man and Cybernetics Magazine - April 2020 - Cover4
https://www.nxtbook.com/nxtbooks/ieee/smc_202310
https://www.nxtbook.com/nxtbooks/ieee/smc_202307
https://www.nxtbook.com/nxtbooks/ieee/smc_202304
https://www.nxtbook.com/nxtbooks/ieee/smc_202301
https://www.nxtbook.com/nxtbooks/ieee/smc_202210
https://www.nxtbook.com/nxtbooks/ieee/smc_202207
https://www.nxtbook.com/nxtbooks/ieee/smc_202204
https://www.nxtbook.com/nxtbooks/ieee/smc_202201
https://www.nxtbook.com/nxtbooks/ieee/smc_202110
https://www.nxtbook.com/nxtbooks/ieee/smc_202107
https://www.nxtbook.com/nxtbooks/ieee/smc_202104
https://www.nxtbook.com/nxtbooks/ieee/smc_202101
https://www.nxtbook.com/nxtbooks/ieee/smc_202010
https://www.nxtbook.com/nxtbooks/ieee/smc_202007
https://www.nxtbook.com/nxtbooks/ieee/smc_202004
https://www.nxtbook.com/nxtbooks/ieee/smc_202001
https://www.nxtbook.com/nxtbooks/ieee/smc_201910
https://www.nxtbook.com/nxtbooks/ieee/smc_201907
https://www.nxtbook.com/nxtbooks/ieee/smc_201904
https://www.nxtbook.com/nxtbooks/ieee/smc_201901
https://www.nxtbook.com/nxtbooks/ieee/smc_201810
https://www.nxtbook.com/nxtbooks/ieee/smc_201807
https://www.nxtbook.com/nxtbooks/ieee/smc_201804
https://www.nxtbook.com/nxtbooks/ieee/smc_201801
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_1017
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0717
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0417
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0117
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_1016
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0716
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0416
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0116
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_1015
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0715
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0415
https://www.nxtbook.com/nxtbooks/ieee/systems_man_cybernetics_0115
https://www.nxtbookmedia.com