Computational Intelligence - November 2014 - 65

learning; for example, why and when
some ingredients of current deep learning techniques, e.g., pretraining and
dropout, are helpful and how they can
be more helpful? There have been some
recent efforts in this direction [6], [23],
[52]. Moreover, we might ask if it is
possible to develop a parameter tuning
guide to replace the current almostexhaustive search?
Third, we need to note that big data
usually contains too many "interests",
and from such data we may be able to
get "anything we want"; in other words,
we can find supporting evidence for
any argument we are in favor of. Thus,
how do we judge/evaluate the "findings"? One important solution is to
turn to statistical hypothesis testing. The
use of statistical tests can help at least in
two aspects: First, we need to verify that
what we have done is really what we
wanted to do. Second, we need to verify that what we have attained is not
caused by small perturbations that exist
in the data, particularly due to the nonthorough exploitation of the whole
data. Although statistical tests have been
studied for centuries and have been
used in machine learning for decades,
the design and deployment of adequate
statistical tests is non-trivial, and in fact
there have been misuses of statistical
tests [17]. Moreover, statistical tests suitable for big data analysis, not only for
the computational efficiency but also
for the concern of using only part of
the data, remain an interesting but
under-explored area of research.
Another way to check the validity of
the analysis results is to derive interpretable models. Although many machine
learning models are black-boxes, there
have been studies on improving the
comprehensibility of models such as
rule extraction [62]. Visualization is
another important approach, although it
is often difficult with dimensions higher
than three.
Moreover, big data usually exists in a
distributed manner; that is, different
parts of the data may be held by different owners, and no one holds the entire
data. It is often the case that some
sources are crucial for some analytics

goal, whereas some other sources pose
less importance. Given the fact that different data owners might warrant the
analyzer with different access rights, can
we leverage the sources without access
to the whole data? What information
must we have for this purpose? Even if
the owners agree to provide some data,
it might be too challenging to transport
the data due to its enormous size. Thus,
can we exploit the data without transporting them? Moreover, data at different places may have different label quality, and may have significant label noise,
perhaps due to crowdsourcing. Can we
do learning with low quality and/or
even contradictory label information?
Furthermore, usually we assume that the
data is identically and independently distributed; however, the fundamental i.i.d.
assumption can hardly hold across different data sources. Can we learn effectively and efficiently beyond the i.i.d.
assumption? There are a few preliminary
studies on these important issues for big
data, including [34], [38], [61].
In addition, given the same data, different users might have different
demands. For example, for product recommendation, some users might
demand that highly recommended items
are good, and some users might demand
that all the recommended items are
good, while other users might demand
all the good items have been returned.
The computational, and storage loads of
big data may be inhibitors to the construction of a model for each of the various demands separately. Can we build
one model (a "general model" which
can be adapted to other demands with
cheap minor modifications) to satisfy the
various demands? Some efforts have
been reported recently in [35].
Another long-standing but unresolved issue is, in the "big data era", can
we really avoid the violation of privacy
concerns [2]? This is actually a longstanding problem that still remains open.
3. Data Mining/Science
with Big Data

Aspects of big data have been studied
and considered by a number of data
mining researchers over the past decade

and beyond. Mining massive data by
scalable algorithms leveraging parallel
and distributed architectures has been a
focus topic of numerous workshops and
conferences, including [1], [14], [43],
[50], [60]. However, the embrace of the
Volume aspect of data is coming to a
realization now, largely through the
rapid availability of datasets that exceed
terabytes and now petabytes-whether
through scientific simulations and
experiments, business transactional data
or digital footprints of individuals.
Astronomy, for example, is a fantastic
application of big data driven by the
advances in the astronomical instruments. Each pixel captured by the new
instruments can have a few thousand
attributes and translate quickly to a petascale problem. This rapid growth in
data is creating a new field called Astroinformatics, which is forging partnerships between computer scientists,
statisticians and astronomers. The emergence of big data from various domains,
whether in business or science or
humanities or engineering, is presenting
novel challenges in scale and provenance
of data, requiring a new rigor and interest among the data mining community
to translate their algorithms and frameworks for data-driven discoveries.
A similar caveat also plays with the
concept of Veracity of data. The issue of
data quality or veracity has been considered by a number of researchers [39],
including data complexity [9], missing
values [19], noise [58], imbalance [13],
and dataset shift [39]. The latter, dataset
shift, is most profound in the case of big
data as the unseen data may present a
distribution that is not seen in the training data. This problem is tied with the
problem of Velocity, which presents the
challenge of developing streaming algorithms that are able to cope with shocks
in the distributions of the data. Again,
this is an established area of research in
the data mining community in the
form of learning from streaming data
[3], [48]. The key opportunity here is to
take the academic literature for a testdrive in real industry settings where
issues of scale and delivery often supersede the desire for accuracy. Depending

November 2014 | Ieee ComputatIoNal INtellIgeNCe magazINe

Table of Contents for the Digital Edition of Computational Intelligence - November 2014