Psychological Science Demo - 1223

Using Machine Learning to Generate Hypotheses
are not depleted, and so on (for reviews, see Ellemers,
van der Toorn, Paunov, & van Leeuwen, 2019; Gerlach,
Teodorescu, & Hertwig, 2019). However, these interventions to reduce unethical behaviors cannot be easily
implemented in the field. For example, during a stayat-home order, it would be advisable for people to not
follow the social norm-if there are too many people
outside, it is advisable to stay indoors; if there is no
one outside, then it is safe to go out.
In this research, we sought to identify novel antecedents of unethical behavior by examining existing data
sets that were designed for other purposes (Goldstone
& Lupyan, 2016). Specifically, we used the World Values
Survey (WVS; 2019), which contains measures of unethical behavior. Using this data set, past research has identified a number of predictors of unethical behavior,
including Big Five personality traits (Simha & Parboteeah,
2019); happiness and belief in free will (Martin, Rigoni,
& Vohs, 2017); filial piety and materialism (Cullen,
Parboteeah, & Hoegl, 2004); political orientation, pride
in nation, generalized trust, and satisfaction with household income (Sommer, Bloom, & Arikan, 2013); and
religiosity, risk aversion, interest in politics, and trust
in the political system (Dong & Torgler, 2009). Most of
these are individual-difference variables that cannot
easily be experimentally manipulated and, therefore,
cannot be easily used by policymakers to help arrest
the COVID-19 pandemic. However, given that the WVS
asked respondents hundreds of questions, there are
likely other predictors of unethical behavior in the data
that researchers have not yet examined.
There are many ways to generate novel hypotheses
from large data sets. Researchers could examine which
variables in the WVS data set are most strongly correlated with the variables measuring unethical behavior.
Researchers could run regressions with regularization
methods (e.g., lasso, ridge, and elastic net) to select an
optimum number of predictors (Hastie, Tibshirani, &
Friedman, 2009). However, the large proportion of missing values in the WVS data set limits the use of these
regression-based methods, as they can be run only on
observations without any missing values. Further, linear
regressions require that key assumptions, such as
homoscedasticity and independently, identically, and
normally distributed residuals, are met. Researchers
could also use machine-learning methods, such as random forest, gradient boost, k-nearest neighbors, support-vector machine, and neural networks (Alpaydin,
2020). These methods do not make any auxiliary
assumptions and can impute even large volumes of
missing data (either in a separate stage prior to modeling or during the process of modeling).
Once a machine-learning model is trained, we can
query it to identify the top predictors. Certain challenges emerge, however, when attempting to do so.

1223

Statement of Relevance
This research is likely to be of interest to all researchers in the social-behavioral sciences who work
on hypothesis testing, because it demonstrates a
general method to generate novel hypotheses using
machine-learning techniques. This method can be
applied in any field in which researchers have access
to reasonably large data sets. The present research
significantly expands the scope of machine learning
in psychology, which has been nearly exclusively
focused on prediction until now. The current research
demonstrates that machine-learning methods can
be used simultaneously for prediction and for theory
development. The context in which we tested the
hypothesis generated by the machine-learning
method-unethical behaviors surrounding the
COVID-19 pandemic-is immediately relevant to
policymakers and the general public who wish
people to act in a more ethical manner to arrest
the pandemic. Our experimental materials provide
messages that policymakers and public-interest
organizations can immediately use.

In particular, identifying the top predictors in a large
data set is a nondeterministic-polynomial-time-complete
problem (Karp, 1975), and there is no known closedform solution to this problem. Only approximate solutions are possible for all problems of this class, and
any given approximate solution can neither be proven
to be the best solution nor be proven to be an inferior
solution. Various regression-based and machinelearning methods merely provide a possible solution;
neither the similarity of solutions provided by different methods nor their difference is guaranteed
(Reyzin, 2019). Thus, researchers can freely choose
any method to identify the top predictors in a large
data set as long as the data meet the assumptions of
the method and researchers have sufficient computing
power.
In the present research, we chose to use a deep
neural network to generate novel hypotheses about
antecedents of unethical behavior in the WVS. We chose
deep learning because this method has been the source
of recent groundbreaking discoveries in physics (e.g.,
novel particles; Baldi, Sadowski, & Whiteson, 2014),
chemistry (e.g., novel materials; Jha et al., 2018), and
biology (e.g., novel antibiotics; Stokes et al., 2020).
Further, regression-based methods limit the range of
possible predictor variables to those that have a mostly
linear and direct relationship with the dependent variable; in contrast, deep-learning models can capture
nonlinear effects and complex interactions.

Psychological Science Demo

Table of Contents for the Digital Edition of Psychological Science Demo