Audio Mining: An Author Interview

One of the “Editor’s Choice” articles in the August 2021 issue of JCR introduces “audio mining” to our journal. Consumer researchers have been mining text for many years with natural language processing (NLP) algorithms and tools. Consumers do write a lot, but they also say a lot and there has been surprisingly little research in our journal or in other marketing journals that has studied the audio produced by consumers. Audio mining can help us do this by using machine learning tools to understand not only what consumers say but also how they say it. This new article by Shane Wang, Shijie Lu, X I Li, Mansur Khamitov, and Neil Bendle focuses specifically on vocal tone (i.e., how something is said), and makes an important contribution to our field by introducing audio mining and also helping us better understand how tone of voice plays a role in persuasion.

A general summary of the article can be found here. To get a better understanding of this research, I chatted with the article’s lead author, Shane Wang, from the Ivey Business School at the University of Western Ontario in Canada.

Andrew Stephen: Tell me about your audio mining article.

Shane: A central contribution of our work is to illustrate that online persuasion can be mined at scale and captured using an automatic audio mining method. In doing so, we provide a pioneering account of how computers (as opposed to labor-intensive and/or semi-automated approaches to coding and interpreting voices) can predict an online persuasion attempt’s effectiveness based on extracted vocal tones.

A: That’s fascinating. Why go to this much trouble to understand vocal tone in persuasion? What are some of your key findings?

S: Persuasion success is often related to hard-to-measure characteristics, such as the way the persuader speaks. The profusion of online interactions provides avenues to study this important consumer-relevant phenomenon, but also changes the process. For example, people who make persuasion attempts online cannot be assessed through handshakes or eye contact. In particular, online videos allow receivers to hear the persuader’s vocal tones. These provide cues that receivers can use in a variety of ways to determine their response to the persuasion attempt.

Against this background, the starting point for our research was to suggest that receivers use cues to determine whether the persuaders are likely to deliver what they promise. Specifically, we set out to test whether persuaders’ vocal tones, measured automatically, affect receivers’ decisions to fund a request. We did this because vocal tones are often thought to give insight into a persuader’s competence. That is, we were curious about how a speaker’s vocal tone persuades.

Our core idea was that speakers who sound more focused would be perceived as more competent, and thus, following prior research and intuition, more persuasive. Additionally, our prediction was that speakers who come across as more stressed would be perceived as less competent, and therefore less persuasive. Last but not least, we anticipated that speakers who seemed excessively emotional would be perceived as less competent, and hence less persuasive.

The key outcome we wanted to check was funding success. We looked at whether perceived competence impacts if receivers predict persuaders will achieve their goals. Our thinking was: why fund a project if you do not believe the persuader is competent to deliver it? Therefore, we expected that the perceived competence of persuaders would mediate the effects of persuaders’ vocal tones on funding success.

A: This article combines secondary data and experiments. As you know, JCR encourages multi-method approaches. Can you tell us a bit more about this?

S: Something that is very important to us is the value of combining secondary data and experiments. This makes a lot of sense when investigating persuasion. We are interested in important real-world phenomena but also want to get under the hood, as it were, to better understand what is happening. We would strongly advocate those trying to understand a large number of consumer relevant phenomenon to look at secondary data to see what is happening in the field. This may require significant data analysis skills but doesn’t always need to. When you combine secondary data analysis with the sort of control that an experiment can bring you there is often considerably more benefit to be gained than would be gained from undertaking separate investigations using different single methods. It is certainly fair to say, as you will see in this paper, that we are strong proponents of multi-method research.

A: Taking a step back, why did you and your coauthors embark on this research? What got you all interested in audio?

S: We are now living in Big Data Era. About 80% of the data we see out there is unstructured. My research focuses on unstructured data which we see everyday in a wide variety of contexts. For example, text, which we certainly see everyday, is a major source of unstructured data. Text is unstructured because it lacks the sort of clear organization of a database. Just think of how text differs from your Excel spreadsheet with its rows and columns.

Unstructured data is much more than just text. Images are a key source of unstructured data. Think of how much information is contained in an image. Again the information in images does not come in a neat package. What goes for images can be multiplied when looking at video. There is so much to see and learn but it video is very challenging to interpret.

Finally, think of the information that comes in audio clips. Music has information. Random noises provide information. (Think of any cop show when the detective hears the vital clue in the background of a voice mail). Of course, when we speak we convey lots of information through both the words we use and the way we say it. When listening think of all the information we automatically gain about the emotional state of a speaker. How we can capture that information is a key challenge for understanding consumer behavior.

To see the potential of automated methods just consider how many audio clips there are out there. We looked at the audio from pitches for funds in a couple of different Kickstarter categories. Knowing what proved successful in the past is vital knowledge for those making, as well as those receiving, or resisting, persuasion attempts.

This research actually is a follow up to my IJRM 2019 paper focusing on video mining. That work uncovered information in videos but lacked the controlled experiments to dig more deeply into what exactly was happening. In this audio mining research we added to our secondary data analyses with controlled experiments. These allowed us to further dive deep into the persuader’s perceived competence. Together our research allowed us to provide an explanation of the receiver’s response to the vocal tones. Thus, our research also speaks to the value of combining secondary data and experiments when investigating persuasion.

A: How will this article be useful to non-academic readers of JCR?

S: Entrepreneurs, and other funding seekers, need to watch their vocal tones and carefully consider the signals they send beyond their words. There is potential for those posting audio to better understand how what they say and how they say it will be received. People always say you never know how you sound to other people. Maybe soon a computer will be able to tell us. A successful persuasion attempt is most likely to result from vocal tones denoting (1) focus, (2) low stress, and (3) stable emotions. These three tone dimensions allow listeners to draw conclusions about an entrepreneur’s competence. So the sensible advice is to sound focused, stable, and not too stressed or extremely emotional to help demonstrate your competence. Of course, this may require extensive practice for those who are not natural communicators. These results identify key indicators of persuasion attempt success and suggest a greater role for audio mining in consumer research.

A: Do you have any advice for researchers working with audio? Any lessons learned on this project?

S: We strongly feel that further research on using and validating various audio mining techniques would be extremely valuable. That said, it isn’t always easy. There is so much data out there. Finding the right data to look at can be a challenge that many academic researchers won’t have faced before. Some academics may be used to not having the data, having too much data and finding what is the right data, is a bit of a novel challenge. Fascinating new methods can quantify key elements of the human voice. This has promising benefits for numerous fields. Consumer research can take the lead in this area—gaining insight into a speaker’s thoughts, identifying presenters’ styles, and assessing persuasion effectiveness. Still, new methods don’t mean that society is the way we would want it to be. When you deal with people you always need to be concerned about potential inequity. For instance, vocal tones differ between men and women so does that have an impact on the way receivers perceive competence? One should always consider whether receivers are behaving in a biased way towards some speakers and what we can do to reduce any inequity.

To start this research it is worth appreciating that there are two big streams of audio/voice related research. The first stream involves transferring the spoken words to text. The skills required to do this work are the same as those needed to text mine with the additional challenge of getting the text transcribed. The second stream is direct mining of the audio. We chose this second stream. There is less prior work in this stream to draw upon. Fortunately, the company we are collaborating with, has large training data of vocal tones, they have collected people’s voice for more than 15 years. As is often the case finding the right people and organizations to work with is central to effective research.

Note: Some responses have been lightly edited for brevity.

Read the full paper:

Journal of Consumer Research, Volume 48, Issue 2, August 2021, Pages 189–211,