Introduction

In the past 10 years, there have been ample arguments suggesting that data-driven analysis methods using social media data can supplement or even replace traditional polls and surveys. The affordability and timeliness of data collected from social media is appealing. Some researchers are interested in investigating the political discourse spread in social media and argued that social media data has potential utility as an indicator of political opinion and are comparable to offline surveys (Caldarelli et al., 2014; Chung & Mustafaraj, 2011; DiGrazia et al., 2013; Kalampokis et al., 2013; O’Connor et al., 2010; O’Leary, 2015). However, other scholars have suggested that the majority of the social media messages are “pointless babble” and that there is little consensus regarding methodology and evaluation (Boyd & Crawford, 2012; Couper, 2013; Gayo-Avello, 2013; Jungherr et al., 2012; Metaxas et al., 2011; Murphy et al., 2014; Ruths & Pfeffer, 2014; Schober et al., 2016; Schoen et al., 2013; Tufekci, 2014; Weller, 2015; Yu & Kak, 2012). There is no consensus on the validity of this method.

Many researchers tried to overcome the limitations by improving sampling or weighting, or by reducing data noise (Baldwin et al., 2013; Barberá, 2016; Chang et al., 2010; Choy et al., 2011, 2012; Davis, 2017; Diaz et al., 2016; Flemming & Sonner, 1999; Kalampokis et al., 2013; Karimi et al., 2016; Mislove et al., 2011; Tufekci, 2014; W. Wang et al., 2015; Zhang et al., 2016).

In this paper, we will review methodologies that have been widely used to predict election results and to evaluate public opinion. Next, we will point out the limitations of those methodologies and then introduce the efforts to overcome those limitations. Finally, we would like to propose a way to measure the direction of public opinion through big data analysis. Specifically, we will focus on how to use social media analysis methods and argue that we can utilize them in understanding agenda, rather than in election forecasting. We pay attention to the possibility of using social media data for understanding public opinion, rather than predicting election outcomes.

Review: Previous Work

Previous review papers only focused on the accuracy of predictions. For example, some researchers (Gayo-Avello, 2012a, 2012b, 2013; Goldstein & Rainey, 2010; Schoen et al., 2013) review and make general arguments about how methodology led to the success or failure of social media predictions. Also, another researcher (Phillips et al., 2017) reviewed the latest literature in 2017 and focused on larger scope predictions such as stocks and marketing or public health, etc. Those review papers only focused on the success or failure of results.

For this paper, we first conducted a database search in June 2018 on Google Scholar for articles published since 2010, which include the terms “social media” and “election prediction.” Then, we tracked each paper to include papers that attempt to use social media data to predict offline political phenomena. Our search criteria yielded 534 articles, of which we selected 69 that matched our research topic.

Methods Used in Predictive Modeling

There are three major predictive modeling methods for election prediction using social media data. Those three methods are used in various research papers and by big data analysis companies.

Counting volume

The oldest and a basic attempt to analyze public opinion using social media data is counting volume, which simply counts the number of a specific word’s appearances. At the beginning of social media research, most studies used Twitter, which was the most popular at that time. This method has been widely used since the correlation between the volume of social media and the election results was discovered. Counting the number of tweets which contain a reference to a political party or a politician has been used in many studies (Khatua et al., 2015; Skoric et al., 2012; Tumasjan et al., 2010; Williams & Gulati, 2008). After counting tweets or expressions, the ratio of each political party or candidate on social media is compared to the actual election results, or the ranking by social media volume and the ranking by survey or by share of voters are compared.

These researchers insist that the relative volume of tweets mirrors the results of the election closely. To make a clear difference, most of the researchers calculate the Mean Absolute Error (MAE), which has been used to compare the accuracy of political information on social media relative to election polls. However, there is no theoretical explanation about the relationship between the volume of tweets and election choice.

Sentiment analysis and machine learning

The basic concept is similar to counting, but this method adds sentiment of word or location information as a variable. Then it uses regression or machine learning analysis (Ceron et al., 2014) instead of simple counting. Sentiment analysis basically measures the affirmation and the negation of extracted words by morpheme analysis. Analysis of sentiments or opinions is assumed to be better than volume counting. It extracts and analyzes opinion-oriented text, recognizing positive and negative opinions, and quantifying how positive and negative entities are (L. Chen et al., 2012; Pimenta et al., 2013; Sang et al., 2012). To define which word has what sentiment, usually a lexicon-based sentiment analysis dictionary, or LIWC (Linguistic Inquiry and Word Count) tool is used. More recently, machine-learning methods are used to detect the sentiment of text.

However, sentiment analysis is still not perfect. Though it seems more sophisticated than the method of measuring volume, this method has weaknesses. The sentiment of the words depends upon context. So, the same words may have different sentiments but the computer is not yet perfectly able to find this sentiment. Some researchers also pointed out that ironic and sarcastic expressions are very frequent and difficult to detect (Clavel & Callejas, 2016; Reyes & Rosso, 2014)

In a more recent approach, researchers use Google Trends, a service that shows how often a specific word is searched by region and time. Unlike social media posts, search keywords are not shared with others. So, they can be said to reflect the user’s honest thoughts. Google Trends analysis is a relatively easy tool to use, and so, not only researchers but also individuals or reporters can easily approach and analyze this data. Moreover, not only words like a political party or the name of a politician, but also more political issues related to specific politicians can be analyzed (Stephens-Davidowitz, 2014).

However, it is difficult to measure opinions toward candidates on the basis of search frequencies of candidates. For example, we can think of what you searched for as an interest, but because of instances of searching for criticism or out of curiosity, it does not link to support or willingness to vote for a candidate.

Limitations of Predictive Modeling

Many researchers point out that monitoring what users share or search for on social media and on the web has led to greater insights into what people care about or pay attention to at any moment in time. However, social media and search results can be readily manipulated, which is something that has been underappreciated by the press and the general public. We will describe and summarize the limitations of using social media data.

Crawling and platform difference

There are various social media platforms such as Twitter, Facebook, Instagram, blogs, Google Plus, etc. We should focus on platform-specific biases. Every social media platforms has its own data collection and accessibility policies. Also, data provision policies for research are different. For example, Twitter provides an API (Application Programming Interface) to provide tweet data. However, according to each API and their required conditions, researchers might obtain a biased sample from Twitter. Twitter provides a glance into its millions of users and billions of tweets through a “Streaming API,” which provides a sample of all tweets matching some parameters preset by the API user. However, the essential drawback of Twitter API is the lack of information concerning what and how much data users get. This leads researchers to question whether the sampled data is a valid representation of the overall activity on Twitter. As an example, some researchers (Ghosh et al., 2013; Morstatter et al., 2013, 2014; O’Connor et al., 2010) acquired different data collected during the same period due to their collection method.

Also, it is obvious that different social media platforms are used in different countries. While popular social media varies from country to country, there are a number of studies that do not take this into consideration. For example, at the time of the US presidential election in 2016, the mainstream media in the US did not predict Trump’s victory. However, Google Trends expected Trump to win early because the number of Trump searches was greater than the number of Clinton searches. Could this result be applied to other countries? It is true that Google Trends successfully predicted (Lui et al., 2011; Metaxas et al., 2011) the US presidential election. However, the situation in Korea and the US is different. According to Naver, Naver ranked first in the PC search market in Korea in March this year with 75.4% of searches, while Google had only 6.7%. Google Trends only reflects the search results from Google, making it less representative of the Korean search market (Traffic Difference: US vs. Korea, July 2018, Alexa.com).

Representativeness: Biased Sample

The preceding problems lead to biased samples. The main challenge is the securing of the representativeness of social media data. Just having a large number of tweets does not mean that there has been a representative sampling of the voting population. A common assumption underlying many large-scale social media-based studies of human behavior is that a large enough sample of users will drown out noise introduced by peculiarities of the platform’s population. In modern opinion polling, insuring a representative sample is a core issue, that is, each individual in any particular target population should have a greater than zero probability of being sampled (Barberá & Rivero, 2015; L. Chen et al., 2012; Gayo-Avello, 2011; Malik et al., 2015; Ruths & Pfeffer, 2014; Tufekci, 2014).

However, biases may vary across different social media platforms. For instance, in the US, Instagram is “especially appealing to adults aged 18 to 29, African-American, Latinos, women, urban residents,” whereas Pinterest is dominated by females, aged 25 to 34, with an average annual household income of $100,000 (Ruths & Pfeffer, 2014). Such factors can affect the results of social media prediction analysis. Representativeness is a crucial problem of social media, and so many researchers in various fields, like social science and computer science, consistently point out the limitations.

Validity of methods: Irreplicability and incomparability

There is no agreement among researchers yet on the criteria for successful prediction. Metaxas et al. (2011) test the predictive power of social media in several senate races of two recent US Congressional elections. They review the findings of other researchers and try to duplicate their findings both in terms of data volume and sentiment analysis. They described three necessary standards that any theory aiming to predict elections competently and consistently using social media data should follow: 1) the prediction theory should be an algorithm with carefully predetermined parameters, 2) the data analysis should take into consideration the difference between social media data and natural phenomena data, and 3) it should contain some explanation about why it works.

In most cases, researchers have filtered their data on the basis of decisions clearly made after the elections were over and the results were known (including which parties’ tweets were included). This has led to an inability to replicate reported success rates (Gayo-Avello, 2013; Marchetti-Bowick et al., 2012).

Overcoming the Limitations

Recently, there have been various efforts to address the issues identified above. Here, we will present attempts to solve the most discussed problems of representativeness and other analytical methodologies that can utilize social media.

Sampling: User demographics

An important limitation in previous studies of political behavior using Twitter data is the biased sample. There are differences of demographic composition between web panels and the population of the U.S. in the 2014 election (Scott et al., 2015) and between search and Twitter data in the 2012 presidential election in the U.S (Diaz et al., 2016). Many researchers address this challenge by developing new machine learning methods that will allow researchers to estimate the age, gender, and race of any Twitter user in the U.S. with high accuracy. A variety of methodologies are used, including classifying and predicting users through machine learning, using all the variables available online as input data (Chang et al., 2010; Karimi et al., 2016; Mislove et al., 2011).

After inferring the demographics of social media users, they apply weighting techniques to predict the percentage of votes that individual candidates will receive (Barberá, 2016; Choy et al., 2011, 2012; Davis, 2017; Diaz et al., 2016; Flemming & Sonner, 1999; Gayo-Avello, 2011; W. Wang et al., 2015). We can match the demographic property distribution of the population, or apply weighting according to census data to correct the results. For example, some researchers (Diaz et al., 2016; Gayo-Avello, 2011) weighted the demographic distributions of social media users according to the census. On the other hand, random sampling (Barberá, 2016) or quota sampling (Davis, 2017) has also been used in sample construction.

Data and method

There have been many studies to develop methods of analysis and data collection. Some researchers (Kalampokis et al., 2013; Tufekci, 2014) insist that data filtering is an important step. For example, when crawling social media data, researchers collect social media posts containing keywords or hash tags that are relevant to the research topic. If the wrong keyword is selected at this time, meaningful data may be discarded or meaningless data may be collected, which generates data noise. Advanced natural language processing algorithms have been developed to reduce noise not only in election predictions, but also in other predictive studies; however, it is still a difficult task (Baldwin et al., 2013; Zhang et al., 2016).

Suggestions

There have been attempts to analyze the structure or diffusion process of public opinion. These attempts tried to show how opinions are diffused or how media messages are structured. We believe that these methods are more appropriate for understanding public opinion, and we think social media analysis can be a way to understand public opinion during elections, rather than predict election results.

Exploring Information Structure by Network Analysis

Tweets are propagated in various ways. Retweeting is the most effective, as it can potentially reach the most people, given its viral nature (Petrovic et al., 2011). Retweeting is the action of reposting someone else’s tweet inside your own message stream, and there are generally two ways to do it on Twitter. Users can either manually edit the original tweet or add “RT @userA” (or something similar) to indicate that the original tweet came from userA, or they can use a retweet button, which does not allow them to change the original tweet. In Twitter’s API, the tweet-retweet connection is marked. In short, a RT network is a message propagation network. We can interpret who spread whose message. Also, by network analysis, we can detect the important issues spread in Twitter (Cameron et al., 2016; Dokoohaki et al., 2015; Petrovic et al., 2011). By analyzing keyword networks, which are frequently used keywords in Twitter, we can calculate a clustering coefficient value that shows some information can be spread widely, and some information cannot be spread and remains isolated.

Finding Topics and Frames by News Article Analysis

Traditionally, researchers read a random sample of articles and then categorize or analyze the tone of the article, which is called content analysis by human coder. Mining public opinion from news articles is a traditional area of opinion analysis. Through news articles, we can measure the direction of public opinion. As computer analysis technology combines with content analysis, it has become possible to analyze news articles and reports on a large-scale (An & Gower, 2009; Krstajić et al., 2010; Sjøvaag & Stavelin, 2012). Automated analysis techniques, such as clustering and classification techniques, can be applied to news articles. Usually, by counting the word frequency for a document set then putting them in a vector, we can analyze and apply co- occurrence words analysis (B. Chen, 2009; Li et al., 2013), topic detection (Papadopoulos et al., 2014; C. Wang et al., 2008), and discourse framing analysis (Dimitrova et al., 2005; Gamson, 1989; Semetko & Valkenburg, 2000; Tian & Stewart, 2005). By analyzing news articles, we can understand the discourse constituted in the media and measure the direction of public opinion.

Understanding Social Media User Behavior

Each social media platform is characterized by its ability to analyze various user behaviors. Especially on Twitter, three major activities can be interpreted differently. Other researchers (Stieglitz & Dang-Xuan, 2012) and the political blog FiveThirtyEight (Roeder et al., 2017) suggest a ternary plot to measure the social media presences of some of the most powerful politicians in the United States. Tweets toward the top have a higher share of retweets, those toward the bottom right have a higher share of likes, and those toward the bottom left are in the ratio danger zone - a higher share of replies. Facebook is a more personalized social media, allowing the analysis of friendships, private settings, likes, and favorite pages. Facebook allows an easy measurement of the performance of a Facebook page. Obtaining and using this data is becoming more and more difficult due to privacy concerns. We can track likes, page views, reach, and more, relatively easily. As mentioned above, there are many studies analyzing different user behavior by different social media platforms, for example, YouTube’s likes, liked channels, or Instagram’s likes, replies, tags, etc.

Discussion

There has been a conflict between positive and negative opinions about research methods and results that use social media. There is controversy about the scientific outcomes of research methods and the generalization of research results, but the claim of predicting the election results with social media data is not well supported. Despite criticism of the usefulness and predictability of social media data, social media research is constantly being attempted due to its advantages such as autonomy, immediacy, and size of data.

In this study, we examined the usefulness of social media analysis for analyzing public opinion formation, propagation process, and media issues. In order to solve the problem of methodological issues and representativeness, researchers try to overcome these challenges through sampling and data collection efforts, but the emergence of bot and spam accounts has heightened the controversy about the manipulability and reliability of social media data (Chu et al., 2010; Haustein et al., 2016; Morstatter et al., 2016; Ratkiewicz et al., 2011; Shu et al., 2017). The methodological errors that still need to be resolved have not been overcome sufficiently.

Therefore, we propose using social media data to understand agendas rather than to use them for forecasting. In other words, we suggest utilizing the methods of network analysis and news content analysis that we introduced. Rather than forecasting the election results, it is reasonable to identify which candidates’ campaigns or issues are agenda- setting through the media, and to identify public interest in the agenda. Due to the problems such as data noise and sample bias, it is too risky to predict behaviors. Instead, it is possible to analyze the issues and the agenda of a specific politician or political party. Therefore, taking advantage of the benefits of social media data, identifying public agendas, and measuring interest will be more relevant and useful in grasping public opinion in the future.


Biographical Notes

Jin-ah KWAK is a Ph.D. candidate at KAIST in Daejeon, South Korea. Big data analysis is one of her major research interests.

She can be reached at KAIST, 291 Daehak-ro, Eoeun-dong, Yuseong-gu, South Korea or by email at <jinah.kwak@kaist.ac.kr.>

Sung Kyum CHO is a professor in the Department of Communication at Chungnam National University and director of the Center for Asian Public Opinion Research & Collaboration Initiative (CAPORCI). He was the first president of the Asian Network for Public Opinion Research (ANPOR). He has also been president of the Korean Association for Survey Research (KASR) and the Korean Society for Journalism and Communication Studies (KSJCS). He is part of the team that conducts the Korean Academic Multimode Open Survey (KAMOS). He is also an associate editor for and publisher of the Asian Journal for Public Opinion Research (AJPOR).

He can be reached at Chungnam National University, Department of Communication 99, Daehak-ro, Yuseong-gu, Daejeon, 305-764 or by email at <skcho@cnu.ac.kr.>


Date of Submission: 2018-08-19

Date of the Review Results: 2018-08-25

Date of the Decision: 2018-08-28