Twitter's drive to catalogue, organize and data mine human thought

Image credit: iStockphoto by Getty Images

Being the 'single largest repository of human thought in the world', in which users share and discuss their interests and opinions on just about every possible topic, Twitter presents a wealth of data that organizations, brands and agencies can potentially leverage, according to Ben Truscello, Asia Pacific Head of Twitter Data for Global Brands and Agencies.

He relates that when his team heads out to talk to clients in the brands and agencies space, “we like asking them the question: what’s possible when you could know anything in the world at any point in time? When you think about the volume of Twitter users and content out there, you can ask that corpus of data a wide range of questions.”

Publicly available Twitter data is provided and licensed to “several ecosystem companies in the world.” According to Truscello, these range from software companies, analytics companies, and even such companies as IBM and Salesforce, which integrate Twitter data into their platforms to provide a range of services like dashboarding for social listening, customer service solutions and providing services to major brands “to enable them to understand what their customers are saying, what their competitors are announcing, and how to understand their audiences better on Twitter.”

Twitter’s acquisition of GNIP enables brands and agencies to work directly with Twitter, “allowing them to do deeper analytics, sophisticated audience segmentation, and integrate our data with their own proprietary data.”

To learn more about what marketers and data scientists can do with access to the real-time and historical public source of ideas, opinions and happenings, Enterprise Innovation sat down with Truscello and Scott Hendrickson, Principal Data Scientist & Manager Data Services, Twitter Data—who currently leads a team of data scientists and data strategists working with some of the biggest companies in the world—to explain the strategic framework for how brands are leveraging social data today to develop operational insights and new business strategies.

There are currently many ecosystem players across different business niches and categories. Is it fair to say that GNIP is seeking to replace these players by engaging directly with brands and agencies?

Hendrickson: Much of the work we do is cutting-edge in the sense that we’re not necessarily inventing new data science, but we are bringing in applications of data science to social data in a way that builds on top of what most brands are already experiencing from the ecosystem partners. So I think we have yet to ever work with a brand that’s not working for one or more of those ecosystem partners—it’s very rare for a brand to start with raw data. They would mostly start with analytics solutions and wanna go deeper. So no, we don’t replace their solutions. Instead, we work with the brand to leverage those solutions and add their internal analytics capabilities to those solutions.

Let’s say I’m Adobe. I have my own internal multi-channel customer engagement solution and on top of that I’m listening on Twitter to whatever hashtags that come out on my product. Is there a way that you guys can combine the very unstructured, free-flowing, very real-time data on Twitter with what my brand has internally for its customer service?

Hendrickson: The use case is this: Adobe is a partner consuming Twitter data and providing a solution to their customers. Adobe is building the solution that combines other parts of the Adobe solution like website monitoring, marketing automation, writing things, etc. with Twitter data insights. So we work closely with Adobe to combine the data to fit the use case that their product is trying to present to their customers.

Truscello: One quick point that I think is always fun to point out is obviously folks think about Twitter as 140 characters. But behind the scenes, and this is where Scott’s team works so to speak, there are about a hundred and fifty data fields that are accessible. It’s also in that space where those kinds of analytics and software providers are turning these data into insights and making it real for a lot of clients.

The client will be able to access all of these metadata through APIs?

Hendrickson: There are challenges in doing so, and that again is where our team helps out. In pushing innovation, in pushing skill sets into our ecosystem and into our brand and customers around data science, around identifying the right tools, around proprietary data integration, etc., we often find ourselves in a consultative role with our ecosystem partners and brand clients as they begin to advance in their capabilities. Our ultimate goal is to help brands find value and insights from Twitter data; we’re aiming toward broader enterprise adoption where Twitter data can provide insights to a wide range of really difficult business questions. Operationalization is a key objective in these engagements.

Twitter has historical data going back to 2006—that is the corpus of Tweets, of human thought as you said. But it has to be corrected against the user base, right? I mean, the first user was probably Jack Dorsey. Obviously that has grown tremendously since then—so how do you correct against that in order to provide insights for a specific brand who wants to know what its historical perception has been like?

Hendrickson: We provide data access along multiple dimensions of shaping the stream, so for example, you can filter by geolocation or by keywords that are matched, or you can filter by pure sampling with just a percentage of the total tweets. Typically when our analytics partners and my team work with brands, we use a combination of few among our fleet of APIs to normalize based on the business case we are working on. We wouldn’t try to say we’ve corrected all the Twitter streams for all the worldwide conversations, but what we would say very very carefully is we can compare over time the growth of the audience versus the growth of the conversation by that audience on Topic X. The data science discipline around this is something we try to educate our customers and partners on because there are many dimensions to it, and it’s very important for model-building, for deducing trends.

There’s so much unstructured and real-time stuff happening with Twitter data. What are some of the challenges you’ve encountered in terms of making sense of this massive amount of data that just keeps flooding through? How do you filter past the sentiment, the sarcasm, the velocity to end up with a credible piece of research?

Hendrickson: Most of our partners are using a combination of the three ways of accessing. And the reason that’s an important answer to your question is that the first step in model building is making sure that you’re analyzing a corpus that represents the audience and conversation that you actually thought it did.

One classic difficult case is the game of Go. In the United States it’s called Go, which is a very common English word, and it’s impossible to find on Twitter. So if you started with that as your filtering mechanism, you would not end up with a conversation about the game of Go. What we end up doing in cases like this is using techniques of text analysis to refine and refine and refine it until we get to the conversation we are looking for. So in the case of Go, we add “stones”, “queue” and other words associated with Go, and then we start getting a corpus of tweets about the game. Once you’ve done that, you make sure you address the sentiment model to the kinds of conversations people would have in that domain. A sentiment model that’s not trained on the conversation you’re looking for is typically going to be somewhat unreliable, and as a data scientist, I often recommend that we don’t take a generic plus, minus, neutral sentiment and try to predict things with it. Partly it’s an approach. I think we need to put a lot of pressure on data scientists to be very careful that we are answering the question that we are supposed to be answering.

Could you talk about some of the challenges you think data science is going to face in the future? How do you see the complexity increasing in the data science profession?

Hendrickson: Well that’s probably two or three dimensions. At a high level I think that there are the big data challenges which are velocity and scale. There’s over a billion Tweets every two days—so it’s a lot of tweets coming through and that’s hard to handle. There are set of technologies growing through a discipline that we’re calling 'big data engineering' or 'big data architecture'—things that are helping build up scaleable technologies for the large side. 

The second kind would be more of the model-building and the amount of computing power that we have, plus the evolution of techniques over the last few years, which has enabled all kinds of amazing results from recommendation engines. And then there’s predictive analytics, and you know there are so many amazing things happening on that side.

I think the other thing you’re bringing up is there really is a need for the profession to take seriously the concerns of doing the work well, and having integrity around knowing the pitfalls and avoiding them. Doing the work for good is also very important—doing the data science work in the service of things that build society.

Twitter users behave in a certain way, and I presume you can analyze all that behavior in a very concrete manner. But the profiles of the users themselves are a bit of a black box, right? I mean, you don’t know if a certain segment of users is already a customer of, let’s say, Brand X or Y. Do you have any way of segmenting the users themselves, of finding out any information about the users, maybe an anonymous way, that helps you segment them?

Hendrickson: So those are usually solutions that the company implements—not Twitter—and that’s partly about permission and explicit customer expectations. We’re very careful with the customer experience and not creating any unpleasant surprises, so in each case we require that the customer gives permission for connecting their personal data to any other internal personal data. And so you will see this happening very successfully in some places, where during the log-in experience you’re asked if you’re connecting your Twitter handle to Acme Widgets’ campaign—whoever Acme Widgets is—and in those cases there are some amazing customer experiences happening.

Truscello: To add to that, quite often the major brands spend a lot of money in primary research, and they have this perspective of their customers that are defined by a certain profile. There are different ways to think about this, depending on your need. But we do spend a lot of time there too. I think the conversational analysis is where there are a lot of ah-hahs and unknowns about your customer base that you can now get because they’re out there actively talking about those things everyday—whether they mention the brand or product specifically often doesn’t matter. What’s the more interesting is that they’re talking about the category of your product—you can discover a lot from there.