You’ve had 20 years of technical experience working in big data for financial institutions and the government. How has the concept of big data changed and what are some of the key themes you see gaining traction in 2015?
That is a good question. I can’t believe its been 20 years… Back in the late 90s big data referred to maybe a million customers. Yes, there were large transactions processed by mainframes but data was regional and fragmented. Things didn’t really start to evolve in terms of what we now understand as big data until around 2001. This was a nexus of two events – 9/11 and the dot-com bubble.
With the introduction of the internet, email, and online travel and shopping sites, consumers developed higher expectations of services. For the first time, people could make their own travel arrangements online. That freedom encouraged travel in ways that hadn’t been seen before. It also showed the underbelly of the inefficiencies of industry systems, especially banking. At that time, if you lived in Florida, and had an account with a national bank, it wasn’t uncommon for you to travel to New York, for instance, and not be able to access your account from a bank branch in that location. This didn’t make sense to consumers but given the growth strategies prevalent in the banking industry then, it’s easy to understand how national banks could face the complex challenge of integrating and correlating massive quantities of data from multiple acquisitions of regional banks. When banking was done primarily at the regional level, there was less of a need to centralize operations, and de-duplicating accounts wasn’t worth the cost. So the popularity of the internet coupled with the surge in growth of national banks resulted in higher expectations from consumers and subsequently the demand for online banking. To meet this expectation, banks needed to centralize their operations for a “golden view” of their customers. The next big shift occurred around 2010 as social media and the internet of things started to create even larger volumes of very different types of data. How you connect and leverage the data has become key as our society is increasingly digitally active.
As for 9/11, it was a turning point for government security. Known terrorists were entering the country due to name transposition and that was unacceptable. There was more data available but there wasn’t good technology to connect the dots, technology that could deal with dirty data or build the keys necessary to truly understand the real-world individuals and their relationships. This is why government agencies turned to inventive solutions to solve these very complex problems. As more unstructured and streaming data came on the scene in 2008 and later, agencies have continuously adapted to enable correlation of this data. This is necessary in order to ensure that the best information is used to uncover threats and that data gathered on potentially harmful sources is separated from data collected on non-harmful sources. The technology should enable the gathering of the right information, at the right time, on those that pose a risk, rather than collecting information on everyone.
What are some of the questions you’re getting from customers you encounter as they implement enterprise big data initiatives to make data work for their organizations?
The question that customers most often ask is “How do I determine the relevance of the data?” They’re storing information but until they can connect it to the business – it isn’t serving a purpose. It’s like we have become so afraid to get rid of data because it “might” be useful. My goal in conversations with customers is to shift the focus back to the business – and help them define what keeps them up at night, so we can translate that into a use case that data can help solve. That’s where we come in. The second question we get is “How do I determine the quality or accuracy of these unstructured sources?” If you are going to be pulling data into a BI (business intelligence) model, this is important. How can I be sure of the data points? This is why what we do is so cool. We help measure the quality of not only the entities and sources, but also the relationships. This unique step eliminates ambiguity which results in a confident end user who can now trust their data.
Are companies truly able to deliver on the promise of Hadoop yet? Sounds like there are some pretty specific challenges along the way.
It depends on what a company wants to do. If you want to save on ETL costs, sure, companies are seeing those savings today. If they want more historical trending that they couldn’t get in their EDW because it only keeps 2-6 months of data active, they’re able to do that as well.
But for the more complicated use cases that require integration of very disparate internal and external sources, that road hasn’t been easy. Those that have tried have often tried to do the work themselves, and quickly realized that they found data mud. Data deconfliction and integration is difficult enough when you are dealing with data in the thousands or millions. But when you get into the hundreds of millions and billions, simple name, address, and phone number rules start to fall apart. Those that have tried machine learning know that probabilities and training sets only get you so far. This is where software like Novetta Entity Analytics can really help. Our software cut its teeth on these types of dirty, fragmented, and varied sources – and with the development of data adaptive rules – has been able to resolve entities and relationships at scale and speeds that would have taken years using traditional methods.
What are the key principles companies need to be mindful of in order to be successful with Hadoop?
The key principle to be mindful of is that a data strategy isn’t buying a Hadoop distribution and storing data. The key is the same as it is for other parts of the business. Define the business case… That could be “I need to reduce the cost of data preparation” or “How do I target the right customers at the right time and stop flooding their inbox and phone with offers they delete?”. Both use cases can be implemented by leveraging a Hadoop distribution but the applications that you need and how they are applied will be very different.
Alright, let’s switch gears a bit. What’s the one tech tool (or three) you can’t live without in your role as Product Director at Novetta?
That’s funny because, I am going to sound like a walking Apple advertisement. I depend heavily on my MacBook Pro so I can compile, test, demo and collaborate with the development team that is working on linux. My iPhone – which needs no explanation. And an iPad, to motivate me to exercise at home and while on the road.
Want to hear more of Jenn’s thoughts on big data and Hadoop? Check out her blog series on the features and benefits of Novetta Entity Analytics, Hadoop Analytics, and various webinars and podcasts, or if you’re attending Strata + Hadoop World next week in San Jose, stop by booth 1331 and meet her in person. Schedule a demo of Novetta Entity Analytics and you’ll get a free gift from Novetta as a bonus!