Machine learning (ML) based data analytics is rewriting the rules for how enterprises handle data. Research into machine learning and analytics is already yielding success in turning vast amounts of data—shaped with the help of data scientists—into analytical rules that can spot things that would escape human analysis in the past—whether it be in pursuit of pushing forward genome research or predicting problems with complex machinery.
Now machine learning is beginning to move into the business world. But most organizations haven’t truly grasped how machine learning will change the way they do business—or how it will change the shape of their organizations in the process. Companies are looking to ML to automate processes or to augment humans by assisting them in data-driven tasks. And it’s possible that ML could turn enterprises into vendors—turning lessons learned from their own vast stores of data into algorithms they can license to software and service providers.
But getting there will depend on how machine learning capabilities evolve over the next five years and what implications that evolution has for today’s long-time hiring/recruitment strategies. And nowhere is this more crucial than in unsupervised machine learning, where systems are given vast datasets and told to find the patterns without humans having first figured out what the software needs to look for. With minimal pre-task human efforts needed, the scalability of unsupervised machine learning is much higher.
David Dittman, director of business intelligence and analytics services at Procter & Gamble, explained that the biggest analytics problem he sees today with other large US companies is that “they are becoming enamored by [machine learning and analytics] technology, while not understanding that they have to build the foundation [for it], because it can be hard, expensive and requires vision.” Instead, Dittman said, companies mistakenly believe that machine learning will reveal the vision for them: “‘Can’t I have artificial intelligence just tell me the answer?'”
The problem is that “artificial intelligence” doesn’t really work that way. ML currently falls into two broad categories: supervised and unsupervised. And neither of these works without having a solid data foundation.
Supervised ML requires humans to create sets of training data and validate the results of the training. Speech recognition is a prime example of this, explained Yisong Yue, assistant professor of computing and mathematics at Caltech. “Speech recognition is trained in a highly supervised way,” said Yue. “You start with gigantic data—asking people to say certain sentences.”
But collecting and classifying enough data for supervised training can be challenging, Yue said. “Imagine how expensive that is, to say all these sentences in a range of ways. [Data scientists] are annotating this stuff left and right. That simply isn’t scalable to every task that you want to solve. There is a fundamental limit to supervised ML.”
Unsupervised machine learning reduces that interaction. The data scientist chooses a presumably massive dataset and essentially tells the software to find the patterns within it, all without humans having to first figure out what the software needs to look for. With minimal pre-task human efforts needed, the scalability of unsupervised ML (particularly in terms of the human workload upfront) is much higher. But the term “unsupervised” can be misleading. A data scientist needs to choose the data to be examined.
Unsupervised ML software is asked to “find clusters of data that may be interesting, and a human analyzes [those groupings] and decides what to do next,” said Mike Gualtieri, Forrester Research’s vice president and principal analyst for advanced analytics and machine learning. Human analysis is still required to make sense of the groupings of data the software creates.
But the payoffs of unsupervised ML could be much broader. For example, Yue said, unsupervised learning may have applications in medical tasks such as cancer diagnosis. Today, he explained, standard diagnostic efforts involve taking a biopsy and sending it to a lab. The problem is that biopsies—themselves a human-intensive analytics effort—are time-consuming and expensive. And when a doctor and patient need to know right away if it’s cancer, waiting for the biopsy result can be medically hazardous. Today, a radiologist typically will look at the tissue, explained Yue, “and the radiologist makes a prediction—the probability of it containing cancerous tissue.”
With a big enough training data set, this could be an application for supervised machine learning, Yue said. “Suppose we took that dataset—the images of the tissue and those biopsy results—and ran supervised ML analysis.” That would be labor-intensive up front, but it could detect similarities in the images of those that had positive biopsies.
But, Yue asked, what if instead the process was done as an unsupervised learning effort?
“Suppose we had a dataset of images and we had no biopsy results? We can use this to figure out what we can predict with clustering.” Assume that the number of samples was 1,000. The software would group the images and look for all of the similarities and differences, which is basic pattern recognition. “Let’s say it finds 10 such clusters, and suppose I can only afford to run 10 biopsies. We could choose to test just one from each cluster,” Yue said. “This is just step one in a long sequence of steps, of course, as it looks at multiple types of cancer.”
Guider versus decider
Unsupervised learning still needs a human being to assign a value to the clusters or patterns of data it finds, so it’s not necessarily ready for totally hands-off tasks. Rather, it is currently better suited to enhance the performance of humans by highlighting patterns of data that might be of interest. But there are places where that may soon change—driven largely by the quality and quantity of data.
“I think, right now, that people are jumping to automation when they should be focused on augmenting their existing decision process,” said Dittman. “Five years from now, we’ll have the proper data assets and then you’ll want more automation and less augmentation. But not yet. Today, there is a lack of usable data for machine learning. It’s not granular enough, not broad enough.”
Even as ML data analytics become more sophisticated, it’s not yet clear how that will change the shape of companies’ IT organizations. Forrester’s Gualtieri anticipates a reduction in the need for data scientists five years from now in much the same way that the need for Web developers who create webpages from scratch were much more needed in 1995 than they were in 2000, as so many webpage functions were automated and sold as modular scripts. A similar shift is likely in machine learning, he suggested, as software and service providers begin to offer application programming interfaces to commercial machine learning platforms.
Gualtieri foresees a simple change in the enterprise IT build-or-buy model. “Today, you’re going to make a build decision and hire more data scientists,” he explained. “As these APIs enter the market, it will move to ‘buy’ as opposed to ‘build.'” He added that “we are seeing the beginnings of this right now.” A couple of examples are Clarifai—which can search through video in search of a particular moment, such as looking at thousands of wedding videos and learning to recognize the ring ceremony or the “you may kiss the bride” moment—and Affectiva, which tries to determine someone’s mood from an image.
Dittman agrees with Gualtieri that companies will likely create many specialized scripts automating many ML tasks. But he disagrees that this will result in fewer computer science jobs in five years.
“If you look at the number of practicing data scientists, that will sharply increase, but it will increase far slower than the digitization of technology, as [ML] moves into more and more whitespaces,” Dittman explained. “Consider the open source trend and the fact that data-scientist tools are going to start to get easier and easier to use, moving from code generation to code reuse.”
Caltech’s Yue argued that demand for data scientists will continue to rise as ML successes will beget more ML attempts. And as the technology improves, he explained, more and more units in a business would be able to take advantage of ML, which means the need for far more data scientists to initially write those programs.
From consumer to provider
Part of what may drive a continuing demand for data scientists is the hunger for data to make ML more effective. Gualtieri sees some enterprises—roughly five years from now—also playing the role of vendor. “Boeing may decide to be that provider of domain-specific machine learning and sell [those modules] to suppliers who could then become customers,” he said.
P&G’s Dittman sees both ends of the analytics equation—data and the ML code—being highly sellable, potentially as a new major revenue source for enterprises. “Companies are going to start monetizing their data,” he explained. “The data industry is going to explode. Data is absolutely exploding, but there is a lack of a data strategy. Getting the right data that you need for your business case, that tends to be the challenge.”
But Yue has a different concern. “Five years from now, [ML] will naturally come into conflict with legal issues. We have strong laws about discrimination, protected classes,” he said. “What if you use data algorithms to decide who to make loans to? How do you know that it’s not discriminatory? It’s a question for policy makers.”
Yue offered the example of software finding a correlation between consumers defaulting on their loans and those who have blue eyes. The software could decide to scan every customer’s eye color and use that information to decide whether or not to approve a loan. “If a human made that decision, it would be considered discriminatory,” Yue said.
That legal issue speaks to the core role a data analyst plays in unsupervised ML. The software’s job is to find the links, but it’s ostensibly the human who decides what to do about those links. One way or the other, HR is going to need to recruit a lot more data scientists for quite some time.