Updating the Definition of ‘Data Scientist’ as Machine Learning Evolves


By Bernardo Lustosa, Partner, cofounder, and COO at ClearSale

In the early days of machine learning, hiring good statisticians was the key challenge for AI projects. Now, machine learning has evolved from its early focus on statistics to more emphasis on computation. As the process of building algorithms has become simpler and the applications for AI technology have grown, human resources professionals in AI face a new challenge. Not only are data scientists in short supply, but what makes a successful data scientist has changed.

Divergence between statistical models and neural networks

As recently as six years ago, there were minimal differences between statistical models (usually logistic regressions) and neural networks. The neural network had a slightly larger separation capacity (statistical performance) at the cost of being a black box. Since they had similar potential, the choice of whether to use a neural network or a statistical model was determined by the requirements of each scenario and by the type of professional available to create the algorithm.

More recently, though, neural networks have evolved to support many layers. This deep learning allows for, among other things, effective and novel exploitation of unstructured data such as text, voice, images, and videos. Increased processing capacity, image identifiers, simultaneous translators, text interpreters, and other innovations have set neural networks further apart from statistical models. With this evolution comes the need for data scientists with new skills.

Unchanging elements of building algorithms

Despite the changes in algorithm structures and capabilities, the process of constructing high-quality predictive models still follows a series of steps that hasn’t changed much. More important than the fit and method used is the ability to perform each step of this process efficiently and creatively.

Field interviews. Data scientists are not usually experts in the subject they are working on. Instead, they are experts on the accuracy and precision required to create the algorithms for various corporate or academic decision-making processes. However, the requirement today is that data scientists develop an understanding of the problem the algorithm was meant to solve, so interviews with subject matter experts focused on that particular problem are essential. Now, data scientists can work on neural networks that span a range of broad knowledge areas, from predicting the mortality of African butterflies to deciding when and where to publish advertising for seniors. This means that today’s data scientists must be able and eager to learn from experts on many subjects.

Understanding the problem. Each prediction hinges on a wealth of factors, all of which the data scientist must know about in order to understand the causal relationships among them. For example, to predict which applicants will default on their loans, the data scientist must know to ask questions such as:

  • Why do people default?
  • Are they planning to default when they apply?
  • Do defaulters have outsize debt relative to their income?
  • Is there fraud in the application process?
  • Is there sales pressure to apply for the loan?

These are some of the many questions to ask on this topic, and there is long lists of questions for every machine learning process. A data scientist who only wants to create algorithms without talking in depth with those involved in the phenomenon being explored will have a limited ability to create effective algorithms.

Identifying relevant information. As a data scientist sifts through the answers to these types of questions, he or she must also be skilled at picking out the information that may explain the phenomenon. A well-trained, inquisitive data scientist will also seek out related data online via search, crawler, and API to pinpoint the most relevant predictive factors.

Sampling. Statistical knowledge — on top of computational knowledge, experience, and judgment — matters for the definition of the response variable, the separation of the database, the certification of past data use, the separation of data between adjustment, validation and testing, and other sampling steps. However, the computational approach supports the use of the ever-larger databases that are required for the construction of complex algorithms. Therefore, both statistical and computational skill sets are a must for today’s data scientists.

Read the source article in VentureBeat.