Product Management Archives | Life Around Data

Over-the-Wall Data Science and How to Avoid Its Pitfalls

Sergei Izrailev — Sat, 03 Nov 2018 13:27:24 +0000

Over-the-wall data science is a common organizational pattern for deploying data science team output to production systems. A data scientist develops an algorithm, a model, or a machine learning pipeline, and then an engineer, often from another team, is responsible for putting the data scientist’s code in production.

Such a pattern of development attempts to solve for the following:

Quality: We want production code to be of high quality and maintained by engineering teams. Since most data scientists are not great software engineers, they are not expected to write end-to-end production-quality code.
Resource Allocation: Building and maintaining production systems requires special expertise, and data scientists can contribute more value solving problems for which they were trained rather than spend the time acquiring such expertise.
Skills: The programming language used in production may be different from what the data scientist is normally using.

However, there are numerous pitfalls in the over-the-wall development pattern that can be avoided with proper planning and resourcing.

What is over-the-wall data science?

A data scientist writes some code and spends a lot of time to get it to behave correctly. For example, the code may assemble data in a certain way and build a machine learning model that performs well on test data. Getting to this point is where data scientists spend most of their time iterating over the code and the data. The work product could be a set of scripts, or a Jupyter or RStudio notebook containing code snippets, documentation, and reproducible test results. In the extreme, the data scientist produces a document detailing the algorithm, using mathematical formulas and references to library calls, and doesn’t even give any code to the engineering team.

At this point, the code is thrown over the wall to Engineering.

An engineer is then tasked with productionizing the data scientist’s code. If the data scientist used R, and the production applications use Java, that could be a real challenge that in the worst case leads to rewriting everything in a different language. Even in a common and much simpler case of Python on both sides, the engineer may want to rewrite the code to satisfy coding standards, add tests, optimize it for performance, etc. As a result, the ownership of the production code lies with the engineer, and the data scientist can’t modify it.

This is, of course, an oversimplification, and there are many variations of such a process.

What is wrong with having a wall?

Let’s assume that the engineer successfully built the new code, the data scientist compared its results to the results of their own code, and the new code is released to production. Time goes by, and the data scientist needs to change something in the algorithm. The data engineer in the meantime moved on to other projects. Changing the algorithm in production becomes a lengthy process, involving waiting for an engineer (hopefully the same one) to become available. In many cases, after going through the process a couple of times, the data scientist simply gives up, and only critical updates are ever released.

Such interaction between data science and engineering frustrates data scientists because it makes it hard to make changes and strips them of ownership of the final code. It also makes it very difficult to troubleshoot production issues. It is also frustrating for engineers because they feel that they are excluded from the original design, don’t participate in the most interesting part of the project, and have to fix someone else’s code. The frustration on both sides makes the whole process even more difficult.

Breaking down the wall between data science and engineering

The need for over-the-wall data science can be eliminated entirely if data scientists are self-sufficient and can safely deploy their own code to production. This can be achieved by minimizing the footprint of data scientist’s code on production systems and by making engineers part of the AI system design and development process upfront. AI system development is a team sport, and both engineers and data scientists are required for success. Hiring and resource allocation must take that into account.

Make the teams cross-functional

Involving engineering early in the data science projects avoids the “us” and “them” mentality, makes the product much better, and encourages knowledge sharing. Even when a full cross-functional team of engineers and data scientists is not practical, forming a project team working together towards a common goal solves most of the problems of over-the-wall data science.

Expect data scientists to become better engineers

In the end, data scientists should own the logic of the AI code in the production application, and that logic needs to be isolated in the application so that data scientists could modify it themselves. In order to do so, data scientists must follow the same best practices as engineers. For example, writing unit and integration tests may feel like a lot of overhead for data scientists at first, however, the value of knowing that your code still works after you’ve made a change soon overcomes that feeling. Also, engineers must be part of the data scientists’ code review process to make sure the code is of production quality and there are no scalability or other issues.

Provide production tooling for data scientists

Engineers should build production-ready reusable components and wrappers, testing, deployment, and monitoring tools, as well as infrastructure and automation for data science related code. Data scientists can then focus on a much smaller portion of the code containing the main logic of the AI application. When the tooling is not in place, data scientists tend to spend much of their time on building the tools themselves.

Avoid rewriting the code in another language

The production environment is one of the constraints on the types of acceptable machine learning packages, algorithms, and languages. This constraint has to be enforced at the beginning of the project to avoid rewrites. A number of companies are offering production-oriented data science platforms and AI model deployment strategies both in open source and commercial products. These products, such as TensorFlow and H2O.ai, help solve the problem of a production environment being very different from that normally used by data scientists.

Images by MabelAmber and Wokadandapix on Pixabay

The post Over-the-Wall Data Science and How to Avoid Its Pitfalls by Sergei Izrailev appeared first on Life Around Data.

AI Systems Development Cycle And How It’s Different From Other Software

Sergei Izrailev — Thu, 27 Sep 2018 13:45:20 +0000

Most software development projects go through the same four phases: discovery, research, prototype, and production. Usually, the research and prototype stages are fairly light because experienced engineers can design a solution and when necessary, test their ideas with a quick proof-of-concept (PoC). AI systems development cycle, on the other hand, depends heavily on research to find whether we can actually build a machine learning model that performs well. In addition, putting an AI system in production operationally involves much more than building the models. Therefore, a working prototype is typically required for AI systems in order to have the confidence that the system will work end-to-end.

Let’s look at each of the four stages of AI systems development cycle in more detail.

Discovery

The discovery phase is responsible for defining the project: what are its goals, what is the business problem it is solving, why solving it is important, what is the value of solving it, what are the constraints, and how will we know that we’ve succeeded. Frequently, such information is captured in a Product Requirements Document (PRD) or a similar document, defining the “what” of the project. Some aspects of discovery are described in another article on reducing the risk of machine learning projects.

For AI systems, feasibility and quality of a solution to the problem at hand are usually not obvious from the start. Carefully defining the constraints can dramatically narrow down the choice of technology and algorithms. However, creating new machine learning models still largely remains to be a task for an expert. As a result, a research stage is needed in order to find whether or not an AI solution is possible, as well as to estimate its value and cost.

Research

The research phase answers in detail how we are going to solve the business problem. Relevant documentation of a typical software project may include a system design, various options considered during design and their trade-offs, specifications, etc., with enough information for an engineering team to build the software.

The research phase of AI systems development cycle is highly iterative, often manual, and heavy on visualizations and analytics. First, we need to check whether we can solve the problem with machine learning given the available data and constraints established in the discovery phase. We collect the data, extract it and transform it into inputs to a machine learning algorithm. We usually build many variants of a model, experiment with input data and algorithms, test and evaluate the models. Then we frequently go back to collecting and transforming the data. This cycle stops when, after a few (and sometimes many) iterations at every step, there’s a model that makes predictions with an acceptable accuracy. Information gathered during this process is passed back into the discovery phase.

Prototype

A prototype for an AI system is proof that a system reflecting the production design, without all the bells and whistles, can run end-to-end as code and produce predictions within the predefined constraints. Sometimes, the output of the research phase is close to a prototype, after a little clean-up and converting some manual steps into scripts. As we are getting closer to production, it is better to keep the prototype code at production quality and involve engineers who will be working on the production AI system.

Note that the goal of the prototype stage is not for a data scientist to create something that will then be rewritten by an engineer in a different language. Often referred to as “over the wall” development, such a pattern is extremely inefficient and should be avoided.

Production

The production stage of the AI systems development cycle is responsible for the final system that is able to reliably build, deploy and operate machine learning models. The reliability requirements lead to a plethora of components that can easily take much of the time and effort of the whole project. Such components include testing, validation, model tracking and versioning, deployment, automation, logging, monitoring, alerting, and error handling, to name a few.

Summary

The AI systems development cycle has the same stages as most other software. It is different in the much higher proportion of the effort allocated to the research and prototype stages. The operational components of AI systems at the production stage may also require much effort, especially in the first iteration of the whole cycle. Once the first AI system is in production, the frameworks used for operationalizing machine learning can be reused and improved on in the subsequent cycles.

The post AI Systems Development Cycle And How It’s Different From Other Software by Sergei Izrailev appeared first on Life Around Data.

How Can We Reduce the Risk of Machine Learning Projects?

Sergei Izrailev — Thu, 20 Sep 2018 13:46:22 +0000

The overall risk of machine learning projects in a business can be relatively high because they tend to be long and complex. Before embarking on building a machine learning solution, you need to decide what business problem you are trying to solve and how machine learning fits in the solution. It’s a thought experiment, in which a magic black box provides a perfect prediction of whatever it is that needs to be predicted in order to solve a business problem. Imagine you have it. Then answer the following questions, which help identify and manage the risks.

Is this the right problem to solve – right now?

This is by far the most important question, and it lies squarely in the product management and business area. To answer it effectively, strong communication must be established between the data science team and the product and business teams. On one hand, it is important to have a process which enables ideas generated in the engineering world to be validated quickly with customers. This avoids situations when a product feature is developed on the premise of “Wouldn’t it be cool if…” and there’s no demand for the feature. On the other hand, there needs to be an efficient way to prototype and validate solutions for inbound ideas and requests from customers.

If we solved the problem, what would be the value?

The value is the net difference between the benefit of having the feature and the cost of building it. While it may be hard to estimate either of these quantities accurately, some idea of the financial or other impact should provide guidance on whether the benefit is worth the risk of investing in the project. The company is exposed to a potentially large opportunity and financial risk, for example, when a product feature is built without an estimate of either what it would cost to build and maintain, or what revenue impact it is expected to have.

Is machine learning the right tool to solve the problem?

Building a production machine learning system is hard and can be expensive. In many cases, an outcome that is good enough to solve the business problem can be achieved with simpler methods that are much easier to implement. During a panel discussion at H2O World 2015, Monica Rogati made this point beautifully: “My favorite data science algorithm is division because you can actually get very far with just division…” Understanding that using machine learning is not the goal and is not necessarily an appropriate tool can sometimes be disappointing to data scientists. However, the satisfaction of solving a real problem and having a business impact easily outweighs this disappointment.

What are the constraints?

Any project has its constraints, and machine learning systems are not an exception. The starting point is the available people and their skills. If the existing skills do not match the task at hand, the project will depend heavily on the ability to train, hire or outsource in order to fill the gaps. More on this in my post Four machine learning skills of a successful AI team.

Further, if machine learning models have to run in a production environment, the data scientists who build the models need to understand upfront the existing production environment architecture, the technology stack, and the requirements for scale, which typically limit the choice of programming language and algorithms that are acceptable in production. Scalability of machine learning systems is a large topic in itself, and I leave it for another blog post.

Summary

By solving the right problem, understanding the value of the solution, being confident that machine learning is the right tool to solve the problem, and defining the constraints upfront we drastically reduce the overall risk of machine learning projects. Answering the questions above helps avoid wasted funds, time, and effort, as well as frustration across the organization. Not all questions can be readily answered, and some discovery with customers and proof-of-concept projects may be needed. While it may appear as unnecessary extra work, the resulting clarity about the project is so powerful that it is worth the investment.

Photo by rawpixel on Unsplash

The post How Can We Reduce the Risk of Machine Learning Projects? by Sergei Izrailev appeared first on Life Around Data.