Big Data, Data Science and Machine Learning: A Manager’s Conceptual Map

Last month, the IT manager of a mid-sized retail chain sat through three separate presentations at a board meeting. The first was titled ‘big data strategy,’ the second proposed a ‘data science infrastructure,’ and the third recommended ‘customer segmentation through machine learning.’ Each came with its own budget line, its own team requirements and its own software licensing costs. After the meeting, the manager asked one question: ‘Which one sits inside which?’ Without a clear answer to that question, no investment decision can be sound.

The most reliable way to build a conceptual map is to use a raw material-discipline-method framework. Big data is a raw material. It describes the mass of unprocessed data accumulating across a company’s servers, log files, sales records and, increasingly, sensor outputs. For data to qualify as ‘big,’ three properties need to be present simultaneously: volume (at the terabyte scale), velocity (real-time or high-frequency generation) and variety (structured table data alongside unstructured types such as text, images and logs). Big data on its own is not a solution — it is a logistics problem that needs solving. Storing, moving and making this data queryable demands serious infrastructure investment before any analysis can begin.

Data science is the name of the discipline that processes this raw material. It combines statistics, computer science and domain expertise to extract meaningful patterns and actionable insights from raw data. A data scientist queries, cleans, models and interprets the mass produced by the big data infrastructure. The output can be a report, a predictive model or a decision-support tool. What separates data science from classical business intelligence is that it does not only look backward; it uses patterns to produce probabilistic inferences about the future. This is why building a data science team cannot be solved by purchasing software alone — it requires finding people who understand both statistics and programming and who can also read a business process.

Machine learning is one of the methods that data science uses. It is not the discipline itself but a specific set of approaches within the toolbox. In classical programming, rules are written by humans: ‘If a customer has not placed an order in 30 days, flag as at-risk.’ In machine learning, the rule is learned from data: the system examines historical records and calculates for itself which pattern correlates with an at-risk customer. This approach consistently outperforms rule-based systems when data is variable and multi-dimensional. However, building a machine learning model first requires a sufficient volume of clean, labelled data — which immediately reveals the dependency on big data infrastructure and data science discipline.

The hierarchy reads as follows: without big data infrastructure, the data science discipline has no raw material to work with; without data science discipline, machine learning methods cannot be applied correctly. This sequence also defines investment priority. Trying to build models without infrastructure is like drawing up a production plan before the factory exists. Many mid-sized businesses in Turkey are making exactly this mistake right now — purchasing a machine learning tool before a clean data foundation or the competency to interpret results is in place.

The practical challenge is the skills gap. The data scientist profile is only beginning to take shape locally; universities are just starting to produce graduates with the right combination of statistical and programming skills. Companies are generally bridging this gap in one of two ways: bringing in consultants or retraining existing statisticians and software developers. Both paths carry high cost and long lead times. On the infrastructure side, cloud computing options are making the required storage and processing capacity more accessible than it was even two years ago, but managing that infrastructure still demands real technical capacity. When calculating total cost of ownership, licensing and hardware line items need to be accompanied by human resource and training costs — otherwise the ROI projection will be misleading from the start.

The concrete decision criterion for any manager is this: take every proposal that arrives under the ‘big data’ label and separate it into infrastructure, discipline and method layers. Which layer are you actually buying, and which layer is still missing? If the infrastructure is not solid, the data science investment floats in the air. If data science competency is absent, the machine learning tool sits unused. Companies that plan to build all three layers in a coordinated sequence — starting with a pilot project and scaling deliberately — will gain a real operational advantage over competitors who spend large budgets and produce no usable output.

This article was originally written in Turkish by Gökhan MERCANOĞLU on July 16, 2012 and has been automatically translated into English and other languages using machine translation.