October 2, 2019

Six Must-Do Steps for Optimal Machine Learning and Data Science

data science hierarchy

Abraham Maslow’s hierarchy of human needs — depicted in the form of a pyramid — was first proposed in 1943 and laid out what motivates humans: from basic necessities like nutrition and safety first, to “belongingness” and a strong sense of self, and finally, to self-actualization, that is, reaching one’s full potential. It’s this last phase, reaching full potential, that is most difficult for humans to achieve and requires the most work to get there. Reaching this “full potential” to leverage machine learning and artificial intelligence (AI) for data analysis is just as difficult, and Maslow’s hierarchy — slightly re-imagined for solving business problems is the perfect template to answer the question, “How do we use AI and machine learning to improve our business?”

Hierarchy Triangle

Data Collection
At the bottom of the data science pyramid is the most basic step: data collection. Collecting is the process of gathering data from instruments, logs, websites, email engagements, sensors – think the heart rate monitor on your Apple Watch — external data, or even user-generated content like survey answers. Data collection is at the foundational level of our pyramid because the right and relevant dataset is what makes machine learning possible. Consider the following questions when thinking about data collection:

  • What data do you need and for what purpose?
  • Have profiles of all audience members been built that classify them by channel interest, e.g., is the recipient more disposed to receiving email than banner ads?
    • Are assets properly tagged in order to make this determination?
  • Have the appropriate third-party agreements been put in place if data is being collected from another entity you don’t control?
  • Is there a plan in place for how to bring together the data being gathered directly and from outside sources?
  • Has a measurement plan to track and report the most important and relevant key performance indicators been developed in conjunction with preparing for data collection?

After determining what to collect comes a big decision: how does data get captured or flow through the system? Where is it stored? How much work must be done to clean the data? Is it easy to get to? Is it easy to analyze?

Exploration and Transformation
When data is available, the next step is exploration. Exploring is the phase that’s often misinterpreted and misunderstood by those outside of analysis – the thinking is, because the data has been gathered, why can’t we immediately dive in and solve problems? Exploration includes data cleaning. Data cleaning is where bad or unexpected data is uncovered and fixed — for example, tagging may be incomplete, data may be missing or in the wrong format, a flag wasn’t tripped in code logic — and this phase solidifies the third tier of the pyramid, ensuring the data is usable.

Aggregating and Labeling
With reliable and ready data, the next step is developing business intelligence (BI) or analytics. Much of what’s measured in this phase should match the measurement plan, which simplifies the process of ensuring the right and relevant measures are reported. Analytics report on and offer insights into seasonality, sensitivity, changes over time, and other various factors. Some rough segmentation through aggregation is possible, as well. Aggregating (i.e., clustering) is a way to make sense of data that isn’t yet classified by finding logical groups based on similarities within the data that also differentiate between groups. Clustering can uncover several previously unknown attributes of importance, known as features, for later machine-learning modeling. Features are what a model determines are the attributes that best describe or replicate the situation the model is predicting. What are features? Think of a cheeseburger: the required features of a cheeseburger are a bun, cheese and patty. Different restaurants may have different versions of a cheeseburger with more or fewer (optional) ingredients (or features), but a predictive model would say that the required features of a cheeseburger are a bun, cheese and patty.

The tail end of this phase is where data scientists begin preparing training data through labeling, either automatically or manually; for example, did this HCP continue prescribing our treatment this month, yes or no? Did the patient switch treatments, yes or no? (This is a prime spot for awesome data stories; simple insights that can embellish reports and provide attention-getting insights.)

Learning and Optimizing
Now that we have valid, relevant (training) data, there’s another question to consider: are we building models that are for internal use; are they external-facing, or within another system, such as Salesforce? If we’re building internal models, we can start developing machine-learning models. Otherwise, no. For the best results, there should be an experimentation framework that supports incremental deployment to avoid major disasters and get rough estimates of the effects of change before they’re fully available. This phase also supports baseline development; for example, if we’re building a model to determine whether an HCP would drop a treatment, we might build a “profile” of prescribers that summarizes adoption patterns, average number of prescriptions, other treatments prescribed, etc. While this might not seem exciting it’s hard to beat simple heuristics (rules of thumb) that give us guidelines for debugging a more complex machine-learning solution later. At this point, simple math is our best friend.

A data model illustrates the relationships among data. During the learning and optimizing phase, simple machine-learning algorithms can be deployed to begin the modeling process. This uncovers signals and features. Signals are indicators of variables or attributes of interest to further modeling; features are the attributes that are most important for uncovering the patterns in data that can be used to predict later. Put another way, signals are the things that make us go, “Hmm … ” and stop to consider the situation. For example, a train signal tells us to stop because a train is coming, and we should stop or have potentially terrible consequences; a changing yellow light is a signal that gives us a chance to determine whether to go through or stop. Signals in data are the same thing; they’re results that stand out for further investigation. Signals are also a method to uncover new features — combinations of individual attributes, calculated fields, or splitting a single attribute into multiple attributes — that help uncover those patterns for more successful models. This phase can take a great deal of time but is almost always a huge benefit.

We’ve reached the top of the data science pyramid — data is defined, collected, cleaned, available, reliable, explored, measured, and used for experimentation regularly — and now ready to fully move into machine learning and artificial intelligence. The pre-work to begin attacking your biggest problems with artificial intelligence and move up the hierarchy of needs is the prerequisite to more successful, optimized, and valuable marketing.

Want to learn how data science can work for your brand? Reach out to your team today!

Sam Johnson is a senior director for Intouch’s Advanced Analytics Lab