CASB Machine Learning: Easy to Say, Hard to Train

Lately, it’s become popular to say that CASB tools use Machine Learning (ML) to detect anomalies in cloud apps. However, you have to be cautious about these claims. What is not widely known is that building a machine learning model is the easy part; training the model is the hard part, requiring enormous amounts of data. Unfortunately, data availability is often a problem for CASB vendors. The data that is available to them is limited to their current set of customers and prospects. As the famous example in HBO series “Silicon Valley” illustrated so well.

Some more serious examples include:

  1. Detection of suspicious logins: Several leading SaaS providers, including Google, Microsoft, DropBox, and Box have APIs that provide information on suspicious logins. This determination is based on ML models. For example, Microsoft Office 365 leverages ML to detect suspicious logins based on these six conditions:
    • Users with leaked credentials
    • Sign-ins from anonymous IP addresses
    • Impossible travel to atypical locations
    • Sign-ins from unfamiliar locations
    • Sign-ins from infected devices
    • Sign-ins from IP addresses with suspicious activity

Enterprise security teams should leverage this data and correlate with other user activity to proactively defend against potential risks.

2. DLP in your images: The rise in the use of smartphones over the last decade has encouraged employees to take photos of a document, whiteboard, company credit card, customer contract, etc. Hidden in these photos, that are often stored in the cloud, is sensitive data. Traditional Data Loss Prevention (DLP) products do not scan these images, and as a result, miss most of the risks. A neural network-based machine learning algorithm like Google Vision (based on Google Tensorflow) uses OCR on these images and feeds them thru DLP checks to discover and redact the sensitive data hidden in these images.

After concluding that we would be spending years training a new machine learning model, and also realizing that the availability of data to train these models is limited, we made the decision not to roll out our own machine learning models. Instead, we went to the leaders. For example, to classify an image, we go to Google, which trains their Google Vision algorithm on images found on the Internet. In fact, their model has been trained with a data set as large as the Internet.

For detecting risks, we leverage Google ML algorithms which are trained on enormous data sets.  For detecting suspicious user logins, we get the intelligence feeds from G Suite, Dropbox and Office 365 directly, all of which have data about end users leveraging their services.

My point is that when it comes to security technologies, it has become trendy to throw around words like AI, ML, analytics. Beware of these trendy claims and ask the vendors how they build and train their ML models or where is the data coming from for their analytics.  

See how we leverage industry leaders’ work on ML  to detect risks and sensitive data hiding in your cloud apps. Request your free trial of Cloud Access Monitor today.