If you are doing anything around Data Science and Machine Learning in the Microsoft space then I am sure that you have come across Azure Machine Learning. Azure ML is Microsoft’s push to bring machine learning to the masses, kind of the same way that Microsoft has done everything in its power to bring Business Intelligence to the masses.

Azure ML has a host of fantastic features and the ability to use custom R and Python scripts if what is provided out of the box does not fit your needs. It makes for an SSIS like drag and drop environment in which to build your machine learning experiments, which is fantastic as it speeds up the process of iteratively developing a machine learning solution. However there is one key feature in this drive to bring machine learning to the masses which has been overlooked.

Power Query datasets as a source for Azure ML experiments

In the spirit of bringing technology and capabilities to the masses I believe that Microsoft is currently missing out on a feature that could be truly great for users of Azure ML and Power BI. That feature would be to use a Power Query dataset which has been saved to your Corporate Data Catalog as the source of data for a Azure ML experiment.

Currently Azure ML allows a range of data sources which are all listed out in the documentation. In essence it boils down to the following ways that you can bring data into Azure ML:

  • Azure BLOB storage, table, or SQL database
  • Hadoop using HiveQL
  • A web URL using HTTP
  • A data feed provider (OData)
  • Uploaded ahead of time from a local machine

By missing Power Query datasets out of that list Microsoft is doing their entire product suit a great disservice. As I have previously blogged, Power BI allows us to change the way in which we build BI solutions and at the heart of this sits Power Query. The ability for people outside of the traditional IT function to take control of ETL for POC’s and new experimental projects sit at the heart of this change.

Now I am sure that there will be those who are quick to point out that users should simply use Power Query to load data into an Excel spreadsheet and then save it to CSV and upload the data to Azure ML. This will surely work however it introduces a manual step in the process which would need to be repeated each time that the model is retrained which is an undesirable solution. Instead of this manual step it would be much more valuable if a team member could create a Power Query dataset which they then save to the Corporate Data Catalog from which Azure ML would then consume this dataset.

What would be the benefits?

To put it in bluntly, this would decrease the amount of time your machine learning team would spend preparing data for their experiments. As has been the case for almost all analytics projects, there is a vast amount of time spent on getting quality data in the correct format. This is still the case for machine learning where it is arguably even more critical to get the best and cleanest possible data as it has a very real and direct impact on the quality of your solution. With machine learning there is an even greater dependency on providing data in the correct format needed since the format required can differ greatly based on the question you are trying to answer as well as the type of learning algorithm you choose to employ.

So instead of a team spending many hours trying to pull data into spreadsheets or using some other form of ETL tool to generate the data, it would be extremely useful to rather allow them to use Power Query. If during experimentation they realise they need to augment the data this can easily be done using Power Query. This will allow them to move fast and get more done and at the end of the process make this dataset available for consumption by publishing it to the Corporate Data Catalog. This dataset will then be reusable as it will pull fresh data each time it is run, so can be used to retrain your model with new data without any manual steps involved.

This article has 2 comments

  1. shaun

    I reckon they have some work to do on the data sources. You can’t load data from your on premise data repositories!? You have copy them upto Azure first. So you’re forced into paying for Azure storage just to use ML studio. There’s also a 10 GB limit on training sets, will be interesting to see how that plays out on Hadoop sources. Big Data mining techniques rely on using counts of distinct sets of attributes with low dimensionality and then work out the intersecting probabilities. Haven’t worked it out but not sure how much 10GB allows for in complex scenarios. It looks like it could be SSIS under the covers. The old man in me just wishes they would add the ML tasks to SSIS 🙁

    1. Christo Olivier

      There are definitely quite a few areas in which Azure ML falls short. I did not know about the 10GB limit on training sets. (I should read the documentation more carefully 🙂 ) That in and of itself would cause it to not be feasible for some large scale ML projects.

      It would have been fantastic if they could make some of the ML tasks available for SSIS. It would definitely be of great benefit to people that need to use it in an on-premise environment. I do however think that going that route would violate the new direction of Microsoft which seems to be to try and get the world to use Azure first, with on-premise as a very distant after thought. I will still keep a bit of hope alive that we do actually get them in SSIS some day 🙂

Comments are closed.