If you are doing anything around Data Science and Machine Learning in the Microsoft space then I am sure that you have come across Azure Machine Learning. Azure ML is Microsoft’s push to bring machine learning to the masses, kind of the same way that Microsoft has done everything in its power to bring Business Intelligence to the masses.
Azure ML has a host of fantastic features and the ability to use custom R and Python scripts if what is provided out of the box does not fit your needs. It makes for an SSIS like drag and drop environment in which to build your machine learning experiments, which is fantastic as it speeds up the process of iteratively developing a machine learning solution. However there is one key feature in this drive to bring machine learning to the masses which has been overlooked.
Power Query datasets as a source for Azure ML experiments
In the spirit of bringing technology and capabilities to the masses I believe that Microsoft is currently missing out on a feature that could be truly great for users of Azure ML and Power BI. That feature would be to use a Power Query dataset which has been saved to your Corporate Data Catalog as the source of data for a Azure ML experiment.
Currently Azure ML allows a range of data sources which are all listed out in the documentation. In essence it boils down to the following ways that you can bring data into Azure ML:
- Azure BLOB storage, table, or SQL database
- Hadoop using HiveQL
- A web URL using HTTP
- A data feed provider (OData)
- Uploaded ahead of time from a local machine
By missing Power Query datasets out of that list Microsoft is doing their entire product suit a great disservice. As I have previously blogged, Power BI allows us to change the way in which we build BI solutions and at the heart of this sits Power Query. The ability for people outside of the traditional IT function to take control of ETL for POC’s and new experimental projects sit at the heart of this change.
Now I am sure that there will be those who are quick to point out that users should simply use Power Query to load data into an Excel spreadsheet and then save it to CSV and upload the data to Azure ML. This will surely work however it introduces a manual step in the process which would need to be repeated each time that the model is retrained which is an undesirable solution. Instead of this manual step it would be much more valuable if a team member could create a Power Query dataset which they then save to the Corporate Data Catalog from which Azure ML would then consume this dataset.
What would be the benefits?
To put it in bluntly, this would decrease the amount of time your machine learning team would spend preparing data for their experiments. As has been the case for almost all analytics projects, there is a vast amount of time spent on getting quality data in the correct format. This is still the case for machine learning where it is arguably even more critical to get the best and cleanest possible data as it has a very real and direct impact on the quality of your solution. With machine learning there is an even greater dependency on providing data in the correct format needed since the format required can differ greatly based on the question you are trying to answer as well as the type of learning algorithm you choose to employ.
So instead of a team spending many hours trying to pull data into spreadsheets or using some other form of ETL tool to generate the data, it would be extremely useful to rather allow them to use Power Query. If during experimentation they realise they need to augment the data this can easily be done using Power Query. This will allow them to move fast and get more done and at the end of the process make this dataset available for consumption by publishing it to the Corporate Data Catalog. This dataset will then be reusable as it will pull fresh data each time it is run, so can be used to retrain your model with new data without any manual steps involved.