I started writing this post previously but found myself waxing lyrically about the challenges that building a traditional data warehouse brings to a BI project. However I realized that subject is a post on its own and something I don't want to get into fully right now. Instead I want to show how BigQuery can be used to achieve better results than we would get from a traditional data warehouse while providing so much more flexibility in the process.
The standard approach of building a data warehouse as part of a BI solution is one of the main reasons why we fail to deliver solutions that can grow with the business needs at a rapid pace. Data Warehouse technology, the principles they are founded on, is more than 20 years old! In a cloud enabled world BigQuery allows us to achieve better results than a traditional data warehouse for a fraction of the cost, while allowing us to move and change with business requirements in a much faster manner.
What do I have against Data Warehouses?
Until recently I had no problem with a traditional data warehouse. It is a solid technology that delivers what is says on the tin. It is tried and tested and there is a load of good material about how to build one properly. I have built data warehouses for almost a decade now but have started acknowledging some of the issues they do produce.
As developers we are sometimes very much uninterested in the commercial implications of our technology. A data warehouse is not a cheap undertaking. For a large volume data warehouse we need proper hardware, that means fast SANS and good amounts of memory. (yes i know people say storage is cheap but enterprise storage such as SANS seem to be made from mithril mined by mythical dwarves deep in the mountains of middle earth) This hardware is not cheap. In addition to this we will also need to have this solution regularly backed up and possibly need a cluster to cater for fail-over requirements. That ends up coming to a sizable amount of cash just to keep the lights on.
The argument will exist that we can always spin up virtual machines in the cloud instead of buying hardware and that will definitely bring down the cost. However you will suffer the usual challenges with performance (have a look at the IOPS and throughput you get on Azure) and the top tier of these products are anything but cheap compared to what I will show you in this post. Remember that even VMs in the cloud will still require someone to administer them and look after them. So you still incur and admin cost that is similar to that of physical machines even though it might be reduced.
In addition to the cost of hardware we are also effectively delivering a solution that can only ever answer the questions it was design to answer. If we come up with a question for which we do not have the data in the format we require then we have to go through a change process to add this data. If you have ever done work on a big BI project where you have cubes and reports on top of a data warehouse you will know that changing the data warehouse means a ripple effect of change throughout the entire solution. Yes we can mitigate this to a certain extent by using views and layer obfuscation where possible but fundamentally change in a large data warehouse always comes with changes at other levels of the solution.
More importantly than the issue of change is the actual speed at which we can make these changes. Even if you follow an "agile" data warehouse delivery approach you will still not be able to react fast enough to the requirements of today's business environment. The lag between identifying new analysis requirements and delivering the data to allow the analysis is just too slow.
BigQuery and a new approach to delivering BI and Analytics
The issues identified above are by no means exhaustive but as I said at the beginning I don't want to get caught up in bemoaning a technology that has served us well. Instead lets look at how we can use the skills we already have and deliver something new to address these issues. Most of the challenges that data warehouses bring are due to the fact that they are based on the hardware and software solutions in a pre-cloud era. So how do we bring our BI and Analytics projects into the cloud era? Enter BigQuery.
What is BigQuery
BigQuery is Google's Big Data tool provided as a service. I wont retype all of the info that is already on the Google Cloud Platform website, you can read up on BigQuery here. What is very interesting about BigQuery is the fact that it is a true cloud product, meaning that you are not starting up a bunch of VMs in the cloud. You actually just use it as a service. There is no selecting tiers of performance (sorry Azure) and the billing is ridiculously simple and cheap. It can be queried using a version of SQL which is also nice since if you are reading this you probably are more than familiar with SQL already.
BigQuery is fast!
BigQuery is fast! Yes really fast. In some cases it is OLAP cube performance fast. It does not suffer from issues doing Distinct counts as we normally would have in cubes. It also scales marvelously well.
One thing to keep in mind with BigQuery is that it performs best on denormalized sets of data. By denormalized I don't mean a star schema, I am talking about one big wide table. Start schemas still work really well however joins do slow down BigQuery especially if the table on the right hand side of a join is large.
Just to give an example I have loaded an expanded version of the Contoso DW sample database into BigQuery. It contains 202 million records and I have two versions of it, one that is a traditional Star Schema (RetailDW) and one that is one big flat wide table (RetailDenormalized).
The data stored in the OnlineSales table in the RetailDenormalized data set is 211GB due to the fact that it is completely denormalized into one flat data set.
The data stored in the FactOnlineSales table in the RetailDW data set is 36GB for the same set of records due to it being in a star schema.
In the normal database world this would make a big difference to the hardware requirements of the solution and we would definitely NOT be creating one big flat table. However remember that BigQuery is a different solution so lets see how performance differs between the two data sets when we run our example query.
The query I will be using is one that calculates the number of distinct customer per Product Sub Category as well as the Average SalesAmount for each Product Sub Category filtered where the Gender is equal to Female. Since the SQL used in BigQuery is a little different than ANSI SQL there are certain functions like DISTINCT which is not supported yet so the query might look a little different to what you are normally used to.
The query used on the denormalized table is shown below:
select averages.ProductSubcategoryName as ProductSubCategoryName, averages.AvgSalesAmount as AverageSalesAmount, distincts.DistinctCountOfCustomer as DistinctCustomerCount from ( select ProductSubcategoryName, avg(SalesAmount) as AvgSalesAmount from RetailDenormalized.OnlineSales where Gender = 'F' group by ProductSubcategoryName ) as averages inner join (select ProductSubcategoryName, count(CustomerKey) as DistinctCountOfCustomer from ( select ProductSubcategoryName, CustomerKey from RetailDenormalized.OnlineSales where Gender = 'F' group by ProductSubcategoryName, CustomerKey ) group by ProductSubcategoryName) as distincts on averages.ProductSubcategoryName = distincts.ProductSubcategoryName order by averages.ProductSubcategoryName
Keep in mind that you do not build any indexes in BigQuery you simply load the data and query it. The queries being run here are also not being returned from the Cache, they are actively run against the data. The result of this query is shown below.
That query ran and completed in 3.3 seconds. It churned through 6.79GB of data out of the total 211GB of the table. (More on how this works in another blog post.)
Lets now have a look at the query for the Start Schema data set which will contain some joins.
select averages.ProductSubcategoryName as ProductSubCategoryName, averages.AvgSalesAmount as AverageSalesAmount, distincts.DistinctCountOfCustomer as DistinctCustomerCount from ( select product.ProductSubcategoryName, avg(SalesAmount) as AvgSalesAmount from RetailDW.FactOnlineSales as sales inner join RetailDW.DimProduct as product on sales.ProductKey = product.ProductKey inner join RetailDW.DimCustomer as customer on sales.CustomerKey = customer.CustomerKey where customer.Gender = 'F' group by product.ProductSubcategoryName ) as averages inner join (select ProductSubcategoryName, count(CustomerKey) as DistinctCountOfCustomer from ( select product.ProductSubcategoryName, sales.CustomerKey from RetailDW.FactOnlineSales as sales inner join RetailDW.DimProduct as product on sales.ProductKey = product.ProductKey inner join RetailDW.DimCustomer as customer on sales.CustomerKey = customer.CustomerKey where customer.Gender = 'F' group by product.ProductSubcategoryName, sales.CustomerKey ) group by ProductSubcategoryName) as distincts on averages.ProductSubcategoryName = distincts.ProductSubcategoryName order by averages.ProductSubcategoryName
That query ran and completed in 5.5 seconds. It churned through 4.52GB of data out of the total 36GB of the tables in the RetailDW star schema data set.
The above two runs clearly show the impact that joins have on BigQuery but more importantly it shows that BigQuery is not impacted by very large denormalized data sets. (I will cover more on this in another blog post)
The cost of using BigQuery
So the obvious question now turns to how much all of this costs and I am very happy to report that it might actually be the best part of all of this. Here is a link to the pricing of BigQuery. The cost shown here is that you are charged the following:
- Storage - $0.02 per GB per month.
- Streaming Inserts - $0.01 per 100,000 rows (These are inserts streamed into BigQuery in real time not data that is loaded into the solution in the normal ETL batch way)
- Queries - $5 per TB (yes $5 per Terabyte of data queried)
- Queries against cache - Free
- Queries that returns an error - Free
- Loading Data - Free
- Exporting Data - Free
If we do the math for how much it would cost to have the above 211GB data set in BigQuery and run 2TB of queries against it then we arrive at the following:
- 211GB * $0.02 = $4.22
- 2TB of queries * $5 = $10
- Total Cost Per Month: $14.22
That is probably the lowest cost for a "data warehouse" that I have ever seen. There is also no need for redundancy or backups of the warehouse as this is a proper cloud service from Google and not a bunch of VMs spun up in the cloud somewhere that still requires administrators to manage them.
How do I connect my reporting tools to BigQuery?
The answer to this is luckily that it is relatively simple. There is a whole host of reporting tools that already connect directly to BigQuery such as Tableau, Qlikview, Targit and Bime. In addition to this there are also third party ODBC providers that you can use to connect all of your normal Microsoft tools, such as Tabular Models, PowerPivot, SSRS, PowerBI etc directly against BigQuery. For a list of third party components have a look at this link here. The last time I checked the cost of the Simba ODBC provider for BigQuery was $199 for a licence.
So what is this new approach to delivering BI solutions?
So after seeing all of the above we need to take a step back and think about what it is we are trying to achieve with our data warehouses. They primarily serve as the repository of data for our BI solutions. We place cubes on top of the data warehouse to increase the performance of querying data at aggregate levels and to allow free form analysis.
In a normal BI project we have roughly the following steps:
- Identify the business process / area that we want to analyse
- Identify the analysis that is required against this business process / area
- Build the dimensional model that would satisfy these requirements
- Design a pyhiscal data warehouse based on the dimensional model
- Build ETL to populate the data warehouse
- Build an OLAP cube or Tabular model on top of the data warehouse
- Build reports on the OLAP cube or Tabular model and allow free form analysis against them
- Realise that additional analysis is required and that we need to change / add to the dimensional model (make changes from steps 3 to 7)
I don't want to get into too much detail but step 8 above is normally where all the friction comes from and when things start slowing down and impacting the BI solution delivery. Always remember that users cannot possibly know all of the analysis they want to do. If you develop a good BI solution it will spark more questions from users as it opens up the data to new queries previously not thought about.
Lets look at how we can change our approach using BigQuery:
- Identify the business process / area that we want to analyse
- Identify all of the data available in source systems that relate to the business process / area
- Identify the analysis that is required against the business process / area
- Build a simple ETL process that extracts all of the data out into denormalized data sets and upload it to BigQuery
- Analyse the data in BigQuery by creating views and aggregate tables where appropriate
- Load views and aggregate tables into reporting tools and report against them (could use Tabular / PowerPivot models where it adds business value)
- Realise that additional or new analysis is required against (Repeat steps 5 and 6)/ Realise that an additional data set is required ( Repeat steps 4 to 6)
The above new approach is high level but shows how we can change the way we deliver BI solutions when we make use of a product such as BigQuery. It allow us to work with our data in a very unique new way due to the scale at which it can operate at.
The above should serve as a good high level overview of what BigQuery makes possible. There is a load of detail around how we actually implement solutions using this new approach, which I will cover in future blog posts. BigQuery was designed to serve as a Big Data tool. However from what I have seen it can have a profound impact not just on Big Data projects but more specifically on how we actually go about delivering BI and Analytics solutions.
In the next posts I will walk through some of the detail about how we implement a solution against BigQuery. I will also cover some of the questions that people will have around security and data access.