A customer of yours is eager to start a new data project on Azure Cloud Platform. They are looking to use a lot of the goodness Azure has to offer in terms of data orchestration, big data, data warehouse, hybrid connectivity, etc.. all the cool stuff. You pull out the Cortana Intelligence Suite diagram and start explaining the various azure services available under this brand name.
After an hour long discussion, several Azure services is selected and an architecture diagram is drawn out that looks something like this.
This is great !" the customer said. "Here are some details..." and he starts providing # of data sources, tables to load, estimated data size, SLA load times, # of report users, etc. "So, how much the Azure Subscription cost me? Can you give a high level estimate?" your customer asked.
"Ermm, ok.. there's this Azure Calculator and some Pricing guidelines on the web that we can use" and you pull out some links like this one trying to understand it all but eventually says... "Let me get back to you !" and after several days, you still haven't figure things out.
You may have encountered a similar customer situation before or perhaps you are the customer and trying to obtain an estimated Azure costing but is stuck on how to proceed further. Let's look at this sample use case to come out with a hypothetical estimate. First, let's understand the architecture proposed and how these Azure services fits together.
The Azure Blob Storage acts as a data staging area (or a form of Data Lake) for all data copied from the source. Azure Data Factory is used to connect and copy data from on-premises databases to the Azure Blob Storage. Data Factory will also orchestrate data movement and transformation using On-Demand HDInsight cluster running hive or pig scripts. Hence, the HDInsight (Hadoop) cluster is mainly used only for data transformation to prepare the data to be loaded into SQL Data Warehouse which is the MPP data query store for Power BI.
Azure Machine Learning is used by data scientists to conduct ML experiments and also acts as a web service API for other applications (like Power BI) to call and obtain predicted results from the produced ML models. Data Factory is used to re-train Azure ML models. Power BI is the business intelligence tool for end-users to query data directly from SQL DW or from raw data stored in Blob storage.
Here are the requirements gathered from the customer.
- There are 5 databases/sources to connect to. Lets say the databases consist of one each.. SQL Server, Oracle, MySQL, DB2 and a flat file folder (ok, I know.. that's not a DB.. )
- Each of these sources contains 5 Tables each that needs to be fully loaded into Azure.
- Each source is approximately 1 TB in size (for the 5 tables). So that's a total of 5 TB for 5 sources.
- Total data storage size will be 12 TB after 5 years estimated growth.
- Data loading and transformation (ELT) from on-premise to cloud is done Daily.
- Reporting is done 8 hours per working day.
- There are 100 reporting users that utilizes their data wangling super powers!
- There are 5 data scientist that works 9 to 5 at home and never intend to ever come to office.
Copy data from On-Premises to Cloud
The data copy activity is done by Data Factory via a free downloadable Data Gateway that establishes connection to the 5 data sources. We start by calculating the data factory charges for "On-Premises Copy Activity" to the Blob Storage. We will have 1 pipeline copy activity for each table within each data source. Which means there's 25 activities (5 data source X 5 tables). Since it's a daily load, these activities are categorize under "Low frequency activities". Take note that these charges are for "per month" basis, which means even if the copy happens every day for all 25 tables, we do NOT multiple these by 30 days. That's good news in terms of pricing. We will also not be re-running any of the activates more than 1000 times, an unlikely scenario for our case, hence we can ignore the "re-run activity" cost.
Each of the 25 tables copied will produce 25 csv files and let's say each copy takes approximately 2 min per day to complete. Because we only have 10 Mbps line dedicated for this copy activity, in theory we are loading a total of 3.75 GB per day (75 MB x 2 min x 25 tables = 3750 MB). That's slightly over 1.3 TB data growth per year hence 6.5 TB in 5 years. In order to calculate "On-Premises Data Movement" cost we only need to multiple the total hours took for data movement, 25 tables x 2 min x 30 days (per month) = 1500 min or 25 hours per month. Here's how it looks like in Azure Calculator.
We then add in the Blob storage cost to potentially store 12 TB worth of data. We will assume a 1 to 1 table to csv file sizing (no table indexes, no compression, etc.) . We will use the "general purpose block blob with LRS" data redundancy (data is auto replicated 3 times in the same regional data center). Amount of storage transactions will not go beyond 100,000 where one transaction is one API call (such as uploading 1 csv file). We will also assume all data is ingress to Azure Cloud and no data is egress to downstream applications, hence we will not worry about bandwidth charges.
Data Transformation on Cloud
Once the 25 "raw" csv files are in blob storage, we would perform data transformation logic on these files. Since we will be using Pig scripts to perform the transformation, we will need to spin up a Hadoop cluster. We will go with the On-Demand HDInsight transformation which will automatically provision a Hadoop cluster during data transformation and automatically shut it down once it's done via Data Factory orchestration.
We will first calculate the Data Factory Cloud Activity cost. We will have 5 Activities, one for each data source or a batch of 5 csv files. Each activity will run a Pig script file to handle empty data values, performing a word count or consolidating the csv files. It can be many other forms of data transformation, up for interpretation.
We will have an additional chain Activity for each batch of 5 csv files (5 Activities in total) to perform further data transformation, perhaps to normalize the data or remove the files all together. Again, the kind of transformation activities being done is up for interpretation. For this use case, we will assume each batch of 5 csv files will be consolidated into 1 single individual csv file.
This is an example of how the data factory pipeline will look like, a total of 10 low frequency cloud activities running on-demand HDInsight clusters.
Next, we will calculate the cost of the HDInsight cluster that's provisioned to compute the data transformation tasks. Since we are only processing 5 csv files per Activity or 750 MB (150 MB x 5 csv) worth of data, the cluster size required is very small. The HDInsight compute assigned are 2 head nodes of D3v2 virtual machines and 4 worker nodes of D3v2 virtual machines. D3v2 are 4 cores, 14 GB RAM machines, sufficient for our use case.
Each Activity transformation run will assumed to take 30 minutes worth of HDInsight compute time to complete. There's a total of 10 separate Activities, hence the total compute time would be, 10 activity x 30 min x 30 days = 150 hours per month. We will also size in 1 TB of blob storage for HDInsight as it's staging data store.
Copy data from Blob to SQL DW
After the data files is processed and loaded into our Blob storage, we will proceed to load these data files into Azure SQL Data Warehouse. This will be executed via 5 low frequency cloud activities loading data into 5 separate tables in SQL DW. Assuming it takes 50 min to load everything, the Cloud data movement cost will be 50 min x 30 days = 25 hours per month.
We will assume a store procedure needs to be run for each loaded table for further data preparation. This will require an additional 5 low frequency cloud activities to execute the store procedures. These store procedures may be used to create new calculated columns, create statistics, change data types, etc.
Here's Azure Data Factory pricing so far.
Azure SQL Data Warehouse sizing
Azure SQL Data Warehouse will be online only during working days for 8 hours duration. 22 days X 8 hours = 176 hours. We will assume an additional 30 minutes every day is required to run the store procedure during the data loading stage. 30 days x 30 minutes = 15 hours. Hence the total SQL Data Warehouse compute hours required per month is 176 + 15 = 191 hours. Azure Automation service will be used to automate SQL DW start-up and pause status. It's free tier of 500 minutes job run time should be sufficient.
Storage size required is presume to be 12 TB, assuming we load everything in including the growth estimates. However, we will use a 75% data compression ratio hence the actual required storage size is 3 TB. We will be running on DWU 400 based on the 3 TB usage estimates.
This is the cost of running Azure SQL Data Warehouse per month.
Azure Machine Learning
The customer have 5 "work from home" data scientists that will be using Azure ML Studio. That would be 5 user seats. Out of the 8 working hours per day, let's say they sit around (watching Netflix) for about 1 hour per day waiting for their project experiments to complete running. Hence they effectively work for only 6 hours per day (minus 1 hour for lunch and toilet breaks). Regardless, that's 22 hours x 5 users = 110 experiment hours per month.
The ML predictive models are deployed as web services and they deployed 5 of them. The web services are mainly used internally, making only about 10 transaction calls per model per day. Each model is able to complete their computation in less than 15 minutes. We will size in a Standard S1 Web API that allows 10 deployed web service, 100,000 transaction calls and 25 compute hours, which is sufficient for their use case.
We do need to retrain the models with new data every now and then. This will again be done using the awesome Data Factory. It will be 2 cloud low frequency activities per model for running a ML Batch Execution activity & a ML Update Resource activity that will perform the model re-train. We do this once per month. 2 activity x 5 web service = 10 low frequency cloud activities. We will also size in 500 GB blob storage for Azure ML usage purpose (output predicted data, staging data, etc.).
Azure Data Factory Cloud cost so far.
Add in Azure Active Directory free tier for access and identity management and with that... we have our finalized estimated cost sizing based on our hypothetical use case. This is the entire cost breakdown that we made.
We didn't factor in HA and DR requirements. The thing about cloud Platform as a Service is HA/DR are mostly build-in. Blob storages are locally redundant with 3 copies, SQL DW has build-in snapshot backups, HDInsight is HA build-in with 2 head notes and 4 worker nodes, etc... Even without going deep into high availability and disaster recovery requirements, the basics are covered.
Another cool thing about going for cloud solutions is we can always scale-up or scale-down most of these services, hence providing us the flexibility to "change our mind". Even if our usage estimates go terrible wrong, it's never as bad for cloud solutions compare to say, paying million of dollars for an on-premise data warehouse which takes up lab space and risk utilizing only 10% of it.
An estimated cost of close to USD 38k per year is also incredible low, to have the capability to process and store terabytes of data in a MPP data warehouse and having machine learning capability for predictive analysis workloads.
Oh ya, one more thing..... it seems like we missed out Power BI..... just add USD 10 per user per month. ;-)