Continuous integration in data warehouse development
A lot has been said about Continuous Integration (CI) in software application development. But when it comes to data warehouse development the concepts involved present different challenges to a team wanting to benefit from CI.
Citing the heading of an article by Martin Fowler, continuous integration can be summarised as:
Continuous Integration is a software development practice where members of a team integrate their work frequently, usually each person integrates at least daily - leading to multiple integrations per day. Each integration is verified by an automated build (including test) to detect integration errors as quickly as possible. Many teams find that this approach leads to significantly reduced integration problems and allows a team to develop cohesive software more rapidly.
Importing Continuous Integration techniques into Data Warehousing
Software engineering practices often come from the software application development community, be that .Net, Java or others. The software development community has been highly effective in identifying patterns and developing techniques for automating repetitive tasks, in so doing the community has developed a set of tools and practices which assist in the delivery of high quality software solutions covering the build process, testing, integration and deployment. Applying the same techniques and concepts in database development and specifically in data warehouse development has historically been more complex. One reason for this is that databases are live containers which keep important data and changes to database objects, such as tables, may have an impact on the data and any derived meaning. The database server industry has typically been more focused on adding processing capabilities to their products leaving other kinds of tools, techniques and approaches behind which include some better practices. For example, in order to change a table in a database data may have to be moved around, in most cases using manually created scripts. Even if you just need to change a stored procedure, function or add a trigger to a table, this change should be thoroughly tested before it is sent to a live environment.
Some database specialist companies have developed their own tools to make continuous integration involving databases much easier, such as Redgate ,but none seem to be tailored to deal with the whole lot of technologies involved in a typical Microsoft data warehouse development project . On a typical data warehouse project using Microsoft technologies you will find SSIS objects containing the ETL logic and a Sql Server database as your data repository. If you intend to use continuous integration on a project you will need to have other support tools. A source code repository which allows versioning and configuration management, such as Team Foundation Server, tools for test automation of your ETL logic and some sort of mechanism to automate your builds in which you run those tests periodically to check that your source code always has the latest working version of your solution. Ideally you should be able to automate database change releases as well.
Concentra's approach to Continuous Integration in Microsoft Data Warehousing
At Concentra we have been building on continuous integration practices that suit our development practices . In a typical data warehouse project we need development and deployment solutions for our SSIS components and the database objects. For the ETL we use SSIS and now with Sql Server Data Tools database development and deployment is almost as easy as deploying a .Net application. Using Powershell we can control any deployment of both, database and SSIS objects fully or incrementally. In addition we have developed our own test automation framework for developers and testers which is flexible enough to create comprehensive test plans at the beginning of any project. To close the loop, using Team City we can run all automated tests during the night and get full reports the next morning. Since the process is automated we can set up similar tests in different environments (development, testing, staging, UAT, etc.) ensuring that the build will be released into a production environment smoothly.
The additional steps involved will initially require more resources, time and effort to set up but implementing all or some of these CI techniques within a project will result in time-savings by reducing errors and automating standard processes. Automating tasks also allows the development and testing team to focus where it matters most. Practices alone will not necessarily guarantee a successful delivery but they can definitely help to enhance your team’s confidence as well as to build trust on the solutions you deliver to your clients.
Continuous integration as a best practice
Database development practices have taken a long time to catch up with the rest of the software development industry techniques, such as CI but Microsoft and other software providers seem to be putting their efforts in the right places now. With a variety of additional features and improvements across existing toolsets we can create and enhance our data warehousing practices to build more mature and higher quality solutions.