To subset or not to subset
There was a problem at a customer in application development where using full copies for developers and QA was causing excessive storage usage and they wanted to reduce costs , so they decided to use subsets of the production development and QA
- Data growing, storage costs too high, decided to roll out subsetting
- App teams and IT Ops teams had to coordinate and manage the complexity of the shift to subsets in dev/test
- Scripts had to be written to extract the correct and coherent data, such as correct date ranges and respect referential integrity
- It’s difficult to get 50% of data 100% of skew instead of 50% of data 50% of skew
- Scripts were constantly breaking as production data evolved requiring more work on the subsetting scripts
- QA teams had to rewrite automated test scripts to run correctly on subsets
- Time lost in ADLC, SDLC to enable subsets to work (converting CapEx into higher OpEx) put pressure on release schedules
- Errors were caught late in UAT, performance, and integration testing, creating “integration or testing hell” at the end of development cycles
- Major incidents occurring post deployment, forcing more detailed tracking of root cause analysis (RCA)
- Production bugs causing downtime were due 20-40% to non-representative data sets and volumes.
Moral of the story, if you roll out subsetting, it’s worth holding the teams accountable and tracking the total cost and impact across teams and release cycles. What is the real cost impact of going to subsetting? How much extra time goes into building and maintaining the subsets and more importantly what is the cost impact of letting bugs slip into production because of the subsets?
A robust, efficient and cost savings alternative solution would be to use database virtualization. With database virtualization, database copies take up almost no space, can be made in minutes and all the over head and complexities listed above go way. In addition database virtualization will reduce CapEx/OpEx in many other areas such as
- Provisioning operational reporting environments
- Providing controlled backup/restore for DBAs
- Full scale test environments.
And subsets do not provide the data control features that database virtualization provides to accelerate application projects (including analytics, MDM, ADLC/SDLC, etc.). Our customers repeatedly see 50% acceleration on project timelines and cost, which generally dwarf the CapEx, OpEx storage expense lines, due to the features we make available in our virtual environments:
- Fast data refresh
- Branching (split a copy of dev database off for use in QA in minutes)
- Automated secure branches (masked data for dev)
- Bookmarks for version control or compliance preservation
- Share (pass errors + data environment from test to dev, if QA finds a bug, they can pass a copy of db back to dev for investigation)
- Reset/rollback (recover to pre-test state or pre-error state)
- Parallelize all steps: have multiple QA databases to run QA suites in parallel. Give all developers their own copy of the database so they can develop without impacting other developers.