Allocate on Write Thin Cloning
Three challenges specifically stand out when considering Copy on Write filesystem snapshots described in the previous section:
- The number of snapshots you can take of source database LUNs is limited
- The size of the snapshots is limited
- Difficulties arise sharing the base image of source databases at multiple points in time. In some cases it is not possible, in others difficult or resource heavy.
These challenges highlight a specific need: to create thin provision clones of a source database from multiple points of time at the same time without using any additional space consumption. This requirement is important, as it allows one base image to serve as the foundation for all subsequent clones and imposes no unplanned storage or refresh requirements on users of the target (cloned) systems.
With a filesystem storage technology called Allocate on Write, these challenges can be met. In allocate on write filesystems, data blocks are never modified. When modifications are requested to a block, the block with the new changes is written to a new location. After a request to modify a block has been issued and completed there will be two versions of the block: the version that existed prior to modification and the modified block. The location of the blocks and the versioning information for each block is located in a metadata area that is in turn managed by the same allocate on write mechanism. When a new version of a block has been written to a new location, the metadata has to be modified. However, instead of modifying the contents of the relevant metadata block, the new metadata block is written to a new location. These allocations of new metadata blocks with points to the new block ripple up the metadata structures all the way to the root block of the metadata. Ultimately, the root metadata block will be allocated in a new place pointing to the new versions of all blocks, meaning that the previous root block points to the filesystem at a previous point in time. The current, recently modified root block points to the filesystem at the current point in time. Through this mechanism an allocate on write system is capable of holding complete version history of not only a block, but all blocks involved in that block’s tracking.
Figure 10. When a datablock in the bottom left is modified, instead of modifying the current block a new block is allocated with the modified contents. The metadata pointing to this new location has to be modified as well, and again instead of modifying the current metadata block, a new metadata block is allocated. These changes ripple up the structure such that the current root block points to the filesystem at the current point in time while the previous root block points to the filesystem at the previous point in time.
Allocate on write has many similar properties with EMC’s VNX copy on write and NetApp’s WAFL systems, but the way allocate on write has been implemented in ZFS eliminates the boundaries found in both. With ZFS there is no practical size limitations to snapshots, no practical limit to the number of snapshots, and snapshots are almost instantaneously and practical zero space (on the order of a few kilobytes).
ZFS was developed by Sun Microsystems to address the limitations and complexity of filesystems and storage. Storage capacity is growing rapidly, yet filesystems have many limitations on how many files can be in a directory or how big a volume can be. Volume sizes are predetermined and have to be shrunk or expanded later depending on how far off the original calculation was, making capacity planning an incredibly important task. Any requirement to change filesystem sizes could cause hours of outages while filesystems are remounted and fsck is run. ZFS has no need for filesystem checks because it is designed to always be consistent on disk. The filesystems can be allocated without size constraints because they are allocated out of a storage pool that can easily be extended on the fly. The storage pool is a set of disks or LUNs. All disks are generally assigned to one pool on a system, and thus all ZFS filesystems using that pool have access to the entire space in the pool. More importantly, they have access to all the I/O operations for the spindles in that pool. In many ways, it completely eliminates the traditional idea of volumes.
On a non-ZFS filesystem the interface is a block device. Writes are done per block and there are no transaction boundaries. In the case of a loss of power or other critical issue there is also a loss of consistency. While the inconsistency issues have been addressed by journaling, that solution impacts performance and can be complex.
In a ZFS filesystem all writes are executed via allocate on write, and thus no data is overwritten. Writes are written in transaction groups such that all related writes succeed or fail as a whole, alleviating the need for fsck operations or journaling. On-disk states are always valid and there are no on-disk “windows of vulnerability”. Everything is checksummed and there is no silent data corruption.
Figure 11. Comparison of non-ZFS filesystems on top and ZFS filesystems on the bottom. The ZFS filesystems are created in a storage pool that has all the available spindles, giving filesystems access to all the storage and IOPS from the entire pool. On the other hand, the non-ZFS filesystems are created on volumes and those volumes are attached to a specific set of spindles, creating islands of storage and limiting the IOPS for each filesystem.
Excepting certain hardware or volume manager specific software packages, the general comparison between non-ZFS and ZFS filesystems is as follows:
- One filesystem per volume
- Filesystem has limited bandwidth
- Storage is stranded on the volume
- Many filesystems in a pool
- Filesystems grow automatically
- Filesystems have access to all bandwidth
Along with many filesystem improvements, ZFS basically has moved the size barrier beyond any existing hardware that has yet been created and has no limitations on the number of snapshots that can be created. The maximum number of snapshots is 2^64 (18 quintillion) and the maximum size of a filesystem is 2^64 bytes (18.45 Exabytes).
A ZFS snapshot is a read-only copy of a filesystem. Snapshot creation is basically instantaneous and the number of snapshots is practically unlimited. Each snapshot takes up no additional space until original blocks become modified or deleted. As snapshots are used for clones and the clones are modified, the new modified blocks will take up additional space. A clone is a writeable copy of a snapshot. Creation of a clone is practically instantaneous and for all practical purposes the number of clones is unlimited.
Snapshots can be sent to a remote ZFS array via a send and receive protocol. Either a full snapshot or incremental changes between snapshots can be sent. Incremental snaps generally send and receive quickly and can efficiently locate modified blocks.
One concern with allocate on write technology is that a single block modification can set off a cascade of block allocations. First, the datablock to be modified is not overwritten but a new block is allocated and the modified contents are written into the new block (similar to copy on write). The metadata that points to the new datablock location has to be modified; but again, instead of overwriting the metadata block, a new block is allocated and the modified data is written into the new block. These changes cascade all the way up the metadata tree to the root block or uber block (see Figure 10). Thus for one data block change there can be 5 new blocks allocated. These allocations are quick as they take place in memory, but what happens when they are written out to disk? Blocks are written out to disk in batches every few seconds for non-synchronous writes. On an idle or low activity filesystem a single block change could create 5 writes to disk, but on an active filesystem the total number of metadata blocks changed will be small compared to the number of datablocks. For every metadata block written there will typically be several datablocks that have been modified. On an active filesystem typically a single metadata block covers the modifications of 10 or 20 datablocks and thus the extra number of blocks written to disk is usually on the order of 10% the actual metadata block count.
Figure 12. The flow of transaction data through in-memory buffers and disk.
But what happens for sync writes that can’t wait for block write batches that happen every few seconds? In those cases the sync writes must be written out immediately. Sync writes depend on another structure called the ZFS Intent Log (ZIL). The ZIL is like a database change log or redo log. It contains just the change vectors and is written sequentially and continuously such that a synchronous write request for a datablock change only has to wait for the write to the ZIL to complete. There is a ZIL per filesystem, and it is responsible for handling synchronous write semantics. The ZIL creates log records for events that change the filesystem (write, create, etc.). The log records will have enough information to replay any changes that might be lost in memory in case of a power outage where the block changes in memory are lost. Log records are stored in memory until either:
- Transaction group commits
- A synchronous write requirement is encountered (e.g. fsync() or O_DSYNC)
In the event of a power failure or panic, log records are replayed. Synchronous writes will not return until ZIL log records are committed to disk.
Another concern is that blocks that were initially written sequentially next to each other may end up spread over the disk after modifications to those blocks due to the updates resulting in a new block being allocated to a different location. This fragmentation has little effect on random read workloads but multiblock reads can suffer from this because a simple request for a continuous number of blocks may turn into several individual reads by ZFS.
ZFS also introduced the concept of hybrid storage pools where both traditional spinning disks and modern flash-based SSDs are used in conjunction. In general, disks are cheap and large in size but are limited both in latency and throughput by mechanics. Flash devices on the other hand provide I/O requests with latency that is only a small fraction of that of disks; however, they are very expensive per gigabytes. So while it may be tempting to achieve the best possible performance by putting all data on SSDs, this is usually still too cost prohibitive. ZFS allows mixing these two storage technologies in a storage pool, after which the ZIL can be placed on a mirror of flash devices to speed up synchronous write requests where latency is crucial.
Another use for SSDs in ZFS is for cache devices. ZFS caches blocks in a memory area called the Adaptive Replacement Cache—also the name of the algorithm used to determine which blocks have a higher chance of being requested again. The ARC is limited in size by the available system memory; however, a stripe of SSD devices for a level 2 ARC can be configured to extend the size of the cache. Since many clones can be dependent on one snapshot, being able to cache that snapshot will speed up access to all the thin clones based off of that snapshot.
Figure 13. A storage pool with an SSD caching layer and ZFS Intent Log for syncing.
With these capabilities in mind, there are several methods available to use this technology for database thin clones:
- Open Source ZFS snapshots and clones
- ZFS Storage Appliance from Oracle with RMAN
- ZFS Storage Appliance from Oracle with Dataguard
(Open) Solaris ZFS
ZFS is available in a number of operating systems today. It was released in Solaris 10 and has gained even more features and importance in Solaris 11. After the acquisition of Sun by Oracle, the OpenSolaris project was abandoned but the community forked a number of open source projects, the most notable of which is Illumos and OpenIndiana. These releases are still actively being developed and maintained. Many commercial products are built on these open source projects.
Any one of these systems can be used to build your own ZFS based storage system to support thin cloning:
- Database storage on local ZFS
- ZFS storage as an NFS filer
- ZFS storage as an iSCSI/block storage array
When a database is already running on Solaris with local disks, a ZFS filesystem can be used to hold all database files. Creating snapshots and clones on that filesystem is a simple matter of using a few ZFS commands; however, one does not have to bother with storage protocols like NFS. If Solaris is in use and datafiles are on ZFS anyways, it may also be a good idea to automate regular snapshots as an extra layer of security and to enable a “poor man’s flashback database”.
When a database is not running locally on a Solaris server, you can still benefit from ZFS features by building your own ZFS storage server. You can share ZFS volumes via iSCSI or fibre channel and use ASM on the database server for datafiles but instead we will focus on the easier setup with ZFS filesystems and the NFS protocol to share the volumes.
On a Solaris Storage server
- Create a zpool (ZFS pool)
- Create a ZFS filesystem in the pool
- Export that filesystem via NFS
On the source database server
- Mount the NFS filesystem
- Put datafiles on the NFS mount as one of:
- “live” data (this may have performance implications)
- backup image copies (or an RMAN clone)
- a replication target
On the Solaris Storage server
- Take snapshots whenever necessary
- Create clones from the snapshots
- Export the clones via NFS
On the target database server
- Mount NFS clones
- Use this thin clone
ZFS Storage Appliance with RMAN
Oracle sells a ZFS storage appliance preconfigured with disks, memory, ZFS filesystem, and a powerful monitoring and analytics dashboard. One of these appliances can be used to create database thin clones; in fact, Oracle has published a 44-page white paper outlining the steps (found at https://www.oracle.com/technetwork/articles/systems-hardware-architecture/cloning-solution-353626.pdf). In brief, the steps involved are:
On the ZFS Appliance
- Create a “db_master” project
- Create a “db_clone” project
- For both the “db_clone” and “db_master” project, create 4 filesystems:
On the Source Database
- Mount a directory from the ZFS Appliance via NFS
- Backup the source database with RMAN to the NFS mount directory
On the ZFS Appliance
- Select the “db_master” project
- Snapshot the “db_master” project
- Clone each filesystem on “db_master” to the “db_clone” project
On the Target Host
- Mount the 4 filesystems from the db_clone project via NFS
- Startup the clone database on the target host using the directories from the db_clone project mount via NFS from the ZFS storage appliance
Figure 14. A diagram of the procedure used to clone databases using the ZFS storage appliance and RMAN. First a directory is mounted on the source machine from the ZFS storage appliance via NFS. Then an RMAN backup is taken of the source database onto the NFS mounted directory. The snapshot can be taken off the RMAN backup on the ZFS storage appliance and then used to create thin clones.
ZFS Storage Appliance with DataGuard
One way to efficiently address getting changes from a source database onto a ZFS storage appliance is by using Dataguard as outlined in Oracle’s white paper on Maximum Availability Architecture (MAA) DB Cloning. You can find the document at the following link:
The concept revolves around using Dataguard to host the datafiles from a Dataguard instance on the ZFS storage appliance. With the datafiles hosted in ZFS, all changes from the source database will be propagated to the ZFS Storage Appliance via the Dataguard instance. Once the Dataguard datafiles are hosted on the ZFS storage appliance, the snapshots of the datafiles can easily be taken at desired points in time and clones can be made from the snapshots. The ZFS clones can be used to start up database thin clones on target database hosts by mounting those datafiles via NFS to the target hosts.
Figure 15. Using Dataguard, files can be shared with a ZFS storage appliance via NFS to use for thin cloning of a target database.