Discover. Understand. Anticipate.


Report written by Hugo Bonnaffé

Why OVH is betting on OpenZFS

The annual OpenZFS Developer Summit took place on October 19-20 in San Francisco. Used in a relatively discrete manner by IT professionals until now, OpenZFS celebrates its 10-year anniversary and it is becoming increasingly popular. See how this open source technology has been used to store the twelve petabytes of dailies for the movie Gravity (1), or witness its integration into the latest release of Ubuntu as a native file system. This year, two representatives from the OVH storage team were sent to the Developer Summit to learn about new developments and to offer their contribution to the community, “Live migration with Zmotion”. More details as well as an overview of the event are provided by François Lesage and Alexandre Lecuyer, OVH storage engineers.

OVH offers a patch to migrate data with no down time

The OpenZFS project derived from the ZFS project sponsored by Sun Microsystems (purchased by Oracle in 2009). It is a result of a fork in 2005, supported by an independent community mainly made up of former Sun employees. Other than Netflix, which utilizes OpenZFZ on its Titan (2) platform, there aren't many other major IT players claiming to use OpenZFS, whose reliability is now proven.

Traditional storage vendors offer reliable and robust proprietary solutions, which have an undeniable advantage in reassuring users of the sustainability of their data. “The downside,” explains François, “is the price paid. These types of systems rely on specific hardware, which significantly raises the cost per gigabyte and they’re real black boxes in the sense that it's not possible to know how the code works nor is it possible to modify the code yourself.”

Compatible with standard machines (x86 architecture), OpenZFS drastically reduces the cost of storage. In addition, the code is open so OVH is able to adapt and improve upon it. “Ten years after the launch of the project, OpenZFS has reached its maturity,” continues François. Its data integrity verification system, which prevents silent file corruption, is among the most effective. Moreover, OpenZFS features are very rich: snapshot, hot and cold storage, etc." In 2007, OVH abandoned EXT3 to take advantage of ZFS and took interest in OpenZFS in 2009. The first OVH projects in production under OpenZFS became reality at the end of 2011 and it has now become the underlying technology of many services: e-mail, web hosting, VPS Classic, NAS-HA, Dedicated Cloud, Backup Storage, etc. “At the time, the technology was already mature and the expertise we acquired has allowed us to effectively compete with proprietary storage systems.”

This OpenZFS experience is exactly what François and Alexandre went to San Francisco to share.“Our technical presentation covered the subject of data migration under OpenZFS. Our hosting activity requires that we continuously allocate and deallocate storage spaces (zpools), which causes fragmentation issues. For this reason , we must regularly perform data migrations. Unfortunately, due to constraints linked to the use of NFS in our infrastructures, OpenZFS doesn’t allow carrying out operations without downtime. We attempted to solve this problem through trial and error.” The patch written by Alexandre and named “Zmotion”, was reviewed at a community workshop that Matt Ahrens was attending (cofounder of the ZFS project). It was proposed for upstream commitment and is already available on Github. “We think that this is something of interest to large companies because the issue of data migration is a reccurring subject of discussion on the community mailing-list”, explains Alexandre.

Watch the “Live Migration with Zmotion” presentation by OVH:

A SlideShare presentation is also available at:

9 contributions which will make OpenZFS even better

“In general, the presentations given by the different contributors were of a very high technical level,” François and Alexandre report. They’ve made wonderful promises of progress for OpenZFS so that users will benefit in the short, medium and long term.”

1- Compressed ARC (Adaptive Replacement Cache)
OpenZFS operates with a two level cache system (ARC and L2ARC), wherein the most frequently accessed data is stored. The first level of cache draws its resources from RAM, while the second level is hosted on the disks – most often SSD type. The goal of George Wilson’s (Delphix) contribution is to make it possible to compress/decompress on the fly the files contained in RAM, the first level of cache. For example, the cache required for a .txt file would require three times less space. Result: It would become possible to increase the performance of cache without the need to add more hardware. Availabilty in OpenZF: short term.

2 - Discontiguous caching with ABD
OpenZFS’s own caching system, ARC, works in redundancy with the OS cache, like Page cache under Linux. This means that a file is actually cached twice: once in the OpenZFS cache and a second time in the OS cache. This results in a loss of space in the RAM. David Chen’s contribution (OSNexus) is intended to allow OpenZFS the use of standard OS caching mechanisms. Availabilty in OpenZFS: mid term.

3 - Persistent L2ARC
The second level OpenZFS caching mechanism is particularly sophisticated. The choice of cached data (on SSDs in most cases since production is hosted on slower disks) is the result of the work of two competing algorithms. This caching mechanism has a weak point in that, in the event of a reboot, the cache must be rebuilt. This operation can take several hours (up to 24 hours in the case of an shared hosting filer cache) before maximum performance is regained. Saso Kiselkov's contribution (Nexenta) preserves the cache and the hot cache even after a reboot, thanks to a modification of the format. Available in OpenZFS: short term on Illumos.

4 - Writeback cache
The idea behind Alex Aizman’s contribution (Nexenta) is to put in place a mechanism for write memory cache, separate from the disks dedicated to logs. For example in the case of input bursts, data is written to an SSD disk or a PCI card, then transferred asynchronously to the pool of disks. Available in OpenZFS: Not known.

5 - Compressed Send and Receive
Dan Kimmel’s contribution (Delphix) consists of directly sending a compressed data stream during a backup or migration. This makes it possible to eliminate the decompression step when sending and the “recompression” step when receiving. Result: Bandwidth consumption is reduced, time to perform backup is accelerated and there's a lesser load on the CPU during backup. Available in OpenZFS: Short to medium term.

6 - Resumable ZFS send/receive
This contribution by Mathew Ahrens (Delphix), presented in Paris six months ago at the OpenZFS European Conference, makes it possible to resume a backup with a token in the event of a network failure during a file transfer. Unlike today, when a backup is interrupted it is “restarted” from zero. This is already available on Illumos and FreeBSD and soon on Linux.

7 - Parity Declustered RAID-Z/Mirror
This contribution by Issac Huang (Intel) is based on extensive research in applied mathematics for the quicker reconstruction of RAIDZ (OpenZFS RAID) in case of disk failure. Nowadays, with large infrastructure capacity, resilvering a RAIDZ sometimes takes days, and it is possible that a second RAIDZ disk could malfunction prior to the completion of the RAID reconstruction. Should this happen, there is a great risk that data will be lost. By optimizing the distribution of the data blocks on the disks, this project – in R&D stage ¬¬– considerably reduces the time necessary to rebuild a RAID.

8 - Dedup Ceiling
Data duplication has existed for a long time in OpenZFS. The fact is, it is hardly used because it can be risky when handling a large number of files. The dedup table is stored in RAM and when it becomes too large, the response time of the deduplication system increases dramatically. Saso Kiselkov's contribution (Nexenta) is intended to store the dedup table on a dedicated disk with the ability to predefine the maximum size. Result: Duplication of files finally becomes useable! The impact on OVH shared web hosting could be considerable. For example, the core files of WordPress are replicated millions of times and deduplication provides some very significant space savings. Availabilty in OpenZFS: mid term.

9 - SPA Metadata Allocation Classes
Don Brady's contribution (Intel) is an opportunity to remember that the heart of OpenZFS is object storage, made usable as a file system and based on metadata objects to reconstruct a tree structure. Some objects are thus stored files while others are metadata (UNIX, ACL ....). And these objects are indiscriminately mixed in the ZFS pool. The goal of Don Brady's work is to make it possible to organize objects based on their nature and, for example, to assign the objects which are metadata to the SSD disks in order to increase the pool's performance. Furthermore, it proposes to associate devices (SATA, SSD, MVNe, disks…) with some types of metadata. Available in OpenZFS: unknown.

(2) (slide 18)

A close-up look at the OpenZFS ARC cache system

by Rémy Vandepoel, systems architect at OVH

ZFS and OpenZFS use two different algorithms for placing, or not, an object/file in ARC (the cache stored in RAM). The MRU and MFU (Most Recently Used and Most Frequently Used) algorithms make it possible to optimize the response time for the most recent and most requested objects by using and appropriating almost all the available RAM on the server. Performance will be even better, especially as the system coordinates their preferred use depending on the situation.

Many statistics are available, using the utility kstat. Notably the variables:: zfs::arcstats:hits and zfs::arcstats:misses, which allow the assessment of the cached content.

# kstat -p zfs:0:arcstats:hits zfs:0:arcstats:misses
zfs:0:arcstats:hits 51250034196
zfs:0:arcstats:misses 5983583701

Here, for example, there's a hit rate of ~89 %.
Example: Only 11% files are accessed on disk, not through the RAM cache.

We can also compare the responses made by the two algorithms:
kstat -p zfs:0:arcstats:mfu_hits zfs:0:arcstats:mru_hits
zfs:0:arcstats:mfu_hits 28694504638
zfs:0:arcstats:mru_hits 11647492315

Tip: All statistics are available using ‘kstat -pn arcstats’.