EXT4 vs XFS: large volumes with low-end RAID controller

Some months ago, I wrote an article comparing EXT3, EXT4, XFS and BTRFS filesystem performances with a Fedora 14 x86_64 installation done on a Dell Latitude D620 laptop. While the results were quite interesting (especially to evaluate BTRFS performance), they were collected on a consumer machine (a laptop), with consumer-grade processor and HDD. So, the results do not necessarily translate to server world in a linear manner – a very good filesystem for a single 2.5” HDD can be inadequate for a multi-disk server machine, and vice-versa.

Today, thank to the “Center for Research Computing” at University of Notre Dame, and especially to Paul Brenner, Serguei Fedorov and Rich Sudlow, I am able to present you some filesystem benchmark results collected on a quite powerful Dell R510 server, loaded with 12 x 2 TB SATA disk connected to a low-end, inexpensive PERC H200 controller. The article will focus on EXT4 vs XFS performance, as EXT3 can not grow bigger than 2 TB and BTRFS is way too young (and unproven) to be considered in the server world. I hope that these data can help you to chose the right filesystem for your workload.

While reading this article, please keep in mind that different usage patterns can favor different filesystems, so I don’t pretend to elect the always-better, stronger FS on the world. I simply want to give you some numbers collected in various usage patterns, so that I can help you in the choice of the right filesystem for some common jobs. Please also consider that FS performances can vary dramatically between kernel releases; however, this behavior should be mitigated by the fact that RHEL 6.0 use very conservative, security-focused kernel updates.

Filesystems, mount options and others

As you probably know, mount options can significantly impact filesystem speed, features and reliability. Moreover, the existence of filesystem-specific options mean that it is often quite hard to 100% match them across the various filesystems.

Fortunately, the vast majority of FS-specific options have very reasonable and reliability-focused predefined values, so we can generally use the defaults with no problem. However, If you want to do a significant comparison, one option should absolutely be the same between the different setups: the write barrier option.

Write barriers are a synchronization method that enable OS to safety flush the on-disk cache content to the physical disk platters. Without write barriers, a fsync() call will flush the main memory disk cache, but it will not flush the disk/controller cache. While disabling barrier can sometime speed up the filesystem/disks combo considerably, it can also lead to data loss, even when the OS assumes that all data were safety written to disk. For example, a power outage will cause the loss of any data in the disk cache that were not written to the disk platters.

However, there are circumstances when write barriers can be disabled without problems: think to a UPS-protected server with a battery backed disk cache, or simply to a controller/disk combo with no DRAM cache at all. In this case, any power outage will not imply a cache data loss, so barriers can be safety disabled.

The Dell R510 server system used for this benchmark round is equipped with a PERC H200 disk controller with no DRAM cache. Moreover this controller disable any disk level cache found on the attacked disks, so I disabled write barriers with the “nobarrier” mount option.

Please keep in mind that enabling write barriers can cause a different, FS-specific performance drop. For example, generally XFS incur into a greater drop than EXT4. So, while the relative standing should remain more-or-less similar, the following results should be considered valid only for installations with write barriers disabled.

UPDATE 05/04/2011:

For more informations about EXT4 and XFS history, mount options and other things, you can visit the following wikipedia pages:

Testbed and methods

The Dell R510 have the following hardware and software configuration:

  • 2x Intel Xeon E5620 with HT OFF (4 cores, 4 threads , 12 MB L3 cache) @ 2.4 GHz
  • 8x 4 GB DDR3 RAM (32 GB total RAM)
  • PERC H200 RAID Controller
  • 12x 2 TB 7.2K RPM SATA 3Gps disks
  • Red Hat Enterprise Linux 6.0 64 bit

The 12 disks were assigned to 2 RAID array:

  • a first, 2 disks RAID 1 array for OS installation
  • a second, 10 disks RAID 10 array for the benchmark runs

To run the benchmarks, I used the following softwares:

  • bonnie++-1.96-1.el6.rf.x86_64.rpm
  • sysbench-0.4.12-1.el6.x86_64.rpm
  • mysql-server-5.1.52-1.el6_0.1.x86_64.rpm
  • mysql-bench-5.1.52-1.el6_0.1.x86_64.rpm
  • postgresql-server-8.4.7-1.el6_0.1.x86_64.rpm
  • postgresql-test-8.4.7-1.el6_0.1.x86_64.rpm

Please note that the benchmarked filesystems were optimized for the physical array layout (in this case, 5 active data disks and 64 KB stripe size). Remember that, as stated before, the PERC H200 controller does not have any onboard cache, and it disable any disk-level cache it found on the attached disks. For this reason, write barriers were disabled.

I run each benchmark at least 3 times and then reported the mean value.

A note on the CPU load number: as this Dell R512 has 8 physical cores that can manage 8 hardware threads (HyperThreading was set to OFF), the maximum CPU load percentage, as reported by the Linux kernel, is 800%. So, if when you read something similar to “100% CPU load”, this mean that, on average, only one core (from the 12 available) was fully utilized.

UPDATE 05/06/2011: hardware description was updated to correctly describe the core/threads configuration. I originally wrote that the CPUs were two hexa-cores ones, while they really are two quad-cores processors.

UPDATE 05/10/2011: a reader ask to me explicitly specify the mkfs and mount parameters. For filesystems creation, I use the following commands:

  • EXT4: mkfs.ext4 /dev/sdb1 -E stride=16,stride-width=80
  • XFS: mkfs.xfs /dev/sdb1 -d su=64k,sw=5

Both filesystems were mounted with default parameters and the “nobarrier” option.

Filesystem creation and checking time

The first test is related to filesystem creation and checking time. The following graph will show you the time needed to create and fsck the ~10 TB filesystem used to fill the RAID 10 array. The fsck command was run after the creation of a significant number of small file, obtained unpacking the linux-2.6.36.4.tar.bz2 file downloaded form kernel.org:

As you can see, XFS was way faster then EXT4 in this large volume creation and checking. However, you should not overestimate these results: remember that you generally create the FS only one time, and the fsck operation should be a rare one (after all, both FS are journaled for this reason). On the other hand, if you plan to create/check very ofter a large filesystem, stay away from EXT4 and go with XFS.

Bonnie++ results

Sequential and random read/write speeds are two factors that can greatly influence final application speed. Let’s start examining Bonnie++ sequential speed and CPU usage:

While EXT4 and XFS generally show comparable results both in normal, cached mode and in synchronous mode, XFS lead the sequential output (write) test by a very large margin. To tell the truth, the EXT4 sequential output test results seem unrealistically low.

What about random speed? Bonnie++’s random I/O speed return the number of seeks per second that the disk subsystem can sustain:

The mechanical nature of current hard disks implies results that are some order of magnitude lower than the sequential ones: considering 512 byte long sectors, we are speaking about a maximum I/O transfer rate of ~264 KB/s. Considering 4096 byte long sector, the I/O transfer rate grows to a maximum of ~2114 KB/s. In this test, we see that EXT4 has a slight advantage; however, in the synchronous mode the two contenders are tied.

Let’s now see file creation/deletion, aka metadata handling, performance. First, normal mode:

EXT4 really eclipses XFS in this test, scoring some very high results. However, you can argue that the ~2500 new files/sec scored by XFS should be enough for any kind of workload.

Now, synchronous mode:

This time, XFS was the best.

So, from Bonnie++ tests we noted that, while EXT4 excel in metadata handling, XFS seems to be faster transferring I/O block from the disk subsystem, and its synchronous behavior seems to be more robust than EXT4 one.

One last thing to note is that Bonnie++ sometime crashed the entire machine when running on top of EXT4 filesystem. The cause the crash is under investigation, but seems related to out of memory conditions. While Bonnie++ (in synchronous mode) was the only test that trigger the crash, the fact that it bring down the entire machine is a bad thing. XFS, on the other hand, never had this problem.

Sysbench file benchmark

Filesystem I/O performances are a difficult thing to profile. For this reason, I run another set of sequential and random I/O transfer benchmarks using the sysbench utility. Sequential speed tests were run with 2 MB big blocks, while random speed with 4 KB blocks.

Let’s start with sequential speed:

While in normal, cached mode the two filesystems are quite well matched each other, in the synchronous test we see some divergence: XFS is faster in sequential write, while EXT4 is faster in sequential read.

Please note that EXT4 sequential read is higher in synchronous mode than in the normal one: can this be related to a delayed allocation side effect? Remember that in normal mode, sysbench’s test issue one fsync() per 100 writes, while in synchronous mode it issue one fsync() for each write, effectively disabling the delayed allocator. My two cents are that if the read speed of the just-written files are greater in the latter mode, it can be that the delayed allocation feature something can lower performance.

Now, random speed:

I’m not sure how to interpret XFS random read speed, as it seems to be higher that the theoretical maximum speed (considering a 4 ms rotational delay, 4 KB blocks and 5 active data disk we end with ~5000 KB max speed). Probably, when using XFS, this read benchmark is greatly influenced by OS caching and/or read-ahead setting. Write speed seems fine though, and we see that XFS is faster here, by quite a large margin. However, the absolute results are very low: this is, again, a consequence of the mechanical nature of current hard disks and the lack of any caching by the controller/disks combo.

Untar and cat time

It is very common in the Linux world to distribute some very large number of quite small files using a compressed, one-file archive created by using the tar and bzip/gzip utilities. For examples, Linux kernel (downloadable from kernel.org) are distributed in this specific manner.

So, an interesting benchmark would be to record the time needed to untar (extract) the Linux kernel .tar.bz2 file, and then to read-back the just-extracted files:

EXT4 is faster in the extraction process, especially considering the very low final sync time.

When considering cat (read) time, however, XFS is the best.

So, these first results show us that there is not a single, best-of-all filesystem. It all depend on the I/O request (read or write) and the workload type (sequential, random, cache, synchronous, etc).

UPDATE 05/04/2011: I added the detailed mysql-bench results graph.

MySQL benchmarks

It’s now time for some database testing.

The first one is about creating and populating a MySQL database with 10 million rows, using sysbench oltp prepare benchmark. Who is the faster between XFS and EXT4?

It seems that XFS wins by a small margin.

What happen when we start to query the db?

In this simple, read-only test we have a tie.

Now, the complex, read-write, transactional test:

We have another virtual tie here.

Last but not least, we have the mysql-bench benchmark scores:

Please note that this benchmark tests various aspects of a MySQL database, and some of them are not directly influenced by I/O speed. So, the XFS’s win is a quite remarkable one.

At the end, have a look at detailed mysql-bench report:

 

So, summarizing MySQL results, we can conclude that while XFS is slight faster then EXT4, you can not go wrong with any of these two filesystems.

PostgreSQL benchmarks:

 

Another popular, open source database server is PostgreSQL. Which filesystem is the fastest here?

The first test is about creating and populating a PostgreSQL database with 100 thousand rows, using sysbench oltp prepare test:

We have a great EXT4 victory here, with a prepare time way lower then the XFS one.

Now, let’s start to query the db with the simple, read-only sysbench oltp benchmark:

In this read-only test, XFS is no slower than EXT4.

What happen in the complex, read-write, transactional benchmark?

EXT4 is again much faster then XFS.

From these tests it seems that when dealing with writes, EXT4 is faster then XFS in PostgreSQL’s workload type.

Finally, I run the pgbench benchmark, with scale and requests per client both set to 1000. First, the prepare time:

This time, XFS shows the same performance then EXT4.

Now, the real benchmark run:

EXT4 is again over 2X faster then XFS.

So, in the end, if you plan to use PostgreSQL, go with EXT4 filesystem (especially if you plan to execute a large number of INSERT / UPDATE / TRANSACTION statements).

Fragmentation

Fragmentation is the #1 enemy of mechanical disks, as every head movement correspond to lower total I/O performance.

Both EXT4 and XFS has a fame to be very fragmentation resistant, but what is the best? Let’s start with counting fragments per file after the extraction of the Linux kernel .tar.bz2 file (see the untar test above for more informations):

Yeah, both filesystems where exceptionally resistant to fragmentation here, showing perfect results.

Sysbench’s sequential and random tests give us another interesting point of reference in this discipline. First, the fragmentation status after the sequential write test:

Now XFS is the leader, with EXT4 lagging quite behind. It is interesting to note that in the synchronous test (one write / one fsync) EXT4 exhibits lower fragmentation: this can explain the higher sequential read results in synchronous mode recorded earlier. Speaking about XFS, it seems that this filesystem optimally manage large files and its high sequential read/write speeds are likely a results of the complete lack of fragmentation in these class of files.

The random write test is a harder one:

In this case, both filesystems become heavily fragmented, proving that no filesystem is completely immune to this issue. However, XFS has and edge here: it ships with a functional, proved defragmenter, while the EXT4 package lack an official, stable-released defrag utility (while this utility exists, it is more-or-less in a beta stage).

Conclusions

Well, if you arrived here, congratulation: you had the patience to analyze about 20 graphs!

So, in the end, which filesystem should you choose for your server, EXT4 or XFS? As stated above, it all depends on the expected workload type. Below are my recommendations:

  • workstation machine: you can not go wrong with any of these two filesystems. While EXT4 is better at files creation and deletion (a common job on any machine), XFS re-balance the choice thank to higher speed with large files and near-perfect fragmentation resistance
  • development machine: if you plan to often create / delete / check any large volume, absolutely go with XFS
  • web server (apache + mysql): although EXT4 is competitive, XFS’s higher MySQL and large files performances give it the edge here
  • file server: if you plan to store and actively use some large files, go with XFS; in the other case (small files) go with EXT4
  • MySQL database server: I slightly prefer XFS for this kind of workload
  • PostgreSQL server: definitely go with EXT4
  • virtualization (consolidation) server: while virtual machine consolidation is a very complex topic and a definitive answer will require extensive testing, I think that XFS should be the better choice as it has great large files performance and excellent fragmentation behavior (also don’t forget its on-line defrag utility)

UPDATE 05/04/2011: Paul ask me to better explain the different filesystem choice for the two different database systems benchmarked (MySQL and PostgreSQL). The point is that, while both MySQL and PostgreSQL are very common opensource database, their implementations (and, in a certain extent, their purposes) are very different. For example, MySQL has optimization aimed at converting (or delaying) some random I/O operations in sequential ones. With these optimizations, MySQL can coalesce some random I/O operations in only one sequential read/write. PostgreSQL, instead, use different optimizations and generally tend to not delay random I/O writes. So, it is not surprising that EXT4 and XFS have quite different behaviors with these two different database server.

Remember that the above benchmark were collected with write barriers disabled! If you had to enable them to guarantee data integrity, the absolute results can be quite different (but the relative standing should remain more-or-less similar).

Featured Articles

This entry was posted in FileSystem, RAID and tagged , , , . Bookmark the permalink.

Comments are closed.