EXT4 vs XFS: large volumes with low-end RAID controller

Some months ago, I wrote an article comparing EXT3, EXT4, XFS and BTRFS filesystem performances with a Fedora 14 x86_64 installation done on a Dell Latitude D620 laptop. While the results were quite interesting (especially to evaluate BTRFS performance), they were collected on a consumer machine (a laptop), with consumer-grade processor and HDD. So, the results do not necessarily translate to server world in a linear manner – a very good filesystem for a single 2.5” HDD can be inadequate for a multi-disk server machine, and vice-versa.

Today, thank to the “Center for Research Computing” at University of Notre Dame, and especially to Paul Brenner, Serguei Fedorov and Rich Sudlow, I am able to present you some filesystem benchmark results collected on a quite powerful Dell R510 server, loaded with 12 x 2 TB SATA disk connected to a low-end, inexpensive PERC H200 controller. The article will focus on EXT4 vs XFS performance, as EXT3 can not grow bigger than 2 TB and BTRFS is way too young (and unproven) to be considered in the server world. I hope that these data can help you to chose the right filesystem for your workload.

While reading this article, please keep in mind that different usage patterns can favor different filesystems, so I don’t pretend to elect the always-better, stronger FS on the world. I simply want to give you some numbers collected in various usage patterns, so that I can help you in the choice of the right filesystem for some common jobs. Please also consider that FS performances can vary dramatically between kernel releases; however, this behavior should be mitigated by the fact that RHEL 6.0 use very conservative, security-focused kernel updates.

Filesystems, mount options and others

As you probably know, mount options can significantly impact filesystem speed, features and reliability. Moreover, the existence of filesystem-specific options mean that it is often quite hard to 100% match them across the various filesystems.

Fortunately, the vast majority of FS-specific options have very reasonable and reliability-focused predefined values, so we can generally use the defaults with no problem. However, If you want to do a significant comparison, one option should absolutely be the same between the different setups: the write barrier option.

Write barriers are a synchronization method that enable OS to safety flush the on-disk cache content to the physical disk platters. Without write barriers, a fsync() call will flush the main memory disk cache, but it will not flush the disk/controller cache. While disabling barrier can sometime speed up the filesystem/disks combo considerably, it can also lead to data loss, even when the OS assumes that all data were safety written to disk. For example, a power outage will cause the loss of any data in the disk cache that were not written to the disk platters.

However, there are circumstances when write barriers can be disabled without problems: think to a UPS-protected server with a battery backed disk cache, or simply to a controller/disk combo with no DRAM cache at all. In this case, any power outage will not imply a cache data loss, so barriers can be safety disabled.

The Dell R510 server system used for this benchmark round is equipped with a PERC H200 disk controller with no DRAM cache. Moreover this controller disable any disk level cache found on the attacked disks, so I disabled write barriers with the “nobarrier” mount option.

Please keep in mind that enabling write barriers can cause a different, FS-specific performance drop. For example, generally XFS incur into a greater drop than EXT4. So, while the relative standing should remain more-or-less similar, the following results should be considered valid only for installations with write barriers disabled.

UPDATE 05/04/2011:

For more informations about EXT4 and XFS history, mount options and other things, you can visit the following wikipedia pages:

Testbed and methods

The Dell R510 have the following hardware and software configuration:

  • 2x Intel Xeon E5620 with HT OFF (4 cores, 4 threads , 12 MB L3 cache) @ 2.4 GHz
  • 8x 4 GB DDR3 RAM (32 GB total RAM)
  • PERC H200 RAID Controller
  • 12x 2 TB 7.2K RPM SATA 3Gps disks
  • Red Hat Enterprise Linux 6.0 64 bit

The 12 disks were assigned to 2 RAID array:

  • a first, 2 disks RAID 1 array for OS installation
  • a second, 10 disks RAID 10 array for the benchmark runs

To run the benchmarks, I used the following softwares:

  • bonnie++-1.96-1.el6.rf.x86_64.rpm
  • sysbench-0.4.12-1.el6.x86_64.rpm
  • mysql-server-5.1.52-1.el6_0.1.x86_64.rpm
  • mysql-bench-5.1.52-1.el6_0.1.x86_64.rpm
  • postgresql-server-8.4.7-1.el6_0.1.x86_64.rpm
  • postgresql-test-8.4.7-1.el6_0.1.x86_64.rpm

Please note that the benchmarked filesystems were optimized for the physical array layout (in this case, 5 active data disks and 64 KB stripe size). Remember that, as stated before, the PERC H200 controller does not have any onboard cache, and it disable any disk-level cache it found on the attached disks. For this reason, write barriers were disabled.

I run each benchmark at least 3 times and then reported the mean value.

A note on the CPU load number: as this Dell R512 has 8 physical cores that can manage 8 hardware threads (HyperThreading was set to OFF), the maximum CPU load percentage, as reported by the Linux kernel, is 800%. So, if when you read something similar to “100% CPU load”, this mean that, on average, only one core (from the 12 available) was fully utilized.

UPDATE 05/06/2011: hardware description was updated to correctly describe the core/threads configuration. I originally wrote that the CPUs were two hexa-cores ones, while they really are two quad-cores processors.

UPDATE 05/10/2011: a reader ask to me explicitly specify the mkfs and mount parameters. For filesystems creation, I use the following commands:

  • EXT4: mkfs.ext4 /dev/sdb1 -E stride=16,stride-width=80
  • XFS: mkfs.xfs /dev/sdb1 -d su=64k,sw=5

Both filesystems were mounted with default parameters and the “nobarrier” option.

Filesystem creation and checking time

The first test is related to filesystem creation and checking time. The following graph will show you the time needed to create and fsck the ~10 TB filesystem used to fill the RAID 10 array. The fsck command was run after the creation of a significant number of small file, obtained unpacking the linux-2.6.36.4.tar.bz2 file downloaded form kernel.org:

As you can see, XFS was way faster then EXT4 in this large volume creation and checking. However, you should not overestimate these results: remember that you generally create the FS only one time, and the fsck operation should be a rare one (after all, both FS are journaled for this reason). On the other hand, if you plan to create/check very ofter a large filesystem, stay away from EXT4 and go with XFS.

Bonnie++ results

Sequential and random read/write speeds are two factors that can greatly influence final application speed. Let’s start examining Bonnie++ sequential speed and CPU usage:

While EXT4 and XFS generally show comparable results both in normal, cached mode and in synchronous mode, XFS lead the sequential output (write) test by a very large margin. To tell the truth, the EXT4 sequential output test results seem unrealistically low.

What about random speed? Bonnie++’s random I/O speed return the number of seeks per second that the disk subsystem can sustain:

The mechanical nature of current hard disks implies results that are some order of magnitude lower than the sequential ones: considering 512 byte long sectors, we are speaking about a maximum I/O transfer rate of ~264 KB/s. Considering 4096 byte long sector, the I/O transfer rate grows to a maximum of ~2114 KB/s. In this test, we see that EXT4 has a slight advantage; however, in the synchronous mode the two contenders are tied.

Let’s now see file creation/deletion, aka metadata handling, performance. First, normal mode:

EXT4 really eclipses XFS in this test, scoring some very high results. However, you can argue that the ~2500 new files/sec scored by XFS should be enough for any kind of workload.

Now, synchronous mode:

This time, XFS was the best.

So, from Bonnie++ tests we noted that, while EXT4 excel in metadata handling, XFS seems to be faster transferring I/O block from the disk subsystem, and its synchronous behavior seems to be more robust than EXT4 one.

One last thing to note is that Bonnie++ sometime crashed the entire machine when running on top of EXT4 filesystem. The cause the crash is under investigation, but seems related to out of memory conditions. While Bonnie++ (in synchronous mode) was the only test that trigger the crash, the fact that it bring down the entire machine is a bad thing. XFS, on the other hand, never had this problem.

Sysbench file benchmark

Filesystem I/O performances are a difficult thing to profile. For this reason, I run another set of sequential and random I/O transfer benchmarks using the sysbench utility. Sequential speed tests were run with 2 MB big blocks, while random speed with 4 KB blocks.

Let’s start with sequential speed:

While in normal, cached mode the two filesystems are quite well matched each other, in the synchronous test we see some divergence: XFS is faster in sequential write, while EXT4 is faster in sequential read.

Please note that EXT4 sequential read is higher in synchronous mode than in the normal one: can this be related to a delayed allocation side effect? Remember that in normal mode, sysbench’s test issue one fsync() per 100 writes, while in synchronous mode it issue one fsync() for each write, effectively disabling the delayed allocator. My two cents are that if the read speed of the just-written files are greater in the latter mode, it can be that the delayed allocation feature something can lower performance.

Now, random speed:

I’m not sure how to interpret XFS random read speed, as it seems to be higher that the theoretical maximum speed (considering a 4 ms rotational delay, 4 KB blocks and 5 active data disk we end with ~5000 KB max speed). Probably, when using XFS, this read benchmark is greatly influenced by OS caching and/or read-ahead setting. Write speed seems fine though, and we see that XFS is faster here, by quite a large margin. However, the absolute results are very low: this is, again, a consequence of the mechanical nature of current hard disks and the lack of any caching by the controller/disks combo.

Untar and cat time

It is very common in the Linux world to distribute some very large number of quite small files using a compressed, one-file archive created by using the tar and bzip/gzip utilities. For examples, Linux kernel (downloadable from kernel.org) are distributed in this specific manner.

So, an interesting benchmark would be to record the time needed to untar (extract) the Linux kernel .tar.bz2 file, and then to read-back the just-extracted files:

EXT4 is faster in the extraction process, especially considering the very low final sync time.

When considering cat (read) time, however, XFS is the best.

So, these first results show us that there is not a single, best-of-all filesystem. It all depend on the I/O request (read or write) and the workload type (sequential, random, cache, synchronous, etc).

UPDATE 05/04/2011: I added the detailed mysql-bench results graph.

MySQL benchmarks

It’s now time for some database testing.

The first one is about creating and populating a MySQL database with 10 million rows, using sysbench oltp prepare benchmark. Who is the faster between XFS and EXT4?

It seems that XFS wins by a small margin.

What happen when we start to query the db?

In this simple, read-only test we have a tie.

Now, the complex, read-write, transactional test:

We have another virtual tie here.

Last but not least, we have the mysql-bench benchmark scores:

Please note that this benchmark tests various aspects of a MySQL database, and some of them are not directly influenced by I/O speed. So, the XFS’s win is a quite remarkable one.

At the end, have a look at detailed mysql-bench report:

 

So, summarizing MySQL results, we can conclude that while XFS is slight faster then EXT4, you can not go wrong with any of these two filesystems.

PostgreSQL benchmarks:

 

Another popular, open source database server is PostgreSQL. Which filesystem is the fastest here?

The first test is about creating and populating a PostgreSQL database with 100 thousand rows, using sysbench oltp prepare test:

We have a great EXT4 victory here, with a prepare time way lower then the XFS one.

Now, let’s start to query the db with the simple, read-only sysbench oltp benchmark:

In this read-only test, XFS is no slower than EXT4.

What happen in the complex, read-write, transactional benchmark?

EXT4 is again much faster then XFS.

From these tests it seems that when dealing with writes, EXT4 is faster then XFS in PostgreSQL’s workload type.

Finally, I run the pgbench benchmark, with scale and requests per client both set to 1000. First, the prepare time:

This time, XFS shows the same performance then EXT4.

Now, the real benchmark run:

EXT4 is again over 2X faster then XFS.

So, in the end, if you plan to use PostgreSQL, go with EXT4 filesystem (especially if you plan to execute a large number of INSERT / UPDATE / TRANSACTION statements).

Fragmentation

Fragmentation is the #1 enemy of mechanical disks, as every head movement correspond to lower total I/O performance.

Both EXT4 and XFS has a fame to be very fragmentation resistant, but what is the best? Let’s start with counting fragments per file after the extraction of the Linux kernel .tar.bz2 file (see the untar test above for more informations):

Yeah, both filesystems where exceptionally resistant to fragmentation here, showing perfect results.

Sysbench’s sequential and random tests give us another interesting point of reference in this discipline. First, the fragmentation status after the sequential write test:

Now XFS is the leader, with EXT4 lagging quite behind. It is interesting to note that in the synchronous test (one write / one fsync) EXT4 exhibits lower fragmentation: this can explain the higher sequential read results in synchronous mode recorded earlier. Speaking about XFS, it seems that this filesystem optimally manage large files and its high sequential read/write speeds are likely a results of the complete lack of fragmentation in these class of files.

The random write test is a harder one:

In this case, both filesystems become heavily fragmented, proving that no filesystem is completely immune to this issue. However, XFS has and edge here: it ships with a functional, proved defragmenter, while the EXT4 package lack an official, stable-released defrag utility (while this utility exists, it is more-or-less in a beta stage).

Conclusions

Well, if you arrived here, congratulation: you had the patience to analyze about 20 graphs!

So, in the end, which filesystem should you choose for your server, EXT4 or XFS? As stated above, it all depends on the expected workload type. Below are my recommendations:

  • workstation machine: you can not go wrong with any of these two filesystems. While EXT4 is better at files creation and deletion (a common job on any machine), XFS re-balance the choice thank to higher speed with large files and near-perfect fragmentation resistance
  • development machine: if you plan to often create / delete / check any large volume, absolutely go with XFS
  • web server (apache + mysql): although EXT4 is competitive, XFS’s higher MySQL and large files performances give it the edge here
  • file server: if you plan to store and actively use some large files, go with XFS; in the other case (small files) go with EXT4
  • MySQL database server: I slightly prefer XFS for this kind of workload
  • PostgreSQL server: definitely go with EXT4
  • virtualization (consolidation) server: while virtual machine consolidation is a very complex topic and a definitive answer will require extensive testing, I think that XFS should be the better choice as it has great large files performance and excellent fragmentation behavior (also don’t forget its on-line defrag utility)

UPDATE 05/04/2011: Paul ask me to better explain the different filesystem choice for the two different database systems benchmarked (MySQL and PostgreSQL). The point is that, while both MySQL and PostgreSQL are very common opensource database, their implementations (and, in a certain extent, their purposes) are very different. For example, MySQL has optimization aimed at converting (or delaying) some random I/O operations in sequential ones. With these optimizations, MySQL can coalesce some random I/O operations in only one sequential read/write. PostgreSQL, instead, use different optimizations and generally tend to not delay random I/O writes. So, it is not surprising that EXT4 and XFS have quite different behaviors with these two different database server.

Remember that the above benchmark were collected with write barriers disabled! If you had to enable them to guarantee data integrity, the absolute results can be quite different (but the relative standing should remain more-or-less similar).

Posted in FileSystem, RAID | Tagged , , , | Comments Off

alloc_sem of Ext4 block group

Yesterday Amir Goldstein sent me an email for a deadlock issue. I was in Chinese New Year vacation, could not have time to check the code (also I know I can not answer his question with ease). Thanks to Ted, he provides a quite clear answer. I feel Ted’s answer is also very informative to me, I copy&past the conversation from linux-ext4@vger.kernel.org to my blog. The copy rights of the bellowed referenced text belong to their original authors.

On Sun, Feb 06, 2011 at 10:43:58AM +0200, Amir Goldstein wrote:
> When looking at alloc_sem, I realized that it is only needed to avoid
> race with adjacent group buddy initialization.
Actually, alloc_sem is used to protect all of the block group specific
data structures; the buddy bitmap counters, adjusting the buddy bitmap
itself, the largest free order in a block group, etc.  So even in the
case where block_size == page_size, alloc_sem is still needed!
- Ted

Posted in FileSystem | Tagged , , | Comments Off

Three Practical System Workloads of Taobao

Days ago, I gave a talk on an academic seminar at ACT of Beihang University (http://act.buaa.edu.cn/). In my talk, I introduced three typical system workloads we (a group of system software developers inside Taobao) observed from the most heavily used/deployed product lines. The introduction was quite brief, no detail touched here. we don’t mind to share what we did imperfectly, and we would like to open mind to cooperate with open source community and industries to improve :-)

If you find there is anything unclear or misleading, please let me know. Communication makes things better most of time :-)

Posted in Uncategorized | Tagged | Comments Off

Don’t waste your SSD blocks

These days, one of my colleagues asked me a question, he formatted an ~80G Ext3 file system on SSD. After mounted the file system, the df output was,

Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sdb1 77418272 184216 73301344 1 /mnt

As well as from fdisk output, it said,

Device Boot Start End Blocks Id System
/dev/sdb1 7834 17625 78654240 83 Linux

From his observation, before format the SSD, there was 78654240 1k blocks available on the partition, after the format, 77418272 1k blocks could be used, which means almost 1G space unused from the partition.

A more serious question was, from the output of df, used blocks + available blocks = 73485560, but the file system had 77418272 blocks — 4301144 1k blocks disappeared ! This 160G SSD costs him 430USD, he complained around 15USD was payed for nothing.

IMHO, this is a quite interesting question, and asked by many people for many times. This time, I’d like to spend some time to explain how the blocks are wasted, and how to make better usage of every block on the SSD (since it’s quite expensive).

First of all, better storage usage depends on the I/O pattern in practice. This SSD is used to store large file for random I/O, especially most of the I/O (99%+) is reading on random file offset, the writing can almost be ignored. Therefore, it is wanted to use every available block to store a very big files on the Ext3 file systems.

If only using the default command line to format an Ext3 file system like “mkfs.ext3 /dev/sdb1″, mkfs.ext3 will do the following things for block allocation,

- Allocates reserved blocks for root user, to avoid non-privilege users using up all disk space.

- Allocates metadata like superblock, backed superblock, block group descriptors, block bitmap for each block group, inode bitmap for each block group, inode table for each block group.

- Allocates reserved block group blocks for offline file system extension.

- Allocates blocks for journal

Since the SSD is only for data storage, no operation system installed on it, and writing performance is disregarded here, and no requirement for further file system size extension, and only a few files are stored on the file systems, some blocks allocation is unnecessary and useless,

- Journal blocks

- Inodes blocks

- Reserved group descriptor blocks for file system resize

- Reserved blocks for root user

Let’s run dumpe2fs to see how many blocks are wasted on the above items, I only list part of the output (outlines) here,

> dumpe2fs /dev/sdb1

Filesystem volume name:   <none>
Last mounted on:          <not available>
Filesystem UUID:          f335ba18-70cc-43f9-bdc8-ed0a8a1a5ad3
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery sparse_super large_file
Filesystem flags:         signed_directory_hash
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              4923392
Block count:              19663560
Reserved block count:     983178
Free blocks:              19308514
Free inodes:              4923381
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      1019
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512

Filesystem created:       Tue Jul  6 21:42:32 2010
Last mount time:          Tue Jul  6 21:44:42 2010
Last write time:          Tue Jul  6 21:44:42 2010
Mount count:              1
Maximum mount count:      39
Last checked:             Tue Jul  6 21:42:32 2010
Check interval:           15552000 (6 months)
Next check after:         Sun Jan  2 21:42:32 2011
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      3ef6ca72-c800-4c44-8c77-532a21bcad5a
Journal backup:           inode blocks
Journal features:         (none)
Journal size:             128M
Journal length:           32768
Journal sequence:         0×00000001
Journal start:            0

Group 0: (Blocks 0-32767)
Primary superblock at 0, Group descriptors at 1-5
Reserved GDT blocks at 6-1024
Block bitmap at 1025 (+1025), Inode bitmap at 1026 (+1026)
Inode table at 1027-1538 (+1027)
31223 free blocks, 8181 free inodes, 2 directories
Free blocks: 1545-32767
Free inodes: 12-8192

[snip ....]

The file system block size is 4KB, which is different from the output block size of df and fdisk. In the above output, I mark the outlines with RED color. Now let’s look at the line for reserved block,

Reserved block count:     983178

These 983178 4K blocks are served for root user, since the system and user home is not on SSD, we don’t need to reserve these blocks.  Read mkfs.ext3(8), there is a parameter ‘-m’ to set reserved-blocks-percentage, set ‘-m 0′ to reserve zero block for privilege user.

From file system features line, we can see resize_inode is one of the default enabled feature,

Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery sparse_super large_file

resize_inode feature reserves quite a lot blocks for new extended block group descriptors, these blocks can be found from lines like,

Reserved GDT blocks at 6-1024

When resize_inode feature enabled, mkfs.ext3 will reserve some blocks after block group descriptor blocks, called “Reserved GDT blocks”.  If file system will be extended in future (e.g. the file system is created on a logical volume), these reserved blocks can be used for new block group descriptors. Now the storage media is SSD, not file system extension in future, we don’t have to pay money (on SSD, blocks means money) for this kind of blocks. To disable resize_inode feature, use “-O ^resize_inode” in mkfs.ext3(8).

Then look at these 2 lines for inode blocks,

Inodes per group:         8192
Inode blocks per group:   512

We only store no more than 5 files on the whole file systems,  but here 512 blocks in each block groups are allocated for inode table. There are 601 block groups, which means 512×601=307712 blocks (≈ 1.2GB space) wasted for inode tables.  Using ‘-N 16′ in mkfs.ext3(8) to specify only 16 inodes in the file system, though mkfs.ext3(3) at least allocate one inode table block in each block group (more then 16 inodes), we only wast 1 block other than 512 blocks for inode able now.

Journal size:             128M

If most of the I/O are readings while writing performance is ignored, and people are really care about space usage, the journal area can be reduced to minimum size (1024 file system blocks), for 4KB blocks Ext3, it’s 4MB: -J size=4M

By above efforts, there is around 4GB+ space back to use. If you really care about the space usage efficiency of your SSD, how about making the file system with:

mkfs.ext3 -J size=4M -m 0 -O ^resize_inode -I 16  <device>

Then you have chance to get more data blocks into usage on your expensive SSD :-)

Posted in SSD | Tagged , | Comments Off

Random I/O — Is raw device always faster than file system ?

For some implementations of distributed file systems, like TFS [1], developers think storing data on raw device directly (e.g. /dev/sdb, /dev/sdc…) might be faster than on file systems.

Their choice is reasonable,

1, Random I/O on large file cannot get any help from file system page cache.

2, <logical offset, physical offset> mapping introduces more I/O on file systems than on raw disk

3, Managing metadata on other powerful servers avoid the necessary to use file systems for data nodes.

The penalty for the “higher” performance is management cost, storing data on raw device introduces difficulties like,

1, Harder to backup/restore the data.

2, Cannot do more flexible management without special management tools for the raw device.

3, No convenient method to access/management the data on raw device.

The above penalties are hard to be ignored by system administrators. Further more, the store of “higher” performance is not exactly true today,

1, For file systems using block pointers for <logical offset, physical offset> mapping, large file takes too many pointer blocks. For example, on Ext3, with 4KB block, a 2TB file needs around 520K+  pointer blocks. Most of the pointer blocks are cold in random I/O, which results lower random I/O performance number than on raw device.

2, For file systems using extent for <logical offset, physical offset> mapping, the extent blocks number depends on how many fragment a large file has. For example, on Ext4, with max block group size 128MB, a 2TB file has around 16384 fragment. To mapping these 16K fragment, 16K extent records are needed, which can be placed in 50+ extent blocks. It’s very easy to hit a hot extent in memory for random I/O on large file.

3, If the <logical offset, physical offset> mapping can be cached in memory as hot, random I/O performance on file system might not be worse than on raw device.

In order to verify my guess, I did some performance testing.  I share part of the data here.

Processor: AMD opteron 6174 (2.2 GHz) x 2

Memory: DDR3 1333MHz 4GB x 4

Hard disk: 5400RPM SATA 2TB x 3 [2]

File size: (create by dd, almost) 2TB

Random I/O access: 100K times read

IO size: 512 bytes

File systems: Ext3, Ext4 (with and without directio)

test tool: seekrw [3]

* With page cache

- Command

seekrw -f /mnt/ext3/img -a 100000 -l 512 -r

seekrw -f /mnt/ext4/img -a 100000 -l 512 -r

- Performance result

Device tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
Ext3 sdc 95.88 767.07 0 46024 0
Ext4 sdd 60.72 485.6 0 29136 0

- Wall clock time

Ext3: real time: 34 minutes 23 seconds 557537 usec

Ext4: real time: 24 minutes 44 seconds 10118 usec

* directio (without pagecache)

- Command

seekrw -f /mnt/ext3/img -a 100000 -l 512 -r -d

seekrw -f /mnt/ext4/img -a 100000 -l 512 -r -d

- Performance result

Device tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
Ext3 sdc 94.93 415.77 0 12473 0
Ext4 sdd 67.9 67.9 0 2037 0
Raw sdf 67.27 538.13 0 16144 0

- Wall clock time

Ext3: real time: 33 minutes 26 seconds 947875 usec

Ext4: real time: 24 minutes 25 seconds 545536 usec

sdf: real time: 24 minutes 38 seconds 523379 usec    (raw device)

From the above performance numbers, Ext4 is 39% faster than Ext3 on random I/O with or without paegcache, this is expected.

The result of random I/O on Ext4 and raw device, is almost same. This is a result also as expected. For file systems mapping <logical offset, physical offset> by extent, it’s quite easy to make most of the mapping records hot in memory. Random I/O on raw device has *NO* obvious performance advance then Ext4.

Dear developers, how about considering extent based file systems now :-)

[1] TFS, TaobaoFS. A distributed file system deployed for http://www.taobao.com . It is developed by core system team of Taobao, will be open source very soon.

[2] The hard disk is connected to RocketRAID 644 card via eSATA connecter into system.

[3] seekrw source code can be download from http://www.mlxos.org/misc/seekrw.c

Posted in Uncategorized | Tagged , , , | Comments Off

Tengine, a customized Nginx, goes to open source

We’re glad to announce that Tengine, our home-baked Nginx at Taobao now becomes an open source project.

Taobao is the largest e-commerce website in Asia and ranked #12 on Alexa’s top global sites list. Our website serves billions of pageviews per day. For busy website as us, Nginx is obviously the best choice. Thanks to Nginx’s high performance, small footprint and flexibility, we have done more with less.

We first learned the Nginx internals by using it as a traditional web server and developing dozens of modules. Then from June of this year we started hacking the Nginx core to expand its capabilities. As some of the features we have developed may also benefit other Nginx users and websites, so why not open source them? We do not want to be just open source software users, but also open source contributors. That’s why the Tengine open source project came out.

Tengine is based on the latest stable version of Nginx (Nginx-1.0.10). There are a few features and bug fixes you may be interested in Tengine:

  • Logging enhancement. It supports syslog (local and remote) and pipe logging. You can also do log sampling, i.e. not all requests have to be written.
  • Protects the server when the system load and memory use goes high.
  • Combines multiple CSS or JavasScript requests into one request to reduce the downloading time.
  • Sets the worker process number and CPU affinities automatically. Setting Nginx’s worker_cpu_affinity is not a pain any more.
  • Enhanced limit_req module with whitelist support and more limit_req directives in one location.
  • More operations engineer friendly server information, so host can be located easily when error happens.
  • More command lines support. You can list all modules compiled in and the directives supported, even the content of configuration file itself.
  • Set expiration for files according to specific content type.
  • Error pages can be set back to ‘default’.

Basically, Tengine can be considered as a better or superset of Nginx. You can download the tar ball here:
http://tengine.taobao.org/download/tengine-1.2.0.tar.gz

We want to say thank you to the Nginx team, especially to Igor. Thank you very much for your great work! We would love to donate the patches against the Nginx-1.1 branch later if you think the patches are okay.

Frankly, I’m not sure whether the features in Tengine right now can impress you guys or not. It’s the first step we moving towards open source after all. We have built a team working on Tengine and have quite a long to-do list. I promise you more enhancements are coming out.

Posted in Nginx | Comments Off

Nginx Internals Talk in Guangzhou, China

nginx map (click to view large image)

I’m going to give a free talk on nginx’s internals next month (September 19), in Guangzhou, China.

I’ve been reading the source code of nginx for a few days. Digging into this charming code is really a pleasant experience, though at first glance it appeared a little bit difficult to understand. Nginx becomes more and more popular, but unfortunately there is not enough documentation on its architecture and implementation. Now that I have spent a considerable amount of time reading the source code and have gained some knowledge, why not share it with those who want to know things under the hood?

So, if you are interested in this talk and you can be in Guangzhou that day, feel free to join in. Please comment on this post or drop me an email to let me know which parts you are interested in (see the mind map above, draft version though).

There might be a thousand Hamlets in a thousand people’s eyes. Note that I’m not Igor, and the only way I try to understand the nuts and bolts is by reverse engineering it, hence I can’t guarantee you no mistakes or misunderstandings in my talk. And frankly, it is not a trivial topic after all, not only because of the size of nginx’s code base, but also its elaborate design.

The speech will be in Chinese while slides will be in English. Specifics of time and location are coming soon. Stay tuned.

Update:
Time: 14:30-17:30, September 19, 2009
Location: Netease Building Tower E, Guangzhou Information Port #16 Keyun RD. Tianhe District, Guangzhou
Registration: http://blog.laiyonghao.com/2009/09/programming-tech-party/370

Posted in Nginx, Web | Tagged , | Comments Off

A Handy Strace Option

A Handy Strace Option

August 20, 2009 at 3:39 pm · Filed under Programming

I didn’t notice the ‘-ff’ option of strace until I came across it today. By turning it on, not only fork(2)s can be followed, but also each process’s trace will be written to tracefile.pid, where pid is the process id of each process. Typical usage might look like this:

# strace -o tracelog.txt -ff -T command

This option can be quite handy, when debugging programs that spawn child processes.

Posted in Uncategorized | Comments Off

Why Python?

Cardinal Biggles had Eric in the comfy chair for over four hours before wringing this confession from him…

My first look at Python was an accident, and I didn’t much like what I saw at the time. It was early 1997, and Mark Lutz’s book Programming Python from O’Reilly & Associates had recently come out. O’Reilly books occasionally land on my doorstep, selected from among the new releases by some mysterious benefactor inside the organization using a random process I’ve given up trying to understand.

One of them was Programming Python. I found this somewhat interesting, as I collect computer languages. I know over two dozen general-purpose languages, write compilers and interpreters for fun, and have designed any number of special-purpose languages and markup formalisms myself. My most recently completed project, as I write this, is a special-purpose language called SNG for manipulating PNG (Portable Network Graphics) images. Interested readers can surf to the SNG home page at http://www.catb.org/~esr/sng/. I have also written implementations of several odd general-purpose languages on my Retrocomputing Museum page, http://www.catb.org/retro/.

I had already heard just enough about Python to know that it is what is nowadays called a “scripting language”, an interpretive language with its own built-in memory management and good facilities for calling and cooperating with other programs. So I dived into Programming Python with one question uppermost in my mind: what has this got that Perl does not?

Perl, of course, is the 800-pound gorilla of modern scripting languages. It has largely replaced shell as the scripting language of choice for system administrators, thanks partly to its comprehensive set of UNIX library and system calls, and partly to the huge collection of Perl modules built by a very active Perl community. The language is commonly estimated to be the CGI language behind about 85% of the “live” content on the Net. Larry Wall, its creator, is rightly considered one of the most important leaders in the Open Source community, and often ranks third behind Linus Torvalds and Richard Stallman in the current pantheon of hacker demigods.

At that time, I had used Perl for a number of small projects. I’d found it quite powerful, even if the syntax and some other aspects of the language seemed rather ad hoc and prone to bite one if not used with care. It seemed to me that Python would have quite a hill to climb as yet another scripting language, so as I read, I looked first for what seemed to set it apart from Perl.

I immediately tripped over the first odd feature of Python that everyone notices: the fact that whitespace (indentation) is actually significant in the language syntax. The language has no analog of the C and Perl brace syntax; instead, changes in indentation delimit statement groups. And, like most hackers on first realizing this fact, I recoiled in reflexive disgust.

I am just barely old enough to have programmed in batch FORTRAN for a few months back in the 1970s. Most hackers aren’t these days, but somehow our culture seems to have retained a pretty accurate folk memory of how nasty those old-style fixed-field languages were. Indeed, the term “free format”, used back then to describe the newer style of token-oriented syntax in Pascal and C, has almost been forgotten; all languages have been designed that way for decades now. Or almost all, anyway. It’s hard to blame anyone, on seeing this Python feature, for initially reacting as though they had unexpectedly stepped in a steaming pile of dinosaur dung.

That’s certainly how I felt. I skimmed through the rest of the language description without much interest. I didn’t see much else to recommend Python, except maybe that the syntax seemed rather cleaner than Perl’s and the facilities for doing basic GUI elements like buttons and menus looked fairly good.

I put the book back on the shelf, making a mental note that I should code some kind of small GUI-centered project in Python sometime, just to make sure I really understood the language. But I didn’t believe what I’d seen would ever compete effectively with Perl.

A lot of other things conspired to keep that note way down on my priority list for many months. The rest of 1997 was eventful for me; it was, among other things, the year I wrote and published the original version of “The Cathedral and the Bazaar”. But I did find time to write several Perl programs, including two of significant size and complexity. One of them, keeper, is the assistant still used to file incoming submissions at the Metalab software archive. It generates the web pages you see atmetalab.unc.edu/pub/Linux/!INDEX.html. The other, anthologize, was used to automatically generate the PostScript for the sixth edition of Linux from the Linux Documentation Project’s archive of HOWTOs. Both programs are available at Metalab.

Writing these programs left me progressively less satisfied with Perl. Larger project size seemed to magnify some of Perl’s annoyances into serious, continuing problems. The syntax that had seemed merely eccentric at a hundred lines began to seem like a nigh-impenetrable hedge of thorns at a thousand. “More than one way to do it” lent flavor and expressiveness at a small scale, but made it significantly harder to maintain consistent style across a wider code base. And many of the features that were later patched into Perl to address the complexity-control needs of bigger programs (objects, lexical scoping, “use strict”, etc.) had a fragile, jerry-rigged feel about them.

These problems combined to make large volumes of Perl code seem unreasonably difficult to read and grasp as a whole after only a few days’ absence. Also, I found I was spending more and more time wrestling with artifacts of the language rather than my application problems. And, most damning of all, the resulting code was ugly—this matters. Ugly programs are like ugly suspension bridges: they’re much more liable to collapse than pretty ones, because the way humans (especially engineer-humans) perceive beauty is intimately related to our ability to process and understand complexity. A language that makes it hard to write elegant code makes it hard to write good code.

With a baseline of two dozen languages under my belt, I could detect all the telltale signs of a language design that had been pushed to the edge of its functional envelope. By mid-1997, I was thinking “there has to be a better way” and began casting about for a more elegant scripting language.

One course I did not consider was going back to C as a default language. The days when it made sense to do your own memory management in a new program are long over, outside of a few specialty areas like kernel hacking, scientific computing and 3-D graphics—places where you absolutely must get maximum speed and tight control of memory usage, because you need to push the hardware as hard as possible.

For most other situations, accepting the debugging overhead of buffer overruns, pointer-aliasing problems, malloc/free memory leaks and all the other associated ills is just crazy on today’s machines. Far better to trade a few cycles and a few kilobytes of memory for the overhead of a scripting language’s memory manager and economize on far more valuable human time. Indeed, the advantages of this strategy are precisely what has driven the explosive growth of Perl since the mid-1990s.

I flirted with Tcl, only to discover quickly that it scales up even more poorly than Perl. Old LISPer that I am, I also looked at various current dialects of Lisp and Scheme—but, as is historically usual for Lisp, lots of clever design was rendered almost useless by scanty or nonexistent documentation, incomplete access to POSIX/UNIX facilities, and a small but nevertheless deeply fragmented user community. Perl’s popularity is not an accident; most of its competitors are either worse than Perl for large projects or somehow nowhere near as useful as their theoretically superior designs ought to make them.

My second look at Python was almost as accidental as my first. In October 1997, a series of questions on the fetchmail-friends mailing list made it clear that end users were having increasing trouble generating configuration files for my fetchmailutility. The file uses a simple, classically UNIX free-format syntax, but can become forbiddingly complicated when a user has POP3 and IMAP accounts at multiple sites. As an example, see Listing 1 for a somewhat simplified version of mine.

Listing 1

I decided to attack the problem by writing an end-user-friendly configuration editor,fetchmailconf. The design objective of fetchmailconf was clear: to completely hide the control file syntax behind a fashionable, ergonomically correct GUI interface replete with selection buttons, slider bars and fill-out forms.

The thought of implementing this in Perl did not thrill me. I had seen GUI code in Perl, and it was a spiky mixture of Perl and Tcl that looked even uglier than my own pure-Perl code. It was at this point I remembered the bit I had set more than six months earlier. This could be an opportunity to get some hands-on experience with Python.

Of course, this brought me face to face once again with Python’s pons asinorum, the significance of whitespace. This time, however, I charged ahead and roughed out some code for a handful of sample GUI elements. Oddly enough, Python’s use of whitespace stopped feeling unnatural after about twenty minutes. I just indented code, pretty much as I would have done in a C program anyway, and it worked.

That was my first surprise. My second came a couple of hours into the project, when I noticed (allowing for pauses needed to look up new features in Programming Python) I was generating working code nearly as fast as I could type. When I realized this, I was quite startled. An important measure of effort in coding is the frequency with which you write something that doesn’t actually match your mental representation of the problem, and have to backtrack on realizing that what you just typed won’t actually tell the language to do what you’re thinking. An important measure of good language design is how rapidly the percentage of missteps of this kind falls as you gain experience with the language.

When you’re writing working code nearly as fast as you can type and your misstep rate is near zero, it generally means you’ve achieved mastery of the language. But that didn’t make sense, because it was still day one and I was regularly pausing to look up new language and library features!

This was my first clue that, in Python, I was actually dealing with an exceptionally good design. Most languages have so much friction and awkwardness built into their design that you learn most of their feature set long before your misstep rate drops anywhere near zero. Python was the first general-purpose language I’d ever used that reversed this process.

Not that it took me very long to learn the feature set. I wrote a working, usable fetchmailconf, with GUI, in six working days, of which perhaps the equivalent of two days were spent learning Python itself. This reflects another useful property of the language: it is compact–you can hold its entire feature set (and at least a concept index of its libraries) in your head. C is a famously compact language. Perl is notoriously not; one of the things the notion “There’s more than one way to do it!” costs Perl is the possibility of compactness.

But my most dramatic moment of discovery lay ahead. My design had a problem: I could easily generate configuration files from the user’s GUI actions, but editing them was a much harder problem. Or, rather, reading them into an editable form was a problem.

The parser for fetchmail’s configuration file syntax is rather elaborate. It’s actually written in YACC and Lex, two classic UNIX tools for generating language-parsing code in C. In order for fetchmailconf to be able to edit existing configuration files, I thought it would have to replicate that elaborate parser in Python. I was very reluctant to do this, partly because of the amount of work involved and partly because I wasn’t sure how to ascertain that two parsers in two different languages accept the same. The last thing I needed was the extra labor of keeping the two parsers in synchronization as the configuration language evolved!

This problem stumped me for a while. Then I had an inspiration: I’d let fetchmailconf use fetchmail’s own parser! I added a –configdump option to fetchmail that would parse .fetchmailrc and dump the result to standard output in the format of a Python initializer. For the file above, the result would look roughly like Listing 2 (to save space, some data not relevant to the example is omitted).

Listing 2

Python could then evaluate the fetchmail –configdump output and have the configuration available as the value of the variable “fetchmail”.

This wasn’t quite the last step in the dance. What I really wanted wasn’t just for fetchmailconf to have the existing configuration, but to turn it into a linked tree of live objects. There would be three kinds of objects in this tree: Configuration (the top-level object representing the entire configuration), Site (representing one of the sites to be polled) and User (representing user data attached to a site). The example file describes five site objects, each with one user object attached to it.

I had already designed and written the three object classes (that’s what took four days, most of it spent getting the layout of the widgets just right). Each had a method that caused it to pop up a GUI edit panel to modify its instance data. My last remaining problem was somehow to transform the dead data in this Python initializer into live objects.

I considered writing code that would explicitly know about the structure of all three classes and use that knowledge to grovel through the initializer creating matching objects, but rejected that idea because new class members were likely to be added over time as the configuration language grew new features. If I wrote the object-creation code in the obvious way, it would be fragile and tend to fall out of sync when either the class definitions or the initializer structure changed.

What I really wanted was code that would analyze the shape and members of the initializer, query the class definitions themselves about their members, and then adjust itself to impedance-match the two sets.

This kind of thing is called metaclass hacking and is generally considered fearsomely esoteric—deep black magic. Most object-oriented languages don’t support it at all; in those that do (Perl being one), it tends to be a complicated and fragile undertaking. I had been impressed by Python’s low coefficient of friction so far, but here was a realtest. How hard would I have to wrestle with the language to get it to do this? I knew from previous experience that the bout was likely to be painful, even assuming I won, but I dived into the book and read up on Python’s metaclass facilities. The resulting function is shown in Listing 3, and the code that calls it is in Listing 4.

Listing 3

Listing 4

That doesn’t look too bad for deep black magic, does it? Thirty-two lines, counting comments. Just from knowing what I’ve said about the class structure, the calling code is even readable. But the size of this code isn’t the real shocker. Brace yourself: this code only took me about ninety minutes to write—and it worked correctly the first time I ran it.

To say I was astonished would have been positively wallowing in understatement. It’s remarkable enough when implementations of simple techniques work exactly as expected the first time; but my first metaclass hack in a new language, six days from a cold standing start? Even if we stipulate that I am a fairly talented hacker, this is an amazing testament to Python’s clarity and elegance of design.

There was simply no way I could have pulled off a coup like this in Perl, even with my vastly greater experience level in that language. It was at this point I realized I was probably leaving Perl behind.

This was my most dramatic Python moment. But, when all is said and done, it was just a clever hack. The long-term usefulness of a language comes not in its ability to support clever hacks, but from how well and how unobtrusively it supports the day-to-day work of programming. The day-to-day work of programming consists not of writing new programs, but mostly reading and modifying existing ones.

So the real punchline of the story is this: weeks and months after writing fetchmailconf, I could still read the fetchmailconf code and grok what it was doing without serious mental effort. And the true reason I no longer write Perl for anything but tiny projects is that was never true when I was writing large masses of Perl code. I fear the prospect of ever having to modify keeper or anthologize again—but fetchmailconf gives me no qualms at all.

Perl still has its uses. For tiny projects (100 lines or fewer) that involve a lot of text pattern matching, I am still more likely to tinker up a Perl-regexp-based solution than to reach for Python. For good recent examples of such things, see thetimeseries and growthplot scripts in the fetchmail distribution. Actually, these are much like the things Perl did in its original role as a sort of combination awk/sed/grep/sh, before it had functions and direct access to the operating system API. For anything larger or more complex, I have come to prefer the subtle virtues of Python—and I think you will, too.

Resources

All listings referred to in this article are available by anonymous download in the fileftp.linuxjournal.com/pub/lj/listings/issue73/3882.tgz.

Eric Raymond is a Linux advocate and the author of The Cathedral & The Bazaar . He can be reached via e-mail at (esr@thyrsus.com). 

Posted in Python | Tagged | Comments Off

Linux: Neighbour Table Overflow Error and Solution

I setup a CentOS Linux based Linux server running as a gateway and firewall server. However, I’m getting the following messages in the /var/log/messages log file:

Dec 20 00:41:01 fw01 kernel: Neighbour table overflow.
Dec 20 00:41:01 fw01 last message repeated 20 times

OR


Dec 20 00:41:01 fw03 kernel: [ 8987.821184] Neighbour table overflow.
Dec 20 00:41:01 fw03 kernel: [ 8987.860465] printk: 100 messages suppressed.

Why does kernel throw “Neighbour table overflow” messages in syslog? How do I fix this problem under Debian / CentOS / RHEL / Fedora / Ubuntu Linux?

For busy networks (or gateway / firewall Linux server) it is mandatory to increase the kernel’s internal ARP cache size. The following kernel variables are used:

net.ipv4.neigh.default.gc_thresh1
net.ipv4.neigh.default.gc_thresh2
net.ipv4.neigh.default.gc_thresh3

To see current values, type:
# sysctl net.ipv4.neigh.default.gc_thresh1
Sample outputs:

net.ipv4.neigh.default.gc_thresh1 = 128

Type the following command:
# sysctl net.ipv4.neigh.default.gc_thresh2
Sample outputs:

net.ipv4.neigh.default.gc_thresh2 = 512

Type the following command:
# sysctl net.ipv4.neigh.default.gc_thresh3
Sample outputs:

net.ipv4.neigh.default.gc_thresh3 = 1024

So you need to make sure that the arp table to become bigger than the above defaults. The above limitations are good for small network or a single server. This will also affect your DNS traffic.

How Do I Fix “Neighbour Table Overflow” Error?

Edit /etc/sysctl.conf file, enter:
# vi /etc/sysctl.conf
Append the following values (this is taken from server that protects over 200 desktops running MS-Windows, Linux, and Apple OS X):

 ## works best with <= 500 client computers ##
# Force gc to clean-up quickly
net.ipv4.neigh.default.gc_interval = 3600

# Set ARP cache entry timeout
net.ipv4.neigh.default.gc_stale_time = 3600

# Setup DNS threshold for arp
net.ipv4.neigh.default.gc_thresh3 = 4096
net.ipv4.neigh.default.gc_thresh2 = 2048
net.ipv4.neigh.default.gc_thresh1 = 1024

To load new changes type the following command:
# sysctl -p

Posted in Linux | Tagged , | Comments Off