Large-file Sequential Write Comparison Using FFSB

Test bed:

Kernel: 2.6.26-rc2 and  ext4-patch-queue-c23f8dd92fe816954b102d8a892276f17440159a (update of May 20, 2008)
Partition size: 5 TB (IBM DS4100 Storage System with 400GB SATA 7200rpm disks)
IO scheduler: CFQ

Tests are done on a bi-Xeon machine with 2G of RAM and with hyper-threading enabled (-> 4 CPUs).

processor      : 4
vendor_id      : GenuineIntel
cpu family     : 15
model          : 4
model name     : Intel(R) Xeon(TM) CPU 2.80GHz
cpu MHz        : 2793.078
cache size     : 1024 KB
bogomips       : 5586.59

#hdparm -t /dev/md0

/dev/md0:
 Timing buffered disk reads:  332 MB in  3.00 seconds = 110.51 MB/sec

#cat /proc/mdstat
Personalities : [linear]
md0 : active linear sdg[2] sdf[1] sde[0]
      5462560320 blocks 64k rounding

Tools version used:
e2fsprogs 1.41-WIP from the "next" branch of e2fsprogs git tree (update of May 20, 2008)
xfsprogs-2.9.8

Synopsis:

FFSB filesystem benchmarking software 5.1 available at
http://sourceforge.net/projects/ffsb/

ffsb profile used:
num_filesystems=1
num_threadgroups=1
directio=0
time=600

[filesystem0]
        location=/mnt/test/
        num_files=0
        num_dirs=0
        max_filesize=1073741824
        min_filesize=1073741824
[end0]

[threadgroup0]
        num_threads=128
        write_size=65536
        write_blocksize=65536
        create_weight=1
[end0]

This profile runs 128 threads which create 128 1-GB files. We collect the throughput scores and the CPU utilization values given by ffsb output.

The partition used is formated with mkfs and mounted with appropriate options before each run of ffsb. A FS check is done after each run.
The number of extents per file is given by filefrag command.

The script used is available here.

Results

.

mkfs options
mount options   Throughput
 MB/s
  CPU usage
       %   
average number of
extents per file
 e2fsck
elapsed time
in seconds
ext3 -I 256
data=writeback 75.7
52.8
5336
xfs "defaults" 43.2
43.3
17
ext4
-I 256 data=writeback 76.9
29.6
128 2577
-I 256 data=writeback,
nodelalloc
76.9 54.3 134 2584
-I 256
-O uninit_groups
data=writeback 70.9 (1)
31.3
130 248
-I 256
-O uninit_groups
data=writeback,
nomballoc
78.9
32.1
320 249
-I 256
-O uninit_groups
data=writeback,
nomballoc,nodelalloc
78.4
57.7
272 249
-I 256
-O uninit_groups
data=writeback,
journal_async_commit
76.5 32.4 130 -
-I 256
-Ouninit_groups,flex_bg
-G 64
data=writeback 75.0
29.5
115  243
-I 256
-Ouninit_groups,flex_bg
-G 64
data=writeback,
nomballoc
78.8 32.5 316 243
-I 256
-Ouninit_groups,flex_bg
-G 64
data=writeback,
nomballoc,nodelalloc
78.3
60.8
260 243
-I 256
-Ouninit_groups,flex_bg
-G 64
data=writeback,
journal_checksum
75.3
29.3
110 243
-I 256
-Ouninit_groups,flex_bg
-G 64
data=writeback,
journal_async_commit
75.1
30.1
130 -
(1)  The multiblock allocator skips the uninitialized block groups when trying to allocate blocks in a first step. When the filesystem is large, scanning a great number of groups to find an initialized one impacts the performance of this workload.  I ran the same tests with the patch below  allowing block allocation
in uninitialized groups:

Index: linux-2.6.26-rc2/fs/ext4/mballoc.c
===================================================================
--- linux-2.6.26-rc2.orig/fs/ext4/mballoc.c     2008-05-20 11:28:22.000000000 +0200
+++ linux-2.6.26-rc2/fs/ext4/mballoc.c  2008-05-22 12:47:34.000000000 +0200
@@ -1643,10 +1643,6 @@ static int ext4_mb_good_group(struct ext
        switch (cr) {
        case 0:
                BUG_ON(ac->ac_2order == 0);
-               /* If this group is uninitialized, skip it initially */
-               desc = ext4_get_group_desc(ac->ac_sb, group, NULL);
-               if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))
-                       return 0;

                bits = ac->ac_sb->s_blocksize_bits + 1;
                for (i = ac->ac_2order; i <= bits; i++)

Results with the patch above applied:



mkfs options
mount options Throughput
 MB/s
CPU usage
       %   
average number 
of extents per file
 e2fsck
elapsed time
in seconds
  rm -rf
elapsed time
in seconds

ext3
-I 256 data=ordered 76.1 61.9 272 266.8
-I 256 data=writeback 75.1 52.2 270 5336 281.8

xfs
"defaults" "defaults" 41 47.8 750 17 34.6
ext4
-I 256 data=writeback 76.9
30.9
130
2583
13.8
-I 256 data=writeback,
nodelalloc
76.7
54.0
136
2621
13.9
-I 256
-O uninit_groups
data=ordered,
nodelalloc
78.6 66.2 136 14.2
-I 256
-O uninit_group
s
data=ordered,
nodelalloc,nomballoc
78.4 68.3 261 12.1
-I 256
-O uninit_groups
data=writeback 78.9
29.8
129
248
14.2
-I 256
-O uninit_groups
data=writeback,
nomballoc
78.9
31.1
347
249
12.1
-I 256
-O uninit_groups
data=writeback,
nomballoc,nodelalloc
78.5
56.7
275 249
12.4
-I 256
-O uninit_groups
data=writeback,
journal_async_commit
79.0
30.5
130
- 14.0
-I 256
-Ouninit_groups,
flex_bg -G 64
data=writeback 78.9
31.3
129
242
10.6
-I 256
-Ouninit_groups,
flex_bg -G 64
data=ordered,
nodelalloc
78.8 64.6 135 10.5
-I 256
-Ouninit_groups,
flex_bg -G 64
data=ordered,
nodelalloc,nomballoc
78.7 68.6 260 9.6
-I 256
-Ouninit_groups,
flex_bg -G 64
data=writeback,
nomballoc
78.8
32.8
328
243
9.9
-I 256
-Ouninit_groups,
flex_bg -G 64
data=writeback,
nomballoc,nodelalloc
78.4
61.0
259
243
9.6
-I 256
-Ouninit_groups,
flex_bg -G 64
data=writeback,
journal_checksum
79.0
31.8
128
242
10.3
-I 256
-Ouninit_groups,
flex_bg -G 64
data=writeback,
journal_async_commit
78.9
31.4
129 -
10.5