Ext2/Ext3 improvement project

pingouin Goal of this project is to improve ext2/ext3 filesystems.
One of these improvements is a 64bit ext3 filesystem.
GNU
Test Plan
square Test plan
Patches from Alex Tomas
square ext3-extents-2.6.11-0.4.patch
square ext3-mballoc2-2.6.11-0.6.patch
square ext3-delayed-allocation-2.6.11-0.5.patch
Benchmarks
square IOzone
square IOzone(update with the last extents patches)
square tiobench (from Ram)
square kernel build
square sqlbench
square sysbench (oltp)
square Low memory tests
extents
square EXT3 extents with little-endian patch (1)
square EXT3 extents with 64 bits physical blocks patch (2) (to apply on top of previous patch (1))   
square EXT3 extents with 64 bits logical blocks patch (3) (to apply on top of previous patch (2))
square Benchmarking of ext3 extents on 64 bits
64bit

Ext3/64bit

Ready for the future

According to Kryder's Law, size of hard drives doubles every 13 months.

With perpendicular recording, we can have a density of 230 gigabits per square inch, and 1 terabyte 3.5-inch hard drive in 2007.
The limit of ext2/ext3 filesystem is 2 TB with 1kB blocks, and 8 TB with 4 kB blocks. If we want to be able to use ext2/ext3 on PC beyond 2010, we should implement 64bit addressing mode in ext2/ext3.
What is true for PC in 2010 is true now in 2006 for servers, we are already able to have 8 TB with our systems using Storage Array, RAID and multiple 500 GB disks.
After a little study, I propose to make ext3 ready for the future by implementing 64bit addressing mode now.

Big block groups

As explained in Design and Implementation of the Second Extended Filesystem, ext2/ext3 is made up of block groups.

According to function ext3_get_inode_block() in fs/ext3/inode.c, the block group of a given inode is given by:
        block_group = (ino - 1) / EXT3_INODES_PER_GROUP(sb)
Now, imagine we have a 64bit file system, and block groups number greater or equal to 232. As inode number is a 32bit value, this means EXT3_INODES_PER_GROUP is lesser or equal to 1. of course this case happens with a filesystem greater than (with a 4 KiB block):
= number of group (232) * size of a group (128 MiB)
= 232 * 227 = 259
= 512 PiB.

(a 64bit file system should be able to manage 264 * 212 = 276 = 64 ZiB)

Because in linux VFS, inode number is also stored on 32bit, we should consider to increase size of block groups.

In ext3, the size limit of a group is given by the number of bit we can store in a block for the group bitmap and, then by the maximum value of a 16bit field. So, to increase size of a group, we should be able to use several blocks to store bitmap and to use 32bit (or 24 bit) fields.

64bit in JBD

This work has been done by Zach Brown for OCFS2

64bit in files

Based on extents from Alex Tomas and the work of Pierre Peiffer.

The on-disk implementation of Alex's extents reserves 48bit for the addressing, but the kernel only manages 32bit.
This is why Pierre made some works on this part to have really more than 32bit in the kernel to address blocks in the extents. Pierre uses 64bit fields, because our goal is to have a 64bit filesystem, but this point can be discussed, we can also imagine to manage the two formats (48 and 64 bit) by using a flag in the filesystem.

64bit in metadata

As a lot of work has already be done by Zach, Alex and Pierre, then modifications to on-disk structures are very "light":
in ext2_super_block structure, we add:
+       /* 64bit support valid if EXT3_FEATURE_RO_COMPAT_64BIT */
+       __u32   s_blocks_count_hi;      /* Blocks count */
+       __u32   s_r_blocks_count_hi;    /* Reserved blocks count */
+       __u32   s_free_blocks_count_hi; /* Free blocks count */
and we can read it with following macro:
#define EXT2_BLOCKS_COUNT(s)   (((s)->s_feature_ro_compat &
EXT2_FEATURE_RO_COMPAT_64BIT ? 
((__u64)(s)->s_blocks_count_hi << 32) : 0) | (__u64)(s)->s_blocks_count)
As, extents patch from Pierre already manages 64bit addressing for file blocks, we only have to manage 64bit block address for block bitmap, inode bitmap and inode table.

We don't have to store a 64bit value in the group descriptor, as these blocks are physically in the group, we can apply following algorithm:
  • if base address for group is < 232, bitmap block address is bg_block_bitmap (so we keep compatibility with existing ext3)
  • if base address for group ≥ 232, bitmap block address is bg_block_bitmap + group base address (we use relative addressing)

The base address of a group is given by:
#define EXT2_GROUP_BASE(s,g)   ((blk_t)(s)->s_first_data_block + \
                                (blk_t)(s)->s_blocks_per_group * (g))
where g is the group number and s the super block.
So, for instance, the address of the block bitmap for a given group is given by:
#define EXT2_RELATIVE(group_base, block)                  \
       ((blk_t)(group_base) & 0xFFFFFFFF00000000 ?        \
               (blk_t)(block) + (blk_t)(group_base) - 1 : \
               (blk_t)(block))
#define EXT2_BLOCK_BITMAP(bg, group_base)      \
               EXT2_RELATIVE((group_base),(bg)->bg_block_bitmap)
These modifications don't increase the size used by the filesystem to store files.

For instance, I copy 295 times the source tree of linux on 70 GB disk:
ext3 has been created with "mkfs -t ext3" and ext3/64bit has been created with "mkfs -t ext3 -g 638232 -N 319488".

FilesystemsTotal block countFree blocks before copyFree blocks after copy
ext3178702961758136759614
ext3/64bit1787029617557081276896

64bit benchmarks
squaresysbench (oltp)
squareiozone
squarekernbuild
64bit patches RSS
squaree2fsprogs upstream patches (updated 2006/05/24)
squarekernel 2.6.16 patches (updated 2006/04/07)
64bit binaries RSS
squareia64 binaries (updated 2006/03/09)
squarex86_64 binaries (updated 2006/03/08)
Mailing lists
squareext2-devel
Links
squareWiki

Page maintained by: Laurent Vivier [GPG key]. Valid XHTML 1.0! Valid CSS!