Ext2/Ext3 improvement project
|
||||||||||||
|
||||||||||||
|
Test Plan
|
||||||||||||
| |
||||||||||||
|
Patches from Alex Tomas
|
||||||||||||
|
|
||||||||||||
|
Benchmarks
|
||||||||||||
|
||||||||||||
|
extents
|
||||||||||||
| |
||||||||||||
| |
||||||||||||
| |
||||||||||||
|
64bit
|
||||||||||||
Ext3/64bitReady for the futureWith perpendicular recording, we can have a density of 230 gigabits per square inch, and 1 terabyte 3.5-inch hard drive in 2007. The limit of ext2/ext3 filesystem is 2 TB with 1kB blocks, and 8 TB with 4 kB blocks. If we want to be able to use ext2/ext3 on PC beyond 2010, we should implement 64bit addressing mode in ext2/ext3. What is true for PC in 2010 is true now in 2006 for servers, we are already able to have 8 TB with our systems using Storage Array, RAID and multiple 500 GB disks. After a little study, I propose to make ext3 ready for the future by implementing 64bit addressing mode now. Big block groupsAs explained in Design and Implementation of the Second Extended Filesystem, ext2/ext3 is made up of block groups.According to function ext3_get_inode_block() in fs/ext3/inode.c, the block group of a given inode is given by:
block_group = (ino - 1) / EXT3_INODES_PER_GROUP(sb)
Now, imagine we have a 64bit file system, and block groups number
greater or equal to 232. As inode number is a 32bit value, this means
EXT3_INODES_PER_GROUP is lesser or equal to 1.
of course this case happens with a filesystem greater than (with a 4 KiB block):= number of group (232) * size of a group (128 MiB) = 232 * 227 = 259 = 512 PiB. (a 64bit file system should be able to manage 264 * 212 = 276 = 64 ZiB) Because in linux VFS, inode number is also stored on 32bit, we should consider to increase size of block groups. In ext3, the size limit of a group is given by the number of bit we can store in a block for the group bitmap and, then by the maximum value of a 16bit field. So, to increase size of a group, we should be able to use several blocks to store bitmap and to use 32bit (or 24 bit) fields. 64bit in JBDThis work has been done by Zach Brown for OCFS264bit in filesBased on extents from Alex Tomas and the work of Pierre Peiffer.The on-disk implementation of Alex's extents reserves 48bit for the addressing, but the kernel only manages 32bit. This is why Pierre made some works on this part to have really more than 32bit in the kernel to address blocks in the extents. Pierre uses 64bit fields, because our goal is to have a 64bit filesystem, but this point can be discussed, we can also imagine to manage the two formats (48 and 64 bit) by using a flag in the filesystem. 64bit in metadataAs a lot of work has already be done by Zach, Alex and Pierre, then modifications to on-disk structures are very "light":in ext2_super_block structure, we add: + /* 64bit support valid if EXT3_FEATURE_RO_COMPAT_64BIT */ + __u32 s_blocks_count_hi; /* Blocks count */ + __u32 s_r_blocks_count_hi; /* Reserved blocks count */ + __u32 s_free_blocks_count_hi; /* Free blocks count */and we can read it with following macro: #define EXT2_BLOCKS_COUNT(s) (((s)->s_feature_ro_compat & EXT2_FEATURE_RO_COMPAT_64BIT ? ((__u64)(s)->s_blocks_count_hi << 32) : 0) | (__u64)(s)->s_blocks_count)As, extents patch from Pierre already manages 64bit addressing for file blocks, we only have to manage 64bit block address for block bitmap, inode bitmap and inode table. We don't have to store a 64bit value in the group descriptor, as these blocks are physically in the group, we can apply following algorithm:
The base address of a group is given by:
#define EXT2_GROUP_BASE(s,g) ((blk_t)(s)->s_first_data_block + \
(blk_t)(s)->s_blocks_per_group * (g))
where g is the group number and s the super block.
So, for instance, the address of the block bitmap for a given group is given by:
#define EXT2_RELATIVE(group_base, block) \
((blk_t)(group_base) & 0xFFFFFFFF00000000 ? \
(blk_t)(block) + (blk_t)(group_base) - 1 : \
(blk_t)(block))
#define EXT2_BLOCK_BITMAP(bg, group_base) \
EXT2_RELATIVE((group_base),(bg)->bg_block_bitmap)
These modifications don't increase the size used by the filesystem to store files.For instance, I copy 295 times the source tree of linux on 70 GB disk: ext3 has been created with "mkfs -t ext3" and ext3/64bit has been created with "mkfs -t ext3 -g 638232 -N 319488".
|
||||||||||||
|
64bit benchmarks
|
||||||||||||
| |
||||||||||||
| |
||||||||||||
| |
||||||||||||
| | ||||||||||||
| | ||||||||||||
| | ||||||||||||
| | ||||||||||||
|
Mailing lists
|
||||||||||||
| |
||||||||||||
|
Links
|
||||||||||||
| |
||||||||||||
| Page maintained by: Laurent Vivier [GPG key]. | |
|