* [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() @ 2006-02-20 21:21 Badari Pulavarty 2006-02-20 21:23 ` [PATCH 1/3] pass b_size to ->get_block() Badari Pulavarty ` (4 more replies) 0 siblings, 5 replies; 32+ messages in thread From: Badari Pulavarty @ 2006-02-20 21:21 UTC (permalink / raw) To: christoph, mcao, akpm; +Cc: lkml, linux-fsdevel, pbadari Hi, Following patches add support to map multiple blocks in ->get_block(). This is will allow us to handle mapping of multiple disk blocks for mpage_readpages() and mpage_writepages() etc. Instead of adding new argument, I use "b_size" to indicate the amount of disk mapping needed for get_block(). And also, on success get_block() actually indicates the amount of disk mapping it did. Now that get_block() can handle multiple blocks, there is no need for ->get_blocks() which was added for DIO. [PATCH 1/3] pass b_size to ->get_block() [PATCH 2/3] map multiple blocks for mpage_readpages() [PATCH 3/3] remove ->get_blocks() support I noticed decent improvements (reduced sys time) on JFS, XFS and ext3. (on simple "dd" read tests). (rc3.mm1) (rc3.mm1 + patches) real 0m18.814s 0m18.482s user 0m0.000s 0m0.004s sys 0m3.240s 0m2.912s Andrew, Could you include it in -mm tree ? Comments ? Thanks, Badari ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 1/3] pass b_size to ->get_block() 2006-02-20 21:21 [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() Badari Pulavarty @ 2006-02-20 21:23 ` Badari Pulavarty 2006-02-20 21:23 ` [PATCH 2/3] map multiple blocks for mpage_readpages() Badari Pulavarty ` (3 subsequent siblings) 4 siblings, 0 replies; 32+ messages in thread From: Badari Pulavarty @ 2006-02-20 21:23 UTC (permalink / raw) To: christoph; +Cc: mcao, akpm, lkml, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 211 bytes --] [PATCH 1/3] pass b_size to ->get_block() Pass amount of disk needs to be mapped to get_block(). This way one can modify the fs ->get_block() functions to map multiple blocks at the same time. Thanks, Badari [-- Attachment #2: getblock-take-size.patch --] [-- Type: text/x-patch, Size: 3339 bytes --] Pass amount of disk needs to be mapped to get_block(). This way one can modify the fs ->get_block() functions to map multiple blocks at the same time. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> fs/buffer.c | 6 ++++++ fs/mpage.c | 2 ++ include/linux/buffer_head.h | 1 + 3 files changed, 9 insertions(+) Index: linux-2.6.16-rc4/fs/buffer.c =================================================================== --- linux-2.6.16-rc4.orig/fs/buffer.c 2006-02-17 14:23:45.000000000 -0800 +++ linux-2.6.16-rc4/fs/buffer.c 2006-02-20 09:37:58.000000000 -0800 @@ -1785,6 +1785,7 @@ static int __block_write_full_page(struc clear_buffer_dirty(bh); set_buffer_uptodate(bh); } else if (!buffer_mapped(bh) && buffer_dirty(bh)) { + bh->b_size = 1 << inode->i_blkbits; err = get_block(inode, block, bh, 1); if (err) goto recover; @@ -1938,6 +1939,7 @@ static int __block_prepare_write(struct if (buffer_new(bh)) clear_buffer_new(bh); if (!buffer_mapped(bh)) { + bh->b_size = 1 << inode->i_blkbits; err = get_block(inode, block, bh, 1); if (err) break; @@ -2093,6 +2095,7 @@ int block_read_full_page(struct page *pa fully_mapped = 0; if (iblock < lblock) { + bh->b_size = 1 << inode->i_blkbits; err = get_block(inode, iblock, bh, 0); if (err) SetPageError(page); @@ -2414,6 +2417,7 @@ int nobh_prepare_write(struct page *page create = 1; if (block_start >= to) create = 0; + map_bh.b_size = 1 << blkbits; ret = get_block(inode, block_in_file + block_in_page, &map_bh, create); if (ret) @@ -2674,6 +2678,7 @@ int block_truncate_page(struct address_s err = 0; if (!buffer_mapped(bh)) { + bh->b_size = 1 << inode->i_blkbits; err = get_block(inode, iblock, bh, 0); if (err) goto unlock; @@ -2760,6 +2765,7 @@ sector_t generic_block_bmap(struct addre struct inode *inode = mapping->host; tmp.b_state = 0; tmp.b_blocknr = 0; + tmp.b_size = 1 << inode->i_blkbits; get_block(inode, block, &tmp, 0); return tmp.b_blocknr; } Index: linux-2.6.16-rc4/include/linux/buffer_head.h =================================================================== --- linux-2.6.16-rc4.orig/include/linux/buffer_head.h 2006-02-17 14:23:45.000000000 -0800 +++ linux-2.6.16-rc4/include/linux/buffer_head.h 2006-02-20 09:39:20.000000000 -0800 @@ -277,6 +277,7 @@ map_bh(struct buffer_head *bh, struct su set_buffer_mapped(bh); bh->b_bdev = sb->s_bdev; bh->b_blocknr = block; + bh->b_size = sb->s_blocksize; } /* Index: linux-2.6.16-rc4/fs/mpage.c =================================================================== --- linux-2.6.16-rc4.orig/fs/mpage.c 2006-02-17 14:23:45.000000000 -0800 +++ linux-2.6.16-rc4/fs/mpage.c 2006-02-20 09:45:37.000000000 -0800 @@ -192,6 +192,7 @@ do_mpage_readpage(struct bio *bio, struc page_block++, block_in_file++) { bh.b_state = 0; if (block_in_file < last_block) { + bh.b_size = blocksize; if (get_block(inode, block_in_file, &bh, 0)) goto confused; } @@ -472,6 +473,7 @@ __mpage_writepage(struct bio *bio, struc for (page_block = 0; page_block < blocks_per_page; ) { map_bh.b_state = 0; + map_bh.b_size = 1 << blkbits; if (get_block(inode, block_in_file, &map_bh, 1)) goto confused; if (buffer_new(&map_bh)) ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 2/3] map multiple blocks for mpage_readpages() 2006-02-20 21:21 [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() Badari Pulavarty 2006-02-20 21:23 ` [PATCH 1/3] pass b_size to ->get_block() Badari Pulavarty @ 2006-02-20 21:23 ` Badari Pulavarty 2006-02-23 3:29 ` Suparna Bhattacharya 2006-02-20 21:24 ` [PATCH 3/3] remove ->get_blocks() support Badari Pulavarty ` (2 subsequent siblings) 4 siblings, 1 reply; 32+ messages in thread From: Badari Pulavarty @ 2006-02-20 21:23 UTC (permalink / raw) To: christoph; +Cc: mcao, akpm, lkml, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 553 bytes --] [PATCH 2/3] map multiple blocks for mpage_readpages() This patch changes mpage_readpages() and get_block() to get the disk mapping information for multiple blocks at the same time. b_size represents the amount of disk mapping that needs to mapped. On the successful get_block() b_size indicates the amount of disk mapping thats actually mapped. Only the filesystems who care to use this information and provide multiple disk blocks at a time can choose to do so. No changes are needed for the filesystems who wants to ignore this. Thanks, Badari [-- Attachment #2: getblocks-readpages.patch --] [-- Type: text/x-patch, Size: 7570 bytes --] This patch changes mpage_readpages() and get_block() to get the disk mapping information for multiple blocks at the same time. b_size represents the amount of disk mapping that needs to mapped. On the successful get_block() b_size indicates the amount of disk mapping thats actually mapped. Only the filesystems who care to use this information and provide multiple disk blocks at a time can choose to do so. No changes are needed for the filesystems who wants to ignore this. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> fs/jfs/inode.c | 3 - fs/mpage.c | 98 +++++++++++++++++++++++++++++++++++--------- fs/xfs/linux-2.6/xfs_aops.c | 4 - 3 files changed, 83 insertions(+), 22 deletions(-) Index: linux-2.6.16-rc4/fs/mpage.c =================================================================== --- linux-2.6.16-rc4.orig/fs/mpage.c 2006-02-20 09:45:37.000000000 -0800 +++ linux-2.6.16-rc4/fs/mpage.c 2006-02-20 09:51:24.000000000 -0800 @@ -165,7 +165,9 @@ map_buffer_to_page(struct page *page, st static struct bio * do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages, - sector_t *last_block_in_bio, get_block_t get_block) + sector_t *last_block_in_bio, struct buffer_head *map_bh, + unsigned long *first_logical_block, int *map_valid, + get_block_t get_block) { struct inode *inode = page->mapping->host; const unsigned blkbits = inode->i_blkbits; @@ -173,34 +175,72 @@ do_mpage_readpage(struct bio *bio, struc const unsigned blocksize = 1 << blkbits; sector_t block_in_file; sector_t last_block; + sector_t last_block_in_file; sector_t blocks[MAX_BUF_PER_PAGE]; unsigned page_block; unsigned first_hole = blocks_per_page; struct block_device *bdev = NULL; - struct buffer_head bh; int length; int fully_mapped = 1; + unsigned nblocks, i; if (page_has_buffers(page)) goto confused; block_in_file = (sector_t)page->index << (PAGE_CACHE_SHIFT - blkbits); - last_block = (i_size_read(inode) + blocksize - 1) >> blkbits; + last_block = block_in_file + (nr_pages * blocks_per_page); + last_block_in_file = (i_size_read(inode) + blocksize - 1) >> blkbits; + if (last_block > last_block_in_file) + last_block = last_block_in_file; + page_block = 0; + + /* + * Map blocks using the result from the last get_blocks call first. + */ + nblocks = map_bh->b_size >> blkbits; + if (*map_valid && + block_in_file > *first_logical_block && + block_in_file < (*first_logical_block + nblocks)) { + unsigned map_offset = block_in_file - *first_logical_block; + unsigned last = nblocks - map_offset; + + for (i = 0; ; i++) { + if (i == last) { + *map_valid = 0; + break; + } + if (page_block == blocks_per_page) + break; + blocks[page_block] = map_bh->b_blocknr + map_offset + i; + page_block++; + block_in_file++; + } + bdev = map_bh->b_bdev; + } + + /* + * Then do more get_blocks calls until we are done with this page. + */ + map_bh->b_page = page; + while (page_block < blocks_per_page) { + map_bh->b_state = 0; + map_bh->b_size = 0; - bh.b_page = page; - for (page_block = 0; page_block < blocks_per_page; - page_block++, block_in_file++) { - bh.b_state = 0; if (block_in_file < last_block) { - bh.b_size = blocksize; - if (get_block(inode, block_in_file, &bh, 0)) + map_bh->b_size = (last_block - block_in_file) << blkbits; + if (get_block(inode, block_in_file, map_bh, 0)) goto confused; + *first_logical_block = block_in_file; + *map_valid = 1; } - if (!buffer_mapped(&bh)) { + if (!buffer_mapped(map_bh)) { fully_mapped = 0; if (first_hole == blocks_per_page) first_hole = page_block; + page_block++; + block_in_file++; + *map_valid = 0; continue; } @@ -210,8 +250,8 @@ do_mpage_readpage(struct bio *bio, struc * we just collected from get_block into the page's buffers * so readpage doesn't have to repeat the get_block call */ - if (buffer_uptodate(&bh)) { - map_buffer_to_page(page, &bh, page_block); + if (buffer_uptodate(map_bh)) { + map_buffer_to_page(page, map_bh, page_block); goto confused; } @@ -219,10 +259,20 @@ do_mpage_readpage(struct bio *bio, struc goto confused; /* hole -> non-hole */ /* Contiguous blocks? */ - if (page_block && blocks[page_block-1] != bh.b_blocknr-1) + if (page_block && blocks[page_block-1] != map_bh->b_blocknr-1) goto confused; - blocks[page_block] = bh.b_blocknr; - bdev = bh.b_bdev; + nblocks = map_bh->b_size >> blkbits; + for (i = 0; ; i++) { + if (i == nblocks) { + *map_valid = 0; + break; + } else if (page_block == blocks_per_page) + break; + blocks[page_block] = map_bh->b_blocknr + i; + page_block++; + block_in_file++; + } + bdev = map_bh->b_bdev; } if (first_hole != blocks_per_page) { @@ -261,7 +311,7 @@ alloc_new: goto alloc_new; } - if (buffer_boundary(&bh) || (first_hole != blocks_per_page)) + if (buffer_boundary(map_bh) || (first_hole != blocks_per_page)) bio = mpage_bio_submit(READ, bio); else *last_block_in_bio = blocks[blocks_per_page - 1]; @@ -332,6 +382,9 @@ mpage_readpages(struct address_space *ma unsigned page_idx; sector_t last_block_in_bio = 0; struct pagevec lru_pvec; + struct buffer_head map_bh; + unsigned long first_logical_block = 0; + int map_valid = 0; pagevec_init(&lru_pvec, 0); for (page_idx = 0; page_idx < nr_pages; page_idx++) { @@ -343,7 +396,9 @@ mpage_readpages(struct address_space *ma page->index, GFP_KERNEL)) { bio = do_mpage_readpage(bio, page, nr_pages - page_idx, - &last_block_in_bio, get_block); + &last_block_in_bio, &map_bh, + &first_logical_block, + &map_valid, get_block); if (!pagevec_add(&lru_pvec, page)) __pagevec_lru_add(&lru_pvec); } else { @@ -365,9 +420,14 @@ int mpage_readpage(struct page *page, ge { struct bio *bio = NULL; sector_t last_block_in_bio = 0; + struct buffer_head map_bh; + unsigned long first_logical_block = 0; + int map_valid = 0; + - bio = do_mpage_readpage(bio, page, 1, - &last_block_in_bio, get_block); + bio = do_mpage_readpage(bio, page, 1, &last_block_in_bio, + &map_bh, &first_logical_block, &map_valid, + get_block); if (bio) mpage_bio_submit(READ, bio); return 0; Index: linux-2.6.16-rc4/fs/jfs/inode.c =================================================================== --- linux-2.6.16-rc4.orig/fs/jfs/inode.c 2006-02-17 14:23:45.000000000 -0800 +++ linux-2.6.16-rc4/fs/jfs/inode.c 2006-02-20 09:48:33.000000000 -0800 @@ -257,7 +257,8 @@ jfs_get_blocks(struct inode *ip, sector_ static int jfs_get_block(struct inode *ip, sector_t lblock, struct buffer_head *bh_result, int create) { - return jfs_get_blocks(ip, lblock, 1, bh_result, create); + return jfs_get_blocks(ip, lblock, bh_result->b_size >> ip->i_blkbits, + bh_result, create); } static int jfs_writepage(struct page *page, struct writeback_control *wbc) Index: linux-2.6.16-rc4/fs/xfs/linux-2.6/xfs_aops.c =================================================================== --- linux-2.6.16-rc4.orig/fs/xfs/linux-2.6/xfs_aops.c 2006-02-17 14:23:45.000000000 -0800 +++ linux-2.6.16-rc4/fs/xfs/linux-2.6/xfs_aops.c 2006-02-20 09:48:33.000000000 -0800 @@ -1136,8 +1136,8 @@ linvfs_get_block( struct buffer_head *bh_result, int create) { - return __linvfs_get_block(inode, iblock, 0, bh_result, - create, 0, BMAPI_WRITE); + return __linvfs_get_block(inode, iblock, bh_result->b_size >> inode->i_blkbits, + bh_result, create, 0, BMAPI_WRITE); } STATIC int ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 2/3] map multiple blocks for mpage_readpages() 2006-02-20 21:23 ` [PATCH 2/3] map multiple blocks for mpage_readpages() Badari Pulavarty @ 2006-02-23 3:29 ` Suparna Bhattacharya 2006-02-23 22:18 ` Badari Pulavarty 0 siblings, 1 reply; 32+ messages in thread From: Suparna Bhattacharya @ 2006-02-23 3:29 UTC (permalink / raw) To: Badari Pulavarty; +Cc: christoph, mcao, akpm, lkml, linux-fsdevel On Mon, Feb 20, 2006 at 01:23:55PM -0800, Badari Pulavarty wrote: > [PATCH 2/3] map multiple blocks for mpage_readpages() > > This patch changes mpage_readpages() and get_block() to > get the disk mapping information for multiple blocks at the > same time. > > b_size represents the amount of disk mapping that needs to > mapped. On the successful get_block() b_size indicates the > amount of disk mapping thats actually mapped. Only the > filesystems who care to use this information and provide multiple > disk blocks at a time can choose to do so. > > No changes are needed for the filesystems who wants to ignore this. That's a nice way to handle this ! Just one minor suggestion on the patches - instead of adding all those arguments to do_mpage_readpage, have you considered using an mio structure as we did in the mpage_writepages multiple blocks case ? Regards Suparna > > Thanks, > Badari > > -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 2/3] map multiple blocks for mpage_readpages() 2006-02-23 3:29 ` Suparna Bhattacharya @ 2006-02-23 22:18 ` Badari Pulavarty 0 siblings, 0 replies; 32+ messages in thread From: Badari Pulavarty @ 2006-02-23 22:18 UTC (permalink / raw) To: suparna; +Cc: christoph, mcao, akpm, lkml, linux-fsdevel On Thu, 2006-02-23 at 08:59 +0530, Suparna Bhattacharya wrote: > On Mon, Feb 20, 2006 at 01:23:55PM -0800, Badari Pulavarty wrote: > > [PATCH 2/3] map multiple blocks for mpage_readpages() > > > > This patch changes mpage_readpages() and get_block() to > > get the disk mapping information for multiple blocks at the > > same time. > > > > b_size represents the amount of disk mapping that needs to > > mapped. On the successful get_block() b_size indicates the > > amount of disk mapping thats actually mapped. Only the > > filesystems who care to use this information and provide multiple > > disk blocks at a time can choose to do so. > > > > No changes are needed for the filesystems who wants to ignore this. > > That's a nice way to handle this ! > > Just one minor suggestion on the patches - instead of adding all those > arguments to do_mpage_readpage, have you considered using > an mio structure as we did in the mpage_writepages multiple blocks case ? I did think about moving stuff to "mio" and pass it around. Then, I realized I need to think hard about mpage_writepages() also to come out with a super-set "mio". For now, I settled with arguments for now, when we cleanup mpage_writepages() we can take a pass at cleaning up mpage_readpages() also. (BTW, there are only 2 more arguments added). Thanks, Badari ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH 3/3] remove ->get_blocks() support 2006-02-20 21:21 [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() Badari Pulavarty 2006-02-20 21:23 ` [PATCH 1/3] pass b_size to ->get_block() Badari Pulavarty 2006-02-20 21:23 ` [PATCH 2/3] map multiple blocks for mpage_readpages() Badari Pulavarty @ 2006-02-20 21:24 ` Badari Pulavarty 2006-02-20 21:59 ` [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() Nathan Scott 2006-02-22 15:12 ` christoph 4 siblings, 0 replies; 32+ messages in thread From: Badari Pulavarty @ 2006-02-20 21:24 UTC (permalink / raw) To: christoph; +Cc: mcao, akpm, lkml, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 261 bytes --] [PATCH 3/3] remove ->get_blocks() support Now that get_block() can handle mapping multiple disk blocks, no need to have ->get_blocks(). This patch removes fs specific ->get_blocks() added for DIO and makes it users use get_block() instead. Thanks, Badari [-- Attachment #2: remove-getblocks.patch --] [-- Type: text/x-patch, Size: 16572 bytes --] Now that get_block() can handle mapping multiple disk blocks, no need to have ->get_blocks(). This patch removes fs specific ->get_blocks() added for DIO and makes it users use get_block() instead. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> fs/block_dev.c | 3 ++- fs/direct-io.c | 27 ++++++++++++++------------- fs/ext2/inode.c | 14 +------------- fs/ext3/inode.c | 3 +-- fs/fat/inode.c | 2 +- fs/hfs/inode.c | 13 +------------ fs/hfsplus/inode.c | 13 +------------ fs/jfs/inode.c | 2 +- fs/ocfs2/aops.c | 2 +- fs/reiserfs/inode.c | 1 - fs/xfs/linux-2.6/xfs_aops.c | 5 ++--- include/linux/fs.h | 17 +++++++---------- 12 files changed, 32 insertions(+), 70 deletions(-) Index: linux-2.6.16-rc3/fs/direct-io.c =================================================================== --- linux-2.6.16-rc3.orig/fs/direct-io.c 2006-02-14 15:54:12.000000000 -0800 +++ linux-2.6.16-rc3/fs/direct-io.c 2006-02-14 15:57:27.000000000 -0800 @@ -86,12 +86,12 @@ struct dio { unsigned first_block_in_page; /* doesn't change, Used only once */ int boundary; /* prev block is at a boundary */ int reap_counter; /* rate limit reaping */ - get_blocks_t *get_blocks; /* block mapping function */ + get_block_t *get_block; /* block mapping function */ dio_iodone_t *end_io; /* IO completion function */ sector_t final_block_in_bio; /* current final block in bio + 1 */ sector_t next_block_for_io; /* next block to be put under IO, in dio_blocks units */ - struct buffer_head map_bh; /* last get_blocks() result */ + struct buffer_head map_bh; /* last get_block() result */ /* * Deferred addition of a page to the dio. These variables are @@ -210,9 +210,9 @@ static struct page *dio_get_page(struct /* * Called when all DIO BIO I/O has been completed - let the filesystem - * know, if it registered an interest earlier via get_blocks. Pass the + * know, if it registered an interest earlier via get_block. Pass the * private field of the map buffer_head so that filesystems can use it - * to hold additional state between get_blocks calls and dio_complete. + * to hold additional state between get_block calls and dio_complete. */ static void dio_complete(struct dio *dio, loff_t offset, ssize_t bytes) { @@ -488,7 +488,7 @@ static int dio_bio_reap(struct dio *dio) * The fs is allowed to map lots of blocks at once. If it wants to do that, * it uses the passed inode-relative block number as the file offset, as usual. * - * get_blocks() is passed the number of i_blkbits-sized blocks which direct_io + * get_block() is passed the number of i_blkbits-sized blocks which direct_io * has remaining to do. The fs should not map more than this number of blocks. * * If the fs has mapped a lot of blocks, it should populate bh->b_size to @@ -501,7 +501,7 @@ static int dio_bio_reap(struct dio *dio) * In the case of filesystem holes: the fs may return an arbitrarily-large * hole by returning an appropriate value in b_size and by clearing * buffer_mapped(). However the direct-io code will only process holes one - * block at a time - it will repeatedly call get_blocks() as it walks the hole. + * block at a time - it will repeatedly call get_block() as it walks the hole. */ static int get_more_blocks(struct dio *dio) { @@ -543,7 +543,8 @@ static int get_more_blocks(struct dio *d * at a higher level for inside-i_size block-instantiating * writes. */ - ret = (*dio->get_blocks)(dio->inode, fs_startblk, fs_count, + map_bh->b_size = fs_count << dio->blkbits; + ret = (*dio->get_block)(dio->inode, fs_startblk, map_bh, create); } return ret; @@ -778,11 +779,11 @@ static void dio_zero_block(struct dio *d * happily perform page-sized but 512-byte aligned IOs. It is important that * blockdev IO be able to have fine alignment and large sizes. * - * So what we do is to permit the ->get_blocks function to populate bh.b_size + * So what we do is to permit the ->get_block function to populate bh.b_size * with the size of IO which is permitted at this offset and this i_blkbits. * * For best results, the blockdev should be set up with 512-byte i_blkbits and - * it should set b_size to PAGE_SIZE or more inside get_blocks(). This gives + * it should set b_size to PAGE_SIZE or more inside get_block(). This gives * fine alignment but still allows this function to work in PAGE_SIZE units. */ static int do_direct_IO(struct dio *dio) @@ -942,7 +943,7 @@ out: static ssize_t direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, const struct iovec *iov, loff_t offset, unsigned long nr_segs, - unsigned blkbits, get_blocks_t get_blocks, dio_iodone_t end_io, + unsigned blkbits, get_block_t get_block, dio_iodone_t end_io, struct dio *dio) { unsigned long user_addr; @@ -964,7 +965,7 @@ direct_io_worker(int rw, struct kiocb *i dio->boundary = 0; dio->reap_counter = 0; - dio->get_blocks = get_blocks; + dio->get_block = get_block; dio->end_io = end_io; dio->map_bh.b_private = NULL; dio->final_block_in_bio = -1; @@ -1170,7 +1171,7 @@ direct_io_worker(int rw, struct kiocb *i ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, struct block_device *bdev, const struct iovec *iov, loff_t offset, - unsigned long nr_segs, get_blocks_t get_blocks, dio_iodone_t end_io, + unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io, int dio_lock_type) { int seg; @@ -1266,7 +1267,7 @@ __blockdev_direct_IO(int rw, struct kioc (end > i_size_read(inode))); retval = direct_io_worker(rw, iocb, inode, iov, offset, - nr_segs, blkbits, get_blocks, end_io, dio); + nr_segs, blkbits, get_block, end_io, dio); if (rw == READ && dio_lock_type == DIO_LOCKING) reader_with_isem = 0; Index: linux-2.6.16-rc3/fs/ext2/inode.c =================================================================== --- linux-2.6.16-rc3.orig/fs/ext2/inode.c 2006-02-14 15:54:12.000000000 -0800 +++ linux-2.6.16-rc3/fs/ext2/inode.c 2006-02-14 15:57:27.000000000 -0800 @@ -667,18 +667,6 @@ static sector_t ext2_bmap(struct address return generic_block_bmap(mapping,block,ext2_get_block); } -static int -ext2_get_blocks(struct inode *inode, sector_t iblock, unsigned long max_blocks, - struct buffer_head *bh_result, int create) -{ - int ret; - - ret = ext2_get_block(inode, iblock, bh_result, create); - if (ret == 0) - bh_result->b_size = (1 << inode->i_blkbits); - return ret; -} - static ssize_t ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, loff_t offset, unsigned long nr_segs) @@ -687,7 +675,7 @@ ext2_direct_IO(int rw, struct kiocb *ioc struct inode *inode = file->f_mapping->host; return blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov, - offset, nr_segs, ext2_get_blocks, NULL); + offset, nr_segs, ext2_get_block, NULL); } static int Index: linux-2.6.16-rc3/fs/ext3/inode.c =================================================================== --- linux-2.6.16-rc3.orig/fs/ext3/inode.c 2006-02-14 15:54:12.000000000 -0800 +++ linux-2.6.16-rc3/fs/ext3/inode.c 2006-02-14 15:57:27.000000000 -0800 @@ -806,8 +806,7 @@ static int ext3_get_block(struct inode * static int ext3_direct_io_get_blocks(struct inode *inode, sector_t iblock, - unsigned long max_blocks, struct buffer_head *bh_result, - int create) + struct buffer_head *bh_result, int create) { handle_t *handle = journal_current_handle(); int ret = 0; Index: linux-2.6.16-rc3/fs/fat/inode.c =================================================================== --- linux-2.6.16-rc3.orig/fs/fat/inode.c 2006-02-14 15:54:12.000000000 -0800 +++ linux-2.6.16-rc3/fs/fat/inode.c 2006-02-14 15:57:27.000000000 -0800 @@ -101,11 +101,11 @@ static int __fat_get_blocks(struct inode } static int fat_get_blocks(struct inode *inode, sector_t iblock, - unsigned long max_blocks, struct buffer_head *bh_result, int create) { struct super_block *sb = inode->i_sb; int err; + unsigned long max_blocks = bh_result->b_size >> inode->i_blkbits; err = __fat_get_blocks(inode, iblock, &max_blocks, bh_result, create); if (err) Index: linux-2.6.16-rc3/fs/hfs/inode.c =================================================================== --- linux-2.6.16-rc3.orig/fs/hfs/inode.c 2006-02-14 15:54:12.000000000 -0800 +++ linux-2.6.16-rc3/fs/hfs/inode.c 2006-02-14 15:57:27.000000000 -0800 @@ -98,17 +98,6 @@ static int hfs_releasepage(struct page * return res ? try_to_free_buffers(page) : 0; } -static int hfs_get_blocks(struct inode *inode, sector_t iblock, unsigned long max_blocks, - struct buffer_head *bh_result, int create) -{ - int ret; - - ret = hfs_get_block(inode, iblock, bh_result, create); - if (!ret) - bh_result->b_size = (1 << inode->i_blkbits); - return ret; -} - static ssize_t hfs_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, loff_t offset, unsigned long nr_segs) { @@ -116,7 +105,7 @@ static ssize_t hfs_direct_IO(int rw, str struct inode *inode = file->f_dentry->d_inode->i_mapping->host; return blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov, - offset, nr_segs, hfs_get_blocks, NULL); + offset, nr_segs, hfs_get_block, NULL); } static int hfs_writepages(struct address_space *mapping, Index: linux-2.6.16-rc3/fs/hfsplus/inode.c =================================================================== --- linux-2.6.16-rc3.orig/fs/hfsplus/inode.c 2006-02-14 15:54:12.000000000 -0800 +++ linux-2.6.16-rc3/fs/hfsplus/inode.c 2006-02-14 15:57:27.000000000 -0800 @@ -93,17 +93,6 @@ static int hfsplus_releasepage(struct pa return res ? try_to_free_buffers(page) : 0; } -static int hfsplus_get_blocks(struct inode *inode, sector_t iblock, unsigned long max_blocks, - struct buffer_head *bh_result, int create) -{ - int ret; - - ret = hfsplus_get_block(inode, iblock, bh_result, create); - if (!ret) - bh_result->b_size = (1 << inode->i_blkbits); - return ret; -} - static ssize_t hfsplus_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, loff_t offset, unsigned long nr_segs) { @@ -111,7 +100,7 @@ static ssize_t hfsplus_direct_IO(int rw, struct inode *inode = file->f_dentry->d_inode->i_mapping->host; return blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov, - offset, nr_segs, hfsplus_get_blocks, NULL); + offset, nr_segs, hfsplus_get_block, NULL); } static int hfsplus_writepages(struct address_space *mapping, Index: linux-2.6.16-rc3/fs/ocfs2/aops.c =================================================================== --- linux-2.6.16-rc3.orig/fs/ocfs2/aops.c 2006-02-14 15:54:12.000000000 -0800 +++ linux-2.6.16-rc3/fs/ocfs2/aops.c 2006-02-14 15:57:27.000000000 -0800 @@ -538,7 +538,6 @@ bail: * fs_count, map_bh, dio->rw == WRITE); */ static int ocfs2_direct_IO_get_blocks(struct inode *inode, sector_t iblock, - unsigned long max_blocks, struct buffer_head *bh_result, int create) { int ret; @@ -546,6 +545,7 @@ static int ocfs2_direct_IO_get_blocks(st u64 p_blkno; int contig_blocks; unsigned char blocksize_bits; + unsigned long max_blocks = bh_result->b_size >> inode->i_blkbits; if (!inode || !bh_result) { mlog(ML_ERROR, "inode or bh_result is null\n"); Index: linux-2.6.16-rc3/fs/reiserfs/inode.c =================================================================== --- linux-2.6.16-rc3.orig/fs/reiserfs/inode.c 2006-02-14 15:54:12.000000000 -0800 +++ linux-2.6.16-rc3/fs/reiserfs/inode.c 2006-02-14 15:57:27.000000000 -0800 @@ -466,7 +466,6 @@ static int reiserfs_get_block_create_0(s direct_IO request. */ static int reiserfs_get_blocks_direct_io(struct inode *inode, sector_t iblock, - unsigned long max_blocks, struct buffer_head *bh_result, int create) { Index: linux-2.6.16-rc3/fs/xfs/linux-2.6/xfs_aops.c =================================================================== --- linux-2.6.16-rc3.orig/fs/xfs/linux-2.6/xfs_aops.c 2006-02-14 15:54:12.000000000 -0800 +++ linux-2.6.16-rc3/fs/xfs/linux-2.6/xfs_aops.c 2006-02-14 15:57:27.000000000 -0800 @@ -1144,12 +1144,11 @@ STATIC int linvfs_get_blocks_direct( struct inode *inode, sector_t iblock, - unsigned long max_blocks, struct buffer_head *bh_result, int create) { - return __linvfs_get_block(inode, iblock, max_blocks, bh_result, - create, 1, BMAPI_WRITE|BMAPI_DIRECT); + return __linvfs_get_block(inode, iblock, bh_result->b_size >> inode->i_blkbits, + bh_result, create, 1, BMAPI_WRITE|BMAPI_DIRECT); } STATIC void Index: linux-2.6.16-rc3/include/linux/fs.h =================================================================== --- linux-2.6.16-rc3.orig/include/linux/fs.h 2006-02-14 15:54:12.000000000 -0800 +++ linux-2.6.16-rc3/include/linux/fs.h 2006-02-14 15:57:27.000000000 -0800 @@ -240,9 +240,6 @@ extern void __init files_init(unsigned l struct buffer_head; typedef int (get_block_t)(struct inode *inode, sector_t iblock, struct buffer_head *bh_result, int create); -typedef int (get_blocks_t)(struct inode *inode, sector_t iblock, - unsigned long max_blocks, - struct buffer_head *bh_result, int create); typedef void (dio_iodone_t)(struct kiocb *iocb, loff_t offset, ssize_t bytes, void *private); @@ -1621,7 +1618,7 @@ static inline void do_generic_file_read( ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, struct block_device *bdev, const struct iovec *iov, loff_t offset, - unsigned long nr_segs, get_blocks_t get_blocks, dio_iodone_t end_io, + unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io, int lock_type); enum { @@ -1632,29 +1629,29 @@ enum { static inline ssize_t blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, struct block_device *bdev, const struct iovec *iov, - loff_t offset, unsigned long nr_segs, get_blocks_t get_blocks, + loff_t offset, unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io) { return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset, - nr_segs, get_blocks, end_io, DIO_LOCKING); + nr_segs, get_block, end_io, DIO_LOCKING); } static inline ssize_t blockdev_direct_IO_no_locking(int rw, struct kiocb *iocb, struct inode *inode, struct block_device *bdev, const struct iovec *iov, - loff_t offset, unsigned long nr_segs, get_blocks_t get_blocks, + loff_t offset, unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io) { return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset, - nr_segs, get_blocks, end_io, DIO_NO_LOCKING); + nr_segs, get_block, end_io, DIO_NO_LOCKING); } static inline ssize_t blockdev_direct_IO_own_locking(int rw, struct kiocb *iocb, struct inode *inode, struct block_device *bdev, const struct iovec *iov, - loff_t offset, unsigned long nr_segs, get_blocks_t get_blocks, + loff_t offset, unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io) { return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset, - nr_segs, get_blocks, end_io, DIO_OWN_LOCKING); + nr_segs, get_block, end_io, DIO_OWN_LOCKING); } extern struct file_operations generic_ro_fops; Index: linux-2.6.16-rc3/fs/block_dev.c =================================================================== --- linux-2.6.16-rc3.orig/fs/block_dev.c 2006-02-14 15:54:12.000000000 -0800 +++ linux-2.6.16-rc3/fs/block_dev.c 2006-02-14 15:57:27.000000000 -0800 @@ -135,9 +135,10 @@ blkdev_get_block(struct inode *inode, se static int blkdev_get_blocks(struct inode *inode, sector_t iblock, - unsigned long max_blocks, struct buffer_head *bh, int create) + struct buffer_head *bh, int create) { sector_t end_block = max_block(I_BDEV(inode)); + unsigned long max_blocks = bh->b_size >> inode->i_blkbits; if ((iblock + max_blocks) > end_block) { max_blocks = end_block - iblock; Index: linux-2.6.16-rc3/fs/jfs/inode.c =================================================================== --- linux-2.6.16-rc3.orig/fs/jfs/inode.c 2006-02-14 15:54:12.000000000 -0800 +++ linux-2.6.16-rc3/fs/jfs/inode.c 2006-02-14 15:57:27.000000000 -0800 @@ -301,7 +301,7 @@ static ssize_t jfs_direct_IO(int rw, str struct inode *inode = file->f_mapping->host; return blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov, - offset, nr_segs, jfs_get_blocks, NULL); + offset, nr_segs, jfs_get_block, NULL); } struct address_space_operations jfs_aops = { ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() 2006-02-20 21:21 [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() Badari Pulavarty ` (2 preceding siblings ...) 2006-02-20 21:24 ` [PATCH 3/3] remove ->get_blocks() support Badari Pulavarty @ 2006-02-20 21:59 ` Nathan Scott 2006-02-20 23:06 ` Badari Pulavarty 2006-02-21 2:41 ` Jeremy Higdon 2006-02-22 15:12 ` christoph 4 siblings, 2 replies; 32+ messages in thread From: Nathan Scott @ 2006-02-20 21:59 UTC (permalink / raw) To: Badari Pulavarty; +Cc: christoph, mcao, akpm, lkml, jeremy, linux-fsdevel On Mon, Feb 20, 2006 at 01:21:27PM -0800, Badari Pulavarty wrote: > Hi, Hi Badari, > Following patches add support to map multiple blocks in ->get_block(). > This is will allow us to handle mapping of multiple disk blocks for > mpage_readpages() and mpage_writepages() etc. Instead of adding new > argument, I use "b_size" to indicate the amount of disk mapping needed > for get_block(). And also, on success get_block() actually indicates > the amount of disk mapping it did. Thanks for doing this work! > Now that get_block() can handle multiple blocks, there is no need > for ->get_blocks() which was added for DIO. > > [PATCH 1/3] pass b_size to ->get_block() > > [PATCH 2/3] map multiple blocks for mpage_readpages() > > [PATCH 3/3] remove ->get_blocks() support > > I noticed decent improvements (reduced sys time) on JFS, XFS and ext3. > (on simple "dd" read tests). > > (rc3.mm1) (rc3.mm1 + patches) > real 0m18.814s 0m18.482s > user 0m0.000s 0m0.004s > sys 0m3.240s 0m2.912s > > Andrew, Could you include it in -mm tree ? > > Comments ? I've been running these patches in my development tree for awhile and have not seen any problems. My one (possibly minor) concern is that we pass get_block a size in units of bytes, e.g.... bh->b_size = 1 << inode->i_blkbits; err = get_block(inode, block, bh, 1); And b_size is a u32. We have had the situation in the past where people (I'm looking at you, Jeremy ;) have been issuing multiple- gigabyte direct reads/writes through XFS. The syscall interface takes an (s)size_t in bytes, which on 64 bit platforms is a 64 bit byte count. I wonder if this change will end up ruining things for the lunatic fringe issuing these kinds of IOs? Maybe the get_block call could take a block count rather than a byte count? (I guess that would equate to dropping get_block_t rather than get_blocks_t... which is kinda the alternate direction to what you took here). On the other hand, maybe it'd be simpler to change b_size to be a size_t instead of u32? Although, since we are now mapping multiple blocks at once, "get_blocks_t" does seem an appropriate name. *shrug*, whatever ... the main thing that'd be good to see addressed is the 32 bit size. cheers. -- Nathan ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() 2006-02-20 21:59 ` [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() Nathan Scott @ 2006-02-20 23:06 ` Badari Pulavarty 2006-02-20 23:16 ` Nathan Scott 2006-02-21 2:41 ` Jeremy Higdon 1 sibling, 1 reply; 32+ messages in thread From: Badari Pulavarty @ 2006-02-20 23:06 UTC (permalink / raw) To: Nathan Scott; +Cc: christoph, mcao, akpm, lkml, jeremy, linux-fsdevel On Tue, 2006-02-21 at 08:59 +1100, Nathan Scott wrote: > On Mon, Feb 20, 2006 at 01:21:27PM -0800, Badari Pulavarty wrote: > > Hi, > > Hi Badari, > > > Following patches add support to map multiple blocks in ->get_block(). > > This is will allow us to handle mapping of multiple disk blocks for > > mpage_readpages() and mpage_writepages() etc. Instead of adding new > > argument, I use "b_size" to indicate the amount of disk mapping needed > > for get_block(). And also, on success get_block() actually indicates > > the amount of disk mapping it did. > > Thanks for doing this work! > > > Now that get_block() can handle multiple blocks, there is no need > > for ->get_blocks() which was added for DIO. > > > > [PATCH 1/3] pass b_size to ->get_block() > > > > [PATCH 2/3] map multiple blocks for mpage_readpages() > > > > [PATCH 3/3] remove ->get_blocks() support > > > > I noticed decent improvements (reduced sys time) on JFS, XFS and ext3. > > (on simple "dd" read tests). > > > > (rc3.mm1) (rc3.mm1 + patches) > > real 0m18.814s 0m18.482s > > user 0m0.000s 0m0.004s > > sys 0m3.240s 0m2.912s > > > > Andrew, Could you include it in -mm tree ? > > > > Comments ? > > I've been running these patches in my development tree for awhile > and have not seen any problems. My one (possibly minor) concern > is that we pass get_block a size in units of bytes, e.g.... > > bh->b_size = 1 << inode->i_blkbits; > err = get_block(inode, block, bh, 1); > > And b_size is a u32. We have had the situation in the past where > people (I'm looking at you, Jeremy ;) have been issuing multiple- > gigabyte direct reads/writes through XFS. The syscall interface > takes an (s)size_t in bytes, which on 64 bit platforms is a 64 bit > byte count. > I wonder if this change will end up ruining things for the lunatic > fringe issuing these kinds of IOs? Maybe the get_block call could > take a block count rather than a byte count? Yes. I thought about it too.. I wanted to pass "block count" instead of "byte count". Right now it does .. bh->b_size = 1 << inode->i_blkbits; call get_block(); First thing get_block() does is blocks = bh->b_size >> inode->i_blkbits; All, the unnecessary shifting around for nothing :( But, I ended up doing "byte count" just to avoid confusion of asking in "blocks" getting back in "bytes". I have no problem making b_size as "size_t" to handle 64-bit. But again, if we are fiddling with buffer_head - may be its time to look at alternative to "buffer_head" with the information exactly we need for getblock() ? Thanks, Badari ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() 2006-02-20 23:06 ` Badari Pulavarty @ 2006-02-20 23:16 ` Nathan Scott 0 siblings, 0 replies; 32+ messages in thread From: Nathan Scott @ 2006-02-20 23:16 UTC (permalink / raw) To: Badari Pulavarty; +Cc: christoph, mcao, akpm, lkml, jeremy, linux-fsdevel On Mon, Feb 20, 2006 at 03:06:11PM -0800, Badari Pulavarty wrote: > On Tue, 2006-02-21 at 08:59 +1100, Nathan Scott wrote: > ... > > I wonder if this change will end up ruining things for the lunatic > > fringe issuing these kinds of IOs? Maybe the get_block call could > > take a block count rather than a byte count? > > Yes. I thought about it too.. I wanted to pass "block count" instead > of "byte count". Right now it does .. > > bh->b_size = 1 << inode->i_blkbits; > call get_block(); > > First thing get_block() does is > blocks = bh->b_size >> inode->i_blkbits; > > All, the unnecessary shifting around for nothing :( Yeah, pretty silly really, but theres not much choice if the goal is to keep this simple. Oh well. > But, I ended up doing "byte count" just to avoid confusion of > asking in "blocks" getting back in "bytes". Understood. > I have no problem making b_size as "size_t" to handle 64-bit. > But again, if we are fiddling with buffer_head - may be its time > to look at alternative to "buffer_head" with the information exactly > we need for getblock() ? That is a much bigger change - I'm not in a position to make the call on whether thats in everyones best interests. However, I do want to make sure we don't regress anything, so I guess the u32 to size_t switch probably should be made to resolve this issue. Thanks again for following up on this. cheers. -- Nathan ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() 2006-02-20 21:59 ` [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() Nathan Scott 2006-02-20 23:06 ` Badari Pulavarty @ 2006-02-21 2:41 ` Jeremy Higdon 2006-02-21 16:03 ` Badari Pulavarty 1 sibling, 1 reply; 32+ messages in thread From: Jeremy Higdon @ 2006-02-21 2:41 UTC (permalink / raw) To: Nathan Scott; +Cc: Badari Pulavarty, christoph, mcao, akpm, lkml, linux-fsdevel On Tue, Feb 21, 2006 at 08:59:53AM +1100, Nathan Scott wrote: > On Mon, Feb 20, 2006 at 01:21:27PM -0800, Badari Pulavarty wrote: > > I've been running these patches in my development tree for awhile > and have not seen any problems. My one (possibly minor) concern > is that we pass get_block a size in units of bytes, e.g.... > > bh->b_size = 1 << inode->i_blkbits; > err = get_block(inode, block, bh, 1); > > And b_size is a u32. We have had the situation in the past where > people (I'm looking at you, Jeremy ;) have been issuing multiple- > gigabyte direct reads/writes through XFS. The syscall interface > takes an (s)size_t in bytes, which on 64 bit platforms is a 64 bit > byte count. > > I wonder if this change will end up ruining things for the lunatic > fringe issuing these kinds of IOs? Maybe the get_block call could Hey! Lunatic fringe indeed. Harumph! :-) Yes, there are a few people out there who will need to issue really large I/O reads or writes to get maximum I/O bandwidth on large stripes. The largest I've done so far is 4GiB, but I expect that number will likely increase this year, and more likely next year, if not. jeremy ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() 2006-02-21 2:41 ` Jeremy Higdon @ 2006-02-21 16:03 ` Badari Pulavarty 2006-02-21 21:39 ` Nathan Scott 0 siblings, 1 reply; 32+ messages in thread From: Badari Pulavarty @ 2006-02-21 16:03 UTC (permalink / raw) To: Jeremy Higdon; +Cc: Nathan Scott, christoph, mcao, akpm, lkml, linux-fsdevel On Mon, 2006-02-20 at 18:41 -0800, Jeremy Higdon wrote: > On Tue, Feb 21, 2006 at 08:59:53AM +1100, Nathan Scott wrote: > > On Mon, Feb 20, 2006 at 01:21:27PM -0800, Badari Pulavarty wrote: > > > > I've been running these patches in my development tree for awhile > > and have not seen any problems. My one (possibly minor) concern > > is that we pass get_block a size in units of bytes, e.g.... > > > > bh->b_size = 1 << inode->i_blkbits; > > err = get_block(inode, block, bh, 1); > > > > And b_size is a u32. We have had the situation in the past where > > people (I'm looking at you, Jeremy ;) have been issuing multiple- > > gigabyte direct reads/writes through XFS. The syscall interface > > takes an (s)size_t in bytes, which on 64 bit platforms is a 64 bit > > byte count. > > > > I wonder if this change will end up ruining things for the lunatic > > fringe issuing these kinds of IOs? Maybe the get_block call could > > Hey! Lunatic fringe indeed. Harumph! :-) > > Yes, there are a few people out there who will need to issue really > large I/O reads or writes to get maximum I/O bandwidth on large > stripes. The largest I've done so far is 4GiB, but I expect that > number will likely increase this year, and more likely next year, > if not. Thinking more about it, currently (without my patches) only DIO code can request for large chunks of mapping through get_blocks(). Since b_size is only u32, you can only map 4GB max. Isn't it ? With my patches, now mpage_readpages() also can request large chunks. (through readahead). So, my patches are not adding any extra limitation. Its carrying the same existing limitation. In order to handle larger chunks of disk mapping, changing b_size to u64 is required and we should request for it irrespective of my patches. Thanks, Badari ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() 2006-02-21 16:03 ` Badari Pulavarty @ 2006-02-21 21:39 ` Nathan Scott 0 siblings, 0 replies; 32+ messages in thread From: Nathan Scott @ 2006-02-21 21:39 UTC (permalink / raw) To: Badari Pulavarty Cc: Jeremy Higdon, christoph, mcao, akpm, lkml, linux-fsdevel On Tue, Feb 21, 2006 at 08:03:54AM -0800, Badari Pulavarty wrote: > ... > Thinking more about it, currently (without my patches) only DIO > code can request for large chunks of mapping through get_blocks(). > Since b_size is only u32, you can only map 4GB max. Isn't it ? > With my patches, now mpage_readpages() also can request large > chunks. (through readahead). So, my patches are not adding any > extra limitation. Its carrying the same existing limitation. The DIO get_blocks is super-freaky at the moment, in that it passes in an unsigned long max_blocks (units of blocks) but it expects the filesystem to fill in a u32 b_size (in bytes). This does at least give the filesystem the hint that this was a monster write and it gets the chance to allocate contiguously, but that was probably unintentional, who knows? I guess that subtlety would be dropped with this change if b_size isn't made a 64 bit entity. > In order to handle larger chunks of disk mapping, changing b_size > to u64 is required and we should request for it irrespective of my > patches. *shrug*... its all very closely inter-related, I'd keep it together. I'd also thought that size_t was the right type here, not u64? 32 bit platforms aren't capable of submitting IOs this large, so they needn't be doing any 64 bit manipulation here, no? cheers. -- Nathan ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() 2006-02-20 21:21 [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() Badari Pulavarty ` (3 preceding siblings ...) 2006-02-20 21:59 ` [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() Nathan Scott @ 2006-02-22 15:12 ` christoph 2006-02-22 16:58 ` Badari Pulavarty 2006-02-24 17:19 ` Badari Pulavarty 4 siblings, 2 replies; 32+ messages in thread From: christoph @ 2006-02-22 15:12 UTC (permalink / raw) To: Badari Pulavarty; +Cc: christoph, mcao, akpm, lkml, linux-fsdevel, vs, zam On Mon, Feb 20, 2006 at 01:21:27PM -0800, Badari Pulavarty wrote: > Hi, > > Following patches add support to map multiple blocks in ->get_block(). > This is will allow us to handle mapping of multiple disk blocks for > mpage_readpages() and mpage_writepages() etc. Instead of adding new > argument, I use "b_size" to indicate the amount of disk mapping needed > for get_block(). And also, on success get_block() actually indicates > the amount of disk mapping it did. > > Now that get_block() can handle multiple blocks, there is no need > for ->get_blocks() which was added for DIO. > > [PATCH 1/3] pass b_size to ->get_block() > > [PATCH 2/3] map multiple blocks for mpage_readpages() > > [PATCH 3/3] remove ->get_blocks() support > > I noticed decent improvements (reduced sys time) on JFS, XFS and ext3. > (on simple "dd" read tests). > > (rc3.mm1) (rc3.mm1 + patches) > real 0m18.814s 0m18.482s > user 0m0.000s 0m0.004s > sys 0m3.240s 0m2.912s > > Andrew, Could you include it in -mm tree ? Thanks Badari, with that interface changes the mpage_readpage changes look a lot nicer than my original version. I'd like to second the request to put it into -mm. And if the namesys folks could try out whether this works for their reiser4 requirements it'd be nice. If you have an even faster ->readpages I'd be interested in that secrete souce receipe for further improvement to mpage_readpages. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() 2006-02-22 15:12 ` christoph @ 2006-02-22 16:58 ` Badari Pulavarty 2006-02-22 16:59 ` christoph ` (2 more replies) 2006-02-24 17:19 ` Badari Pulavarty 1 sibling, 3 replies; 32+ messages in thread From: Badari Pulavarty @ 2006-02-22 16:58 UTC (permalink / raw) To: christoph; +Cc: mcao, akpm, lkml, linux-fsdevel, vs, zam On Wed, 2006-02-22 at 16:12 +0100, christoph wrote: > On Mon, Feb 20, 2006 at 01:21:27PM -0800, Badari Pulavarty wrote: > > Hi, > > > > Following patches add support to map multiple blocks in ->get_block(). > > This is will allow us to handle mapping of multiple disk blocks for > > mpage_readpages() and mpage_writepages() etc. Instead of adding new > > argument, I use "b_size" to indicate the amount of disk mapping needed > > for get_block(). And also, on success get_block() actually indicates > > the amount of disk mapping it did. > > > > Now that get_block() can handle multiple blocks, there is no need > > for ->get_blocks() which was added for DIO. > > > > [PATCH 1/3] pass b_size to ->get_block() > > > > [PATCH 2/3] map multiple blocks for mpage_readpages() > > > > [PATCH 3/3] remove ->get_blocks() support > > > > I noticed decent improvements (reduced sys time) on JFS, XFS and ext3. > > (on simple "dd" read tests). > > > > (rc3.mm1) (rc3.mm1 + patches) > > real 0m18.814s 0m18.482s > > user 0m0.000s 0m0.004s > > sys 0m3.240s 0m2.912s > > > > Andrew, Could you include it in -mm tree ? > > Thanks Badari, with that interface changes the mpage_readpage changes > look a lot nicer than my original version. I'd like to second > the request to put it into -mm. Thanks. Only current issue Nathan raised is, he wants to see b_size change to u64 (instead of u32) to support really-huge-IO requests. Does this sound reasonable to you ? Thanks, Badari ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() 2006-02-22 16:58 ` Badari Pulavarty @ 2006-02-22 16:59 ` christoph 2006-02-23 1:40 ` Nathan Scott 2006-02-22 17:23 ` [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() Peter Staubach 2006-02-22 18:37 ` Dave Kleikamp 2 siblings, 1 reply; 32+ messages in thread From: christoph @ 2006-02-22 16:59 UTC (permalink / raw) To: Badari Pulavarty; +Cc: christoph, mcao, akpm, lkml, linux-fsdevel, vs, zam On Wed, Feb 22, 2006 at 08:58:30AM -0800, Badari Pulavarty wrote: > Thanks. Only current issue Nathan raised is, he wants to see > b_size change to u64 (instead of u32) to support really-huge-IO > requests. Does this sound reasonable to you ? I know that we didn't want to increase b_size at some point because of concerns about the number of objects fitting into a page in the slab allocator. If the same number of bigger heads fit into the same page I see no problems with the increase. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() 2006-02-22 16:59 ` christoph @ 2006-02-23 1:40 ` Nathan Scott 2006-02-23 1:59 ` Andrew Morton 0 siblings, 1 reply; 32+ messages in thread From: Nathan Scott @ 2006-02-23 1:40 UTC (permalink / raw) To: Badari Pulavarty, Christoph Hellwig; +Cc: mcao, akpm, lkml, linux-fsdevel On Wed, Feb 22, 2006 at 05:59:42PM +0100, christoph wrote: > On Wed, Feb 22, 2006 at 08:58:30AM -0800, Badari Pulavarty wrote: > > Thanks. Only current issue Nathan raised is, he wants to see > > b_size change to u64 [size_t] (instead of u32) to support > > really-huge-IO requests. And also to not go backwards on what the current DIO mapping interface provides us. > Does this sound reasonable to you ? > > I know that we didn't want to increase b_size at some point because > of concerns about the number of objects fitting into a page in the There's four extra bytes on an 88 byte structure (with sector_t CONFIG'd at 64 bits) - so, yes, there'll be a small decrease in the number that fit in a page on 64 bit platforms. Perhaps its not worth it, but it would be good to sort out these pesky size mismatches cos they introduce very subtle bugs. > slab allocator. If the same number of bigger heads fit into the > same page I see no problems with the increase. Heh, bigger heads. Well, the same number fit in the page for 32 bit platforms, does that count? Taking a quick look at the struct definition, theres some oddities crept in since the 2.4 version - looks like sct had arranged the fields in nice 32-bit-platform-cache-aligned groups, with several cache-alignment related comments, some of which now remain and make very little sense on their own (& with the now-incorrect grouping). Anyway, here's a patch Badari - I reckon its worth it, but then I am a bit biased (as its likely only XFS is going to be seeing this size of I/O for some time, and as someone who has hunted down some of these size mismatch bugs...) cheers. -- Nathan Increase the size of the buffer_head b_size field for 64 bit platforms. Update some old and moldy comments in and around the structure as well. The b_size increase allows us to perform larger mappings and allocations for large I/O requests from userspace, which tie in with other changes allowing the get_block_t() interface to map multiple blocks at once. Signed-off-by: Nathan Scott <nathans@sgi.com> --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -46,20 +46,25 @@ struct address_space; typedef void (bh_end_io_t)(struct buffer_head *bh, int uptodate); /* - * Keep related fields in common cachelines. The most commonly accessed - * field (b_state) goes at the start so the compiler does not generate - * indexed addressing for it. + * Historically, a buffer_head was used to map a single block + * within a page, and of course as the unit of I/O through the + * filesystem and block layers. Nowadays the basic I/O unit + * is the bio, and buffer_heads are used for extracting block + * mappings (via a get_block_t call), for tracking state within + * a page (via a page_mapping) and for wrapping bio submission + * for backward compatibility reasons (e.g. submit_bh). There + * may be one or two other uses too (I used it for drying the + * dishes the other night when I couldn't find a tea towel...). */ struct buffer_head { - /* First cache line: */ unsigned long b_state; /* buffer state bitmap (see above) */ struct buffer_head *b_this_page;/* circular list of page's buffers */ struct page *b_page; /* the page this bh is mapped to */ - atomic_t b_count; /* users using this block */ - u32 b_size; /* block size */ + atomic_t b_count; /* users using this buffer_head */ - sector_t b_blocknr; /* block number */ - char *b_data; /* pointer to data block */ + size_t b_size; /* size of mapping */ + sector_t b_blocknr; /* start block number */ + char *b_data; /* pointer to data within the page */ struct block_device *b_bdev; bh_end_io_t *b_end_io; /* I/O completion */ ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() 2006-02-23 1:40 ` Nathan Scott @ 2006-02-23 1:59 ` Andrew Morton 2006-02-23 16:28 ` [PATCH] change b_size to size_t Badari Pulavarty 0 siblings, 1 reply; 32+ messages in thread From: Andrew Morton @ 2006-02-23 1:59 UTC (permalink / raw) To: Nathan Scott; +Cc: pbadari, hch, mcao, linux-kernel, linux-fsdevel Nathan Scott <nathans@sgi.com> wrote: > > There's four extra bytes on an 88 byte structure (with sector_t > CONFIG'd at 64 bits) oy, buffer_heads are 48 bytes - took a few months out of my life, that did. That's where it most counts - memory-constrained non-LBD 32-bit platforms. Making b_size size_t doesn't affect that, so.. ^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH] change b_size to size_t 2006-02-23 1:59 ` Andrew Morton @ 2006-02-23 16:28 ` Badari Pulavarty 2006-02-23 16:32 ` Benjamin LaHaise 2006-02-23 16:40 ` Dave Kleikamp 0 siblings, 2 replies; 32+ messages in thread From: Badari Pulavarty @ 2006-02-23 16:28 UTC (permalink / raw) To: Andrew Morton; +Cc: Nathan Scott, christoph, mcao, lkml, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 572 bytes --] On Wed, 2006-02-22 at 17:59 -0800, Andrew Morton wrote: > Nathan Scott <nathans@sgi.com> wrote: > > > > There's four extra bytes on an 88 byte structure (with sector_t > > CONFIG'd at 64 bits) > > oy, buffer_heads are 48 bytes - took a few months out of my life, that did. > That's where it most counts - memory-constrained non-LBD 32-bit platforms. > > Making b_size size_t doesn't affect that, so.. Here is the updated version of the patch, which changes buffer_head.b_size to size_t to support mapping large amount of disk blocks (for large IOs). Thanks, Badari [-- Attachment #2: increase-bsize.patch --] [-- Type: text/x-patch, Size: 4537 bytes --] Increase the size of the buffer_head b_size field for 64 bit platforms. Update some old and moldy comments in and around the structure as well. The b_size increase allows us to perform larger mappings and allocations for large I/O requests from userspace, which tie in with other changes allowing the get_block_t() interface to map multiple blocks at once. Signed-off-by: Nathan Scott <nathans@sgi.com> Signed-off-by: Badari Pulavary <pbadari@us.ibm.com> Index: linux-2.6.16-rc4/include/linux/buffer_head.h =================================================================== --- linux-2.6.16-rc4.orig/include/linux/buffer_head.h 2006-02-17 14:23:45.000000000 -0800 +++ linux-2.6.16-rc4/include/linux/buffer_head.h 2006-02-23 08:33:29.000000000 -0800 @@ -46,20 +46,23 @@ struct address_space; typedef void (bh_end_io_t)(struct buffer_head *bh, int uptodate); /* - * Keep related fields in common cachelines. The most commonly accessed - * field (b_state) goes at the start so the compiler does not generate - * indexed addressing for it. + * Historically, a buffer_head was used to map a single block + * within a page, and of course as the unit of I/O through the + * filesystem and block layers. Nowadays the basic I/O unit + * is the bio, and buffer_heads are used for extracting block + * mappings (via a get_block_t call), for tracking state within + * a page (via a page_mapping) and for wrapping bio submission + * for backward compatibility reasons (e.g. submit_bh). */ struct buffer_head { - /* First cache line: */ unsigned long b_state; /* buffer state bitmap (see above) */ struct buffer_head *b_this_page;/* circular list of page's buffers */ struct page *b_page; /* the page this bh is mapped to */ - atomic_t b_count; /* users using this block */ - u32 b_size; /* block size */ + atomic_t b_count; /* users using this buffer_head */ - sector_t b_blocknr; /* block number */ - char *b_data; /* pointer to data block */ + size_t b_size; /* size of mapping */ + sector_t b_blocknr; /* start block number */ + char *b_data; /* pointer to data within the page */ struct block_device *b_bdev; bh_end_io_t *b_end_io; /* I/O completion */ Index: linux-2.6.16-rc4/fs/buffer.c =================================================================== --- linux-2.6.16-rc4.orig/fs/buffer.c 2006-02-17 14:23:45.000000000 -0800 +++ linux-2.6.16-rc4/fs/buffer.c 2006-02-23 08:19:15.000000000 -0800 @@ -432,7 +432,7 @@ __find_get_block_slow(struct block_devic printk("__find_get_block_slow() failed. " "block=%llu, b_blocknr=%llu\n", (unsigned long long)block, (unsigned long long)bh->b_blocknr); - printk("b_state=0x%08lx, b_size=%u\n", bh->b_state, bh->b_size); + printk("b_state=0x%08lx, b_size=%llu\n", bh->b_state, (unsigned long long)bh->b_size); printk("device blocksize: %d\n", 1 << bd_inode->i_blkbits); } out_unlock: Index: linux-2.6.16-rc4/fs/reiserfs/prints.c =================================================================== --- linux-2.6.16-rc4.orig/fs/reiserfs/prints.c 2006-02-17 14:23:45.000000000 -0800 +++ linux-2.6.16-rc4/fs/reiserfs/prints.c 2006-02-23 08:20:01.000000000 -0800 @@ -143,8 +143,8 @@ static void sprintf_buffer_head(char *bu char b[BDEVNAME_SIZE]; sprintf(buf, - "dev %s, size %d, blocknr %llu, count %d, state 0x%lx, page %p, (%s, %s, %s)", - bdevname(bh->b_bdev, b), bh->b_size, + "dev %s, size %lld, blocknr %llu, count %d, state 0x%lx, page %p, (%s, %s, %s)", + bdevname(bh->b_bdev, b), (unsigned long long)bh->b_size, (unsigned long long)bh->b_blocknr, atomic_read(&(bh->b_count)), bh->b_state, bh->b_page, buffer_uptodate(bh) ? "UPTODATE" : "!UPTODATE", Index: linux-2.6.16-rc4/fs/ocfs2/journal.c =================================================================== --- linux-2.6.16-rc4.orig/fs/ocfs2/journal.c 2006-02-17 14:23:45.000000000 -0800 +++ linux-2.6.16-rc4/fs/ocfs2/journal.c 2006-02-23 08:23:34.000000000 -0800 @@ -377,12 +377,12 @@ int ocfs2_journal_access(struct ocfs2_jo BUG_ON(!bh); BUG_ON(!(handle->flags & OCFS2_HANDLE_STARTED)); - mlog_entry("bh->b_blocknr=%llu, type=%d (\"%s\"), bh->b_size = %hu\n", + mlog_entry("bh->b_blocknr=%llu, type=%d (\"%s\"), bh->b_size = %llu\n", (unsigned long long)bh->b_blocknr, type, (type == OCFS2_JOURNAL_ACCESS_CREATE) ? "OCFS2_JOURNAL_ACCESS_CREATE" : "OCFS2_JOURNAL_ACCESS_WRITE", - bh->b_size); + (unsigned long long)bh->b_size); /* we can safely remove this assertion after testing. */ if (!buffer_uptodate(bh)) { ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] change b_size to size_t 2006-02-23 16:28 ` [PATCH] change b_size to size_t Badari Pulavarty @ 2006-02-23 16:32 ` Benjamin LaHaise 2006-02-23 17:20 ` Badari Pulavarty 2006-02-23 17:28 ` Badari Pulavarty 2006-02-23 16:40 ` Dave Kleikamp 1 sibling, 2 replies; 32+ messages in thread From: Benjamin LaHaise @ 2006-02-23 16:32 UTC (permalink / raw) To: Badari Pulavarty Cc: Andrew Morton, Nathan Scott, christoph, mcao, lkml, linux-fsdevel On Thu, Feb 23, 2006 at 08:28:12AM -0800, Badari Pulavarty wrote: > Here is the updated version of the patch, which changes > buffer_head.b_size to size_t to support mapping large > amount of disk blocks (for large IOs). Your patch doesn't seem to be inline, so I can't quote it. Several problems: on 64 bit platforms you introduced 4 bytes of padding into buffer_head. atomic_t only takes up 4 byte, while size_t is 8 byte aligned. This is a waste of memory, imo, seeing as the vast majority of systems out there will not be doing 4GB+ ios any time soon. Also, the cast to unsigned long long for size_t is pretty atrocious. Cast to unsigned long if anything, as size_t is unsigned long on all platforms linux runs on. -ben -- "Ladies and gentlemen, I'm sorry to interrupt, but the police are here and they've asked us to stop the party." Don't Email: <dont@kvack.org>. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] change b_size to size_t 2006-02-23 16:32 ` Benjamin LaHaise @ 2006-02-23 17:20 ` Badari Pulavarty 2006-02-23 17:28 ` Badari Pulavarty 1 sibling, 0 replies; 32+ messages in thread From: Badari Pulavarty @ 2006-02-23 17:20 UTC (permalink / raw) To: Benjamin LaHaise Cc: Andrew Morton, Nathan Scott, christoph, mcao, lkml, linux-fsdevel [-- Attachment #1: Type: text/plain, Size: 6042 bytes --] On Thu, 2006-02-23 at 11:32 -0500, Benjamin LaHaise wrote: > On Thu, Feb 23, 2006 at 08:28:12AM -0800, Badari Pulavarty wrote: > > Here is the updated version of the patch, which changes > > buffer_head.b_size to size_t to support mapping large > > amount of disk blocks (for large IOs). > > Your patch doesn't seem to be inline, so I can't quote it. Several > problems: on 64 bit platforms you introduced 4 bytes of padding into > buffer_head. atomic_t only takes up 4 byte, while size_t is 8 byte > aligned. I moved stuff around, but still see 96 bytes (8 byte increase) for the structure. What am I doing wrong ? Can you check ? > This is a waste of memory, imo, seeing as the vast majority > of systems out there will not be doing 4GB+ ios any time soon. Yep. I agree. I was modifying get_block() to take b_size as the amount of disk mapping requested, so I can get rid of ->get_blocks() and also add support for mpage_readpages() and mapge_writepages() to deal with multiple blocks. XFS seems to be requesting >4GB IOs, but doing 4GB IOs through DIO today and planning to see bigger IOs in future. Since "buffer_head" is not a primary structure used for IO anymore + this change affect only 64-bit machines, its worth doing this ? > Also, the cast to unsigned long long for size_t is pretty atrocious. > Cast to unsigned long if anything, as size_t is unsigned long on all > platforms linux runs on. Okay. Changed it to (unsigned long) instead. I was just following sector_t :( Thanks, Badari Increase the size of the buffer_head b_size field for 64 bit platforms. Update some old and moldy comments in and around the structure as well. The b_size increase allows us to perform larger mappings and allocations for large I/O requests from userspace, which tie in with other changes allowing the get_block_t() interface to map multiple blocks at once. Signed-off-by: Nathan Scott <nathans@sgi.com> Signed-off-by: Badari Pulavary <pbadari@us.ibm.com> Index: linux-2.6.16-rc4/include/linux/buffer_head.h =================================================================== --- linux-2.6.16-rc4.orig/include/linux/buffer_head.h 2006-02-17 14:23:45.000000000 -0800 +++ linux-2.6.16-rc4/include/linux/buffer_head.h 2006-02-23 09:00:06.000000000 -0800 @@ -46,20 +46,23 @@ struct address_space; typedef void (bh_end_io_t)(struct buffer_head *bh, int uptodate); /* - * Keep related fields in common cachelines. The most commonly accessed - * field (b_state) goes at the start so the compiler does not generate - * indexed addressing for it. + * Historically, a buffer_head was used to map a single block + * within a page, and of course as the unit of I/O through the + * filesystem and block layers. Nowadays the basic I/O unit + * is the bio, and buffer_heads are used for extracting block + * mappings (via a get_block_t call), for tracking state within + * a page (via a page_mapping) and for wrapping bio submission + * for backward compatibility reasons (e.g. submit_bh). */ struct buffer_head { - /* First cache line: */ unsigned long b_state; /* buffer state bitmap (see above) */ struct buffer_head *b_this_page;/* circular list of page's buffers */ struct page *b_page; /* the page this bh is mapped to */ - atomic_t b_count; /* users using this block */ - u32 b_size; /* block size */ + char *b_data; /* pointer to data within the page */ - sector_t b_blocknr; /* block number */ - char *b_data; /* pointer to data block */ + size_t b_size; /* size of mapping */ + sector_t b_blocknr; /* start block number */ + atomic_t b_count; /* users using this buffer_head */ struct block_device *b_bdev; bh_end_io_t *b_end_io; /* I/O completion */ Index: linux-2.6.16-rc4/fs/buffer.c =================================================================== --- linux-2.6.16-rc4.orig/fs/buffer.c 2006-02-17 14:23:45.000000000 -0800 +++ linux-2.6.16-rc4/fs/buffer.c 2006-02-23 08:55:18.000000000 -0800 @@ -432,7 +432,8 @@ __find_get_block_slow(struct block_devic printk("__find_get_block_slow() failed. " "block=%llu, b_blocknr=%llu\n", (unsigned long long)block, (unsigned long long)bh->b_blocknr); - printk("b_state=0x%08lx, b_size=%u\n", bh->b_state, bh->b_size); + printk("b_state=0x%08lx, b_size=%lu\n", bh->b_state, + (unsigned long)bh->b_size); printk("device blocksize: %d\n", 1 << bd_inode->i_blkbits); } out_unlock: Index: linux-2.6.16-rc4/fs/reiserfs/prints.c =================================================================== --- linux-2.6.16-rc4.orig/fs/reiserfs/prints.c 2006-02-17 14:23:45.000000000 -0800 +++ linux-2.6.16-rc4/fs/reiserfs/prints.c 2006-02-23 08:56:17.000000000 -0800 @@ -143,8 +143,8 @@ static void sprintf_buffer_head(char *bu char b[BDEVNAME_SIZE]; sprintf(buf, - "dev %s, size %d, blocknr %llu, count %d, state 0x%lx, page %p, (%s, %s, %s)", - bdevname(bh->b_bdev, b), bh->b_size, + "dev %s, size %ld, blocknr %llu, count %d, state 0x%lx, page %p, (%s, %s, %s)", + bdevname(bh->b_bdev, b), (unsigned long)bh->b_size, (unsigned long long)bh->b_blocknr, atomic_read(&(bh->b_count)), bh->b_state, bh->b_page, buffer_uptodate(bh) ? "UPTODATE" : "!UPTODATE", Index: linux-2.6.16-rc4/fs/ocfs2/journal.c =================================================================== --- linux-2.6.16-rc4.orig/fs/ocfs2/journal.c 2006-02-17 14:23:45.000000000 -0800 +++ linux-2.6.16-rc4/fs/ocfs2/journal.c 2006-02-23 08:56:55.000000000 -0800 @@ -377,12 +377,12 @@ int ocfs2_journal_access(struct ocfs2_jo BUG_ON(!bh); BUG_ON(!(handle->flags & OCFS2_HANDLE_STARTED)); - mlog_entry("bh->b_blocknr=%llu, type=%d (\"%s\"), bh->b_size = %hu\n", + mlog_entry("bh->b_blocknr=%llu, type=%d (\"%s\"), bh->b_size = %lu\n", (unsigned long long)bh->b_blocknr, type, (type == OCFS2_JOURNAL_ACCESS_CREATE) ? "OCFS2_JOURNAL_ACCESS_CREATE" : "OCFS2_JOURNAL_ACCESS_WRITE", - bh->b_size); + (unsigned long)bh->b_size); /* we can safely remove this assertion after testing. */ if (!buffer_uptodate(bh)) { [-- Attachment #2: increase-bsize.patch --] [-- Type: text/x-patch, Size: 4519 bytes --] Increase the size of the buffer_head b_size field for 64 bit platforms. Update some old and moldy comments in and around the structure as well. The b_size increase allows us to perform larger mappings and allocations for large I/O requests from userspace, which tie in with other changes allowing the get_block_t() interface to map multiple blocks at once. Signed-off-by: Nathan Scott <nathans@sgi.com> Signed-off-by: Badari Pulavary <pbadari@us.ibm.com> Index: linux-2.6.16-rc4/include/linux/buffer_head.h =================================================================== --- linux-2.6.16-rc4.orig/include/linux/buffer_head.h 2006-02-17 14:23:45.000000000 -0800 +++ linux-2.6.16-rc4/include/linux/buffer_head.h 2006-02-23 09:00:06.000000000 -0800 @@ -46,20 +46,23 @@ struct address_space; typedef void (bh_end_io_t)(struct buffer_head *bh, int uptodate); /* - * Keep related fields in common cachelines. The most commonly accessed - * field (b_state) goes at the start so the compiler does not generate - * indexed addressing for it. + * Historically, a buffer_head was used to map a single block + * within a page, and of course as the unit of I/O through the + * filesystem and block layers. Nowadays the basic I/O unit + * is the bio, and buffer_heads are used for extracting block + * mappings (via a get_block_t call), for tracking state within + * a page (via a page_mapping) and for wrapping bio submission + * for backward compatibility reasons (e.g. submit_bh). */ struct buffer_head { - /* First cache line: */ unsigned long b_state; /* buffer state bitmap (see above) */ struct buffer_head *b_this_page;/* circular list of page's buffers */ struct page *b_page; /* the page this bh is mapped to */ - atomic_t b_count; /* users using this block */ - u32 b_size; /* block size */ + char *b_data; /* pointer to data within the page */ - sector_t b_blocknr; /* block number */ - char *b_data; /* pointer to data block */ + size_t b_size; /* size of mapping */ + sector_t b_blocknr; /* start block number */ + atomic_t b_count; /* users using this buffer_head */ struct block_device *b_bdev; bh_end_io_t *b_end_io; /* I/O completion */ Index: linux-2.6.16-rc4/fs/buffer.c =================================================================== --- linux-2.6.16-rc4.orig/fs/buffer.c 2006-02-17 14:23:45.000000000 -0800 +++ linux-2.6.16-rc4/fs/buffer.c 2006-02-23 08:55:18.000000000 -0800 @@ -432,7 +432,8 @@ __find_get_block_slow(struct block_devic printk("__find_get_block_slow() failed. " "block=%llu, b_blocknr=%llu\n", (unsigned long long)block, (unsigned long long)bh->b_blocknr); - printk("b_state=0x%08lx, b_size=%u\n", bh->b_state, bh->b_size); + printk("b_state=0x%08lx, b_size=%lu\n", bh->b_state, + (unsigned long)bh->b_size); printk("device blocksize: %d\n", 1 << bd_inode->i_blkbits); } out_unlock: Index: linux-2.6.16-rc4/fs/reiserfs/prints.c =================================================================== --- linux-2.6.16-rc4.orig/fs/reiserfs/prints.c 2006-02-17 14:23:45.000000000 -0800 +++ linux-2.6.16-rc4/fs/reiserfs/prints.c 2006-02-23 08:56:17.000000000 -0800 @@ -143,8 +143,8 @@ static void sprintf_buffer_head(char *bu char b[BDEVNAME_SIZE]; sprintf(buf, - "dev %s, size %d, blocknr %llu, count %d, state 0x%lx, page %p, (%s, %s, %s)", - bdevname(bh->b_bdev, b), bh->b_size, + "dev %s, size %ld, blocknr %llu, count %d, state 0x%lx, page %p, (%s, %s, %s)", + bdevname(bh->b_bdev, b), (unsigned long)bh->b_size, (unsigned long long)bh->b_blocknr, atomic_read(&(bh->b_count)), bh->b_state, bh->b_page, buffer_uptodate(bh) ? "UPTODATE" : "!UPTODATE", Index: linux-2.6.16-rc4/fs/ocfs2/journal.c =================================================================== --- linux-2.6.16-rc4.orig/fs/ocfs2/journal.c 2006-02-17 14:23:45.000000000 -0800 +++ linux-2.6.16-rc4/fs/ocfs2/journal.c 2006-02-23 08:56:55.000000000 -0800 @@ -377,12 +377,12 @@ int ocfs2_journal_access(struct ocfs2_jo BUG_ON(!bh); BUG_ON(!(handle->flags & OCFS2_HANDLE_STARTED)); - mlog_entry("bh->b_blocknr=%llu, type=%d (\"%s\"), bh->b_size = %hu\n", + mlog_entry("bh->b_blocknr=%llu, type=%d (\"%s\"), bh->b_size = %lu\n", (unsigned long long)bh->b_blocknr, type, (type == OCFS2_JOURNAL_ACCESS_CREATE) ? "OCFS2_JOURNAL_ACCESS_CREATE" : "OCFS2_JOURNAL_ACCESS_WRITE", - bh->b_size); + (unsigned long)bh->b_size); /* we can safely remove this assertion after testing. */ if (!buffer_uptodate(bh)) { ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] change b_size to size_t 2006-02-23 16:32 ` Benjamin LaHaise 2006-02-23 17:20 ` Badari Pulavarty @ 2006-02-23 17:28 ` Badari Pulavarty 2006-02-23 17:29 ` Benjamin LaHaise 2006-02-23 17:40 ` Badari Pulavarty 1 sibling, 2 replies; 32+ messages in thread From: Badari Pulavarty @ 2006-02-23 17:28 UTC (permalink / raw) To: Benjamin LaHaise Cc: Andrew Morton, Nathan Scott, christoph, mcao, lkml, linux-fsdevel On Thu, 2006-02-23 at 11:32 -0500, Benjamin LaHaise wrote: > On Thu, Feb 23, 2006 at 08:28:12AM -0800, Badari Pulavarty wrote: > > Here is the updated version of the patch, which changes > > buffer_head.b_size to size_t to support mapping large > > amount of disk blocks (for large IOs). > > Your patch doesn't seem to be inline, so I can't quote it. Several > problems: on 64 bit platforms you introduced 4 bytes of padding into > buffer_head. atomic_t only takes up 4 byte, while size_t is 8 byte > aligned. Ignore my previous mail. How about doing this ? Change b_state to u32 and change b_size to "size_t". This way, we don't increase the overall size of the structure on 64-bit machines. Isn't it ? Thanks, Badari struct buffer_head { u32 b_state; /* buffer state bitmap (see above) */ atomic_t b_count; /* users using this buffer_head */ struct buffer_head *b_this_page;/* circular list of page's buffers */ struct page *b_page; /* the page this bh is mapped to */ size_t b_size; /* size of mapping */ sector_t b_blocknr; /* start block number */ char *b_data; /* pointer to data within the page */ struct block_device *b_bdev; bh_end_io_t *b_end_io; /* I/O completion */ void *b_private; /* reserved for b_end_io */ struct list_head b_assoc_buffers; /* associated with another mapping */ }; Thanks, Badari ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] change b_size to size_t 2006-02-23 17:28 ` Badari Pulavarty @ 2006-02-23 17:29 ` Benjamin LaHaise 2006-02-23 18:46 ` Badari Pulavarty 2006-02-23 17:40 ` Badari Pulavarty 1 sibling, 1 reply; 32+ messages in thread From: Benjamin LaHaise @ 2006-02-23 17:29 UTC (permalink / raw) To: Badari Pulavarty Cc: Andrew Morton, Nathan Scott, christoph, mcao, lkml, linux-fsdevel On Thu, Feb 23, 2006 at 09:28:58AM -0800, Badari Pulavarty wrote: > How about doing this ? Change b_state to u32 and change b_size > to "size_t". This way, we don't increase the overall size of > the structure on 64-bit machines. Isn't it ? I was thinking of that too, but that doesn't work with the bit operations on big endian machines (which require unsigned long). Oh well, I guess that even with adding an extra 8 bytes on x86-64 we're still at the 96 bytes, or 92 bytes if the atomic_t is put at the end of the struct. -ben -- "Ladies and gentlemen, I'm sorry to interrupt, but the police are here and they've asked us to stop the party." Don't Email: <dont@kvack.org>. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] change b_size to size_t 2006-02-23 17:29 ` Benjamin LaHaise @ 2006-02-23 18:46 ` Badari Pulavarty 0 siblings, 0 replies; 32+ messages in thread From: Badari Pulavarty @ 2006-02-23 18:46 UTC (permalink / raw) To: Benjamin LaHaise Cc: Andrew Morton, Nathan Scott, christoph, mcao, lkml, linux-fsdevel On Thu, 2006-02-23 at 12:29 -0500, Benjamin LaHaise wrote: > On Thu, Feb 23, 2006 at 09:28:58AM -0800, Badari Pulavarty wrote: > > How about doing this ? Change b_state to u32 and change b_size > > to "size_t". This way, we don't increase the overall size of > > the structure on 64-bit machines. Isn't it ? > > I was thinking of that too, but that doesn't work with the bit operations > on big endian machines (which require unsigned long). Oh well, I guess > that even with adding an extra 8 bytes on x86-64 we're still at the 96 > bytes, or 92 bytes if the atomic_t is put at the end of the struct. > > -ben Okay. Here is the final version (hopefully). I moved atomic_t to end of the structure. I still see 96-bytes for x86-64 also :( buffer_head 40425 40760 96 40 1 : tunables 120 60 8 : slabdata 1019 1019 0 Thanks, Badari Increase the size of the buffer_head b_size field for 64 bit platforms. Update some old and moldy comments in and around the structure as well. The b_size increase allows us to perform larger mappings and allocations for large I/O requests from userspace, which tie in with other changes allowing the get_block_t() interface to map multiple blocks at once. Signed-off-by: Nathan Scott <nathans@sgi.com> Signed-off-by: Badari Pulavary <pbadari@us.ibm.com> Index: linux-2.6.16-rc4/include/linux/buffer_head.h =================================================================== --- linux-2.6.16-rc4.orig/include/linux/buffer_head.h 2006-02-17 14:23:45.000000000 -0800 +++ linux-2.6.16-rc4/include/linux/buffer_head.h 2006-02-23 10:15:33.000000000 -0800 @@ -46,25 +46,28 @@ struct address_space; typedef void (bh_end_io_t)(struct buffer_head *bh, int uptodate); /* - * Keep related fields in common cachelines. The most commonly accessed - * field (b_state) goes at the start so the compiler does not generate - * indexed addressing for it. + * Historically, a buffer_head was used to map a single block + * within a page, and of course as the unit of I/O through the + * filesystem and block layers. Nowadays the basic I/O unit + * is the bio, and buffer_heads are used for extracting block + * mappings (via a get_block_t call), for tracking state within + * a page (via a page_mapping) and for wrapping bio submission + * for backward compatibility reasons (e.g. submit_bh). */ struct buffer_head { - /* First cache line: */ unsigned long b_state; /* buffer state bitmap (see above) */ struct buffer_head *b_this_page;/* circular list of page's buffers */ struct page *b_page; /* the page this bh is mapped to */ - atomic_t b_count; /* users using this block */ - u32 b_size; /* block size */ - sector_t b_blocknr; /* block number */ - char *b_data; /* pointer to data block */ + sector_t b_blocknr; /* start block number */ + size_t b_size; /* size of mapping */ + char *b_data; /* pointer to data within the page */ struct block_device *b_bdev; bh_end_io_t *b_end_io; /* I/O completion */ void *b_private; /* reserved for b_end_io */ struct list_head b_assoc_buffers; /* associated with another mapping */ + atomic_t b_count; /* users using this buffer_head */ }; /* Index: linux-2.6.16-rc4/fs/buffer.c =================================================================== --- linux-2.6.16-rc4.orig/fs/buffer.c 2006-02-17 14:23:45.000000000 -0800 +++ linux-2.6.16-rc4/fs/buffer.c 2006-02-23 08:55:18.000000000 -0800 @@ -432,7 +432,8 @@ __find_get_block_slow(struct block_devic printk("__find_get_block_slow() failed. " "block=%llu, b_blocknr=%llu\n", (unsigned long long)block, (unsigned long long)bh->b_blocknr); - printk("b_state=0x%08lx, b_size=%u\n", bh->b_state, bh->b_size); + printk("b_state=0x%08lx, b_size=%lu\n", bh->b_state, + (unsigned long)bh->b_size); printk("device blocksize: %d\n", 1 << bd_inode->i_blkbits); } out_unlock: Index: linux-2.6.16-rc4/fs/reiserfs/prints.c =================================================================== --- linux-2.6.16-rc4.orig/fs/reiserfs/prints.c 2006-02-17 14:23:45.000000000 -0800 +++ linux-2.6.16-rc4/fs/reiserfs/prints.c 2006-02-23 08:56:17.000000000 -0800 @@ -143,8 +143,8 @@ static void sprintf_buffer_head(char *bu char b[BDEVNAME_SIZE]; sprintf(buf, - "dev %s, size %d, blocknr %llu, count %d, state 0x%lx, page %p, (%s, %s, %s)", - bdevname(bh->b_bdev, b), bh->b_size, + "dev %s, size %ld, blocknr %llu, count %d, state 0x%lx, page %p, (%s, %s, %s)", + bdevname(bh->b_bdev, b), (unsigned long)bh->b_size, (unsigned long long)bh->b_blocknr, atomic_read(&(bh->b_count)), bh->b_state, bh->b_page, buffer_uptodate(bh) ? "UPTODATE" : "!UPTODATE", Index: linux-2.6.16-rc4/fs/ocfs2/journal.c =================================================================== --- linux-2.6.16-rc4.orig/fs/ocfs2/journal.c 2006-02-17 14:23:45.000000000 -0800 +++ linux-2.6.16-rc4/fs/ocfs2/journal.c 2006-02-23 08:56:55.000000000 -0800 @@ -377,12 +377,12 @@ int ocfs2_journal_access(struct ocfs2_jo BUG_ON(!bh); BUG_ON(!(handle->flags & OCFS2_HANDLE_STARTED)); - mlog_entry("bh->b_blocknr=%llu, type=%d (\"%s\"), bh->b_size = %hu\n", + mlog_entry("bh->b_blocknr=%llu, type=%d (\"%s\"), bh->b_size = %lu\n", (unsigned long long)bh->b_blocknr, type, (type == OCFS2_JOURNAL_ACCESS_CREATE) ? "OCFS2_JOURNAL_ACCESS_CREATE" : "OCFS2_JOURNAL_ACCESS_WRITE", - bh->b_size); + (unsigned long)bh->b_size); /* we can safely remove this assertion after testing. */ if (!buffer_uptodate(bh)) { ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] change b_size to size_t 2006-02-23 17:28 ` Badari Pulavarty 2006-02-23 17:29 ` Benjamin LaHaise @ 2006-02-23 17:40 ` Badari Pulavarty 1 sibling, 0 replies; 32+ messages in thread From: Badari Pulavarty @ 2006-02-23 17:40 UTC (permalink / raw) To: Benjamin LaHaise Cc: Andrew Morton, Nathan Scott, christoph, mcao, lkml, linux-fsdevel On Thu, 2006-02-23 at 09:28 -0800, Badari Pulavarty wrote: > On Thu, 2006-02-23 at 11:32 -0500, Benjamin LaHaise wrote: > > On Thu, Feb 23, 2006 at 08:28:12AM -0800, Badari Pulavarty wrote: > > > Here is the updated version of the patch, which changes > > > buffer_head.b_size to size_t to support mapping large > > > amount of disk blocks (for large IOs). > > > > Your patch doesn't seem to be inline, so I can't quote it. Several > > problems: on 64 bit platforms you introduced 4 bytes of padding into > > buffer_head. atomic_t only takes up 4 byte, while size_t is 8 byte > > aligned. > > > Ignore my previous mail. > > How about doing this ? Change b_state to u32 and change b_size > to "size_t". This way, we don't increase the overall size of > the structure on 64-bit machines. Isn't it ? I hate to correct myself again. But this won't work either. If we do this, we can use bit_spin_lock() helpers any more to do bh_state manipulation :( Yes. Bottom line is, we would increase the size of the structure by 8-bytes on 64-bit machines. I don't see any way out of it. But this would provide ability to let the filesystems know that the we are dealing with large (> 4GB) of IOs (may be they can allocated as much as possible contiguously), even if we don't really do that big IOs. Thanks, Badari ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH] change b_size to size_t 2006-02-23 16:28 ` [PATCH] change b_size to size_t Badari Pulavarty 2006-02-23 16:32 ` Benjamin LaHaise @ 2006-02-23 16:40 ` Dave Kleikamp 1 sibling, 0 replies; 32+ messages in thread From: Dave Kleikamp @ 2006-02-23 16:40 UTC (permalink / raw) To: Badari Pulavarty Cc: Andrew Morton, Nathan Scott, christoph, mcao, lkml, linux-fsdevel On Thu, 2006-02-23 at 08:28 -0800, Badari Pulavarty wrote: > Index: linux-2.6.16-rc4/fs/buffer.c > =================================================================== > --- linux-2.6.16-rc4.orig/fs/buffer.c 2006-02-17 14:23:45.000000000 -0800 > +++ linux-2.6.16-rc4/fs/buffer.c 2006-02-23 08:19:15.000000000 -0800 > @@ -432,7 +432,7 @@ __find_get_block_slow(struct block_devic > printk("__find_get_block_slow() failed. " > "block=%llu, b_blocknr=%llu\n", > (unsigned long long)block, (unsigned long long)bh->b_blocknr); > - printk("b_state=0x%08lx, b_size=%u\n", bh->b_state, bh->b_size); > + printk("b_state=0x%08lx, b_size=%llu\n", bh->b_state, (unsigned long long)bh->b_size); Wrapping at 80 columns, it looks right, but you've got a long line in here. > printk("device blocksize: %d\n", 1 << bd_inode->i_blkbits); > } > -- David Kleikamp IBM Linux Technology Center ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() 2006-02-22 16:58 ` Badari Pulavarty 2006-02-22 16:59 ` christoph @ 2006-02-22 17:23 ` Peter Staubach 2006-02-22 18:37 ` Dave Kleikamp 2 siblings, 0 replies; 32+ messages in thread From: Peter Staubach @ 2006-02-22 17:23 UTC (permalink / raw) To: Badari Pulavarty; +Cc: christoph, mcao, akpm, lkml, linux-fsdevel, vs, zam Badari Pulavarty wrote: > >Thanks. Only current issue Nathan raised is, he wants to see >b_size change to u64 (instead of u32) to support really-huge-IO >requests. Does this sound reasonable to you ? > I believe that the various read and write system calls impose a 32 bit limit on the size of the i/o request, so you may need to look at that too. Increasing the size of b_size would address a bug which exists when trying to read more than 4GB from a block device which is greater than 4GB in size. Currently, that bug is masked because the maximum system call size is arbitrarily limited. Thanx... ps ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() 2006-02-22 16:58 ` Badari Pulavarty 2006-02-22 16:59 ` christoph 2006-02-22 17:23 ` [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() Peter Staubach @ 2006-02-22 18:37 ` Dave Kleikamp 2006-02-22 19:00 ` Badari Pulavarty 2 siblings, 1 reply; 32+ messages in thread From: Dave Kleikamp @ 2006-02-22 18:37 UTC (permalink / raw) To: Badari Pulavarty; +Cc: christoph, mcao, akpm, lkml, linux-fsdevel, vs, zam On Wed, 2006-02-22 at 08:58 -0800, Badari Pulavarty wrote: > Thanks. Only current issue Nathan raised is, he wants to see > b_size change to u64 (instead of u32) to support really-huge-IO > requests. Does this sound reasonable to you ? Didn't someone point out that size_t would make more sense? There's no reason for a 32-bit architecture to have a 64-bit b_size. > Thanks, > Badari Shaggy -- David Kleikamp IBM Linux Technology Center ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() 2006-02-22 18:37 ` Dave Kleikamp @ 2006-02-22 19:00 ` Badari Pulavarty 0 siblings, 0 replies; 32+ messages in thread From: Badari Pulavarty @ 2006-02-22 19:00 UTC (permalink / raw) To: Dave Kleikamp; +Cc: christoph, mcao, akpm, lkml, linux-fsdevel, vs, zam Dave Kleikamp wrote: > On Wed, 2006-02-22 at 08:58 -0800, Badari Pulavarty wrote: > >>Thanks. Only current issue Nathan raised is, he wants to see >>b_size change to u64 (instead of u32) to support really-huge-IO >>requests. Does this sound reasonable to you ? > > > Didn't someone point out that size_t would make more sense? There's no > reason for a 32-bit architecture to have a 64-bit b_size. Yes. I meant to say size_t. - Badari ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() 2006-02-22 15:12 ` christoph 2006-02-22 16:58 ` Badari Pulavarty @ 2006-02-24 17:19 ` Badari Pulavarty 2006-03-06 10:03 ` Suparna Bhattacharya 1 sibling, 1 reply; 32+ messages in thread From: Badari Pulavarty @ 2006-02-24 17:19 UTC (permalink / raw) To: christoph; +Cc: mcao, akpm, lkml, linux-fsdevel, vs, zam On Wed, 2006-02-22 at 16:12 +0100, christoph wrote: .. > Thanks Badari, with that interface changes the mpage_readpage changes > look a lot nicer than my original version. I'd like to second > the request to put it into -mm. > > And if the namesys folks could try out whether this works for their > reiser4 requirements it'd be nice. If you have an even faster > ->readpages I'd be interested in that secrete souce receipe for > further improvement to mpage_readpages. I don't have any secret receipes, but I was thinking of re-organizing the code a little. Complexity is due to "confused" case and "blocksize < pagesize" cases + going in-and-out of the worker routine with stored state. I am thinking of having a "fast path" which doesn't deal with any of those and "slow" path to deal with all that non-sense. Something like .. mpage_readpages() { if (block-size < page-size) slow_path; while (nr-pages) { if (get_block(bh)) slow_path; if (uptodate(bh)) slow_path; while (bh.b_size) { if (not contig) submit bio(); add all the pages we can to bio(); bh.b_size -= size-of-pages-added; nr_pages -= count-of-pages-added; } } } slow_path is going to be slow & ugly. How important is to handle 1k, 2k filesystems efficiently ? Should I try ? Thanks, Badari ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() 2006-02-24 17:19 ` Badari Pulavarty @ 2006-03-06 10:03 ` Suparna Bhattacharya 2006-03-06 22:39 ` Nathan Scott 0 siblings, 1 reply; 32+ messages in thread From: Suparna Bhattacharya @ 2006-03-06 10:03 UTC (permalink / raw) To: Badari Pulavarty; +Cc: christoph, mcao, akpm, lkml, linux-fsdevel, vs, zam On Fri, Feb 24, 2006 at 09:19:08AM -0800, Badari Pulavarty wrote: > On Wed, 2006-02-22 at 16:12 +0100, christoph wrote: > .. > > Thanks Badari, with that interface changes the mpage_readpage changes > > look a lot nicer than my original version. I'd like to second > > the request to put it into -mm. > > > > And if the namesys folks could try out whether this works for their > > reiser4 requirements it'd be nice. If you have an even faster > > ->readpages I'd be interested in that secrete souce receipe for > > further improvement to mpage_readpages. > > I don't have any secret receipes, but I was thinking of re-organizing > the code a little. Complexity is due to "confused" case and > "blocksize < pagesize" cases + going in-and-out of the worker routine > with stored state. > > I am thinking of having a "fast path" which doesn't deal with any > of those and "slow" path to deal with all that non-sense. > Something like .. > > mpage_readpages() > { > if (block-size < page-size) > slow_path; > > while (nr-pages) { > if (get_block(bh)) > slow_path; > if (uptodate(bh)) > slow_path; > while (bh.b_size) { > if (not contig) > submit bio(); > add all the pages we can to bio(); > bh.b_size -= size-of-pages-added; > nr_pages -= count-of-pages-added; > } > > } > } > > slow_path is going to be slow & ugly. How important is to handle > 1k, 2k filesystems efficiently ? Should I try ? With 64K page size that could include 4K, 8K, 16K, 32K block size filesystems as well, not sure how likely that would be ? Regards Suparna > > Thanks, > Badari > > - > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Suparna Bhattacharya (suparna@in.ibm.com) Linux Technology Center IBM Software Lab, India ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() 2006-03-06 10:03 ` Suparna Bhattacharya @ 2006-03-06 22:39 ` Nathan Scott 2006-03-07 9:00 ` Badari Pulavarty 0 siblings, 1 reply; 32+ messages in thread From: Nathan Scott @ 2006-03-06 22:39 UTC (permalink / raw) To: Suparna Bhattacharya Cc: Badari Pulavarty, christoph, mcao, akpm, lkml, linux-fsdevel, vs, zam On Mon, Mar 06, 2006 at 03:33:21PM +0530, Suparna Bhattacharya wrote: > On Fri, Feb 24, 2006 at 09:19:08AM -0800, Badari Pulavarty wrote: > > > > I am thinking of having a "fast path" which doesn't deal with any > > of those and "slow" path to deal with all that non-sense. > > ... > > slow_path is going to be slow & ugly. How important is to handle > > 1k, 2k filesystems efficiently ? Should I try ? > > With 64K page size that could include 4K, 8K, 16K, 32K block size filesystems > as well, not sure how likely that would be ? A number of architectures have a pagesize greater than 4K. Most (OK, sample size of 2) mkfs programs default to using 4K blocksizes. So, any/all platforms not having a 4K pagesize will be disadvantaged. Search on the definition of PAGE_SHIFT in asm-*/page.h and for all platforms where its not defined to 12, this will hurt. cheers. -- Nathan ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() 2006-03-06 22:39 ` Nathan Scott @ 2006-03-07 9:00 ` Badari Pulavarty 0 siblings, 0 replies; 32+ messages in thread From: Badari Pulavarty @ 2006-03-07 9:00 UTC (permalink / raw) To: Nathan Scott Cc: Suparna Bhattacharya, christoph, mcao, akpm, lkml, linux-fsdevel, vs, zam Nathan Scott wrote: >On Mon, Mar 06, 2006 at 03:33:21PM +0530, Suparna Bhattacharya wrote: > >>On Fri, Feb 24, 2006 at 09:19:08AM -0800, Badari Pulavarty wrote: >> >>>I am thinking of having a "fast path" which doesn't deal with any >>>of those and "slow" path to deal with all that non-sense. >>>... >>>slow_path is going to be slow & ugly. How important is to handle >>>1k, 2k filesystems efficiently ? Should I try ? >>> >>With 64K page size that could include 4K, 8K, 16K, 32K block size filesystems >>as well, not sure how likely that would be ? >> > >A number of architectures have a pagesize greater than 4K. Most >(OK, sample size of 2) mkfs programs default to using 4K blocksizes. >So, any/all platforms not having a 4K pagesize will be disadvantaged. >Search on the definition of PAGE_SHIFT in asm-*/page.h and for all >platforms where its not defined to 12, this will hurt. > I agree, I haven't made up my mind either on if its really worth doing. I was hoping that it will help simple cases + it will help filesystems which use page->private for something other than buffer heads. Thanks, Badari > ^ permalink raw reply [flat|nested] 32+ messages in thread
end of thread, other threads:[~2006-03-07 1:01 UTC | newest] Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2006-02-20 21:21 [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() Badari Pulavarty 2006-02-20 21:23 ` [PATCH 1/3] pass b_size to ->get_block() Badari Pulavarty 2006-02-20 21:23 ` [PATCH 2/3] map multiple blocks for mpage_readpages() Badari Pulavarty 2006-02-23 3:29 ` Suparna Bhattacharya 2006-02-23 22:18 ` Badari Pulavarty 2006-02-20 21:24 ` [PATCH 3/3] remove ->get_blocks() support Badari Pulavarty 2006-02-20 21:59 ` [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() Nathan Scott 2006-02-20 23:06 ` Badari Pulavarty 2006-02-20 23:16 ` Nathan Scott 2006-02-21 2:41 ` Jeremy Higdon 2006-02-21 16:03 ` Badari Pulavarty 2006-02-21 21:39 ` Nathan Scott 2006-02-22 15:12 ` christoph 2006-02-22 16:58 ` Badari Pulavarty 2006-02-22 16:59 ` christoph 2006-02-23 1:40 ` Nathan Scott 2006-02-23 1:59 ` Andrew Morton 2006-02-23 16:28 ` [PATCH] change b_size to size_t Badari Pulavarty 2006-02-23 16:32 ` Benjamin LaHaise 2006-02-23 17:20 ` Badari Pulavarty 2006-02-23 17:28 ` Badari Pulavarty 2006-02-23 17:29 ` Benjamin LaHaise 2006-02-23 18:46 ` Badari Pulavarty 2006-02-23 17:40 ` Badari Pulavarty 2006-02-23 16:40 ` Dave Kleikamp 2006-02-22 17:23 ` [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() Peter Staubach 2006-02-22 18:37 ` Dave Kleikamp 2006-02-22 19:00 ` Badari Pulavarty 2006-02-24 17:19 ` Badari Pulavarty 2006-03-06 10:03 ` Suparna Bhattacharya 2006-03-06 22:39 ` Nathan Scott 2006-03-07 9:00 ` Badari Pulavarty
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).