linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()
@ 2006-02-20 21:21 Badari Pulavarty
  2006-02-20 21:23 ` [PATCH 1/3] pass b_size to ->get_block() Badari Pulavarty
                   ` (4 more replies)
  0 siblings, 5 replies; 32+ messages in thread
From: Badari Pulavarty @ 2006-02-20 21:21 UTC (permalink / raw)
  To: christoph, mcao, akpm; +Cc: lkml, linux-fsdevel, pbadari

Hi,

Following patches add support to map multiple blocks in ->get_block().
This is will allow us to handle mapping of multiple disk blocks for
mpage_readpages() and mpage_writepages() etc. Instead of adding new
argument, I use "b_size" to indicate the amount of disk mapping needed
for get_block(). And also, on success get_block() actually indicates
the amount of disk mapping it did.

Now that get_block() can handle multiple blocks, there is no need
for ->get_blocks() which was added for DIO. 

[PATCH 1/3] pass b_size to ->get_block()

[PATCH 2/3] map multiple blocks for mpage_readpages()

[PATCH 3/3] remove ->get_blocks() support

I noticed decent improvements (reduced sys time) on JFS, XFS and ext3. 
(on simple "dd" read tests).
	
         (rc3.mm1)	(rc3.mm1 + patches)
real    0m18.814s	0m18.482s
user    0m0.000s	0m0.004s
sys     0m3.240s	0m2.912s

Andrew, Could you include it in -mm tree ?

Comments ?

Thanks,
Badari



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/3] pass b_size to ->get_block()
  2006-02-20 21:21 [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() Badari Pulavarty
@ 2006-02-20 21:23 ` Badari Pulavarty
  2006-02-20 21:23 ` [PATCH 2/3] map multiple blocks for mpage_readpages() Badari Pulavarty
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 32+ messages in thread
From: Badari Pulavarty @ 2006-02-20 21:23 UTC (permalink / raw)
  To: christoph; +Cc: mcao, akpm, lkml, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 211 bytes --]

[PATCH 1/3] pass b_size to ->get_block()

Pass amount of disk needs to be mapped to get_block().
This way one can modify the fs ->get_block() functions
to map multiple blocks at the same time.

Thanks,
Badari



[-- Attachment #2: getblock-take-size.patch --]
[-- Type: text/x-patch, Size: 3339 bytes --]

Pass amount of disk needs to be mapped to get_block().
This way one can modify the fs ->get_block() functions 
to map multiple blocks at the same time.

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>

 fs/buffer.c                 |    6 ++++++
 fs/mpage.c                  |    2 ++
 include/linux/buffer_head.h |    1 +
 3 files changed, 9 insertions(+)

Index: linux-2.6.16-rc4/fs/buffer.c
===================================================================
--- linux-2.6.16-rc4.orig/fs/buffer.c	2006-02-17 14:23:45.000000000 -0800
+++ linux-2.6.16-rc4/fs/buffer.c	2006-02-20 09:37:58.000000000 -0800
@@ -1785,6 +1785,7 @@ static int __block_write_full_page(struc
 			clear_buffer_dirty(bh);
 			set_buffer_uptodate(bh);
 		} else if (!buffer_mapped(bh) && buffer_dirty(bh)) {
+			bh->b_size = 1 << inode->i_blkbits;
 			err = get_block(inode, block, bh, 1);
 			if (err)
 				goto recover;
@@ -1938,6 +1939,7 @@ static int __block_prepare_write(struct 
 		if (buffer_new(bh))
 			clear_buffer_new(bh);
 		if (!buffer_mapped(bh)) {
+			bh->b_size = 1 << inode->i_blkbits;
 			err = get_block(inode, block, bh, 1);
 			if (err)
 				break;
@@ -2093,6 +2095,7 @@ int block_read_full_page(struct page *pa
 
 			fully_mapped = 0;
 			if (iblock < lblock) {
+				bh->b_size = 1 << inode->i_blkbits;
 				err = get_block(inode, iblock, bh, 0);
 				if (err)
 					SetPageError(page);
@@ -2414,6 +2417,7 @@ int nobh_prepare_write(struct page *page
 		create = 1;
 		if (block_start >= to)
 			create = 0;
+		map_bh.b_size = 1 << blkbits;
 		ret = get_block(inode, block_in_file + block_in_page,
 					&map_bh, create);
 		if (ret)
@@ -2674,6 +2678,7 @@ int block_truncate_page(struct address_s
 
 	err = 0;
 	if (!buffer_mapped(bh)) {
+		bh->b_size = 1 << inode->i_blkbits;
 		err = get_block(inode, iblock, bh, 0);
 		if (err)
 			goto unlock;
@@ -2760,6 +2765,7 @@ sector_t generic_block_bmap(struct addre
 	struct inode *inode = mapping->host;
 	tmp.b_state = 0;
 	tmp.b_blocknr = 0;
+	tmp.b_size = 1 << inode->i_blkbits;
 	get_block(inode, block, &tmp, 0);
 	return tmp.b_blocknr;
 }
Index: linux-2.6.16-rc4/include/linux/buffer_head.h
===================================================================
--- linux-2.6.16-rc4.orig/include/linux/buffer_head.h	2006-02-17 14:23:45.000000000 -0800
+++ linux-2.6.16-rc4/include/linux/buffer_head.h	2006-02-20 09:39:20.000000000 -0800
@@ -277,6 +277,7 @@ map_bh(struct buffer_head *bh, struct su
 	set_buffer_mapped(bh);
 	bh->b_bdev = sb->s_bdev;
 	bh->b_blocknr = block;
+	bh->b_size =  sb->s_blocksize;
 }
 
 /*
Index: linux-2.6.16-rc4/fs/mpage.c
===================================================================
--- linux-2.6.16-rc4.orig/fs/mpage.c	2006-02-17 14:23:45.000000000 -0800
+++ linux-2.6.16-rc4/fs/mpage.c	2006-02-20 09:45:37.000000000 -0800
@@ -192,6 +192,7 @@ do_mpage_readpage(struct bio *bio, struc
 				page_block++, block_in_file++) {
 		bh.b_state = 0;
 		if (block_in_file < last_block) {
+			bh.b_size = blocksize;
 			if (get_block(inode, block_in_file, &bh, 0))
 				goto confused;
 		}
@@ -472,6 +473,7 @@ __mpage_writepage(struct bio *bio, struc
 	for (page_block = 0; page_block < blocks_per_page; ) {
 
 		map_bh.b_state = 0;
+		map_bh.b_size = 1 << blkbits;
 		if (get_block(inode, block_in_file, &map_bh, 1))
 			goto confused;
 		if (buffer_new(&map_bh))

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 2/3] map multiple blocks for mpage_readpages()
  2006-02-20 21:21 [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() Badari Pulavarty
  2006-02-20 21:23 ` [PATCH 1/3] pass b_size to ->get_block() Badari Pulavarty
@ 2006-02-20 21:23 ` Badari Pulavarty
  2006-02-23  3:29   ` Suparna Bhattacharya
  2006-02-20 21:24 ` [PATCH 3/3] remove ->get_blocks() support Badari Pulavarty
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 32+ messages in thread
From: Badari Pulavarty @ 2006-02-20 21:23 UTC (permalink / raw)
  To: christoph; +Cc: mcao, akpm, lkml, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 553 bytes --]

[PATCH 2/3] map multiple blocks for mpage_readpages()

This patch changes mpage_readpages() and get_block() to
get the disk mapping information for multiple blocks at the
same time.

b_size represents the amount of disk mapping that needs to
mapped. On the successful get_block() b_size indicates the
amount of disk mapping thats actually mapped.  Only the
filesystems who care to use this information and provide multiple
disk blocks at a time can choose to do so.

No changes are needed for the filesystems who wants to ignore this.

Thanks,
Badari



[-- Attachment #2: getblocks-readpages.patch --]
[-- Type: text/x-patch, Size: 7570 bytes --]

This patch changes mpage_readpages() and get_block() to
get the disk mapping information for multiple blocks at the 
same time. 

b_size represents the amount of disk mapping that needs to
mapped. On the successful get_block() b_size indicates the
amount of disk mapping thats actually mapped.  Only the 
filesystems who care to use this information and provide multiple 
disk blocks at a time can choose to do so. 

No changes are needed for the filesystems who wants to ignore this.

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>

 fs/jfs/inode.c              |    3 -
 fs/mpage.c                  |   98 +++++++++++++++++++++++++++++++++++---------
 fs/xfs/linux-2.6/xfs_aops.c |    4 -
 3 files changed, 83 insertions(+), 22 deletions(-)

Index: linux-2.6.16-rc4/fs/mpage.c
===================================================================
--- linux-2.6.16-rc4.orig/fs/mpage.c	2006-02-20 09:45:37.000000000 -0800
+++ linux-2.6.16-rc4/fs/mpage.c	2006-02-20 09:51:24.000000000 -0800
@@ -165,7 +165,9 @@ map_buffer_to_page(struct page *page, st
 
 static struct bio *
 do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
-			sector_t *last_block_in_bio, get_block_t get_block)
+			sector_t *last_block_in_bio, struct buffer_head *map_bh,
+			unsigned long *first_logical_block, int *map_valid,
+			get_block_t get_block)
 {
 	struct inode *inode = page->mapping->host;
 	const unsigned blkbits = inode->i_blkbits;
@@ -173,34 +175,72 @@ do_mpage_readpage(struct bio *bio, struc
 	const unsigned blocksize = 1 << blkbits;
 	sector_t block_in_file;
 	sector_t last_block;
+	sector_t last_block_in_file;
 	sector_t blocks[MAX_BUF_PER_PAGE];
 	unsigned page_block;
 	unsigned first_hole = blocks_per_page;
 	struct block_device *bdev = NULL;
-	struct buffer_head bh;
 	int length;
 	int fully_mapped = 1;
+	unsigned nblocks, i;
 
 	if (page_has_buffers(page))
 		goto confused;
 
 	block_in_file = (sector_t)page->index << (PAGE_CACHE_SHIFT - blkbits);
-	last_block = (i_size_read(inode) + blocksize - 1) >> blkbits;
+	last_block = block_in_file +  (nr_pages * blocks_per_page);
+	last_block_in_file = (i_size_read(inode) + blocksize - 1) >> blkbits;
+	if (last_block > last_block_in_file)
+		last_block = last_block_in_file;
+	page_block = 0;
+
+	/*
+	 * Map blocks using the result from the last get_blocks call first.
+	 */
+	nblocks = map_bh->b_size >> blkbits;
+	if (*map_valid &&
+	    block_in_file > *first_logical_block &&
+	    block_in_file < (*first_logical_block + nblocks)) {
+		unsigned map_offset = block_in_file - *first_logical_block;
+		unsigned last = nblocks - map_offset;
+
+		for (i = 0; ; i++) {
+			if (i == last) {
+				*map_valid = 0;
+				break;
+			}
+			if (page_block == blocks_per_page)
+				break;
+			blocks[page_block] = map_bh->b_blocknr + map_offset + i;
+			page_block++;
+			block_in_file++;
+		}
+		bdev = map_bh->b_bdev;
+	}
+
+	/*
+	 * Then do more get_blocks calls until we are done with this page.
+	 */
+	map_bh->b_page = page;
+	while (page_block < blocks_per_page) {
+		map_bh->b_state = 0;
+		map_bh->b_size = 0;
 
-	bh.b_page = page;
-	for (page_block = 0; page_block < blocks_per_page;
-				page_block++, block_in_file++) {
-		bh.b_state = 0;
 		if (block_in_file < last_block) {
-			bh.b_size = blocksize;
-			if (get_block(inode, block_in_file, &bh, 0))
+			map_bh->b_size = (last_block - block_in_file) << blkbits;
+			if (get_block(inode, block_in_file, map_bh, 0))
 				goto confused;
+			*first_logical_block = block_in_file;
+			*map_valid  = 1;
 		}
 
-		if (!buffer_mapped(&bh)) {
+		if (!buffer_mapped(map_bh)) {
 			fully_mapped = 0;
 			if (first_hole == blocks_per_page)
 				first_hole = page_block;
+			page_block++;
+			block_in_file++;
+			*map_valid = 0;
 			continue;
 		}
 
@@ -210,8 +250,8 @@ do_mpage_readpage(struct bio *bio, struc
 		 * we just collected from get_block into the page's buffers
 		 * so readpage doesn't have to repeat the get_block call
 		 */
-		if (buffer_uptodate(&bh)) {
-			map_buffer_to_page(page, &bh, page_block);
+		if (buffer_uptodate(map_bh)) {
+			map_buffer_to_page(page, map_bh, page_block);
 			goto confused;
 		}
 	
@@ -219,10 +259,20 @@ do_mpage_readpage(struct bio *bio, struc
 			goto confused;		/* hole -> non-hole */
 
 		/* Contiguous blocks? */
-		if (page_block && blocks[page_block-1] != bh.b_blocknr-1)
+		if (page_block && blocks[page_block-1] != map_bh->b_blocknr-1)
 			goto confused;
-		blocks[page_block] = bh.b_blocknr;
-		bdev = bh.b_bdev;
+		nblocks = map_bh->b_size >> blkbits;
+		for (i = 0; ; i++) {
+			if (i == nblocks) {
+				*map_valid = 0;
+				break;
+			} else if (page_block == blocks_per_page)
+				break;
+			blocks[page_block] = map_bh->b_blocknr + i;
+			page_block++;
+			block_in_file++;
+		}
+		bdev = map_bh->b_bdev;
 	}
 
 	if (first_hole != blocks_per_page) {
@@ -261,7 +311,7 @@ alloc_new:
 		goto alloc_new;
 	}
 
-	if (buffer_boundary(&bh) || (first_hole != blocks_per_page))
+	if (buffer_boundary(map_bh) || (first_hole != blocks_per_page))
 		bio = mpage_bio_submit(READ, bio);
 	else
 		*last_block_in_bio = blocks[blocks_per_page - 1];
@@ -332,6 +382,9 @@ mpage_readpages(struct address_space *ma
 	unsigned page_idx;
 	sector_t last_block_in_bio = 0;
 	struct pagevec lru_pvec;
+	struct buffer_head map_bh;
+	unsigned long first_logical_block = 0;
+	int map_valid = 0;
 
 	pagevec_init(&lru_pvec, 0);
 	for (page_idx = 0; page_idx < nr_pages; page_idx++) {
@@ -343,7 +396,9 @@ mpage_readpages(struct address_space *ma
 					page->index, GFP_KERNEL)) {
 			bio = do_mpage_readpage(bio, page,
 					nr_pages - page_idx,
-					&last_block_in_bio, get_block);
+					&last_block_in_bio, &map_bh,
+					&first_logical_block,
+					&map_valid, get_block);
 			if (!pagevec_add(&lru_pvec, page))
 				__pagevec_lru_add(&lru_pvec);
 		} else {
@@ -365,9 +420,14 @@ int mpage_readpage(struct page *page, ge
 {
 	struct bio *bio = NULL;
 	sector_t last_block_in_bio = 0;
+	struct buffer_head map_bh;
+	unsigned long first_logical_block = 0;
+	int map_valid = 0;
+
 
-	bio = do_mpage_readpage(bio, page, 1,
-			&last_block_in_bio, get_block);
+	bio = do_mpage_readpage(bio, page, 1, &last_block_in_bio,
+			&map_bh, &first_logical_block, &map_valid,
+			get_block);
 	if (bio)
 		mpage_bio_submit(READ, bio);
 	return 0;
Index: linux-2.6.16-rc4/fs/jfs/inode.c
===================================================================
--- linux-2.6.16-rc4.orig/fs/jfs/inode.c	2006-02-17 14:23:45.000000000 -0800
+++ linux-2.6.16-rc4/fs/jfs/inode.c	2006-02-20 09:48:33.000000000 -0800
@@ -257,7 +257,8 @@ jfs_get_blocks(struct inode *ip, sector_
 static int jfs_get_block(struct inode *ip, sector_t lblock,
 			 struct buffer_head *bh_result, int create)
 {
-	return jfs_get_blocks(ip, lblock, 1, bh_result, create);
+	return jfs_get_blocks(ip, lblock, bh_result->b_size >> ip->i_blkbits,
+			bh_result, create);
 }
 
 static int jfs_writepage(struct page *page, struct writeback_control *wbc)
Index: linux-2.6.16-rc4/fs/xfs/linux-2.6/xfs_aops.c
===================================================================
--- linux-2.6.16-rc4.orig/fs/xfs/linux-2.6/xfs_aops.c	2006-02-17 14:23:45.000000000 -0800
+++ linux-2.6.16-rc4/fs/xfs/linux-2.6/xfs_aops.c	2006-02-20 09:48:33.000000000 -0800
@@ -1136,8 +1136,8 @@ linvfs_get_block(
 	struct buffer_head	*bh_result,
 	int			create)
 {
-	return __linvfs_get_block(inode, iblock, 0, bh_result,
-					create, 0, BMAPI_WRITE);
+	return __linvfs_get_block(inode, iblock, bh_result->b_size >> inode->i_blkbits,
+				bh_result, create, 0, BMAPI_WRITE);
 }
 
 STATIC int

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 3/3] remove ->get_blocks() support
  2006-02-20 21:21 [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() Badari Pulavarty
  2006-02-20 21:23 ` [PATCH 1/3] pass b_size to ->get_block() Badari Pulavarty
  2006-02-20 21:23 ` [PATCH 2/3] map multiple blocks for mpage_readpages() Badari Pulavarty
@ 2006-02-20 21:24 ` Badari Pulavarty
  2006-02-20 21:59 ` [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() Nathan Scott
  2006-02-22 15:12 ` christoph
  4 siblings, 0 replies; 32+ messages in thread
From: Badari Pulavarty @ 2006-02-20 21:24 UTC (permalink / raw)
  To: christoph; +Cc: mcao, akpm, lkml, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 261 bytes --]

[PATCH 3/3] remove ->get_blocks() support


Now that get_block() can handle mapping multiple disk blocks,
no need to have ->get_blocks(). This patch removes fs specific
->get_blocks() added for DIO and makes it users use get_block()
instead.


Thanks,
Badari



[-- Attachment #2: remove-getblocks.patch --]
[-- Type: text/x-patch, Size: 16572 bytes --]

Now that get_block() can handle mapping multiple disk blocks,
no need to have ->get_blocks(). This patch removes fs specific
->get_blocks() added for DIO and makes it users use get_block()
instead.

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>

 fs/block_dev.c              |    3 ++-
 fs/direct-io.c              |   27 ++++++++++++++-------------
 fs/ext2/inode.c             |   14 +-------------
 fs/ext3/inode.c             |    3 +--
 fs/fat/inode.c              |    2 +-
 fs/hfs/inode.c              |   13 +------------
 fs/hfsplus/inode.c          |   13 +------------
 fs/jfs/inode.c              |    2 +-
 fs/ocfs2/aops.c             |    2 +-
 fs/reiserfs/inode.c         |    1 -
 fs/xfs/linux-2.6/xfs_aops.c |    5 ++---
 include/linux/fs.h          |   17 +++++++----------
 12 files changed, 32 insertions(+), 70 deletions(-)

Index: linux-2.6.16-rc3/fs/direct-io.c
===================================================================
--- linux-2.6.16-rc3.orig/fs/direct-io.c	2006-02-14 15:54:12.000000000 -0800
+++ linux-2.6.16-rc3/fs/direct-io.c	2006-02-14 15:57:27.000000000 -0800
@@ -86,12 +86,12 @@ struct dio {
 	unsigned first_block_in_page;	/* doesn't change, Used only once */
 	int boundary;			/* prev block is at a boundary */
 	int reap_counter;		/* rate limit reaping */
-	get_blocks_t *get_blocks;	/* block mapping function */
+	get_block_t *get_block;		/* block mapping function */
 	dio_iodone_t *end_io;		/* IO completion function */
 	sector_t final_block_in_bio;	/* current final block in bio + 1 */
 	sector_t next_block_for_io;	/* next block to be put under IO,
 					   in dio_blocks units */
-	struct buffer_head map_bh;	/* last get_blocks() result */
+	struct buffer_head map_bh;	/* last get_block() result */
 
 	/*
 	 * Deferred addition of a page to the dio.  These variables are
@@ -210,9 +210,9 @@ static struct page *dio_get_page(struct 
 
 /*
  * Called when all DIO BIO I/O has been completed - let the filesystem
- * know, if it registered an interest earlier via get_blocks.  Pass the
+ * know, if it registered an interest earlier via get_block.  Pass the
  * private field of the map buffer_head so that filesystems can use it
- * to hold additional state between get_blocks calls and dio_complete.
+ * to hold additional state between get_block calls and dio_complete.
  */
 static void dio_complete(struct dio *dio, loff_t offset, ssize_t bytes)
 {
@@ -488,7 +488,7 @@ static int dio_bio_reap(struct dio *dio)
  * The fs is allowed to map lots of blocks at once.  If it wants to do that,
  * it uses the passed inode-relative block number as the file offset, as usual.
  *
- * get_blocks() is passed the number of i_blkbits-sized blocks which direct_io
+ * get_block() is passed the number of i_blkbits-sized blocks which direct_io
  * has remaining to do.  The fs should not map more than this number of blocks.
  *
  * If the fs has mapped a lot of blocks, it should populate bh->b_size to
@@ -501,7 +501,7 @@ static int dio_bio_reap(struct dio *dio)
  * In the case of filesystem holes: the fs may return an arbitrarily-large
  * hole by returning an appropriate value in b_size and by clearing
  * buffer_mapped().  However the direct-io code will only process holes one
- * block at a time - it will repeatedly call get_blocks() as it walks the hole.
+ * block at a time - it will repeatedly call get_block() as it walks the hole.
  */
 static int get_more_blocks(struct dio *dio)
 {
@@ -543,7 +543,8 @@ static int get_more_blocks(struct dio *d
 		 * at a higher level for inside-i_size block-instantiating
 		 * writes.
 		 */
-		ret = (*dio->get_blocks)(dio->inode, fs_startblk, fs_count,
+		map_bh->b_size = fs_count << dio->blkbits;
+		ret = (*dio->get_block)(dio->inode, fs_startblk,
 						map_bh, create);
 	}
 	return ret;
@@ -778,11 +779,11 @@ static void dio_zero_block(struct dio *d
  * happily perform page-sized but 512-byte aligned IOs.  It is important that
  * blockdev IO be able to have fine alignment and large sizes.
  *
- * So what we do is to permit the ->get_blocks function to populate bh.b_size
+ * So what we do is to permit the ->get_block function to populate bh.b_size
  * with the size of IO which is permitted at this offset and this i_blkbits.
  *
  * For best results, the blockdev should be set up with 512-byte i_blkbits and
- * it should set b_size to PAGE_SIZE or more inside get_blocks().  This gives
+ * it should set b_size to PAGE_SIZE or more inside get_block().  This gives
  * fine alignment but still allows this function to work in PAGE_SIZE units.
  */
 static int do_direct_IO(struct dio *dio)
@@ -942,7 +943,7 @@ out:
 static ssize_t
 direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, 
 	const struct iovec *iov, loff_t offset, unsigned long nr_segs, 
-	unsigned blkbits, get_blocks_t get_blocks, dio_iodone_t end_io,
+	unsigned blkbits, get_block_t get_block, dio_iodone_t end_io,
 	struct dio *dio)
 {
 	unsigned long user_addr; 
@@ -964,7 +965,7 @@ direct_io_worker(int rw, struct kiocb *i
 
 	dio->boundary = 0;
 	dio->reap_counter = 0;
-	dio->get_blocks = get_blocks;
+	dio->get_block = get_block;
 	dio->end_io = end_io;
 	dio->map_bh.b_private = NULL;
 	dio->final_block_in_bio = -1;
@@ -1170,7 +1171,7 @@ direct_io_worker(int rw, struct kiocb *i
 ssize_t
 __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	struct block_device *bdev, const struct iovec *iov, loff_t offset, 
-	unsigned long nr_segs, get_blocks_t get_blocks, dio_iodone_t end_io,
+	unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io,
 	int dio_lock_type)
 {
 	int seg;
@@ -1266,7 +1267,7 @@ __blockdev_direct_IO(int rw, struct kioc
 		(end > i_size_read(inode)));
 
 	retval = direct_io_worker(rw, iocb, inode, iov, offset,
-				nr_segs, blkbits, get_blocks, end_io, dio);
+				nr_segs, blkbits, get_block, end_io, dio);
 
 	if (rw == READ && dio_lock_type == DIO_LOCKING)
 		reader_with_isem = 0;
Index: linux-2.6.16-rc3/fs/ext2/inode.c
===================================================================
--- linux-2.6.16-rc3.orig/fs/ext2/inode.c	2006-02-14 15:54:12.000000000 -0800
+++ linux-2.6.16-rc3/fs/ext2/inode.c	2006-02-14 15:57:27.000000000 -0800
@@ -667,18 +667,6 @@ static sector_t ext2_bmap(struct address
 	return generic_block_bmap(mapping,block,ext2_get_block);
 }
 
-static int
-ext2_get_blocks(struct inode *inode, sector_t iblock, unsigned long max_blocks,
-			struct buffer_head *bh_result, int create)
-{
-	int ret;
-
-	ret = ext2_get_block(inode, iblock, bh_result, create);
-	if (ret == 0)
-		bh_result->b_size = (1 << inode->i_blkbits);
-	return ret;
-}
-
 static ssize_t
 ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
 			loff_t offset, unsigned long nr_segs)
@@ -687,7 +675,7 @@ ext2_direct_IO(int rw, struct kiocb *ioc
 	struct inode *inode = file->f_mapping->host;
 
 	return blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
-				offset, nr_segs, ext2_get_blocks, NULL);
+				offset, nr_segs, ext2_get_block, NULL);
 }
 
 static int
Index: linux-2.6.16-rc3/fs/ext3/inode.c
===================================================================
--- linux-2.6.16-rc3.orig/fs/ext3/inode.c	2006-02-14 15:54:12.000000000 -0800
+++ linux-2.6.16-rc3/fs/ext3/inode.c	2006-02-14 15:57:27.000000000 -0800
@@ -806,8 +806,7 @@ static int ext3_get_block(struct inode *
 
 static int
 ext3_direct_io_get_blocks(struct inode *inode, sector_t iblock,
-		unsigned long max_blocks, struct buffer_head *bh_result,
-		int create)
+		struct buffer_head *bh_result, int create)
 {
 	handle_t *handle = journal_current_handle();
 	int ret = 0;
Index: linux-2.6.16-rc3/fs/fat/inode.c
===================================================================
--- linux-2.6.16-rc3.orig/fs/fat/inode.c	2006-02-14 15:54:12.000000000 -0800
+++ linux-2.6.16-rc3/fs/fat/inode.c	2006-02-14 15:57:27.000000000 -0800
@@ -101,11 +101,11 @@ static int __fat_get_blocks(struct inode
 }
 
 static int fat_get_blocks(struct inode *inode, sector_t iblock,
-			  unsigned long max_blocks,
 			  struct buffer_head *bh_result, int create)
 {
 	struct super_block *sb = inode->i_sb;
 	int err;
+	unsigned long max_blocks = bh_result->b_size >> inode->i_blkbits;
 
 	err = __fat_get_blocks(inode, iblock, &max_blocks, bh_result, create);
 	if (err)
Index: linux-2.6.16-rc3/fs/hfs/inode.c
===================================================================
--- linux-2.6.16-rc3.orig/fs/hfs/inode.c	2006-02-14 15:54:12.000000000 -0800
+++ linux-2.6.16-rc3/fs/hfs/inode.c	2006-02-14 15:57:27.000000000 -0800
@@ -98,17 +98,6 @@ static int hfs_releasepage(struct page *
 	return res ? try_to_free_buffers(page) : 0;
 }
 
-static int hfs_get_blocks(struct inode *inode, sector_t iblock, unsigned long max_blocks,
-			  struct buffer_head *bh_result, int create)
-{
-	int ret;
-
-	ret = hfs_get_block(inode, iblock, bh_result, create);
-	if (!ret)
-		bh_result->b_size = (1 << inode->i_blkbits);
-	return ret;
-}
-
 static ssize_t hfs_direct_IO(int rw, struct kiocb *iocb,
 		const struct iovec *iov, loff_t offset, unsigned long nr_segs)
 {
@@ -116,7 +105,7 @@ static ssize_t hfs_direct_IO(int rw, str
 	struct inode *inode = file->f_dentry->d_inode->i_mapping->host;
 
 	return blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
-				  offset, nr_segs, hfs_get_blocks, NULL);
+				  offset, nr_segs, hfs_get_block, NULL);
 }
 
 static int hfs_writepages(struct address_space *mapping,
Index: linux-2.6.16-rc3/fs/hfsplus/inode.c
===================================================================
--- linux-2.6.16-rc3.orig/fs/hfsplus/inode.c	2006-02-14 15:54:12.000000000 -0800
+++ linux-2.6.16-rc3/fs/hfsplus/inode.c	2006-02-14 15:57:27.000000000 -0800
@@ -93,17 +93,6 @@ static int hfsplus_releasepage(struct pa
 	return res ? try_to_free_buffers(page) : 0;
 }
 
-static int hfsplus_get_blocks(struct inode *inode, sector_t iblock, unsigned long max_blocks,
-			      struct buffer_head *bh_result, int create)
-{
-	int ret;
-
-	ret = hfsplus_get_block(inode, iblock, bh_result, create);
-	if (!ret)
-		bh_result->b_size = (1 << inode->i_blkbits);
-	return ret;
-}
-
 static ssize_t hfsplus_direct_IO(int rw, struct kiocb *iocb,
 		const struct iovec *iov, loff_t offset, unsigned long nr_segs)
 {
@@ -111,7 +100,7 @@ static ssize_t hfsplus_direct_IO(int rw,
 	struct inode *inode = file->f_dentry->d_inode->i_mapping->host;
 
 	return blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
-				  offset, nr_segs, hfsplus_get_blocks, NULL);
+				  offset, nr_segs, hfsplus_get_block, NULL);
 }
 
 static int hfsplus_writepages(struct address_space *mapping,
Index: linux-2.6.16-rc3/fs/ocfs2/aops.c
===================================================================
--- linux-2.6.16-rc3.orig/fs/ocfs2/aops.c	2006-02-14 15:54:12.000000000 -0800
+++ linux-2.6.16-rc3/fs/ocfs2/aops.c	2006-02-14 15:57:27.000000000 -0800
@@ -538,7 +538,6 @@ bail:
  * 					fs_count, map_bh, dio->rw == WRITE);
  */
 static int ocfs2_direct_IO_get_blocks(struct inode *inode, sector_t iblock,
-				     unsigned long max_blocks,
 				     struct buffer_head *bh_result, int create)
 {
 	int ret;
@@ -546,6 +545,7 @@ static int ocfs2_direct_IO_get_blocks(st
 	u64 p_blkno;
 	int contig_blocks;
 	unsigned char blocksize_bits;
+	unsigned long max_blocks = bh_result->b_size >> inode->i_blkbits;
 
 	if (!inode || !bh_result) {
 		mlog(ML_ERROR, "inode or bh_result is null\n");
Index: linux-2.6.16-rc3/fs/reiserfs/inode.c
===================================================================
--- linux-2.6.16-rc3.orig/fs/reiserfs/inode.c	2006-02-14 15:54:12.000000000 -0800
+++ linux-2.6.16-rc3/fs/reiserfs/inode.c	2006-02-14 15:57:27.000000000 -0800
@@ -466,7 +466,6 @@ static int reiserfs_get_block_create_0(s
    direct_IO request. */
 static int reiserfs_get_blocks_direct_io(struct inode *inode,
 					 sector_t iblock,
-					 unsigned long max_blocks,
 					 struct buffer_head *bh_result,
 					 int create)
 {
Index: linux-2.6.16-rc3/fs/xfs/linux-2.6/xfs_aops.c
===================================================================
--- linux-2.6.16-rc3.orig/fs/xfs/linux-2.6/xfs_aops.c	2006-02-14 15:54:12.000000000 -0800
+++ linux-2.6.16-rc3/fs/xfs/linux-2.6/xfs_aops.c	2006-02-14 15:57:27.000000000 -0800
@@ -1144,12 +1144,11 @@ STATIC int
 linvfs_get_blocks_direct(
 	struct inode		*inode,
 	sector_t		iblock,
-	unsigned long		max_blocks,
 	struct buffer_head	*bh_result,
 	int			create)
 {
-	return __linvfs_get_block(inode, iblock, max_blocks, bh_result,
-					create, 1, BMAPI_WRITE|BMAPI_DIRECT);
+	return __linvfs_get_block(inode, iblock, bh_result->b_size >> inode->i_blkbits,
+				bh_result, create, 1, BMAPI_WRITE|BMAPI_DIRECT);
 }
 
 STATIC void
Index: linux-2.6.16-rc3/include/linux/fs.h
===================================================================
--- linux-2.6.16-rc3.orig/include/linux/fs.h	2006-02-14 15:54:12.000000000 -0800
+++ linux-2.6.16-rc3/include/linux/fs.h	2006-02-14 15:57:27.000000000 -0800
@@ -240,9 +240,6 @@ extern void __init files_init(unsigned l
 struct buffer_head;
 typedef int (get_block_t)(struct inode *inode, sector_t iblock,
 			struct buffer_head *bh_result, int create);
-typedef int (get_blocks_t)(struct inode *inode, sector_t iblock,
-			unsigned long max_blocks,
-			struct buffer_head *bh_result, int create);
 typedef void (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 			ssize_t bytes, void *private);
 
@@ -1621,7 +1618,7 @@ static inline void do_generic_file_read(
 
 ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	struct block_device *bdev, const struct iovec *iov, loff_t offset,
-	unsigned long nr_segs, get_blocks_t get_blocks, dio_iodone_t end_io,
+	unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io,
 	int lock_type);
 
 enum {
@@ -1632,29 +1629,29 @@ enum {
 
 static inline ssize_t blockdev_direct_IO(int rw, struct kiocb *iocb,
 	struct inode *inode, struct block_device *bdev, const struct iovec *iov,
-	loff_t offset, unsigned long nr_segs, get_blocks_t get_blocks,
+	loff_t offset, unsigned long nr_segs, get_block_t get_block,
 	dio_iodone_t end_io)
 {
 	return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
-				nr_segs, get_blocks, end_io, DIO_LOCKING);
+				nr_segs, get_block, end_io, DIO_LOCKING);
 }
 
 static inline ssize_t blockdev_direct_IO_no_locking(int rw, struct kiocb *iocb,
 	struct inode *inode, struct block_device *bdev, const struct iovec *iov,
-	loff_t offset, unsigned long nr_segs, get_blocks_t get_blocks,
+	loff_t offset, unsigned long nr_segs, get_block_t get_block,
 	dio_iodone_t end_io)
 {
 	return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
-				nr_segs, get_blocks, end_io, DIO_NO_LOCKING);
+				nr_segs, get_block, end_io, DIO_NO_LOCKING);
 }
 
 static inline ssize_t blockdev_direct_IO_own_locking(int rw, struct kiocb *iocb,
 	struct inode *inode, struct block_device *bdev, const struct iovec *iov,
-	loff_t offset, unsigned long nr_segs, get_blocks_t get_blocks,
+	loff_t offset, unsigned long nr_segs, get_block_t get_block,
 	dio_iodone_t end_io)
 {
 	return __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
-				nr_segs, get_blocks, end_io, DIO_OWN_LOCKING);
+				nr_segs, get_block, end_io, DIO_OWN_LOCKING);
 }
 
 extern struct file_operations generic_ro_fops;
Index: linux-2.6.16-rc3/fs/block_dev.c
===================================================================
--- linux-2.6.16-rc3.orig/fs/block_dev.c	2006-02-14 15:54:12.000000000 -0800
+++ linux-2.6.16-rc3/fs/block_dev.c	2006-02-14 15:57:27.000000000 -0800
@@ -135,9 +135,10 @@ blkdev_get_block(struct inode *inode, se
 
 static int
 blkdev_get_blocks(struct inode *inode, sector_t iblock,
-		unsigned long max_blocks, struct buffer_head *bh, int create)
+		struct buffer_head *bh, int create)
 {
 	sector_t end_block = max_block(I_BDEV(inode));
+	unsigned long max_blocks = bh->b_size >> inode->i_blkbits;
 
 	if ((iblock + max_blocks) > end_block) {
 		max_blocks = end_block - iblock;
Index: linux-2.6.16-rc3/fs/jfs/inode.c
===================================================================
--- linux-2.6.16-rc3.orig/fs/jfs/inode.c	2006-02-14 15:54:12.000000000 -0800
+++ linux-2.6.16-rc3/fs/jfs/inode.c	2006-02-14 15:57:27.000000000 -0800
@@ -301,7 +301,7 @@ static ssize_t jfs_direct_IO(int rw, str
 	struct inode *inode = file->f_mapping->host;
 
 	return blockdev_direct_IO(rw, iocb, inode, inode->i_sb->s_bdev, iov,
-				offset, nr_segs, jfs_get_blocks, NULL);
+				offset, nr_segs, jfs_get_block, NULL);
 }
 
 struct address_space_operations jfs_aops = {

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()
  2006-02-20 21:21 [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() Badari Pulavarty
                   ` (2 preceding siblings ...)
  2006-02-20 21:24 ` [PATCH 3/3] remove ->get_blocks() support Badari Pulavarty
@ 2006-02-20 21:59 ` Nathan Scott
  2006-02-20 23:06   ` Badari Pulavarty
  2006-02-21  2:41   ` Jeremy Higdon
  2006-02-22 15:12 ` christoph
  4 siblings, 2 replies; 32+ messages in thread
From: Nathan Scott @ 2006-02-20 21:59 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: christoph, mcao, akpm, lkml, jeremy, linux-fsdevel

On Mon, Feb 20, 2006 at 01:21:27PM -0800, Badari Pulavarty wrote:
> Hi,

Hi Badari,

> Following patches add support to map multiple blocks in ->get_block().
> This is will allow us to handle mapping of multiple disk blocks for
> mpage_readpages() and mpage_writepages() etc. Instead of adding new
> argument, I use "b_size" to indicate the amount of disk mapping needed
> for get_block(). And also, on success get_block() actually indicates
> the amount of disk mapping it did.

Thanks for doing this work!

> Now that get_block() can handle multiple blocks, there is no need
> for ->get_blocks() which was added for DIO. 
> 
> [PATCH 1/3] pass b_size to ->get_block()
> 
> [PATCH 2/3] map multiple blocks for mpage_readpages()
> 
> [PATCH 3/3] remove ->get_blocks() support
> 
> I noticed decent improvements (reduced sys time) on JFS, XFS and ext3. 
> (on simple "dd" read tests).
> 	
>          (rc3.mm1)	(rc3.mm1 + patches)
> real    0m18.814s	0m18.482s
> user    0m0.000s	0m0.004s
> sys     0m3.240s	0m2.912s
> 
> Andrew, Could you include it in -mm tree ?
> 
> Comments ?

I've been running these patches in my development tree for awhile
and have not seen any problems.  My one (possibly minor) concern
is that we pass get_block a size in units of bytes, e.g....

	bh->b_size = 1 << inode->i_blkbits;
	err = get_block(inode, block, bh, 1);

And b_size is a u32.  We have had the situation in the past where
people (I'm looking at you, Jeremy ;) have been issuing multiple-
gigabyte direct reads/writes through XFS.  The syscall interface
takes an (s)size_t in bytes, which on 64 bit platforms is a 64 bit
byte count.

I wonder if this change will end up ruining things for the lunatic
fringe issuing these kinds of IOs?  Maybe the get_block call could
take a block count rather than a byte count?  (I guess that would
equate to dropping get_block_t rather than get_blocks_t... which is
kinda the alternate direction to what you took here).  On the other
hand, maybe it'd be simpler to change b_size to be a size_t instead
of u32?  Although, since we are now mapping multiple blocks at once,
"get_blocks_t" does seem an appropriate name.  *shrug*, whatever ...
the main thing that'd be good to see addressed is the 32 bit size.

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()
  2006-02-20 21:59 ` [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() Nathan Scott
@ 2006-02-20 23:06   ` Badari Pulavarty
  2006-02-20 23:16     ` Nathan Scott
  2006-02-21  2:41   ` Jeremy Higdon
  1 sibling, 1 reply; 32+ messages in thread
From: Badari Pulavarty @ 2006-02-20 23:06 UTC (permalink / raw)
  To: Nathan Scott; +Cc: christoph, mcao, akpm, lkml, jeremy, linux-fsdevel

On Tue, 2006-02-21 at 08:59 +1100, Nathan Scott wrote:
> On Mon, Feb 20, 2006 at 01:21:27PM -0800, Badari Pulavarty wrote:
> > Hi,
> 
> Hi Badari,
> 
> > Following patches add support to map multiple blocks in ->get_block().
> > This is will allow us to handle mapping of multiple disk blocks for
> > mpage_readpages() and mpage_writepages() etc. Instead of adding new
> > argument, I use "b_size" to indicate the amount of disk mapping needed
> > for get_block(). And also, on success get_block() actually indicates
> > the amount of disk mapping it did.
> 
> Thanks for doing this work!
> 
> > Now that get_block() can handle multiple blocks, there is no need
> > for ->get_blocks() which was added for DIO. 
> > 
> > [PATCH 1/3] pass b_size to ->get_block()
> > 
> > [PATCH 2/3] map multiple blocks for mpage_readpages()
> > 
> > [PATCH 3/3] remove ->get_blocks() support
> > 
> > I noticed decent improvements (reduced sys time) on JFS, XFS and ext3. 
> > (on simple "dd" read tests).
> > 	
> >          (rc3.mm1)	(rc3.mm1 + patches)
> > real    0m18.814s	0m18.482s
> > user    0m0.000s	0m0.004s
> > sys     0m3.240s	0m2.912s
> > 
> > Andrew, Could you include it in -mm tree ?
> > 
> > Comments ?
> 
> I've been running these patches in my development tree for awhile
> and have not seen any problems.  My one (possibly minor) concern
> is that we pass get_block a size in units of bytes, e.g....
> 
> 	bh->b_size = 1 << inode->i_blkbits;
> 	err = get_block(inode, block, bh, 1);
> 
> And b_size is a u32.  We have had the situation in the past where
> people (I'm looking at you, Jeremy ;) have been issuing multiple-
> gigabyte direct reads/writes through XFS.  The syscall interface
> takes an (s)size_t in bytes, which on 64 bit platforms is a 64 bit
> byte count.

> I wonder if this change will end up ruining things for the lunatic
> fringe issuing these kinds of IOs?  Maybe the get_block call could
> take a block count rather than a byte count?  

Yes. I thought about it too.. I wanted to pass "block count" instead
of "byte count". Right now it does ..

	bh->b_size = 1 << inode->i_blkbits;
	call get_block();

First thing get_block() does is
	blocks = bh->b_size >> inode->i_blkbits;

All, the unnecessary shifting around for nothing :(

But, I ended up doing "byte count" just to avoid confusion of
asking in "blocks" getting back in "bytes".

I have no problem making b_size as "size_t" to handle 64-bit.
But again, if we are fiddling with buffer_head - may be its time
to look at alternative to "buffer_head" with the information exactly 
we need for getblock() ?

Thanks,
Badari


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()
  2006-02-20 23:06   ` Badari Pulavarty
@ 2006-02-20 23:16     ` Nathan Scott
  0 siblings, 0 replies; 32+ messages in thread
From: Nathan Scott @ 2006-02-20 23:16 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: christoph, mcao, akpm, lkml, jeremy, linux-fsdevel

On Mon, Feb 20, 2006 at 03:06:11PM -0800, Badari Pulavarty wrote:
> On Tue, 2006-02-21 at 08:59 +1100, Nathan Scott wrote:
> ...
> > I wonder if this change will end up ruining things for the lunatic
> > fringe issuing these kinds of IOs?  Maybe the get_block call could
> > take a block count rather than a byte count?  
> 
> Yes. I thought about it too.. I wanted to pass "block count" instead
> of "byte count". Right now it does ..
> 
> 	bh->b_size = 1 << inode->i_blkbits;
> 	call get_block();
> 
> First thing get_block() does is
> 	blocks = bh->b_size >> inode->i_blkbits;
> 
> All, the unnecessary shifting around for nothing :(

Yeah, pretty silly really, but theres not much choice if the
goal is to keep this simple.  Oh well.

> But, I ended up doing "byte count" just to avoid confusion of
> asking in "blocks" getting back in "bytes".

Understood.

> I have no problem making b_size as "size_t" to handle 64-bit.
> But again, if we are fiddling with buffer_head - may be its time
> to look at alternative to "buffer_head" with the information exactly 
> we need for getblock() ?

That is a much bigger change - I'm not in a position to make the
call on whether thats in everyones best interests.  However, I do
want to make sure we don't regress anything, so I guess the u32
to size_t switch probably should be made to resolve this issue.

Thanks again for following up on this.

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()
  2006-02-20 21:59 ` [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() Nathan Scott
  2006-02-20 23:06   ` Badari Pulavarty
@ 2006-02-21  2:41   ` Jeremy Higdon
  2006-02-21 16:03     ` Badari Pulavarty
  1 sibling, 1 reply; 32+ messages in thread
From: Jeremy Higdon @ 2006-02-21  2:41 UTC (permalink / raw)
  To: Nathan Scott; +Cc: Badari Pulavarty, christoph, mcao, akpm, lkml, linux-fsdevel

On Tue, Feb 21, 2006 at 08:59:53AM +1100, Nathan Scott wrote:
> On Mon, Feb 20, 2006 at 01:21:27PM -0800, Badari Pulavarty wrote:
> 
> I've been running these patches in my development tree for awhile
> and have not seen any problems.  My one (possibly minor) concern
> is that we pass get_block a size in units of bytes, e.g....
> 
> 	bh->b_size = 1 << inode->i_blkbits;
> 	err = get_block(inode, block, bh, 1);
> 
> And b_size is a u32.  We have had the situation in the past where
> people (I'm looking at you, Jeremy ;) have been issuing multiple-
> gigabyte direct reads/writes through XFS.  The syscall interface
> takes an (s)size_t in bytes, which on 64 bit platforms is a 64 bit
> byte count.
> 
> I wonder if this change will end up ruining things for the lunatic
> fringe issuing these kinds of IOs?  Maybe the get_block call could

Hey!  Lunatic fringe indeed.  Harumph!  :-)

Yes, there are a few people out there who will need to issue really
large I/O reads or writes to get maximum I/O bandwidth on large
stripes.  The largest I've done so far is 4GiB, but I expect that
number will likely increase this year, and more likely next year,
if not.

jeremy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()
  2006-02-21  2:41   ` Jeremy Higdon
@ 2006-02-21 16:03     ` Badari Pulavarty
  2006-02-21 21:39       ` Nathan Scott
  0 siblings, 1 reply; 32+ messages in thread
From: Badari Pulavarty @ 2006-02-21 16:03 UTC (permalink / raw)
  To: Jeremy Higdon; +Cc: Nathan Scott, christoph, mcao, akpm, lkml, linux-fsdevel

On Mon, 2006-02-20 at 18:41 -0800, Jeremy Higdon wrote:
> On Tue, Feb 21, 2006 at 08:59:53AM +1100, Nathan Scott wrote:
> > On Mon, Feb 20, 2006 at 01:21:27PM -0800, Badari Pulavarty wrote:
> > 
> > I've been running these patches in my development tree for awhile
> > and have not seen any problems.  My one (possibly minor) concern
> > is that we pass get_block a size in units of bytes, e.g....
> > 
> > 	bh->b_size = 1 << inode->i_blkbits;
> > 	err = get_block(inode, block, bh, 1);
> > 
> > And b_size is a u32.  We have had the situation in the past where
> > people (I'm looking at you, Jeremy ;) have been issuing multiple-
> > gigabyte direct reads/writes through XFS.  The syscall interface
> > takes an (s)size_t in bytes, which on 64 bit platforms is a 64 bit
> > byte count.
> > 
> > I wonder if this change will end up ruining things for the lunatic
> > fringe issuing these kinds of IOs?  Maybe the get_block call could
> 
> Hey!  Lunatic fringe indeed.  Harumph!  :-)
> 
> Yes, there are a few people out there who will need to issue really
> large I/O reads or writes to get maximum I/O bandwidth on large
> stripes.  The largest I've done so far is 4GiB, but I expect that
> number will likely increase this year, and more likely next year,
> if not.

Thinking more about it, currently (without my patches) only DIO
code can request for large chunks of mapping through get_blocks().
Since b_size is only u32, you can only map 4GB max. Isn't it ?
With my patches, now mpage_readpages() also can request large
chunks. (through readahead). So, my patches are not adding any
extra limitation. Its carrying the same existing limitation. 

In order to handle larger chunks of disk mapping, changing b_size 
to u64 is required and we should request for it irrespective of my
patches.

Thanks,
Badari


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()
  2006-02-21 16:03     ` Badari Pulavarty
@ 2006-02-21 21:39       ` Nathan Scott
  0 siblings, 0 replies; 32+ messages in thread
From: Nathan Scott @ 2006-02-21 21:39 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Jeremy Higdon, christoph, mcao, akpm, lkml, linux-fsdevel

On Tue, Feb 21, 2006 at 08:03:54AM -0800, Badari Pulavarty wrote:
> ...
> Thinking more about it, currently (without my patches) only DIO
> code can request for large chunks of mapping through get_blocks().
> Since b_size is only u32, you can only map 4GB max. Isn't it ?
> With my patches, now mpage_readpages() also can request large
> chunks. (through readahead). So, my patches are not adding any
> extra limitation. Its carrying the same existing limitation. 

The DIO get_blocks is super-freaky at the moment, in that it passes
in an unsigned long max_blocks (units of blocks) but it expects the
filesystem to fill in a u32 b_size (in bytes).  This does at least
give the filesystem the hint that this was a monster write and it
gets the chance to allocate contiguously, but that was probably
unintentional, who knows?  I guess that subtlety would be dropped
with this change if b_size isn't made a 64 bit entity.

> In order to handle larger chunks of disk mapping, changing b_size 
> to u64 is required and we should request for it irrespective of my
> patches.

*shrug*... its all very closely inter-related, I'd keep it together.
I'd also thought that size_t was the right type here, not u64?  32
bit platforms aren't capable of submitting IOs this large, so they
needn't be doing any 64 bit manipulation here, no?

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()
  2006-02-20 21:21 [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() Badari Pulavarty
                   ` (3 preceding siblings ...)
  2006-02-20 21:59 ` [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() Nathan Scott
@ 2006-02-22 15:12 ` christoph
  2006-02-22 16:58   ` Badari Pulavarty
  2006-02-24 17:19   ` Badari Pulavarty
  4 siblings, 2 replies; 32+ messages in thread
From: christoph @ 2006-02-22 15:12 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: christoph, mcao, akpm, lkml, linux-fsdevel, vs, zam

On Mon, Feb 20, 2006 at 01:21:27PM -0800, Badari Pulavarty wrote:
> Hi,
> 
> Following patches add support to map multiple blocks in ->get_block().
> This is will allow us to handle mapping of multiple disk blocks for
> mpage_readpages() and mpage_writepages() etc. Instead of adding new
> argument, I use "b_size" to indicate the amount of disk mapping needed
> for get_block(). And also, on success get_block() actually indicates
> the amount of disk mapping it did.
> 
> Now that get_block() can handle multiple blocks, there is no need
> for ->get_blocks() which was added for DIO. 
> 
> [PATCH 1/3] pass b_size to ->get_block()
> 
> [PATCH 2/3] map multiple blocks for mpage_readpages()
> 
> [PATCH 3/3] remove ->get_blocks() support
> 
> I noticed decent improvements (reduced sys time) on JFS, XFS and ext3. 
> (on simple "dd" read tests).
> 	
>          (rc3.mm1)	(rc3.mm1 + patches)
> real    0m18.814s	0m18.482s
> user    0m0.000s	0m0.004s
> sys     0m3.240s	0m2.912s
> 
> Andrew, Could you include it in -mm tree ?

Thanks Badari, with that interface changes the mpage_readpage changes
look a lot nicer than my original version.  I'd like to second
the request to put it into -mm. 

And if the namesys folks could try out whether this works for their
reiser4 requirements it'd be nice.  If you have an even faster
->readpages I'd be interested in that secrete souce receipe for
further improvement to mpage_readpages.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()
  2006-02-22 15:12 ` christoph
@ 2006-02-22 16:58   ` Badari Pulavarty
  2006-02-22 16:59     ` christoph
                       ` (2 more replies)
  2006-02-24 17:19   ` Badari Pulavarty
  1 sibling, 3 replies; 32+ messages in thread
From: Badari Pulavarty @ 2006-02-22 16:58 UTC (permalink / raw)
  To: christoph; +Cc: mcao, akpm, lkml, linux-fsdevel, vs, zam

On Wed, 2006-02-22 at 16:12 +0100, christoph wrote:
> On Mon, Feb 20, 2006 at 01:21:27PM -0800, Badari Pulavarty wrote:
> > Hi,
> > 
> > Following patches add support to map multiple blocks in ->get_block().
> > This is will allow us to handle mapping of multiple disk blocks for
> > mpage_readpages() and mpage_writepages() etc. Instead of adding new
> > argument, I use "b_size" to indicate the amount of disk mapping needed
> > for get_block(). And also, on success get_block() actually indicates
> > the amount of disk mapping it did.
> > 
> > Now that get_block() can handle multiple blocks, there is no need
> > for ->get_blocks() which was added for DIO. 
> > 
> > [PATCH 1/3] pass b_size to ->get_block()
> > 
> > [PATCH 2/3] map multiple blocks for mpage_readpages()
> > 
> > [PATCH 3/3] remove ->get_blocks() support
> > 
> > I noticed decent improvements (reduced sys time) on JFS, XFS and ext3. 
> > (on simple "dd" read tests).
> > 	
> >          (rc3.mm1)	(rc3.mm1 + patches)
> > real    0m18.814s	0m18.482s
> > user    0m0.000s	0m0.004s
> > sys     0m3.240s	0m2.912s
> > 
> > Andrew, Could you include it in -mm tree ?
> 
> Thanks Badari, with that interface changes the mpage_readpage changes
> look a lot nicer than my original version.  I'd like to second
> the request to put it into -mm. 

Thanks. Only current issue Nathan raised is, he wants to see
b_size change to u64 (instead of u32) to support really-huge-IO
requests. Does this sound reasonable to you ?

Thanks,
Badari


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()
  2006-02-22 16:58   ` Badari Pulavarty
@ 2006-02-22 16:59     ` christoph
  2006-02-23  1:40       ` Nathan Scott
  2006-02-22 17:23     ` [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() Peter Staubach
  2006-02-22 18:37     ` Dave Kleikamp
  2 siblings, 1 reply; 32+ messages in thread
From: christoph @ 2006-02-22 16:59 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: christoph, mcao, akpm, lkml, linux-fsdevel, vs, zam

On Wed, Feb 22, 2006 at 08:58:30AM -0800, Badari Pulavarty wrote:
> Thanks. Only current issue Nathan raised is, he wants to see
> b_size change to u64 (instead of u32) to support really-huge-IO
> requests. Does this sound reasonable to you ?

I know that we didn't want to increase b_size at some point because
of concerns about the number of objects fitting into a page in the
slab allocator.  If the same number of bigger heads fit into the
same page I see no problems with the increase.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()
  2006-02-22 16:58   ` Badari Pulavarty
  2006-02-22 16:59     ` christoph
@ 2006-02-22 17:23     ` Peter Staubach
  2006-02-22 18:37     ` Dave Kleikamp
  2 siblings, 0 replies; 32+ messages in thread
From: Peter Staubach @ 2006-02-22 17:23 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: christoph, mcao, akpm, lkml, linux-fsdevel, vs, zam

Badari Pulavarty wrote:

>
>Thanks. Only current issue Nathan raised is, he wants to see
>b_size change to u64 (instead of u32) to support really-huge-IO
>requests. Does this sound reasonable to you ?
>

I believe that the various read and write system calls impose a 32 bit limit
on the size of the i/o request, so you may need to look at that too.

Increasing the size of b_size would address a bug which exists when trying
to read more than 4GB from a block device which is greater than 4GB in size.
Currently, that bug is masked because the maximum system call size is
arbitrarily limited.

    Thanx...

       ps

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()
  2006-02-22 16:58   ` Badari Pulavarty
  2006-02-22 16:59     ` christoph
  2006-02-22 17:23     ` [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() Peter Staubach
@ 2006-02-22 18:37     ` Dave Kleikamp
  2006-02-22 19:00       ` Badari Pulavarty
  2 siblings, 1 reply; 32+ messages in thread
From: Dave Kleikamp @ 2006-02-22 18:37 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: christoph, mcao, akpm, lkml, linux-fsdevel, vs, zam

On Wed, 2006-02-22 at 08:58 -0800, Badari Pulavarty wrote:
> Thanks. Only current issue Nathan raised is, he wants to see
> b_size change to u64 (instead of u32) to support really-huge-IO
> requests. Does this sound reasonable to you ?

Didn't someone point out that size_t would make more sense?  There's no
reason for a 32-bit architecture to have a 64-bit b_size.

> Thanks,
> Badari

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()
  2006-02-22 18:37     ` Dave Kleikamp
@ 2006-02-22 19:00       ` Badari Pulavarty
  0 siblings, 0 replies; 32+ messages in thread
From: Badari Pulavarty @ 2006-02-22 19:00 UTC (permalink / raw)
  To: Dave Kleikamp; +Cc: christoph, mcao, akpm, lkml, linux-fsdevel, vs, zam

Dave Kleikamp wrote:

> On Wed, 2006-02-22 at 08:58 -0800, Badari Pulavarty wrote:
> 
>>Thanks. Only current issue Nathan raised is, he wants to see
>>b_size change to u64 (instead of u32) to support really-huge-IO
>>requests. Does this sound reasonable to you ?
> 
> 
> Didn't someone point out that size_t would make more sense?  There's no
> reason for a 32-bit architecture to have a 64-bit b_size.

Yes. I meant to say size_t.

- Badari


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()
  2006-02-22 16:59     ` christoph
@ 2006-02-23  1:40       ` Nathan Scott
  2006-02-23  1:59         ` Andrew Morton
  0 siblings, 1 reply; 32+ messages in thread
From: Nathan Scott @ 2006-02-23  1:40 UTC (permalink / raw)
  To: Badari Pulavarty, Christoph Hellwig; +Cc: mcao, akpm, lkml, linux-fsdevel

On Wed, Feb 22, 2006 at 05:59:42PM +0100, christoph wrote:
> On Wed, Feb 22, 2006 at 08:58:30AM -0800, Badari Pulavarty wrote:
> > Thanks. Only current issue Nathan raised is, he wants to see
> > b_size change to u64 [size_t] (instead of u32) to support
> > really-huge-IO requests.

And also to not go backwards on what the current DIO mapping
interface provides us.

> Does this sound reasonable to you ?
> 
> I know that we didn't want to increase b_size at some point because
> of concerns about the number of objects fitting into a page in the

There's four extra bytes on an 88 byte structure (with sector_t
CONFIG'd at 64 bits) - so, yes, there'll be a small decrease in
the number that fit in a page on 64 bit platforms.  Perhaps its
not worth it, but it would be good to sort out these pesky size
mismatches cos they introduce very subtle bugs.

> slab allocator.  If the same number of bigger heads fit into the
> same page I see no problems with the increase.

Heh, bigger heads.  Well, the same number fit in the page for 32
bit platforms, does that count?

Taking a quick look at the struct definition, theres some oddities
crept in since the 2.4 version - looks like sct had arranged the
fields in nice 32-bit-platform-cache-aligned groups, with several
cache-alignment related comments, some of which now remain and make
very little sense on their own (& with the now-incorrect grouping).

Anyway, here's a patch Badari - I reckon its worth it, but then I
am a bit biased (as its likely only XFS is going to be seeing this
size of I/O for some time, and as someone who has hunted down some
of these size mismatch bugs...)

cheers.

-- 
Nathan


Increase the size of the buffer_head b_size field for 64 bit
platforms.  Update some old and moldy comments in and around
the structure as well.

The b_size increase allows us to perform larger mappings and
allocations for large I/O requests from userspace, which tie
in with other changes allowing the get_block_t() interface to
map multiple blocks at once.

Signed-off-by: Nathan Scott <nathans@sgi.com>

--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -46,20 +46,25 @@ struct address_space;
 typedef void (bh_end_io_t)(struct buffer_head *bh, int uptodate);
 
 /*
- * Keep related fields in common cachelines.  The most commonly accessed
- * field (b_state) goes at the start so the compiler does not generate
- * indexed addressing for it.
+ * Historically, a buffer_head was used to map a single block
+ * within a page, and of course as the unit of I/O through the
+ * filesystem and block layers.  Nowadays the basic I/O unit
+ * is the bio, and buffer_heads are used for extracting block
+ * mappings (via a get_block_t call), for tracking state within
+ * a page (via a page_mapping) and for wrapping bio submission
+ * for backward compatibility reasons (e.g. submit_bh).  There
+ * may be one or two other uses too (I used it for drying the
+ * dishes the other night when I couldn't find a tea towel...).
  */
 struct buffer_head {
-	/* First cache line: */
 	unsigned long b_state;		/* buffer state bitmap (see above) */
 	struct buffer_head *b_this_page;/* circular list of page's buffers */
 	struct page *b_page;		/* the page this bh is mapped to */
-	atomic_t b_count;		/* users using this block */
-	u32 b_size;			/* block size */
+	atomic_t b_count;		/* users using this buffer_head */
 
-	sector_t b_blocknr;		/* block number */
-	char *b_data;			/* pointer to data block */
+	size_t b_size;			/* size of mapping */
+	sector_t b_blocknr;		/* start block number */
+	char *b_data;			/* pointer to data within the page */
 
 	struct block_device *b_bdev;
 	bh_end_io_t *b_end_io;		/* I/O completion */

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()
  2006-02-23  1:40       ` Nathan Scott
@ 2006-02-23  1:59         ` Andrew Morton
  2006-02-23 16:28           ` [PATCH] change b_size to size_t Badari Pulavarty
  0 siblings, 1 reply; 32+ messages in thread
From: Andrew Morton @ 2006-02-23  1:59 UTC (permalink / raw)
  To: Nathan Scott; +Cc: pbadari, hch, mcao, linux-kernel, linux-fsdevel

Nathan Scott <nathans@sgi.com> wrote:
>
> There's four extra bytes on an 88 byte structure (with sector_t
>  CONFIG'd at 64 bits)

oy, buffer_heads are 48 bytes - took a few months out of my life, that did.
That's where it most counts - memory-constrained non-LBD 32-bit platforms.

Making b_size size_t doesn't affect that, so..

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/3] map multiple blocks for mpage_readpages()
  2006-02-20 21:23 ` [PATCH 2/3] map multiple blocks for mpage_readpages() Badari Pulavarty
@ 2006-02-23  3:29   ` Suparna Bhattacharya
  2006-02-23 22:18     ` Badari Pulavarty
  0 siblings, 1 reply; 32+ messages in thread
From: Suparna Bhattacharya @ 2006-02-23  3:29 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: christoph, mcao, akpm, lkml, linux-fsdevel

On Mon, Feb 20, 2006 at 01:23:55PM -0800, Badari Pulavarty wrote:
> [PATCH 2/3] map multiple blocks for mpage_readpages()
> 
> This patch changes mpage_readpages() and get_block() to
> get the disk mapping information for multiple blocks at the
> same time.
> 
> b_size represents the amount of disk mapping that needs to
> mapped. On the successful get_block() b_size indicates the
> amount of disk mapping thats actually mapped.  Only the
> filesystems who care to use this information and provide multiple
> disk blocks at a time can choose to do so.
> 
> No changes are needed for the filesystems who wants to ignore this.

That's a nice way to handle this !

Just one minor suggestion on the patches - instead of adding all those
arguments to do_mpage_readpage, have you considered using
an mio structure as we did in the mpage_writepages multiple blocks case ?

Regards
Suparna

> 
> Thanks,
> Badari
> 
> 



-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH] change b_size to size_t
  2006-02-23  1:59         ` Andrew Morton
@ 2006-02-23 16:28           ` Badari Pulavarty
  2006-02-23 16:32             ` Benjamin LaHaise
  2006-02-23 16:40             ` Dave Kleikamp
  0 siblings, 2 replies; 32+ messages in thread
From: Badari Pulavarty @ 2006-02-23 16:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Nathan Scott, christoph, mcao, lkml, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 572 bytes --]

On Wed, 2006-02-22 at 17:59 -0800, Andrew Morton wrote:
> Nathan Scott <nathans@sgi.com> wrote:
> >
> > There's four extra bytes on an 88 byte structure (with sector_t
> >  CONFIG'd at 64 bits)
> 
> oy, buffer_heads are 48 bytes - took a few months out of my life, that did.
> That's where it most counts - memory-constrained non-LBD 32-bit platforms.
> 
> Making b_size size_t doesn't affect that, so..

Here is the updated version of the patch, which changes
buffer_head.b_size to size_t to support mapping large
amount of disk blocks (for large IOs).

Thanks,
Badari



[-- Attachment #2: increase-bsize.patch --]
[-- Type: text/x-patch, Size: 4537 bytes --]

Increase the size of the buffer_head b_size field for 64 bit
platforms.  Update some old and moldy comments in and around
the structure as well.

The b_size increase allows us to perform larger mappings and
allocations for large I/O requests from userspace, which tie
in with other changes allowing the get_block_t() interface to
map multiple blocks at once.

Signed-off-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Badari Pulavary <pbadari@us.ibm.com>

Index: linux-2.6.16-rc4/include/linux/buffer_head.h
===================================================================
--- linux-2.6.16-rc4.orig/include/linux/buffer_head.h	2006-02-17 14:23:45.000000000 -0800
+++ linux-2.6.16-rc4/include/linux/buffer_head.h	2006-02-23 08:33:29.000000000 -0800
@@ -46,20 +46,23 @@ struct address_space;
 typedef void (bh_end_io_t)(struct buffer_head *bh, int uptodate);
 
 /*
- * Keep related fields in common cachelines.  The most commonly accessed
- * field (b_state) goes at the start so the compiler does not generate
- * indexed addressing for it.
+ * Historically, a buffer_head was used to map a single block
+ * within a page, and of course as the unit of I/O through the
+ * filesystem and block layers.  Nowadays the basic I/O unit
+ * is the bio, and buffer_heads are used for extracting block
+ * mappings (via a get_block_t call), for tracking state within
+ * a page (via a page_mapping) and for wrapping bio submission
+ * for backward compatibility reasons (e.g. submit_bh).
  */
 struct buffer_head {
-	/* First cache line: */
 	unsigned long b_state;		/* buffer state bitmap (see above) */
 	struct buffer_head *b_this_page;/* circular list of page's buffers */
 	struct page *b_page;		/* the page this bh is mapped to */
-	atomic_t b_count;		/* users using this block */
-	u32 b_size;			/* block size */
+	atomic_t b_count;		/* users using this buffer_head */
 
-	sector_t b_blocknr;		/* block number */
-	char *b_data;			/* pointer to data block */
+	size_t b_size;			/* size of mapping */
+	sector_t b_blocknr;		/* start block number */
+	char *b_data;			/* pointer to data within the page */
 
 	struct block_device *b_bdev;
 	bh_end_io_t *b_end_io;		/* I/O completion */
Index: linux-2.6.16-rc4/fs/buffer.c
===================================================================
--- linux-2.6.16-rc4.orig/fs/buffer.c	2006-02-17 14:23:45.000000000 -0800
+++ linux-2.6.16-rc4/fs/buffer.c	2006-02-23 08:19:15.000000000 -0800
@@ -432,7 +432,7 @@ __find_get_block_slow(struct block_devic
 		printk("__find_get_block_slow() failed. "
 			"block=%llu, b_blocknr=%llu\n",
 			(unsigned long long)block, (unsigned long long)bh->b_blocknr);
-		printk("b_state=0x%08lx, b_size=%u\n", bh->b_state, bh->b_size);
+		printk("b_state=0x%08lx, b_size=%llu\n", bh->b_state, 					(unsigned long long)bh->b_size);
 		printk("device blocksize: %d\n", 1 << bd_inode->i_blkbits);
 	}
 out_unlock:
Index: linux-2.6.16-rc4/fs/reiserfs/prints.c
===================================================================
--- linux-2.6.16-rc4.orig/fs/reiserfs/prints.c	2006-02-17 14:23:45.000000000 -0800
+++ linux-2.6.16-rc4/fs/reiserfs/prints.c	2006-02-23 08:20:01.000000000 -0800
@@ -143,8 +143,8 @@ static void sprintf_buffer_head(char *bu
 	char b[BDEVNAME_SIZE];
 
 	sprintf(buf,
-		"dev %s, size %d, blocknr %llu, count %d, state 0x%lx, page %p, (%s, %s, %s)",
-		bdevname(bh->b_bdev, b), bh->b_size,
+		"dev %s, size %lld, blocknr %llu, count %d, state 0x%lx, page %p, (%s, %s, %s)",
+		bdevname(bh->b_bdev, b), (unsigned long long)bh->b_size,
 		(unsigned long long)bh->b_blocknr, atomic_read(&(bh->b_count)),
 		bh->b_state, bh->b_page,
 		buffer_uptodate(bh) ? "UPTODATE" : "!UPTODATE",
Index: linux-2.6.16-rc4/fs/ocfs2/journal.c
===================================================================
--- linux-2.6.16-rc4.orig/fs/ocfs2/journal.c	2006-02-17 14:23:45.000000000 -0800
+++ linux-2.6.16-rc4/fs/ocfs2/journal.c	2006-02-23 08:23:34.000000000 -0800
@@ -377,12 +377,12 @@ int ocfs2_journal_access(struct ocfs2_jo
 	BUG_ON(!bh);
 	BUG_ON(!(handle->flags & OCFS2_HANDLE_STARTED));
 
-	mlog_entry("bh->b_blocknr=%llu, type=%d (\"%s\"), bh->b_size = %hu\n",
+	mlog_entry("bh->b_blocknr=%llu, type=%d (\"%s\"), bh->b_size = %llu\n",
 		   (unsigned long long)bh->b_blocknr, type,
 		   (type == OCFS2_JOURNAL_ACCESS_CREATE) ?
 		   "OCFS2_JOURNAL_ACCESS_CREATE" :
 		   "OCFS2_JOURNAL_ACCESS_WRITE",
-		   bh->b_size);
+		   (unsigned long long)bh->b_size);
 
 	/* we can safely remove this assertion after testing. */
 	if (!buffer_uptodate(bh)) {

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] change b_size to size_t
  2006-02-23 16:28           ` [PATCH] change b_size to size_t Badari Pulavarty
@ 2006-02-23 16:32             ` Benjamin LaHaise
  2006-02-23 17:20               ` Badari Pulavarty
  2006-02-23 17:28               ` Badari Pulavarty
  2006-02-23 16:40             ` Dave Kleikamp
  1 sibling, 2 replies; 32+ messages in thread
From: Benjamin LaHaise @ 2006-02-23 16:32 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Andrew Morton, Nathan Scott, christoph, mcao, lkml, linux-fsdevel

On Thu, Feb 23, 2006 at 08:28:12AM -0800, Badari Pulavarty wrote:
> Here is the updated version of the patch, which changes
> buffer_head.b_size to size_t to support mapping large
> amount of disk blocks (for large IOs).

Your patch doesn't seem to be inline, so I can't quote it.  Several 
problems: on 64 bit platforms you introduced 4 bytes of padding into 
buffer_head.  atomic_t only takes up 4 byte, while size_t is 8 byte 
aligned.  This is a waste of memory, imo, seeing as the vast majority 
of systems out there will not be doing 4GB+ ios any time soon.

Also, the cast to unsigned long long for size_t is pretty atrocious.  
Cast to unsigned long if anything, as size_t is unsigned long on all 
platforms linux runs on.

		-ben
-- 
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here 
and they've asked us to stop the party."  Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] change b_size to size_t
  2006-02-23 16:28           ` [PATCH] change b_size to size_t Badari Pulavarty
  2006-02-23 16:32             ` Benjamin LaHaise
@ 2006-02-23 16:40             ` Dave Kleikamp
  1 sibling, 0 replies; 32+ messages in thread
From: Dave Kleikamp @ 2006-02-23 16:40 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Andrew Morton, Nathan Scott, christoph, mcao, lkml, linux-fsdevel

On Thu, 2006-02-23 at 08:28 -0800, Badari Pulavarty wrote:
> Index: linux-2.6.16-rc4/fs/buffer.c
> ===================================================================
> --- linux-2.6.16-rc4.orig/fs/buffer.c   2006-02-17 14:23:45.000000000 -0800
> +++ linux-2.6.16-rc4/fs/buffer.c        2006-02-23 08:19:15.000000000 -0800
> @@ -432,7 +432,7 @@ __find_get_block_slow(struct block_devic
>                 printk("__find_get_block_slow() failed. "
>                         "block=%llu, b_blocknr=%llu\n",
>                         (unsigned long long)block, (unsigned long long)bh->b_blocknr);
> -               printk("b_state=0x%08lx, b_size=%u\n", bh->b_state, bh->b_size);
> +               printk("b_state=0x%08lx, b_size=%llu\n", bh->b_state,                                   (unsigned long long)bh->b_size);

Wrapping at 80 columns, it looks right, but you've got a long line in
here.

>                 printk("device blocksize: %d\n", 1 << bd_inode->i_blkbits);
>         }
> 
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] change b_size to size_t
  2006-02-23 16:32             ` Benjamin LaHaise
@ 2006-02-23 17:20               ` Badari Pulavarty
  2006-02-23 17:28               ` Badari Pulavarty
  1 sibling, 0 replies; 32+ messages in thread
From: Badari Pulavarty @ 2006-02-23 17:20 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Andrew Morton, Nathan Scott, christoph, mcao, lkml, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 6042 bytes --]

On Thu, 2006-02-23 at 11:32 -0500, Benjamin LaHaise wrote:
> On Thu, Feb 23, 2006 at 08:28:12AM -0800, Badari Pulavarty wrote:
> > Here is the updated version of the patch, which changes
> > buffer_head.b_size to size_t to support mapping large
> > amount of disk blocks (for large IOs).
> 
> Your patch doesn't seem to be inline, so I can't quote it.  Several 
> problems: on 64 bit platforms you introduced 4 bytes of padding into 
> buffer_head.  atomic_t only takes up 4 byte, while size_t is 8 byte 
> aligned.  

I moved stuff around, but still see 96 bytes (8 byte increase) for 
the structure. What am I doing wrong ? Can you check ?

> This is a waste of memory, imo, seeing as the vast majority 
> of systems out there will not be doing 4GB+ ios any time soon.

Yep. I agree.  I was modifying get_block() to take b_size as the amount
of disk mapping requested, so I can get rid of ->get_blocks() and also
add support for mpage_readpages() and mapge_writepages() to deal with
multiple blocks. XFS seems to be requesting >4GB IOs, but doing 4GB 
IOs through DIO today and planning to see bigger IOs in future. 

Since "buffer_head" is not a primary structure used for IO anymore +
this change affect only 64-bit machines, its worth doing this ?

> Also, the cast to unsigned long long for size_t is pretty atrocious.  
> Cast to unsigned long if anything, as size_t is unsigned long on all 
> platforms linux runs on.

Okay. Changed it to (unsigned long) instead. I was just following
sector_t :(

Thanks,
Badari

Increase the size of the buffer_head b_size field for 64 bit
platforms.  Update some old and moldy comments in and around
the structure as well.

The b_size increase allows us to perform larger mappings and
allocations for large I/O requests from userspace, which tie
in with other changes allowing the get_block_t() interface to
map multiple blocks at once.

Signed-off-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Badari Pulavary <pbadari@us.ibm.com>

Index: linux-2.6.16-rc4/include/linux/buffer_head.h
===================================================================
--- linux-2.6.16-rc4.orig/include/linux/buffer_head.h	2006-02-17
14:23:45.000000000 -0800
+++ linux-2.6.16-rc4/include/linux/buffer_head.h	2006-02-23
09:00:06.000000000 -0800
@@ -46,20 +46,23 @@ struct address_space;
 typedef void (bh_end_io_t)(struct buffer_head *bh, int uptodate);
 
 /*
- * Keep related fields in common cachelines.  The most commonly
accessed
- * field (b_state) goes at the start so the compiler does not generate
- * indexed addressing for it.
+ * Historically, a buffer_head was used to map a single block
+ * within a page, and of course as the unit of I/O through the
+ * filesystem and block layers.  Nowadays the basic I/O unit
+ * is the bio, and buffer_heads are used for extracting block
+ * mappings (via a get_block_t call), for tracking state within
+ * a page (via a page_mapping) and for wrapping bio submission
+ * for backward compatibility reasons (e.g. submit_bh).
  */
 struct buffer_head {
-	/* First cache line: */
 	unsigned long b_state;		/* buffer state bitmap (see above) */
 	struct buffer_head *b_this_page;/* circular list of page's buffers */
 	struct page *b_page;		/* the page this bh is mapped to */
-	atomic_t b_count;		/* users using this block */
-	u32 b_size;			/* block size */
+	char *b_data;			/* pointer to data within the page */
 
-	sector_t b_blocknr;		/* block number */
-	char *b_data;			/* pointer to data block */
+	size_t b_size;			/* size of mapping */
+	sector_t b_blocknr;		/* start block number */
+	atomic_t b_count;		/* users using this buffer_head */
 
 	struct block_device *b_bdev;
 	bh_end_io_t *b_end_io;		/* I/O completion */
Index: linux-2.6.16-rc4/fs/buffer.c
===================================================================
--- linux-2.6.16-rc4.orig/fs/buffer.c	2006-02-17 14:23:45.000000000
-0800
+++ linux-2.6.16-rc4/fs/buffer.c	2006-02-23 08:55:18.000000000 -0800
@@ -432,7 +432,8 @@ __find_get_block_slow(struct block_devic
 		printk("__find_get_block_slow() failed. "
 			"block=%llu, b_blocknr=%llu\n",
 			(unsigned long long)block, (unsigned long long)bh->b_blocknr);
-		printk("b_state=0x%08lx, b_size=%u\n", bh->b_state, bh->b_size);
+		printk("b_state=0x%08lx, b_size=%lu\n", bh->b_state,
+				(unsigned long)bh->b_size);
 		printk("device blocksize: %d\n", 1 << bd_inode->i_blkbits);
 	}
 out_unlock:
Index: linux-2.6.16-rc4/fs/reiserfs/prints.c
===================================================================
--- linux-2.6.16-rc4.orig/fs/reiserfs/prints.c	2006-02-17
14:23:45.000000000 -0800
+++ linux-2.6.16-rc4/fs/reiserfs/prints.c	2006-02-23 08:56:17.000000000
-0800
@@ -143,8 +143,8 @@ static void sprintf_buffer_head(char *bu
 	char b[BDEVNAME_SIZE];
 
 	sprintf(buf,
-		"dev %s, size %d, blocknr %llu, count %d, state 0x%lx, page %p, (%s,
%s, %s)",
-		bdevname(bh->b_bdev, b), bh->b_size,
+		"dev %s, size %ld, blocknr %llu, count %d, state 0x%lx, page %p, (%s,
%s, %s)",
+		bdevname(bh->b_bdev, b), (unsigned long)bh->b_size,
 		(unsigned long long)bh->b_blocknr, atomic_read(&(bh->b_count)),
 		bh->b_state, bh->b_page,
 		buffer_uptodate(bh) ? "UPTODATE" : "!UPTODATE",
Index: linux-2.6.16-rc4/fs/ocfs2/journal.c
===================================================================
--- linux-2.6.16-rc4.orig/fs/ocfs2/journal.c	2006-02-17
14:23:45.000000000 -0800
+++ linux-2.6.16-rc4/fs/ocfs2/journal.c	2006-02-23 08:56:55.000000000
-0800
@@ -377,12 +377,12 @@ int ocfs2_journal_access(struct ocfs2_jo
 	BUG_ON(!bh);
 	BUG_ON(!(handle->flags & OCFS2_HANDLE_STARTED));
 
-	mlog_entry("bh->b_blocknr=%llu, type=%d (\"%s\"), bh->b_size = %hu\n",
+	mlog_entry("bh->b_blocknr=%llu, type=%d (\"%s\"), bh->b_size = %lu\n",
 		   (unsigned long long)bh->b_blocknr, type,
 		   (type == OCFS2_JOURNAL_ACCESS_CREATE) ?
 		   "OCFS2_JOURNAL_ACCESS_CREATE" :
 		   "OCFS2_JOURNAL_ACCESS_WRITE",
-		   bh->b_size);
+		   (unsigned long)bh->b_size);
 
 	/* we can safely remove this assertion after testing. */
 	if (!buffer_uptodate(bh)) {


[-- Attachment #2: increase-bsize.patch --]
[-- Type: text/x-patch, Size: 4519 bytes --]

Increase the size of the buffer_head b_size field for 64 bit
platforms.  Update some old and moldy comments in and around
the structure as well.

The b_size increase allows us to perform larger mappings and
allocations for large I/O requests from userspace, which tie
in with other changes allowing the get_block_t() interface to
map multiple blocks at once.

Signed-off-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Badari Pulavary <pbadari@us.ibm.com>

Index: linux-2.6.16-rc4/include/linux/buffer_head.h
===================================================================
--- linux-2.6.16-rc4.orig/include/linux/buffer_head.h	2006-02-17 14:23:45.000000000 -0800
+++ linux-2.6.16-rc4/include/linux/buffer_head.h	2006-02-23 09:00:06.000000000 -0800
@@ -46,20 +46,23 @@ struct address_space;
 typedef void (bh_end_io_t)(struct buffer_head *bh, int uptodate);
 
 /*
- * Keep related fields in common cachelines.  The most commonly accessed
- * field (b_state) goes at the start so the compiler does not generate
- * indexed addressing for it.
+ * Historically, a buffer_head was used to map a single block
+ * within a page, and of course as the unit of I/O through the
+ * filesystem and block layers.  Nowadays the basic I/O unit
+ * is the bio, and buffer_heads are used for extracting block
+ * mappings (via a get_block_t call), for tracking state within
+ * a page (via a page_mapping) and for wrapping bio submission
+ * for backward compatibility reasons (e.g. submit_bh).
  */
 struct buffer_head {
-	/* First cache line: */
 	unsigned long b_state;		/* buffer state bitmap (see above) */
 	struct buffer_head *b_this_page;/* circular list of page's buffers */
 	struct page *b_page;		/* the page this bh is mapped to */
-	atomic_t b_count;		/* users using this block */
-	u32 b_size;			/* block size */
+	char *b_data;			/* pointer to data within the page */
 
-	sector_t b_blocknr;		/* block number */
-	char *b_data;			/* pointer to data block */
+	size_t b_size;			/* size of mapping */
+	sector_t b_blocknr;		/* start block number */
+	atomic_t b_count;		/* users using this buffer_head */
 
 	struct block_device *b_bdev;
 	bh_end_io_t *b_end_io;		/* I/O completion */
Index: linux-2.6.16-rc4/fs/buffer.c
===================================================================
--- linux-2.6.16-rc4.orig/fs/buffer.c	2006-02-17 14:23:45.000000000 -0800
+++ linux-2.6.16-rc4/fs/buffer.c	2006-02-23 08:55:18.000000000 -0800
@@ -432,7 +432,8 @@ __find_get_block_slow(struct block_devic
 		printk("__find_get_block_slow() failed. "
 			"block=%llu, b_blocknr=%llu\n",
 			(unsigned long long)block, (unsigned long long)bh->b_blocknr);
-		printk("b_state=0x%08lx, b_size=%u\n", bh->b_state, bh->b_size);
+		printk("b_state=0x%08lx, b_size=%lu\n", bh->b_state,
+				(unsigned long)bh->b_size);
 		printk("device blocksize: %d\n", 1 << bd_inode->i_blkbits);
 	}
 out_unlock:
Index: linux-2.6.16-rc4/fs/reiserfs/prints.c
===================================================================
--- linux-2.6.16-rc4.orig/fs/reiserfs/prints.c	2006-02-17 14:23:45.000000000 -0800
+++ linux-2.6.16-rc4/fs/reiserfs/prints.c	2006-02-23 08:56:17.000000000 -0800
@@ -143,8 +143,8 @@ static void sprintf_buffer_head(char *bu
 	char b[BDEVNAME_SIZE];
 
 	sprintf(buf,
-		"dev %s, size %d, blocknr %llu, count %d, state 0x%lx, page %p, (%s, %s, %s)",
-		bdevname(bh->b_bdev, b), bh->b_size,
+		"dev %s, size %ld, blocknr %llu, count %d, state 0x%lx, page %p, (%s, %s, %s)",
+		bdevname(bh->b_bdev, b), (unsigned long)bh->b_size,
 		(unsigned long long)bh->b_blocknr, atomic_read(&(bh->b_count)),
 		bh->b_state, bh->b_page,
 		buffer_uptodate(bh) ? "UPTODATE" : "!UPTODATE",
Index: linux-2.6.16-rc4/fs/ocfs2/journal.c
===================================================================
--- linux-2.6.16-rc4.orig/fs/ocfs2/journal.c	2006-02-17 14:23:45.000000000 -0800
+++ linux-2.6.16-rc4/fs/ocfs2/journal.c	2006-02-23 08:56:55.000000000 -0800
@@ -377,12 +377,12 @@ int ocfs2_journal_access(struct ocfs2_jo
 	BUG_ON(!bh);
 	BUG_ON(!(handle->flags & OCFS2_HANDLE_STARTED));
 
-	mlog_entry("bh->b_blocknr=%llu, type=%d (\"%s\"), bh->b_size = %hu\n",
+	mlog_entry("bh->b_blocknr=%llu, type=%d (\"%s\"), bh->b_size = %lu\n",
 		   (unsigned long long)bh->b_blocknr, type,
 		   (type == OCFS2_JOURNAL_ACCESS_CREATE) ?
 		   "OCFS2_JOURNAL_ACCESS_CREATE" :
 		   "OCFS2_JOURNAL_ACCESS_WRITE",
-		   bh->b_size);
+		   (unsigned long)bh->b_size);
 
 	/* we can safely remove this assertion after testing. */
 	if (!buffer_uptodate(bh)) {

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] change b_size to size_t
  2006-02-23 16:32             ` Benjamin LaHaise
  2006-02-23 17:20               ` Badari Pulavarty
@ 2006-02-23 17:28               ` Badari Pulavarty
  2006-02-23 17:29                 ` Benjamin LaHaise
  2006-02-23 17:40                 ` Badari Pulavarty
  1 sibling, 2 replies; 32+ messages in thread
From: Badari Pulavarty @ 2006-02-23 17:28 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Andrew Morton, Nathan Scott, christoph, mcao, lkml, linux-fsdevel

On Thu, 2006-02-23 at 11:32 -0500, Benjamin LaHaise wrote:
> On Thu, Feb 23, 2006 at 08:28:12AM -0800, Badari Pulavarty wrote:
> > Here is the updated version of the patch, which changes
> > buffer_head.b_size to size_t to support mapping large
> > amount of disk blocks (for large IOs).
> 
> Your patch doesn't seem to be inline, so I can't quote it.  Several 
> problems: on 64 bit platforms you introduced 4 bytes of padding into 
> buffer_head.  atomic_t only takes up 4 byte, while size_t is 8 byte 
> aligned. 


Ignore my previous mail.

How about doing this ? Change b_state to u32 and change b_size
to "size_t". This way, we don't increase the overall size of
the structure on 64-bit machines. Isn't it ?

Thanks,
Badari


struct buffer_head {
        u32 b_state;                    /* buffer state bitmap (see
above) */
        atomic_t b_count;               /* users using this buffer_head
*/
        struct buffer_head *b_this_page;/* circular list of page's
buffers */
        struct page *b_page;            /* the page this bh is mapped to
*/
        size_t b_size;                  /* size of mapping */

        sector_t b_blocknr;             /* start block number */
        char *b_data;                   /* pointer to data within the
page */

        struct block_device *b_bdev;
        bh_end_io_t *b_end_io;          /* I/O completion */
        void *b_private;                /* reserved for b_end_io */
        struct list_head b_assoc_buffers; /* associated with another
mapping */
};


Thanks,
Badari


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] change b_size to size_t
  2006-02-23 17:28               ` Badari Pulavarty
@ 2006-02-23 17:29                 ` Benjamin LaHaise
  2006-02-23 18:46                   ` Badari Pulavarty
  2006-02-23 17:40                 ` Badari Pulavarty
  1 sibling, 1 reply; 32+ messages in thread
From: Benjamin LaHaise @ 2006-02-23 17:29 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Andrew Morton, Nathan Scott, christoph, mcao, lkml, linux-fsdevel

On Thu, Feb 23, 2006 at 09:28:58AM -0800, Badari Pulavarty wrote:
> How about doing this ? Change b_state to u32 and change b_size
> to "size_t". This way, we don't increase the overall size of
> the structure on 64-bit machines. Isn't it ?

I was thinking of that too, but that doesn't work with the bit operations 
on big endian machines (which require unsigned long).  Oh well, I guess 
that even with adding an extra 8 bytes on x86-64 we're still at the 96 
bytes, or 92 bytes if the atomic_t is put at the end of the struct.

		-ben
-- 
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here 
and they've asked us to stop the party."  Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] change b_size to size_t
  2006-02-23 17:28               ` Badari Pulavarty
  2006-02-23 17:29                 ` Benjamin LaHaise
@ 2006-02-23 17:40                 ` Badari Pulavarty
  1 sibling, 0 replies; 32+ messages in thread
From: Badari Pulavarty @ 2006-02-23 17:40 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Andrew Morton, Nathan Scott, christoph, mcao, lkml, linux-fsdevel

On Thu, 2006-02-23 at 09:28 -0800, Badari Pulavarty wrote:
> On Thu, 2006-02-23 at 11:32 -0500, Benjamin LaHaise wrote:
> > On Thu, Feb 23, 2006 at 08:28:12AM -0800, Badari Pulavarty wrote:
> > > Here is the updated version of the patch, which changes
> > > buffer_head.b_size to size_t to support mapping large
> > > amount of disk blocks (for large IOs).
> > 
> > Your patch doesn't seem to be inline, so I can't quote it.  Several 
> > problems: on 64 bit platforms you introduced 4 bytes of padding into 
> > buffer_head.  atomic_t only takes up 4 byte, while size_t is 8 byte 
> > aligned. 
> 
> 
> Ignore my previous mail.
> 
> How about doing this ? Change b_state to u32 and change b_size
> to "size_t". This way, we don't increase the overall size of
> the structure on 64-bit machines. Isn't it ?

I hate to correct myself again. But this won't work either.
If we do this, we can use bit_spin_lock() helpers any more
to do bh_state manipulation :(

Yes. Bottom line is, we would increase the size of the structure
by 8-bytes on 64-bit machines. I don't see any way out of it.
But this would provide ability to let the filesystems know
that the we are dealing with large (> 4GB) of IOs (may be they
can allocated as much as possible contiguously), even if
we don't really do that big IOs.

Thanks,
Badari


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] change b_size to size_t
  2006-02-23 17:29                 ` Benjamin LaHaise
@ 2006-02-23 18:46                   ` Badari Pulavarty
  0 siblings, 0 replies; 32+ messages in thread
From: Badari Pulavarty @ 2006-02-23 18:46 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Andrew Morton, Nathan Scott, christoph, mcao, lkml, linux-fsdevel

On Thu, 2006-02-23 at 12:29 -0500, Benjamin LaHaise wrote:
> On Thu, Feb 23, 2006 at 09:28:58AM -0800, Badari Pulavarty wrote:
> > How about doing this ? Change b_state to u32 and change b_size
> > to "size_t". This way, we don't increase the overall size of
> > the structure on 64-bit machines. Isn't it ?
> 
> I was thinking of that too, but that doesn't work with the bit operations 
> on big endian machines (which require unsigned long).  Oh well, I guess 
> that even with adding an extra 8 bytes on x86-64 we're still at the 96 
> bytes, or 92 bytes if the atomic_t is put at the end of the struct.
> 
> 		-ben

Okay. Here is the final version (hopefully). I moved atomic_t to
end of the structure. I still see 96-bytes for x86-64 also :(


buffer_head        40425  40760     96   40    1 : tunables  120   60
8 : slabdata   1019   1019      0

Thanks,
Badari

Increase the size of the buffer_head b_size field for 64 bit
platforms.  Update some old and moldy comments in and around
the structure as well.

The b_size increase allows us to perform larger mappings and
allocations for large I/O requests from userspace, which tie
in with other changes allowing the get_block_t() interface to
map multiple blocks at once.

Signed-off-by: Nathan Scott <nathans@sgi.com>
Signed-off-by: Badari Pulavary <pbadari@us.ibm.com>

Index: linux-2.6.16-rc4/include/linux/buffer_head.h
===================================================================
--- linux-2.6.16-rc4.orig/include/linux/buffer_head.h	2006-02-17
14:23:45.000000000 -0800
+++ linux-2.6.16-rc4/include/linux/buffer_head.h	2006-02-23
10:15:33.000000000 -0800
@@ -46,25 +46,28 @@ struct address_space;
 typedef void (bh_end_io_t)(struct buffer_head *bh, int uptodate);
 
 /*
- * Keep related fields in common cachelines.  The most commonly
accessed
- * field (b_state) goes at the start so the compiler does not generate
- * indexed addressing for it.
+ * Historically, a buffer_head was used to map a single block
+ * within a page, and of course as the unit of I/O through the
+ * filesystem and block layers.  Nowadays the basic I/O unit
+ * is the bio, and buffer_heads are used for extracting block
+ * mappings (via a get_block_t call), for tracking state within
+ * a page (via a page_mapping) and for wrapping bio submission
+ * for backward compatibility reasons (e.g. submit_bh).
  */
 struct buffer_head {
-	/* First cache line: */
 	unsigned long b_state;		/* buffer state bitmap (see above) */
 	struct buffer_head *b_this_page;/* circular list of page's buffers */
 	struct page *b_page;		/* the page this bh is mapped to */
-	atomic_t b_count;		/* users using this block */
-	u32 b_size;			/* block size */
 
-	sector_t b_blocknr;		/* block number */
-	char *b_data;			/* pointer to data block */
+	sector_t b_blocknr;		/* start block number */
+	size_t b_size;			/* size of mapping */
+	char *b_data;			/* pointer to data within the page */
 
 	struct block_device *b_bdev;
 	bh_end_io_t *b_end_io;		/* I/O completion */
  	void *b_private;		/* reserved for b_end_io */
 	struct list_head b_assoc_buffers; /* associated with another mapping
*/
+	atomic_t b_count;		/* users using this buffer_head */
 };
 
 /*
Index: linux-2.6.16-rc4/fs/buffer.c
===================================================================
--- linux-2.6.16-rc4.orig/fs/buffer.c	2006-02-17 14:23:45.000000000
-0800
+++ linux-2.6.16-rc4/fs/buffer.c	2006-02-23 08:55:18.000000000 -0800
@@ -432,7 +432,8 @@ __find_get_block_slow(struct block_devic
 		printk("__find_get_block_slow() failed. "
 			"block=%llu, b_blocknr=%llu\n",
 			(unsigned long long)block, (unsigned long long)bh->b_blocknr);
-		printk("b_state=0x%08lx, b_size=%u\n", bh->b_state, bh->b_size);
+		printk("b_state=0x%08lx, b_size=%lu\n", bh->b_state,
+				(unsigned long)bh->b_size);
 		printk("device blocksize: %d\n", 1 << bd_inode->i_blkbits);
 	}
 out_unlock:
Index: linux-2.6.16-rc4/fs/reiserfs/prints.c
===================================================================
--- linux-2.6.16-rc4.orig/fs/reiserfs/prints.c	2006-02-17
14:23:45.000000000 -0800
+++ linux-2.6.16-rc4/fs/reiserfs/prints.c	2006-02-23 08:56:17.000000000
-0800
@@ -143,8 +143,8 @@ static void sprintf_buffer_head(char *bu
 	char b[BDEVNAME_SIZE];
 
 	sprintf(buf,
-		"dev %s, size %d, blocknr %llu, count %d, state 0x%lx, page %p, (%s,
%s, %s)",
-		bdevname(bh->b_bdev, b), bh->b_size,
+		"dev %s, size %ld, blocknr %llu, count %d, state 0x%lx, page %p, (%s,
%s, %s)",
+		bdevname(bh->b_bdev, b), (unsigned long)bh->b_size,
 		(unsigned long long)bh->b_blocknr, atomic_read(&(bh->b_count)),
 		bh->b_state, bh->b_page,
 		buffer_uptodate(bh) ? "UPTODATE" : "!UPTODATE",
Index: linux-2.6.16-rc4/fs/ocfs2/journal.c
===================================================================
--- linux-2.6.16-rc4.orig/fs/ocfs2/journal.c	2006-02-17
14:23:45.000000000 -0800
+++ linux-2.6.16-rc4/fs/ocfs2/journal.c	2006-02-23 08:56:55.000000000
-0800
@@ -377,12 +377,12 @@ int ocfs2_journal_access(struct ocfs2_jo
 	BUG_ON(!bh);
 	BUG_ON(!(handle->flags & OCFS2_HANDLE_STARTED));
 
-	mlog_entry("bh->b_blocknr=%llu, type=%d (\"%s\"), bh->b_size = %hu\n",
+	mlog_entry("bh->b_blocknr=%llu, type=%d (\"%s\"), bh->b_size = %lu\n",
 		   (unsigned long long)bh->b_blocknr, type,
 		   (type == OCFS2_JOURNAL_ACCESS_CREATE) ?
 		   "OCFS2_JOURNAL_ACCESS_CREATE" :
 		   "OCFS2_JOURNAL_ACCESS_WRITE",
-		   bh->b_size);
+		   (unsigned long)bh->b_size);
 
 	/* we can safely remove this assertion after testing. */
 	if (!buffer_uptodate(bh)) {



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/3] map multiple blocks for mpage_readpages()
  2006-02-23  3:29   ` Suparna Bhattacharya
@ 2006-02-23 22:18     ` Badari Pulavarty
  0 siblings, 0 replies; 32+ messages in thread
From: Badari Pulavarty @ 2006-02-23 22:18 UTC (permalink / raw)
  To: suparna; +Cc: christoph, mcao, akpm, lkml, linux-fsdevel

On Thu, 2006-02-23 at 08:59 +0530, Suparna Bhattacharya wrote:
> On Mon, Feb 20, 2006 at 01:23:55PM -0800, Badari Pulavarty wrote:
> > [PATCH 2/3] map multiple blocks for mpage_readpages()
> > 
> > This patch changes mpage_readpages() and get_block() to
> > get the disk mapping information for multiple blocks at the
> > same time.
> > 
> > b_size represents the amount of disk mapping that needs to
> > mapped. On the successful get_block() b_size indicates the
> > amount of disk mapping thats actually mapped.  Only the
> > filesystems who care to use this information and provide multiple
> > disk blocks at a time can choose to do so.
> > 
> > No changes are needed for the filesystems who wants to ignore this.
> 
> That's a nice way to handle this !
> 
> Just one minor suggestion on the patches - instead of adding all those
> arguments to do_mpage_readpage, have you considered using
> an mio structure as we did in the mpage_writepages multiple blocks case ?

I did think about moving stuff to "mio" and pass it around. Then, I
realized I need to think hard about mpage_writepages() also to come
out with a super-set "mio". For now, I settled with arguments for now,
when we cleanup mpage_writepages() we can take a pass at cleaning up
mpage_readpages() also.  (BTW, there are only 2 more arguments added).

Thanks,
Badari


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()
  2006-02-22 15:12 ` christoph
  2006-02-22 16:58   ` Badari Pulavarty
@ 2006-02-24 17:19   ` Badari Pulavarty
  2006-03-06 10:03     ` Suparna Bhattacharya
  1 sibling, 1 reply; 32+ messages in thread
From: Badari Pulavarty @ 2006-02-24 17:19 UTC (permalink / raw)
  To: christoph; +Cc: mcao, akpm, lkml, linux-fsdevel, vs, zam

On Wed, 2006-02-22 at 16:12 +0100, christoph wrote:
..
> Thanks Badari, with that interface changes the mpage_readpage changes
> look a lot nicer than my original version.  I'd like to second
> the request to put it into -mm. 
> 
> And if the namesys folks could try out whether this works for their
> reiser4 requirements it'd be nice.  If you have an even faster
> ->readpages I'd be interested in that secrete souce receipe for
> further improvement to mpage_readpages.

I don't have any secret receipes, but I was thinking of re-organizing
the code a little. Complexity is due to "confused" case and 
"blocksize < pagesize" cases + going in-and-out of the worker routine
with stored state.

I am thinking of having a "fast path" which doesn't deal with any
of those and "slow" path to deal with all that non-sense.
Something like ..

mpage_readpages()
{
	if (block-size < page-size)
		slow_path;

	while (nr-pages) {
		if (get_block(bh)) 
			slow_path;
		if (uptodate(bh))
			slow_path;
		while (bh.b_size) {
			if (not contig)
				submit bio();
			add all the pages we can to bio();
			bh.b_size -= size-of-pages-added;
			nr_pages -= count-of-pages-added;
		}

	}
}

slow_path is going to be slow & ugly. How important is to handle
1k, 2k filesystems efficiently ? Should I try ?
	
Thanks,
Badari


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()
  2006-02-24 17:19   ` Badari Pulavarty
@ 2006-03-06 10:03     ` Suparna Bhattacharya
  2006-03-06 22:39       ` Nathan Scott
  0 siblings, 1 reply; 32+ messages in thread
From: Suparna Bhattacharya @ 2006-03-06 10:03 UTC (permalink / raw)
  To: Badari Pulavarty; +Cc: christoph, mcao, akpm, lkml, linux-fsdevel, vs, zam

On Fri, Feb 24, 2006 at 09:19:08AM -0800, Badari Pulavarty wrote:
> On Wed, 2006-02-22 at 16:12 +0100, christoph wrote:
> ..
> > Thanks Badari, with that interface changes the mpage_readpage changes
> > look a lot nicer than my original version.  I'd like to second
> > the request to put it into -mm. 
> > 
> > And if the namesys folks could try out whether this works for their
> > reiser4 requirements it'd be nice.  If you have an even faster
> > ->readpages I'd be interested in that secrete souce receipe for
> > further improvement to mpage_readpages.
> 
> I don't have any secret receipes, but I was thinking of re-organizing
> the code a little. Complexity is due to "confused" case and 
> "blocksize < pagesize" cases + going in-and-out of the worker routine
> with stored state.
> 
> I am thinking of having a "fast path" which doesn't deal with any
> of those and "slow" path to deal with all that non-sense.
> Something like ..
> 
> mpage_readpages()
> {
> 	if (block-size < page-size)
> 		slow_path;
> 
> 	while (nr-pages) {
> 		if (get_block(bh)) 
> 			slow_path;
> 		if (uptodate(bh))
> 			slow_path;
> 		while (bh.b_size) {
> 			if (not contig)
> 				submit bio();
> 			add all the pages we can to bio();
> 			bh.b_size -= size-of-pages-added;
> 			nr_pages -= count-of-pages-added;
> 		}
> 
> 	}
> }
> 
> slow_path is going to be slow & ugly. How important is to handle
> 1k, 2k filesystems efficiently ? Should I try ?

With 64K page size that could include 4K, 8K, 16K, 32K block size filesystems
as well, not sure how likely that would be ?

Regards
Suparna

> 	
> Thanks,
> Badari
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Suparna Bhattacharya (suparna@in.ibm.com)
Linux Technology Center
IBM Software Lab, India


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()
  2006-03-06 10:03     ` Suparna Bhattacharya
@ 2006-03-06 22:39       ` Nathan Scott
  2006-03-07  9:00         ` Badari Pulavarty
  0 siblings, 1 reply; 32+ messages in thread
From: Nathan Scott @ 2006-03-06 22:39 UTC (permalink / raw)
  To: Suparna Bhattacharya
  Cc: Badari Pulavarty, christoph, mcao, akpm, lkml, linux-fsdevel, vs, zam

On Mon, Mar 06, 2006 at 03:33:21PM +0530, Suparna Bhattacharya wrote:
> On Fri, Feb 24, 2006 at 09:19:08AM -0800, Badari Pulavarty wrote:
> > 
> > I am thinking of having a "fast path" which doesn't deal with any
> > of those and "slow" path to deal with all that non-sense.
> > ...
> > slow_path is going to be slow & ugly. How important is to handle
> > 1k, 2k filesystems efficiently ? Should I try ?
> 
> With 64K page size that could include 4K, 8K, 16K, 32K block size filesystems
> as well, not sure how likely that would be ?

A number of architectures have a pagesize greater than 4K.  Most
(OK, sample size of 2) mkfs programs default to using 4K blocksizes.
So, any/all platforms not having a 4K pagesize will be disadvantaged.
Search on the definition of PAGE_SHIFT in asm-*/page.h and for all
platforms where its not defined to 12, this will hurt.

cheers.

-- 
Nathan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages()
  2006-03-06 22:39       ` Nathan Scott
@ 2006-03-07  9:00         ` Badari Pulavarty
  0 siblings, 0 replies; 32+ messages in thread
From: Badari Pulavarty @ 2006-03-07  9:00 UTC (permalink / raw)
  To: Nathan Scott
  Cc: Suparna Bhattacharya, christoph, mcao, akpm, lkml, linux-fsdevel,
	vs, zam



Nathan Scott wrote:

>On Mon, Mar 06, 2006 at 03:33:21PM +0530, Suparna Bhattacharya wrote:
>
>>On Fri, Feb 24, 2006 at 09:19:08AM -0800, Badari Pulavarty wrote:
>>
>>>I am thinking of having a "fast path" which doesn't deal with any
>>>of those and "slow" path to deal with all that non-sense.
>>>...
>>>slow_path is going to be slow & ugly. How important is to handle
>>>1k, 2k filesystems efficiently ? Should I try ?
>>>
>>With 64K page size that could include 4K, 8K, 16K, 32K block size filesystems
>>as well, not sure how likely that would be ?
>>
>
>A number of architectures have a pagesize greater than 4K.  Most
>(OK, sample size of 2) mkfs programs default to using 4K blocksizes.
>So, any/all platforms not having a 4K pagesize will be disadvantaged.
>Search on the definition of PAGE_SHIFT in asm-*/page.h and for all
>platforms where its not defined to 12, this will hurt.
>

I agree,  I haven't made up my mind either on if its really worth doing.

I was hoping that it will help simple cases + it will help filesystems 
which use
page->private for something other than buffer heads.

Thanks,
Badari

>




^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2006-03-07  1:01 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-02-20 21:21 [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() Badari Pulavarty
2006-02-20 21:23 ` [PATCH 1/3] pass b_size to ->get_block() Badari Pulavarty
2006-02-20 21:23 ` [PATCH 2/3] map multiple blocks for mpage_readpages() Badari Pulavarty
2006-02-23  3:29   ` Suparna Bhattacharya
2006-02-23 22:18     ` Badari Pulavarty
2006-02-20 21:24 ` [PATCH 3/3] remove ->get_blocks() support Badari Pulavarty
2006-02-20 21:59 ` [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() Nathan Scott
2006-02-20 23:06   ` Badari Pulavarty
2006-02-20 23:16     ` Nathan Scott
2006-02-21  2:41   ` Jeremy Higdon
2006-02-21 16:03     ` Badari Pulavarty
2006-02-21 21:39       ` Nathan Scott
2006-02-22 15:12 ` christoph
2006-02-22 16:58   ` Badari Pulavarty
2006-02-22 16:59     ` christoph
2006-02-23  1:40       ` Nathan Scott
2006-02-23  1:59         ` Andrew Morton
2006-02-23 16:28           ` [PATCH] change b_size to size_t Badari Pulavarty
2006-02-23 16:32             ` Benjamin LaHaise
2006-02-23 17:20               ` Badari Pulavarty
2006-02-23 17:28               ` Badari Pulavarty
2006-02-23 17:29                 ` Benjamin LaHaise
2006-02-23 18:46                   ` Badari Pulavarty
2006-02-23 17:40                 ` Badari Pulavarty
2006-02-23 16:40             ` Dave Kleikamp
2006-02-22 17:23     ` [PATCH 0/3] map multiple blocks in get_block() and mpage_readpages() Peter Staubach
2006-02-22 18:37     ` Dave Kleikamp
2006-02-22 19:00       ` Badari Pulavarty
2006-02-24 17:19   ` Badari Pulavarty
2006-03-06 10:03     ` Suparna Bhattacharya
2006-03-06 22:39       ` Nathan Scott
2006-03-07  9:00         ` Badari Pulavarty

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).