linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [00/36] Large Blocksize Support V6
@ 2007-08-28 19:05 clameter
  2007-08-28 19:05 ` [01/36] Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user clameter
                   ` (37 more replies)
  0 siblings, 38 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:05 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[An update before the Kernel Summit because of the numerous requests that I
have had for this patchset. Please speak up if you feel that we need something
like this.]

This patchset modifies the Linux kernel so that larger block sizes than
page size can be supported. Larger block sizes are handled by using
compound pages of an arbitrary order for the page cache instead of
single pages with order 0.

Support is added in a way that limits the changes to existing code.
As a result filesystems can support I/O using large buffers with minimal
changes.

The page cache functions are mostly unchanged. Instead of a page struct
representing a single page they take a head page struct (which looks
the same as a regular page struct apart from the compound flags) and
operate on those. Most page cache functions can stay as they are.

No locking protocols are added or modified.

The support is also fully transparent at the level of the OS. No
specialized heuristics are added to switch to larger pages. Large
page support is enabled by filesystems or device drivers when a device
or volume is mounted. Larger block sizes are usually set during volume
creation although the patchset supports setting these sizes per file.
The formattted partition will then always be accessed with the
configured blocksize.

Some of the changes are:

- Replace the use of PAGE_CACHE_XXX constants to calculate offsets into
  pages with functions that do the the same and allow the constants to
  be parameterized.

- Extend the capabilities of compound pages so that they can be
  put onto the LRU and reclaimed.

- Allow setting a larger blocksize via set_blocksize()

Rationales:
-----------

1. The ability to handle memory of an arbitrarily large size using
   a singe page struct "handle" is essential for scaling memory handling
   and reducing overhead in multiple kernel subsystems. This patchset
   is a strategic move that allows performance gains throughout the
   kernel.

2. Reduce fsck times. Larger block sizes mean faster file system checking.
   Using 64k block size will reduce the number of blocks to be managed
   by a factor of 16 and produce much denser and contiguous metadata.

3. Performance. If we look at IA64 vs. x86_64 then it seems that the
   faster interrupt handling on x86_64 compensate for the speed loss due to
   a smaller page size (4k vs 16k on IA64). Supporting larger block sizes
   sizes on all allows a significant reduction in I/O overhead and increases
   the size of I/O that can be performed by hardware in a single request
   since the number of scatter gather entries are typically limited for
   one request. This is going to become increasingly important to support
   the ever growing memory sizes since we may have to handle excessively
   large amounts of 4k requests for data sizes that may become common
   soon. For example to write a 1 terabyte file the kernel would have to
   handle 256 million 4k chunks.

4. Cross arch compatibility: It is currently not possible to mount
   an 16k blocksize ext2 filesystem created on IA64 on an x86_64 system.
   With this patch this becomes possible. Note that this also means that
   some filesystems are already capable of working with blocksizes of
   up to 64k (ext2, XFS) which is currently only available on a select
   few arches. This patchset enables that functionality on all arches.
   There are no special modifications needed to the filesystems. The
   set_blocksize() function call will simply support a larger blocksize.

5. VM scalability
   Large block sizes mean less state keeping for the information being
   transferred. For a 1TB file one needs to handle 256 million page
   structs in the VM if one uses 4k page size. A 64k page size reduces
   that amount to 16 million. If the limitation in existing filesystems
   are removed then even higher reductions become possible. For very
   large files like that a page size of 2 MB may be beneficial which
   will reduce the number of page struct to handle to 512k. The variable
   nature of the block size means that the size can be tuned at file
   system creation time for the anticipated needs on a volume.

6. IO scalability
   The IO layer will receive large blocks of contiguious memory with
   this patchset. This means that less scatter gather elements are needed
   and the memory used is guaranteed to be contiguous. Instead of having
   to handle 4k chunks we can f.e. handle 64k chunks in one go.

7. Limited scatter gather support restricts I/O sizes.

   A lot of I/O controllers are limited in the number of scatter gather
   elements that they support. For example a controller that support 128
   entries in the scatter gather lists can only perform I/O of 128*4k =
   512k in one go. If the blocksize is larger (f.e. 64k) then we can perform
   larger I/O transfers. If we support 128 entries then 128*64k = 8M
   can be transferred in one transaction.

   Dave Chinner measured a performance increase of 50% when going to 64k
   blocksize with XFS with an earlier version of this patchset.

8. We have problems supporting devices with a higher blocksize than
   page size. This is for example important to support CD and DVDs that
   can only read and write 32k or 64k blocks. We currently have a shim
   layer in there to deal with this situation which limits the speed
   of I/O. The developers are currently looking for ways to completely
   bypass the page cache because of this deficiency.

9. 32/64k blocksize is also used in flash devices. Same issues.

10. Future harddisks will support bigger block sizes that Linux cannot
   support since we are limited to PAGE_SIZE. Ok the on board cache
   may buffer this for us but what is the point of handling smaller
   page sizes than what the drive supports?


Acceptance issues:
------------------

The patchset is a pretty significant change to the way that the Linux kernel
operates. I have tried to keep the changes as minimal as possible and I
believe that this is a reasonable start to introduce large block I/O
capabilities.

The Linux VM is gradually acquiring abilities to defragment memory. These
capabilities are partially present for 2.6.23. Later versions may merge
more of the defragmentation work. The use of large pages may cause
significant fragmentation to memory. Without proper defragmentation support
these patches cannot work reliably and may cause OOMs (although I have
rarely seen those and I have seen none with 2.6.23 when testing with 16k
blocksize. Possibly the effect of the limited defragmentation capabilities
in 2.6.23 is already sufficient. Beware: Larger blocksize can cause
more defragmentation).

A number of key developers are hesitant about this functionality given the
problems that we have had in the past with memory defragmentation and the
invasiveness of the patchset. It is to be expected that it will take awhile
until confidence in the defragmentation logic builds up. So the patchset
may still have to exist for a long time unmerged. However, it is necessary
that work on this patchset continue even outside of the kernel tree in order
to mature this patchset until it can be merged at some point in the future.

The most serious shortcoming of this patchset is the lack of mmap support.
This means f.e. that it is not possible to execute binaries. Adding
mmap support would mean more changes to the VM. In order to do that some
support by multiple developers is likely needed. The current idea is to
allow the mapping of 4K segments of larger pages to preserve mmap semantics.
This means that application programs could still map pages in 4k chunks and
the VM would provide a mapping into a subsection of a larger page.

How to make this patchset work:
-------------------------------

1. Apply this patchset or do a

git pull
  git://git.kernel.org/pub/scm/linux/kernel/git/christoph/largeblocksize.git
  largeblock

(The git archive is used to keep the patchset up to date. Please send patches
against the git tree)

2. Enable LARGE_BLOCKSIZE Support
3. Compile kernel

In order to use a filesystem with a larger blocksize it needs to be formatted
for that larger blocksize. This is done using the mkfs.xxx tool for each
filesystem. Surprisingly the existing tools work without modification. These
formatting tools may warn you that the blocksize you specify is not supported
on your particular architecture. Ignore that warning since this is no longer
true after you have applied this patchset.

Tested file systems:

Filesystem	Max Blocksize	Changes

Reiserfs	8k		Page size functions
Ext2		64k		Page size functions
XFS		64k		Page size functions / Remove PAGE_SIZE check
Ramfs		MAX_ORDER	Parameter to specify order

Todo/Issues:

- There are certainly numerous issues with this patch. I have only tested
  copying files back and forth, volume creation etc. Others have run
  fsxlinux on the volumes. The missing mmap support limits what can be
  done for now.

- ZONE_MOVABLE is available in 2.6.23. Using the kernelcore=xxx as a kernel
  parameter enables an area where defragmentation can work. This may be
  necessary to avoid OOMs although I have seen no problems with up to 32k
  blocksize even without that measure.

- The antifragmentation patches in Andrew's tree address more fragmentation
  issues. However, large orders may still lead to fragmentation
  of the movable sections. Memory compaction is still not merged and will
  likely be needed to reliably support even larger orders of 256k or more.
  How memory compaction impacts performance still has to be determined.

- Support for bouncing pages.

- Remove PAGE_CACHE_xxx constants after using page_cache_xxx functions
  everywhere. But that will have to wait until merging becomes possible.
  For now certain subsystems (shmem f.e.) are not using these functions.
  They will only use order 0 pages.

- Support for non harddisk based filesystems. Remove the pktdvd etc
  layers needed because the VM current does not support sufficiently
  large blocksizes for these devices. Look for other places  in the kernel
  where we have similar issues.

- Mmap support

V5->V6:
- Rediff against 2.6.23-rc4
- Fix breakage introduced by updates to reiserfs
- Readahead fixes by Fengguang Wu <fengguang.wu@gmail.com>
- Provide a git tree that is kept up to date

V4->V5:
- Diff against 2.6.22-rc6-mm1
- provide test tree on ftp.kernel.org:/pub/linux

V3->V4
- It is possible to transparently make filesystems support larger
  blocksizes by simply allowing larger blocksizes in set_blocksize.
  Remove all special modifications for mmap etc from the filesystems.
  This now makes 3 disk based filesystems that can use larger blocks
  (reiser, ext2, xfs). Are there any other useful ones to make work?
- Patch against 2.6.22-rc4-mm2 which allows the use of Mel's antifrag
  logic to avoid fragmentation.
- More page cache cleanup by applying the functions to filesystems.
- Disable bouncing when the gfp mask is setup.
- Disable mmap directly in mm/filemap.c to avoid filesystem changes
  while we have no mmap support for higher order pages.

RFC V2->V3
- More restructuring
- It actually works!
- Add XFS support
- Fix up UP support
- Work out the direct I/O issues
- Add CONFIG_LARGE_BLOCKSIZE. Off by default which makes the inlines revert
  back to constants. Disabled for 32bit and HIGHMEM configurations.
  This also allows a gradual migration to the new page cache
  inline functions. LARGE_BLOCKSIZE capabilities can be
  added gradually and if there is a problem then we can disable
  a subsystem.

RFC V1->V2
- Some ext2 support
- Some block layer, fs layer support etc.
- Better page cache macros
- Use macros to clean up code.

-- 

^ permalink raw reply	[flat|nested] 124+ messages in thread

* [01/36] Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
@ 2007-08-28 19:05 ` clameter
  2007-08-28 19:05 ` [02/36] Define functions for page cache handling clameter
                   ` (36 subsequent siblings)
  37 siblings, 0 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:05 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0001-Pagecache-zeroing-zero_user_segment-zero_user_segm.patch --]
[-- Type: text/plain, Size: 28592 bytes --]

Simplify page cache zeroing of segments of pages through 3 functions

zero_user_segments(page, start1, end1, start2, end2)

        Zeros two segments of the page. It takes the position where to
        start and end the zeroing which avoids length calculations.

zero_user_segment(page, start, end)

        Same for a single segment.

zero_user(page, start, length)

        Length variant for the case where we know the length.

We remove the zero_user_page macro. Issues:

1. Its a macro. Inline functions are preferable.

2. The KM_USER0 macro is only defined for HIGHMEM.

   Having to treat this special case everywhere makes the
   code needlessly complex. The parameter for zeroing is always
   KM_USER0 except in one single case that we open code.

Avoiding KM_USER0 makes a lot of code not having to be dealing
with the special casing for HIGHMEM anymore. Dealing with
kmap is only necessary for HIGHMEM configurations. In those
configurations we use KM_USER0 like we do for a series of other
functions defined in highmem.h.

Since KM_USER0 is depends on HIGHMEM the existing zero_user_page
function could not be a macro. zero_user_* functions introduced
here can be be inline because that constant is not used when these
functions are called.

Also extract the flushing of the caches to be outside of the kmap.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 drivers/block/loop.c       |    2 +-
 fs/buffer.c                |   48 +++++++++++++-----------------------------
 fs/cifs/inode.c            |    2 +-
 fs/direct-io.c             |    4 +-
 fs/ecryptfs/mmap.c         |    7 ++---
 fs/ext3/inode.c            |    4 +-
 fs/ext4/inode.c            |    4 +-
 fs/gfs2/bmap.c             |    2 +-
 fs/gfs2/ops_address.c      |    2 +-
 fs/libfs.c                 |   11 +++------
 fs/mpage.c                 |    7 +----
 fs/nfs/read.c              |   10 ++++----
 fs/nfs/write.c             |    2 +-
 fs/ntfs/aops.c             |   18 +++++++++-------
 fs/ntfs/file.c             |   32 +++++++++++++---------------
 fs/ocfs2/aops.c            |    6 ++--
 fs/reiserfs/inode.c        |    4 +-
 fs/xfs/linux-2.6/xfs_lrw.c |    2 +-
 include/linux/highmem.h    |   49 +++++++++++++++++++++++++++----------------
 mm/filemap_xip.c           |    2 +-
 mm/truncate.c              |    2 +-
 21 files changed, 104 insertions(+), 116 deletions(-)

Index: linux-2.6/drivers/block/loop.c
===================================================================
--- linux-2.6.orig/drivers/block/loop.c	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/drivers/block/loop.c	2007-08-27 19:22:17.000000000 -0700
@@ -251,7 +251,7 @@ static int do_lo_send_aops(struct loop_d
 			 */
 			printk(KERN_ERR "loop: transfer error block %llu\n",
 			       (unsigned long long)index);
-			zero_user_page(page, offset, size, KM_USER0);
+			zero_user(page, offset, size);
 		}
 		flush_dcache_page(page);
 		ret = aops->commit_write(file, page, offset,
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/fs/buffer.c	2007-08-27 19:22:17.000000000 -0700
@@ -1803,19 +1803,10 @@ static int __block_prepare_write(struct 
 					set_buffer_uptodate(bh);
 					continue;
 				}
-				if (block_end > to || block_start < from) {
-					void *kaddr;
-
-					kaddr = kmap_atomic(page, KM_USER0);
-					if (block_end > to)
-						memset(kaddr+to, 0,
-							block_end-to);
-					if (block_start < from)
-						memset(kaddr+block_start,
-							0, from-block_start);
-					flush_dcache_page(page);
-					kunmap_atomic(kaddr, KM_USER0);
-				}
+				if (block_end > to || block_start < from)
+					zero_user_segments(page,
+						to, block_end,
+						block_start, from)
 				continue;
 			}
 		}
@@ -1863,7 +1854,7 @@ static int __block_prepare_write(struct 
 			break;
 		if (buffer_new(bh)) {
 			clear_buffer_new(bh);
-			zero_user_page(page, block_start, bh->b_size, KM_USER0);
+			zero_user(page, block_start, bh->b_size);
 			set_buffer_uptodate(bh);
 			mark_buffer_dirty(bh);
 		}
@@ -1951,8 +1942,7 @@ int block_read_full_page(struct page *pa
 					SetPageError(page);
 			}
 			if (!buffer_mapped(bh)) {
-				zero_user_page(page, i * blocksize, blocksize,
-						KM_USER0);
+				zero_user(page, i * blocksize, blocksize);
 				if (!err)
 					set_buffer_uptodate(bh);
 				continue;
@@ -2116,8 +2106,7 @@ int cont_prepare_write(struct page *page
 						PAGE_CACHE_SIZE, get_block);
 		if (status)
 			goto out_unmap;
-		zero_user_page(new_page, zerofrom, PAGE_CACHE_SIZE - zerofrom,
-				KM_USER0);
+		zero_user_segment(new_page, zerofrom, PAGE_CACHE_SIZE);
 		generic_commit_write(NULL, new_page, zerofrom, PAGE_CACHE_SIZE);
 		unlock_page(new_page);
 		page_cache_release(new_page);
@@ -2144,7 +2133,7 @@ int cont_prepare_write(struct page *page
 	if (status)
 		goto out1;
 	if (zerofrom < offset) {
-		zero_user_page(page, zerofrom, offset - zerofrom, KM_USER0);
+		zero_user_segment(page, zerofrom, offset);
 		__block_commit_write(inode, page, zerofrom, offset);
 	}
 	return 0;
@@ -2277,7 +2266,6 @@ int nobh_prepare_write(struct page *page
 	unsigned block_in_page;
 	unsigned block_start;
 	sector_t block_in_file;
-	char *kaddr;
 	int nr_reads = 0;
 	int i;
 	int ret = 0;
@@ -2317,13 +2305,8 @@ int nobh_prepare_write(struct page *page
 		if (PageUptodate(page))
 			continue;
 		if (buffer_new(&map_bh) || !buffer_mapped(&map_bh)) {
-			kaddr = kmap_atomic(page, KM_USER0);
-			if (block_start < from)
-				memset(kaddr+block_start, 0, from-block_start);
-			if (block_end > to)
-				memset(kaddr + to, 0, block_end - to);
-			flush_dcache_page(page);
-			kunmap_atomic(kaddr, KM_USER0);
+			zero_user_segments(page, block_start, from,
+						to, block_end);
 			continue;
 		}
 		if (buffer_uptodate(&map_bh))
@@ -2389,7 +2372,7 @@ failed:
 	 * Error recovery is pretty slack.  Clear the page and mark it dirty
 	 * so we'll later zero out any blocks which _were_ allocated.
 	 */
-	zero_user_page(page, 0, PAGE_CACHE_SIZE, KM_USER0);
+	zero_user(page, 0, PAGE_CACHE_SIZE);
 	SetPageUptodate(page);
 	set_page_dirty(page);
 	return ret;
@@ -2458,7 +2441,7 @@ int nobh_writepage(struct page *page, ge
 	 * the  page size, the remaining memory is zeroed when mapped, and
 	 * writes to that region are not written out to the file."
 	 */
-	zero_user_page(page, offset, PAGE_CACHE_SIZE - offset, KM_USER0);
+	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
 out:
 	ret = mpage_writepage(page, get_block, wbc);
 	if (ret == -EAGAIN)
@@ -2492,8 +2475,7 @@ int nobh_truncate_page(struct address_sp
 	to = (offset + blocksize) & ~(blocksize - 1);
 	ret = a_ops->prepare_write(NULL, page, offset, to);
 	if (ret == 0) {
-		zero_user_page(page, offset, PAGE_CACHE_SIZE - offset,
-				KM_USER0);
+		zero_user_segment(page, offset, PAGE_CACHE_SIZE);
 		/*
 		 * It would be more correct to call aops->commit_write()
 		 * here, but this is more efficient.
@@ -2572,7 +2554,7 @@ int block_truncate_page(struct address_s
 			goto unlock;
 	}
 
-	zero_user_page(page, offset, length, KM_USER0);
+	zero_user(page, offset, length);
 	mark_buffer_dirty(bh);
 	err = 0;
 
@@ -2618,7 +2600,7 @@ int block_write_full_page(struct page *p
 	 * the  page size, the remaining memory is zeroed when mapped, and
 	 * writes to that region are not written out to the file."
 	 */
-	zero_user_page(page, offset, PAGE_CACHE_SIZE - offset, KM_USER0);
+	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
 	return __block_write_full_page(inode, page, get_block, wbc);
 }
 
Index: linux-2.6/fs/cifs/inode.c
===================================================================
--- linux-2.6.orig/fs/cifs/inode.c	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/fs/cifs/inode.c	2007-08-27 19:22:17.000000000 -0700
@@ -1353,7 +1353,7 @@ static int cifs_truncate_page(struct add
 	if (!page)
 		return -ENOMEM;
 
-	zero_user_page(page, offset, PAGE_CACHE_SIZE - offset, KM_USER0);
+	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
 	unlock_page(page);
 	page_cache_release(page);
 	return rc;
Index: linux-2.6/fs/direct-io.c
===================================================================
--- linux-2.6.orig/fs/direct-io.c	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/fs/direct-io.c	2007-08-27 19:22:17.000000000 -0700
@@ -887,8 +887,8 @@ do_holes:
 					page_cache_release(page);
 					goto out;
 				}
-				zero_user_page(page, block_in_page << blkbits,
-						1 << blkbits, KM_USER0);
+				zero_user(page, block_in_page << blkbits,
+						1 << blkbits);
 				dio->block_in_file++;
 				block_in_page++;
 				goto next_block;
Index: linux-2.6/fs/ecryptfs/mmap.c
===================================================================
--- linux-2.6.orig/fs/ecryptfs/mmap.c	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/fs/ecryptfs/mmap.c	2007-08-27 19:22:17.000000000 -0700
@@ -371,8 +371,7 @@ static int fill_zeros_to_end_of_page(str
 	end_byte_in_page = i_size_read(inode) % PAGE_CACHE_SIZE;
 	if (to > end_byte_in_page)
 		end_byte_in_page = to;
-	zero_user_page(page, end_byte_in_page,
-		PAGE_CACHE_SIZE - end_byte_in_page, KM_USER0);
+	zero_user_segment(page, end_byte_in_page, PAGE_CACHE_SIZE);
 out:
 	return 0;
 }
@@ -422,7 +421,7 @@ static int ecryptfs_prepare_write(struct
 			}
 		}
 		if (end_of_prev_pg_pos + 1 > i_size_read(page->mapping->host))
-			zero_user_page(page, 0, PAGE_CACHE_SIZE, KM_USER0);
+			zero_user(page, 0, PAGE_CACHE_SIZE);
 	}
 out:
 	return rc;
@@ -790,7 +789,7 @@ ecryptfs_write_zeros(struct file *file, 
 		page_cache_release(tmp_page);
 		goto out;
 	}
-	zero_user_page(tmp_page, start, num_zeros, KM_USER0);
+	zero_user(tmp_page, start, num_zeros);
 	rc = ecryptfs_commit_write(file, tmp_page, start, start + num_zeros);
 	if (rc < 0) {
 		ecryptfs_printk(KERN_ERR, "Error attempting to write zero's "
Index: linux-2.6/fs/ext3/inode.c
===================================================================
--- linux-2.6.orig/fs/ext3/inode.c	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/fs/ext3/inode.c	2007-08-27 19:22:17.000000000 -0700
@@ -1778,7 +1778,7 @@ static int ext3_block_truncate_page(hand
 	 */
 	if (!page_has_buffers(page) && test_opt(inode->i_sb, NOBH) &&
 	     ext3_should_writeback_data(inode) && PageUptodate(page)) {
-		zero_user_page(page, offset, length, KM_USER0);
+		zero_user(page, offset, length);
 		set_page_dirty(page);
 		goto unlock;
 	}
@@ -1831,7 +1831,7 @@ static int ext3_block_truncate_page(hand
 			goto unlock;
 	}
 
-	zero_user_page(page, offset, length, KM_USER0);
+	zero_user(page, offset, length);
 	BUFFER_TRACE(bh, "zeroed end of block");
 
 	err = 0;
Index: linux-2.6/fs/ext4/inode.c
===================================================================
--- linux-2.6.orig/fs/ext4/inode.c	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/fs/ext4/inode.c	2007-08-27 19:22:17.000000000 -0700
@@ -1777,7 +1777,7 @@ int ext4_block_truncate_page(handle_t *h
 	 */
 	if (!page_has_buffers(page) && test_opt(inode->i_sb, NOBH) &&
 	     ext4_should_writeback_data(inode) && PageUptodate(page)) {
-		zero_user_page(page, offset, length, KM_USER0);
+		zero_user(page, offset, length);
 		set_page_dirty(page);
 		goto unlock;
 	}
@@ -1830,7 +1830,7 @@ int ext4_block_truncate_page(handle_t *h
 			goto unlock;
 	}
 
-	zero_user_page(page, offset, length, KM_USER0);
+	zero_user(page, offset, length);
 
 	BUFFER_TRACE(bh, "zeroed end of block");
 
Index: linux-2.6/fs/gfs2/bmap.c
===================================================================
--- linux-2.6.orig/fs/gfs2/bmap.c	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/fs/gfs2/bmap.c	2007-08-27 19:22:17.000000000 -0700
@@ -933,7 +933,7 @@ static int gfs2_block_truncate_page(stru
 	if (sdp->sd_args.ar_data == GFS2_DATA_ORDERED || gfs2_is_jdata(ip))
 		gfs2_trans_add_bh(ip->i_gl, bh, 0);
 
-	zero_user_page(page, offset, length, KM_USER0);
+	zero_user(page, offset, length);
 
 unlock:
 	unlock_page(page);
Index: linux-2.6/fs/gfs2/ops_address.c
===================================================================
--- linux-2.6.orig/fs/gfs2/ops_address.c	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/fs/gfs2/ops_address.c	2007-08-27 19:22:17.000000000 -0700
@@ -208,7 +208,7 @@ static int stuffed_readpage(struct gfs2_
 	 * so we need to supply one here. It doesn't happen often.
 	 */
 	if (unlikely(page->index)) {
-		zero_user_page(page, 0, PAGE_CACHE_SIZE, KM_USER0);
+		zero_user(page, 0, PAGE_CACHE_SIZE);
 		return 0;
 	}
 
Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/fs/libfs.c	2007-08-27 19:22:17.000000000 -0700
@@ -340,13 +340,10 @@ int simple_prepare_write(struct file *fi
 			unsigned from, unsigned to)
 {
 	if (!PageUptodate(page)) {
-		if (to - from != PAGE_CACHE_SIZE) {
-			void *kaddr = kmap_atomic(page, KM_USER0);
-			memset(kaddr, 0, from);
-			memset(kaddr + to, 0, PAGE_CACHE_SIZE - to);
-			flush_dcache_page(page);
-			kunmap_atomic(kaddr, KM_USER0);
-		}
+		if (to - from != PAGE_CACHE_SIZE)
+			zero_user_segments(page,
+				0, from,
+				to, PAGE_CACHE_SIZE);
 	}
 	return 0;
 }
Index: linux-2.6/fs/mpage.c
===================================================================
--- linux-2.6.orig/fs/mpage.c	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/fs/mpage.c	2007-08-27 19:22:17.000000000 -0700
@@ -284,9 +284,7 @@ do_mpage_readpage(struct bio *bio, struc
 	}
 
 	if (first_hole != blocks_per_page) {
-		zero_user_page(page, first_hole << blkbits,
-				PAGE_CACHE_SIZE - (first_hole << blkbits),
-				KM_USER0);
+		zero_user_segment(page, first_hole << blkbits, PAGE_CACHE_SIZE);
 		if (first_hole == 0) {
 			SetPageUptodate(page);
 			unlock_page(page);
@@ -585,8 +583,7 @@ page_is_mapped:
 
 		if (page->index > end_index || !offset)
 			goto confused;
-		zero_user_page(page, offset, PAGE_CACHE_SIZE - offset,
-				KM_USER0);
+		zero_user_segment(page, offset, PAGE_CACHE_SIZE);
 	}
 
 	/*
Index: linux-2.6/fs/nfs/read.c
===================================================================
--- linux-2.6.orig/fs/nfs/read.c	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/fs/nfs/read.c	2007-08-27 19:22:17.000000000 -0700
@@ -79,7 +79,7 @@ void nfs_readdata_release(void *data)
 static
 int nfs_return_empty_page(struct page *page)
 {
-	zero_user_page(page, 0, PAGE_CACHE_SIZE, KM_USER0);
+	zero_user(page, 0, PAGE_CACHE_SIZE);
 	SetPageUptodate(page);
 	unlock_page(page);
 	return 0;
@@ -103,10 +103,10 @@ static void nfs_readpage_truncate_uninit
 	pglen = PAGE_CACHE_SIZE - base;
 	for (;;) {
 		if (remainder <= pglen) {
-			zero_user_page(*pages, base, remainder, KM_USER0);
+			zero_user(*pages, base, remainder);
 			break;
 		}
-		zero_user_page(*pages, base, pglen, KM_USER0);
+		zero_user(*pages, base, pglen);
 		pages++;
 		remainder -= pglen;
 		pglen = PAGE_CACHE_SIZE;
@@ -130,7 +130,7 @@ static int nfs_readpage_async(struct nfs
 		return PTR_ERR(new);
 	}
 	if (len < PAGE_CACHE_SIZE)
-		zero_user_page(page, len, PAGE_CACHE_SIZE - len, KM_USER0);
+		zero_user_segment(page, len, PAGE_CACHE_SIZE);
 
 	nfs_list_add_request(new, &one_request);
 	if (NFS_SERVER(inode)->rsize < PAGE_CACHE_SIZE)
@@ -538,7 +538,7 @@ readpage_async_filler(void *data, struct
 		goto out_error;
 
 	if (len < PAGE_CACHE_SIZE)
-		zero_user_page(page, len, PAGE_CACHE_SIZE - len, KM_USER0);
+		zero_user_segment(page, len, PAGE_CACHE_SIZE);
 	nfs_pageio_add_request(desc->pgio, new);
 	return 0;
 out_error:
Index: linux-2.6/fs/nfs/write.c
===================================================================
--- linux-2.6.orig/fs/nfs/write.c	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/fs/nfs/write.c	2007-08-27 19:22:17.000000000 -0700
@@ -168,7 +168,7 @@ static void nfs_mark_uptodate(struct pag
 	if (count != nfs_page_length(page))
 		return;
 	if (count != PAGE_CACHE_SIZE)
-		zero_user_page(page, count, PAGE_CACHE_SIZE - count, KM_USER0);
+		zero_user_segment(page, count, PAGE_CACHE_SIZE);
 	SetPageUptodate(page);
 }
 
Index: linux-2.6/fs/ntfs/aops.c
===================================================================
--- linux-2.6.orig/fs/ntfs/aops.c	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/fs/ntfs/aops.c	2007-08-27 19:22:17.000000000 -0700
@@ -87,13 +87,17 @@ static void ntfs_end_buffer_async_read(s
 		/* Check for the current buffer head overflowing. */
 		if (unlikely(file_ofs + bh->b_size > init_size)) {
 			int ofs;
+			void *kaddr;
 
 			ofs = 0;
 			if (file_ofs < init_size)
 				ofs = init_size - file_ofs;
 			local_irq_save(flags);
-			zero_user_page(page, bh_offset(bh) + ofs,
-					 bh->b_size - ofs, KM_BIO_SRC_IRQ);
+			kaddr = kmap_atomic(page, KM_BIO_SRC_IRQ);
+			memset(kaddr + bh_offset(bh) + ofs, 0,
+					bh->b_size - ofs);
+			flush_dcache_page(page);
+			kunmap_atomic(kaddr, KM_BIO_SRC_IRQ);
 			local_irq_restore(flags);
 		}
 	} else {
@@ -334,7 +338,7 @@ handle_hole:
 		bh->b_blocknr = -1UL;
 		clear_buffer_mapped(bh);
 handle_zblock:
-		zero_user_page(page, i * blocksize, blocksize, KM_USER0);
+		zero_user(page, i * blocksize, blocksize);
 		if (likely(!err))
 			set_buffer_uptodate(bh);
 	} while (i++, iblock++, (bh = bh->b_this_page) != head);
@@ -451,7 +455,7 @@ retry_readpage:
 	 * ok to ignore the compressed flag here.
 	 */
 	if (unlikely(page->index > 0)) {
-		zero_user_page(page, 0, PAGE_CACHE_SIZE, KM_USER0);
+		zero_user(page, 0, PAGE_CACHE_SIZE);
 		goto done;
 	}
 	if (!NInoAttr(ni))
@@ -780,8 +784,7 @@ lock_retry_remap:
 		if (err == -ENOENT || lcn == LCN_ENOENT) {
 			bh->b_blocknr = -1;
 			clear_buffer_dirty(bh);
-			zero_user_page(page, bh_offset(bh), blocksize,
-					KM_USER0);
+			zero_user(page, bh_offset(bh), blocksize);
 			set_buffer_uptodate(bh);
 			err = 0;
 			continue;
@@ -1406,8 +1409,7 @@ retry_writepage:
 		if (page->index >= (i_size >> PAGE_CACHE_SHIFT)) {
 			/* The page straddles i_size. */
 			unsigned int ofs = i_size & ~PAGE_CACHE_MASK;
-			zero_user_page(page, ofs, PAGE_CACHE_SIZE - ofs,
-					KM_USER0);
+			zero_user_segment(page, ofs, PAGE_CACHE_SIZE);
 		}
 		/* Handle mst protected attributes. */
 		if (NInoMstProtected(ni))
Index: linux-2.6/fs/ntfs/file.c
===================================================================
--- linux-2.6.orig/fs/ntfs/file.c	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/fs/ntfs/file.c	2007-08-27 19:22:17.000000000 -0700
@@ -607,8 +607,8 @@ do_next_page:
 					ntfs_submit_bh_for_read(bh);
 					*wait_bh++ = bh;
 				} else {
-					zero_user_page(page, bh_offset(bh),
-							blocksize, KM_USER0);
+					zero_user(page, bh_offset(bh),
+							blocksize);
 					set_buffer_uptodate(bh);
 				}
 			}
@@ -683,9 +683,8 @@ map_buffer_cached:
 						ntfs_submit_bh_for_read(bh);
 						*wait_bh++ = bh;
 					} else {
-						zero_user_page(page,
-							bh_offset(bh),
-							blocksize, KM_USER0);
+						zero_user(page, bh_offset(bh),
+								blocksize);
 						set_buffer_uptodate(bh);
 					}
 				}
@@ -703,8 +702,8 @@ map_buffer_cached:
 			 */
 			if (bh_end <= pos || bh_pos >= end) {
 				if (!buffer_uptodate(bh)) {
-					zero_user_page(page, bh_offset(bh),
-							blocksize, KM_USER0);
+					zero_user(page, bh_offset(bh),
+							blocksize);
 					set_buffer_uptodate(bh);
 				}
 				mark_buffer_dirty(bh);
@@ -743,8 +742,7 @@ map_buffer_cached:
 				if (!buffer_uptodate(bh))
 					set_buffer_uptodate(bh);
 			} else if (!buffer_uptodate(bh)) {
-				zero_user_page(page, bh_offset(bh), blocksize,
-						KM_USER0);
+				zero_user(page, bh_offset(bh), blocksize);
 				set_buffer_uptodate(bh);
 			}
 			continue;
@@ -868,8 +866,8 @@ rl_not_mapped_enoent:
 					if (!buffer_uptodate(bh))
 						set_buffer_uptodate(bh);
 				} else if (!buffer_uptodate(bh)) {
-					zero_user_page(page, bh_offset(bh),
-							blocksize, KM_USER0);
+					zero_user(page, bh_offset(bh),
+						blocksize);
 					set_buffer_uptodate(bh);
 				}
 				continue;
@@ -1128,8 +1126,8 @@ rl_not_mapped_enoent:
 
 				if (likely(bh_pos < initialized_size))
 					ofs = initialized_size - bh_pos;
-				zero_user_page(page, bh_offset(bh) + ofs,
-						blocksize - ofs, KM_USER0);
+				zero_user_segment(page, bh_offset(bh) + ofs,
+						blocksize);
 			}
 		} else /* if (unlikely(!buffer_uptodate(bh))) */
 			err = -EIO;
@@ -1269,8 +1267,8 @@ rl_not_mapped_enoent:
 				if (PageUptodate(page))
 					set_buffer_uptodate(bh);
 				else {
-					zero_user_page(page, bh_offset(bh),
-							blocksize, KM_USER0);
+					zero_user(page, bh_offset(bh),
+							blocksize);
 					set_buffer_uptodate(bh);
 				}
 			}
@@ -1330,7 +1328,7 @@ err_out:
 		len = PAGE_CACHE_SIZE;
 		if (len > bytes)
 			len = bytes;
-		zero_user_page(*pages, 0, len, KM_USER0);
+		zero_user(*pages, 0, len);
 	}
 	goto out;
 }
@@ -1451,7 +1449,7 @@ err_out:
 		len = PAGE_CACHE_SIZE;
 		if (len > bytes)
 			len = bytes;
-		zero_user_page(*pages, 0, len, KM_USER0);
+		zero_user(*pages, 0, len);
 	}
 	goto out;
 }
Index: linux-2.6/fs/ocfs2/aops.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/aops.c	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/fs/ocfs2/aops.c	2007-08-27 19:22:17.000000000 -0700
@@ -238,7 +238,7 @@ static int ocfs2_readpage(struct file *f
 	 * XXX sys_readahead() seems to get that wrong?
 	 */
 	if (start >= i_size_read(inode)) {
-		zero_user_page(page, 0, PAGE_SIZE, KM_USER0);
+		zero_user(page, 0, PAGE_SIZE);
 		SetPageUptodate(page);
 		ret = 0;
 		goto out_alloc;
@@ -746,7 +746,7 @@ int ocfs2_map_page_blocks(struct page *p
 		if (block_start >= to)
 			break;
 
-		zero_user_page(page, block_start, bh->b_size, KM_USER0);
+		zero_user(page, block_start, bh->b_size);
 		set_buffer_uptodate(bh);
 		mark_buffer_dirty(bh);
 
@@ -905,7 +905,7 @@ static void ocfs2_zero_new_buffers(struc
 					start = max(from, block_start);
 					end = min(to, block_end);
 
-					zero_user_page(page, start, end - start, KM_USER0);
+					zero_user_segment(page, start, end);
 					set_buffer_uptodate(bh);
 				}
 
Index: linux-2.6/fs/reiserfs/inode.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/inode.c	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/fs/reiserfs/inode.c	2007-08-27 19:22:17.000000000 -0700
@@ -2150,7 +2150,7 @@ int reiserfs_truncate_file(struct inode 
 		/* if we are not on a block boundary */
 		if (length) {
 			length = blocksize - length;
-			zero_user_page(page, offset, length, KM_USER0);
+			zero_user(page, offset, length);
 			if (buffer_mapped(bh) && bh->b_blocknr != 0) {
 				mark_buffer_dirty(bh);
 			}
@@ -2374,7 +2374,7 @@ static int reiserfs_write_full_page(stru
 			unlock_page(page);
 			return 0;
 		}
-		zero_user_page(page, last_offset, PAGE_CACHE_SIZE - last_offset, KM_USER0);
+		zero_user_segment(page, last_offset, PAGE_CACHE_SIZE);
 	}
 	bh = head;
 	block = page->index << (PAGE_CACHE_SHIFT - s->s_blocksize_bits);
Index: linux-2.6/fs/xfs/linux-2.6/xfs_lrw.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_lrw.c	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/fs/xfs/linux-2.6/xfs_lrw.c	2007-08-27 19:22:17.000000000 -0700
@@ -159,7 +159,7 @@ xfs_iozero(
 		if (status)
 			goto unlock;
 
-		zero_user_page(page, offset, bytes, KM_USER0);
+		zero_user(page, offset, bytes);
 
 		status = mapping->a_ops->commit_write(NULL, page, offset,
 							offset + bytes);
Index: linux-2.6/include/linux/highmem.h
===================================================================
--- linux-2.6.orig/include/linux/highmem.h	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/include/linux/highmem.h	2007-08-27 19:22:17.000000000 -0700
@@ -124,28 +124,41 @@ static inline void clear_highpage(struct
 	kunmap_atomic(kaddr, KM_USER0);
 }
 
-/*
- * Same but also flushes aliased cache contents to RAM.
- *
- * This must be a macro because KM_USER0 and friends aren't defined if
- * !CONFIG_HIGHMEM
- */
-#define zero_user_page(page, offset, size, km_type)		\
-	do {							\
-		void *kaddr;					\
-								\
-		BUG_ON((offset) + (size) > PAGE_SIZE);		\
-								\
-		kaddr = kmap_atomic(page, km_type);		\
-		memset((char *)kaddr + (offset), 0, (size));	\
-		flush_dcache_page(page);			\
-		kunmap_atomic(kaddr, (km_type));		\
-	} while (0)
+static inline void zero_user_segments(struct page *page,
+	unsigned start1, unsigned end1,
+	unsigned start2, unsigned end2)
+{
+	void *kaddr = kmap_atomic(page, KM_USER0);
+
+	BUG_ON(end1 > PAGE_SIZE ||
+		end2 > PAGE_SIZE);
+
+	if (end1 > start1)
+		memset(kaddr + start1, 0, end1 - start1);
+
+	if (end2 > start2)
+		memset(kaddr + start2, 0, end2 - start2);
+
+	kunmap_atomic(kaddr, KM_USER0);
+	flush_dcache_page(page);
+}
+
+static inline void zero_user_segment(struct page *page,
+	unsigned start, unsigned end)
+{
+	zero_user_segments(page, start, end, 0, 0);
+}
+
+static inline void zero_user(struct page *page,
+	unsigned start, unsigned size)
+{
+	zero_user_segments(page, start, start + size, 0, 0);
+}
 
 static inline void __deprecated memclear_highpage_flush(struct page *page,
 			unsigned int offset, unsigned int size)
 {
-	zero_user_page(page, offset, size, KM_USER0);
+	zero_user(page, offset, size);
 }
 
 #ifndef __HAVE_ARCH_COPY_USER_HIGHPAGE
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/mm/filemap_xip.c	2007-08-27 19:22:17.000000000 -0700
@@ -433,7 +433,7 @@ xip_truncate_page(struct address_space *
 		else
 			return PTR_ERR(page);
 	}
-	zero_user_page(page, offset, length, KM_USER0);
+	zero_user(page, offset, length);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(xip_truncate_page);
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/mm/truncate.c	2007-08-27 19:22:17.000000000 -0700
@@ -47,7 +47,7 @@ void do_invalidatepage(struct page *page
 
 static inline void truncate_partial_page(struct page *page, unsigned partial)
 {
-	zero_user_page(page, partial, PAGE_CACHE_SIZE - partial, KM_USER0);
+	zero_user_segment(page, partial, PAGE_CACHE_SIZE);
 	if (PagePrivate(page))
 		do_invalidatepage(page, partial);
 }
Index: linux-2.6/fs/affs/file.c
===================================================================
--- linux-2.6.orig/fs/affs/file.c	2007-08-27 19:22:39.000000000 -0700
+++ linux-2.6/fs/affs/file.c	2007-08-27 19:23:05.000000000 -0700
@@ -628,7 +628,7 @@ static int affs_prepare_write_ofs(struct
 			return err;
 	}
 	if (to < PAGE_CACHE_SIZE) {
-		zero_user_page(page, to, PAGE_CACHE_SIZE - to, KM_USER0);
+		zero_user_segment(page, to, PAGE_CACHE_SIZE);
 		if (size > offset + to) {
 			if (size < offset + PAGE_CACHE_SIZE)
 				tmp = size & ~PAGE_CACHE_MASK;
Index: linux-2.6/fs/reiserfs/file.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/file.c	2007-08-27 19:23:15.000000000 -0700
+++ linux-2.6/fs/reiserfs/file.c	2007-08-27 19:24:31.000000000 -0700
@@ -1060,11 +1060,11 @@ static int reiserfs_prepare_file_region_
 	   parts of first and last pages in write area (if needed) */
 	if ((pos & ~((loff_t) PAGE_CACHE_SIZE - 1)) > inode->i_size) {
 		if (from != 0)		/* First page needs to be partially zeroed */
-			zero_user_page(prepared_pages[0], 0, from, KM_USER0);
+			zero_user(prepared_pages[0], 0, from);
 
 		if (to != PAGE_CACHE_SIZE)	/* Last page needs to be partially zeroed */
-			zero_user_page(prepared_pages[num_pages-1], to,
-					PAGE_CACHE_SIZE - to, KM_USER0);
+			zero_user_segment(prepared_pages[num_pages-1], to,
+					PAGE_CACHE_SIZE);
 
 		/* Since all blocks are new - use already calculated value */
 		return blocks;
@@ -1191,9 +1191,8 @@ static int reiserfs_prepare_file_region_
 					ll_rw_block(READ, 1, &bh);
 					*wait_bh++ = bh;
 				} else {	/* Not mapped, zero it */
-					zero_user_page(prepared_pages[0],
-						       block_start,
-						       from - block_start, KM_USER0);
+					zero_user_segment(prepared_pages[0],
+						       block_start, from);
 					set_buffer_uptodate(bh);
 				}
 			}
@@ -1225,8 +1224,8 @@ static int reiserfs_prepare_file_region_
 					ll_rw_block(READ, 1, &bh);
 					*wait_bh++ = bh;
 				} else {	/* Not mapped, zero it */
-					zero_user_page(prepared_pages[num_pages-1],
-							to, block_end - to, KM_USER0);
+					zero_user_segment(prepared_pages[num_pages-1],
+							to, block_end);
 					set_buffer_uptodate(bh);
 				}
 			}

-- 

^ permalink raw reply	[flat|nested] 124+ messages in thread

* [02/36] Define functions for page cache handling
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
  2007-08-28 19:05 ` [01/36] Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user clameter
@ 2007-08-28 19:05 ` clameter
  2007-08-28 19:05 ` [03/36] Use page_cache_xxx functions in mm/filemap.c clameter
                   ` (35 subsequent siblings)
  37 siblings, 0 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:05 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0002-Define-functions-for-page-cache-handling.patch --]
[-- Type: text/plain, Size: 3491 bytes --]

We use the macros PAGE_CACHE_SIZE PAGE_CACHE_SHIFT PAGE_CACHE_MASK
and PAGE_CACHE_ALIGN in various places in the kernel. Many times
common operations like calculating the offset or the index are coded
using shifts and adds. This patch provides inline functions to
get the calculations accomplished without having to explicitly
shift and add constants.

All functions take an address_space pointer. The address space pointer
will be used in the future to eventually support a variable size
page cache. Information reachable via the mapping may then determine
page size.

New function                    Related base page constant
====================================================================
page_cache_shift(a)             PAGE_CACHE_SHIFT
page_cache_size(a)              PAGE_CACHE_SIZE
page_cache_mask(a)              PAGE_CACHE_MASK
page_cache_index(a, pos)        Calculate page number from position
page_cache_next(addr, pos)      Page number of next page
page_cache_offset(a, pos)       Calculate offset into a page
page_cache_pos(a, index, offset)
                                Form position based on page number
                                and an offset.

This provides a basis that would allow the conversion of all page cache
handling in the kernel and ultimately allow the removal of the PAGE_CACHE_*
constants.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 include/linux/pagemap.h |   54 +++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 54 insertions(+), 0 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 8a83537..836e9dd 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -52,12 +52,66 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
  * space in smaller chunks for same flexibility).
  *
  * Or rather, it _will_ be done in larger chunks.
+ *
+ * The following constants can be used if a filesystem only supports a single
+ * page size.
  */
 #define PAGE_CACHE_SHIFT	PAGE_SHIFT
 #define PAGE_CACHE_SIZE		PAGE_SIZE
 #define PAGE_CACHE_MASK		PAGE_MASK
 #define PAGE_CACHE_ALIGN(addr)	(((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK)
 
+/*
+ * Functions that are currently setup for a fixed PAGE_SIZEd. The use of
+ * these will allow a variable page size pagecache in the future.
+ */
+static inline int mapping_order(struct address_space *a)
+{
+	return 0;
+}
+
+static inline int page_cache_shift(struct address_space *a)
+{
+	return PAGE_SHIFT;
+}
+
+static inline unsigned int page_cache_size(struct address_space *a)
+{
+	return PAGE_SIZE;
+}
+
+static inline loff_t page_cache_mask(struct address_space *a)
+{
+	return (loff_t)PAGE_MASK;
+}
+
+static inline unsigned int page_cache_offset(struct address_space *a,
+		loff_t pos)
+{
+	return pos & ~PAGE_MASK;
+}
+
+static inline pgoff_t page_cache_index(struct address_space *a,
+		loff_t pos)
+{
+	return pos >> page_cache_shift(a);
+}
+
+/*
+ * Index of the page starting on or after the given position.
+ */
+static inline pgoff_t page_cache_next(struct address_space *a,
+		loff_t pos)
+{
+	return page_cache_index(a, pos + page_cache_size(a) - 1);
+}
+
+static inline loff_t page_cache_pos(struct address_space *a,
+		pgoff_t index, unsigned long offset)
+{
+	return ((loff_t)index << page_cache_shift(a)) + offset;
+}
+
 #define page_cache_get(page)		get_page(page)
 #define page_cache_release(page)	put_page(page)
 void release_pages(struct page **pages, int nr, int cold);
-- 
1.5.2.4

-- 

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [03/36] Use page_cache_xxx functions in mm/filemap.c
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
  2007-08-28 19:05 ` [01/36] Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user clameter
  2007-08-28 19:05 ` [02/36] Define functions for page cache handling clameter
@ 2007-08-28 19:05 ` clameter
  2007-08-28 19:05 ` [04/36] Use page_cache_xxx in mm/page-writeback.c clameter
                   ` (34 subsequent siblings)
  37 siblings, 0 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:05 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0003-Use-page_cache_xxx-functions-in-mm-filemap.c.patch --]
[-- Type: text/plain, Size: 6637 bytes --]

Conver the uses of PAGE_CACHE_xxx to use page_cache_xxx instead.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 mm/filemap.c |   56 ++++++++++++++++++++++++++++----------------------------
 1 files changed, 28 insertions(+), 28 deletions(-)

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/mm/filemap.c	2007-08-27 19:31:13.000000000 -0700
@@ -303,8 +303,8 @@ int wait_on_page_writeback_range(struct 
 int sync_page_range(struct inode *inode, struct address_space *mapping,
 			loff_t pos, loff_t count)
 {
-	pgoff_t start = pos >> PAGE_CACHE_SHIFT;
-	pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
+	pgoff_t start = page_cache_index(mapping, pos);
+	pgoff_t end = page_cache_index(mapping, pos + count - 1);
 	int ret;
 
 	if (!mapping_cap_writeback_dirty(mapping) || !count)
@@ -335,8 +335,8 @@ EXPORT_SYMBOL(sync_page_range);
 int sync_page_range_nolock(struct inode *inode, struct address_space *mapping,
 			   loff_t pos, loff_t count)
 {
-	pgoff_t start = pos >> PAGE_CACHE_SHIFT;
-	pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
+	pgoff_t start = page_cache_index(mapping, pos);
+	pgoff_t end = page_cache_index(mapping, pos + count - 1);
 	int ret;
 
 	if (!mapping_cap_writeback_dirty(mapping) || !count)
@@ -365,7 +365,7 @@ int filemap_fdatawait(struct address_spa
 		return 0;
 
 	return wait_on_page_writeback_range(mapping, 0,
-				(i_size - 1) >> PAGE_CACHE_SHIFT);
+				page_cache_index(mapping, i_size - 1));
 }
 EXPORT_SYMBOL(filemap_fdatawait);
 
@@ -413,8 +413,8 @@ int filemap_write_and_wait_range(struct 
 		/* See comment of filemap_write_and_wait() */
 		if (err != -EIO) {
 			int err2 = wait_on_page_writeback_range(mapping,
-						lstart >> PAGE_CACHE_SHIFT,
-						lend >> PAGE_CACHE_SHIFT);
+					page_cache_index(mapping, lstart),
+					page_cache_index(mapping, lend));
 			if (!err)
 				err = err2;
 		}
@@ -877,12 +877,12 @@ void do_generic_mapping_read(struct addr
 	struct file_ra_state ra = *_ra;
 
 	cached_page = NULL;
-	index = *ppos >> PAGE_CACHE_SHIFT;
+	index = page_cache_index(mapping, *ppos);
 	next_index = index;
 	prev_index = ra.prev_index;
 	prev_offset = ra.prev_offset;
-	last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT;
-	offset = *ppos & ~PAGE_CACHE_MASK;
+	last_index = page_cache_next(mapping, *ppos + desc->count);
+	offset = page_cache_offset(mapping, *ppos);
 
 	for (;;) {
 		struct page *page;
@@ -919,16 +919,16 @@ page_ok:
 		 */
 
 		isize = i_size_read(inode);
-		end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
+		end_index = page_cache_index(mapping, isize - 1);
 		if (unlikely(!isize || index > end_index)) {
 			page_cache_release(page);
 			goto out;
 		}
 
 		/* nr is the maximum number of bytes to copy from this page */
-		nr = PAGE_CACHE_SIZE;
+		nr = page_cache_size(mapping);
 		if (index == end_index) {
-			nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
+			nr = page_cache_offset(mapping, isize - 1) + 1;
 			if (nr <= offset) {
 				page_cache_release(page);
 				goto out;
@@ -963,8 +963,8 @@ page_ok:
 		 */
 		ret = actor(desc, page, offset, nr);
 		offset += ret;
-		index += offset >> PAGE_CACHE_SHIFT;
-		offset &= ~PAGE_CACHE_MASK;
+		index += page_cache_index(mapping, offset);
+		offset = page_cache_offset(mapping, offset);
 		prev_offset = offset;
 		ra.prev_offset = offset;
 
@@ -1058,7 +1058,7 @@ out:
 	*_ra = ra;
 	_ra->prev_index = prev_index;
 
-	*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
+	*ppos = page_cache_pos(mapping, index, offset);
 	if (cached_page)
 		page_cache_release(cached_page);
 	if (filp)
@@ -1240,8 +1240,8 @@ asmlinkage ssize_t sys_readahead(int fd,
 	if (file) {
 		if (file->f_mode & FMODE_READ) {
 			struct address_space *mapping = file->f_mapping;
-			unsigned long start = offset >> PAGE_CACHE_SHIFT;
-			unsigned long end = (offset + count - 1) >> PAGE_CACHE_SHIFT;
+			unsigned long start = page_cache_index(mapping, offset);
+			unsigned long end = page_cache_index(mapping, offset + count - 1);
 			unsigned long len = end - start + 1;
 			ret = do_readahead(mapping, file, start, len);
 		}
@@ -1310,7 +1310,7 @@ int filemap_fault(struct vm_area_struct 
 	int did_readaround = 0;
 	int ret = 0;
 
-	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+	size = page_cache_next(mapping, i_size_read(inode));
 	if (vmf->pgoff >= size)
 		goto outside_data_content;
 
@@ -1385,7 +1385,7 @@ retry_find:
 		goto page_not_uptodate;
 
 	/* Must recheck i_size under page lock */
-	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+	size = page_cache_next(mapping, i_size_read(inode));
 	if (unlikely(vmf->pgoff >= size)) {
 		unlock_page(page);
 		goto outside_data_content;
@@ -1869,9 +1869,9 @@ generic_file_buffered_write(struct kiocb
 		unsigned long offset;
 		size_t copied;
 
-		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
-		index = pos >> PAGE_CACHE_SHIFT;
-		bytes = PAGE_CACHE_SIZE - offset;
+		offset = page_cache_offset(mapping, pos); /* Within page */
+		index = page_cache_index(mapping, pos);
+		bytes = page_cache_size(mapping) - offset;
 
 		/* Limit the size of the copy to the caller's write size */
 		bytes = min(bytes, count);
@@ -2082,8 +2082,8 @@ __generic_file_aio_write_nolock(struct k
 		if (err == 0) {
 			written = written_buffered;
 			invalidate_mapping_pages(mapping,
-						 pos >> PAGE_CACHE_SHIFT,
-						 endbyte >> PAGE_CACHE_SHIFT);
+						 page_cache_index(mapping, pos),
+						 page_cache_index(mapping, endbyte));
 		} else {
 			/*
 			 * We don't know how much we wrote, so just return
@@ -2170,7 +2170,7 @@ generic_file_direct_IO(int rw, struct ki
 	 */
 	if (rw == WRITE) {
 		write_len = iov_length(iov, nr_segs);
-		end = (offset + write_len - 1) >> PAGE_CACHE_SHIFT;
+		end = page_cache_index(mapping, offset + write_len - 1);
 	       	if (mapping_mapped(mapping))
 			unmap_mapping_range(mapping, offset, write_len, 0);
 	}
@@ -2187,7 +2187,7 @@ generic_file_direct_IO(int rw, struct ki
 	 */
 	if (rw == WRITE && mapping->nrpages) {
 		retval = invalidate_inode_pages2_range(mapping,
-					offset >> PAGE_CACHE_SHIFT, end);
+					page_cache_index(mapping, offset), end);
 		if (retval)
 			goto out;
 	}
@@ -2205,7 +2205,7 @@ generic_file_direct_IO(int rw, struct ki
 	 */
 	if (rw == WRITE && mapping->nrpages) {
 		int err = invalidate_inode_pages2_range(mapping,
-					      offset >> PAGE_CACHE_SHIFT, end);
+					      page_cache_index(mapping, offset), end);
 		if (err && retval >= 0)
 			retval = err;
 	}

-- 

^ permalink raw reply	[flat|nested] 124+ messages in thread

* [04/36] Use page_cache_xxx in mm/page-writeback.c
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (2 preceding siblings ...)
  2007-08-28 19:05 ` [03/36] Use page_cache_xxx functions in mm/filemap.c clameter
@ 2007-08-28 19:05 ` clameter
  2007-08-28 19:05 ` [05/36] Use page_cache_xxx in mm/truncate.c clameter
                   ` (33 subsequent siblings)
  37 siblings, 0 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:05 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0004-Use-page_cache_xxx-in-mm-page-writeback.c.patch --]
[-- Type: text/plain, Size: 1222 bytes --]

Use page_cache_xxx in mm/page-writeback.c

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 mm/page-writeback.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 63512a9..ebe76e3 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -624,8 +624,8 @@ int write_cache_pages(struct address_space *mapping,
 		index = mapping->writeback_index; /* Start from prev offset */
 		end = -1;
 	} else {
-		index = wbc->range_start >> PAGE_CACHE_SHIFT;
-		end = wbc->range_end >> PAGE_CACHE_SHIFT;
+		index = page_cache_index(mapping, wbc->range_start);
+		end = page_cache_index(mapping, wbc->range_end);
 		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
 			range_whole = 1;
 		scanned = 1;
@@ -827,7 +827,7 @@ int __set_page_dirty_nobuffers(struct page *page)
 			WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
 			if (mapping_cap_account_dirty(mapping)) {
 				__inc_zone_page_state(page, NR_FILE_DIRTY);
-				task_io_account_write(PAGE_CACHE_SIZE);
+				task_io_account_write(page_cache_size(mapping));
 			}
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
-- 
1.5.2.4

-- 

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [05/36] Use page_cache_xxx in mm/truncate.c
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (3 preceding siblings ...)
  2007-08-28 19:05 ` [04/36] Use page_cache_xxx in mm/page-writeback.c clameter
@ 2007-08-28 19:05 ` clameter
  2007-08-28 19:05 ` [06/36] Use page_cache_xxx in mm/rmap.c clameter
                   ` (32 subsequent siblings)
  37 siblings, 0 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:05 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0005-Use-page_cache_xxx-in-mm-truncate.c.patch --]
[-- Type: text/plain, Size: 3970 bytes --]

Use page_cache_xxx in mm/truncate.c

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 mm/truncate.c |   35 ++++++++++++++++++-----------------
 1 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index bf8068d..8c3d32e 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -45,9 +45,10 @@ void do_invalidatepage(struct page *page, unsigned long offset)
 		(*invalidatepage)(page, offset);
 }
 
-static inline void truncate_partial_page(struct page *page, unsigned partial)
+static inline void truncate_partial_page(struct address_space *mapping,
+			struct page *page, unsigned partial)
 {
-	zero_user_segment(page, partial, PAGE_CACHE_SIZE);
+	zero_user_segment(page, partial, page_cache_size(mapping));
 	if (PagePrivate(page))
 		do_invalidatepage(page, partial);
 }
@@ -95,7 +96,7 @@ truncate_complete_page(struct address_space *mapping, struct page *page)
 	if (page->mapping != mapping)
 		return;
 
-	cancel_dirty_page(page, PAGE_CACHE_SIZE);
+	cancel_dirty_page(page, page_cache_size(mapping));
 
 	if (PagePrivate(page))
 		do_invalidatepage(page, 0);
@@ -157,9 +158,9 @@ invalidate_complete_page(struct address_space *mapping, struct page *page)
 void truncate_inode_pages_range(struct address_space *mapping,
 				loff_t lstart, loff_t lend)
 {
-	const pgoff_t start = (lstart + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT;
+	const pgoff_t start = page_cache_next(mapping, lstart);
 	pgoff_t end;
-	const unsigned partial = lstart & (PAGE_CACHE_SIZE - 1);
+	const unsigned partial = page_cache_offset(mapping, lstart);
 	struct pagevec pvec;
 	pgoff_t next;
 	int i;
@@ -167,8 +168,9 @@ void truncate_inode_pages_range(struct address_space *mapping,
 	if (mapping->nrpages == 0)
 		return;
 
-	BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1));
-	end = (lend >> PAGE_CACHE_SHIFT);
+	BUG_ON(page_cache_offset(mapping, lend) !=
+				page_cache_size(mapping) - 1);
+	end = page_cache_index(mapping, lend);
 
 	pagevec_init(&pvec, 0);
 	next = start;
@@ -194,8 +196,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			}
 			if (page_mapped(page)) {
 				unmap_mapping_range(mapping,
-				  (loff_t)page_index<<PAGE_CACHE_SHIFT,
-				  PAGE_CACHE_SIZE, 0);
+				  page_cache_pos(mapping, page_index, 0),
+				  page_cache_size(mapping), 0);
 			}
 			truncate_complete_page(mapping, page);
 			unlock_page(page);
@@ -208,7 +210,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
 		struct page *page = find_lock_page(mapping, start - 1);
 		if (page) {
 			wait_on_page_writeback(page);
-			truncate_partial_page(page, partial);
+			truncate_partial_page(mapping, page, partial);
 			unlock_page(page);
 			page_cache_release(page);
 		}
@@ -236,8 +238,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
 			wait_on_page_writeback(page);
 			if (page_mapped(page)) {
 				unmap_mapping_range(mapping,
-				  (loff_t)page->index<<PAGE_CACHE_SHIFT,
-				  PAGE_CACHE_SIZE, 0);
+				  page_cache_pos(mapping, page->index, 0),
+				  page_cache_size(mapping), 0);
 			}
 			if (page->index > next)
 				next = page->index;
@@ -421,9 +423,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 					 * Zap the rest of the file in one hit.
 					 */
 					unmap_mapping_range(mapping,
-					   (loff_t)page_index<<PAGE_CACHE_SHIFT,
-					   (loff_t)(end - page_index + 1)
-							<< PAGE_CACHE_SHIFT,
+					   page_cache_pos(mapping, page_index, 0),
+					   page_cache_pos(mapping, end - page_index + 1, 0),
 					    0);
 					did_range_unmap = 1;
 				} else {
@@ -431,8 +432,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
 					 * Just zap this page
 					 */
 					unmap_mapping_range(mapping,
-					  (loff_t)page_index<<PAGE_CACHE_SHIFT,
-					  PAGE_CACHE_SIZE, 0);
+					  page_cache_pos(mapping, page_index, 0),
+					  page_cache_size(mapping), 0);
 				}
 			}
 			BUG_ON(page_mapped(page));
-- 
1.5.2.4

-- 

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [06/36] Use page_cache_xxx in mm/rmap.c
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (4 preceding siblings ...)
  2007-08-28 19:05 ` [05/36] Use page_cache_xxx in mm/truncate.c clameter
@ 2007-08-28 19:05 ` clameter
  2007-08-28 19:05 ` [07/36] Use page_cache_xxx in mm/filemap_xip.c clameter
                   ` (31 subsequent siblings)
  37 siblings, 0 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:05 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0006-Use-page_cache_xxx-in-mm-rmap.c.patch --]
[-- Type: text/plain, Size: 1998 bytes --]

Use page_cache_xxx in mm/rmap.c

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 mm/rmap.c |   13 +++++++++----
 1 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 41ac397..d6a1771 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -188,9 +188,14 @@ static void page_unlock_anon_vma(struct anon_vma *anon_vma)
 static inline unsigned long
 vma_address(struct page *page, struct vm_area_struct *vma)
 {
-	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	pgoff_t pgoff;
 	unsigned long address;
 
+	if (PageAnon(page))
+		pgoff = page->index;
+	else
+		pgoff = page->index << mapping_order(page->mapping);
+
 	address = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 	if (unlikely(address < vma->vm_start || address >= vma->vm_end)) {
 		/* page should be within any vma from prio_tree_next */
@@ -335,7 +340,7 @@ static int page_referenced_file(struct page *page)
 {
 	unsigned int mapcount;
 	struct address_space *mapping = page->mapping;
-	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	pgoff_t pgoff = page->index << (page_cache_shift(mapping) - PAGE_SHIFT);
 	struct vm_area_struct *vma;
 	struct prio_tree_iter iter;
 	int referenced = 0;
@@ -447,7 +452,7 @@ out:
 
 static int page_mkclean_file(struct address_space *mapping, struct page *page)
 {
-	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	pgoff_t pgoff = page->index << (page_cache_shift(mapping) - PAGE_SHIFT);
 	struct vm_area_struct *vma;
 	struct prio_tree_iter iter;
 	int ret = 0;
@@ -863,7 +868,7 @@ static int try_to_unmap_anon(struct page *page, int migration)
 static int try_to_unmap_file(struct page *page, int migration)
 {
 	struct address_space *mapping = page->mapping;
-	pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	pgoff_t pgoff = page->index << (page_cache_shift(mapping) - PAGE_SHIFT);
 	struct vm_area_struct *vma;
 	struct prio_tree_iter iter;
 	int ret = SWAP_AGAIN;
-- 
1.5.2.4

-- 

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [07/36] Use page_cache_xxx in mm/filemap_xip.c
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (5 preceding siblings ...)
  2007-08-28 19:05 ` [06/36] Use page_cache_xxx in mm/rmap.c clameter
@ 2007-08-28 19:05 ` clameter
  2007-08-28 19:49   ` Jörn Engel
  2007-08-28 19:05 ` [08/36] Use page_cache_xxx in mm/migrate.c clameter
                   ` (30 subsequent siblings)
  37 siblings, 1 reply; 124+ messages in thread
From: clameter @ 2007-08-28 19:05 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0007-Use-page_cache_xxx-in-mm-filemap_xip.c.patch --]
[-- Type: text/plain, Size: 2886 bytes --]

Use page_cache_xxx in mm/filemap_xip.c

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 mm/filemap_xip.c |   28 ++++++++++++++--------------
 1 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index ba6892d..5237e53 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -61,24 +61,24 @@ do_xip_mapping_read(struct address_space *mapping,
 
 	BUG_ON(!mapping->a_ops->get_xip_page);
 
-	index = *ppos >> PAGE_CACHE_SHIFT;
-	offset = *ppos & ~PAGE_CACHE_MASK;
+	index = page_cache_index(mapping, *ppos);
+	offset = page_cache_offset(mapping, *ppos);
 
 	isize = i_size_read(inode);
 	if (!isize)
 		goto out;
 
-	end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
+	end_index = page_cache_index(mapping, isize - 1);
 	for (;;) {
 		struct page *page;
 		unsigned long nr, ret;
 
 		/* nr is the maximum number of bytes to copy from this page */
-		nr = PAGE_CACHE_SIZE;
+		nr = page_cache_size(mapping);
 		if (index >= end_index) {
 			if (index > end_index)
 				goto out;
-			nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
+			nr = page_cache_next(mapping, size - 1) + 1;
 			if (nr <= offset) {
 				goto out;
 			}
@@ -117,8 +117,8 @@ do_xip_mapping_read(struct address_space *mapping,
 		 */
 		ret = actor(desc, page, offset, nr);
 		offset += ret;
-		index += offset >> PAGE_CACHE_SHIFT;
-		offset &= ~PAGE_CACHE_MASK;
+		index += page_cache_index(mapping, offset);
+		offset = page_cache_offset(mapping, offset);
 
 		if (ret == nr && desc->count)
 			continue;
@@ -131,7 +131,7 @@ no_xip_page:
 	}
 
 out:
-	*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
+	*ppos = page_cache_pos(mapping, index, offset);
 	if (filp)
 		file_accessed(filp);
 }
@@ -220,7 +220,7 @@ static int xip_file_fault(struct vm_area_struct *area, struct vm_fault *vmf)
 
 	/* XXX: are VM_FAULT_ codes OK? */
 
-	size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+	size = page_cache_next(mapping, i_size_read(inode));
 	if (vmf->pgoff >= size)
 		return VM_FAULT_SIGBUS;
 
@@ -289,9 +289,9 @@ __xip_file_write(struct file *filp, const char __user *buf,
 		unsigned long offset;
 		size_t copied;
 
-		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
-		index = pos >> PAGE_CACHE_SHIFT;
-		bytes = PAGE_CACHE_SIZE - offset;
+		offset = page_cache_offset(mapping, pos); /* Within page */
+		index = page_cache_index(mapping, pos);
+		bytes = page_cache_size(mapping) - offset;
 		if (bytes > count)
 			bytes = count;
 
@@ -405,8 +405,8 @@ EXPORT_SYMBOL_GPL(xip_file_write);
 int
 xip_truncate_page(struct address_space *mapping, loff_t from)
 {
-	pgoff_t index = from >> PAGE_CACHE_SHIFT;
-	unsigned offset = from & (PAGE_CACHE_SIZE-1);
+	pgoff_t index = page_cache_index(mapping, from);
+	unsigned offset = page_cache_offset(mapping, from);
 	unsigned blocksize;
 	unsigned length;
 	struct page *page;
-- 
1.5.2.4

-- 

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [08/36] Use page_cache_xxx in mm/migrate.c
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (6 preceding siblings ...)
  2007-08-28 19:05 ` [07/36] Use page_cache_xxx in mm/filemap_xip.c clameter
@ 2007-08-28 19:05 ` clameter
  2007-08-28 19:06 ` [09/36] Use page_cache_xxx in fs/libfs.c clameter
                   ` (29 subsequent siblings)
  37 siblings, 0 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:05 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0008-Use-page_cache_xxx-in-mm-migrate.c.patch --]
[-- Type: text/plain, Size: 651 bytes --]

Use page_cache_xxx in mm/migrate.c

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 mm/migrate.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 37c73b9..4949927 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -195,7 +195,7 @@ static void remove_file_migration_ptes(struct page *old, struct page *new)
 	struct vm_area_struct *vma;
 	struct address_space *mapping = page_mapping(new);
 	struct prio_tree_iter iter;
-	pgoff_t pgoff = new->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+	pgoff_t pgoff = new->index << mapping_order(mapping);
 
 	if (!mapping)
 		return;
-- 
1.5.2.4

-- 

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [09/36] Use page_cache_xxx in fs/libfs.c
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (7 preceding siblings ...)
  2007-08-28 19:05 ` [08/36] Use page_cache_xxx in mm/migrate.c clameter
@ 2007-08-28 19:06 ` clameter
  2007-08-28 19:06 ` [10/36] Use page_cache_xxx in fs/sync clameter
                   ` (28 subsequent siblings)
  37 siblings, 0 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:06 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0009-Use-page_cache_xxx-in-fs-libfs.c.patch --]
[-- Type: text/plain, Size: 1492 bytes --]

Use page_cache_xxx in fs/libfs.c

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/libfs.c |   12 +++++++-----
 1 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 53b3dc5..e90f894 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -16,7 +16,8 @@ int simple_getattr(struct vfsmount *mnt, struct dentry *dentry,
 {
 	struct inode *inode = dentry->d_inode;
 	generic_fillattr(inode, stat);
-	stat->blocks = inode->i_mapping->nrpages << (PAGE_CACHE_SHIFT - 9);
+	stat->blocks = inode->i_mapping->nrpages <<
+				(page_cache_shift(inode->i_mapping) - 9);
 	return 0;
 }
 
@@ -340,10 +341,10 @@ int simple_prepare_write(struct file *file, struct page *page,
 			unsigned from, unsigned to)
 {
 	if (!PageUptodate(page)) {
-		if (to - from != PAGE_CACHE_SIZE)
+		if (to - from != page_cache_size(file->f_mapping))
 			zero_user_segments(page,
 				0, from,
-				to, PAGE_CACHE_SIZE);
+				to, page_cache_size(file->f_mapping));
 	}
 	return 0;
 }
@@ -351,8 +352,9 @@ int simple_prepare_write(struct file *file, struct page *page,
 int simple_commit_write(struct file *file, struct page *page,
 			unsigned from, unsigned to)
 {
-	struct inode *inode = page->mapping->host;
-	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+	struct address_space *mapping = page->mapping;
+	struct inode *inode = mapping->host;
+	loff_t pos = page_cache_pos(mapping, page->index, to);
 
 	if (!PageUptodate(page))
 		SetPageUptodate(page);
-- 
1.5.2.4

-- 

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [10/36] Use page_cache_xxx in fs/sync
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (8 preceding siblings ...)
  2007-08-28 19:06 ` [09/36] Use page_cache_xxx in fs/libfs.c clameter
@ 2007-08-28 19:06 ` clameter
  2007-08-28 19:06 ` [11/36] Use page_cache_xxx in fs/buffer.c clameter
                   ` (27 subsequent siblings)
  37 siblings, 0 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:06 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0010-Use-page_cache_xxx-in-fs-sync.patch --]
[-- Type: text/plain, Size: 1026 bytes --]

Use page_cache_xxx in fs/sync.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/sync.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/sync.c b/fs/sync.c
index 7cd005e..f30d7eb 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -260,8 +260,8 @@ int do_sync_mapping_range(struct address_space *mapping, loff_t offset,
 	ret = 0;
 	if (flags & SYNC_FILE_RANGE_WAIT_BEFORE) {
 		ret = wait_on_page_writeback_range(mapping,
-					offset >> PAGE_CACHE_SHIFT,
-					endbyte >> PAGE_CACHE_SHIFT);
+					page_cache_index(mapping, offset),
+					page_cache_index(mapping, endbyte));
 		if (ret < 0)
 			goto out;
 	}
@@ -275,8 +275,8 @@ int do_sync_mapping_range(struct address_space *mapping, loff_t offset,
 
 	if (flags & SYNC_FILE_RANGE_WAIT_AFTER) {
 		ret = wait_on_page_writeback_range(mapping,
-					offset >> PAGE_CACHE_SHIFT,
-					endbyte >> PAGE_CACHE_SHIFT);
+					page_cache_index(mapping, offset),
+					page_cache_index(mapping, endbyte));
 	}
 out:
 	return ret;
-- 
1.5.2.4

-- 

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (9 preceding siblings ...)
  2007-08-28 19:06 ` [10/36] Use page_cache_xxx in fs/sync clameter
@ 2007-08-28 19:06 ` clameter
  2007-08-30  9:20   ` Dmitry Monakhov
  2007-08-28 19:06 ` [12/36] Use page_cache_xxx in mm/mpage.c clameter
                   ` (26 subsequent siblings)
  37 siblings, 1 reply; 124+ messages in thread
From: clameter @ 2007-08-28 19:06 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0011-Use-page_cache_xxx-in-fs-buffer.c.patch --]
[-- Type: text/plain, Size: 14461 bytes --]

Use page_cache_xxx in fs/buffer.c.

We have a special situation in set_bh_page() since reiserfs calls that
function before setting up the mapping. So retrieve the page size
from the page struct rather than the mapping.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/buffer.c |  110 +++++++++++++++++++++++++++++++++---------------------------
 1 file changed, 62 insertions(+), 48 deletions(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c	2007-08-28 11:37:13.000000000 -0700
+++ linux-2.6/fs/buffer.c	2007-08-28 11:37:58.000000000 -0700
@@ -257,7 +257,7 @@ __find_get_block_slow(struct block_devic
 	struct page *page;
 	int all_mapped = 1;
 
-	index = block >> (PAGE_CACHE_SHIFT - bd_inode->i_blkbits);
+	index = block >> (page_cache_shift(bd_mapping) - bd_inode->i_blkbits);
 	page = find_get_page(bd_mapping, index);
 	if (!page)
 		goto out;
@@ -697,7 +697,7 @@ static int __set_page_dirty(struct page 
 
 		if (mapping_cap_account_dirty(mapping)) {
 			__inc_zone_page_state(page, NR_FILE_DIRTY);
-			task_io_account_write(PAGE_CACHE_SIZE);
+			task_io_account_write(page_cache_size(mapping));
 		}
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
@@ -891,10 +891,11 @@ struct buffer_head *alloc_page_buffers(s
 {
 	struct buffer_head *bh, *head;
 	long offset;
+	unsigned int page_size = page_cache_size(page->mapping);
 
 try_again:
 	head = NULL;
-	offset = PAGE_SIZE;
+	offset = page_size;
 	while ((offset -= size) >= 0) {
 		bh = alloc_buffer_head(GFP_NOFS);
 		if (!bh)
@@ -1426,7 +1427,7 @@ void set_bh_page(struct buffer_head *bh,
 		struct page *page, unsigned long offset)
 {
 	bh->b_page = page;
-	BUG_ON(offset >= PAGE_SIZE);
+	BUG_ON(offset >= compound_size(page));
 	if (PageHighMem(page))
 		/*
 		 * This catches illegal uses and preserves the offset:
@@ -1605,6 +1606,7 @@ static int __block_write_full_page(struc
 	struct buffer_head *bh, *head;
 	const unsigned blocksize = 1 << inode->i_blkbits;
 	int nr_underway = 0;
+	struct address_space *mapping = inode->i_mapping;
 
 	BUG_ON(!PageLocked(page));
 
@@ -1625,7 +1627,8 @@ static int __block_write_full_page(struc
 	 * handle that here by just cleaning them.
 	 */
 
-	block = (sector_t)page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+	block = (sector_t)page->index <<
+		(page_cache_shift(mapping) - inode->i_blkbits);
 	head = page_buffers(page);
 	bh = head;
 
@@ -1742,7 +1745,7 @@ recover:
 	} while ((bh = bh->b_this_page) != head);
 	SetPageError(page);
 	BUG_ON(PageWriteback(page));
-	mapping_set_error(page->mapping, err);
+	mapping_set_error(mapping, err);
 	set_page_writeback(page);
 	do {
 		struct buffer_head *next = bh->b_this_page;
@@ -1767,8 +1770,8 @@ static int __block_prepare_write(struct 
 	struct buffer_head *bh, *head, *wait[2], **wait_bh=wait;
 
 	BUG_ON(!PageLocked(page));
-	BUG_ON(from > PAGE_CACHE_SIZE);
-	BUG_ON(to > PAGE_CACHE_SIZE);
+	BUG_ON(from > page_cache_size(inode->i_mapping));
+	BUG_ON(to > page_cache_size(inode->i_mapping));
 	BUG_ON(from > to);
 
 	blocksize = 1 << inode->i_blkbits;
@@ -1777,7 +1780,8 @@ static int __block_prepare_write(struct 
 	head = page_buffers(page);
 
 	bbits = inode->i_blkbits;
-	block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits);
+	block = (sector_t)page->index <<
+		(page_cache_shift(inode->i_mapping) - bbits);
 
 	for(bh = head, block_start = 0; bh != head || !block_start;
 	    block++, block_start=block_end, bh = bh->b_this_page) {
@@ -1921,7 +1925,8 @@ int block_read_full_page(struct page *pa
 		create_empty_buffers(page, blocksize, 0);
 	head = page_buffers(page);
 
-	iblock = (sector_t)page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+	iblock = (sector_t)page->index <<
+		(page_cache_shift(page->mapping) - inode->i_blkbits);
 	lblock = (i_size_read(inode)+blocksize-1) >> inode->i_blkbits;
 	bh = head;
 	nr = 0;
@@ -2045,7 +2050,7 @@ int generic_cont_expand(struct inode *in
 	pgoff_t index;
 	unsigned int offset;
 
-	offset = (size & (PAGE_CACHE_SIZE - 1)); /* Within page */
+	offset = page_cache_offset(inode->i_mapping, size); /* Within page */
 
 	/* ugh.  in prepare/commit_write, if from==to==start of block, we
 	** skip the prepare.  make sure we never send an offset for the start
@@ -2055,7 +2060,7 @@ int generic_cont_expand(struct inode *in
 		/* caller must handle this extra byte. */
 		offset++;
 	}
-	index = size >> PAGE_CACHE_SHIFT;
+	index = page_cache_index(inode->i_mapping, size);
 
 	return __generic_cont_expand(inode, size, index, offset);
 }
@@ -2063,8 +2068,8 @@ int generic_cont_expand(struct inode *in
 int generic_cont_expand_simple(struct inode *inode, loff_t size)
 {
 	loff_t pos = size - 1;
-	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
-	unsigned int offset = (pos & (PAGE_CACHE_SIZE - 1)) + 1;
+	pgoff_t index = page_cache_index(inode->i_mapping, pos);
+	unsigned int offset = page_cache_offset(inode->i_mapping, pos) + 1;
 
 	/* prepare/commit_write can handle even if from==to==start of block. */
 	return __generic_cont_expand(inode, size, index, offset);
@@ -2086,28 +2091,28 @@ int cont_prepare_write(struct page *page
 	unsigned zerofrom;
 	unsigned blocksize = 1 << inode->i_blkbits;
 
-	while(page->index > (pgpos = *bytes>>PAGE_CACHE_SHIFT)) {
+	while (page->index > (pgpos = page_cache_index(mapping, *bytes))) {
 		status = -ENOMEM;
 		new_page = grab_cache_page(mapping, pgpos);
 		if (!new_page)
 			goto out;
 		/* we might sleep */
-		if (*bytes>>PAGE_CACHE_SHIFT != pgpos) {
+		if (page_cache_index(mapping, *bytes) != pgpos) {
 			unlock_page(new_page);
 			page_cache_release(new_page);
 			continue;
 		}
-		zerofrom = *bytes & ~PAGE_CACHE_MASK;
+		zerofrom = page_cache_offset(mapping, *bytes);
 		if (zerofrom & (blocksize-1)) {
 			*bytes |= (blocksize-1);
 			(*bytes)++;
 		}
 		status = __block_prepare_write(inode, new_page, zerofrom,
-						PAGE_CACHE_SIZE, get_block);
+					page_cache_size(mapping), get_block);
 		if (status)
 			goto out_unmap;
-		zero_user_segment(new_page, zerofrom, PAGE_CACHE_SIZE);
-		generic_commit_write(NULL, new_page, zerofrom, PAGE_CACHE_SIZE);
+		zero_user_segment(new_page, zerofrom, page_cache_size(mapping));
+		generic_commit_write(NULL, new_page, zerofrom, page_cache_size(mapping));
 		unlock_page(new_page);
 		page_cache_release(new_page);
 	}
@@ -2117,7 +2122,7 @@ int cont_prepare_write(struct page *page
 		zerofrom = offset;
 	} else {
 		/* page covers the boundary, find the boundary offset */
-		zerofrom = *bytes & ~PAGE_CACHE_MASK;
+		zerofrom = page_cache_offset(mapping, *bytes);
 
 		/* if we will expand the thing last block will be filled */
 		if (to > zerofrom && (zerofrom & (blocksize-1))) {
@@ -2169,8 +2174,9 @@ int block_commit_write(struct page *page
 int generic_commit_write(struct file *file, struct page *page,
 		unsigned from, unsigned to)
 {
-	struct inode *inode = page->mapping->host;
-	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+	struct address_space *mapping = page->mapping;
+	struct inode *inode = mapping->host;
+	loff_t pos = page_cache_pos(mapping, page->index, to);
 	__block_commit_write(inode,page,from,to);
 	/*
 	 * No need to use i_size_read() here, the i_size
@@ -2206,20 +2212,22 @@ block_page_mkwrite(struct vm_area_struct
 	unsigned long end;
 	loff_t size;
 	int ret = -EINVAL;
+	struct address_space *mapping;
 
 	lock_page(page);
+	mapping = page->mapping;
 	size = i_size_read(inode);
-	if ((page->mapping != inode->i_mapping) ||
+	if ((mapping != inode->i_mapping) ||
 	    (page_offset(page) > size)) {
 		/* page got truncated out from underneath us */
 		goto out_unlock;
 	}
 
 	/* page is wholly or partially inside EOF */
-	if (((page->index + 1) << PAGE_CACHE_SHIFT) > size)
-		end = size & ~PAGE_CACHE_MASK;
+	if (page_cache_pos(mapping, page->index + 1, 0) > size)
+		end = page_cache_offset(mapping, size);
 	else
-		end = PAGE_CACHE_SIZE;
+		end = page_cache_size(mapping);
 
 	ret = block_prepare_write(page, 0, end, get_block);
 	if (!ret)
@@ -2258,6 +2266,7 @@ static void end_buffer_read_nobh(struct 
 int nobh_prepare_write(struct page *page, unsigned from, unsigned to,
 			get_block_t *get_block)
 {
+	struct address_space *mapping = page->mapping;
 	struct inode *inode = page->mapping->host;
 	const unsigned blkbits = inode->i_blkbits;
 	const unsigned blocksize = 1 << blkbits;
@@ -2265,6 +2274,7 @@ int nobh_prepare_write(struct page *page
 	struct buffer_head *read_bh[MAX_BUF_PER_PAGE];
 	unsigned block_in_page;
 	unsigned block_start;
+	unsigned page_size = page_cache_size(mapping);
 	sector_t block_in_file;
 	int nr_reads = 0;
 	int i;
@@ -2274,7 +2284,8 @@ int nobh_prepare_write(struct page *page
 	if (PageMappedToDisk(page))
 		return 0;
 
-	block_in_file = (sector_t)page->index << (PAGE_CACHE_SHIFT - blkbits);
+	block_in_file = (sector_t)page->index <<
+			(page_cache_shift(mapping) - blkbits);
 	map_bh.b_page = page;
 
 	/*
@@ -2283,7 +2294,7 @@ int nobh_prepare_write(struct page *page
 	 * page is fully mapped-to-disk.
 	 */
 	for (block_start = 0, block_in_page = 0;
-		  block_start < PAGE_CACHE_SIZE;
+		  block_start < page_size;
 		  block_in_page++, block_start += blocksize) {
 		unsigned block_end = block_start + blocksize;
 		int create;
@@ -2372,7 +2383,7 @@ failed:
 	 * Error recovery is pretty slack.  Clear the page and mark it dirty
 	 * so we'll later zero out any blocks which _were_ allocated.
 	 */
-	zero_user(page, 0, PAGE_CACHE_SIZE);
+	zero_user(page, 0, page_size);
 	SetPageUptodate(page);
 	set_page_dirty(page);
 	return ret;
@@ -2386,8 +2397,9 @@ EXPORT_SYMBOL(nobh_prepare_write);
 int nobh_commit_write(struct file *file, struct page *page,
 		unsigned from, unsigned to)
 {
-	struct inode *inode = page->mapping->host;
-	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+	struct address_space *mapping = page->mapping;
+	struct inode *inode = mapping->host;
+	loff_t pos = page_cache_pos(mapping, page->index, to);
 
 	SetPageUptodate(page);
 	set_page_dirty(page);
@@ -2407,9 +2419,10 @@ EXPORT_SYMBOL(nobh_commit_write);
 int nobh_writepage(struct page *page, get_block_t *get_block,
 			struct writeback_control *wbc)
 {
-	struct inode * const inode = page->mapping->host;
+	struct address_space *mapping = page->mapping;
+	struct inode * const inode = mapping->host;
 	loff_t i_size = i_size_read(inode);
-	const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
+	const pgoff_t end_index = page_cache_offset(mapping, i_size);
 	unsigned offset;
 	int ret;
 
@@ -2418,7 +2431,7 @@ int nobh_writepage(struct page *page, ge
 		goto out;
 
 	/* Is the page fully outside i_size? (truncate in progress) */
-	offset = i_size & (PAGE_CACHE_SIZE-1);
+	offset = page_cache_offset(mapping, i_size);
 	if (page->index >= end_index+1 || !offset) {
 		/*
 		 * The page may have dirty, unmapped buffers.  For example,
@@ -2441,7 +2454,7 @@ int nobh_writepage(struct page *page, ge
 	 * the  page size, the remaining memory is zeroed when mapped, and
 	 * writes to that region are not written out to the file."
 	 */
-	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
+	zero_user_segment(page, offset, page_cache_size(mapping));
 out:
 	ret = mpage_writepage(page, get_block, wbc);
 	if (ret == -EAGAIN)
@@ -2457,8 +2470,8 @@ int nobh_truncate_page(struct address_sp
 {
 	struct inode *inode = mapping->host;
 	unsigned blocksize = 1 << inode->i_blkbits;
-	pgoff_t index = from >> PAGE_CACHE_SHIFT;
-	unsigned offset = from & (PAGE_CACHE_SIZE-1);
+	pgoff_t index = page_cache_index(mapping, from);
+	unsigned offset = page_cache_offset(mapping, from);
 	unsigned to;
 	struct page *page;
 	const struct address_space_operations *a_ops = mapping->a_ops;
@@ -2475,7 +2488,7 @@ int nobh_truncate_page(struct address_sp
 	to = (offset + blocksize) & ~(blocksize - 1);
 	ret = a_ops->prepare_write(NULL, page, offset, to);
 	if (ret == 0) {
-		zero_user_segment(page, offset, PAGE_CACHE_SIZE);
+		zero_user_segment(page, offset, page_cache_size(mapping));
 		/*
 		 * It would be more correct to call aops->commit_write()
 		 * here, but this is more efficient.
@@ -2493,8 +2506,8 @@ EXPORT_SYMBOL(nobh_truncate_page);
 int block_truncate_page(struct address_space *mapping,
 			loff_t from, get_block_t *get_block)
 {
-	pgoff_t index = from >> PAGE_CACHE_SHIFT;
-	unsigned offset = from & (PAGE_CACHE_SIZE-1);
+	pgoff_t index = page_cache_index(mapping, from);
+	unsigned offset = page_cache_offset(mapping, from);
 	unsigned blocksize;
 	sector_t iblock;
 	unsigned length, pos;
@@ -2511,8 +2524,8 @@ int block_truncate_page(struct address_s
 		return 0;
 
 	length = blocksize - length;
-	iblock = (sector_t)index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
-	
+	iblock = (sector_t)index <<
+			(page_cache_shift(mapping) - inode->i_blkbits);
 	page = grab_cache_page(mapping, index);
 	err = -ENOMEM;
 	if (!page)
@@ -2571,9 +2584,10 @@ out:
 int block_write_full_page(struct page *page, get_block_t *get_block,
 			struct writeback_control *wbc)
 {
-	struct inode * const inode = page->mapping->host;
+	struct address_space *mapping = page->mapping;
+	struct inode * const inode = mapping->host;
 	loff_t i_size = i_size_read(inode);
-	const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
+	const pgoff_t end_index = page_cache_index(mapping, i_size);
 	unsigned offset;
 
 	/* Is the page fully inside i_size? */
@@ -2581,7 +2595,7 @@ int block_write_full_page(struct page *p
 		return __block_write_full_page(inode, page, get_block, wbc);
 
 	/* Is the page fully outside i_size? (truncate in progress) */
-	offset = i_size & (PAGE_CACHE_SIZE-1);
+	offset = page_cache_offset(mapping, i_size);
 	if (page->index >= end_index+1 || !offset) {
 		/*
 		 * The page may have dirty, unmapped buffers.  For example,
@@ -2600,7 +2614,7 @@ int block_write_full_page(struct page *p
 	 * the  page size, the remaining memory is zeroed when mapped, and
 	 * writes to that region are not written out to the file."
 	 */
-	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
+	zero_user_segment(page, offset, page_cache_size(mapping));
 	return __block_write_full_page(inode, page, get_block, wbc);
 }
 
@@ -2854,7 +2868,7 @@ int try_to_free_buffers(struct page *pag
 	 * dirty bit from being lost.
 	 */
 	if (ret)
-		cancel_dirty_page(page, PAGE_CACHE_SIZE);
+		cancel_dirty_page(page, page_cache_size(mapping));
 	spin_unlock(&mapping->private_lock);
 out:
 	if (buffers_to_free) {

-- 

^ permalink raw reply	[flat|nested] 124+ messages in thread

* [12/36] Use page_cache_xxx in mm/mpage.c
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (10 preceding siblings ...)
  2007-08-28 19:06 ` [11/36] Use page_cache_xxx in fs/buffer.c clameter
@ 2007-08-28 19:06 ` clameter
  2007-08-28 19:06 ` [13/36] Use page_cache_xxx in mm/fadvise.c clameter
                   ` (25 subsequent siblings)
  37 siblings, 0 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:06 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0012-Use-page_cache_xxx-in-mm-mpage.c.patch --]
[-- Type: text/plain, Size: 4285 bytes --]

Use page_cache_xxx in mm/mpage.c

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/mpage.c |   28 ++++++++++++++++------------
 1 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/fs/mpage.c b/fs/mpage.c
index a5e1385..2843ed7 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -133,7 +133,8 @@ mpage_alloc(struct block_device *bdev,
 static void 
 map_buffer_to_page(struct page *page, struct buffer_head *bh, int page_block) 
 {
-	struct inode *inode = page->mapping->host;
+	struct address_space *mapping = page->mapping;
+	struct inode *inode = mapping->host;
 	struct buffer_head *page_bh, *head;
 	int block = 0;
 
@@ -142,9 +143,9 @@ map_buffer_to_page(struct page *page, struct buffer_head *bh, int page_block)
 		 * don't make any buffers if there is only one buffer on
 		 * the page and the page just needs to be set up to date
 		 */
-		if (inode->i_blkbits == PAGE_CACHE_SHIFT && 
+		if (inode->i_blkbits == page_cache_shift(mapping) &&
 		    buffer_uptodate(bh)) {
-			SetPageUptodate(page);    
+			SetPageUptodate(page);
 			return;
 		}
 		create_empty_buffers(page, 1 << inode->i_blkbits, 0);
@@ -177,9 +178,10 @@ do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
 		sector_t *last_block_in_bio, struct buffer_head *map_bh,
 		unsigned long *first_logical_block, get_block_t get_block)
 {
-	struct inode *inode = page->mapping->host;
+	struct address_space *mapping = page->mapping;
+	struct inode *inode = mapping->host;
 	const unsigned blkbits = inode->i_blkbits;
-	const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits;
+	const unsigned blocks_per_page = page_cache_size(mapping) >> blkbits;
 	const unsigned blocksize = 1 << blkbits;
 	sector_t block_in_file;
 	sector_t last_block;
@@ -196,7 +198,7 @@ do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
 	if (page_has_buffers(page))
 		goto confused;
 
-	block_in_file = (sector_t)page->index << (PAGE_CACHE_SHIFT - blkbits);
+	block_in_file = (sector_t)page->index << (page_cache_shift(mapping) - blkbits);
 	last_block = block_in_file + nr_pages * blocks_per_page;
 	last_block_in_file = (i_size_read(inode) + blocksize - 1) >> blkbits;
 	if (last_block > last_block_in_file)
@@ -284,7 +286,8 @@ do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages,
 	}
 
 	if (first_hole != blocks_per_page) {
-		zero_user_segment(page, first_hole << blkbits, PAGE_CACHE_SIZE);
+		zero_user_segment(page, first_hole << blkbits,
+					page_cache_size(mapping));
 		if (first_hole == 0) {
 			SetPageUptodate(page);
 			unlock_page(page);
@@ -468,7 +471,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
 	struct inode *inode = page->mapping->host;
 	const unsigned blkbits = inode->i_blkbits;
 	unsigned long end_index;
-	const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits;
+	const unsigned blocks_per_page = page_cache_size(mapping) >> blkbits;
 	sector_t last_block;
 	sector_t block_in_file;
 	sector_t blocks[MAX_BUF_PER_PAGE];
@@ -537,7 +540,8 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
 	 * The page has no buffers: map it to disk
 	 */
 	BUG_ON(!PageUptodate(page));
-	block_in_file = (sector_t)page->index << (PAGE_CACHE_SHIFT - blkbits);
+	block_in_file = (sector_t)page->index <<
+			(page_cache_shift(mapping) - blkbits);
 	last_block = (i_size - 1) >> blkbits;
 	map_bh.b_page = page;
 	for (page_block = 0; page_block < blocks_per_page; ) {
@@ -569,7 +573,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
 	first_unmapped = page_block;
 
 page_is_mapped:
-	end_index = i_size >> PAGE_CACHE_SHIFT;
+	end_index = page_cache_index(mapping, i_size);
 	if (page->index >= end_index) {
 		/*
 		 * The page straddles i_size.  It must be zeroed out on each
@@ -579,11 +583,11 @@ page_is_mapped:
 		 * is zeroed when mapped, and writes to that region are not
 		 * written out to the file."
 		 */
-		unsigned offset = i_size & (PAGE_CACHE_SIZE - 1);
+		unsigned offset = page_cache_offset(mapping, i_size);
 
 		if (page->index > end_index || !offset)
 			goto confused;
-		zero_user_segment(page, offset, PAGE_CACHE_SIZE);
+		zero_user_segment(page, offset, page_cache_size(mapping));
 	}
 
 	/*
-- 
1.5.2.4

-- 

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [13/36] Use page_cache_xxx in mm/fadvise.c
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (11 preceding siblings ...)
  2007-08-28 19:06 ` [12/36] Use page_cache_xxx in mm/mpage.c clameter
@ 2007-08-28 19:06 ` clameter
  2007-08-28 19:06 ` [14/36] Use page_cache_xxx in fs/splice.c clameter
                   ` (24 subsequent siblings)
  37 siblings, 0 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:06 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0013-Use-page_cache_xxx-in-mm-fadvise.c.patch --]
[-- Type: text/plain, Size: 1188 bytes --]

Use page_cache_xxx in mm/fadvise.c

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 mm/fadvise.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/fadvise.c b/mm/fadvise.c
index 0df4c89..804c2a9 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -79,8 +79,8 @@ asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
 		}
 
 		/* First and last PARTIAL page! */
-		start_index = offset >> PAGE_CACHE_SHIFT;
-		end_index = endbyte >> PAGE_CACHE_SHIFT;
+		start_index = page_cache_index(mapping, offset);
+		end_index = page_cache_index(mapping, endbyte);
 
 		/* Careful about overflow on the "+1" */
 		nrpages = end_index - start_index + 1;
@@ -100,8 +100,8 @@ asmlinkage long sys_fadvise64_64(int fd, loff_t offset, loff_t len, int advice)
 			filemap_flush(mapping);
 
 		/* First and last FULL page! */
-		start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
-		end_index = (endbyte >> PAGE_CACHE_SHIFT);
+		start_index = page_cache_next(mapping, offset);
+		end_index = page_cache_index(mapping, endbyte);
 
 		if (end_index >= start_index)
 			invalidate_mapping_pages(mapping, start_index,
-- 
1.5.2.4

-- 

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [14/36] Use page_cache_xxx in fs/splice.c
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (12 preceding siblings ...)
  2007-08-28 19:06 ` [13/36] Use page_cache_xxx in mm/fadvise.c clameter
@ 2007-08-28 19:06 ` clameter
  2007-08-28 19:06 ` [15/36] Use page_cache_xxx functions in fs/ext2 clameter
                   ` (23 subsequent siblings)
  37 siblings, 0 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:06 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0014-Use-page_cache_xxx-in-fs-splice.c.patch --]
[-- Type: text/plain, Size: 3215 bytes --]

Use page_cache_xxx in fs/splice.c

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/splice.c |   27 ++++++++++++++-------------
 1 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index c010a72..7910f32 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -279,9 +279,9 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 		.ops = &page_cache_pipe_buf_ops,
 	};
 
-	index = *ppos >> PAGE_CACHE_SHIFT;
-	loff = *ppos & ~PAGE_CACHE_MASK;
-	req_pages = (len + loff + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+	index = page_cache_index(mapping, *ppos);
+	loff = page_cache_offset(mapping, *ppos);
+	req_pages = page_cache_next(mapping, len + loff);
 	nr_pages = min(req_pages, (unsigned)PIPE_BUFFERS);
 
 	/*
@@ -336,7 +336,7 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 	 * Now loop over the map and see if we need to start IO on any
 	 * pages, fill in the partial map, etc.
 	 */
-	index = *ppos >> PAGE_CACHE_SHIFT;
+	index = page_cache_index(mapping, *ppos);
 	nr_pages = spd.nr_pages;
 	spd.nr_pages = 0;
 	for (page_nr = 0; page_nr < nr_pages; page_nr++) {
@@ -348,7 +348,8 @@ __generic_file_splice_read(struct file *in, loff_t *ppos,
 		/*
 		 * this_len is the max we'll use from this page
 		 */
-		this_len = min_t(unsigned long, len, PAGE_CACHE_SIZE - loff);
+		this_len = min_t(unsigned long, len,
+					page_cache_size(mapping) - loff);
 		page = pages[page_nr];
 
 		if (PageReadahead(page))
@@ -408,7 +409,7 @@ fill_it:
 		 * i_size must be checked after PageUptodate.
 		 */
 		isize = i_size_read(mapping->host);
-		end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
+		end_index = page_cache_index(mapping, isize - 1);
 		if (unlikely(!isize || index > end_index))
 			break;
 
@@ -422,7 +423,7 @@ fill_it:
 			/*
 			 * max good bytes in this page
 			 */
-			plen = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
+			plen = page_cache_offset(mapping, isize - 1) + 1;
 			if (plen <= loff)
 				break;
 
@@ -573,12 +574,12 @@ static int pipe_to_file(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
 	if (unlikely(ret))
 		return ret;
 
-	index = sd->pos >> PAGE_CACHE_SHIFT;
-	offset = sd->pos & ~PAGE_CACHE_MASK;
+	index = page_cache_index(mapping, sd->pos);
+	offset = page_cache_offset(mapping, sd->pos);
 
 	this_len = sd->len;
-	if (this_len + offset > PAGE_CACHE_SIZE)
-		this_len = PAGE_CACHE_SIZE - offset;
+	if (this_len + offset > page_cache_size(mapping))
+		this_len = page_cache_size(mapping) - offset;
 
 find_page:
 	page = find_lock_page(mapping, index);
@@ -839,7 +840,7 @@ generic_file_splice_write_nolock(struct pipe_inode_info *pipe, struct file *out,
 		unsigned long nr_pages;
 
 		*ppos += ret;
-		nr_pages = (ret + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+		nr_pages = page_cache_next(mapping, ret);
 
 		/*
 		 * If file or inode is SYNC and we actually wrote some data,
@@ -896,7 +897,7 @@ generic_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
 		unsigned long nr_pages;
 
 		*ppos += ret;
-		nr_pages = (ret + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+		nr_pages = page_cache_next(mapping, ret);
 
 		/*
 		 * If file or inode is SYNC and we actually wrote some data,
-- 
1.5.2.4

-- 

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [15/36] Use page_cache_xxx functions in fs/ext2
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (13 preceding siblings ...)
  2007-08-28 19:06 ` [14/36] Use page_cache_xxx in fs/splice.c clameter
@ 2007-08-28 19:06 ` clameter
  2007-08-28 19:06 ` [16/36] Use page_cache_xxx in fs/ext3 clameter
                   ` (22 subsequent siblings)
  37 siblings, 0 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:06 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0015-Use-page_cache_xxx-functions-in-fs-ext2.patch --]
[-- Type: text/plain, Size: 5775 bytes --]

Use page_cache_xxx functions in fs/ext2

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/ext2/dir.c |   40 +++++++++++++++++++++++-----------------
 1 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 2bf49d7..d72926f 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -43,7 +43,8 @@ static inline void ext2_put_page(struct page *page)
 
 static inline unsigned long dir_pages(struct inode *inode)
 {
-	return (inode->i_size+PAGE_CACHE_SIZE-1)>>PAGE_CACHE_SHIFT;
+	return (inode->i_size+page_cache_size(inode->i_mapping)-1)>>
+			page_cache_shift(inode->i_mapping);
 }
 
 /*
@@ -54,10 +55,11 @@ static unsigned
 ext2_last_byte(struct inode *inode, unsigned long page_nr)
 {
 	unsigned last_byte = inode->i_size;
+	struct address_space *mapping = inode->i_mapping;
 
-	last_byte -= page_nr << PAGE_CACHE_SHIFT;
-	if (last_byte > PAGE_CACHE_SIZE)
-		last_byte = PAGE_CACHE_SIZE;
+	last_byte -= page_nr << page_cache_shift(mapping);
+	if (last_byte > page_cache_size(mapping))
+		last_byte = page_cache_size(mapping);
 	return last_byte;
 }
 
@@ -76,18 +78,19 @@ static int ext2_commit_chunk(struct page *page, unsigned from, unsigned to)
 
 static void ext2_check_page(struct page *page)
 {
-	struct inode *dir = page->mapping->host;
+	struct address_space *mapping = page->mapping;
+	struct inode *dir = mapping->host;
 	struct super_block *sb = dir->i_sb;
 	unsigned chunk_size = ext2_chunk_size(dir);
 	char *kaddr = page_address(page);
 	u32 max_inumber = le32_to_cpu(EXT2_SB(sb)->s_es->s_inodes_count);
 	unsigned offs, rec_len;
-	unsigned limit = PAGE_CACHE_SIZE;
+	unsigned limit = page_cache_size(mapping);
 	ext2_dirent *p;
 	char *error;
 
-	if ((dir->i_size >> PAGE_CACHE_SHIFT) == page->index) {
-		limit = dir->i_size & ~PAGE_CACHE_MASK;
+	if (page_cache_index(mapping, dir->i_size) == page->index) {
+		limit = page_cache_offset(mapping, dir->i_size);
 		if (limit & (chunk_size - 1))
 			goto Ebadsize;
 		if (!limit)
@@ -139,7 +142,7 @@ Einumber:
 bad_entry:
 	ext2_error (sb, "ext2_check_page", "bad entry in directory #%lu: %s - "
 		"offset=%lu, inode=%lu, rec_len=%d, name_len=%d",
-		dir->i_ino, error, (page->index<<PAGE_CACHE_SHIFT)+offs,
+		dir->i_ino, error, page_cache_pos(mapping, page->index, offs),
 		(unsigned long) le32_to_cpu(p->inode),
 		rec_len, p->name_len);
 	goto fail;
@@ -148,7 +151,7 @@ Eend:
 	ext2_error (sb, "ext2_check_page",
 		"entry in directory #%lu spans the page boundary"
 		"offset=%lu, inode=%lu",
-		dir->i_ino, (page->index<<PAGE_CACHE_SHIFT)+offs,
+		dir->i_ino, page_cache_pos(mapping, page->index, offs),
 		(unsigned long) le32_to_cpu(p->inode));
 fail:
 	SetPageChecked(page);
@@ -246,8 +249,9 @@ ext2_readdir (struct file * filp, void * dirent, filldir_t filldir)
 	loff_t pos = filp->f_pos;
 	struct inode *inode = filp->f_path.dentry->d_inode;
 	struct super_block *sb = inode->i_sb;
-	unsigned int offset = pos & ~PAGE_CACHE_MASK;
-	unsigned long n = pos >> PAGE_CACHE_SHIFT;
+	struct address_space *mapping = inode->i_mapping;
+	unsigned int offset = page_cache_offset(mapping, pos);
+	unsigned long n = page_cache_index(mapping, pos);
 	unsigned long npages = dir_pages(inode);
 	unsigned chunk_mask = ~(ext2_chunk_size(inode)-1);
 	unsigned char *types = NULL;
@@ -268,14 +272,14 @@ ext2_readdir (struct file * filp, void * dirent, filldir_t filldir)
 			ext2_error(sb, __FUNCTION__,
 				   "bad page in #%lu",
 				   inode->i_ino);
-			filp->f_pos += PAGE_CACHE_SIZE - offset;
+			filp->f_pos += page_cache_size(mapping) - offset;
 			return -EIO;
 		}
 		kaddr = page_address(page);
 		if (unlikely(need_revalidate)) {
 			if (offset) {
 				offset = ext2_validate_entry(kaddr, offset, chunk_mask);
-				filp->f_pos = (n<<PAGE_CACHE_SHIFT) + offset;
+				filp->f_pos = page_cache_pos(mapping, n, offset);
 			}
 			filp->f_version = inode->i_version;
 			need_revalidate = 0;
@@ -298,7 +302,7 @@ ext2_readdir (struct file * filp, void * dirent, filldir_t filldir)
 
 				offset = (char *)de - kaddr;
 				over = filldir(dirent, de->name, de->name_len,
-						(n<<PAGE_CACHE_SHIFT) | offset,
+						page_cache_pos(mapping, n, offset),
 						le32_to_cpu(de->inode), d_type);
 				if (over) {
 					ext2_put_page(page);
@@ -324,6 +328,7 @@ struct ext2_dir_entry_2 * ext2_find_entry (struct inode * dir,
 			struct dentry *dentry, struct page ** res_page)
 {
 	const char *name = dentry->d_name.name;
+	struct address_space *mapping = dir->i_mapping;
 	int namelen = dentry->d_name.len;
 	unsigned reclen = EXT2_DIR_REC_LEN(namelen);
 	unsigned long start, n;
@@ -365,7 +370,7 @@ struct ext2_dir_entry_2 * ext2_find_entry (struct inode * dir,
 		if (++n >= npages)
 			n = 0;
 		/* next page is past the blocks we've got */
-		if (unlikely(n > (dir->i_blocks >> (PAGE_CACHE_SHIFT - 9)))) {
+		if (unlikely(n > (dir->i_blocks >> (page_cache_shift(mapping) - 9)))) {
 			ext2_error(dir->i_sb, __FUNCTION__,
 				"dir %lu size %lld exceeds block count %llu",
 				dir->i_ino, dir->i_size,
@@ -434,6 +439,7 @@ void ext2_set_link(struct inode *dir, struct ext2_dir_entry_2 *de,
 int ext2_add_link (struct dentry *dentry, struct inode *inode)
 {
 	struct inode *dir = dentry->d_parent->d_inode;
+	struct address_space *mapping = inode->i_mapping;
 	const char *name = dentry->d_name.name;
 	int namelen = dentry->d_name.len;
 	unsigned chunk_size = ext2_chunk_size(dir);
@@ -463,7 +469,7 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode)
 		kaddr = page_address(page);
 		dir_end = kaddr + ext2_last_byte(dir, n);
 		de = (ext2_dirent *)kaddr;
-		kaddr += PAGE_CACHE_SIZE - reclen;
+		kaddr += page_cache_size(mapping) - reclen;
 		while ((char *)de <= kaddr) {
 			if ((char *)de == dir_end) {
 				/* We hit i_size */
-- 
1.5.2.4

-- 

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [16/36] Use page_cache_xxx in fs/ext3
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (14 preceding siblings ...)
  2007-08-28 19:06 ` [15/36] Use page_cache_xxx functions in fs/ext2 clameter
@ 2007-08-28 19:06 ` clameter
  2007-08-28 19:06 ` [17/36] Use page_cache_xxx in fs/ext4 clameter
                   ` (21 subsequent siblings)
  37 siblings, 0 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:06 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0016-Use-page_cache_xxx-in-fs-ext3.patch --]
[-- Type: text/plain, Size: 5301 bytes --]

Use page_cache_xxx in fs/ext3

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/ext3/dir.c   |    3 ++-
 fs/ext3/inode.c |   34 +++++++++++++++++-----------------
 2 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/fs/ext3/dir.c b/fs/ext3/dir.c
index c00723a..a65b5a7 100644
--- a/fs/ext3/dir.c
+++ b/fs/ext3/dir.c
@@ -137,7 +137,8 @@ static int ext3_readdir(struct file * filp,
 						&map_bh, 0, 0);
 		if (err > 0) {
 			pgoff_t index = map_bh.b_blocknr >>
-					(PAGE_CACHE_SHIFT - inode->i_blkbits);
+				(page_cache_shift(inode->i_mapping)
+					- inode->i_blkbits);
 			if (!ra_has_index(&filp->f_ra, index))
 				page_cache_sync_readahead(
 					sb->s_bdev->bd_inode->i_mapping,
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index eb3c264..986519b 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1224,7 +1224,7 @@ static int ext3_ordered_commit_write(struct file *file, struct page *page,
 		 */
 		loff_t new_i_size;
 
-		new_i_size = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+		new_i_size = page_cache_pos(page->mapping, page->index, to);
 		if (new_i_size > EXT3_I(inode)->i_disksize)
 			EXT3_I(inode)->i_disksize = new_i_size;
 		ret = generic_commit_write(file, page, from, to);
@@ -1243,7 +1243,7 @@ static int ext3_writeback_commit_write(struct file *file, struct page *page,
 	int ret = 0, ret2;
 	loff_t new_i_size;
 
-	new_i_size = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+	new_i_size = page_cache_pos(inode->i_mapping, page->index, to);
 	if (new_i_size > EXT3_I(inode)->i_disksize)
 		EXT3_I(inode)->i_disksize = new_i_size;
 
@@ -1270,7 +1270,7 @@ static int ext3_journalled_commit_write(struct file *file,
 	/*
 	 * Here we duplicate the generic_commit_write() functionality
 	 */
-	pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+	pos = page_cache_pos(page->mapping, page->index, to);
 
 	ret = walk_page_buffers(handle, page_buffers(page), from,
 				to, &partial, commit_write_fn);
@@ -1422,6 +1422,7 @@ static int ext3_ordered_writepage(struct page *page,
 	handle_t *handle = NULL;
 	int ret = 0;
 	int err;
+	int pagesize = page_cache_size(inode->i_mapping);
 
 	J_ASSERT(PageLocked(page));
 
@@ -1444,8 +1445,7 @@ static int ext3_ordered_writepage(struct page *page,
 				(1 << BH_Dirty)|(1 << BH_Uptodate));
 	}
 	page_bufs = page_buffers(page);
-	walk_page_buffers(handle, page_bufs, 0,
-			PAGE_CACHE_SIZE, NULL, bget_one);
+	walk_page_buffers(handle, page_bufs, 0, pagesize, NULL, bget_one);
 
 	ret = block_write_full_page(page, ext3_get_block, wbc);
 
@@ -1462,13 +1462,12 @@ static int ext3_ordered_writepage(struct page *page,
 	 * and generally junk.
 	 */
 	if (ret == 0) {
-		err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE,
-					NULL, journal_dirty_data_fn);
+		err = walk_page_buffers(handle, page_bufs, 0, pagesize,
+			NULL, journal_dirty_data_fn);
 		if (!ret)
 			ret = err;
 	}
-	walk_page_buffers(handle, page_bufs, 0,
-			PAGE_CACHE_SIZE, NULL, bput_one);
+	walk_page_buffers(handle, page_bufs, 0, pagesize, NULL, bput_one);
 	err = ext3_journal_stop(handle);
 	if (!ret)
 		ret = err;
@@ -1520,6 +1519,7 @@ static int ext3_journalled_writepage(struct page *page,
 	handle_t *handle = NULL;
 	int ret = 0;
 	int err;
+	int pagesize = page_cache_size(inode->i_mapping);
 
 	if (ext3_journal_current_handle())
 		goto no_write;
@@ -1536,17 +1536,16 @@ static int ext3_journalled_writepage(struct page *page,
 		 * doesn't seem much point in redirtying the page here.
 		 */
 		ClearPageChecked(page);
-		ret = block_prepare_write(page, 0, PAGE_CACHE_SIZE,
-					ext3_get_block);
+		ret = block_prepare_write(page, 0, pagesize, ext3_get_block);
 		if (ret != 0) {
 			ext3_journal_stop(handle);
 			goto out_unlock;
 		}
 		ret = walk_page_buffers(handle, page_buffers(page), 0,
-			PAGE_CACHE_SIZE, NULL, do_journal_get_write_access);
+			pagesize, NULL, do_journal_get_write_access);
 
 		err = walk_page_buffers(handle, page_buffers(page), 0,
-				PAGE_CACHE_SIZE, NULL, commit_write_fn);
+				pagesize, NULL, commit_write_fn);
 		if (ret == 0)
 			ret = err;
 		EXT3_I(inode)->i_state |= EXT3_STATE_JDATA;
@@ -1761,8 +1760,8 @@ void ext3_set_aops(struct inode *inode)
 static int ext3_block_truncate_page(handle_t *handle, struct page *page,
 		struct address_space *mapping, loff_t from)
 {
-	ext3_fsblk_t index = from >> PAGE_CACHE_SHIFT;
-	unsigned offset = from & (PAGE_CACHE_SIZE-1);
+	ext3_fsblk_t index = page_cache_index(mapping, from);
+	unsigned offset = page_cache_offset(mapping, from);
 	unsigned blocksize, iblock, length, pos;
 	struct inode *inode = mapping->host;
 	struct buffer_head *bh;
@@ -1770,7 +1769,8 @@ static int ext3_block_truncate_page(handle_t *handle, struct page *page,
 
 	blocksize = inode->i_sb->s_blocksize;
 	length = blocksize - (offset & (blocksize - 1));
-	iblock = index << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits);
+	iblock = index <<
+		(page_cache_shift(mapping) - inode->i_sb->s_blocksize_bits);
 
 	/*
 	 * For "nobh" option,  we can only work if we don't need to
@@ -2249,7 +2249,7 @@ void ext3_truncate(struct inode *inode)
 		page = NULL;
 	} else {
 		page = grab_cache_page(mapping,
-				inode->i_size >> PAGE_CACHE_SHIFT);
+				page_cache_index(mapping, inode->i_size));
 		if (!page)
 			return;
 	}
-- 
1.5.2.4

-- 

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [17/36] Use page_cache_xxx in fs/ext4
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (15 preceding siblings ...)
  2007-08-28 19:06 ` [16/36] Use page_cache_xxx in fs/ext3 clameter
@ 2007-08-28 19:06 ` clameter
  2007-08-28 19:06 ` [18/36] Use page_cache_xxx in fs/reiserfs clameter
                   ` (20 subsequent siblings)
  37 siblings, 0 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:06 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0017-Use-page_cache_xxx-in-fs-ext4.patch --]
[-- Type: text/plain, Size: 5277 bytes --]

Use page_cache_xxx in fs/ext4

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/ext4/dir.c   |    3 ++-
 fs/ext4/inode.c |   31 ++++++++++++++++---------------
 2 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index 3ab01c0..9d6cd51 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -136,7 +136,8 @@ static int ext4_readdir(struct file * filp,
 		err = ext4_get_blocks_wrap(NULL, inode, blk, 1, &map_bh, 0, 0);
 		if (err > 0) {
 			pgoff_t index = map_bh.b_blocknr >>
-					(PAGE_CACHE_SHIFT - inode->i_blkbits);
+				(page_cache_size(node->i_mapping)
+					- inode->i_blkbits);
 			if (!ra_has_index(&filp->f_ra, index))
 				page_cache_sync_readahead(
 					sb->s_bdev->bd_inode->i_mapping,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3fe1e40..0be5bf8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1223,7 +1223,7 @@ static int ext4_ordered_commit_write(struct file *file, struct page *page,
 		 */
 		loff_t new_i_size;
 
-		new_i_size = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+		new_i_size = page_cache_pos(page->mapping, page->index, to);
 		if (new_i_size > EXT4_I(inode)->i_disksize)
 			EXT4_I(inode)->i_disksize = new_i_size;
 		ret = generic_commit_write(file, page, from, to);
@@ -1242,7 +1242,7 @@ static int ext4_writeback_commit_write(struct file *file, struct page *page,
 	int ret = 0, ret2;
 	loff_t new_i_size;
 
-	new_i_size = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+	new_i_size = page_cache_pos(page->mapping, page->index, to);
 	if (new_i_size > EXT4_I(inode)->i_disksize)
 		EXT4_I(inode)->i_disksize = new_i_size;
 
@@ -1269,7 +1269,7 @@ static int ext4_journalled_commit_write(struct file *file,
 	/*
 	 * Here we duplicate the generic_commit_write() functionality
 	 */
-	pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+	pos = page_cache_pos(page->mapping, page->index, to);
 
 	ret = walk_page_buffers(handle, page_buffers(page), from,
 				to, &partial, commit_write_fn);
@@ -1421,6 +1421,7 @@ static int ext4_ordered_writepage(struct page *page,
 	handle_t *handle = NULL;
 	int ret = 0;
 	int err;
+	int pagesize = page_cache_size(inode->i_mapping);
 
 	J_ASSERT(PageLocked(page));
 
@@ -1443,8 +1444,7 @@ static int ext4_ordered_writepage(struct page *page,
 				(1 << BH_Dirty)|(1 << BH_Uptodate));
 	}
 	page_bufs = page_buffers(page);
-	walk_page_buffers(handle, page_bufs, 0,
-			PAGE_CACHE_SIZE, NULL, bget_one);
+	walk_page_buffers(handle, page_bufs, 0, pagesize, NULL, bget_one);
 
 	ret = block_write_full_page(page, ext4_get_block, wbc);
 
@@ -1461,13 +1461,12 @@ static int ext4_ordered_writepage(struct page *page,
 	 * and generally junk.
 	 */
 	if (ret == 0) {
-		err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE,
+		err = walk_page_buffers(handle, page_bufs, 0, pagesize,
 					NULL, jbd2_journal_dirty_data_fn);
 		if (!ret)
 			ret = err;
 	}
-	walk_page_buffers(handle, page_bufs, 0,
-			PAGE_CACHE_SIZE, NULL, bput_one);
+	walk_page_buffers(handle, page_bufs, 0, pagesize, NULL, bput_one);
 	err = ext4_journal_stop(handle);
 	if (!ret)
 		ret = err;
@@ -1519,6 +1518,7 @@ static int ext4_journalled_writepage(struct page *page,
 	handle_t *handle = NULL;
 	int ret = 0;
 	int err;
+	int pagesize = page_cache_size(inode->i_mapping);
 
 	if (ext4_journal_current_handle())
 		goto no_write;
@@ -1535,17 +1535,17 @@ static int ext4_journalled_writepage(struct page *page,
 		 * doesn't seem much point in redirtying the page here.
 		 */
 		ClearPageChecked(page);
-		ret = block_prepare_write(page, 0, PAGE_CACHE_SIZE,
+		ret = block_prepare_write(page, 0, pagesize,
 					ext4_get_block);
 		if (ret != 0) {
 			ext4_journal_stop(handle);
 			goto out_unlock;
 		}
 		ret = walk_page_buffers(handle, page_buffers(page), 0,
-			PAGE_CACHE_SIZE, NULL, do_journal_get_write_access);
+			pagesize, NULL, do_journal_get_write_access);
 
 		err = walk_page_buffers(handle, page_buffers(page), 0,
-				PAGE_CACHE_SIZE, NULL, commit_write_fn);
+				pagesize, NULL, commit_write_fn);
 		if (ret == 0)
 			ret = err;
 		EXT4_I(inode)->i_state |= EXT4_STATE_JDATA;
@@ -1760,8 +1760,8 @@ void ext4_set_aops(struct inode *inode)
 int ext4_block_truncate_page(handle_t *handle, struct page *page,
 		struct address_space *mapping, loff_t from)
 {
-	ext4_fsblk_t index = from >> PAGE_CACHE_SHIFT;
-	unsigned offset = from & (PAGE_CACHE_SIZE-1);
+	ext4_fsblk_t index = page_cache_index(mapping, from);
+	unsigned offset = page_cache_offset(mapping, from);
 	unsigned blocksize, iblock, length, pos;
 	struct inode *inode = mapping->host;
 	struct buffer_head *bh;
@@ -1769,7 +1769,8 @@ int ext4_block_truncate_page(handle_t *handle, struct page *page,
 
 	blocksize = inode->i_sb->s_blocksize;
 	length = blocksize - (offset & (blocksize - 1));
-	iblock = index << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits);
+	iblock = index <<
+		(page_cache_shift(mapping) - inode->i_sb->s_blocksize_bits);
 
 	/*
 	 * For "nobh" option,  we can only work if we don't need to
@@ -2249,7 +2250,7 @@ void ext4_truncate(struct inode *inode)
 		page = NULL;
 	} else {
 		page = grab_cache_page(mapping,
-				inode->i_size >> PAGE_CACHE_SHIFT);
+				page_cache_index(mapping, inode->i_size));
 		if (!page)
 			return;
 	}
-- 
1.5.2.4

-- 

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [18/36] Use page_cache_xxx in fs/reiserfs
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (16 preceding siblings ...)
  2007-08-28 19:06 ` [17/36] Use page_cache_xxx in fs/ext4 clameter
@ 2007-08-28 19:06 ` clameter
  2007-08-28 19:06 ` [19/36] Use page_cache_xxx for fs/xfs clameter
                   ` (19 subsequent siblings)
  37 siblings, 0 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:06 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0018-Use-page_cache_xxx-in-fs-reiserfs.patch --]
[-- Type: text/plain, Size: 19829 bytes --]

Use page_cache_xxx in fs/reiserfs

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/reiserfs/file.c            |   83 ++++++++++++++++++++++-------------------
 fs/reiserfs/inode.c           |   33 ++++++++++------
 fs/reiserfs/ioctl.c           |    2 +-
 fs/reiserfs/stree.c           |    8 ++-
 fs/reiserfs/tail_conversion.c |    5 +-
 fs/reiserfs/xattr.c           |   19 +++++----
 6 files changed, 84 insertions(+), 66 deletions(-)

Index: linux-2.6/fs/reiserfs/file.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/file.c	2007-08-27 21:22:40.000000000 -0700
+++ linux-2.6/fs/reiserfs/file.c	2007-08-27 21:50:01.000000000 -0700
@@ -187,9 +187,11 @@ static int reiserfs_allocate_blocks_for_
 	int curr_block;		// current block used to keep track of unmapped blocks.
 	int i;			// loop counter
 	int itempos;		// position in item
-	unsigned int from = (pos & (PAGE_CACHE_SIZE - 1));	// writing position in
+	struct address_space *mapping = prepared_pages[0]->mapping;
+	unsigned int from = page_cache_offset(mapping, pos);	// writing position in
 	// first page
-	unsigned int to = ((pos + write_bytes - 1) & (PAGE_CACHE_SIZE - 1)) + 1;	/* last modified byte offset in last page */
+	unsigned int to = page_cache_offset(mapping, pos + write_bytes - 1)  + 1;
+					/* last modified byte offset in last page */
 	__u64 hole_size;	// amount of blocks for a file hole, if it needed to be created.
 	int modifying_this_item = 0;	// Flag for items traversal code to keep track
 	// of the fact that we already prepared
@@ -731,19 +733,22 @@ static int reiserfs_copy_from_user_to_fi
 	long page_fault = 0;	// status of copy_from_user.
 	int i;			// loop counter.
 	int offset;		// offset in page
+	struct address_space *mapping = prepared_pages[0]->mapping;
 
-	for (i = 0, offset = (pos & (PAGE_CACHE_SIZE - 1)); i < num_pages;
+	for (i = 0, offset = page_cache_offset(mapping, pos); i < num_pages;
 	     i++, offset = 0) {
-		size_t count = min_t(size_t, PAGE_CACHE_SIZE - offset, write_bytes);	// How much of bytes to write to this page
+		size_t count = min_t(size_t, page_cache_size(mapping) - offset,
+			write_bytes);	// How much of bytes to write to this page
 		struct page *page = prepared_pages[i];	// Current page we process.
 
 		fault_in_pages_readable(buf, count);
 
 		/* Copy data from userspace to the current page */
 		kmap(page);
-		page_fault = __copy_from_user(page_address(page) + offset, buf, count);	// Copy the data.
+		page_fault = __copy_from_user(page_address(page) + offset, buf, count);
+					// Copy the data.
 		/* Flush processor's dcache for this page */
-		flush_dcache_page(page);
+		flush_mapping_page(page);
 		kunmap(page);
 		buf += count;
 		write_bytes -= count;
@@ -763,11 +768,12 @@ int reiserfs_commit_page(struct inode *i
 	int partial = 0;
 	unsigned blocksize;
 	struct buffer_head *bh, *head;
-	unsigned long i_size_index = inode->i_size >> PAGE_CACHE_SHIFT;
+	unsigned long i_size_index =
+		page_cache_offset(inode->i_mapping, inode->i_size);
 	int new;
 	int logit = reiserfs_file_data_log(inode);
 	struct super_block *s = inode->i_sb;
-	int bh_per_page = PAGE_CACHE_SIZE / s->s_blocksize;
+	int bh_per_page = page_cache_size(inode->i_mapping) / s->s_blocksize;
 	struct reiserfs_transaction_handle th;
 	int ret = 0;
 
@@ -839,10 +845,11 @@ static int reiserfs_submit_file_region_f
 	int offset;		// Writing offset in page.
 	int orig_write_bytes = write_bytes;
 	int sd_update = 0;
+	struct address_space *mapping = inode->i_mapping;
 
-	for (i = 0, offset = (pos & (PAGE_CACHE_SIZE - 1)); i < num_pages;
+	for (i = 0, offset = page_cache_offset(mapping, pos); i < num_pages;
 	     i++, offset = 0) {
-		int count = min_t(int, PAGE_CACHE_SIZE - offset, write_bytes);	// How much of bytes to write to this page
+		int count = min_t(int, page_cache_size(mapping) - offset, write_bytes);	// How much of bytes to write to this page
 		struct page *page = prepared_pages[i];	// Current page we process.
 
 		status =
@@ -985,12 +992,12 @@ static int reiserfs_prepare_file_region_
     )
 {
 	int res = 0;		// Return values of different functions we call.
-	unsigned long index = pos >> PAGE_CACHE_SHIFT;	// Offset in file in pages.
-	int from = (pos & (PAGE_CACHE_SIZE - 1));	// Writing offset in first page
-	int to = ((pos + write_bytes - 1) & (PAGE_CACHE_SIZE - 1)) + 1;
+	struct address_space *mapping = inode->i_mapping;	// Pages are mapped here.
+	unsigned long index = page_cache_index(mapping, pos);	// Offset in file in pages.
+	int from = page_cache_offset(mapping, pos);	// Writing offset in first page
+	int to = page_cache_offset(mapping, pos + write_bytes - 1) + 1;
 	/* offset of last modified byte in last
 	   page */
-	struct address_space *mapping = inode->i_mapping;	// Pages are mapped here.
 	int i;			// Simple counter
 	int blocks = 0;		/* Return value (blocks that should be allocated) */
 	struct buffer_head *bh, *head;	// Current bufferhead and first bufferhead
@@ -1041,30 +1048,31 @@ static int reiserfs_prepare_file_region_
 		/* These are full-overwritten pages so we count all the blocks in
 		   these pages are counted as needed to be allocated */
 		blocks =
-		    (num_pages - 2) << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+		    (num_pages - 2) << (page_cache_shift(mapping) - inode->i_blkbits);
 
 	/* count blocks needed for first page (possibly partially written) */
-	blocks += ((PAGE_CACHE_SIZE - from) >> inode->i_blkbits) + !!(from & (inode->i_sb->s_blocksize - 1));	/* roundup */
+	blocks += ((page_cache_size(mapping) - from) >> inode->i_blkbits)
+		+ !!(from & (inode->i_sb->s_blocksize - 1));	/* roundup */
 
 	/* Now we account for last page. If last page == first page (we
 	   overwrite only one page), we substract all the blocks past the
 	   last writing position in a page out of already calculated number
 	   of blocks */
-	blocks += ((num_pages > 1) << (PAGE_CACHE_SHIFT - inode->i_blkbits)) -
-	    ((PAGE_CACHE_SIZE - to) >> inode->i_blkbits);
+	blocks += ((num_pages > 1) << (page_cache_shift(mapping) - inode->i_blkbits)) -
+	    ((page_cache_size(mapping) - to) >> inode->i_blkbits);
 	/* Note how we do not roundup here since partial blocks still
 	   should be allocated */
 
 	/* Now if all the write area lies past the file end, no point in
 	   maping blocks, since there is none, so we just zero out remaining
 	   parts of first and last pages in write area (if needed) */
-	if ((pos & ~((loff_t) PAGE_CACHE_SIZE - 1)) > inode->i_size) {
+	if (page_cache_offset(mapping, pos) > inode->i_size) {
 		if (from != 0)		/* First page needs to be partially zeroed */
 			zero_user(prepared_pages[0], 0, from);
 
-		if (to != PAGE_CACHE_SIZE)	/* Last page needs to be partially zeroed */
+		if (to != page_cache_size(mapping))	/* Last page needs to be partially zeroed */
 			zero_user_segment(prepared_pages[num_pages-1], to,
-					PAGE_CACHE_SIZE);
+					page_cache_size(mapping));
 
 		/* Since all blocks are new - use already calculated value */
 		return blocks;
@@ -1201,8 +1209,8 @@ static int reiserfs_prepare_file_region_
 
 	/* Last page, see if it is not uptodate, or if the last page is past the end of the file. */
 	if (!PageUptodate(prepared_pages[num_pages - 1]) ||
-	    ((pos + write_bytes) >> PAGE_CACHE_SHIFT) >
-	    (inode->i_size >> PAGE_CACHE_SHIFT)) {
+	    page_cache_index(mapping, pos + write_bytes) >
+	    page_cache_index(mapping, inode->i_size)) {
 		head = page_buffers(prepared_pages[num_pages - 1]);
 
 		/* for each buffer in page */
@@ -1292,6 +1300,7 @@ static ssize_t reiserfs_file_write(struc
 	   locked pages in array for now */
 	struct page *prepared_pages[REISERFS_WRITE_PAGES_AT_A_TIME];
 	struct reiserfs_transaction_handle th;
+	struct address_space *mapping = inode->i_mapping;
 	th.t_trans_id = 0;
 
 	/* If a filesystem is converted from 3.5 to 3.6, we'll have v3.5 items
@@ -1358,10 +1367,10 @@ static ssize_t reiserfs_file_write(struc
 		size_t blocks_to_allocate;	/* how much blocks we need to allocate for this iteration */
 
 		/*  (pos & (PAGE_CACHE_SIZE-1)) is an idiom for offset into a page of pos */
-		num_pages = !!((pos + count) & (PAGE_CACHE_SIZE - 1)) +	/* round up partial
+		num_pages = !!(page_cache_offset(mapping, pos + count)) +	/* round up partial
 									   pages */
 		    ((count +
-		      (pos & (PAGE_CACHE_SIZE - 1))) >> PAGE_CACHE_SHIFT);
+		      (pos & (page_cache_size(mapping) - 1))) >> page_cache_shift(mapping));
 		/* convert size to amount of
 		   pages */
 		reiserfs_write_lock(inode->i_sb);
@@ -1374,8 +1383,8 @@ static ssize_t reiserfs_file_write(struc
 			    min_t(size_t, REISERFS_WRITE_PAGES_AT_A_TIME,
 				  reiserfs_can_fit_pages(inode->i_sb));
 			/* Also we should not forget to set size in bytes accordingly */
-			write_bytes = (num_pages << PAGE_CACHE_SHIFT) -
-			    (pos & (PAGE_CACHE_SIZE - 1));
+			write_bytes = (num_pages << page_cache_shift(mapping)) -
+			page_cache_offset(mapping, pos);
 			/* If position is not on the
 			   start of the page, we need
 			   to substract the offset
@@ -1387,7 +1396,7 @@ static ssize_t reiserfs_file_write(struc
 		   we still have the space to write the blocks to */
 		reiserfs_claim_blocks_to_be_allocated(inode->i_sb,
 						      num_pages <<
-						      (PAGE_CACHE_SHIFT -
+						      (page_cache_shift(mapping) -
 						       inode->i_blkbits));
 		reiserfs_write_unlock(inode->i_sb);
 
@@ -1412,7 +1421,7 @@ static ssize_t reiserfs_file_write(struc
 			// No blocks were claimed before, so do it now.
 			reiserfs_claim_blocks_to_be_allocated(inode->i_sb,
 							      1 <<
-							      (PAGE_CACHE_SHIFT
+							      (page_cache_shift(mapping)
 							       -
 							       inode->
 							       i_blkbits));
@@ -1429,7 +1438,7 @@ static ssize_t reiserfs_file_write(struc
 		if (res < 0) {
 			reiserfs_release_claimed_blocks(inode->i_sb,
 							num_pages <<
-							(PAGE_CACHE_SHIFT -
+							(page_cache_shift(mapping) -
 							 inode->i_blkbits));
 			break;
 		}
@@ -1439,7 +1448,7 @@ static ssize_t reiserfs_file_write(struc
 		/* First we correct our estimate of how many blocks we need */
 		reiserfs_release_claimed_blocks(inode->i_sb,
 						(num_pages <<
-						 (PAGE_CACHE_SHIFT -
+						 (page_cache_shift(mapping) -
 						  inode->i_sb->
 						  s_blocksize_bits)) -
 						blocks_to_allocate);
Index: linux-2.6/fs/reiserfs/inode.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/inode.c	2007-08-27 21:22:40.000000000 -0700
+++ linux-2.6/fs/reiserfs/inode.c	2007-08-27 21:22:40.000000000 -0700
@@ -336,7 +336,8 @@ static int _get_block_create_0(struct in
 		goto finished;
 	}
 	// read file tail into part of page
-	offset = (cpu_key_k_offset(&key) - 1) & (PAGE_CACHE_SIZE - 1);
+	offset = page_cache_offset(inode->i_mapping,
+				cpu_key_k_offset(&key) - 1);
 	fs_gen = get_generation(inode->i_sb);
 	copy_item_head(&tmp_ih, ih);
 
@@ -522,10 +523,10 @@ static int convert_tail_for_hole(struct 
 		return -EIO;
 
 	/* always try to read until the end of the block */
-	tail_start = tail_offset & (PAGE_CACHE_SIZE - 1);
+	tail_start = page_cache_offset(inode->i_mapping, tail_offset);
 	tail_end = (tail_start | (bh_result->b_size - 1)) + 1;
 
-	index = tail_offset >> PAGE_CACHE_SHIFT;
+	index = page_cache_index(inode->i_mapping, tail_offset);
 	/* hole_page can be zero in case of direct_io, we are sure
 	   that we cannot get here if we write with O_DIRECT into
 	   tail page */
@@ -2007,11 +2008,13 @@ static int grab_tail_page(struct inode *
 	/* we want the page with the last byte in the file,
 	 ** not the page that will hold the next byte for appending
 	 */
-	unsigned long index = (p_s_inode->i_size - 1) >> PAGE_CACHE_SHIFT;
+	unsigned long index = page_cache_index(p_s_inode->i_mapping,
+						p_s_inode->i_size - 1);
 	unsigned long pos = 0;
 	unsigned long start = 0;
 	unsigned long blocksize = p_s_inode->i_sb->s_blocksize;
-	unsigned long offset = (p_s_inode->i_size) & (PAGE_CACHE_SIZE - 1);
+	unsigned long offset = page_cache_index(p_s_inode->i_mapping,
+							p_s_inode->i_size);
 	struct buffer_head *bh;
 	struct buffer_head *head;
 	struct page *page;
@@ -2083,7 +2086,8 @@ int reiserfs_truncate_file(struct inode 
 {
 	struct reiserfs_transaction_handle th;
 	/* we want the offset for the first byte after the end of the file */
-	unsigned long offset = p_s_inode->i_size & (PAGE_CACHE_SIZE - 1);
+	unsigned long offset = page_cache_offset(p_s_inode->i_mapping,
+							p_s_inode->i_size);
 	unsigned blocksize = p_s_inode->i_sb->s_blocksize;
 	unsigned length;
 	struct page *page = NULL;
@@ -2232,7 +2236,7 @@ static int map_block_for_writepage(struc
 	} else if (is_direct_le_ih(ih)) {
 		char *p;
 		p = page_address(bh_result->b_page);
-		p += (byte_offset - 1) & (PAGE_CACHE_SIZE - 1);
+		p += page_cache_offset(inode->i_mapping, byte_offset - 1);
 		copy_size = ih_item_len(ih) - pos_in_item;
 
 		fs_gen = get_generation(inode->i_sb);
@@ -2331,7 +2335,8 @@ static int reiserfs_write_full_page(stru
 				    struct writeback_control *wbc)
 {
 	struct inode *inode = page->mapping->host;
-	unsigned long end_index = inode->i_size >> PAGE_CACHE_SHIFT;
+	unsigned long end_index = page_cache_index(inode->i_mapping,
+							inode->i_size);
 	int error = 0;
 	unsigned long block;
 	sector_t last_block;
@@ -2341,7 +2346,7 @@ static int reiserfs_write_full_page(stru
 	int checked = PageChecked(page);
 	struct reiserfs_transaction_handle th;
 	struct super_block *s = inode->i_sb;
-	int bh_per_page = PAGE_CACHE_SIZE / s->s_blocksize;
+	int bh_per_page = page_cache_size(inode->i_mapping) / s->s_blocksize;
 	th.t_trans_id = 0;
 
 	/* no logging allowed when nonblocking or from PF_MEMALLOC */
@@ -2368,16 +2373,18 @@ static int reiserfs_write_full_page(stru
 	if (page->index >= end_index) {
 		unsigned last_offset;
 
-		last_offset = inode->i_size & (PAGE_CACHE_SIZE - 1);
+		last_offset = page_cache_offset(inode->i_mapping, inode->i_size);
 		/* no file contents in this page */
 		if (page->index >= end_index + 1 || !last_offset) {
 			unlock_page(page);
 			return 0;
 		}
-		zero_user_segment(page, last_offset, PAGE_CACHE_SIZE);
+		zero_user_segment(page, last_offset,
+				page_cache_size(inode->i_mapping));
 	}
 	bh = head;
-	block = page->index << (PAGE_CACHE_SHIFT - s->s_blocksize_bits);
+	block = page->index << (page_cache_shift(inode->i_mapping)
+						- s->s_blocksize_bits);
 	last_block = (i_size_read(inode) - 1) >> inode->i_blkbits;
 	/* first map all the buffers, logging any direct items we find */
 	do {
@@ -2608,7 +2615,7 @@ static int reiserfs_commit_write(struct 
 				 unsigned from, unsigned to)
 {
 	struct inode *inode = page->mapping->host;
-	loff_t pos = ((loff_t) page->index << PAGE_CACHE_SHIFT) + to;
+	loff_t pos = page_cache_pos(page->mapping, page->index, to);
 	int ret = 0;
 	int update_sd = 0;
 	struct reiserfs_transaction_handle *th = NULL;
Index: linux-2.6/fs/reiserfs/ioctl.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/ioctl.c	2007-08-27 21:22:36.000000000 -0700
+++ linux-2.6/fs/reiserfs/ioctl.c	2007-08-27 21:22:40.000000000 -0700
@@ -168,8 +168,8 @@ static int reiserfs_unpack(struct inode 
 	 ** reiserfs_prepare_write on that page.  This will force a 
 	 ** reiserfs_get_block to unpack the tail for us.
 	 */
-	index = inode->i_size >> PAGE_CACHE_SHIFT;
 	mapping = inode->i_mapping;
+	index = page_cache_index(mapping, inode->i_size);
 	page = grab_cache_page(mapping, index);
 	retval = -ENOMEM;
 	if (!page) {
Index: linux-2.6/fs/reiserfs/stree.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/stree.c	2007-08-27 21:22:36.000000000 -0700
+++ linux-2.6/fs/reiserfs/stree.c	2007-08-27 21:22:40.000000000 -0700
@@ -1283,7 +1283,8 @@ int reiserfs_delete_item(struct reiserfs
 		 */
 
 		data = kmap_atomic(p_s_un_bh->b_page, KM_USER0);
-		off = ((le_ih_k_offset(&s_ih) - 1) & (PAGE_CACHE_SIZE - 1));
+		off = page_cache_offset(p_s_inode->i_mapping,
+					le_ih_k_offset(&s_ih) - 1);
 		memcpy(data + off,
 		       B_I_PITEM(PATH_PLAST_BUFFER(p_s_path), &s_ih),
 		       n_ret_value);
@@ -1439,7 +1440,7 @@ static void unmap_buffers(struct page *p
 
 	if (page) {
 		if (page_has_buffers(page)) {
-			tail_index = pos & (PAGE_CACHE_SIZE - 1);
+			tail_index = page_cache_offset(page_mapping(page), pos);
 			cur_index = 0;
 			head = page_buffers(page);
 			bh = head;
@@ -1459,7 +1460,8 @@ static void unmap_buffers(struct page *p
 				bh = next;
 			} while (bh != head);
 			if (PAGE_SIZE == bh->b_size) {
-				cancel_dirty_page(page, PAGE_CACHE_SIZE);
+				cancel_dirty_page(page,
+					page_cache_size(page_mapping(page)));
 			}
 		}
 	}
Index: linux-2.6/fs/reiserfs/tail_conversion.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/tail_conversion.c	2007-08-27 21:22:36.000000000 -0700
+++ linux-2.6/fs/reiserfs/tail_conversion.c	2007-08-27 21:22:40.000000000 -0700
@@ -128,7 +128,8 @@ int direct2indirect(struct reiserfs_tran
 	 */
 	if (up_to_date_bh) {
 		unsigned pgoff =
-		    (tail_offset + total_tail - 1) & (PAGE_CACHE_SIZE - 1);
+			page_cache_offset(inode->i_mapping,
+			tail_offset + total_tail - 1);
 		char *kaddr = kmap_atomic(up_to_date_bh->b_page, KM_USER0);
 		memset(kaddr + pgoff, 0, n_blk_size - total_tail);
 		kunmap_atomic(kaddr, KM_USER0);
@@ -238,7 +239,7 @@ int indirect2direct(struct reiserfs_tran
 	 ** the page was locked and this part of the page was up to date when
 	 ** indirect2direct was called, so we know the bytes are still valid
 	 */
-	tail = tail + (pos & (PAGE_CACHE_SIZE - 1));
+	tail = tail + page_cache_offset(p_s_inode->i_mapping, pos);
 
 	PATH_LAST_POSITION(p_s_path)++;
 
Index: linux-2.6/fs/reiserfs/xattr.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/xattr.c	2007-08-27 21:22:36.000000000 -0700
+++ linux-2.6/fs/reiserfs/xattr.c	2007-08-27 21:56:54.000000000 -0700
@@ -487,13 +487,13 @@ reiserfs_xattr_set(struct inode *inode, 
 	while (buffer_pos < buffer_size || buffer_pos == 0) {
 		size_t chunk;
 		size_t skip = 0;
-		size_t page_offset = (file_pos & (PAGE_CACHE_SIZE - 1));
-		if (buffer_size - buffer_pos > PAGE_CACHE_SIZE)
-			chunk = PAGE_CACHE_SIZE;
+		size_t page_offset = page_cache_offset(mapping, file_pos);
+		if (buffer_size - buffer_pos > page_cache_size(mapping))
+			chunk = page_cache_size(mapping);
 		else
 			chunk = buffer_size - buffer_pos;
 
-		page = reiserfs_get_page(xinode, file_pos >> PAGE_CACHE_SHIFT);
+		page = reiserfs_get_page(xinode, page_cache_index(mapping, file_pos));
 		if (IS_ERR(page)) {
 			err = PTR_ERR(page);
 			goto out_filp;
@@ -505,8 +505,8 @@ reiserfs_xattr_set(struct inode *inode, 
 		if (file_pos == 0) {
 			struct reiserfs_xattr_header *rxh;
 			skip = file_pos = sizeof(struct reiserfs_xattr_header);
-			if (chunk + skip > PAGE_CACHE_SIZE)
-				chunk = PAGE_CACHE_SIZE - skip;
+			if (chunk + skip > page_cache_size(mapping))
+				chunk = page_cache_size(mapping) - skip;
 			rxh = (struct reiserfs_xattr_header *)data;
 			rxh->h_magic = cpu_to_le32(REISERFS_XATTR_MAGIC);
 			rxh->h_hash = cpu_to_le32(xahash);
@@ -597,12 +597,13 @@ reiserfs_xattr_get(const struct inode *i
 		size_t chunk;
 		char *data;
 		size_t skip = 0;
-		if (isize - file_pos > PAGE_CACHE_SIZE)
-			chunk = PAGE_CACHE_SIZE;
+		if (isize - file_pos > page_cache_size(xinode->i_mapping))
+			chunk = page_cache_size(xinode->i_mapping);
 		else
 			chunk = isize - file_pos;
 
-		page = reiserfs_get_page(xinode, file_pos >> PAGE_CACHE_SHIFT);
+		page = reiserfs_get_page(xinode,
+				page_cache_index(xinode->i_mapping, file_pos));
 		if (IS_ERR(page)) {
 			err = PTR_ERR(page);
 			goto out_dput;

-- 

^ permalink raw reply	[flat|nested] 124+ messages in thread

* [19/36] Use page_cache_xxx for fs/xfs
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (17 preceding siblings ...)
  2007-08-28 19:06 ` [18/36] Use page_cache_xxx in fs/reiserfs clameter
@ 2007-08-28 19:06 ` clameter
  2007-08-28 19:06 ` [20/36] Use page_cache_xxx in drivers/block/rd.c clameter
                   ` (18 subsequent siblings)
  37 siblings, 0 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:06 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0019-Use-page_cache_xxx-for-fs-xfs.patch --]
[-- Type: text/plain, Size: 6776 bytes --]

Use page_cache_xxx for fs/xfs

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/xfs/linux-2.6/xfs_aops.c |   55 ++++++++++++++++++++++--------------------
 fs/xfs/linux-2.6/xfs_lrw.c  |    6 ++--
 2 files changed, 32 insertions(+), 29 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index fd4105d..e48817a 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -74,7 +74,7 @@ xfs_page_trace(
 	xfs_inode_t	*ip;
 	bhv_vnode_t	*vp = vn_from_inode(inode);
 	loff_t		isize = i_size_read(inode);
-	loff_t		offset = page_offset(page);
+	loff_t		offset = page_cache_offset(page->mapping);
 	int		delalloc = -1, unmapped = -1, unwritten = -1;
 
 	if (page_has_buffers(page))
@@ -610,7 +610,7 @@ xfs_probe_page(
 					break;
 			} while ((bh = bh->b_this_page) != head);
 		} else
-			ret = mapped ? 0 : PAGE_CACHE_SIZE;
+			ret = mapped ? 0 : page_cache_size(page->mapping);
 	}
 
 	return ret;
@@ -637,7 +637,7 @@ xfs_probe_cluster(
 	} while ((bh = bh->b_this_page) != head);
 
 	/* if we reached the end of the page, sum forwards in following pages */
-	tlast = i_size_read(inode) >> PAGE_CACHE_SHIFT;
+	tlast = page_cache_index(inode->i_mapping, i_size_read(inode));
 	tindex = startpage->index + 1;
 
 	/* Prune this back to avoid pathological behavior */
@@ -655,14 +655,14 @@ xfs_probe_cluster(
 			size_t pg_offset, len = 0;
 
 			if (tindex == tlast) {
-				pg_offset =
-				    i_size_read(inode) & (PAGE_CACHE_SIZE - 1);
+				pg_offset = page_cache_offset(inode->i_mapping,
+							i_size_read(inode));
 				if (!pg_offset) {
 					done = 1;
 					break;
 				}
 			} else
-				pg_offset = PAGE_CACHE_SIZE;
+				pg_offset = page_cache_size(inode->i_mapping);
 
 			if (page->index == tindex && !TestSetPageLocked(page)) {
 				len = xfs_probe_page(page, pg_offset, mapped);
@@ -744,7 +744,8 @@ xfs_convert_page(
 	int			bbits = inode->i_blkbits;
 	int			len, page_dirty;
 	int			count = 0, done = 0, uptodate = 1;
- 	xfs_off_t		offset = page_offset(page);
+	struct address_space	*map = inode->i_mapping;
+	xfs_off_t		offset = page_cache_pos(map, page->index, 0);
 
 	if (page->index != tindex)
 		goto fail;
@@ -752,7 +753,7 @@ xfs_convert_page(
 		goto fail;
 	if (PageWriteback(page))
 		goto fail_unlock_page;
-	if (page->mapping != inode->i_mapping)
+	if (page->mapping != map)
 		goto fail_unlock_page;
 	if (!xfs_is_delayed_page(page, (*ioendp)->io_type))
 		goto fail_unlock_page;
@@ -764,20 +765,20 @@ xfs_convert_page(
 	 * Derivation:
 	 *
 	 * End offset is the highest offset that this page should represent.
-	 * If we are on the last page, (end_offset & (PAGE_CACHE_SIZE - 1))
-	 * will evaluate non-zero and be less than PAGE_CACHE_SIZE and
+	 * If we are on the last page, (end_offset & page_cache_mask())
+	 * will evaluate non-zero and be less than page_cache_size() and
 	 * hence give us the correct page_dirty count. On any other page,
 	 * it will be zero and in that case we need page_dirty to be the
 	 * count of buffers on the page.
 	 */
 	end_offset = min_t(unsigned long long,
-			(xfs_off_t)(page->index + 1) << PAGE_CACHE_SHIFT,
+			(xfs_off_t)(page->index + 1) << page_cache_shift(map),
 			i_size_read(inode));
 
 	len = 1 << inode->i_blkbits;
-	p_offset = min_t(unsigned long, end_offset & (PAGE_CACHE_SIZE - 1),
-					PAGE_CACHE_SIZE);
-	p_offset = p_offset ? roundup(p_offset, len) : PAGE_CACHE_SIZE;
+	p_offset = min_t(unsigned long, page_cache_offset(map, end_offset),
+					page_cache_size(map));
+	p_offset = p_offset ? roundup(p_offset, len) : page_cache_size(map);
 	page_dirty = p_offset / len;
 
 	bh = head = page_buffers(page);
@@ -933,6 +934,8 @@ xfs_page_state_convert(
 	int			page_dirty, count = 0;
 	int			trylock = 0;
 	int			all_bh = unmapped;
+	struct address_space	*map = inode->i_mapping;
+	int			pagesize = page_cache_size(map);
 
 	if (startio) {
 		if (wbc->sync_mode == WB_SYNC_NONE && wbc->nonblocking)
@@ -941,11 +944,11 @@ xfs_page_state_convert(
 
 	/* Is this page beyond the end of the file? */
 	offset = i_size_read(inode);
-	end_index = offset >> PAGE_CACHE_SHIFT;
-	last_index = (offset - 1) >> PAGE_CACHE_SHIFT;
+	end_index = page_cache_index(map, offset);
+	last_index = page_cache_index(map, (offset - 1));
 	if (page->index >= end_index) {
 		if ((page->index >= end_index + 1) ||
-		    !(i_size_read(inode) & (PAGE_CACHE_SIZE - 1))) {
+		    !(page_cache_offset(map, i_size_read(inode)))) {
 			if (startio)
 				unlock_page(page);
 			return 0;
@@ -959,22 +962,22 @@ xfs_page_state_convert(
 	 * Derivation:
 	 *
 	 * End offset is the highest offset that this page should represent.
-	 * If we are on the last page, (end_offset & (PAGE_CACHE_SIZE - 1))
-	 * will evaluate non-zero and be less than PAGE_CACHE_SIZE and
-	 * hence give us the correct page_dirty count. On any other page,
+	 * If we are on the last page, (page_cache_offset(mapping, end_offset))
+	 * will evaluate non-zero and be less than page_cache_size(mapping)
+	 * and hence give us the correct page_dirty count. On any other page,
 	 * it will be zero and in that case we need page_dirty to be the
 	 * count of buffers on the page.
  	 */
 	end_offset = min_t(unsigned long long,
-			(xfs_off_t)(page->index + 1) << PAGE_CACHE_SHIFT, offset);
+			(xfs_off_t)page_cache_pos(map, page->index + 1, 0), offset);
 	len = 1 << inode->i_blkbits;
-	p_offset = min_t(unsigned long, end_offset & (PAGE_CACHE_SIZE - 1),
-					PAGE_CACHE_SIZE);
-	p_offset = p_offset ? roundup(p_offset, len) : PAGE_CACHE_SIZE;
+	p_offset = min_t(unsigned long, page_cache_offset(map, end_offset),
+					pagesize);
+	p_offset = p_offset ? roundup(p_offset, len) : pagesize;
 	page_dirty = p_offset / len;
 
 	bh = head = page_buffers(page);
-	offset = page_offset(page);
+	offset = page_cache_pos(map, page->index, 0);
 	flags = BMAPI_READ;
 	type = IOMAP_NEW;
 
@@ -1122,7 +1125,7 @@ xfs_page_state_convert(
 
 	if (ioend && iomap_valid) {
 		offset = (iomap.iomap_offset + iomap.iomap_bsize - 1) >>
-					PAGE_CACHE_SHIFT;
+					page_cache_shift(map);
 		tlast = min_t(pgoff_t, offset, last_index);
 		xfs_cluster_write(inode, page->index + 1, &iomap, &ioend,
 					wbc, startio, all_bh, tlast);
diff --git a/fs/xfs/linux-2.6/xfs_lrw.c b/fs/xfs/linux-2.6/xfs_lrw.c
index 1614b81..d1f3130 100644
--- a/fs/xfs/linux-2.6/xfs_lrw.c
+++ b/fs/xfs/linux-2.6/xfs_lrw.c
@@ -143,9 +143,9 @@ xfs_iozero(
 	do {
 		unsigned long index, offset;
 
-		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
-		index = pos >> PAGE_CACHE_SHIFT;
-		bytes = PAGE_CACHE_SIZE - offset;
+		offset = page_cache_offset(mapping, pos); /* Within page */
+		index = page_cache_index(mapping, pos);
+		bytes = page_cache_size(mapping) - offset;
 		if (bytes > count)
 			bytes = count;
 
-- 
1.5.2.4

-- 

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [20/36] Use page_cache_xxx in drivers/block/rd.c
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (18 preceding siblings ...)
  2007-08-28 19:06 ` [19/36] Use page_cache_xxx for fs/xfs clameter
@ 2007-08-28 19:06 ` clameter
  2007-08-28 19:06 ` [21/36] compound pages: PageHead/PageTail instead of PageCompound clameter
                   ` (17 subsequent siblings)
  37 siblings, 0 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:06 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0020-Use-page_cache_xxx-in-drivers-block-rd.c.patch --]
[-- Type: text/plain, Size: 1423 bytes --]

Use page_cache_xxx in drivers/block/rd.c

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 drivers/block/rd.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/block/rd.c b/drivers/block/rd.c
index 65150b5..e148b3b 100644
--- a/drivers/block/rd.c
+++ b/drivers/block/rd.c
@@ -121,7 +121,7 @@ static void make_page_uptodate(struct page *page)
 			}
 		} while ((bh = bh->b_this_page) != head);
 	} else {
-		memset(page_address(page), 0, PAGE_CACHE_SIZE);
+		memset(page_address(page), 0, page_cache_size(page_mapping(page)));
 	}
 	flush_dcache_page(page);
 	SetPageUptodate(page);
@@ -201,9 +201,9 @@ static const struct address_space_operations ramdisk_aops = {
 static int rd_blkdev_pagecache_IO(int rw, struct bio_vec *vec, sector_t sector,
 				struct address_space *mapping)
 {
-	pgoff_t index = sector >> (PAGE_CACHE_SHIFT - 9);
+	pgoff_t index = sector >> (page_cache_size(mapping) - 9);
 	unsigned int vec_offset = vec->bv_offset;
-	int offset = (sector << 9) & ~PAGE_CACHE_MASK;
+	int offset = page_cache_offset(mapping, (sector << 9));
 	int size = vec->bv_len;
 	int err = 0;
 
@@ -213,7 +213,7 @@ static int rd_blkdev_pagecache_IO(int rw, struct bio_vec *vec, sector_t sector,
 		char *src;
 		char *dst;
 
-		count = PAGE_CACHE_SIZE - offset;
+		count = page_cache_size(mapping) - offset;
 		if (count > size)
 			count = size;
 		size -= count;
-- 
1.5.2.4

-- 

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [21/36] compound pages: PageHead/PageTail instead of PageCompound
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (19 preceding siblings ...)
  2007-08-28 19:06 ` [20/36] Use page_cache_xxx in drivers/block/rd.c clameter
@ 2007-08-28 19:06 ` clameter
  2007-08-28 19:06 ` [22/36] compound pages: Add new support functions clameter
                   ` (16 subsequent siblings)
  37 siblings, 0 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:06 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0021-compound-pages-PageHead-PageTail-instead-of-PageCom.patch --]
[-- Type: text/plain, Size: 6099 bytes --]

This patch enhances the handling of compound pages in the VM. It may also
be important also for the antifrag patches that need to manage a set of
higher order free pages and also for other uses of compound pages.

For now it simplifies accounting for SLUB pages but the groundwork here is
important for the large block size patches and for allowing to page migration
of larger pages. With this framework we may be able to get to a point where
compound pages keep their flags while they are free and Mel may avoid having
special functions for determining the page order of higher order freed pages.
If we can avoid the setup and teardown of higher order pages then allocation
and release of compound pages will be faster.

Looking at the handling of compound pages we see that the fact that a page
is part of a higher order page is not that interesting. The differentiation
is mainly for head pages and tail pages of higher order pages. Head pages
keep the page state and it is usually sufficient to pass a pointer to
a head page. It is usually an error if tail pages are encountered. Or they
may need to be treated like PAGE_SIZE pages. So a compound flag in the page
flags is not what we need. Instead we introduce a flag for the head page and
another for the tail page. The PageCompound test is preserved for backward
compatibility and will test if either PageTail or PageHead has been set.

After this patchset the uses of CompoundPage() will be reduced significantly
in the core VM. The I/O layer will still use CompoundPage() for direct I/O.
However, if we at some point convert direct I/O to also support compound
pages as a single unit then CompoundPage() there may become unecessary as
well as the leftover check in mm/swap.c. We may end up mostly with checks
for PageTail and PageHead.

This patch:

Use two separate page flags for the head and tail of compound pages.
PageHead() and PageTail() become more efficient.

PageCompound then becomes a check for PageTail || PageHead. Over time
it is expected that PageCompound will mostly go away since the head page
processing will be different from tail page processing is most situations.

We can remove the compound page check from set_page_refcounted since
PG_reclaim is no longer overloaded.

Also the check in _free_one_page can only be for PageHead. We cannot
free a tail page.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 include/linux/page-flags.h |   41 +++++++++++------------------------------
 mm/internal.h              |    2 +-
 mm/page_alloc.c            |    2 +-
 3 files changed, 13 insertions(+), 32 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 209d3a4..2786693 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -83,13 +83,15 @@
 #define PG_private		11	/* If pagecache, has fs-private data */
 
 #define PG_writeback		12	/* Page is under writeback */
-#define PG_compound		14	/* Part of a compound page */
 #define PG_swapcache		15	/* Swap page: swp_entry_t in private */
 
 #define PG_mappedtodisk		16	/* Has blocks allocated on-disk */
 #define PG_reclaim		17	/* To be reclaimed asap */
 #define PG_buddy		19	/* Page is free, on buddy lists */
 
+#define PG_head			21	/* Page is head of a compound page */
+#define PG_tail			22	/* Page is tail of a compound page */
+
 /* PG_readahead is only used for file reads; PG_reclaim is only for writes */
 #define PG_readahead		PG_reclaim /* Reminder to do async read-ahead */
 
@@ -216,37 +218,16 @@ static inline void SetPageUptodate(struct page *page)
 #define ClearPageReclaim(page)	clear_bit(PG_reclaim, &(page)->flags)
 #define TestClearPageReclaim(page) test_and_clear_bit(PG_reclaim, &(page)->flags)
 
-#define PageCompound(page)	test_bit(PG_compound, &(page)->flags)
-#define __SetPageCompound(page)	__set_bit(PG_compound, &(page)->flags)
-#define __ClearPageCompound(page) __clear_bit(PG_compound, &(page)->flags)
-
-/*
- * PG_reclaim is used in combination with PG_compound to mark the
- * head and tail of a compound page
- *
- * PG_compound & PG_reclaim	=> Tail page
- * PG_compound & ~PG_reclaim	=> Head page
- */
-
-#define PG_head_tail_mask ((1L << PG_compound) | (1L << PG_reclaim))
+#define PageHead(page)		test_bit(PG_head, &(page)->flags)
+#define __SetPageHead(page)	__set_bit(PG_head, &(page)->flags)
+#define __ClearPageHead(page)	__clear_bit(PG_head, &(page)->flags)
 
-#define PageTail(page)	((page->flags & PG_head_tail_mask) \
-				== PG_head_tail_mask)
-
-static inline void __SetPageTail(struct page *page)
-{
-	page->flags |= PG_head_tail_mask;
-}
-
-static inline void __ClearPageTail(struct page *page)
-{
-	page->flags &= ~PG_head_tail_mask;
-}
+#define PageTail(page)		test_bit(PG_tail, &(page->flags))
+#define __SetPageTail(page)	__set_bit(PG_tail, &(page)->flags)
+#define __ClearPageTail(page)	__clear_bit(PG_tail, &(page)->flags)
 
-#define PageHead(page)	((page->flags & PG_head_tail_mask) \
-				== (1L << PG_compound))
-#define __SetPageHead(page)	__SetPageCompound(page)
-#define __ClearPageHead(page)	__ClearPageCompound(page)
+#define PageCompound(page)	((page)->flags & \
+				((1L << PG_head) | (1L << PG_tail)))
 
 #ifdef CONFIG_SWAP
 #define PageSwapCache(page)	test_bit(PG_swapcache, &(page)->flags)
diff --git a/mm/internal.h b/mm/internal.h
index a3110c0..c79791c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -24,7 +24,7 @@ static inline void set_page_count(struct page *page, int v)
  */
 static inline void set_page_refcounted(struct page *page)
 {
-	VM_BUG_ON(PageCompound(page) && PageTail(page));
+	VM_BUG_ON(PageTail(page));
 	VM_BUG_ON(atomic_read(&page->_count));
 	set_page_count(page, 1);
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6d3550c..8f59605 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -405,7 +405,7 @@ static inline void __free_one_page(struct page *page,
 	unsigned long page_idx;
 	int order_size = 1 << order;
 
-	if (unlikely(PageCompound(page)))
+	if (unlikely(PageHead(page)))
 		destroy_compound_page(page, order);
 
 	page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
-- 
1.5.2.4

-- 

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [22/36] compound pages: Add new support functions
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (20 preceding siblings ...)
  2007-08-28 19:06 ` [21/36] compound pages: PageHead/PageTail instead of PageCompound clameter
@ 2007-08-28 19:06 ` clameter
  2007-08-28 19:06 ` [23/36] compound pages: vmstat support clameter
                   ` (15 subsequent siblings)
  37 siblings, 0 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:06 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0022-compound-pages-Add-new-support-functions.patch --]
[-- Type: text/plain, Size: 1082 bytes --]

compound_pages(page)	-> Determines base pages of a compound page

compound_shift(page)	-> Determine the page shift of a compound page

compound_size(page)	-> Determine the size of a compound page

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 include/linux/mm.h |   15 +++++++++++++++
 1 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3e9e8fe..fa4cbab 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -362,6 +362,21 @@ static inline void set_compound_order(struct page *page, unsigned long order)
 	page[1].lru.prev = (void *)order;
 }
 
+static inline int compound_pages(struct page *page)
+{
+	return 1 << compound_order(page);
+}
+
+static inline int compound_shift(struct page *page)
+{
+	return PAGE_SHIFT + compound_order(page);
+}
+
+static inline int compound_size(struct page *page)
+{
+	return PAGE_SIZE << compound_order(page);
+}
+
 /*
  * Multiple processes may "see" the same page. E.g. for untouched
  * mappings of /dev/null, all processes see the same page full of
-- 
1.5.2.4

-- 

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [23/36] compound pages: vmstat support
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (21 preceding siblings ...)
  2007-08-28 19:06 ` [22/36] compound pages: Add new support functions clameter
@ 2007-08-28 19:06 ` clameter
  2007-08-28 19:06 ` [24/36] compound pages: Use new compound vmstat functions in SLUB clameter
                   ` (14 subsequent siblings)
  37 siblings, 0 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:06 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0023-compound-pages-vmstat-support.patch --]
[-- Type: text/plain, Size: 2650 bytes --]

Add support for compound pages so that

inc_xxxx and dec_xxx

will increment the ZVCs by the number of base pages of the compound page.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 include/linux/vmstat.h |    5 ++---
 mm/vmstat.c            |   18 +++++++++++++-----
 2 files changed, 15 insertions(+), 8 deletions(-)

Index: linux-2.6/include/linux/vmstat.h
===================================================================
--- linux-2.6.orig/include/linux/vmstat.h	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/include/linux/vmstat.h	2007-08-27 20:59:42.000000000 -0700
@@ -234,7 +234,7 @@ static inline void __inc_zone_state(stru
 static inline void __inc_zone_page_state(struct page *page,
 			enum zone_stat_item item)
 {
-	__inc_zone_state(page_zone(page), item);
+	__mod_zone_page_state(page_zone(page), item, compound_pages(page));
 }
 
 static inline void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
@@ -246,8 +246,7 @@ static inline void __dec_zone_state(stru
 static inline void __dec_zone_page_state(struct page *page,
 			enum zone_stat_item item)
 {
-	atomic_long_dec(&page_zone(page)->vm_stat[item]);
-	atomic_long_dec(&vm_stat[item]);
+	__mod_zone_page_state(page_zone(page), item, -compound_pages(page));
 }
 
 /*
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/mm/vmstat.c	2007-08-27 20:59:42.000000000 -0700
@@ -225,7 +225,12 @@ void __inc_zone_state(struct zone *zone,
 
 void __inc_zone_page_state(struct page *page, enum zone_stat_item item)
 {
-	__inc_zone_state(page_zone(page), item);
+	struct zone *z = page_zone(page);
+
+	if (likely(!PageHead(page)))
+		__inc_zone_state(z, item);
+	else
+		__mod_zone_page_state(z, item, compound_pages(page));
 }
 EXPORT_SYMBOL(__inc_zone_page_state);
 
@@ -246,7 +251,12 @@ void __dec_zone_state(struct zone *zone,
 
 void __dec_zone_page_state(struct page *page, enum zone_stat_item item)
 {
-	__dec_zone_state(page_zone(page), item);
+	struct zone *z = page_zone(page);
+
+	if (likely(!PageHead(page)))
+		__dec_zone_state(z, item);
+	else
+		__mod_zone_page_state(z, item, -compound_pages(page));
 }
 EXPORT_SYMBOL(__dec_zone_page_state);
 
@@ -262,11 +272,9 @@ void inc_zone_state(struct zone *zone, e
 void inc_zone_page_state(struct page *page, enum zone_stat_item item)
 {
 	unsigned long flags;
-	struct zone *zone;
 
-	zone = page_zone(page);
 	local_irq_save(flags);
-	__inc_zone_state(zone, item);
+	__inc_zone_page_state(page, item);
 	local_irq_restore(flags);
 }
 EXPORT_SYMBOL(inc_zone_page_state);

-- 

^ permalink raw reply	[flat|nested] 124+ messages in thread

* [24/36] compound pages: Use new compound vmstat functions in SLUB
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (22 preceding siblings ...)
  2007-08-28 19:06 ` [23/36] compound pages: vmstat support clameter
@ 2007-08-28 19:06 ` clameter
  2007-08-28 19:06 ` [25/36] compound pages: Allow use of get_page_unless_zero with compound pages clameter
                   ` (13 subsequent siblings)
  37 siblings, 0 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:06 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0024-compound-pages-Use-new-compound-vmstat-functions-in.patch --]
[-- Type: text/plain, Size: 1550 bytes --]

Use the new dec/inc functions to simplify SLUB's accounting
of pages.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 mm/slub.c |   13 ++++---------
 1 files changed, 4 insertions(+), 9 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/mm/slub.c	2007-08-27 21:02:51.000000000 -0700
@@ -1038,7 +1038,6 @@ static inline void kmem_cache_open_debug
 static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 {
 	struct page * page;
-	int pages = 1 << s->order;
 
 	if (s->order)
 		flags |= __GFP_COMP;
@@ -1054,10 +1053,9 @@ static struct page *allocate_slab(struct
 	if (!page)
 		return NULL;
 
-	mod_zone_page_state(page_zone(page),
+	inc_zone_page_state(page,
 		(s->flags & SLAB_RECLAIM_ACCOUNT) ?
-		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
-		pages);
+		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE);
 
 	return page;
 }
@@ -1124,8 +1122,6 @@ out:
 
 static void __free_slab(struct kmem_cache *s, struct page *page)
 {
-	int pages = 1 << s->order;
-
 	if (unlikely(SlabDebug(page))) {
 		void *p;
 
@@ -1135,10 +1131,9 @@ static void __free_slab(struct kmem_cach
 		ClearSlabDebug(page);
 	}
 
-	mod_zone_page_state(page_zone(page),
+	dec_zone_page_state(page,
 		(s->flags & SLAB_RECLAIM_ACCOUNT) ?
-		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE,
-		- pages);
+		NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE);
 
 	page->mapping = NULL;
 	__free_pages(page, s->order);

-- 

^ permalink raw reply	[flat|nested] 124+ messages in thread

* [25/36] compound pages: Allow use of get_page_unless_zero with compound pages
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (23 preceding siblings ...)
  2007-08-28 19:06 ` [24/36] compound pages: Use new compound vmstat functions in SLUB clameter
@ 2007-08-28 19:06 ` clameter
  2007-08-28 19:06 ` [26/36] compound pages: Allow freeing of compound pages via pagevec clameter
                   ` (12 subsequent siblings)
  37 siblings, 0 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:06 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0025-compound-pages-Allow-use-of-get_page_unless_zero-wi.patch --]
[-- Type: text/plain, Size: 959 bytes --]

This is needed by slab defragmentation. The refcount of a page head
may be incremented to ensure that a compound page will not go away under us.

It also may be needed for defragmentation of higher order pages. The
moving of compound pages may require the establishment of a reference
before the use of page migration functions.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 include/linux/mm.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2007-08-27 20:59:40.000000000 -0700
+++ linux-2.6/include/linux/mm.h	2007-08-27 21:03:20.000000000 -0700
@@ -290,7 +290,7 @@ static inline int put_page_testzero(stru
  */
 static inline int get_page_unless_zero(struct page *page)
 {
-	VM_BUG_ON(PageCompound(page));
+	VM_BUG_ON(PageTail(page));
 	return atomic_inc_not_zero(&page->_count);
 }
 

-- 

^ permalink raw reply	[flat|nested] 124+ messages in thread

* [26/36] compound pages: Allow freeing of compound pages via pagevec
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (24 preceding siblings ...)
  2007-08-28 19:06 ` [25/36] compound pages: Allow use of get_page_unless_zero with compound pages clameter
@ 2007-08-28 19:06 ` clameter
  2007-08-28 19:06 ` [27/36] Compound page zeroing and flushing clameter
                   ` (11 subsequent siblings)
  37 siblings, 0 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:06 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0026-compound-pages-Allow-freeing-of-compound-pages-via.patch --]
[-- Type: text/plain, Size: 2492 bytes --]

Allow the freeing of compound pages via pagevec.

In release_pages() we currently special case for compound pages in order to
be sure to always decrement the page count of the head page and not the
tail page. However that redirection to the head page is only necessary for
tail pages. So we can actually use PageTail instead of PageCompound there
by avoiding the redirection to the first page. Tail page handling is
not changed.

The head page of a compound pages now represents single page large page.
We do the usual processing including checking if its on the LRU
and removing it (not useful right now but later when compound pages are
on the LRU this will work). Then we add the compound page to the pagevec.
Only head pages will end up on the pagevec not tail pages.

In __pagevec_free() we then check if we are freeing a head page and if
so call the destructor for the compound page.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 mm/page_alloc.c |   13 +++++++++++--
 mm/swap.c       |    8 +++++++-
 2 files changed, 18 insertions(+), 3 deletions(-)

Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c	2007-08-27 20:59:38.000000000 -0700
+++ linux-2.6/mm/page_alloc.c	2007-08-27 21:05:34.000000000 -0700
@@ -1441,8 +1441,17 @@ void __pagevec_free(struct pagevec *pvec
 {
 	int i = pagevec_count(pvec);
 
-	while (--i >= 0)
-		free_hot_cold_page(pvec->pages[i], pvec->cold);
+	while (--i >= 0) {
+		struct page *page = pvec->pages[i];
+
+		if (PageHead(page)) {
+			compound_page_dtor *dtor;
+
+			dtor = get_compound_page_dtor(page);
+			(*dtor)(page);
+		} else
+			free_hot_cold_page(page, pvec->cold);
+	}
 }
 
 fastcall void __free_pages(struct page *page, unsigned int order)
Index: linux-2.6/mm/swap.c
===================================================================
--- linux-2.6.orig/mm/swap.c	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/mm/swap.c	2007-08-27 21:05:34.000000000 -0700
@@ -263,7 +263,13 @@ void release_pages(struct page **pages, 
 	for (i = 0; i < nr; i++) {
 		struct page *page = pages[i];
 
-		if (unlikely(PageCompound(page))) {
+		/*
+		 * If we have a tail page on the LRU then we need to
+		 * decrement the page count of the head page. There
+		 * is no further need to do anything since tail pages
+		 * cannot be on the LRU.
+		 */
+		if (unlikely(PageTail(page))) {
 			if (zone) {
 				spin_unlock_irq(&zone->lru_lock);
 				zone = NULL;

-- 

^ permalink raw reply	[flat|nested] 124+ messages in thread

* [27/36] Compound page zeroing and flushing
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (25 preceding siblings ...)
  2007-08-28 19:06 ` [26/36] compound pages: Allow freeing of compound pages via pagevec clameter
@ 2007-08-28 19:06 ` clameter
  2007-08-28 19:06 ` [28/36] Fix PAGE SIZE assumption in miscellaneous places clameter
                   ` (10 subsequent siblings)
  37 siblings, 0 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:06 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0027-Compound-page-zeroing-and-flushing.patch --]
[-- Type: text/plain, Size: 3903 bytes --]

We may now have to zero and flush higher order pages. Implement
clear_mapping_page and flush_mapping_page to do that job. Replace
the flushing and clearing at some key locations for the pagecache.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/libfs.c              |    4 ++--
 include/linux/highmem.h |   31 +++++++++++++++++++++++++++++--
 mm/filemap.c            |    4 ++--
 mm/filemap_xip.c        |    4 ++--
 4 files changed, 35 insertions(+), 8 deletions(-)

Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c	2007-08-27 20:51:55.000000000 -0700
+++ linux-2.6/fs/libfs.c	2007-08-27 21:08:04.000000000 -0700
@@ -330,8 +330,8 @@ int simple_rename(struct inode *old_dir,
 
 int simple_readpage(struct file *file, struct page *page)
 {
-	clear_highpage(page);
-	flush_dcache_page(page);
+	clear_mapping_page(page);
+	flush_mapping_page(page);
 	SetPageUptodate(page);
 	unlock_page(page);
 	return 0;
Index: linux-2.6/include/linux/highmem.h
===================================================================
--- linux-2.6.orig/include/linux/highmem.h	2007-08-27 19:22:17.000000000 -0700
+++ linux-2.6/include/linux/highmem.h	2007-08-27 21:08:04.000000000 -0700
@@ -124,14 +124,41 @@ static inline void clear_highpage(struct
 	kunmap_atomic(kaddr, KM_USER0);
 }
 
+/*
+ * Clear a higher order page
+ */
+static inline void clear_mapping_page(struct page *page)
+{
+	int nr_pages = compound_pages(page);
+	int i;
+
+	for (i = 0; i < nr_pages; i++)
+		clear_highpage(page + i);
+}
+
+/*
+ * Primitive support for flushing higher order pages.
+ *
+ * A bit stupid: On many platforms flushing the first page
+ * will flush any TLB starting there
+ */
+static inline void flush_mapping_page(struct page *page)
+{
+	int nr_pages = compound_pages(page);
+	int i;
+
+	for (i = 0; i < nr_pages; i++)
+		flush_dcache_page(page + i);
+}
+
 static inline void zero_user_segments(struct page *page,
 	unsigned start1, unsigned end1,
 	unsigned start2, unsigned end2)
 {
 	void *kaddr = kmap_atomic(page, KM_USER0);
 
-	BUG_ON(end1 > PAGE_SIZE ||
-		end2 > PAGE_SIZE);
+	BUG_ON(end1 > compound_size(page) ||
+		end2 > compound_size(page));
 
 	if (end1 > start1)
 		memset(kaddr + start1, 0, end1 - start1);
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2007-08-27 19:31:13.000000000 -0700
+++ linux-2.6/mm/filemap.c	2007-08-27 21:08:04.000000000 -0700
@@ -941,7 +941,7 @@ page_ok:
 		 * before reading the page on the kernel side.
 		 */
 		if (mapping_writably_mapped(mapping))
-			flush_dcache_page(page);
+			flush_mapping_page(page);
 
 		/*
 		 * When a sequential read accesses a page several times,
@@ -1932,7 +1932,7 @@ generic_file_buffered_write(struct kiocb
 		else
 			copied = filemap_copy_from_user_iovec(page, offset,
 						cur_iov, iov_base, bytes);
-		flush_dcache_page(page);
+		flush_mapping_page(page);
 		status = a_ops->commit_write(file, page, offset, offset+bytes);
 		if (status == AOP_TRUNCATED_PAGE) {
 			page_cache_release(page);
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c	2007-08-27 20:51:40.000000000 -0700
+++ linux-2.6/mm/filemap_xip.c	2007-08-27 21:08:04.000000000 -0700
@@ -104,7 +104,7 @@ do_xip_mapping_read(struct address_space
 		 * before reading the page on the kernel side.
 		 */
 		if (mapping_writably_mapped(mapping))
-			flush_dcache_page(page);
+			flush_mapping_page(page);
 
 		/*
 		 * Ok, we have the page, so now we can copy it to user space...
@@ -320,7 +320,7 @@ __xip_file_write(struct file *filp, cons
 		}
 
 		copied = filemap_copy_from_user(page, offset, buf, bytes);
-		flush_dcache_page(page);
+		flush_mapping_page(page);
 		if (likely(copied > 0)) {
 			status = copied;
 

-- 

^ permalink raw reply	[flat|nested] 124+ messages in thread

* [28/36] Fix PAGE SIZE assumption in miscellaneous places
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (26 preceding siblings ...)
  2007-08-28 19:06 ` [27/36] Compound page zeroing and flushing clameter
@ 2007-08-28 19:06 ` clameter
  2007-08-28 19:06 ` [29/36] Fix up reclaim counters clameter
                   ` (9 subsequent siblings)
  37 siblings, 0 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:06 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0028-Fix-PAGE-SIZE-assumption-in-miscellaneous-places.patch --]
[-- Type: text/plain, Size: 658 bytes --]

Fix PAGE SIZE assumption in miscellaneous places.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 kernel/futex.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index a124250..c6102e8 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -258,7 +258,7 @@ int get_futex_key(u32 __user *uaddr, struct rw_semaphore *fshared,
 	err = get_user_pages(current, mm, address, 1, 0, 0, &page, NULL);
 	if (err >= 0) {
 		key->shared.pgoff =
-			page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+			page->index << (compound_order(page) - PAGE_SHIFT);
 		put_page(page);
 		return 0;
 	}
-- 
1.5.2.4

-- 

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [29/36] Fix up reclaim counters
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (27 preceding siblings ...)
  2007-08-28 19:06 ` [28/36] Fix PAGE SIZE assumption in miscellaneous places clameter
@ 2007-08-28 19:06 ` clameter
  2007-08-28 19:06 ` [30/36] Add VM_BUG_ONs to check for correct page order clameter
                   ` (8 subsequent siblings)
  37 siblings, 0 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:06 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0029-Fix-up-reclaim-counters.patch --]
[-- Type: text/plain, Size: 5028 bytes --]

Compound pages of an arbitrary order may now be on the LRU and
may be reclaimed.

Adjust the counting in vmscan.c to could the number of base
pages.

Also change the active and inactive accounting to do the same.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 include/linux/mm_inline.h |   36 +++++++++++++++++++++++++++---------
 mm/vmscan.c               |   22 ++++++++++++----------
 2 files changed, 39 insertions(+), 19 deletions(-)

Index: linux-2.6/include/linux/mm_inline.h
===================================================================
--- linux-2.6.orig/include/linux/mm_inline.h	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/include/linux/mm_inline.h	2007-08-27 21:08:27.000000000 -0700
@@ -2,39 +2,57 @@ static inline void
 add_page_to_active_list(struct zone *zone, struct page *page)
 {
 	list_add(&page->lru, &zone->active_list);
-	__inc_zone_state(zone, NR_ACTIVE);
+	if (!PageHead(page))
+		__inc_zone_state(zone, NR_ACTIVE);
+	else
+		__inc_zone_page_state(page, NR_ACTIVE);
 }
 
 static inline void
 add_page_to_inactive_list(struct zone *zone, struct page *page)
 {
 	list_add(&page->lru, &zone->inactive_list);
-	__inc_zone_state(zone, NR_INACTIVE);
+	if (!PageHead(page))
+		__inc_zone_state(zone, NR_INACTIVE);
+	else
+		__inc_zone_page_state(page, NR_INACTIVE);
 }
 
 static inline void
 del_page_from_active_list(struct zone *zone, struct page *page)
 {
 	list_del(&page->lru);
-	__dec_zone_state(zone, NR_ACTIVE);
+	if (!PageHead(page))
+		__dec_zone_state(zone, NR_ACTIVE);
+	else
+		__dec_zone_page_state(page, NR_ACTIVE);
 }
 
 static inline void
 del_page_from_inactive_list(struct zone *zone, struct page *page)
 {
 	list_del(&page->lru);
-	__dec_zone_state(zone, NR_INACTIVE);
+	if (!PageHead(page))
+		__dec_zone_state(zone, NR_INACTIVE);
+	else
+		__dec_zone_page_state(page, NR_INACTIVE);
 }
 
 static inline void
 del_page_from_lru(struct zone *zone, struct page *page)
 {
+	enum zone_stat_item counter = NR_ACTIVE;
+
 	list_del(&page->lru);
-	if (PageActive(page)) {
+	if (PageActive(page))
 		__ClearPageActive(page);
-		__dec_zone_state(zone, NR_ACTIVE);
-	} else {
-		__dec_zone_state(zone, NR_INACTIVE);
-	}
+	else
+		counter = NR_INACTIVE;
+
+	if (!PageHead(page))
+		__dec_zone_state(zone, counter);
+	else
+		__dec_zone_page_state(page, counter);
 }
 
+
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/mm/vmscan.c	2007-08-27 21:08:27.000000000 -0700
@@ -466,14 +466,14 @@ static unsigned long shrink_page_list(st
 
 		VM_BUG_ON(PageActive(page));
 
-		sc->nr_scanned++;
+		sc->nr_scanned += compound_pages(page);
 
 		if (!sc->may_swap && page_mapped(page))
 			goto keep_locked;
 
 		/* Double the slab pressure for mapped and swapcache pages */
 		if (page_mapped(page) || PageSwapCache(page))
-			sc->nr_scanned++;
+			sc->nr_scanned += compound_pages(page);
 
 		may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
 			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
@@ -590,7 +590,7 @@ static unsigned long shrink_page_list(st
 
 free_it:
 		unlock_page(page);
-		nr_reclaimed++;
+		nr_reclaimed += compound_pages(page);
 		if (!pagevec_add(&freed_pvec, page))
 			__pagevec_release_nonlru(&freed_pvec);
 		continue;
@@ -682,22 +682,23 @@ static unsigned long isolate_lru_pages(u
 	unsigned long nr_taken = 0;
 	unsigned long scan;
 
-	for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
+	for (scan = 0; scan < nr_to_scan && !list_empty(src); ) {
 		struct page *page;
 		unsigned long pfn;
 		unsigned long end_pfn;
 		unsigned long page_pfn;
+		int pages;
 		int zone_id;
 
 		page = lru_to_page(src);
 		prefetchw_prev_lru_page(page, src, flags);
-
+		pages = compound_pages(page);
 		VM_BUG_ON(!PageLRU(page));
 
 		switch (__isolate_lru_page(page, mode)) {
 		case 0:
 			list_move(&page->lru, dst);
-			nr_taken++;
+			nr_taken += pages;
 			break;
 
 		case -EBUSY:
@@ -743,8 +744,8 @@ static unsigned long isolate_lru_pages(u
 			switch (__isolate_lru_page(cursor_page, mode)) {
 			case 0:
 				list_move(&cursor_page->lru, dst);
-				nr_taken++;
-				scan++;
+				nr_taken += compound_pages(cursor_page);
+				scan += compound_pages(cursor_page);
 				break;
 
 			case -EBUSY:
@@ -754,6 +755,7 @@ static unsigned long isolate_lru_pages(u
 				break;
 			}
 		}
+		scan += pages;
 	}
 
 	*scanned = scan;
@@ -1010,7 +1012,7 @@ force_reclaim_mapped:
 		ClearPageActive(page);
 
 		list_move(&page->lru, &zone->inactive_list);
-		pgmoved++;
+		pgmoved += compound_pages(page);
 		if (!pagevec_add(&pvec, page)) {
 			__mod_zone_page_state(zone, NR_INACTIVE, pgmoved);
 			spin_unlock_irq(&zone->lru_lock);
@@ -1038,7 +1040,7 @@ force_reclaim_mapped:
 		SetPageLRU(page);
 		VM_BUG_ON(!PageActive(page));
 		list_move(&page->lru, &zone->active_list);
-		pgmoved++;
+		pgmoved += compound_pages(page);
 		if (!pagevec_add(&pvec, page)) {
 			__mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
 			pgmoved = 0;

-- 

^ permalink raw reply	[flat|nested] 124+ messages in thread

* [30/36] Add VM_BUG_ONs to check for correct page order
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (28 preceding siblings ...)
  2007-08-28 19:06 ` [29/36] Fix up reclaim counters clameter
@ 2007-08-28 19:06 ` clameter
  2007-08-28 19:06 ` [31/36] Large Blocksize: Core piece clameter
                   ` (7 subsequent siblings)
  37 siblings, 0 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:06 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0030-Add-VM_BUG_ONs-to-check-for-correct-page-order.patch --]
[-- Type: text/plain, Size: 3916 bytes --]

Before allowing different page orders it may be wise to get some checkpoints
in at various places. Checkpoints will help debugging whenever a wrong order
page shows up in a mapping. Helps when converting new filesystems to utilize
larger pages.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/buffer.c  |    1 +
 mm/filemap.c |   18 +++++++++++++++---
 2 files changed, 16 insertions(+), 3 deletions(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c	2007-08-27 20:52:34.000000000 -0700
+++ linux-2.6/fs/buffer.c	2007-08-27 21:09:19.000000000 -0700
@@ -893,6 +893,7 @@ struct buffer_head *alloc_page_buffers(s
 	long offset;
 	unsigned int page_size = page_cache_size(page->mapping);
 
+	BUG_ON(size > page_size);
 try_again:
 	head = NULL;
 	offset = page_size;
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2007-08-27 21:08:04.000000000 -0700
+++ linux-2.6/mm/filemap.c	2007-08-27 21:09:19.000000000 -0700
@@ -128,6 +128,7 @@ void remove_from_page_cache(struct page 
 	struct address_space *mapping = page->mapping;
 
 	BUG_ON(!PageLocked(page));
+	VM_BUG_ON(mapping_order(mapping) != compound_order(page));
 
 	write_lock_irq(&mapping->tree_lock);
 	__remove_from_page_cache(page);
@@ -269,6 +270,7 @@ int wait_on_page_writeback_range(struct 
 			if (page->index > end)
 				continue;
 
+			VM_BUG_ON(mapping_order(mapping) != compound_order(page));
 			wait_on_page_writeback(page);
 			if (PageError(page))
 				ret = -EIO;
@@ -440,6 +442,7 @@ int add_to_page_cache(struct page *page,
 {
 	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 
+	VM_BUG_ON(mapping_order(mapping) != compound_order(page));
 	if (error == 0) {
 		write_lock_irq(&mapping->tree_lock);
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
@@ -599,8 +602,10 @@ struct page * find_get_page(struct addre
 
 	read_lock_irq(&mapping->tree_lock);
 	page = radix_tree_lookup(&mapping->page_tree, offset);
-	if (page)
+	if (page) {
+		VM_BUG_ON(mapping_order(mapping) != compound_order(page));
 		page_cache_get(page);
+	}
 	read_unlock_irq(&mapping->tree_lock);
 	return page;
 }
@@ -625,6 +630,7 @@ struct page *find_lock_page(struct addre
 repeat:
 	page = radix_tree_lookup(&mapping->page_tree, offset);
 	if (page) {
+		VM_BUG_ON(mapping_order(mapping) != compound_order(page));
 		page_cache_get(page);
 		if (TestSetPageLocked(page)) {
 			read_unlock_irq(&mapping->tree_lock);
@@ -715,8 +721,10 @@ unsigned find_get_pages(struct address_s
 	read_lock_irq(&mapping->tree_lock);
 	ret = radix_tree_gang_lookup(&mapping->page_tree,
 				(void **)pages, start, nr_pages);
-	for (i = 0; i < ret; i++)
+	for (i = 0; i < ret; i++) {
+		VM_BUG_ON(mapping_order(mapping) != compound_order(pages[i]));
 		page_cache_get(pages[i]);
+	}
 	read_unlock_irq(&mapping->tree_lock);
 	return ret;
 }
@@ -746,6 +754,7 @@ unsigned find_get_pages_contig(struct ad
 		if (pages[i]->mapping == NULL || pages[i]->index != index)
 			break;
 
+		VM_BUG_ON(mapping_order(mapping) != compound_order(pages[i]));
 		page_cache_get(pages[i]);
 		index++;
 	}
@@ -774,8 +783,10 @@ unsigned find_get_pages_tag(struct addre
 	read_lock_irq(&mapping->tree_lock);
 	ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
 				(void **)pages, *index, nr_pages, tag);
-	for (i = 0; i < ret; i++)
+	for (i = 0; i < ret; i++) {
+		VM_BUG_ON(mapping_order(mapping) != compound_order(pages[i]));
 		page_cache_get(pages[i]);
+	}
 	if (ret)
 		*index = pages[ret - 1]->index + 1;
 	read_unlock_irq(&mapping->tree_lock);
@@ -2233,6 +2244,7 @@ int try_to_release_page(struct page *pag
 	struct address_space * const mapping = page->mapping;
 
 	BUG_ON(!PageLocked(page));
+	VM_BUG_ON(mapping_order(mapping) != compound_order(page));
 	if (PageWriteback(page))
 		return 0;
 

-- 

^ permalink raw reply	[flat|nested] 124+ messages in thread

* [31/36] Large Blocksize: Core piece
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (29 preceding siblings ...)
  2007-08-28 19:06 ` [30/36] Add VM_BUG_ONs to check for correct page order clameter
@ 2007-08-28 19:06 ` clameter
  2007-08-30  0:11   ` Mingming Cao
  2007-08-28 19:06 ` [32/36] Readahead changes to support large blocksize clameter
                   ` (6 subsequent siblings)
  37 siblings, 1 reply; 124+ messages in thread
From: clameter @ 2007-08-28 19:06 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0031-Large-Blocksize-Core-piece.patch --]
[-- Type: text/plain, Size: 16688 bytes --]

Provide an alternate definition for the page_cache_xxx(mapping, ...)
functions that can determine the current page size from the mapping
and generate the appropriate shifts, sizes and mask for the page cache
operations. Change the basic functions that allocate pages for the
page cache to be able to handle higher order allocations.

Provide a new function

mapping_setup(stdruct address_space *, gfp_t mask, int order)

that allows the setup of a mapping of any compound page order.

mapping_set_gfp_mask() is still provided but it sets mappings to order 0.
Calls to mapping_set_gfp_mask() must be converted to mapping_setup() in
order for the filesystem to be able to use larger pages. For some key block
devices and filesystems the conversion is done here.

mapping_setup() for higher order is only allowed if the mapping does not
use DMA mappings or HIGHMEM since we do not support bouncing at the moment.
Thus BUG() on DMA mappings and clear the highmem bit of higher order mappings.

Modify the set_blocksize() function so that an arbitrary blocksize can be set.
Blocksizes up to MAX_ORDER - 1 can be set. This is typically 8MB on many
platforms (order 11). Typically file systems are not only limited by the core
VM but also by the structure of their internal data structures. The core VM
limitations fall away with this patch. The functionality provided here
can do nothing about the internal limitations of filesystems.

Known internal limitations:

Ext2            64k
XFS             64k
Reiserfs        8k
Ext3            4k (rumor has it that changing a constant can remove the limit)
Ext4            4k

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 block/Kconfig               |   17 ++++++
 drivers/block/rd.c          |    6 ++-
 fs/block_dev.c              |   29 +++++++---
 fs/buffer.c                 |    4 +-
 fs/inode.c                  |    7 ++-
 fs/xfs/linux-2.6/xfs_buf.c  |    3 +-
 include/linux/buffer_head.h |   12 ++++-
 include/linux/fs.h          |    5 ++
 include/linux/pagemap.h     |  121 ++++++++++++++++++++++++++++++++++++++++--
 mm/filemap.c                |   17 ++++--
 10 files changed, 192 insertions(+), 29 deletions(-)

Index: linux-2.6/block/Kconfig
===================================================================
--- linux-2.6.orig/block/Kconfig	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/block/Kconfig	2007-08-27 21:16:38.000000000 -0700
@@ -62,6 +62,20 @@ config BLK_DEV_BSG
 	protocols (e.g. Task Management Functions and SMP in Serial
 	Attached SCSI).
 
+#
+# The functions to switch on larger pages in a filesystem will return an error
+# if the gfp flags for a mapping require only DMA pages. Highmem will always
+# be switched off for higher order mappings.
+#
+config LARGE_BLOCKSIZE
+	bool "Support blocksizes larger than page size"
+	default n
+	depends on EXPERIMENTAL
+	help
+	  Allows the page cache to support higher orders of pages. Higher
+	  order page cache pages may be useful to increase I/O performance
+	  anbd support special devices like CD or DVDs and Flash.
+
 endif # BLOCK
 
 source block/Kconfig.iosched
Index: linux-2.6/drivers/block/rd.c
===================================================================
--- linux-2.6.orig/drivers/block/rd.c	2007-08-27 20:59:27.000000000 -0700
+++ linux-2.6/drivers/block/rd.c	2007-08-27 21:10:38.000000000 -0700
@@ -121,7 +121,8 @@ static void make_page_uptodate(struct pa
 			}
 		} while ((bh = bh->b_this_page) != head);
 	} else {
-		memset(page_address(page), 0, page_cache_size(page_mapping(page)));
+		memset(page_address(page), 0,
+			page_cache_size(page_mapping(page)));
 	}
 	flush_dcache_page(page);
 	SetPageUptodate(page);
@@ -380,7 +381,8 @@ static int rd_open(struct inode *inode, 
 		gfp_mask = mapping_gfp_mask(mapping);
 		gfp_mask &= ~(__GFP_FS|__GFP_IO);
 		gfp_mask |= __GFP_HIGH;
-		mapping_set_gfp_mask(mapping, gfp_mask);
+		mapping_setup(mapping, gfp_mask,
+			page_cache_blkbits_to_order(inode->i_blkbits));
 	}
 
 	return 0;
Index: linux-2.6/fs/block_dev.c
===================================================================
--- linux-2.6.orig/fs/block_dev.c	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/fs/block_dev.c	2007-08-27 21:10:38.000000000 -0700
@@ -63,36 +63,46 @@ static void kill_bdev(struct block_devic
 		return;
 	invalidate_bh_lrus();
 	truncate_inode_pages(bdev->bd_inode->i_mapping, 0);
-}	
+}
 
 int set_blocksize(struct block_device *bdev, int size)
 {
-	/* Size must be a power of two, and between 512 and PAGE_SIZE */
-	if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
+	int order;
+
+	if (size > (PAGE_SIZE << (MAX_ORDER - 1)) ||
+			size < 512 || !is_power_of_2(size))
 		return -EINVAL;
 
 	/* Size cannot be smaller than the size supported by the device */
 	if (size < bdev_hardsect_size(bdev))
 		return -EINVAL;
 
+	order = page_cache_blocksize_to_order(size);
+
 	/* Don't change the size if it is same as current */
 	if (bdev->bd_block_size != size) {
+		int bits = blksize_bits(size);
+		struct address_space *mapping =
+			bdev->bd_inode->i_mapping;
+
 		sync_blockdev(bdev);
-		bdev->bd_block_size = size;
-		bdev->bd_inode->i_blkbits = blksize_bits(size);
 		kill_bdev(bdev);
+		bdev->bd_block_size = size;
+		bdev->bd_inode->i_blkbits = bits;
+		mapping_setup(mapping, GFP_NOFS, order);
 	}
 	return 0;
 }
-
 EXPORT_SYMBOL(set_blocksize);
 
 int sb_set_blocksize(struct super_block *sb, int size)
 {
 	if (set_blocksize(sb->s_bdev, size))
 		return 0;
-	/* If we get here, we know size is power of two
-	 * and it's value is between 512 and PAGE_SIZE */
+	/*
+	 * If we get here, we know size is power of two
+	 * and it's value is valid for the page cache
+	 */
 	sb->s_blocksize = size;
 	sb->s_blocksize_bits = blksize_bits(size);
 	return sb->s_blocksize;
@@ -574,7 +584,8 @@ struct block_device *bdget(dev_t dev)
 		inode->i_rdev = dev;
 		inode->i_bdev = bdev;
 		inode->i_data.a_ops = &def_blk_aops;
-		mapping_set_gfp_mask(&inode->i_data, GFP_USER);
+		mapping_setup(&inode->i_data, GFP_USER,
+			page_cache_blkbits_to_order(inode->i_blkbits));
 		inode->i_data.backing_dev_info = &default_backing_dev_info;
 		spin_lock(&bdev_lock);
 		list_add(&bdev->bd_list, &all_bdevs);
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c	2007-08-27 21:09:19.000000000 -0700
+++ linux-2.6/fs/buffer.c	2007-08-27 21:10:38.000000000 -0700
@@ -1090,7 +1090,7 @@ __getblk_slow(struct block_device *bdev,
 {
 	/* Size must be multiple of hard sectorsize */
 	if (unlikely(size & (bdev_hardsect_size(bdev)-1) ||
-			(size < 512 || size > PAGE_SIZE))) {
+		size < 512 || size > (PAGE_SIZE << (MAX_ORDER - 1)))) {
 		printk(KERN_ERR "getblk(): invalid block size %d requested\n",
 					size);
 		printk(KERN_ERR "hardsect size: %d\n",
@@ -1811,7 +1811,7 @@ static int __block_prepare_write(struct 
 				if (block_end > to || block_start < from)
 					zero_user_segments(page,
 						to, block_end,
-						block_start, from)
+						block_start, from);
 				continue;
 			}
 		}
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/fs/inode.c	2007-08-27 21:10:38.000000000 -0700
@@ -145,7 +145,8 @@ static struct inode *alloc_inode(struct 
 		mapping->a_ops = &empty_aops;
  		mapping->host = inode;
 		mapping->flags = 0;
-		mapping_set_gfp_mask(mapping, GFP_HIGHUSER_PAGECACHE);
+		mapping_setup(mapping, GFP_HIGHUSER_PAGECACHE,
+				page_cache_blkbits_to_order(inode->i_blkbits));
 		mapping->assoc_mapping = NULL;
 		mapping->backing_dev_info = &default_backing_dev_info;
 
@@ -243,7 +244,7 @@ void clear_inode(struct inode *inode)
 {
 	might_sleep();
 	invalidate_inode_buffers(inode);
-       
+
 	BUG_ON(inode->i_data.nrpages);
 	BUG_ON(!(inode->i_state & I_FREEING));
 	BUG_ON(inode->i_state & I_CLEAR);
@@ -528,7 +529,7 @@ repeat:
  *	for allocations related to inode->i_mapping is GFP_HIGHUSER_PAGECACHE.
  *	If HIGHMEM pages are unsuitable or it is known that pages allocated
  *	for the page cache are not reclaimable or migratable,
- *	mapping_set_gfp_mask() must be called with suitable flags on the
+ *	mapping_setup() must be called with suitable flags and bits on the
  *	newly created inode's mapping
  *
  */
Index: linux-2.6/fs/xfs/linux-2.6/xfs_buf.c
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_buf.c	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/fs/xfs/linux-2.6/xfs_buf.c	2007-08-27 21:10:38.000000000 -0700
@@ -1547,7 +1547,8 @@ xfs_mapping_buftarg(
 	mapping = &inode->i_data;
 	mapping->a_ops = &mapping_aops;
 	mapping->backing_dev_info = bdi;
-	mapping_set_gfp_mask(mapping, GFP_NOFS);
+	mapping_setup(mapping, GFP_NOFS,
+		page_cache_blkbits_to_order(inode->i_blkbits));
 	btp->bt_mapping = mapping;
 	return 0;
 }
Index: linux-2.6/include/linux/buffer_head.h
===================================================================
--- linux-2.6.orig/include/linux/buffer_head.h	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/include/linux/buffer_head.h	2007-08-27 21:10:38.000000000 -0700
@@ -129,7 +129,17 @@ BUFFER_FNS(Ordered, ordered)
 BUFFER_FNS(Eopnotsupp, eopnotsupp)
 BUFFER_FNS(Unwritten, unwritten)
 
-#define bh_offset(bh)		((unsigned long)(bh)->b_data & ~PAGE_MASK)
+static inline unsigned long bh_offset(struct buffer_head *bh)
+{
+	/*
+	 * No mapping available. Use page struct to obtain
+	 * order.
+	 */
+	unsigned long mask = compound_size(bh->b_page) - 1;
+
+	return (unsigned long)bh->b_data & mask;
+}
+
 #define touch_buffer(bh)	mark_page_accessed(bh->b_page)
 
 /* If we *know* page->private refers to buffer_heads */
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/include/linux/fs.h	2007-08-27 21:10:38.000000000 -0700
@@ -446,6 +446,11 @@ struct address_space {
 	spinlock_t		i_mmap_lock;	/* protect tree, count, list */
 	unsigned int		truncate_count;	/* Cover race condition with truncate */
 	unsigned long		nrpages;	/* number of total pages */
+#ifdef CONFIG_LARGE_BLOCKSIZE
+	loff_t			offset_mask;	/* Mask to get to offset bits */
+	unsigned int		order;		/* Page order of the pages in here */
+	unsigned int		shift;		/* Shift of index */
+#endif
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h	2007-08-27 19:29:55.000000000 -0700
+++ linux-2.6/include/linux/pagemap.h	2007-08-27 21:15:58.000000000 -0700
@@ -39,10 +39,35 @@ static inline gfp_t mapping_gfp_mask(str
  * This is non-atomic.  Only to be used before the mapping is activated.
  * Probably needs a barrier...
  */
-static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
+static inline void mapping_setup(struct address_space *m,
+					gfp_t mask, int order)
 {
 	m->flags = (m->flags & ~(__force unsigned long)__GFP_BITS_MASK) |
 				(__force unsigned long)mask;
+
+#ifdef CONFIG_LARGE_BLOCKSIZE
+	m->order = order;
+	m->shift = order + PAGE_SHIFT;
+	m->offset_mask = (PAGE_SIZE << order) - 1;
+	if (order) {
+		/*
+		 * Bouncing is not supported. Requests for DMA
+		 * memory will not work
+		 */
+		BUG_ON(m->flags & (__GFP_DMA|__GFP_DMA32));
+		/*
+		 * Bouncing not supported. We cannot use HIGHMEM
+		 */
+		m->flags &= ~__GFP_HIGHMEM;
+		m->flags |= __GFP_COMP;
+		/*
+		 * If we could raise the kswapd order then it should be
+		 * done here.
+		 *
+		 * raise_kswapd_order(order);
+		 */
+	}
+#endif
 }
 
 /*
@@ -62,6 +87,78 @@ static inline void mapping_set_gfp_mask(
 #define PAGE_CACHE_ALIGN(addr)	(((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK)
 
 /*
+ * The next set of functions allow to write code that is capable of dealing
+ * with multiple page sizes.
+ */
+#ifdef CONFIG_LARGE_BLOCKSIZE
+/*
+ * Determine page order from the blkbits in the inode structure
+ */
+static inline int page_cache_blkbits_to_order(int shift)
+{
+	BUG_ON(shift < 9);
+
+	if (shift < PAGE_SHIFT)
+		return 0;
+
+	return shift - PAGE_SHIFT;
+}
+
+/*
+ * Determine page order from a given blocksize
+ */
+static inline int page_cache_blocksize_to_order(unsigned long size)
+{
+	return page_cache_blkbits_to_order(ilog2(size));
+}
+
+static inline int mapping_order(struct address_space *a)
+{
+	return a->order;
+}
+
+static inline int page_cache_shift(struct address_space *a)
+{
+	return a->shift;
+}
+
+static inline unsigned int page_cache_size(struct address_space *a)
+{
+	return a->offset_mask + 1;
+}
+
+static inline loff_t page_cache_mask(struct address_space *a)
+{
+	return ~a->offset_mask;
+}
+
+static inline unsigned int page_cache_offset(struct address_space *a,
+		loff_t pos)
+{
+	return pos & a->offset_mask;
+}
+#else
+/*
+ * Kernel configured for a fixed PAGE_SIZEd page cache
+ */
+static inline int page_cache_blkbits_to_order(int shift)
+{
+	if (shift < 9)
+		return -EINVAL;
+	if (shift > PAGE_SHIFT)
+		return -EINVAL;
+	return 0;
+}
+
+static inline int page_cache_blocksize_to_order(unsigned long size)
+{
+	if (size >= 512 && size <= PAGE_SIZE)
+		return 0;
+
+	return -EINVAL;
+}
+
+/*
  * Functions that are currently setup for a fixed PAGE_SIZEd. The use of
  * these will allow a variable page size pagecache in the future.
  */
@@ -90,6 +187,7 @@ static inline unsigned int page_cache_of
 {
 	return pos & ~PAGE_MASK;
 }
+#endif
 
 static inline pgoff_t page_cache_index(struct address_space *a,
 		loff_t pos)
@@ -112,27 +210,37 @@ static inline loff_t page_cache_pos(stru
 	return ((loff_t)index << page_cache_shift(a)) + offset;
 }
 
+/*
+ * Legacy function. Only supports order 0 pages.
+ */
+static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
+{
+	BUG_ON(mapping_order(m));
+	mapping_setup(m, mask, 0);
+}
+
 #define page_cache_get(page)		get_page(page)
 #define page_cache_release(page)	put_page(page)
 void release_pages(struct page **pages, int nr, int cold);
 
 #ifdef CONFIG_NUMA
-extern struct page *__page_cache_alloc(gfp_t gfp);
+extern struct page *__page_cache_alloc(gfp_t gfp, int);
 #else
-static inline struct page *__page_cache_alloc(gfp_t gfp)
+static inline struct page *__page_cache_alloc(gfp_t gfp, int order)
 {
-	return alloc_pages(gfp, 0);
+	return alloc_pages(gfp, order);
 }
 #endif
 
 static inline struct page *page_cache_alloc(struct address_space *x)
 {
-	return __page_cache_alloc(mapping_gfp_mask(x));
+	return __page_cache_alloc(mapping_gfp_mask(x), mapping_order(x));
 }
 
 static inline struct page *page_cache_alloc_cold(struct address_space *x)
 {
-	return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
+	return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD,
+				mapping_order(x));
 }
 
 typedef int filler_t(void *, struct page *);
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2007-08-27 21:09:19.000000000 -0700
+++ linux-2.6/mm/filemap.c	2007-08-27 21:14:55.000000000 -0700
@@ -471,13 +471,13 @@ int add_to_page_cache_lru(struct page *p
 }
 
 #ifdef CONFIG_NUMA
-struct page *__page_cache_alloc(gfp_t gfp)
+struct page *__page_cache_alloc(gfp_t gfp, int order)
 {
 	if (cpuset_do_page_mem_spread()) {
 		int n = cpuset_mem_spread_node();
-		return alloc_pages_node(n, gfp, 0);
+		return alloc_pages_node(n, gfp, order);
 	}
-	return alloc_pages(gfp, 0);
+	return alloc_pages(gfp, order);
 }
 EXPORT_SYMBOL(__page_cache_alloc);
 #endif
@@ -678,7 +678,7 @@ repeat:
 	if (!page) {
 		if (!cached_page) {
 			cached_page =
-				__page_cache_alloc(gfp_mask);
+				__page_cache_alloc(gfp_mask, mapping_order(mapping));
 			if (!cached_page)
 				return NULL;
 		}
@@ -818,7 +818,8 @@ grab_cache_page_nowait(struct address_sp
 		page_cache_release(page);
 		return NULL;
 	}
-	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
+	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS,
+				mapping_order(mapping));
 	if (page && add_to_page_cache_lru(page, mapping, index, GFP_KERNEL)) {
 		page_cache_release(page);
 		page = NULL;
@@ -1479,6 +1480,12 @@ int generic_file_mmap(struct file * file
 {
 	struct address_space *mapping = file->f_mapping;
 
+	/*
+	 * Forbid mmap access to higher order mappings.
+	 */
+	if (mapping_order(mapping))
+		return -ENOSYS;
+
 	if (!mapping->a_ops->readpage)
 		return -ENOEXEC;
 	file_accessed(file);

-- 

^ permalink raw reply	[flat|nested] 124+ messages in thread

* [32/36] Readahead changes to support large blocksize.
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (30 preceding siblings ...)
  2007-08-28 19:06 ` [31/36] Large Blocksize: Core piece clameter
@ 2007-08-28 19:06 ` clameter
  2007-08-28 19:06 ` [33/36] Large blocksize support in ramfs clameter
                   ` (5 subsequent siblings)
  37 siblings, 0 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:06 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Fengguang Wu, Christoph Hellwig,
	Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, swin wang, totty.lu,
	H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0032-Readahead-changes-to-support-large-blocksize.patch --]
[-- Type: text/plain, Size: 5373 bytes --]

Fix up readhead for large I/O operations.

Only calculate the readahead until the 2M boundary then fall back to
one page.

Signed-off-by: Fengguang Wu <fengguang.wu@gmail.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>

===================================================================
---
 include/linux/mm.h |    2 +-
 mm/fadvise.c       |    4 ++--
 mm/filemap.c       |    5 ++---
 mm/madvise.c       |    2 +-
 mm/readahead.c     |   22 ++++++++++++++--------
 5 files changed, 20 insertions(+), 15 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2007-08-27 21:03:20.000000000 -0700
+++ linux-2.6/include/linux/mm.h	2007-08-27 21:14:44.000000000 -0700
@@ -1142,7 +1142,7 @@ void page_cache_async_readahead(struct a
 				pgoff_t offset,
 				unsigned long size);
 
-unsigned long max_sane_readahead(unsigned long nr);
+unsigned long max_sane_readahead(unsigned long nr, int order);
 
 /* Do stack extension */
 extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
Index: linux-2.6/mm/fadvise.c
===================================================================
--- linux-2.6.orig/mm/fadvise.c	2007-08-27 20:52:49.000000000 -0700
+++ linux-2.6/mm/fadvise.c	2007-08-27 21:14:44.000000000 -0700
@@ -86,10 +86,10 @@ asmlinkage long sys_fadvise64_64(int fd,
 		nrpages = end_index - start_index + 1;
 		if (!nrpages)
 			nrpages = ~0UL;
-		
+
 		ret = force_page_cache_readahead(mapping, file,
 				start_index,
-				max_sane_readahead(nrpages));
+				nrpages);
 		if (ret > 0)
 			ret = 0;
 		break;
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2007-08-27 21:10:38.000000000 -0700
+++ linux-2.6/mm/filemap.c	2007-08-27 21:14:44.000000000 -0700
@@ -1237,8 +1237,7 @@ do_readahead(struct address_space *mappi
 	if (!mapping || !mapping->a_ops || !mapping->a_ops->readpage)
 		return -EINVAL;
 
-	force_page_cache_readahead(mapping, filp, index,
-					max_sane_readahead(nr));
+	force_page_cache_readahead(mapping, filp, index, nr);
 	return 0;
 }
 
@@ -1373,7 +1372,7 @@ retry_find:
 			count_vm_event(PGMAJFAULT);
 		}
 		did_readaround = 1;
-		ra_pages = max_sane_readahead(file->f_ra.ra_pages);
+		ra_pages = file->f_ra.ra_pages;
 		if (ra_pages) {
 			pgoff_t start = 0;
 
Index: linux-2.6/mm/madvise.c
===================================================================
--- linux-2.6.orig/mm/madvise.c	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/mm/madvise.c	2007-08-27 21:14:44.000000000 -0700
@@ -124,7 +124,7 @@ static long madvise_willneed(struct vm_a
 	end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
 
 	force_page_cache_readahead(file->f_mapping,
-			file, start, max_sane_readahead(end - start));
+			file, start, end - start);
 	return 0;
 }
 
Index: linux-2.6/mm/readahead.c
===================================================================
--- linux-2.6.orig/mm/readahead.c	2007-08-27 19:22:13.000000000 -0700
+++ linux-2.6/mm/readahead.c	2007-08-27 21:14:44.000000000 -0700
@@ -44,7 +44,8 @@ EXPORT_SYMBOL_GPL(default_backing_dev_in
 void
 file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping)
 {
-	ra->ra_pages = mapping->backing_dev_info->ra_pages;
+	ra->ra_pages = DIV_ROUND_UP(mapping->backing_dev_info->ra_pages,
+				    page_cache_size(mapping));
 	ra->prev_index = -1;
 }
 EXPORT_SYMBOL_GPL(file_ra_state_init);
@@ -84,7 +85,7 @@ int read_cache_pages(struct address_spac
 			put_pages_list(pages);
 			break;
 		}
-		task_io_account_read(PAGE_CACHE_SIZE);
+		task_io_account_read(page_cache_size(mapping));
 	}
 	pagevec_lru_add(&lru_pvec);
 	return ret;
@@ -151,7 +152,7 @@ __do_page_cache_readahead(struct address
 	if (isize == 0)
 		goto out;
 
-	end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
+	end_index = page_cache_index(mapping, isize - 1);
 
 	/*
 	 * Preallocate as many pages as we will need.
@@ -204,10 +205,12 @@ int force_page_cache_readahead(struct ad
 	if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages))
 		return -EINVAL;
 
+	nr_to_read = max_sane_readahead(nr_to_read, mapping_order(mapping));
 	while (nr_to_read) {
 		int err;
 
-		unsigned long this_chunk = (2 * 1024 * 1024) / PAGE_CACHE_SIZE;
+		unsigned long this_chunk = DIV_ROUND_UP(2 * 1024 * 1024,
+						page_cache_size(mapping));
 
 		if (this_chunk > nr_to_read)
 			this_chunk = nr_to_read;
@@ -237,17 +240,20 @@ int do_page_cache_readahead(struct addre
 	if (bdi_read_congested(mapping->backing_dev_info))
 		return -1;
 
+	nr_to_read = max_sane_readahead(nr_to_read, mapping_order(mapping));
 	return __do_page_cache_readahead(mapping, filp, offset, nr_to_read, 0);
 }
 
 /*
- * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
+ * Given a desired number of page order readahead pages, return a
  * sensible upper limit.
  */
-unsigned long max_sane_readahead(unsigned long nr)
+unsigned long max_sane_readahead(unsigned long nr, int order)
 {
-	return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE)
-		+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
+	unsigned long base_pages = node_page_state(numa_node_id(), NR_INACTIVE)
+			+ node_page_state(numa_node_id(), NR_FREE_PAGES);
+
+	return min(nr, (base_pages / 2) >> order);
 }
 
 /*

-- 

^ permalink raw reply	[flat|nested] 124+ messages in thread

* [33/36] Large blocksize support in ramfs
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (31 preceding siblings ...)
  2007-08-28 19:06 ` [32/36] Readahead changes to support large blocksize clameter
@ 2007-08-28 19:06 ` clameter
  2007-08-28 19:06 ` [34/36] Large blocksize support in XFS clameter
                   ` (4 subsequent siblings)
  37 siblings, 0 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:06 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0033-Large-blocksize-support-in-ramfs.patch --]
[-- Type: text/plain, Size: 2073 bytes --]

The simplest file system to use for large blocksize support is ramfs.

Note that ramfs does not use the lower layers (buffer I/O etc) so this
case is useful for initial testing of changes to large buffer size
support if one just wants to exercise the higher layers.

The patch adds the ability to specify a mount parameter to modify the
order for the pages that are allocated by ramfs.

Here is an example of how to mount a volume with order 10 pages:

	mount -tramfs -o10 none /media

Mounts a ramfs filesystem with 4MB sized pages. Then copy
a file onto it.

	cp linux-2.6.21-rc7.tar.gz /media

This will populate the ramfs volume. Note that we allocated 14 pages
of 4M each instead of 13508.

Get rid of the large pages again

	umount /media


Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/ramfs/inode.c |   12 +++++++++---
 1 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index ef2b46d..b317f80 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -60,7 +60,8 @@ struct inode *ramfs_get_inode(struct super_block *sb, int mode, dev_t dev)
 		inode->i_blocks = 0;
 		inode->i_mapping->a_ops = &ramfs_aops;
 		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
-		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+		mapping_setup(inode->i_mapping, GFP_HIGHUSER,
+				sb->s_blocksize_bits - PAGE_SHIFT);
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		switch (mode & S_IFMT) {
 		default:
@@ -164,10 +165,15 @@ static int ramfs_fill_super(struct super_block * sb, void * data, int silent)
 {
 	struct inode * inode;
 	struct dentry * root;
+	int order = 0;
+	char *options = data;
+
+	if (options && *options)
+		order = simple_strtoul(options, NULL, 10);
 
 	sb->s_maxbytes = MAX_LFS_FILESIZE;
-	sb->s_blocksize = PAGE_CACHE_SIZE;
-	sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
+	sb->s_blocksize = PAGE_CACHE_SIZE << order;
+	sb->s_blocksize_bits = order + PAGE_CACHE_SHIFT;
 	sb->s_magic = RAMFS_MAGIC;
 	sb->s_op = &ramfs_ops;
 	sb->s_time_gran = 1;
-- 
1.5.2.4

-- 

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [34/36] Large blocksize support in XFS
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (32 preceding siblings ...)
  2007-08-28 19:06 ` [33/36] Large blocksize support in ramfs clameter
@ 2007-08-28 19:06 ` clameter
  2007-08-28 19:06 ` [35/36] Large blocksize support for ext2 clameter
                   ` (3 subsequent siblings)
  37 siblings, 0 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:06 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Dave Chinner, Christoph Hellwig,
	Mel Gorman, William Lee Irwin III, Jens Axboe, Badari Pulavarty,
	Maxim Levitsky, Fengguang Wu, swin wang, totty.lu,
	H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0034-Large-blocksize-support-in-XFS.patch --]
[-- Type: text/plain, Size: 913 bytes --]

The only change needed to enable Large Block I/O in XFS is to remove
the check for a too large blocksize ;-)

Signed-off-by: Dave Chinner <dgc@sgi.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/xfs/xfs_mount.c |   13 -------------
 1 files changed, 0 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index a66b398..47ddc89 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -326,19 +326,6 @@ xfs_mount_validate_sb(
 		return XFS_ERROR(ENOSYS);
 	}
 
-	/*
-	 * Until this is fixed only page-sized or smaller data blocks work.
-	 */
-	if (unlikely(sbp->sb_blocksize > PAGE_SIZE)) {
-		xfs_fs_mount_cmn_err(flags,
-			"file system with blocksize %d bytes",
-			sbp->sb_blocksize);
-		xfs_fs_mount_cmn_err(flags,
-			"only pagesize (%ld) or less will currently work.",
-			PAGE_SIZE);
-		return XFS_ERROR(ENOSYS);
-	}
-
 	return 0;
 }
 
-- 
1.5.2.4

-- 

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [35/36] Large blocksize support for ext2
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (33 preceding siblings ...)
  2007-08-28 19:06 ` [34/36] Large blocksize support in XFS clameter
@ 2007-08-28 19:06 ` clameter
  2007-08-28 19:22   ` Christoph Hellwig
  2007-08-28 19:06 ` [36/36] Reiserfs: Fix up for mapping_set_gfp_mask clameter
                   ` (2 subsequent siblings)
  37 siblings, 1 reply; 124+ messages in thread
From: clameter @ 2007-08-28 19:06 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0035-Large-blocksize-support-for-ext2.patch --]
[-- Type: text/plain, Size: 1072 bytes --]

This adds support for a block size of up to 64k on any platform.
It enables the mounting filesystems that have a larger blocksize
than the page size.

F.e. the following is possible on x86_64 and i386 that have only a 4k page
size:

mke2fs -b 16384 /dev/hdd2	<Ignore warning about too large block size>

mount /dev/hdd2 /media
ls -l /media

.... Do more things with the volume that uses a 16k page cache size on
a 4k page sized platform..

Hmmm... Actually there is nothing additional to be done after the earlier
cleanup of the macros. So just modify copyright.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/ext2/inode.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 0079b2c..5ff775a 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -20,6 +20,9 @@
  * 	(jj@sunsite.ms.mff.cuni.cz)
  *
  *  Assorted race fixes, rewrite of ext2_get_block() by Al Viro, 2000
+ *
+ *  (C) 2007 SGI.
+ *  Large blocksize support by Christoph Lameter
  */
 
 #include <linux/smp_lock.h>
-- 
1.5.2.4

-- 

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [36/36] Reiserfs: Fix up for mapping_set_gfp_mask
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (34 preceding siblings ...)
  2007-08-28 19:06 ` [35/36] Large blocksize support for ext2 clameter
@ 2007-08-28 19:06 ` clameter
  2007-08-28 19:20 ` [00/36] Large Blocksize Support V6 Christoph Hellwig
  2007-09-01 19:17 ` Peter Zijlstra
  37 siblings, 0 replies; 124+ messages in thread
From: clameter @ 2007-08-28 19:06 UTC (permalink / raw)
  To: torvalds
  Cc: linux-fsdevel, linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

[-- Attachment #1: 0036-Reiserfs-Fix-up-for-mapping_set_gfp_mask.patch --]
[-- Type: text/plain, Size: 955 bytes --]

mapping_set_gfp_mask only works on order 0 page cache operations. Reiserfs
can use 8k pages (order 1). Replace the mapping_set_gfp_mask with
mapping_setup to make this work properly.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
 fs/reiserfs/xattr.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/reiserfs/xattr.c b/fs/reiserfs/xattr.c
index c86f570..5ca01f3 100644
--- a/fs/reiserfs/xattr.c
+++ b/fs/reiserfs/xattr.c
@@ -405,9 +405,10 @@ static struct page *reiserfs_get_page(struct inode *dir, unsigned long n)
 {
 	struct address_space *mapping = dir->i_mapping;
 	struct page *page;
+
 	/* We can deadlock if we try to free dentries,
 	   and an unlink/rmdir has just occured - GFP_NOFS avoids this */
-	mapping_set_gfp_mask(mapping, GFP_NOFS);
+	mapping_setup(mapping, GFP_NOFS, page_cache_shift(mapping));
 	page = read_mapping_page(mapping, n, NULL);
 	if (!IS_ERR(page)) {
 		kmap(page);
-- 
1.5.2.4

-- 

^ permalink raw reply related	[flat|nested] 124+ messages in thread

* Re: [00/36] Large Blocksize Support V6
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (35 preceding siblings ...)
  2007-08-28 19:06 ` [36/36] Reiserfs: Fix up for mapping_set_gfp_mask clameter
@ 2007-08-28 19:20 ` Christoph Hellwig
  2007-08-28 19:55   ` Christoph Lameter
  2007-09-01 19:17 ` Peter Zijlstra
  37 siblings, 1 reply; 124+ messages in thread
From: Christoph Hellwig @ 2007-08-28 19:20 UTC (permalink / raw)
  To: clameter
  Cc: torvalds, linux-fsdevel, linux-kernel, Christoph Hellwig,
	Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

Stoooooopp!

This patchseries is entirely unacceptable!

one patch per file is the most braindead and most unacceptable way
to split a series.  Please stop whatever you're doing right now and
correct it and send out a patch that has one patch per logical change
for the whole tree.  This means people can actually read the patch,
and it's bisectable.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [35/36] Large blocksize support for ext2
  2007-08-28 19:06 ` [35/36] Large blocksize support for ext2 clameter
@ 2007-08-28 19:22   ` Christoph Hellwig
  2007-08-28 19:56     ` Christoph Lameter
  0 siblings, 1 reply; 124+ messages in thread
From: Christoph Hellwig @ 2007-08-28 19:22 UTC (permalink / raw)
  To: clameter
  Cc: torvalds, linux-fsdevel, linux-kernel, Christoph Hellwig,
	Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Tue, Aug 28, 2007 at 12:06:26PM -0700, clameter@sgi.com wrote:
> Hmmm... Actually there is nothing additional to be done after the earlier
> cleanup of the macros. So just modify copyright.

So you get a copyright line for some trivial macro cleanups?  Please
drop this patch and rather put your copyright into places where you
actually did major work..


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [07/36] Use page_cache_xxx in mm/filemap_xip.c
  2007-08-28 19:05 ` [07/36] Use page_cache_xxx in mm/filemap_xip.c clameter
@ 2007-08-28 19:49   ` Jörn Engel
  2007-08-28 19:55     ` Christoph Hellwig
  0 siblings, 1 reply; 124+ messages in thread
From: Jörn Engel @ 2007-08-28 19:49 UTC (permalink / raw)
  To: clameter
  Cc: torvalds, linux-fsdevel, linux-kernel, Christoph Hellwig,
	Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, Eric W. Biederman

On Tue, 28 August 2007 12:05:58 -0700, clameter@sgi.com wrote:
>  
> -	index = *ppos >> PAGE_CACHE_SHIFT;
> -	offset = *ppos & ~PAGE_CACHE_MASK;
> +	index = page_cache_index(mapping, *ppos);
> +	offset = page_cache_offset(mapping, *ppos);

Part of me feels inclined to marge this patch now because it makes the
code more readable, even if page_cache_index() is implemented as
#define page_cache_index(mapping, pos) ((pos) >> PAGE_CACHE_SHIFT)

I know there is little use in yet another global search'n'replace
wankfest and Andrew might wash my mouth just for mentioning it.  Still,
hard to dislike this part of your patch.

Jörn

-- 
He who knows others is wise.
He who knows himself is enlightened.
-- Lao Tsu

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [07/36] Use page_cache_xxx in mm/filemap_xip.c
  2007-08-28 19:49   ` Jörn Engel
@ 2007-08-28 19:55     ` Christoph Hellwig
  2007-08-28 23:49       ` Nick Piggin
  0 siblings, 1 reply; 124+ messages in thread
From: Christoph Hellwig @ 2007-08-28 19:55 UTC (permalink / raw)
  To: J??rn Engel
  Cc: clameter, torvalds, linux-fsdevel, linux-kernel,
	Christoph Hellwig, Mel Gorman, William Lee Irwin III,
	David Chinner, Jens Axboe, Badari Pulavarty, Maxim Levitsky,
	Fengguang Wu, swin wang, totty.lu, H. Peter Anvin,
	Eric W. Biederman

On Tue, Aug 28, 2007 at 09:49:38PM +0200, J??rn Engel wrote:
> On Tue, 28 August 2007 12:05:58 -0700, clameter@sgi.com wrote:
> >  
> > -	index = *ppos >> PAGE_CACHE_SHIFT;
> > -	offset = *ppos & ~PAGE_CACHE_MASK;
> > +	index = page_cache_index(mapping, *ppos);
> > +	offset = page_cache_offset(mapping, *ppos);
> 
> Part of me feels inclined to marge this patch now because it makes the
> code more readable, even if page_cache_index() is implemented as
> #define page_cache_index(mapping, pos) ((pos) >> PAGE_CACHE_SHIFT)
> 
> I know there is little use in yet another global search'n'replace
> wankfest and Andrew might wash my mouth just for mentioning it.  Still,
> hard to dislike this part of your patch.

Yes, I I suggested that before.  Andrew seems to somehow hate this
patchset, but even if we don;'t get it in the lowercase macros are much
much better then the current PAGE_CACHE_* confusion.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [00/36] Large Blocksize Support V6
  2007-08-28 19:20 ` [00/36] Large Blocksize Support V6 Christoph Hellwig
@ 2007-08-28 19:55   ` Christoph Lameter
  2007-09-01  1:11     ` Christoph Lameter
  0 siblings, 1 reply; 124+ messages in thread
From: Christoph Lameter @ 2007-08-28 19:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: torvalds, linux-fsdevel, linux-kernel, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Tue, 28 Aug 2007, Christoph Hellwig wrote:

> one patch per file is the most braindead and most unacceptable way
> to split a series.  Please stop whatever you're doing right now and
> correct it and send out a patch that has one patch per logical change
> for the whole tree.  This means people can actually read the patch,
> and it's bisectable.

The patches are per logical change aside from the first patches that 
introduce the page cache functions all over the kernel. It would be 
unacceptably big and difficult to merge if I put them all together.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [35/36] Large blocksize support for ext2
  2007-08-28 19:22   ` Christoph Hellwig
@ 2007-08-28 19:56     ` Christoph Lameter
  0 siblings, 0 replies; 124+ messages in thread
From: Christoph Lameter @ 2007-08-28 19:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: torvalds, linux-fsdevel, linux-kernel, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Tue, 28 Aug 2007, Christoph Hellwig wrote:

> On Tue, Aug 28, 2007 at 12:06:26PM -0700, clameter@sgi.com wrote:
> > Hmmm... Actually there is nothing additional to be done after the earlier
> > cleanup of the macros. So just modify copyright.
> 
> So you get a copyright line for some trivial macro cleanups?  Please
> drop this patch and rather put your copyright into places where you
> actually did major work..

Ok.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [07/36] Use page_cache_xxx in mm/filemap_xip.c
  2007-08-28 19:55     ` Christoph Hellwig
@ 2007-08-28 23:49       ` Nick Piggin
  0 siblings, 0 replies; 124+ messages in thread
From: Nick Piggin @ 2007-08-28 23:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: J??rn Engel, clameter, torvalds, linux-fsdevel, linux-kernel,
	Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, Eric W. Biederman

Christoph Hellwig wrote:
> On Tue, Aug 28, 2007 at 09:49:38PM +0200, J??rn Engel wrote:
> 
>>On Tue, 28 August 2007 12:05:58 -0700, clameter@sgi.com wrote:
>>
>>> 
>>>-	index = *ppos >> PAGE_CACHE_SHIFT;
>>>-	offset = *ppos & ~PAGE_CACHE_MASK;
>>>+	index = page_cache_index(mapping, *ppos);
>>>+	offset = page_cache_offset(mapping, *ppos);
>>
>>Part of me feels inclined to marge this patch now because it makes the
>>code more readable, even if page_cache_index() is implemented as
>>#define page_cache_index(mapping, pos) ((pos) >> PAGE_CACHE_SHIFT)
>>
>>I know there is little use in yet another global search'n'replace
>>wankfest and Andrew might wash my mouth just for mentioning it.  Still,
>>hard to dislike this part of your patch.
> 
> 
> Yes, I I suggested that before.  Andrew seems to somehow hate this
> patchset, but even if we don;'t get it in the lowercase macros are much
> much better then the current PAGE_CACHE_* confusion.

I don't mind the change either. The open coded macros are very
recognisable, but it isn't hard to have a typo and get one
slightly wrong.

If it goes upstream now it wouldn't have the mapping argument
though, would it? Or the need to replace PAGE_CACHE_SIZE I guess.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [31/36] Large Blocksize: Core piece
  2007-08-28 19:06 ` [31/36] Large Blocksize: Core piece clameter
@ 2007-08-30  0:11   ` Mingming Cao
  2007-08-30  0:12     ` Christoph Lameter
                       ` (4 more replies)
  0 siblings, 5 replies; 124+ messages in thread
From: Mingming Cao @ 2007-08-30  0:11 UTC (permalink / raw)
  To: clameter
  Cc: torvalds, linux-fsdevel, linux-kernel, Christoph Hellwig,
	Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Tue, 2007-08-28 at 12:06 -0700, clameter@sgi.com wrote:
> plain text document attachment (0031-Large-Blocksize-Core-piece.patch)
> Provide an alternate definition for the page_cache_xxx(mapping, ...)
> functions that can determine the current page size from the mapping
> and generate the appropriate shifts, sizes and mask for the page cache
> operations. Change the basic functions that allocate pages for the
> page cache to be able to handle higher order allocations.
> 
> Provide a new function
> 
> mapping_setup(stdruct address_space *, gfp_t mask, int order)
> 
> that allows the setup of a mapping of any compound page order.
> 
> mapping_set_gfp_mask() is still provided but it sets mappings to order 0.
> Calls to mapping_set_gfp_mask() must be converted to mapping_setup() in
> order for the filesystem to be able to use larger pages. For some key block
> devices and filesystems the conversion is done here.
> 
> mapping_setup() for higher order is only allowed if the mapping does not
> use DMA mappings or HIGHMEM since we do not support bouncing at the moment.
> Thus BUG() on DMA mappings and clear the highmem bit of higher order mappings.
> 
> Modify the set_blocksize() function so that an arbitrary blocksize can be set.
> Blocksizes up to MAX_ORDER - 1 can be set. This is typically 8MB on many
> platforms (order 11). Typically file systems are not only limited by the core
> VM but also by the structure of their internal data structures. The core VM
> limitations fall away with this patch. The functionality provided here
> can do nothing about the internal limitations of filesystems.
> 
> Known internal limitations:
> 
> Ext2            64k
> XFS             64k
> Reiserfs        8k
> Ext3            4k (rumor has it that changing a constant can remove the limit)
> Ext4            4k
> 

There are patches original worked by Takashi Sato to support large block
size (up to 64k) in ext2/3/4, which addressed the directory issue as
well. I just forward ported. Will posted it in a separate thread.
Haven't get a chance to integrated with your patch yet (next step).

thanks,
Mingming
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> ---
>  block/Kconfig               |   17 ++++++
>  drivers/block/rd.c          |    6 ++-
>  fs/block_dev.c              |   29 +++++++---
>  fs/buffer.c                 |    4 +-
>  fs/inode.c                  |    7 ++-
>  fs/xfs/linux-2.6/xfs_buf.c  |    3 +-
>  include/linux/buffer_head.h |   12 ++++-
>  include/linux/fs.h          |    5 ++
>  include/linux/pagemap.h     |  121 ++++++++++++++++++++++++++++++++++++++++--
>  mm/filemap.c                |   17 ++++--
>  10 files changed, 192 insertions(+), 29 deletions(-)
> 
> Index: linux-2.6/block/Kconfig
> ===================================================================
> --- linux-2.6.orig/block/Kconfig	2007-08-27 19:22:13.000000000 -0700
> +++ linux-2.6/block/Kconfig	2007-08-27 21:16:38.000000000 -0700
> @@ -62,6 +62,20 @@ config BLK_DEV_BSG
>  	protocols (e.g. Task Management Functions and SMP in Serial
>  	Attached SCSI).
> 
> +#
> +# The functions to switch on larger pages in a filesystem will return an error
> +# if the gfp flags for a mapping require only DMA pages. Highmem will always
> +# be switched off for higher order mappings.
> +#
> +config LARGE_BLOCKSIZE
> +	bool "Support blocksizes larger than page size"
> +	default n
> +	depends on EXPERIMENTAL
> +	help
> +	  Allows the page cache to support higher orders of pages. Higher
> +	  order page cache pages may be useful to increase I/O performance
> +	  anbd support special devices like CD or DVDs and Flash.
> +
>  endif # BLOCK
> 
>  source block/Kconfig.iosched
> Index: linux-2.6/drivers/block/rd.c
> ===================================================================
> --- linux-2.6.orig/drivers/block/rd.c	2007-08-27 20:59:27.000000000 -0700
> +++ linux-2.6/drivers/block/rd.c	2007-08-27 21:10:38.000000000 -0700
> @@ -121,7 +121,8 @@ static void make_page_uptodate(struct pa
>  			}
>  		} while ((bh = bh->b_this_page) != head);
>  	} else {
> -		memset(page_address(page), 0, page_cache_size(page_mapping(page)));
> +		memset(page_address(page), 0,
> +			page_cache_size(page_mapping(page)));
>  	}
>  	flush_dcache_page(page);
>  	SetPageUptodate(page);
> @@ -380,7 +381,8 @@ static int rd_open(struct inode *inode, 
>  		gfp_mask = mapping_gfp_mask(mapping);
>  		gfp_mask &= ~(__GFP_FS|__GFP_IO);
>  		gfp_mask |= __GFP_HIGH;
> -		mapping_set_gfp_mask(mapping, gfp_mask);
> +		mapping_setup(mapping, gfp_mask,
> +			page_cache_blkbits_to_order(inode->i_blkbits));
>  	}
> 
>  	return 0;
> Index: linux-2.6/fs/block_dev.c
> ===================================================================
> --- linux-2.6.orig/fs/block_dev.c	2007-08-27 19:22:13.000000000 -0700
> +++ linux-2.6/fs/block_dev.c	2007-08-27 21:10:38.000000000 -0700
> @@ -63,36 +63,46 @@ static void kill_bdev(struct block_devic
>  		return;
>  	invalidate_bh_lrus();
>  	truncate_inode_pages(bdev->bd_inode->i_mapping, 0);
> -}	
> +}
> 
>  int set_blocksize(struct block_device *bdev, int size)
>  {
> -	/* Size must be a power of two, and between 512 and PAGE_SIZE */
> -	if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
> +	int order;
> +
> +	if (size > (PAGE_SIZE << (MAX_ORDER - 1)) ||
> +			size < 512 || !is_power_of_2(size))
>  		return -EINVAL;
> 
>  	/* Size cannot be smaller than the size supported by the device */
>  	if (size < bdev_hardsect_size(bdev))
>  		return -EINVAL;
> 
> +	order = page_cache_blocksize_to_order(size);
> +
>  	/* Don't change the size if it is same as current */
>  	if (bdev->bd_block_size != size) {
> +		int bits = blksize_bits(size);
> +		struct address_space *mapping =
> +			bdev->bd_inode->i_mapping;
> +
>  		sync_blockdev(bdev);
> -		bdev->bd_block_size = size;
> -		bdev->bd_inode->i_blkbits = blksize_bits(size);
>  		kill_bdev(bdev);
> +		bdev->bd_block_size = size;
> +		bdev->bd_inode->i_blkbits = bits;
> +		mapping_setup(mapping, GFP_NOFS, order);
>  	}
>  	return 0;
>  }
> -
>  EXPORT_SYMBOL(set_blocksize);
> 
>  int sb_set_blocksize(struct super_block *sb, int size)
>  {
>  	if (set_blocksize(sb->s_bdev, size))
>  		return 0;
> -	/* If we get here, we know size is power of two
> -	 * and it's value is between 512 and PAGE_SIZE */
> +	/*
> +	 * If we get here, we know size is power of two
> +	 * and it's value is valid for the page cache
> +	 */
>  	sb->s_blocksize = size;
>  	sb->s_blocksize_bits = blksize_bits(size);
>  	return sb->s_blocksize;
> @@ -574,7 +584,8 @@ struct block_device *bdget(dev_t dev)
>  		inode->i_rdev = dev;
>  		inode->i_bdev = bdev;
>  		inode->i_data.a_ops = &def_blk_aops;
> -		mapping_set_gfp_mask(&inode->i_data, GFP_USER);
> +		mapping_setup(&inode->i_data, GFP_USER,
> +			page_cache_blkbits_to_order(inode->i_blkbits));
>  		inode->i_data.backing_dev_info = &default_backing_dev_info;
>  		spin_lock(&bdev_lock);
>  		list_add(&bdev->bd_list, &all_bdevs);
> Index: linux-2.6/fs/buffer.c
> ===================================================================
> --- linux-2.6.orig/fs/buffer.c	2007-08-27 21:09:19.000000000 -0700
> +++ linux-2.6/fs/buffer.c	2007-08-27 21:10:38.000000000 -0700
> @@ -1090,7 +1090,7 @@ __getblk_slow(struct block_device *bdev,
>  {
>  	/* Size must be multiple of hard sectorsize */
>  	if (unlikely(size & (bdev_hardsect_size(bdev)-1) ||
> -			(size < 512 || size > PAGE_SIZE))) {
> +		size < 512 || size > (PAGE_SIZE << (MAX_ORDER - 1)))) {
>  		printk(KERN_ERR "getblk(): invalid block size %d requested\n",
>  					size);
>  		printk(KERN_ERR "hardsect size: %d\n",
> @@ -1811,7 +1811,7 @@ static int __block_prepare_write(struct 
>  				if (block_end > to || block_start < from)
>  					zero_user_segments(page,
>  						to, block_end,
> -						block_start, from)
> +						block_start, from);
>  				continue;
>  			}
>  		}
> Index: linux-2.6/fs/inode.c
> ===================================================================
> --- linux-2.6.orig/fs/inode.c	2007-08-27 19:22:13.000000000 -0700
> +++ linux-2.6/fs/inode.c	2007-08-27 21:10:38.000000000 -0700
> @@ -145,7 +145,8 @@ static struct inode *alloc_inode(struct 
>  		mapping->a_ops = &empty_aops;
>   		mapping->host = inode;
>  		mapping->flags = 0;
> -		mapping_set_gfp_mask(mapping, GFP_HIGHUSER_PAGECACHE);
> +		mapping_setup(mapping, GFP_HIGHUSER_PAGECACHE,
> +				page_cache_blkbits_to_order(inode->i_blkbits));
>  		mapping->assoc_mapping = NULL;
>  		mapping->backing_dev_info = &default_backing_dev_info;
> 
> @@ -243,7 +244,7 @@ void clear_inode(struct inode *inode)
>  {
>  	might_sleep();
>  	invalidate_inode_buffers(inode);
> -       
> +
>  	BUG_ON(inode->i_data.nrpages);
>  	BUG_ON(!(inode->i_state & I_FREEING));
>  	BUG_ON(inode->i_state & I_CLEAR);
> @@ -528,7 +529,7 @@ repeat:
>   *	for allocations related to inode->i_mapping is GFP_HIGHUSER_PAGECACHE.
>   *	If HIGHMEM pages are unsuitable or it is known that pages allocated
>   *	for the page cache are not reclaimable or migratable,
> - *	mapping_set_gfp_mask() must be called with suitable flags on the
> + *	mapping_setup() must be called with suitable flags and bits on the
>   *	newly created inode's mapping
>   *
>   */
> Index: linux-2.6/fs/xfs/linux-2.6/xfs_buf.c
> ===================================================================
> --- linux-2.6.orig/fs/xfs/linux-2.6/xfs_buf.c	2007-08-27 19:22:13.000000000 -0700
> +++ linux-2.6/fs/xfs/linux-2.6/xfs_buf.c	2007-08-27 21:10:38.000000000 -0700
> @@ -1547,7 +1547,8 @@ xfs_mapping_buftarg(
>  	mapping = &inode->i_data;
>  	mapping->a_ops = &mapping_aops;
>  	mapping->backing_dev_info = bdi;
> -	mapping_set_gfp_mask(mapping, GFP_NOFS);
> +	mapping_setup(mapping, GFP_NOFS,
> +		page_cache_blkbits_to_order(inode->i_blkbits));
>  	btp->bt_mapping = mapping;
>  	return 0;
>  }
> Index: linux-2.6/include/linux/buffer_head.h
> ===================================================================
> --- linux-2.6.orig/include/linux/buffer_head.h	2007-08-27 19:22:13.000000000 -0700
> +++ linux-2.6/include/linux/buffer_head.h	2007-08-27 21:10:38.000000000 -0700
> @@ -129,7 +129,17 @@ BUFFER_FNS(Ordered, ordered)
>  BUFFER_FNS(Eopnotsupp, eopnotsupp)
>  BUFFER_FNS(Unwritten, unwritten)
> 
> -#define bh_offset(bh)		((unsigned long)(bh)->b_data & ~PAGE_MASK)
> +static inline unsigned long bh_offset(struct buffer_head *bh)
> +{
> +	/*
> +	 * No mapping available. Use page struct to obtain
> +	 * order.
> +	 */
> +	unsigned long mask = compound_size(bh->b_page) - 1;
> +
> +	return (unsigned long)bh->b_data & mask;
> +}
> +
>  #define touch_buffer(bh)	mark_page_accessed(bh->b_page)
> 
>  /* If we *know* page->private refers to buffer_heads */
> Index: linux-2.6/include/linux/fs.h
> ===================================================================
> --- linux-2.6.orig/include/linux/fs.h	2007-08-27 19:22:13.000000000 -0700
> +++ linux-2.6/include/linux/fs.h	2007-08-27 21:10:38.000000000 -0700
> @@ -446,6 +446,11 @@ struct address_space {
>  	spinlock_t		i_mmap_lock;	/* protect tree, count, list */
>  	unsigned int		truncate_count;	/* Cover race condition with truncate */
>  	unsigned long		nrpages;	/* number of total pages */
> +#ifdef CONFIG_LARGE_BLOCKSIZE
> +	loff_t			offset_mask;	/* Mask to get to offset bits */
> +	unsigned int		order;		/* Page order of the pages in here */
> +	unsigned int		shift;		/* Shift of index */
> +#endif
>  	pgoff_t			writeback_index;/* writeback starts here */
>  	const struct address_space_operations *a_ops;	/* methods */
>  	unsigned long		flags;		/* error bits/gfp mask */
> Index: linux-2.6/include/linux/pagemap.h
> ===================================================================
> --- linux-2.6.orig/include/linux/pagemap.h	2007-08-27 19:29:55.000000000 -0700
> +++ linux-2.6/include/linux/pagemap.h	2007-08-27 21:15:58.000000000 -0700
> @@ -39,10 +39,35 @@ static inline gfp_t mapping_gfp_mask(str
>   * This is non-atomic.  Only to be used before the mapping is activated.
>   * Probably needs a barrier...
>   */
> -static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
> +static inline void mapping_setup(struct address_space *m,
> +					gfp_t mask, int order)
>  {
>  	m->flags = (m->flags & ~(__force unsigned long)__GFP_BITS_MASK) |
>  				(__force unsigned long)mask;
> +
> +#ifdef CONFIG_LARGE_BLOCKSIZE
> +	m->order = order;
> +	m->shift = order + PAGE_SHIFT;
> +	m->offset_mask = (PAGE_SIZE << order) - 1;
> +	if (order) {
> +		/*
> +		 * Bouncing is not supported. Requests for DMA
> +		 * memory will not work
> +		 */
> +		BUG_ON(m->flags & (__GFP_DMA|__GFP_DMA32));
> +		/*
> +		 * Bouncing not supported. We cannot use HIGHMEM
> +		 */
> +		m->flags &= ~__GFP_HIGHMEM;
> +		m->flags |= __GFP_COMP;
> +		/*
> +		 * If we could raise the kswapd order then it should be
> +		 * done here.
> +		 *
> +		 * raise_kswapd_order(order);
> +		 */
> +	}
> +#endif
>  }
> 
>  /*
> @@ -62,6 +87,78 @@ static inline void mapping_set_gfp_mask(
>  #define PAGE_CACHE_ALIGN(addr)	(((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK)
> 
>  /*
> + * The next set of functions allow to write code that is capable of dealing
> + * with multiple page sizes.
> + */
> +#ifdef CONFIG_LARGE_BLOCKSIZE
> +/*
> + * Determine page order from the blkbits in the inode structure
> + */
> +static inline int page_cache_blkbits_to_order(int shift)
> +{
> +	BUG_ON(shift < 9);
> +
> +	if (shift < PAGE_SHIFT)
> +		return 0;
> +
> +	return shift - PAGE_SHIFT;
> +}
> +
> +/*
> + * Determine page order from a given blocksize
> + */
> +static inline int page_cache_blocksize_to_order(unsigned long size)
> +{
> +	return page_cache_blkbits_to_order(ilog2(size));
> +}
> +
> +static inline int mapping_order(struct address_space *a)
> +{
> +	return a->order;
> +}
> +
> +static inline int page_cache_shift(struct address_space *a)
> +{
> +	return a->shift;
> +}
> +
> +static inline unsigned int page_cache_size(struct address_space *a)
> +{
> +	return a->offset_mask + 1;
> +}
> +
> +static inline loff_t page_cache_mask(struct address_space *a)
> +{
> +	return ~a->offset_mask;
> +}
> +
> +static inline unsigned int page_cache_offset(struct address_space *a,
> +		loff_t pos)
> +{
> +	return pos & a->offset_mask;
> +}
> +#else
> +/*
> + * Kernel configured for a fixed PAGE_SIZEd page cache
> + */
> +static inline int page_cache_blkbits_to_order(int shift)
> +{
> +	if (shift < 9)
> +		return -EINVAL;
> +	if (shift > PAGE_SHIFT)
> +		return -EINVAL;
> +	return 0;
> +}
> +
> +static inline int page_cache_blocksize_to_order(unsigned long size)
> +{
> +	if (size >= 512 && size <= PAGE_SIZE)
> +		return 0;
> +
> +	return -EINVAL;
> +}
> +
> +/*
>   * Functions that are currently setup for a fixed PAGE_SIZEd. The use of
>   * these will allow a variable page size pagecache in the future.
>   */
> @@ -90,6 +187,7 @@ static inline unsigned int page_cache_of
>  {
>  	return pos & ~PAGE_MASK;
>  }
> +#endif
> 
>  static inline pgoff_t page_cache_index(struct address_space *a,
>  		loff_t pos)
> @@ -112,27 +210,37 @@ static inline loff_t page_cache_pos(stru
>  	return ((loff_t)index << page_cache_shift(a)) + offset;
>  }
> 
> +/*
> + * Legacy function. Only supports order 0 pages.
> + */
> +static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
> +{
> +	BUG_ON(mapping_order(m));
> +	mapping_setup(m, mask, 0);
> +}
> +
>  #define page_cache_get(page)		get_page(page)
>  #define page_cache_release(page)	put_page(page)
>  void release_pages(struct page **pages, int nr, int cold);
> 
>  #ifdef CONFIG_NUMA
> -extern struct page *__page_cache_alloc(gfp_t gfp);
> +extern struct page *__page_cache_alloc(gfp_t gfp, int);
>  #else
> -static inline struct page *__page_cache_alloc(gfp_t gfp)
> +static inline struct page *__page_cache_alloc(gfp_t gfp, int order)
>  {
> -	return alloc_pages(gfp, 0);
> +	return alloc_pages(gfp, order);
>  }
>  #endif
> 
>  static inline struct page *page_cache_alloc(struct address_space *x)
>  {
> -	return __page_cache_alloc(mapping_gfp_mask(x));
> +	return __page_cache_alloc(mapping_gfp_mask(x), mapping_order(x));
>  }
> 
>  static inline struct page *page_cache_alloc_cold(struct address_space *x)
>  {
> -	return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
> +	return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD,
> +				mapping_order(x));
>  }
> 
>  typedef int filler_t(void *, struct page *);
> Index: linux-2.6/mm/filemap.c
> ===================================================================
> --- linux-2.6.orig/mm/filemap.c	2007-08-27 21:09:19.000000000 -0700
> +++ linux-2.6/mm/filemap.c	2007-08-27 21:14:55.000000000 -0700
> @@ -471,13 +471,13 @@ int add_to_page_cache_lru(struct page *p
>  }
> 
>  #ifdef CONFIG_NUMA
> -struct page *__page_cache_alloc(gfp_t gfp)
> +struct page *__page_cache_alloc(gfp_t gfp, int order)
>  {
>  	if (cpuset_do_page_mem_spread()) {
>  		int n = cpuset_mem_spread_node();
> -		return alloc_pages_node(n, gfp, 0);
> +		return alloc_pages_node(n, gfp, order);
>  	}
> -	return alloc_pages(gfp, 0);
> +	return alloc_pages(gfp, order);
>  }
>  EXPORT_SYMBOL(__page_cache_alloc);
>  #endif
> @@ -678,7 +678,7 @@ repeat:
>  	if (!page) {
>  		if (!cached_page) {
>  			cached_page =
> -				__page_cache_alloc(gfp_mask);
> +				__page_cache_alloc(gfp_mask, mapping_order(mapping));
>  			if (!cached_page)
>  				return NULL;
>  		}
> @@ -818,7 +818,8 @@ grab_cache_page_nowait(struct address_sp
>  		page_cache_release(page);
>  		return NULL;
>  	}
> -	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
> +	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS,
> +				mapping_order(mapping));
>  	if (page && add_to_page_cache_lru(page, mapping, index, GFP_KERNEL)) {
>  		page_cache_release(page);
>  		page = NULL;
> @@ -1479,6 +1480,12 @@ int generic_file_mmap(struct file * file
>  {
>  	struct address_space *mapping = file->f_mapping;
> 
> +	/*
> +	 * Forbid mmap access to higher order mappings.
> +	 */
> +	if (mapping_order(mapping))
> +		return -ENOSYS;
> +
>  	if (!mapping->a_ops->readpage)
>  		return -ENOEXEC;
>  	file_accessed(file);
> 


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [31/36] Large Blocksize: Core piece
  2007-08-30  0:11   ` Mingming Cao
@ 2007-08-30  0:12     ` Christoph Lameter
  2007-08-30  0:47     ` [RFC 1/4] Large Blocksize support for Ext2/3/4 Mingming Cao
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 124+ messages in thread
From: Christoph Lameter @ 2007-08-30  0:12 UTC (permalink / raw)
  To: Mingming Cao
  Cc: torvalds, linux-fsdevel, linux-kernel, Christoph Hellwig,
	Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Wed, 29 Aug 2007, Mingming Cao wrote:

> > Known internal limitations:
> > 
> > Ext2            64k
> > XFS             64k
> > Reiserfs        8k
> > Ext3            4k (rumor has it that changing a constant can remove the limit)
> > Ext4            4k
> > 
> 
> There are patches original worked by Takashi Sato to support large block
> size (up to 64k) in ext2/3/4, which addressed the directory issue as
> well. I just forward ported. Will posted it in a separate thread.
> Haven't get a chance to integrated with your patch yet (next step).

Ahh. Great. Keep me posted.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* [RFC 1/4] Large Blocksize support for Ext2/3/4
  2007-08-30  0:11   ` Mingming Cao
  2007-08-30  0:12     ` Christoph Lameter
@ 2007-08-30  0:47     ` Mingming Cao
  2007-08-30  0:59       ` Christoph Lameter
                         ` (9 more replies)
  2007-08-30  0:47     ` [RFC 2/4]ext2: fix " Mingming Cao
                       ` (2 subsequent siblings)
  4 siblings, 10 replies; 124+ messages in thread
From: Mingming Cao @ 2007-08-30  0:47 UTC (permalink / raw)
  To: clameter, linux-fsdevel
  Cc: adilger, sho, ext4 development, linux-fsdevel, linux-kernel

The next 4 patches support large block size (up to PAGESIZE, max 64KB)
for ext2/3/4, originally from Takashi Sato.
http://marc.info/?l=linux-ext4&m=115768873518400&w=2


It's quite simple to support large block size in ext2/3/4, mostly just
enlarge the block size limit.  But it is NOT possible to have 64kB
blocksize on ext2/3/4 without some changes to the directory handling
code.  The reason is that an empty 64kB directory block would have a
rec_len == (__u16)2^16 == 0, and this would cause an error to be hit in
the filesystem.  The proposed solution is to put 2 empty records in such
a directory, or to special-case an impossible value like rec_len =
0xffff to handle this. 


The Patch-set consists of the following 4 patches.
  [1/4]  ext2/3/4: enlarge blocksize
         - Allow blocksize up to pagesize

  [2/4]  ext2: fix rec_len overflow
         - prevent rec_len from overflow with 64KB blocksize

  [3/4]  ext3: fix rec_len overflow
         - prevent rec_len from overflow with 64KB blocksize

  [4/4]  ext4: fix rec_len overflow
         - prevent rec_len from overflow with 64KB blocksize

Just rebase to 2.6.23-rc4 and against the ext4 patch queue. Compile tested only. 

Next steps:
Need a e2fsprogs changes to able test this feature. As mkfs needs to be
educated not assuming rec_len to be blocksize all the time.
Will try it with Christoph Lameter's large block patch next.


Signed-off-by: Takashi Sato <sho@tnes.nec.co.jp>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
---
 fs/ext2/super.c         |    2 +-
 fs/ext3/super.c         |    5 ++++-
 fs/ext4/super.c         |    5 +++++
 include/linux/ext2_fs.h |    4 ++--
 include/linux/ext3_fs.h |    4 ++--
 include/linux/ext4_fs.h |    4 ++--
 6 files changed, 16 insertions(+), 8 deletions(-)

Index: linux-2.6.23-rc3/fs/ext2/super.c
===================================================================
--- linux-2.6.23-rc3.orig/fs/ext2/super.c	2007-08-12 21:25:24.000000000 -0700
+++ linux-2.6.23-rc3/fs/ext2/super.c	2007-08-29 15:22:29.000000000 -0700
@@ -775,7 +775,7 @@ static int ext2_fill_super(struct super_
 		brelse(bh);
 
 		if (!sb_set_blocksize(sb, blocksize)) {
-			printk(KERN_ERR "EXT2-fs: blocksize too small for device.\n");
+			printk(KERN_ERR "EXT2-fs: bad blocksize %d.\n", blocksize);
 			goto failed_sbi;
 		}
 
Index: linux-2.6.23-rc3/fs/ext3/super.c
===================================================================
--- linux-2.6.23-rc3.orig/fs/ext3/super.c	2007-08-12 21:25:24.000000000 -0700
+++ linux-2.6.23-rc3/fs/ext3/super.c	2007-08-29 15:22:29.000000000 -0700
@@ -1549,7 +1549,10 @@ static int ext3_fill_super (struct super
 		}
 
 		brelse (bh);
-		sb_set_blocksize(sb, blocksize);
+		if (!sb_set_blocksize(sb, blocksize)) {
+			printk(KERN_ERR "EXT3-fs: bad blocksize %d.\n", blocksize);
+			goto out_fail;
+		}
 		logic_sb_block = (sb_block * EXT3_MIN_BLOCK_SIZE) / blocksize;
 		offset = (sb_block * EXT3_MIN_BLOCK_SIZE) % blocksize;
 		bh = sb_bread(sb, logic_sb_block);
Index: linux-2.6.23-rc3/fs/ext4/super.c
===================================================================
--- linux-2.6.23-rc3.orig/fs/ext4/super.c	2007-08-28 11:09:40.000000000 -0700
+++ linux-2.6.23-rc3/fs/ext4/super.c	2007-08-29 15:24:08.000000000 -0700
@@ -1626,6 +1626,11 @@ static int ext4_fill_super (struct super
 		goto out_fail;
 	}
 
+	if (!sb_set_blocksize(sb, blocksize)) {
+		printk(KERN_ERR "EXT4-fs: bad blocksize %d.\n", blocksize);
+		goto out_fail;
+	}
+
 	/*
 	 * The ext4 superblock will not be buffer aligned for other than 1kB
 	 * block sizes.  We need to calculate the offset from buffer start.
Index: linux-2.6.23-rc3/include/linux/ext2_fs.h
===================================================================
--- linux-2.6.23-rc3.orig/include/linux/ext2_fs.h	2007-08-12 21:25:24.000000000 -0700
+++ linux-2.6.23-rc3/include/linux/ext2_fs.h	2007-08-29 15:22:29.000000000 -0700
@@ -86,8 +86,8 @@ static inline struct ext2_sb_info *EXT2_
  * Macro-instructions used to manage several block sizes
  */
 #define EXT2_MIN_BLOCK_SIZE		1024
-#define	EXT2_MAX_BLOCK_SIZE		4096
-#define EXT2_MIN_BLOCK_LOG_SIZE		  10
+#define EXT2_MAX_BLOCK_SIZE		65536
+#define EXT2_MIN_BLOCK_LOG_SIZE		10
 #ifdef __KERNEL__
 # define EXT2_BLOCK_SIZE(s)		((s)->s_blocksize)
 #else
Index: linux-2.6.23-rc3/include/linux/ext3_fs.h
===================================================================
--- linux-2.6.23-rc3.orig/include/linux/ext3_fs.h	2007-08-12 21:25:24.000000000 -0700
+++ linux-2.6.23-rc3/include/linux/ext3_fs.h	2007-08-29 15:22:29.000000000 -0700
@@ -76,8 +76,8 @@
  * Macro-instructions used to manage several block sizes
  */
 #define EXT3_MIN_BLOCK_SIZE		1024
-#define	EXT3_MAX_BLOCK_SIZE		4096
-#define EXT3_MIN_BLOCK_LOG_SIZE		  10
+#define	EXT3_MAX_BLOCK_SIZE		65536
+#define EXT3_MIN_BLOCK_LOG_SIZE		10
 #ifdef __KERNEL__
 # define EXT3_BLOCK_SIZE(s)		((s)->s_blocksize)
 #else
Index: linux-2.6.23-rc3/include/linux/ext4_fs.h
===================================================================
--- linux-2.6.23-rc3.orig/include/linux/ext4_fs.h	2007-08-28 11:09:40.000000000 -0700
+++ linux-2.6.23-rc3/include/linux/ext4_fs.h	2007-08-29 15:22:29.000000000 -0700
@@ -104,8 +104,8 @@ struct ext4_allocation_request {
  * Macro-instructions used to manage several block sizes
  */
 #define EXT4_MIN_BLOCK_SIZE		1024
-#define	EXT4_MAX_BLOCK_SIZE		4096
-#define EXT4_MIN_BLOCK_LOG_SIZE		  10
+#define	EXT4_MAX_BLOCK_SIZE		65536
+#define EXT4_MIN_BLOCK_LOG_SIZE		10
 #ifdef __KERNEL__
 # define EXT4_BLOCK_SIZE(s)		((s)->s_blocksize)
 #else



^ permalink raw reply	[flat|nested] 124+ messages in thread

* [RFC 2/4]ext2: fix rec_len overflow with 64KB block size
  2007-08-30  0:11   ` Mingming Cao
  2007-08-30  0:12     ` Christoph Lameter
  2007-08-30  0:47     ` [RFC 1/4] Large Blocksize support for Ext2/3/4 Mingming Cao
@ 2007-08-30  0:47     ` Mingming Cao
  2007-08-30  0:48     ` [RFC 3/4] ext3: " Mingming Cao
  2007-08-30  0:48     ` [RFC 4/4]ext4: " Mingming Cao
  4 siblings, 0 replies; 124+ messages in thread
From: Mingming Cao @ 2007-08-30  0:47 UTC (permalink / raw)
  To: clameter, linux-fsdevel; +Cc: linux-kernel, adilger, sho, ext4 development

[2/4]  ext2: fix rec_len overflow
         - prevent rec_len from overflow with 64KB blocksize


Signed-off-by: Takashi Sato <sho@tnes.nec.co.jp>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>

---
 fs/ext2/dir.c           |   46 ++++++++++++++++++++++++++++++++++++----------
 include/linux/ext2_fs.h |   13 +++++++++++++
 2 files changed, 49 insertions(+), 10 deletions(-)

Index: linux-2.6.23-rc3/fs/ext2/dir.c
===================================================================
--- linux-2.6.23-rc3.orig/fs/ext2/dir.c	2007-08-12 21:25:24.000000000 -0700
+++ linux-2.6.23-rc3/fs/ext2/dir.c	2007-08-29 15:29:51.000000000 -0700
@@ -94,9 +94,9 @@ static void ext2_check_page(struct page 
 			goto out;
 	}
 	for (offs = 0; offs <= limit - EXT2_DIR_REC_LEN(1); offs += rec_len) {
+		offs = EXT2_DIR_ADJUST_TAIL_OFFS(offs, chunk_size);
 		p = (ext2_dirent *)(kaddr + offs);
 		rec_len = le16_to_cpu(p->rec_len);
-
 		if (rec_len < EXT2_DIR_REC_LEN(1))
 			goto Eshort;
 		if (rec_len & 3)
@@ -108,6 +108,7 @@ static void ext2_check_page(struct page 
 		if (le32_to_cpu(p->inode) > max_inumber)
 			goto Einumber;
 	}
+	offs = EXT2_DIR_ADJUST_TAIL_OFFS(offs, chunk_size);
 	if (offs != limit)
 		goto Eend;
 out:
@@ -283,6 +284,7 @@ ext2_readdir (struct file * filp, void *
 		de = (ext2_dirent *)(kaddr+offset);
 		limit = kaddr + ext2_last_byte(inode, n) - EXT2_DIR_REC_LEN(1);
 		for ( ;(char*)de <= limit; de = ext2_next_entry(de)) {
+			de = EXT2_DIR_ADJUST_TAIL_ADDR(kaddr, de, sb->s_blocksize);
 			if (de->rec_len == 0) {
 				ext2_error(sb, __FUNCTION__,
 					"zero-length directory entry");
@@ -305,8 +307,10 @@ ext2_readdir (struct file * filp, void *
 					return 0;
 				}
 			}
+			filp->f_pos = EXT2_DIR_ADJUST_TAIL_OFFS(filp->f_pos, sb->s_blocksize);
 			filp->f_pos += le16_to_cpu(de->rec_len);
 		}
+		filp->f_pos = EXT2_DIR_ADJUST_TAIL_OFFS(filp->f_pos, sb->s_blocksize);
 		ext2_put_page(page);
 	}
 	return 0;
@@ -343,13 +347,14 @@ struct ext2_dir_entry_2 * ext2_find_entr
 		start = 0;
 	n = start;
 	do {
-		char *kaddr;
+		char *kaddr, *page_start;
 		page = ext2_get_page(dir, n);
 		if (!IS_ERR(page)) {
-			kaddr = page_address(page);
+			kaddr = page_start = page_address(page);
 			de = (ext2_dirent *) kaddr;
 			kaddr += ext2_last_byte(dir, n) - reclen;
 			while ((char *) de <= kaddr) {
+				de = EXT2_DIR_ADJUST_TAIL_ADDR(page_start, de, dir->i_sb->s_blocksize);
 				if (de->rec_len == 0) {
 					ext2_error(dir->i_sb, __FUNCTION__,
 						"zero-length directory entry");
@@ -416,6 +421,7 @@ void ext2_set_link(struct inode *dir, st
 	unsigned to = from + le16_to_cpu(de->rec_len);
 	int err;
 
+	to = EXT2_DIR_ADJUST_TAIL_OFFS(to, inode->i_sb->s_blocksize);
 	lock_page(page);
 	err = page->mapping->a_ops->prepare_write(NULL, page, from, to);
 	BUG_ON(err);
@@ -446,6 +452,7 @@ int ext2_add_link (struct dentry *dentry
 	char *kaddr;
 	unsigned from, to;
 	int err;
+	char *page_start = NULL;
 
 	/*
 	 * We take care of directory expansion in the same loop.
@@ -460,16 +467,28 @@ int ext2_add_link (struct dentry *dentry
 		if (IS_ERR(page))
 			goto out;
 		lock_page(page);
-		kaddr = page_address(page);
+		kaddr = page_start = page_address(page);
 		dir_end = kaddr + ext2_last_byte(dir, n);
 		de = (ext2_dirent *)kaddr;
-		kaddr += PAGE_CACHE_SIZE - reclen;
+		if (chunk_size < EXT2_DIR_MAX_REC_LEN) {
+			kaddr += PAGE_CACHE_SIZE - reclen;
+		} else {
+			kaddr += PAGE_CACHE_SIZE - 
+				(chunk_size - EXT2_DIR_MAX_REC_LEN) - reclen;
+		}
 		while ((char *)de <= kaddr) {
+			de = EXT2_DIR_ADJUST_TAIL_ADDR(page_start, de, chunk_size);	
 			if ((char *)de == dir_end) {
 				/* We hit i_size */
 				name_len = 0;
-				rec_len = chunk_size;
-				de->rec_len = cpu_to_le16(chunk_size);
+				if (chunk_size  < EXT2_DIR_MAX_REC_LEN) {
+					rec_len = chunk_size;
+					de->rec_len = cpu_to_le16(chunk_size);
+				} else {
+					rec_len = EXT2_DIR_MAX_REC_LEN;
+					de->rec_len =
+					cpu_to_le16(EXT2_DIR_MAX_REC_LEN);
+				}
 				de->inode = 0;
 				goto got_it;
 			}
@@ -499,6 +518,7 @@ int ext2_add_link (struct dentry *dentry
 got_it:
 	from = (char*)de - (char*)page_address(page);
 	to = from + rec_len;
+	to = EXT2_DIR_ADJUST_TAIL_OFFS(to, chunk_size);
 	err = page->mapping->a_ops->prepare_write(NULL, page, from, to);
 	if (err)
 		goto out_unlock;
@@ -541,6 +561,7 @@ int ext2_delete_entry (struct ext2_dir_e
 	ext2_dirent * de = (ext2_dirent *) (kaddr + from);
 	int err;
 
+	to = EXT2_DIR_ADJUST_TAIL_OFFS(to, inode->i_sb->s_blocksize);
 	while ((char*)de < (char*)dir) {
 		if (de->rec_len == 0) {
 			ext2_error(inode->i_sb, __FUNCTION__,
@@ -598,7 +619,11 @@ int ext2_make_empty(struct inode *inode,
 
 	de = (struct ext2_dir_entry_2 *)(kaddr + EXT2_DIR_REC_LEN(1));
 	de->name_len = 2;
-	de->rec_len = cpu_to_le16(chunk_size - EXT2_DIR_REC_LEN(1));
+	if (chunk_size < EXT2_DIR_MAX_REC_LEN) {
+		de->rec_len = cpu_to_le16(chunk_size - EXT2_DIR_REC_LEN(1));
+	} else {
+		de->rec_len = cpu_to_le16(EXT2_DIR_MAX_REC_LEN - EXT2_DIR_REC_LEN(1));
+	}
 	de->inode = cpu_to_le32(parent->i_ino);
 	memcpy (de->name, "..\0", 4);
 	ext2_set_de_type (de, inode);
@@ -618,18 +643,19 @@ int ext2_empty_dir (struct inode * inode
 	unsigned long i, npages = dir_pages(inode);
 
 	for (i = 0; i < npages; i++) {
-		char *kaddr;
+		char *kaddr, *page_start;
 		ext2_dirent * de;
 		page = ext2_get_page(inode, i);
 
 		if (IS_ERR(page))
 			continue;
 
-		kaddr = page_address(page);
+		kaddr = page_start = page_address(page);
 		de = (ext2_dirent *)kaddr;
 		kaddr += ext2_last_byte(inode, i) - EXT2_DIR_REC_LEN(1);
 
 		while ((char *)de <= kaddr) {
+			de = EXT2_DIR_ADJUST_TAIL_ADDR(page_start, de, inode->i_sb->s_blocksize);
 			if (de->rec_len == 0) {
 				ext2_error(inode->i_sb, __FUNCTION__,
 					"zero-length directory entry");
Index: linux-2.6.23-rc3/include/linux/ext2_fs.h
===================================================================
--- linux-2.6.23-rc3.orig/include/linux/ext2_fs.h	2007-08-29 15:22:29.000000000 -0700
+++ linux-2.6.23-rc3/include/linux/ext2_fs.h	2007-08-29 15:29:51.000000000 -0700
@@ -557,5 +557,18 @@ enum {
 #define EXT2_DIR_ROUND 			(EXT2_DIR_PAD - 1)
 #define EXT2_DIR_REC_LEN(name_len)	(((name_len) + 8 + EXT2_DIR_ROUND) & \
 					 ~EXT2_DIR_ROUND)
+#define	EXT2_DIR_MAX_REC_LEN		65532
+
+/*
+ * Align a tail offset(address) to the end of a directory block
+ */
+#define EXT2_DIR_ADJUST_TAIL_OFFS(offs, bsize) \
+	((((offs) & ((bsize) -1)) == EXT2_DIR_MAX_REC_LEN) ? \
+	((offs) + (bsize) - EXT2_DIR_MAX_REC_LEN):(offs))
+
+#define EXT2_DIR_ADJUST_TAIL_ADDR(page, de, bsize) \
+	(((((char*)(de) - (page)) & ((bsize) - 1)) == EXT2_DIR_MAX_REC_LEN) ? \
+	((ext2_dirent*)((char*)(de) + (bsize) - EXT2_DIR_MAX_REC_LEN)):(de))
 
 #endif	/* _LINUX_EXT2_FS_H */
+



^ permalink raw reply	[flat|nested] 124+ messages in thread

* [RFC 3/4] ext3: fix rec_len overflow with 64KB block size
  2007-08-30  0:11   ` Mingming Cao
                       ` (2 preceding siblings ...)
  2007-08-30  0:47     ` [RFC 2/4]ext2: fix " Mingming Cao
@ 2007-08-30  0:48     ` Mingming Cao
  2007-08-30  0:48     ` [RFC 4/4]ext4: " Mingming Cao
  4 siblings, 0 replies; 124+ messages in thread
From: Mingming Cao @ 2007-08-30  0:48 UTC (permalink / raw)
  To: clameter; +Cc: linux-kernel, adilger, sho, ext4 development

[3/4]  ext3: fix rec_len overflow
         - prevent rec_len from overflow with 64KB blocksize

Signed-off-by: Takashi Sato <sho@tnes.nec.co.jp>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>

---
 fs/ext3/dir.c           |   13 ++++---
 fs/ext3/namei.c         |   88 +++++++++++++++++++++++++++++++++++++++---------
 include/linux/ext3_fs.h |    9 ++++
 3 files changed, 91 insertions(+), 19 deletions(-)

Index: linux-2.6.23-rc3/fs/ext3/dir.c
===================================================================
--- linux-2.6.23-rc3.orig/fs/ext3/dir.c	2007-08-12 21:25:24.000000000 -0700
+++ linux-2.6.23-rc3/fs/ext3/dir.c	2007-08-29 15:40:06.000000000 -0700
@@ -100,10 +100,11 @@ static int ext3_readdir(struct file * fi
 	unsigned long offset;
 	int i, stored;
 	struct ext3_dir_entry_2 *de;
-	struct super_block *sb;
 	int err;
 	struct inode *inode = filp->f_path.dentry->d_inode;
 	int ret = 0;
+	struct super_block *sb = inode->i_sb;
+	unsigned tail = sb->s_blocksize;
 
 	sb = inode->i_sb;
 
@@ -167,8 +168,11 @@ revalidate:
 		 * readdir(2), then we might be pointing to an invalid
 		 * dirent right now.  Scan from the start of the block
 		 * to make sure. */
-		if (filp->f_version != inode->i_version) {
-			for (i = 0; i < sb->s_blocksize && i < offset; ) {
+		if (tail >  EXT3_DIR_MAX_REC_LEN) {
+			tail = EXT3_DIR_MAX_REC_LEN;
+		}
+                if (filp->f_version != inode->i_version) {
+			for (i = 0; i < tail && i < offset; ) {
 				de = (struct ext3_dir_entry_2 *)
 					(bh->b_data + i);
 				/* It's too expensive to do a full
@@ -189,7 +193,7 @@ revalidate:
 		}
 
 		while (!error && filp->f_pos < inode->i_size
-		       && offset < sb->s_blocksize) {
+		       && offset < tail) {
 			de = (struct ext3_dir_entry_2 *) (bh->b_data + offset);
 			if (!ext3_check_dir_entry ("ext3_readdir", inode, de,
 						   bh, offset)) {
@@ -225,6 +229,7 @@ revalidate:
 			}
 			filp->f_pos += le16_to_cpu(de->rec_len);
 		}
+		filp->f_pos = EXT3_DIR_ADJUST_TAIL_OFFS(filp->f_pos, sb->s_blocksize);
 		offset = 0;
 		brelse (bh);
 	}
Index: linux-2.6.23-rc3/fs/ext3/namei.c
===================================================================
--- linux-2.6.23-rc3.orig/fs/ext3/namei.c	2007-08-12 21:25:24.000000000 -0700
+++ linux-2.6.23-rc3/fs/ext3/namei.c	2007-08-29 15:30:10.000000000 -0700
@@ -262,9 +262,13 @@ static struct stats dx_show_leaf(struct 
 	unsigned names = 0, space = 0;
 	char *base = (char *) de;
 	struct dx_hash_info h = *hinfo;
+	unsigned tail = size;
 
 	printk("names: ");
-	while ((char *) de < base + size)
+	if (tail > EXT3_DIR_MAX_REC_LEN) {
+		tail = EXT3_DIR_MAX_REC_LEN;
+	}
+	while ((char *) de < base + tail)
 	{
 		if (de->inode)
 		{
@@ -677,8 +681,12 @@ static int dx_make_map (struct ext3_dir_
 	int count = 0;
 	char *base = (char *) de;
 	struct dx_hash_info h = *hinfo;
+	unsigned tail = size;
 
-	while ((char *) de < base + size)
+	if (tail > EXT3_DIR_MAX_REC_LEN) {
+		tail = EXT3_DIR_MAX_REC_LEN;
+	}
+	while ((char *) de < base + tail)
 	{
 		if (de->name_len && de->inode) {
 			ext3fs_dirhash(de->name, de->name_len, &h);
@@ -775,9 +783,13 @@ static inline int search_dirblock(struct
 	int de_len;
 	const char *name = dentry->d_name.name;
 	int namelen = dentry->d_name.len;
+	unsigned tail = dir->i_sb->s_blocksize;
 
 	de = (struct ext3_dir_entry_2 *) bh->b_data;
-	dlimit = bh->b_data + dir->i_sb->s_blocksize;
+	if (tail > EXT3_DIR_MAX_REC_LEN) {
+		tail = EXT3_DIR_MAX_REC_LEN;
+	}
+	dlimit = bh->b_data + tail;
 	while ((char *) de < dlimit) {
 		/* this code is executed quadratically often */
 		/* do minimal checking `by hand' */
@@ -1115,6 +1127,9 @@ static struct ext3_dir_entry_2* dx_pack_
 	unsigned rec_len = 0;
 
 	prev = to = de;
+	if (size > EXT3_DIR_MAX_REC_LEN) {
+		size = EXT3_DIR_MAX_REC_LEN;
+	}
 	while ((char*)de < base + size) {
 		next = (struct ext3_dir_entry_2 *) ((char *) de +
 						    le16_to_cpu(de->rec_len));
@@ -1180,8 +1195,15 @@ static struct ext3_dir_entry_2 *do_split
 	/* Fancy dance to stay within two buffers */
 	de2 = dx_move_dirents(data1, data2, map + split, count - split);
 	de = dx_pack_dirents(data1,blocksize);
-	de->rec_len = cpu_to_le16(data1 + blocksize - (char *) de);
-	de2->rec_len = cpu_to_le16(data2 + blocksize - (char *) de2);
+	if (blocksize < EXT3_DIR_MAX_REC_LEN) {
+		de->rec_len = cpu_to_le16(data1 + blocksize - (char *) de);
+		de2->rec_len = cpu_to_le16(data2 + blocksize - (char *) de2);
+	} else {
+		de->rec_len = cpu_to_le16(data1 + EXT3_DIR_MAX_REC_LEN -
+							(char *) de);
+		de2->rec_len = cpu_to_le16(data2 + EXT3_DIR_MAX_REC_LEN -
+							(char *) de2);
+	}
 	dxtrace(dx_show_leaf (hinfo, (struct ext3_dir_entry_2 *) data1, blocksize, 1));
 	dxtrace(dx_show_leaf (hinfo, (struct ext3_dir_entry_2 *) data2, blocksize, 1));
 
@@ -1236,11 +1258,15 @@ static int add_dirent_to_buf(handle_t *h
 	unsigned short	reclen;
 	int		nlen, rlen, err;
 	char		*top;
+	unsigned	tail = dir->i_sb->s_blocksize;
 
+	if (tail > EXT3_DIR_MAX_REC_LEN) {
+		tail = EXT3_DIR_MAX_REC_LEN;
+	}
 	reclen = EXT3_DIR_REC_LEN(namelen);
 	if (!de) {
 		de = (struct ext3_dir_entry_2 *)bh->b_data;
-		top = bh->b_data + dir->i_sb->s_blocksize - reclen;
+		top = bh->b_data + tail - reclen;
 		while ((char *) de <= top) {
 			if (!ext3_check_dir_entry("ext3_add_entry", dir, de,
 						  bh, offset)) {
@@ -1354,13 +1380,21 @@ static int make_indexed_dir(handle_t *ha
 	/* The 0th block becomes the root, move the dirents out */
 	fde = &root->dotdot;
 	de = (struct ext3_dir_entry_2 *)((char *)fde + le16_to_cpu(fde->rec_len));
-	len = ((char *) root) + blocksize - (char *) de;
+	if (blocksize < EXT3_DIR_MAX_REC_LEN) {
+		len = ((char *) root) + blocksize - (char *) de;
+	} else {
+		len = ((char *) root) + EXT3_DIR_MAX_REC_LEN - (char *) de;
+	}
 	memcpy (data1, de, len);
 	de = (struct ext3_dir_entry_2 *) data1;
 	top = data1 + len;
 	while ((char *)(de2=(void*)de+le16_to_cpu(de->rec_len)) < top)
 		de = de2;
-	de->rec_len = cpu_to_le16(data1 + blocksize - (char *) de);
+	if (blocksize < EXT3_DIR_MAX_REC_LEN) {
+		de->rec_len = cpu_to_le16(data1 + blocksize - (char *) de);
+	} else {
+		de->rec_len = cpu_to_le16(data1 + EXT3_DIR_MAX_REC_LEN - (char *) de);
+	}
 	/* Initialize the root; the dot dirents already exist */
 	de = (struct ext3_dir_entry_2 *) (&root->dotdot);
 	de->rec_len = cpu_to_le16(blocksize - EXT3_DIR_REC_LEN(2));
@@ -1450,7 +1484,11 @@ static int ext3_add_entry (handle_t *han
 		return retval;
 	de = (struct ext3_dir_entry_2 *) bh->b_data;
 	de->inode = 0;
-	de->rec_len = cpu_to_le16(blocksize);
+	if (blocksize < EXT3_DIR_MAX_REC_LEN) {
+		de->rec_len = cpu_to_le16(blocksize);
+	} else {
+		de->rec_len = cpu_to_le16(EXT3_DIR_MAX_REC_LEN);
+	}
 	return add_dirent_to_buf(handle, dentry, inode, de, bh);
 }
 
@@ -1514,7 +1552,12 @@ static int ext3_dx_add_entry(handle_t *h
 			goto cleanup;
 		node2 = (struct dx_node *)(bh2->b_data);
 		entries2 = node2->entries;
-		node2->fake.rec_len = cpu_to_le16(sb->s_blocksize);
+		if (sb->s_blocksize < EXT3_DIR_MAX_REC_LEN) {
+			node2->fake.rec_len = cpu_to_le16(sb->s_blocksize);
+		} else {
+			node2->fake.rec_len =
+				cpu_to_le16(EXT3_DIR_MAX_REC_LEN);
+		}
 		node2->fake.inode = 0;
 		BUFFER_TRACE(frame->bh, "get_write_access");
 		err = ext3_journal_get_write_access(handle, frame->bh);
@@ -1602,11 +1645,15 @@ static int ext3_delete_entry (handle_t *
 {
 	struct ext3_dir_entry_2 * de, * pde;
 	int i;
+	unsigned tail = bh->b_size;
 
 	i = 0;
 	pde = NULL;
 	de = (struct ext3_dir_entry_2 *) bh->b_data;
-	while (i < bh->b_size) {
+	if (tail > EXT3_DIR_MAX_REC_LEN) {
+		tail = EXT3_DIR_MAX_REC_LEN;
+	}
+	while (i < tail) {
 		if (!ext3_check_dir_entry("ext3_delete_entry", dir, de, bh, i))
 			return -EIO;
 		if (de == de_del)  {
@@ -1766,7 +1813,11 @@ retry:
 	de = (struct ext3_dir_entry_2 *)
 			((char *) de + le16_to_cpu(de->rec_len));
 	de->inode = cpu_to_le32(dir->i_ino);
-	de->rec_len = cpu_to_le16(inode->i_sb->s_blocksize-EXT3_DIR_REC_LEN(1));
+	if (inode->i_sb->s_blocksize < EXT3_DIR_MAX_REC_LEN) {
+		de->rec_len = cpu_to_le16(inode->i_sb->s_blocksize-EXT3_DIR_REC_LEN(1));
+	} else {
+		de->rec_len = cpu_to_le16(EXT3_DIR_MAX_REC_LEN-EXT3_DIR_REC_LEN(1));
+	}
 	de->name_len = 2;
 	strcpy (de->name, "..");
 	ext3_set_de_type(dir->i_sb, de, S_IFDIR);
@@ -1801,10 +1852,10 @@ static int empty_dir (struct inode * ino
 	unsigned long offset;
 	struct buffer_head * bh;
 	struct ext3_dir_entry_2 * de, * de1;
-	struct super_block * sb;
+	struct super_block * sb = inode->i_sb;
 	int err = 0;
+	unsigned tail = sb->s_blocksize;
 
-	sb = inode->i_sb;
 	if (inode->i_size < EXT3_DIR_REC_LEN(1) + EXT3_DIR_REC_LEN(2) ||
 	    !(bh = ext3_bread (NULL, inode, 0, 0, &err))) {
 		if (err)
@@ -1831,11 +1882,17 @@ static int empty_dir (struct inode * ino
 		return 1;
 	}
 	offset = le16_to_cpu(de->rec_len) + le16_to_cpu(de1->rec_len);
+	if (offset == EXT3_DIR_MAX_REC_LEN) {
+		offset += sb->s_blocksize - EXT3_DIR_MAX_REC_LEN;
+	}
 	de = (struct ext3_dir_entry_2 *)
 			((char *) de1 + le16_to_cpu(de1->rec_len));
+	if (tail > EXT3_DIR_MAX_REC_LEN) {
+		tail = EXT3_DIR_MAX_REC_LEN;
+	}
 	while (offset < inode->i_size ) {
 		if (!bh ||
-			(void *) de >= (void *) (bh->b_data+sb->s_blocksize)) {
+			(void *) de >= (void *) (bh->b_data + tail)) {
 			err = 0;
 			brelse (bh);
 			bh = ext3_bread (NULL, inode,
@@ -1862,6 +1919,7 @@ static int empty_dir (struct inode * ino
 			return 0;
 		}
 		offset += le16_to_cpu(de->rec_len);
+		offset = EXT3_DIR_ADJUST_TAIL_OFFS(offset, sb->s_blocksize);
 		de = (struct ext3_dir_entry_2 *)
 				((char *) de + le16_to_cpu(de->rec_len));
 	}
Index: linux-2.6.23-rc3/include/linux/ext3_fs.h
===================================================================
--- linux-2.6.23-rc3.orig/include/linux/ext3_fs.h	2007-08-29 15:22:29.000000000 -0700
+++ linux-2.6.23-rc3/include/linux/ext3_fs.h	2007-08-29 15:30:10.000000000 -0700
@@ -660,6 +660,15 @@ struct ext3_dir_entry_2 {
 #define EXT3_DIR_ROUND			(EXT3_DIR_PAD - 1)
 #define EXT3_DIR_REC_LEN(name_len)	(((name_len) + 8 + EXT3_DIR_ROUND) & \
 					 ~EXT3_DIR_ROUND)
+#define EXT3_DIR_MAX_REC_LEN		65532
+
+/*
+ * Align a tail offset to the end of a directory block
+ */
+#define EXT3_DIR_ADJUST_TAIL_OFFS(offs, bsize) \
+	((((offs) & ((bsize) -1)) == EXT3_DIR_MAX_REC_LEN) ? \
+	((offs) + (bsize) - EXT3_DIR_MAX_REC_LEN):(offs))
+
 /*
  * Hash Tree Directory indexing
  * (c) Daniel Phillips, 2001



^ permalink raw reply	[flat|nested] 124+ messages in thread

* [RFC 4/4]ext4: fix rec_len overflow with 64KB block size
  2007-08-30  0:11   ` Mingming Cao
                       ` (3 preceding siblings ...)
  2007-08-30  0:48     ` [RFC 3/4] ext3: " Mingming Cao
@ 2007-08-30  0:48     ` Mingming Cao
  4 siblings, 0 replies; 124+ messages in thread
From: Mingming Cao @ 2007-08-30  0:48 UTC (permalink / raw)
  To: clameter; +Cc: linux-kernel, adilger, sho, ext4 development

[4/4]  ext4: fix rec_len overflow
         - prevent rec_len from overflow with 64KB blocksize


Signed-off-by: Takashi Sato <sho@tnes.nec.co.jp>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>

---
 fs/ext4/dir.c           |   11 ++++--
 fs/ext4/namei.c         |   88 +++++++++++++++++++++++++++++++++++++++---------
 include/linux/ext4_fs.h |    9 ++++
 3 files changed, 90 insertions(+), 18 deletions(-)

Index: linux-2.6.23-rc3/fs/ext4/dir.c
===================================================================
--- linux-2.6.23-rc3.orig/fs/ext4/dir.c	2007-08-12 21:25:24.000000000 -0700
+++ linux-2.6.23-rc3/fs/ext4/dir.c	2007-08-29 15:33:19.000000000 -0700
@@ -100,10 +100,11 @@ static int ext4_readdir(struct file * fi
 	unsigned long offset;
 	int i, stored;
 	struct ext4_dir_entry_2 *de;
-	struct super_block *sb;
 	int err;
 	struct inode *inode = filp->f_path.dentry->d_inode;
+	struct super_block *sb = inode->i_sb;
 	int ret = 0;
+	unsigned tail = sb->s_blocksize;
 
 	sb = inode->i_sb;
 
@@ -166,8 +167,11 @@ revalidate:
 		 * readdir(2), then we might be pointing to an invalid
 		 * dirent right now.  Scan from the start of the block
 		 * to make sure. */
+		if (tail >  EXT4_DIR_MAX_REC_LEN) {
+			tail = EXT4_DIR_MAX_REC_LEN;
+		}
 		if (filp->f_version != inode->i_version) {
-			for (i = 0; i < sb->s_blocksize && i < offset; ) {
+			for (i = 0; i < tail && i < offset; ) {
 				de = (struct ext4_dir_entry_2 *)
 					(bh->b_data + i);
 				/* It's too expensive to do a full
@@ -188,7 +192,7 @@ revalidate:
 		}
 
 		while (!error && filp->f_pos < inode->i_size
-		       && offset < sb->s_blocksize) {
+		       && offset < tail) {
 			de = (struct ext4_dir_entry_2 *) (bh->b_data + offset);
 			if (!ext4_check_dir_entry ("ext4_readdir", inode, de,
 						   bh, offset)) {
@@ -225,6 +229,7 @@ revalidate:
 			}
 			filp->f_pos += le16_to_cpu(de->rec_len);
 		}
+		filp->f_pos = EXT4_DIR_ADJUST_TAIL_OFFS(filp->f_pos, sb->s_blocksize);
 		offset = 0;
 		brelse (bh);
 	}
Index: linux-2.6.23-rc3/fs/ext4/namei.c
===================================================================
--- linux-2.6.23-rc3.orig/fs/ext4/namei.c	2007-08-28 11:08:48.000000000 -0700
+++ linux-2.6.23-rc3/fs/ext4/namei.c	2007-08-29 15:30:22.000000000 -0700
@@ -262,9 +262,13 @@ static struct stats dx_show_leaf(struct 
 	unsigned names = 0, space = 0;
 	char *base = (char *) de;
 	struct dx_hash_info h = *hinfo;
+	unsigned tail = size;
 
 	printk("names: ");
-	while ((char *) de < base + size)
+	if (tail > EXT4_DIR_MAX_REC_LEN) {
+		tail = EXT4_DIR_MAX_REC_LEN;
+	}
+	while ((char *) de < base + tail)
 	{
 		if (de->inode)
 		{
@@ -677,8 +681,12 @@ static int dx_make_map (struct ext4_dir_
 	int count = 0;
 	char *base = (char *) de;
 	struct dx_hash_info h = *hinfo;
+	unsigned tail = size;
 
-	while ((char *) de < base + size)
+	if (tail > EXT4_DIR_MAX_REC_LEN) {
+		tail = EXT4_DIR_MAX_REC_LEN;
+	}
+	while ((char *) de < base + tail)
 	{
 		if (de->name_len && de->inode) {
 			ext4fs_dirhash(de->name, de->name_len, &h);
@@ -773,9 +781,13 @@ static inline int search_dirblock(struct
 	int de_len;
 	const char *name = dentry->d_name.name;
 	int namelen = dentry->d_name.len;
+	unsigned tail = dir->i_sb->s_blocksize;
 
 	de = (struct ext4_dir_entry_2 *) bh->b_data;
-	dlimit = bh->b_data + dir->i_sb->s_blocksize;
+	if (tail > EXT4_DIR_MAX_REC_LEN) {
+		tail = EXT4_DIR_MAX_REC_LEN;
+	}
+	dlimit = bh->b_data + tail;
 	while ((char *) de < dlimit) {
 		/* this code is executed quadratically often */
 		/* do minimal checking `by hand' */
@@ -1113,6 +1125,9 @@ static struct ext4_dir_entry_2* dx_pack_
 	unsigned rec_len = 0;
 
 	prev = to = de;
+	if (size > EXT4_DIR_MAX_REC_LEN) {
+		size = EXT4_DIR_MAX_REC_LEN;
+	}
 	while ((char*)de < base + size) {
 		next = (struct ext4_dir_entry_2 *) ((char *) de +
 						    le16_to_cpu(de->rec_len));
@@ -1178,8 +1193,15 @@ static struct ext4_dir_entry_2 *do_split
 	/* Fancy dance to stay within two buffers */
 	de2 = dx_move_dirents(data1, data2, map + split, count - split);
 	de = dx_pack_dirents(data1,blocksize);
-	de->rec_len = cpu_to_le16(data1 + blocksize - (char *) de);
-	de2->rec_len = cpu_to_le16(data2 + blocksize - (char *) de2);
+	if (blocksize < EXT4_DIR_MAX_REC_LEN) {
+		de->rec_len = cpu_to_le16(data1 + blocksize - (char *) de);
+		de2->rec_len = cpu_to_le16(data2 + blocksize - (char *) de2);
+	} else {
+		de->rec_len = cpu_to_le16(data1 + EXT4_DIR_MAX_REC_LEN -
+							(char *) de);
+		de2->rec_len = cpu_to_le16(data2 + EXT4_DIR_MAX_REC_LEN -
+							(char *) de2);
+	}
 	dxtrace(dx_show_leaf (hinfo, (struct ext4_dir_entry_2 *) data1, blocksize, 1));
 	dxtrace(dx_show_leaf (hinfo, (struct ext4_dir_entry_2 *) data2, blocksize, 1));
 
@@ -1234,11 +1256,15 @@ static int add_dirent_to_buf(handle_t *h
 	unsigned short	reclen;
 	int		nlen, rlen, err;
 	char		*top;
+	unsigned        tail = dir->i_sb->s_blocksize;
 
+	if (tail > EXT4_DIR_MAX_REC_LEN) {
+		tail = EXT4_DIR_MAX_REC_LEN;
+	}
 	reclen = EXT4_DIR_REC_LEN(namelen);
 	if (!de) {
 		de = (struct ext4_dir_entry_2 *)bh->b_data;
-		top = bh->b_data + dir->i_sb->s_blocksize - reclen;
+		top = bh->b_data + tail - reclen;
 		while ((char *) de <= top) {
 			if (!ext4_check_dir_entry("ext4_add_entry", dir, de,
 						  bh, offset)) {
@@ -1351,13 +1377,21 @@ static int make_indexed_dir(handle_t *ha
 	/* The 0th block becomes the root, move the dirents out */
 	fde = &root->dotdot;
 	de = (struct ext4_dir_entry_2 *)((char *)fde + le16_to_cpu(fde->rec_len));
-	len = ((char *) root) + blocksize - (char *) de;
+	if (blocksize < EXT4_DIR_MAX_REC_LEN) {
+		len = ((char *) root) + blocksize - (char *) de;
+	} else {
+		len = ((char *) root) + EXT4_DIR_MAX_REC_LEN - (char *) de;
+	}
 	memcpy (data1, de, len);
 	de = (struct ext4_dir_entry_2 *) data1;
 	top = data1 + len;
 	while ((char *)(de2=(void*)de+le16_to_cpu(de->rec_len)) < top)
 		de = de2;
-	de->rec_len = cpu_to_le16(data1 + blocksize - (char *) de);
+	if (blocksize < EXT4_DIR_MAX_REC_LEN) {
+		de->rec_len = cpu_to_le16(data1 + blocksize - (char *) de);
+	} else {
+		de->rec_len = cpu_to_le16(data1 + EXT4_DIR_MAX_REC_LEN - (char *) de);
+	}
 	/* Initialize the root; the dot dirents already exist */
 	de = (struct ext4_dir_entry_2 *) (&root->dotdot);
 	de->rec_len = cpu_to_le16(blocksize - EXT4_DIR_REC_LEN(2));
@@ -1447,7 +1481,11 @@ static int ext4_add_entry (handle_t *han
 		return retval;
 	de = (struct ext4_dir_entry_2 *) bh->b_data;
 	de->inode = 0;
-	de->rec_len = cpu_to_le16(blocksize);
+	if (blocksize < EXT4_DIR_MAX_REC_LEN) {
+		de->rec_len = cpu_to_le16(blocksize);
+	} else {
+		de->rec_len = cpu_to_le16(EXT4_DIR_MAX_REC_LEN);
+	}
 	return add_dirent_to_buf(handle, dentry, inode, de, bh);
 }
 
@@ -1511,7 +1549,12 @@ static int ext4_dx_add_entry(handle_t *h
 			goto cleanup;
 		node2 = (struct dx_node *)(bh2->b_data);
 		entries2 = node2->entries;
-		node2->fake.rec_len = cpu_to_le16(sb->s_blocksize);
+		if (sb->s_blocksize < EXT4_DIR_MAX_REC_LEN) {
+			node2->fake.rec_len = cpu_to_le16(sb->s_blocksize);
+		} else {
+			node2->fake.rec_len =
+				cpu_to_le16(EXT4_DIR_MAX_REC_LEN);
+		}
 		node2->fake.inode = 0;
 		BUFFER_TRACE(frame->bh, "get_write_access");
 		err = ext4_journal_get_write_access(handle, frame->bh);
@@ -1599,11 +1642,15 @@ static int ext4_delete_entry (handle_t *
 {
 	struct ext4_dir_entry_2 * de, * pde;
 	int i;
+	unsigned tail = bh->b_size;
 
 	i = 0;
 	pde = NULL;
 	de = (struct ext4_dir_entry_2 *) bh->b_data;
-	while (i < bh->b_size) {
+	if (tail > EXT4_DIR_MAX_REC_LEN) {
+		tail = EXT4_DIR_MAX_REC_LEN;
+	}
+	while (i < tail) {
 		if (!ext4_check_dir_entry("ext4_delete_entry", dir, de, bh, i))
 			return -EIO;
 		if (de == de_del)  {
@@ -1791,7 +1838,11 @@ retry:
 	de = (struct ext4_dir_entry_2 *)
 			((char *) de + le16_to_cpu(de->rec_len));
 	de->inode = cpu_to_le32(dir->i_ino);
-	de->rec_len = cpu_to_le16(inode->i_sb->s_blocksize-EXT4_DIR_REC_LEN(1));
+	if (inode->i_sb->s_blocksize < EXT4_DIR_MAX_REC_LEN) {
+		de->rec_len = cpu_to_le16(inode->i_sb->s_blocksize-EXT4_DIR_REC_LEN(1));
+	} else  {   
+		de->rec_len = cpu_to_le16(EXT4_DIR_MAX_REC_LEN-EXT4_DIR_REC_LEN(1));
+	}
 	de->name_len = 2;
 	strcpy (de->name, "..");
 	ext4_set_de_type(dir->i_sb, de, S_IFDIR);
@@ -1826,10 +1877,10 @@ static int empty_dir (struct inode * ino
 	unsigned long offset;
 	struct buffer_head * bh;
 	struct ext4_dir_entry_2 * de, * de1;
-	struct super_block * sb;
+	struct super_block * sb = inode->i_sb;
 	int err = 0;
+	unsigned tail = sb->s_blocksize;
 
-	sb = inode->i_sb;
 	if (inode->i_size < EXT4_DIR_REC_LEN(1) + EXT4_DIR_REC_LEN(2) ||
 	    !(bh = ext4_bread (NULL, inode, 0, 0, &err))) {
 		if (err)
@@ -1856,11 +1907,17 @@ static int empty_dir (struct inode * ino
 		return 1;
 	}
 	offset = le16_to_cpu(de->rec_len) + le16_to_cpu(de1->rec_len);
+	if (offset == EXT4_DIR_MAX_REC_LEN) {
+		offset += sb->s_blocksize - EXT4_DIR_MAX_REC_LEN;
+	}
 	de = (struct ext4_dir_entry_2 *)
 			((char *) de1 + le16_to_cpu(de1->rec_len));
+	if (tail > EXT4_DIR_MAX_REC_LEN) {
+		tail = EXT4_DIR_MAX_REC_LEN;
+	}
 	while (offset < inode->i_size ) {
 		if (!bh ||
-			(void *) de >= (void *) (bh->b_data+sb->s_blocksize)) {
+			(void *) de >= (void *) (bh->b_data + tail)) {
 			err = 0;
 			brelse (bh);
 			bh = ext4_bread (NULL, inode,
@@ -1887,6 +1944,7 @@ static int empty_dir (struct inode * ino
 			return 0;
 		}
 		offset += le16_to_cpu(de->rec_len);
+		offset = EXT4_DIR_ADJUST_TAIL_OFFS(offset, sb->s_blocksize);
 		de = (struct ext4_dir_entry_2 *)
 				((char *) de + le16_to_cpu(de->rec_len));
 	}
Index: linux-2.6.23-rc3/include/linux/ext4_fs.h
===================================================================
--- linux-2.6.23-rc3.orig/include/linux/ext4_fs.h	2007-08-29 15:22:29.000000000 -0700
+++ linux-2.6.23-rc3/include/linux/ext4_fs.h	2007-08-29 15:30:22.000000000 -0700
@@ -834,6 +834,15 @@ struct ext4_dir_entry_2 {
 #define EXT4_DIR_ROUND			(EXT4_DIR_PAD - 1)
 #define EXT4_DIR_REC_LEN(name_len)	(((name_len) + 8 + EXT4_DIR_ROUND) & \
 					 ~EXT4_DIR_ROUND)
+#define	EXT4_DIR_MAX_REC_LEN		65532
+
+/*
+ * Align a tail offset to the end of a directory block
+ */
+#define EXT4_DIR_ADJUST_TAIL_OFFS(offs, bsize) \
+	((((offs) & ((bsize) -1)) == EXT4_DIR_MAX_REC_LEN) ? \
+	((offs) + (bsize) - EXT4_DIR_MAX_REC_LEN):(offs))
+
 /*
  * Hash Tree Directory indexing
  * (c) Daniel Phillips, 2001



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFC 1/4] Large Blocksize support for Ext2/3/4
  2007-08-30  0:47     ` [RFC 1/4] Large Blocksize support for Ext2/3/4 Mingming Cao
@ 2007-08-30  0:59       ` Christoph Lameter
  2007-09-01  0:01       ` Mingming Cao
                         ` (8 subsequent siblings)
  9 siblings, 0 replies; 124+ messages in thread
From: Christoph Lameter @ 2007-08-30  0:59 UTC (permalink / raw)
  To: Mingming Cao; +Cc: linux-fsdevel, adilger, sho, ext4 development, linux-kernel

On Wed, 29 Aug 2007, Mingming Cao wrote:

> It's quite simple to support large block size in ext2/3/4, mostly just
> enlarge the block size limit.  But it is NOT possible to have 64kB
> blocksize on ext2/3/4 without some changes to the directory handling
> code.  The reason is that an empty 64kB directory block would have a
> rec_len == (__u16)2^16 == 0, and this would cause an error to be hit in
> the filesystem.  The proposed solution is to put 2 empty records in such
> a directory, or to special-case an impossible value like rec_len =
> 0xffff to handle this. 

Ahh. Good.

I could add the path to the large blocksize patchset?


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-28 19:06 ` [11/36] Use page_cache_xxx in fs/buffer.c clameter
@ 2007-08-30  9:20   ` Dmitry Monakhov
  2007-08-30 18:14     ` Christoph Lameter
  0 siblings, 1 reply; 124+ messages in thread
From: Dmitry Monakhov @ 2007-08-30  9:20 UTC (permalink / raw)
  To: clameter
  Cc: torvalds, linux-fsdevel, linux-kernel, Christoph Hellwig,
	Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On 12:06 Tue 28 Aug     , clameter@sgi.com wrote:
> Use page_cache_xxx in fs/buffer.c.
submit_bh wasn't changed this means what bio pages may have huge size
without respect to queue reqsrictions (q->max_hw_segments, and etc)
At least driver/md/raid0 will be broken by you'r patch.
> 
> We have a special situation in set_bh_page() since reiserfs calls that
> function before setting up the mapping. So retrieve the page size
> from the page struct rather than the mapping.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> ---
>  fs/buffer.c |  110 +++++++++++++++++++++++++++++++++---------------------------
>  1 file changed, 62 insertions(+), 48 deletions(-)
> 
> Index: linux-2.6/fs/buffer.c
> ===================================================================
> --- linux-2.6.orig/fs/buffer.c	2007-08-28 11:37:13.000000000 -0700
> +++ linux-2.6/fs/buffer.c	2007-08-28 11:37:58.000000000 -0700
> @@ -257,7 +257,7 @@ __find_get_block_slow(struct block_devic
>  	struct page *page;
>  	int all_mapped = 1;
>  
> -	index = block >> (PAGE_CACHE_SHIFT - bd_inode->i_blkbits);
> +	index = block >> (page_cache_shift(bd_mapping) - bd_inode->i_blkbits);
>  	page = find_get_page(bd_mapping, index);
>  	if (!page)
>  		goto out;
> @@ -697,7 +697,7 @@ static int __set_page_dirty(struct page 
>  
>  		if (mapping_cap_account_dirty(mapping)) {
>  			__inc_zone_page_state(page, NR_FILE_DIRTY);
> -			task_io_account_write(PAGE_CACHE_SIZE);
> +			task_io_account_write(page_cache_size(mapping));
>  		}
>  		radix_tree_tag_set(&mapping->page_tree,
>  				page_index(page), PAGECACHE_TAG_DIRTY);
> @@ -891,10 +891,11 @@ struct buffer_head *alloc_page_buffers(s
>  {
>  	struct buffer_head *bh, *head;
>  	long offset;
> +	unsigned int page_size = page_cache_size(page->mapping);
>  
>  try_again:
>  	head = NULL;
> -	offset = PAGE_SIZE;
> +	offset = page_size;
>  	while ((offset -= size) >= 0) {
>  		bh = alloc_buffer_head(GFP_NOFS);
>  		if (!bh)
> @@ -1426,7 +1427,7 @@ void set_bh_page(struct buffer_head *bh,
>  		struct page *page, unsigned long offset)
>  {
>  	bh->b_page = page;
> -	BUG_ON(offset >= PAGE_SIZE);
> +	BUG_ON(offset >= compound_size(page));
>  	if (PageHighMem(page))
>  		/*
>  		 * This catches illegal uses and preserves the offset:
> @@ -1605,6 +1606,7 @@ static int __block_write_full_page(struc
>  	struct buffer_head *bh, *head;
>  	const unsigned blocksize = 1 << inode->i_blkbits;
>  	int nr_underway = 0;
> +	struct address_space *mapping = inode->i_mapping;
>  
>  	BUG_ON(!PageLocked(page));
>  
> @@ -1625,7 +1627,8 @@ static int __block_write_full_page(struc
>  	 * handle that here by just cleaning them.
>  	 */
>  
> -	block = (sector_t)page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
> +	block = (sector_t)page->index <<
> +		(page_cache_shift(mapping) - inode->i_blkbits);
>  	head = page_buffers(page);
>  	bh = head;
>  
> @@ -1742,7 +1745,7 @@ recover:
>  	} while ((bh = bh->b_this_page) != head);
>  	SetPageError(page);
>  	BUG_ON(PageWriteback(page));
> -	mapping_set_error(page->mapping, err);
> +	mapping_set_error(mapping, err);
>  	set_page_writeback(page);
>  	do {
>  		struct buffer_head *next = bh->b_this_page;
> @@ -1767,8 +1770,8 @@ static int __block_prepare_write(struct 
>  	struct buffer_head *bh, *head, *wait[2], **wait_bh=wait;
>  
>  	BUG_ON(!PageLocked(page));
> -	BUG_ON(from > PAGE_CACHE_SIZE);
> -	BUG_ON(to > PAGE_CACHE_SIZE);
> +	BUG_ON(from > page_cache_size(inode->i_mapping));
> +	BUG_ON(to > page_cache_size(inode->i_mapping));
>  	BUG_ON(from > to);
>  
>  	blocksize = 1 << inode->i_blkbits;
> @@ -1777,7 +1780,8 @@ static int __block_prepare_write(struct 
>  	head = page_buffers(page);
>  
>  	bbits = inode->i_blkbits;
> -	block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits);
> +	block = (sector_t)page->index <<
> +		(page_cache_shift(inode->i_mapping) - bbits);
>  
>  	for(bh = head, block_start = 0; bh != head || !block_start;
>  	    block++, block_start=block_end, bh = bh->b_this_page) {
> @@ -1921,7 +1925,8 @@ int block_read_full_page(struct page *pa
>  		create_empty_buffers(page, blocksize, 0);
>  	head = page_buffers(page);
>  
> -	iblock = (sector_t)page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
> +	iblock = (sector_t)page->index <<
> +		(page_cache_shift(page->mapping) - inode->i_blkbits);
>  	lblock = (i_size_read(inode)+blocksize-1) >> inode->i_blkbits;
>  	bh = head;
>  	nr = 0;
> @@ -2045,7 +2050,7 @@ int generic_cont_expand(struct inode *in
>  	pgoff_t index;
>  	unsigned int offset;
>  
> -	offset = (size & (PAGE_CACHE_SIZE - 1)); /* Within page */
> +	offset = page_cache_offset(inode->i_mapping, size); /* Within page */
>  
>  	/* ugh.  in prepare/commit_write, if from==to==start of block, we
>  	** skip the prepare.  make sure we never send an offset for the start
> @@ -2055,7 +2060,7 @@ int generic_cont_expand(struct inode *in
>  		/* caller must handle this extra byte. */
>  		offset++;
>  	}
> -	index = size >> PAGE_CACHE_SHIFT;
> +	index = page_cache_index(inode->i_mapping, size);
>  
>  	return __generic_cont_expand(inode, size, index, offset);
>  }
> @@ -2063,8 +2068,8 @@ int generic_cont_expand(struct inode *in
>  int generic_cont_expand_simple(struct inode *inode, loff_t size)
>  {
>  	loff_t pos = size - 1;
> -	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
> -	unsigned int offset = (pos & (PAGE_CACHE_SIZE - 1)) + 1;
> +	pgoff_t index = page_cache_index(inode->i_mapping, pos);
> +	unsigned int offset = page_cache_offset(inode->i_mapping, pos) + 1;
>  
>  	/* prepare/commit_write can handle even if from==to==start of block. */
>  	return __generic_cont_expand(inode, size, index, offset);
> @@ -2086,28 +2091,28 @@ int cont_prepare_write(struct page *page
>  	unsigned zerofrom;
>  	unsigned blocksize = 1 << inode->i_blkbits;
>  
> -	while(page->index > (pgpos = *bytes>>PAGE_CACHE_SHIFT)) {
> +	while (page->index > (pgpos = page_cache_index(mapping, *bytes))) {
>  		status = -ENOMEM;
>  		new_page = grab_cache_page(mapping, pgpos);
>  		if (!new_page)
>  			goto out;
>  		/* we might sleep */
> -		if (*bytes>>PAGE_CACHE_SHIFT != pgpos) {
> +		if (page_cache_index(mapping, *bytes) != pgpos) {
>  			unlock_page(new_page);
>  			page_cache_release(new_page);
>  			continue;
>  		}
> -		zerofrom = *bytes & ~PAGE_CACHE_MASK;
> +		zerofrom = page_cache_offset(mapping, *bytes);
>  		if (zerofrom & (blocksize-1)) {
>  			*bytes |= (blocksize-1);
>  			(*bytes)++;
>  		}
>  		status = __block_prepare_write(inode, new_page, zerofrom,
> -						PAGE_CACHE_SIZE, get_block);
> +					page_cache_size(mapping), get_block);
>  		if (status)
>  			goto out_unmap;
> -		zero_user_segment(new_page, zerofrom, PAGE_CACHE_SIZE);
> -		generic_commit_write(NULL, new_page, zerofrom, PAGE_CACHE_SIZE);
> +		zero_user_segment(new_page, zerofrom, page_cache_size(mapping));
> +		generic_commit_write(NULL, new_page, zerofrom, page_cache_size(mapping));
>  		unlock_page(new_page);
>  		page_cache_release(new_page);
>  	}
> @@ -2117,7 +2122,7 @@ int cont_prepare_write(struct page *page
>  		zerofrom = offset;
>  	} else {
>  		/* page covers the boundary, find the boundary offset */
> -		zerofrom = *bytes & ~PAGE_CACHE_MASK;
> +		zerofrom = page_cache_offset(mapping, *bytes);
>  
>  		/* if we will expand the thing last block will be filled */
>  		if (to > zerofrom && (zerofrom & (blocksize-1))) {
> @@ -2169,8 +2174,9 @@ int block_commit_write(struct page *page
>  int generic_commit_write(struct file *file, struct page *page,
>  		unsigned from, unsigned to)
>  {
> -	struct inode *inode = page->mapping->host;
> -	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
> +	struct address_space *mapping = page->mapping;
> +	struct inode *inode = mapping->host;
> +	loff_t pos = page_cache_pos(mapping, page->index, to);
>  	__block_commit_write(inode,page,from,to);
>  	/*
>  	 * No need to use i_size_read() here, the i_size
> @@ -2206,20 +2212,22 @@ block_page_mkwrite(struct vm_area_struct
>  	unsigned long end;
>  	loff_t size;
>  	int ret = -EINVAL;
> +	struct address_space *mapping;
>  
>  	lock_page(page);
> +	mapping = page->mapping;
>  	size = i_size_read(inode);
> -	if ((page->mapping != inode->i_mapping) ||
> +	if ((mapping != inode->i_mapping) ||
>  	    (page_offset(page) > size)) {
>  		/* page got truncated out from underneath us */
>  		goto out_unlock;
>  	}
>  
>  	/* page is wholly or partially inside EOF */
> -	if (((page->index + 1) << PAGE_CACHE_SHIFT) > size)
> -		end = size & ~PAGE_CACHE_MASK;
> +	if (page_cache_pos(mapping, page->index + 1, 0) > size)
> +		end = page_cache_offset(mapping, size);
>  	else
> -		end = PAGE_CACHE_SIZE;
> +		end = page_cache_size(mapping);
>  
>  	ret = block_prepare_write(page, 0, end, get_block);
>  	if (!ret)
> @@ -2258,6 +2266,7 @@ static void end_buffer_read_nobh(struct 
>  int nobh_prepare_write(struct page *page, unsigned from, unsigned to,
>  			get_block_t *get_block)
>  {
> +	struct address_space *mapping = page->mapping;
>  	struct inode *inode = page->mapping->host;
>  	const unsigned blkbits = inode->i_blkbits;
>  	const unsigned blocksize = 1 << blkbits;
> @@ -2265,6 +2274,7 @@ int nobh_prepare_write(struct page *page
>  	struct buffer_head *read_bh[MAX_BUF_PER_PAGE];
>  	unsigned block_in_page;
>  	unsigned block_start;
> +	unsigned page_size = page_cache_size(mapping);
>  	sector_t block_in_file;
>  	int nr_reads = 0;
>  	int i;
> @@ -2274,7 +2284,8 @@ int nobh_prepare_write(struct page *page
>  	if (PageMappedToDisk(page))
>  		return 0;
>  
> -	block_in_file = (sector_t)page->index << (PAGE_CACHE_SHIFT - blkbits);
> +	block_in_file = (sector_t)page->index <<
> +			(page_cache_shift(mapping) - blkbits);
>  	map_bh.b_page = page;
>  
>  	/*
> @@ -2283,7 +2294,7 @@ int nobh_prepare_write(struct page *page
>  	 * page is fully mapped-to-disk.
>  	 */
>  	for (block_start = 0, block_in_page = 0;
> -		  block_start < PAGE_CACHE_SIZE;
> +		  block_start < page_size;
>  		  block_in_page++, block_start += blocksize) {
>  		unsigned block_end = block_start + blocksize;
>  		int create;
> @@ -2372,7 +2383,7 @@ failed:
>  	 * Error recovery is pretty slack.  Clear the page and mark it dirty
>  	 * so we'll later zero out any blocks which _were_ allocated.
>  	 */
> -	zero_user(page, 0, PAGE_CACHE_SIZE);
> +	zero_user(page, 0, page_size);
>  	SetPageUptodate(page);
>  	set_page_dirty(page);
>  	return ret;
> @@ -2386,8 +2397,9 @@ EXPORT_SYMBOL(nobh_prepare_write);
>  int nobh_commit_write(struct file *file, struct page *page,
>  		unsigned from, unsigned to)
>  {
> -	struct inode *inode = page->mapping->host;
> -	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
> +	struct address_space *mapping = page->mapping;
> +	struct inode *inode = mapping->host;
> +	loff_t pos = page_cache_pos(mapping, page->index, to);
>  
>  	SetPageUptodate(page);
>  	set_page_dirty(page);
> @@ -2407,9 +2419,10 @@ EXPORT_SYMBOL(nobh_commit_write);
>  int nobh_writepage(struct page *page, get_block_t *get_block,
>  			struct writeback_control *wbc)
>  {
> -	struct inode * const inode = page->mapping->host;
> +	struct address_space *mapping = page->mapping;
> +	struct inode * const inode = mapping->host;
>  	loff_t i_size = i_size_read(inode);
> -	const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
> +	const pgoff_t end_index = page_cache_offset(mapping, i_size);
>  	unsigned offset;
>  	int ret;
>  
> @@ -2418,7 +2431,7 @@ int nobh_writepage(struct page *page, ge
>  		goto out;
>  
>  	/* Is the page fully outside i_size? (truncate in progress) */
> -	offset = i_size & (PAGE_CACHE_SIZE-1);
> +	offset = page_cache_offset(mapping, i_size);
>  	if (page->index >= end_index+1 || !offset) {
>  		/*
>  		 * The page may have dirty, unmapped buffers.  For example,
> @@ -2441,7 +2454,7 @@ int nobh_writepage(struct page *page, ge
>  	 * the  page size, the remaining memory is zeroed when mapped, and
>  	 * writes to that region are not written out to the file."
>  	 */
> -	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
> +	zero_user_segment(page, offset, page_cache_size(mapping));
>  out:
>  	ret = mpage_writepage(page, get_block, wbc);
>  	if (ret == -EAGAIN)
> @@ -2457,8 +2470,8 @@ int nobh_truncate_page(struct address_sp
>  {
>  	struct inode *inode = mapping->host;
>  	unsigned blocksize = 1 << inode->i_blkbits;
> -	pgoff_t index = from >> PAGE_CACHE_SHIFT;
> -	unsigned offset = from & (PAGE_CACHE_SIZE-1);
> +	pgoff_t index = page_cache_index(mapping, from);
> +	unsigned offset = page_cache_offset(mapping, from);
>  	unsigned to;
>  	struct page *page;
>  	const struct address_space_operations *a_ops = mapping->a_ops;
> @@ -2475,7 +2488,7 @@ int nobh_truncate_page(struct address_sp
>  	to = (offset + blocksize) & ~(blocksize - 1);
>  	ret = a_ops->prepare_write(NULL, page, offset, to);
>  	if (ret == 0) {
> -		zero_user_segment(page, offset, PAGE_CACHE_SIZE);
> +		zero_user_segment(page, offset, page_cache_size(mapping));
>  		/*
>  		 * It would be more correct to call aops->commit_write()
>  		 * here, but this is more efficient.
> @@ -2493,8 +2506,8 @@ EXPORT_SYMBOL(nobh_truncate_page);
>  int block_truncate_page(struct address_space *mapping,
>  			loff_t from, get_block_t *get_block)
>  {
> -	pgoff_t index = from >> PAGE_CACHE_SHIFT;
> -	unsigned offset = from & (PAGE_CACHE_SIZE-1);
> +	pgoff_t index = page_cache_index(mapping, from);
> +	unsigned offset = page_cache_offset(mapping, from);
>  	unsigned blocksize;
>  	sector_t iblock;
>  	unsigned length, pos;
> @@ -2511,8 +2524,8 @@ int block_truncate_page(struct address_s
>  		return 0;
>  
>  	length = blocksize - length;
> -	iblock = (sector_t)index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
> -	
> +	iblock = (sector_t)index <<
> +			(page_cache_shift(mapping) - inode->i_blkbits);
>  	page = grab_cache_page(mapping, index);
>  	err = -ENOMEM;
>  	if (!page)
> @@ -2571,9 +2584,10 @@ out:
>  int block_write_full_page(struct page *page, get_block_t *get_block,
>  			struct writeback_control *wbc)
>  {
> -	struct inode * const inode = page->mapping->host;
> +	struct address_space *mapping = page->mapping;
> +	struct inode * const inode = mapping->host;
>  	loff_t i_size = i_size_read(inode);
> -	const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
> +	const pgoff_t end_index = page_cache_index(mapping, i_size);
>  	unsigned offset;
>  
>  	/* Is the page fully inside i_size? */
> @@ -2581,7 +2595,7 @@ int block_write_full_page(struct page *p
>  		return __block_write_full_page(inode, page, get_block, wbc);
>  
>  	/* Is the page fully outside i_size? (truncate in progress) */
> -	offset = i_size & (PAGE_CACHE_SIZE-1);
> +	offset = page_cache_offset(mapping, i_size);
>  	if (page->index >= end_index+1 || !offset) {
>  		/*
>  		 * The page may have dirty, unmapped buffers.  For example,
> @@ -2600,7 +2614,7 @@ int block_write_full_page(struct page *p
>  	 * the  page size, the remaining memory is zeroed when mapped, and
>  	 * writes to that region are not written out to the file."
>  	 */
> -	zero_user_segment(page, offset, PAGE_CACHE_SIZE);
> +	zero_user_segment(page, offset, page_cache_size(mapping));
>  	return __block_write_full_page(inode, page, get_block, wbc);
>  }
>  
> @@ -2854,7 +2868,7 @@ int try_to_free_buffers(struct page *pag
>  	 * dirty bit from being lost.
>  	 */
>  	if (ret)
> -		cancel_dirty_page(page, PAGE_CACHE_SIZE);
> +		cancel_dirty_page(page, page_cache_size(mapping));
>  	spin_unlock(&mapping->private_lock);
>  out:
>  	if (buffers_to_free) {
> 
> -- 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-30  9:20   ` Dmitry Monakhov
@ 2007-08-30 18:14     ` Christoph Lameter
  2007-08-31  1:47       ` Christoph Lameter
  0 siblings, 1 reply; 124+ messages in thread
From: Christoph Lameter @ 2007-08-30 18:14 UTC (permalink / raw)
  To: Dmitry Monakhov
  Cc: torvalds, linux-fsdevel, linux-kernel, Christoph Hellwig,
	Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Thu, 30 Aug 2007, Dmitry Monakhov wrote:

> On 12:06 Tue 28 Aug     , clameter@sgi.com wrote:
> > Use page_cache_xxx in fs/buffer.c.
> submit_bh wasn't changed this means what bio pages may have huge size
> without respect to queue reqsrictions (q->max_hw_segments, and etc)
> At least driver/md/raid0 will be broken by you'r patch.

Hmmm... So we need to check the page size and generate multiple requests 
in submit_bh?

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-30 18:14     ` Christoph Lameter
@ 2007-08-31  1:47       ` Christoph Lameter
  2007-08-31  6:56         ` Jens Axboe
  0 siblings, 1 reply; 124+ messages in thread
From: Christoph Lameter @ 2007-08-31  1:47 UTC (permalink / raw)
  To: Dmitry Monakhov
  Cc: torvalds, linux-fsdevel, linux-kernel, Christoph Hellwig,
	Mel Gorman, William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman

This may already be handled?

submit_bh() calls submit_bio() which calls __generic_make_request() and 
there we do:

                if (unlikely(bio_sectors(bio) > q->max_hw_sectors)) {
                        printk("bio too big device %s (%u > %u)\n",
                                bdevname(bio->bi_bdev, b),
                                bio_sectors(bio),
                                q->max_hw_sectors);
                        goto end_io;
                }

So if we try to push a too large buffer down with submit_bh() we get a 
failure.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-31  1:47       ` Christoph Lameter
@ 2007-08-31  6:56         ` Jens Axboe
  2007-08-31  7:03           ` Christoph Lameter
  0 siblings, 1 reply; 124+ messages in thread
From: Jens Axboe @ 2007-08-31  6:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Dmitry Monakhov, torvalds, linux-fsdevel, linux-kernel,
	Christoph Hellwig, Mel Gorman, William Lee Irwin III,
	David Chinner, Badari Pulavarty, Maxim Levitsky, Fengguang Wu,
	swin wang, totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Thu, Aug 30 2007, Christoph Lameter wrote:
> This may already be handled?
> 
> submit_bh() calls submit_bio() which calls __generic_make_request() and 
> there we do:
> 
>                 if (unlikely(bio_sectors(bio) > q->max_hw_sectors)) {
>                         printk("bio too big device %s (%u > %u)\n",
>                                 bdevname(bio->bi_bdev, b),
>                                 bio_sectors(bio),
>                                 q->max_hw_sectors);
>                         goto end_io;
>                 }
> 
> So if we try to push a too large buffer down with submit_bh() we get a 
> failure.

Only partly, you may be violating a number of other restrictions (size
is many things, not just length of the data).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-31  6:56         ` Jens Axboe
@ 2007-08-31  7:03           ` Christoph Lameter
  2007-08-31  7:11             ` Jens Axboe
  0 siblings, 1 reply; 124+ messages in thread
From: Christoph Lameter @ 2007-08-31  7:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Dmitry Monakhov, torvalds, linux-fsdevel, linux-kernel,
	Christoph Hellwig, Mel Gorman, William Lee Irwin III,
	David Chinner, Badari Pulavarty, Maxim Levitsky, Fengguang Wu,
	swin wang, totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Fri, 31 Aug 2007, Jens Axboe wrote:

> > So if we try to push a too large buffer down with submit_bh() we get a 
> > failure.
> 
> Only partly, you may be violating a number of other restrictions (size
> is many things, not just length of the data).

Could you be more specific?


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-31  7:03           ` Christoph Lameter
@ 2007-08-31  7:11             ` Jens Axboe
  2007-08-31  7:17               ` Christoph Lameter
  0 siblings, 1 reply; 124+ messages in thread
From: Jens Axboe @ 2007-08-31  7:11 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Dmitry Monakhov, torvalds, linux-fsdevel, linux-kernel,
	Christoph Hellwig, Mel Gorman, William Lee Irwin III,
	David Chinner, Badari Pulavarty, Maxim Levitsky, Fengguang Wu,
	swin wang, totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Fri, Aug 31 2007, Christoph Lameter wrote:
> On Fri, 31 Aug 2007, Jens Axboe wrote:
> 
> > > So if we try to push a too large buffer down with submit_bh() we get a 
> > > failure.
> > 
> > Only partly, you may be violating a number of other restrictions (size
> > is many things, not just length of the data).
> 
> Could you be more specific?

Size of a single segment, for instance. Or if the bio crosses a dma
boundary. If your block is 64kb and the maximum segment size is 32kb,
then you would need to clone the bio and split it into two.

Things like that. This isn't a problem with single page requests, as we
based the lower possible boundaries on that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-31  7:11             ` Jens Axboe
@ 2007-08-31  7:17               ` Christoph Lameter
  2007-08-31  7:26                 ` Jens Axboe
  0 siblings, 1 reply; 124+ messages in thread
From: Christoph Lameter @ 2007-08-31  7:17 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Dmitry Monakhov, torvalds, linux-fsdevel, linux-kernel,
	Christoph Hellwig, Mel Gorman, William Lee Irwin III,
	David Chinner, Badari Pulavarty, Maxim Levitsky, Fengguang Wu,
	swin wang, totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Fri, 31 Aug 2007, Jens Axboe wrote:

> > Could you be more specific?
> 
> Size of a single segment, for instance. Or if the bio crosses a dma
> boundary. If your block is 64kb and the maximum segment size is 32kb,
> then you would need to clone the bio and split it into two.

A DMA boundary cannot be crossed AFAIK. The compound pages are aligned to 
the power of two boundaries and the page allocator will not create pages 
that cross the zone boundaries.

It looks like the code will correctly signal a failure if you try to write 
a 64k block on a device with a maximum segment size of 32k. Isnt this 
okay? One would not want to use a larger block size than supported by the 
underlying hardware?

> Things like that. This isn't a problem with single page requests, as we
> based the lower possible boundaries on that.

submit_bh() is used to submit a single buffer and I think that was our 
main concern here.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-31  7:17               ` Christoph Lameter
@ 2007-08-31  7:26                 ` Jens Axboe
  2007-08-31  7:33                   ` Christoph Lameter
  0 siblings, 1 reply; 124+ messages in thread
From: Jens Axboe @ 2007-08-31  7:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Dmitry Monakhov, torvalds, linux-fsdevel, linux-kernel,
	Christoph Hellwig, Mel Gorman, William Lee Irwin III,
	David Chinner, Badari Pulavarty, Maxim Levitsky, Fengguang Wu,
	swin wang, totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Fri, Aug 31 2007, Christoph Lameter wrote:
> On Fri, 31 Aug 2007, Jens Axboe wrote:
> 
> > > Could you be more specific?
> > 
> > Size of a single segment, for instance. Or if the bio crosses a dma
> > boundary. If your block is 64kb and the maximum segment size is 32kb,
> > then you would need to clone the bio and split it into two.
> 
> A DMA boundary cannot be crossed AFAIK. The compound pages are aligned to 
> the power of two boundaries and the page allocator will not create pages 
> that cross the zone boundaries.

With a 64k page and a dma boundary of 0x7fff, that's two segments.

> It looks like the code will correctly signal a failure if you try to write 
> a 64k block on a device with a maximum segment size of 32k. Isnt this 
> okay? One would not want to use a larger block size than supported by the 
> underlying hardware?

That's just the size in sectors limitation again. And that also needs to
be handled, the fact that it currently errors out is reassuring but
definitely a long term solution. You don't want to knowingly setup such
a system where the fs block size is larger than what the hardware would
want, but it should work. You could be moving hardware around, for
recovery or otherwise.

> > Things like that. This isn't a problem with single page requests, as we
> > based the lower possible boundaries on that.
> 
> submit_bh() is used to submit a single buffer and I think that was our 
> main concern here.

And how large can that be?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-31  7:26                 ` Jens Axboe
@ 2007-08-31  7:33                   ` Christoph Lameter
  2007-08-31  7:43                     ` Jens Axboe
  0 siblings, 1 reply; 124+ messages in thread
From: Christoph Lameter @ 2007-08-31  7:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Dmitry Monakhov, torvalds, linux-fsdevel, linux-kernel,
	Christoph Hellwig, Mel Gorman, William Lee Irwin III,
	David Chinner, Badari Pulavarty, Maxim Levitsky, Fengguang Wu,
	swin wang, totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Fri, 31 Aug 2007, Jens Axboe wrote:

> > A DMA boundary cannot be crossed AFAIK. The compound pages are aligned to 
> > the power of two boundaries and the page allocator will not create pages 
> > that cross the zone boundaries.
> 
> With a 64k page and a dma boundary of 0x7fff, that's two segments.

Ok so DMA memory restrictions not conforming to the DMA zones? The 
example is a bit weird. DMA only to the first 32k of memory? If the limit 
would be higher like 16MB then we would not have an issue. Is there really 
a device that can only do I/O to the first 32k of memory?

How do we split that up today? We could add processing to submit_bio to 
check for the boundary and create two bios.

> > submit_bh() is used to submit a single buffer and I think that was our 
> > main concern here.
> 
> And how large can that be?

As large as mkxxxfs allowed it to be. For XFS and extX with the current 
patchset 32k is the limit (64k with the fixes to ext2) but a new 
filesystem could theoretically use a larger blocksize.



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-31  7:33                   ` Christoph Lameter
@ 2007-08-31  7:43                     ` Jens Axboe
  2007-08-31  7:52                       ` Christoph Lameter
  0 siblings, 1 reply; 124+ messages in thread
From: Jens Axboe @ 2007-08-31  7:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Dmitry Monakhov, torvalds, linux-fsdevel, linux-kernel,
	Christoph Hellwig, Mel Gorman, William Lee Irwin III,
	David Chinner, Badari Pulavarty, Maxim Levitsky, Fengguang Wu,
	swin wang, totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Fri, Aug 31 2007, Christoph Lameter wrote:
> On Fri, 31 Aug 2007, Jens Axboe wrote:
> 
> > > A DMA boundary cannot be crossed AFAIK. The compound pages are aligned to 
> > > the power of two boundaries and the page allocator will not create pages 
> > > that cross the zone boundaries.
> > 
> > With a 64k page and a dma boundary of 0x7fff, that's two segments.
> 
> Ok so DMA memory restrictions not conforming to the DMA zones? The 
> example is a bit weird. DMA only to the first 32k of memory? If the limit 
> would be higher like 16MB then we would not have an issue. Is there really 
> a device that can only do I/O to the first 32k of memory?

They have nothing to do with each other, you are mixing things up. It
has nothing to do with the device being able to dma into that memory or
not, we have fine existing infrastructure to handle that. But different
hardware have different characteristics on what a single segment is. You
can say "a single segment cannot cross a 32kb boundary". So from the
example above, your single 64k page may need to be split into two
segments. Or it could have a maximum segment size of 32k, in which case
it would have to be split as well.

Do you see what I mean now?

> How do we split that up today? We could add processing to submit_bio
> to check for the boundary and create two bios.

But we do not split them up today - see what I wrote! Today we impose
the restriction that a device must be able to handle a single "normal"
page, and if it can't do that, it has to split it up itself.

But yes, you would have to create some out-of-line function to use
bio_split() until you have chopped things down enough. It's not a good
thing for performance naturally, but if we consider this a "just make it
work" fallback, I don't think it's too bad. You want to make a note of
that it is happening though, so people realize that it is happening.

> > > submit_bh() is used to submit a single buffer and I think that was
> > > our main concern here.
> > 
> > And how large can that be?
> 
> As large as mkxxxfs allowed it to be. For XFS and extX with the
> current patchset 32k is the limit (64k with the fixes to ext2) but a
> new filesystem could theoretically use a larger blocksize.

OK, since it goes direct to bio anyway, it can be handled there.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-31  7:43                     ` Jens Axboe
@ 2007-08-31  7:52                       ` Christoph Lameter
  2007-08-31  8:12                         ` Jens Axboe
  2007-08-31  8:36                         ` Dmitry Monakhov
  0 siblings, 2 replies; 124+ messages in thread
From: Christoph Lameter @ 2007-08-31  7:52 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Dmitry Monakhov, torvalds, linux-fsdevel, linux-kernel,
	Christoph Hellwig, Mel Gorman, William Lee Irwin III,
	David Chinner, Badari Pulavarty, Maxim Levitsky, Fengguang Wu,
	swin wang, totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Fri, 31 Aug 2007, Jens Axboe wrote:

> They have nothing to do with each other, you are mixing things up. It
> has nothing to do with the device being able to dma into that memory or
> not, we have fine existing infrastructure to handle that. But different
> hardware have different characteristics on what a single segment is. You
> can say "a single segment cannot cross a 32kb boundary". So from the
> example above, your single 64k page may need to be split into two
> segments. Or it could have a maximum segment size of 32k, in which case
> it would have to be split as well.
> 
> Do you see what I mean now?

Ok. So another solution maybe to limit the blocksizes that can be used 
with a device?

> > How do we split that up today? We could add processing to submit_bio
> > to check for the boundary and create two bios.
> 
> But we do not split them up today - see what I wrote! Today we impose
> the restriction that a device must be able to handle a single "normal"
> page, and if it can't do that, it has to split it up itself.
> 
> But yes, you would have to create some out-of-line function to use
> bio_split() until you have chopped things down enough. It's not a good
> thing for performance naturally, but if we consider this a "just make it
> work" fallback, I don't think it's too bad. You want to make a note of
> that it is happening though, so people realize that it is happening.

Hmmmm.. We could keep the existing scheme too and check that device 
drivers split things up if they are too large? Isnt it possible today
to create a huge bio of 2M for huge pages and send it to a device?

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-31  7:52                       ` Christoph Lameter
@ 2007-08-31  8:12                         ` Jens Axboe
  2007-08-31 15:22                           ` Christoph Lameter
  2007-08-31  8:36                         ` Dmitry Monakhov
  1 sibling, 1 reply; 124+ messages in thread
From: Jens Axboe @ 2007-08-31  8:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Dmitry Monakhov, torvalds, linux-fsdevel, linux-kernel,
	Christoph Hellwig, Mel Gorman, William Lee Irwin III,
	David Chinner, Badari Pulavarty, Maxim Levitsky, Fengguang Wu,
	swin wang, totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Fri, Aug 31 2007, Christoph Lameter wrote:
> On Fri, 31 Aug 2007, Jens Axboe wrote:
> 
> > They have nothing to do with each other, you are mixing things up. It
> > has nothing to do with the device being able to dma into that memory or
> > not, we have fine existing infrastructure to handle that. But different
> > hardware have different characteristics on what a single segment is. You
> > can say "a single segment cannot cross a 32kb boundary". So from the
> > example above, your single 64k page may need to be split into two
> > segments. Or it could have a maximum segment size of 32k, in which case
> > it would have to be split as well.
> > 
> > Do you see what I mean now?
> 
> Ok. So another solution maybe to limit the blocksizes that can be used 
> with a device?

That'd work for creation, but not for moving things around.

> > > How do we split that up today? We could add processing to submit_bio
> > > to check for the boundary and create two bios.
> > 
> > But we do not split them up today - see what I wrote! Today we impose
> > the restriction that a device must be able to handle a single "normal"
> > page, and if it can't do that, it has to split it up itself.
> > 
> > But yes, you would have to create some out-of-line function to use
> > bio_split() until you have chopped things down enough. It's not a good
> > thing for performance naturally, but if we consider this a "just make it
> > work" fallback, I don't think it's too bad. You want to make a note of
> > that it is happening though, so people realize that it is happening.
> 
> Hmmmm.. We could keep the existing scheme too and check that device 
> drivers split things up if they are too large? Isnt it possible today
> to create a huge bio of 2M for huge pages and send it to a device?

Not sure, aren't the constituents of compound pages the basis for IO?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-31  7:52                       ` Christoph Lameter
  2007-08-31  8:12                         ` Jens Axboe
@ 2007-08-31  8:36                         ` Dmitry Monakhov
  2007-08-31 15:28                           ` Christoph Lameter
  1 sibling, 1 reply; 124+ messages in thread
From: Dmitry Monakhov @ 2007-08-31  8:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jens Axboe, torvalds, linux-fsdevel, linux-kernel,
	Christoph Hellwig, Mel Gorman, William Lee Irwin III,
	David Chinner, Badari Pulavarty, Maxim Levitsky, Fengguang Wu,
	swin wang, totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On 00:52 Fri 31 Aug     , Christoph Lameter wrote:
> On Fri, 31 Aug 2007, Jens Axboe wrote:
> 
> > They have nothing to do with each other, you are mixing things up. It
> > has nothing to do with the device being able to dma into that memory or
> > not, we have fine existing infrastructure to handle that. But different
> > hardware have different characteristics on what a single segment is. You
> > can say "a single segment cannot cross a 32kb boundary". So from the
> > example above, your single 64k page may need to be split into two
> > segments. Or it could have a maximum segment size of 32k, in which case
> > it would have to be split as well.
> > 
> > Do you see what I mean now?
> 
> Ok. So another solution maybe to limit the blocksizes that can be used 
> with a device?
IMHO It is not good because after fs was created with big blksize it's image
cant be used on other devices. 
We may just rewrite submit_bh simular to drivers/md/dm-io.c:do_region
with following pseudocode:

remaning = super_page_size();
while (remaining) {
	init_bio(bio);
	/*Try and add as many pages as possible*/
	while (remaining) {
		dp->get_page(dp, &page, &len, &offset);
		len = min(len,
		to_bytes(remaining));
        	if(!bio_add_page(bio, page, len, offset))
		break;
        	offset = 0;
		remaining -= to_sector(len);
		dp->next_page(dp);
		}
	atomic_inc(&io->count);
	submit_bio(rw, bio);
}
> > > How do we split that up today? We could add processing to submit_bio
> > > to check for the boundary and create two bios.
> > 
> > But we do not split them up today - see what I wrote! Today we impose
> > the restriction that a device must be able to handle a single "normal"
> > page, and if it can't do that, it has to split it up itself.
> > 
> > But yes, you would have to create some out-of-line function to use
> > bio_split() until you have chopped things down enough. It's not a good
> > thing for performance naturally, but if we consider this a "just make it
> > work" fallback, I don't think it's too bad. You want to make a note of
> > that it is happening though, so people realize that it is happening.
> 
> Hmmmm.. We could keep the existing scheme too and check that device 
> drivers split things up if they are too large? Isnt it possible today
> to create a huge bio of 2M for huge pages and send it to a device?


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-31  8:12                         ` Jens Axboe
@ 2007-08-31 15:22                           ` Christoph Lameter
  2007-08-31 16:35                             ` Jörn Engel
  2007-08-31 19:00                             ` Jens Axboe
  0 siblings, 2 replies; 124+ messages in thread
From: Christoph Lameter @ 2007-08-31 15:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Dmitry Monakhov, torvalds, linux-fsdevel, linux-kernel,
	Christoph Hellwig, Mel Gorman, William Lee Irwin III,
	David Chinner, Badari Pulavarty, Maxim Levitsky, Fengguang Wu,
	swin wang, totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Fri, 31 Aug 2007, Jens Axboe wrote:

> > Ok. So another solution maybe to limit the blocksizes that can be used 
> > with a device?
> 
> That'd work for creation, but not for moving things around.

What do you mean by moving things around? Creation binds a filesystem to a 
device.

> > Hmmmm.. We could keep the existing scheme too and check that device 
> > drivers split things up if they are too large? Isnt it possible today
> > to create a huge bio of 2M for huge pages and send it to a device?
> 
> Not sure, aren't the constituents of compound pages the basis for IO?

get_user_pages() serializes compound pages into the base pages. But doesnt 
the I/O layer coalesce these later into 2M chunks again?

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-31  8:36                         ` Dmitry Monakhov
@ 2007-08-31 15:28                           ` Christoph Lameter
  0 siblings, 0 replies; 124+ messages in thread
From: Christoph Lameter @ 2007-08-31 15:28 UTC (permalink / raw)
  To: Dmitry Monakhov
  Cc: Jens Axboe, torvalds, linux-fsdevel, linux-kernel,
	Christoph Hellwig, Mel Gorman, William Lee Irwin III,
	David Chinner, Badari Pulavarty, Maxim Levitsky, Fengguang Wu,
	swin wang, totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Fri, 31 Aug 2007, Dmitry Monakhov wrote:

> > Ok. So another solution maybe to limit the blocksizes that can be used 
> > with a device?
> IMHO It is not good because after fs was created with big blksize it's image
> cant be used on other devices. 

Ok so a raw copy of the partition would do this?

> We may just rewrite submit_bh simular to drivers/md/dm-io.c:do_region
> with following pseudocode:
> 
> remaning = super_page_size();

That would be compound_size(page)

> while (remaining) {
> 	init_bio(bio);
> 	/*Try and add as many pages as possible*/

This seems to be doing the same as get_user_pages() serializing the 
compound page.

> 	while (remaining) {
> 		dp->get_page(dp, &page, &len, &offset);
> 		len = min(len,
> 		to_bytes(remaining));
>         	if(!bio_add_page(bio, page, len, offset))
> 		break;
>         	offset = 0;
> 		remaining -= to_sector(len);
> 		dp->next_page(dp);
> 		}
> 	atomic_inc(&io->count);
> 	submit_bio(rw, bio);
> }

Another solution may be to not serialize but instead determine the maximum 
segment length and generate bios that reference various subsection of the
compound page of that length? That way you do not serialize and later 
coalesce again.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-31 15:22                           ` Christoph Lameter
@ 2007-08-31 16:35                             ` Jörn Engel
  2007-08-31 19:00                             ` Jens Axboe
  1 sibling, 0 replies; 124+ messages in thread
From: Jörn Engel @ 2007-08-31 16:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Jens Axboe, Dmitry Monakhov, torvalds, linux-fsdevel,
	linux-kernel, Christoph Hellwig, Mel Gorman,
	William Lee Irwin III, David Chinner, Badari Pulavarty,
	Maxim Levitsky, Fengguang Wu, swin wang, totty.lu,
	H. Peter Anvin, Eric W. Biederman

On Fri, 31 August 2007 08:22:45 -0700, Christoph Lameter wrote:
> 
> What do you mean by moving things around? Creation binds a filesystem to a 
> device.

Create the filesystem on a usb key, then move it to the next machine,
i suppose.

Or on any other movable medium, including disks, nbd, iSCSI,...

Jörn

-- 
Doubt is not a pleasant condition, but certainty is an absurd one.
-- Voltaire

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [11/36] Use page_cache_xxx in fs/buffer.c
  2007-08-31 15:22                           ` Christoph Lameter
  2007-08-31 16:35                             ` Jörn Engel
@ 2007-08-31 19:00                             ` Jens Axboe
  1 sibling, 0 replies; 124+ messages in thread
From: Jens Axboe @ 2007-08-31 19:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Dmitry Monakhov, torvalds, linux-fsdevel, linux-kernel,
	Christoph Hellwig, Mel Gorman, William Lee Irwin III,
	David Chinner, Badari Pulavarty, Maxim Levitsky, Fengguang Wu,
	swin wang, totty.lu, H. Peter Anvin, joern, Eric W. Biederman

On Fri, Aug 31 2007, Christoph Lameter wrote:
> On Fri, 31 Aug 2007, Jens Axboe wrote:
> 
> > > Ok. So another solution maybe to limit the blocksizes that can be used 
> > > with a device?
> > 
> > That'd work for creation, but not for moving things around.
> 
> What do you mean by moving things around? Creation binds a filesystem to a 
> device.

Only the bottom part. Change controller, move disk, whatever. There are
lots of ways to change part of the IO path.

> > > Hmmmm.. We could keep the existing scheme too and check that device 
> > > drivers split things up if they are too large? Isnt it possible today
> > > to create a huge bio of 2M for huge pages and send it to a device?
> > 
> > Not sure, aren't the constituents of compound pages the basis for IO?
> 
> get_user_pages() serializes compound pages into the base pages. But doesnt 
> the I/O layer coalesce these later into 2M chunks again?

You pretty much hit the nail on the head there yourself. The io layer
_may_ coalesce them all together, but it may also stop at am arbitrary
point and put the remainder in another request.

This situation is different from submitting one huge piece, the above is
what has always happened regardless of whether the origin happens to be
a compound page or not.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFC 1/4] Large Blocksize support for Ext2/3/4
  2007-08-30  0:47     ` [RFC 1/4] Large Blocksize support for Ext2/3/4 Mingming Cao
  2007-08-30  0:59       ` Christoph Lameter
@ 2007-09-01  0:01       ` Mingming Cao
  2007-09-01  0:12       ` [RFC 1/2] JBD: slab management support for large block(>8k) Mingming Cao
                         ` (7 subsequent siblings)
  9 siblings, 0 replies; 124+ messages in thread
From: Mingming Cao @ 2007-09-01  0:01 UTC (permalink / raw)
  To: clameter; +Cc: linux-fsdevel, adilger, sho, ext4 development, linux-kernel

On Wed, 2007-08-29 at 17:47 -0700, Mingming Cao wrote:

> Just rebase to 2.6.23-rc4 and against the ext4 patch queue. Compile tested only. 
> 
> Next steps:
> Need a e2fsprogs changes to able test this feature. As mkfs needs to be
> educated not assuming rec_len to be blocksize all the time.
> Will try it with Christoph Lameter's large block patch next.
> 

Two problems were found when testing largeblock on ext3.  Patches to
follow. 

Good news is, with your changes, plus all these extN changes, I am able
to run ext2/3/4 with 64k block size, tested on x86 and ppc64 with 4k
page size. fsx test runs fine for an hour on ext3 with 16k blocksize on
x86:-)

Mingming


^ permalink raw reply	[flat|nested] 124+ messages in thread

* [RFC 1/2] JBD: slab management support for large block(>8k)
  2007-08-30  0:47     ` [RFC 1/4] Large Blocksize support for Ext2/3/4 Mingming Cao
  2007-08-30  0:59       ` Christoph Lameter
  2007-09-01  0:01       ` Mingming Cao
@ 2007-09-01  0:12       ` Mingming Cao
  2007-09-01 18:39         ` Christoph Hellwig
  2007-09-01  0:12       ` [RFC 2/2] JBD: blocks reservation fix for large block support Mingming Cao
                         ` (6 subsequent siblings)
  9 siblings, 1 reply; 124+ messages in thread
From: Mingming Cao @ 2007-09-01  0:12 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: adilger, sho, ext4 development, linux-kernel, clameter

>From clameter:
Teach jbd/jbd2 slab management to support >8k block size. Without this, it refused to mount on >8k ext3.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>

Index: my2.6/fs/jbd/journal.c
===================================================================
--- my2.6.orig/fs/jbd/journal.c	2007-08-30 18:40:02.000000000 -0700
+++ my2.6/fs/jbd/journal.c	2007-08-31 11:01:18.000000000 -0700
@@ -1627,16 +1627,17 @@ void * __jbd_kmalloc (const char *where,
  * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
  * and allocate frozen and commit buffers from these slabs.
  *
- * Reason for doing this is to avoid, SLAB_DEBUG - since it could
- * cause bh to cross page boundary.
+ * (Note: We only seem to need the definitions here for the SLAB_DEBUG
+ * case. In non debug operations SLUB will find the corresponding kmalloc
+ * cache and create an alias. --clameter)
  */
-
-#define JBD_MAX_SLABS 5
-#define JBD_SLAB_INDEX(size)  (size >> 11)
+#define JBD_MAX_SLABS 7
+#define JBD_SLAB_INDEX(size)  get_order((size) << (PAGE_SHIFT - 10))
 
 static struct kmem_cache *jbd_slab[JBD_MAX_SLABS];
 static const char *jbd_slab_names[JBD_MAX_SLABS] = {
-	"jbd_1k", "jbd_2k", "jbd_4k", NULL, "jbd_8k"
+	"jbd_1k", "jbd_2k", "jbd_4k", "jbd_8k",
+	"jbd_16k", "jbd_32k", "jbd_64k"
 };
 
 static void journal_destroy_jbd_slabs(void)
Index: my2.6/fs/jbd2/journal.c
===================================================================
--- my2.6.orig/fs/jbd2/journal.c	2007-08-30 18:40:02.000000000 -0700
+++ my2.6/fs/jbd2/journal.c	2007-08-31 11:04:37.000000000 -0700
@@ -1639,16 +1639,18 @@ void * __jbd2_kmalloc (const char *where
  * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
  * and allocate frozen and commit buffers from these slabs.
  *
- * Reason for doing this is to avoid, SLAB_DEBUG - since it could
- * cause bh to cross page boundary.
+ * (Note: We only seem to need the definitions here for the SLAB_DEBUG
+ * case. In non debug operations SLUB will find the corresponding kmalloc
+ * cache and create an alias. --clameter)
  */
 
-#define JBD_MAX_SLABS 5
-#define JBD_SLAB_INDEX(size)  (size >> 11)
+#define JBD_MAX_SLABS 7
+#define JBD_SLAB_INDEX(size)  get_order((size) << (PAGE_SHIFT - 10))
 
 static struct kmem_cache *jbd_slab[JBD_MAX_SLABS];
 static const char *jbd_slab_names[JBD_MAX_SLABS] = {
-	"jbd2_1k", "jbd2_2k", "jbd2_4k", NULL, "jbd2_8k"
+	"jbd2_1k", "jbd2_2k", "jbd2_4k", "jbd2_8k",
+        "jbd2_16k", "jbd2_32k", "jbd2_64k"
 };
 
 static void jbd2_journal_destroy_jbd_slabs(void)



^ permalink raw reply	[flat|nested] 124+ messages in thread

* [RFC 2/2] JBD: blocks reservation fix for large block support
  2007-08-30  0:47     ` [RFC 1/4] Large Blocksize support for Ext2/3/4 Mingming Cao
                         ` (2 preceding siblings ...)
  2007-09-01  0:12       ` [RFC 1/2] JBD: slab management support for large block(>8k) Mingming Cao
@ 2007-09-01  0:12       ` Mingming Cao
  2007-10-02  0:34       ` [PATCH 1/2] ext4: Support large blocksize up to PAGESIZE Mingming Cao
                         ` (5 subsequent siblings)
  9 siblings, 0 replies; 124+ messages in thread
From: Mingming Cao @ 2007-09-01  0:12 UTC (permalink / raw)
  To: ext4 development; +Cc: adilger, sho, linux-kernel, clameter, linux-fsdevel

The blocks per page could be less or quals to 1 with the large block support in VM.
The patch fixed the way to calculate the number of blocks to reserve in journal in the
case blocksize > pagesize.



Signed-off-by: Mingming Cao <cmm@us.ibm.com>

Index: my2.6/fs/jbd/journal.c
===================================================================
--- my2.6.orig/fs/jbd/journal.c	2007-08-31 13:27:16.000000000 -0700
+++ my2.6/fs/jbd/journal.c	2007-08-31 13:28:18.000000000 -0700
@@ -1611,7 +1611,12 @@ void journal_ack_err(journal_t *journal)
 
 int journal_blocks_per_page(struct inode *inode)
 {
-	return 1 << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits);
+	int bits = PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits;
+
+	if (bits > 0)
+		return 1 << bits;
+	else
+		return 1;
 }
 
 /*
Index: my2.6/fs/jbd2/journal.c
===================================================================
--- my2.6.orig/fs/jbd2/journal.c	2007-08-31 13:32:21.000000000 -0700
+++ my2.6/fs/jbd2/journal.c	2007-08-31 13:32:30.000000000 -0700
@@ -1612,7 +1612,12 @@ void jbd2_journal_ack_err(journal_t *jou
 
 int jbd2_journal_blocks_per_page(struct inode *inode)
 {
-	return 1 << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits);
+	int bits = PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits;
+
+	if (bits > 0)
+		return 1 << bits;
+	else
+		return 1;
 }
 
 /*



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [00/36] Large Blocksize Support V6
  2007-08-28 19:55   ` Christoph Lameter
@ 2007-09-01  1:11     ` Christoph Lameter
  0 siblings, 0 replies; 124+ messages in thread
From: Christoph Lameter @ 2007-09-01  1:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: torvalds, linux-fsdevel, linux-kernel, Mel Gorman,
	William Lee Irwin III, David Chinner, Jens Axboe,
	Badari Pulavarty, Maxim Levitsky, Fengguang Wu, swin wang,
	totty.lu, H. Peter Anvin, joern, Eric W. Biederman, Mingming Cao

Thanks to some help Mingming Cao we now have support for extX with up to 
64k blocksize. There were several issues in the jbd layer.... (The ext2 
patch that Christoph complained about was dropped).

The patchset can be tested (assuming one has a current git tree)

git checkout -b largeblock
git pull git://git.kernel.org/pub/scm/linux/kernel/git/christoph/largeblocksize.git largeblock

... Fiddle around with large blocksize functionality....

git checkout master

... Back to Linus' tree.

git branch -D largeblock

... Get rid of it.


commit ed541c23b8e71a0217fd96d1b421992fdd7519df
Author: Mingming Cao <cmm@us.ibm.com>

    JBD: blocks reservation fix for large block support

commit a1eaa33cf1600f18e961f1cf5c87820bca44df08
Author: Christoph Lameter <clameter@sgi.com>

    Teach jbd/jbd2 slab management to support >8k block size.

commit 8199976e04333d66202edcaec6cef46771ed194e
Author: Christoph Lameter <clameter@sgi.com>

    Do not use f_mapping in simple_prepare_write()

commit ac4d742ff3b3526d4c22d5b42e9f9fcc99881a8c
Author: Mingming Cao <cmm@us.ibm.com>

    ext4: fix rec_len overflow with 64KB block size

commit f336a2d00e7c79500ff30fad40f6e3090319cbe7
Author: Mingming Cao <cmm@us.ibm.com>

    ext3: fix rec_len overflow with 64KB block size

commit b0c1b74d42cce96c592f8d13b7b842a3e07b0273
Author: Christoph Lameter <christoph@qirst.com>

    ext2: fix rec_len overflow with 64KB block size

commit 01229e6a2e84178a8b8467930c113a0096c069f2
Author: Mingming Cao <cmm@us.ibm.com>

    Large Blocksize support for Ext2/3/4



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFC 1/2] JBD: slab management support for large block(>8k)
  2007-09-01  0:12       ` [RFC 1/2] JBD: slab management support for large block(>8k) Mingming Cao
@ 2007-09-01 18:39         ` Christoph Hellwig
  2007-09-02 11:40           ` Christoph Lameter
  0 siblings, 1 reply; 124+ messages in thread
From: Christoph Hellwig @ 2007-09-01 18:39 UTC (permalink / raw)
  To: Mingming Cao
  Cc: linux-fsdevel, adilger, sho, ext4 development, linux-kernel, clameter

On Fri, Aug 31, 2007 at 05:12:18PM -0700, Mingming Cao wrote:
> >From clameter:
> Teach jbd/jbd2 slab management to support >8k block size. Without this, it refused to mount on >8k ext3.


But the real fix is to kill this code.  We can't send down slab pages
down the block layer without breaking iscsi or aoe.  And this code is
only used in so rare cases that all the normal testing won't hit it.
Very bad combination.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [00/36] Large Blocksize Support V6
  2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
                   ` (36 preceding siblings ...)
  2007-08-28 19:20 ` [00/36] Large Blocksize Support V6 Christoph Hellwig
@ 2007-09-01 19:17 ` Peter Zijlstra
  2007-09-02 11:44   ` Christoph Lameter
  37 siblings, 1 reply; 124+ messages in thread
From: Peter Zijlstra @ 2007-09-01 19:17 UTC (permalink / raw)
  To: clameter; +Cc: torvalds, linux-fsdevel, linux-kernel, Andrea Arcangeli


On Tue, 2007-08-28 at 12:05 -0700, clameter@sgi.com wrote:

> Todo/Issues:

 - reclaim

by mixing large order pages into the regular lru, page aging gets rather
unfair.

One possible solution to this is address_space based reclaim, something
which might also be beneficial for containers.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFC 1/2] JBD: slab management support for large block(>8k)
  2007-09-01 18:39         ` Christoph Hellwig
@ 2007-09-02 11:40           ` Christoph Lameter
  2007-09-02 15:28             ` Christoph Hellwig
  0 siblings, 1 reply; 124+ messages in thread
From: Christoph Lameter @ 2007-09-02 11:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mingming Cao, linux-fsdevel, adilger, sho, ext4 development,
	linux-kernel

On Sat, 1 Sep 2007, Christoph Hellwig wrote:

> On Fri, Aug 31, 2007 at 05:12:18PM -0700, Mingming Cao wrote:
> > >From clameter:
> > Teach jbd/jbd2 slab management to support >8k block size. Without this, it refused to mount on >8k ext3.
> 
> 
> But the real fix is to kill this code.  We can't send down slab pages
> down the block layer without breaking iscsi or aoe.  And this code is
> only used in so rare cases that all the normal testing won't hit it.
> Very bad combination.

We are doing what you describe right now. So the current code is broken?



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [00/36] Large Blocksize Support V6
  2007-09-01 19:17 ` Peter Zijlstra
@ 2007-09-02 11:44   ` Christoph Lameter
  0 siblings, 0 replies; 124+ messages in thread
From: Christoph Lameter @ 2007-09-02 11:44 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: torvalds, linux-fsdevel, linux-kernel, Andrea Arcangeli

On Sat, 1 Sep 2007, Peter Zijlstra wrote:

> by mixing large order pages into the regular lru, page aging gets rather
> unfair.

Not in general, only for particular loads. On average this is okay. It is 
consistent to age the whole block and not just a part of it.





^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFC 1/2] JBD: slab management support for large block(>8k)
  2007-09-02 11:40           ` Christoph Lameter
@ 2007-09-02 15:28             ` Christoph Hellwig
  2007-09-03  7:55               ` Christoph Lameter
  0 siblings, 1 reply; 124+ messages in thread
From: Christoph Hellwig @ 2007-09-02 15:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Christoph Hellwig, Mingming Cao, linux-fsdevel, adilger, sho,
	ext4 development, linux-kernel

On Sun, Sep 02, 2007 at 04:40:21AM -0700, Christoph Lameter wrote:
> On Sat, 1 Sep 2007, Christoph Hellwig wrote:
> 
> > On Fri, Aug 31, 2007 at 05:12:18PM -0700, Mingming Cao wrote:
> > > >From clameter:
> > > Teach jbd/jbd2 slab management to support >8k block size. Without this, it refused to mount on >8k ext3.
> > 
> > 
> > But the real fix is to kill this code.  We can't send down slab pages
> > down the block layer without breaking iscsi or aoe.  And this code is
> > only used in so rare cases that all the normal testing won't hit it.
> > Very bad combination.
> 
> We are doing what you describe right now. So the current code is broken?

Yes.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFC 1/2] JBD: slab management support for large block(>8k)
  2007-09-02 15:28             ` Christoph Hellwig
@ 2007-09-03  7:55               ` Christoph Lameter
  2007-09-03 13:40                 ` Christoph Hellwig
  0 siblings, 1 reply; 124+ messages in thread
From: Christoph Lameter @ 2007-09-03  7:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mingming Cao, linux-fsdevel, adilger, sho, ext4 development,
	linux-kernel

On Sun, 2 Sep 2007, Christoph Hellwig wrote:

> > We are doing what you describe right now. So the current code is broken?
> Yes.

How about getting rid of the slabs there and use kmalloc? Kmalloc in mm 
(and therfore hopefully 2.6.24) will convert kmallocs > PAGE_SIZE to page 
allocator calls. Not sure what to do about the 1k and 2k requests though.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFC 1/2] JBD: slab management support for large block(>8k)
  2007-09-03  7:55               ` Christoph Lameter
@ 2007-09-03 13:40                 ` Christoph Hellwig
  2007-09-03 19:31                   ` Christoph Lameter
  0 siblings, 1 reply; 124+ messages in thread
From: Christoph Hellwig @ 2007-09-03 13:40 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Christoph Hellwig, Mingming Cao, linux-fsdevel, adilger, sho,
	ext4 development, linux-kernel

On Mon, Sep 03, 2007 at 12:55:04AM -0700, Christoph Lameter wrote:
> On Sun, 2 Sep 2007, Christoph Hellwig wrote:
> 
> > > We are doing what you describe right now. So the current code is broken?
> > Yes.
> 
> How about getting rid of the slabs there and use kmalloc? Kmalloc in mm 
> (and therfore hopefully 2.6.24) will convert kmallocs > PAGE_SIZE to page 
> allocator calls. Not sure what to do about the 1k and 2k requests though.

The problem is that we must never use kmalloc pages, so we always need
to request a page or more for these.  Better to use get_free_page directly,
that's how I fixed it in XFS a while ago.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFC 1/2] JBD: slab management support for large block(>8k)
  2007-09-03 13:40                 ` Christoph Hellwig
@ 2007-09-03 19:31                   ` Christoph Lameter
  2007-09-03 19:33                     ` Christoph Hellwig
  0 siblings, 1 reply; 124+ messages in thread
From: Christoph Lameter @ 2007-09-03 19:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mingming Cao, linux-fsdevel, adilger, sho, ext4 development,
	linux-kernel

On Mon, 3 Sep 2007, Christoph Hellwig wrote:

> > How about getting rid of the slabs there and use kmalloc? Kmalloc in mm 
> > (and therfore hopefully 2.6.24) will convert kmallocs > PAGE_SIZE to page 
> > allocator calls. Not sure what to do about the 1k and 2k requests though.
> 
> The problem is that we must never use kmalloc pages, so we always need
> to request a page or more for these.  Better to use get_free_page directly,
> that's how I fixed it in XFS a while ago.

So you'd be fine with replacing the allocs with

get_free_pages(GFP_xxx, get_order(size)) ?


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFC 1/2] JBD: slab management support for large block(>8k)
  2007-09-03 19:31                   ` Christoph Lameter
@ 2007-09-03 19:33                     ` Christoph Hellwig
  2007-09-14 18:53                       ` [PATCH] JBD slab cleanups Mingming Cao
  0 siblings, 1 reply; 124+ messages in thread
From: Christoph Hellwig @ 2007-09-03 19:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Christoph Hellwig, Mingming Cao, linux-fsdevel, adilger, sho,
	ext4 development, linux-kernel

On Mon, Sep 03, 2007 at 12:31:49PM -0700, Christoph Lameter wrote:
> So you'd be fine with replacing the allocs with
> 
> get_free_pages(GFP_xxx, get_order(size)) ?

Yes.  And rip out all that code related to setting up the slabs.  I plan
to add WARN_ONs to bio_add_page and friends to detect further usage of
slab pages if there is any.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* [PATCH] JBD slab cleanups
  2007-09-03 19:33                     ` Christoph Hellwig
@ 2007-09-14 18:53                       ` Mingming Cao
  2007-09-14 18:58                         ` Christoph Lameter
  2007-09-17 19:29                         ` Mingming Cao
  0 siblings, 2 replies; 124+ messages in thread
From: Mingming Cao @ 2007-09-14 18:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christoph Lameter, linux-fsdevel, ext4 development, linux-kernel

jbd/jbd2: Replace slab allocations with page cache allocations

From: Christoph Lameter <clameter@sgi.com>

JBD should not pass slab pages down to the block layer.
Use page allocator pages instead. This will also prepare
JBD for the large blocksize patchset.

Tested on 2.6.23-rc6 with fsx runs fine.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
 fs/jbd/checkpoint.c   |    2 
 fs/jbd/commit.c       |    6 +-
 fs/jbd/journal.c      |  107 ++++---------------------------------------------
 fs/jbd/transaction.c  |   10 ++--
 fs/jbd2/checkpoint.c  |    2 
 fs/jbd2/commit.c      |    6 +-
 fs/jbd2/journal.c     |  109 ++++----------------------------------------------
 fs/jbd2/transaction.c |   18 ++++----
 include/linux/jbd.h   |   23 +++++++++-
 include/linux/jbd2.h  |   28 ++++++++++--
 10 files changed, 83 insertions(+), 228 deletions(-)

Index: linux-2.6.23-rc5/fs/jbd/journal.c
===================================================================
--- linux-2.6.23-rc5.orig/fs/jbd/journal.c	2007-09-13 13:37:57.000000000 -0700
+++ linux-2.6.23-rc5/fs/jbd/journal.c	2007-09-13 13:45:39.000000000 -0700
@@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit);
 
 static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);
 static void __journal_abort_soft (journal_t *journal, int errno);
-static int journal_create_jbd_slab(size_t slab_size);
 
 /*
  * Helper function used to manage commit timeouts
@@ -334,10 +333,10 @@ repeat:
 		char *tmp;
 
 		jbd_unlock_bh_state(bh_in);
-		tmp = jbd_slab_alloc(bh_in->b_size, GFP_NOFS);
+		tmp = jbd_alloc(bh_in->b_size, GFP_NOFS);
 		jbd_lock_bh_state(bh_in);
 		if (jh_in->b_frozen_data) {
-			jbd_slab_free(tmp, bh_in->b_size);
+			jbd_free(tmp, bh_in->b_size);
 			goto repeat;
 		}
 
@@ -679,7 +678,7 @@ static journal_t * journal_init_common (
 	/* Set up a default-sized revoke table for the new mount. */
 	err = journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH);
 	if (err) {
-		kfree(journal);
+		jbd_kfree(journal);
 		goto fail;
 	}
 	return journal;
@@ -728,7 +727,7 @@ journal_t * journal_init_dev(struct bloc
 	if (!journal->j_wbuf) {
 		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
 			__FUNCTION__);
-		kfree(journal);
+		jbd_kfree(journal);
 		journal = NULL;
 		goto out;
 	}
@@ -782,7 +781,7 @@ journal_t * journal_init_inode (struct i
 	if (!journal->j_wbuf) {
 		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
 			__FUNCTION__);
-		kfree(journal);
+		jbd_kfree(journal);
 		return NULL;
 	}
 
@@ -791,7 +790,7 @@ journal_t * journal_init_inode (struct i
 	if (err) {
 		printk(KERN_ERR "%s: Cannnot locate journal superblock\n",
 		       __FUNCTION__);
-		kfree(journal);
+		jbd_kfree(journal);
 		return NULL;
 	}
 
@@ -1095,13 +1094,6 @@ int journal_load(journal_t *journal)
 		}
 	}
 
-	/*
-	 * Create a slab for this blocksize
-	 */
-	err = journal_create_jbd_slab(be32_to_cpu(sb->s_blocksize));
-	if (err)
-		return err;
-
 	/* Let the recovery code check whether it needs to recover any
 	 * data from the journal. */
 	if (journal_recover(journal))
@@ -1166,7 +1158,7 @@ void journal_destroy(journal_t *journal)
 	if (journal->j_revoke)
 		journal_destroy_revoke(journal);
 	kfree(journal->j_wbuf);
-	kfree(journal);
+	jbd_kfree(journal);
 }
 
 
@@ -1615,86 +1607,6 @@ int journal_blocks_per_page(struct inode
 }
 
 /*
- * Simple support for retrying memory allocations.  Introduced to help to
- * debug different VM deadlock avoidance strategies.
- */
-void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry)
-{
-	return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
-}
-
-/*
- * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
- * and allocate frozen and commit buffers from these slabs.
- *
- * Reason for doing this is to avoid, SLAB_DEBUG - since it could
- * cause bh to cross page boundary.
- */
-
-#define JBD_MAX_SLABS 5
-#define JBD_SLAB_INDEX(size)  (size >> 11)
-
-static struct kmem_cache *jbd_slab[JBD_MAX_SLABS];
-static const char *jbd_slab_names[JBD_MAX_SLABS] = {
-	"jbd_1k", "jbd_2k", "jbd_4k", NULL, "jbd_8k"
-};
-
-static void journal_destroy_jbd_slabs(void)
-{
-	int i;
-
-	for (i = 0; i < JBD_MAX_SLABS; i++) {
-		if (jbd_slab[i])
-			kmem_cache_destroy(jbd_slab[i]);
-		jbd_slab[i] = NULL;
-	}
-}
-
-static int journal_create_jbd_slab(size_t slab_size)
-{
-	int i = JBD_SLAB_INDEX(slab_size);
-
-	BUG_ON(i >= JBD_MAX_SLABS);
-
-	/*
-	 * Check if we already have a slab created for this size
-	 */
-	if (jbd_slab[i])
-		return 0;
-
-	/*
-	 * Create a slab and force alignment to be same as slabsize -
-	 * this will make sure that allocations won't cross the page
-	 * boundary.
-	 */
-	jbd_slab[i] = kmem_cache_create(jbd_slab_names[i],
-				slab_size, slab_size, 0, NULL);
-	if (!jbd_slab[i]) {
-		printk(KERN_EMERG "JBD: no memory for jbd_slab cache\n");
-		return -ENOMEM;
-	}
-	return 0;
-}
-
-void * jbd_slab_alloc(size_t size, gfp_t flags)
-{
-	int idx;
-
-	idx = JBD_SLAB_INDEX(size);
-	BUG_ON(jbd_slab[idx] == NULL);
-	return kmem_cache_alloc(jbd_slab[idx], flags | __GFP_NOFAIL);
-}
-
-void jbd_slab_free(void *ptr,  size_t size)
-{
-	int idx;
-
-	idx = JBD_SLAB_INDEX(size);
-	BUG_ON(jbd_slab[idx] == NULL);
-	kmem_cache_free(jbd_slab[idx], ptr);
-}
-
-/*
  * Journal_head storage management
  */
 static struct kmem_cache *journal_head_cache;
@@ -1881,13 +1793,13 @@ static void __journal_remove_journal_hea
 				printk(KERN_WARNING "%s: freeing "
 						"b_frozen_data\n",
 						__FUNCTION__);
-				jbd_slab_free(jh->b_frozen_data, bh->b_size);
+				jbd_free(jh->b_frozen_data, bh->b_size);
 			}
 			if (jh->b_committed_data) {
 				printk(KERN_WARNING "%s: freeing "
 						"b_committed_data\n",
 						__FUNCTION__);
-				jbd_slab_free(jh->b_committed_data, bh->b_size);
+				jbd_free(jh->b_committed_data, bh->b_size);
 			}
 			bh->b_private = NULL;
 			jh->b_bh = NULL;	/* debug, really */
@@ -2042,7 +1954,6 @@ static void journal_destroy_caches(void)
 	journal_destroy_revoke_caches();
 	journal_destroy_journal_head_cache();
 	journal_destroy_handle_cache();
-	journal_destroy_jbd_slabs();
 }
 
 static int __init journal_init(void)
Index: linux-2.6.23-rc5/include/linux/jbd.h
===================================================================
--- linux-2.6.23-rc5.orig/include/linux/jbd.h	2007-09-13 13:37:57.000000000 -0700
+++ linux-2.6.23-rc5/include/linux/jbd.h	2007-09-13 13:42:27.000000000 -0700
@@ -71,9 +71,26 @@ extern int journal_enable_debug;
 #define jbd_debug(f, a...)	/**/
 #endif
 
-extern void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry);
-extern void * jbd_slab_alloc(size_t size, gfp_t flags);
-extern void jbd_slab_free(void *ptr, size_t size);
+static inline void *__jbd_kmalloc(const char *where, size_t size,
+						gfp_t flags, int retry)
+{
+	return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
+}
+
+static inline void jbd_kfree(void *ptr)
+{
+	return kfree(ptr);
+}
+
+static inline void *jbd_alloc(size_t size, gfp_t flags)
+{
+	return (void *)__get_free_pages(flags, get_order(size));
+}
+
+static inline void jbd_free(void *ptr, size_t size)
+{
+	free_pages((unsigned long)ptr, get_order(size));
+};
 
 #define jbd_kmalloc(size, flags) \
 	__jbd_kmalloc(__FUNCTION__, (size), (flags), journal_oom_retry)
Index: linux-2.6.23-rc5/include/linux/jbd2.h
===================================================================
--- linux-2.6.23-rc5.orig/include/linux/jbd2.h	2007-09-13 13:37:58.000000000 -0700
+++ linux-2.6.23-rc5/include/linux/jbd2.h	2007-09-13 13:51:49.000000000 -0700
@@ -71,11 +71,27 @@ extern u8 jbd2_journal_enable_debug;
 #define jbd_debug(f, a...)	/**/
 #endif
 
-extern void * __jbd2_kmalloc (const char *where, size_t size, gfp_t flags, int retry);
-extern void * jbd2_slab_alloc(size_t size, gfp_t flags);
-extern void jbd2_slab_free(void *ptr, size_t size);
+static inline void *__jbd2_kmalloc(const char *where, size_t size,
+						gfp_t flags, int retry)
+{
+	return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
+}
+static inline void jbd2_kfree(void *ptr)
+{
+	return kfree(ptr);
+}
+
+static inline void *jbd2_alloc(size_t size, gfp_t flags)
+{
+	return (void *)__get_free_pages(flags, get_order(size));
+}
+
+static inline void jbd2_free(void *ptr, size_t size)
+{
+	free_pages((unsigned long)ptr, get_order(size));
+};
 
-#define jbd_kmalloc(size, flags) \
+#define jbd2_kmalloc(size, flags) \
 	__jbd2_kmalloc(__FUNCTION__, (size), (flags), journal_oom_retry)
 #define jbd_rep_kmalloc(size, flags) \
 	__jbd2_kmalloc(__FUNCTION__, (size), (flags), 1)
@@ -959,12 +975,12 @@ void jbd2_journal_put_journal_head(struc
  */
 extern struct kmem_cache *jbd2_handle_cache;
 
-static inline handle_t *jbd_alloc_handle(gfp_t gfp_flags)
+static inline handle_t *jbd2_alloc_handle(gfp_t gfp_flags)
 {
 	return kmem_cache_alloc(jbd2_handle_cache, gfp_flags);
 }
 
-static inline void jbd_free_handle(handle_t *handle)
+static inline void jbd2_free_handle(handle_t *handle)
 {
 	kmem_cache_free(jbd2_handle_cache, handle);
 }
Index: linux-2.6.23-rc5/fs/jbd2/journal.c
===================================================================
--- linux-2.6.23-rc5.orig/fs/jbd2/journal.c	2007-09-13 13:37:57.000000000 -0700
+++ linux-2.6.23-rc5/fs/jbd2/journal.c	2007-09-13 14:00:17.000000000 -0700
@@ -84,7 +84,6 @@ EXPORT_SYMBOL(jbd2_journal_force_commit)
 
 static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);
 static void __journal_abort_soft (journal_t *journal, int errno);
-static int jbd2_journal_create_jbd_slab(size_t slab_size);
 
 /*
  * Helper function used to manage commit timeouts
@@ -335,10 +334,10 @@ repeat:
 		char *tmp;
 
 		jbd_unlock_bh_state(bh_in);
-		tmp = jbd2_slab_alloc(bh_in->b_size, GFP_NOFS);
+		tmp = jbd2_alloc(bh_in->b_size, GFP_NOFS);
 		jbd_lock_bh_state(bh_in);
 		if (jh_in->b_frozen_data) {
-			jbd2_slab_free(tmp, bh_in->b_size);
+			jbd2_free(tmp, bh_in->b_size);
 			goto repeat;
 		}
 
@@ -655,7 +654,7 @@ static journal_t * journal_init_common (
 	journal_t *journal;
 	int err;
 
-	journal = jbd_kmalloc(sizeof(*journal), GFP_KERNEL);
+	journal = jbd2_kmalloc(sizeof(*journal), GFP_KERNEL);
 	if (!journal)
 		goto fail;
 	memset(journal, 0, sizeof(*journal));
@@ -680,7 +679,7 @@ static journal_t * journal_init_common (
 	/* Set up a default-sized revoke table for the new mount. */
 	err = jbd2_journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH);
 	if (err) {
-		kfree(journal);
+		jbd2_kfree(journal);
 		goto fail;
 	}
 	return journal;
@@ -729,7 +728,7 @@ journal_t * jbd2_journal_init_dev(struct
 	if (!journal->j_wbuf) {
 		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
 			__FUNCTION__);
-		kfree(journal);
+		jbd2_kfree(journal);
 		journal = NULL;
 		goto out;
 	}
@@ -783,7 +782,7 @@ journal_t * jbd2_journal_init_inode (str
 	if (!journal->j_wbuf) {
 		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
 			__FUNCTION__);
-		kfree(journal);
+		jbd2_kfree(journal);
 		return NULL;
 	}
 
@@ -792,7 +791,7 @@ journal_t * jbd2_journal_init_inode (str
 	if (err) {
 		printk(KERN_ERR "%s: Cannnot locate journal superblock\n",
 		       __FUNCTION__);
-		kfree(journal);
+		jbd2_kfree(journal);
 		return NULL;
 	}
 
@@ -1096,13 +1095,6 @@ int jbd2_journal_load(journal_t *journal
 		}
 	}
 
-	/*
-	 * Create a slab for this blocksize
-	 */
-	err = jbd2_journal_create_jbd_slab(be32_to_cpu(sb->s_blocksize));
-	if (err)
-		return err;
-
 	/* Let the recovery code check whether it needs to recover any
 	 * data from the journal. */
 	if (jbd2_journal_recover(journal))
@@ -1167,7 +1159,7 @@ void jbd2_journal_destroy(journal_t *jou
 	if (journal->j_revoke)
 		jbd2_journal_destroy_revoke(journal);
 	kfree(journal->j_wbuf);
-	kfree(journal);
+	jbd2_kfree(journal);
 }
 
 
@@ -1627,86 +1619,6 @@ size_t journal_tag_bytes(journal_t *jour
 }
 
 /*
- * Simple support for retrying memory allocations.  Introduced to help to
- * debug different VM deadlock avoidance strategies.
- */
-void * __jbd2_kmalloc (const char *where, size_t size, gfp_t flags, int retry)
-{
-	return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
-}
-
-/*
- * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
- * and allocate frozen and commit buffers from these slabs.
- *
- * Reason for doing this is to avoid, SLAB_DEBUG - since it could
- * cause bh to cross page boundary.
- */
-
-#define JBD_MAX_SLABS 5
-#define JBD_SLAB_INDEX(size)  (size >> 11)
-
-static struct kmem_cache *jbd_slab[JBD_MAX_SLABS];
-static const char *jbd_slab_names[JBD_MAX_SLABS] = {
-	"jbd2_1k", "jbd2_2k", "jbd2_4k", NULL, "jbd2_8k"
-};
-
-static void jbd2_journal_destroy_jbd_slabs(void)
-{
-	int i;
-
-	for (i = 0; i < JBD_MAX_SLABS; i++) {
-		if (jbd_slab[i])
-			kmem_cache_destroy(jbd_slab[i]);
-		jbd_slab[i] = NULL;
-	}
-}
-
-static int jbd2_journal_create_jbd_slab(size_t slab_size)
-{
-	int i = JBD_SLAB_INDEX(slab_size);
-
-	BUG_ON(i >= JBD_MAX_SLABS);
-
-	/*
-	 * Check if we already have a slab created for this size
-	 */
-	if (jbd_slab[i])
-		return 0;
-
-	/*
-	 * Create a slab and force alignment to be same as slabsize -
-	 * this will make sure that allocations won't cross the page
-	 * boundary.
-	 */
-	jbd_slab[i] = kmem_cache_create(jbd_slab_names[i],
-				slab_size, slab_size, 0, NULL);
-	if (!jbd_slab[i]) {
-		printk(KERN_EMERG "JBD: no memory for jbd_slab cache\n");
-		return -ENOMEM;
-	}
-	return 0;
-}
-
-void * jbd2_slab_alloc(size_t size, gfp_t flags)
-{
-	int idx;
-
-	idx = JBD_SLAB_INDEX(size);
-	BUG_ON(jbd_slab[idx] == NULL);
-	return kmem_cache_alloc(jbd_slab[idx], flags | __GFP_NOFAIL);
-}
-
-void jbd2_slab_free(void *ptr,  size_t size)
-{
-	int idx;
-
-	idx = JBD_SLAB_INDEX(size);
-	BUG_ON(jbd_slab[idx] == NULL);
-	kmem_cache_free(jbd_slab[idx], ptr);
-}
-
-/*
  * Journal_head storage management
  */
 static struct kmem_cache *jbd2_journal_head_cache;
@@ -1893,13 +1805,13 @@ static void __journal_remove_journal_hea
 				printk(KERN_WARNING "%s: freeing "
 						"b_frozen_data\n",
 						__FUNCTION__);
-				jbd2_slab_free(jh->b_frozen_data, bh->b_size);
+				jbd2_free(jh->b_frozen_data, bh->b_size);
 			}
 			if (jh->b_committed_data) {
 				printk(KERN_WARNING "%s: freeing "
 						"b_committed_data\n",
 						__FUNCTION__);
-				jbd2_slab_free(jh->b_committed_data, bh->b_size);
+				jbd2_free(jh->b_committed_data, bh->b_size);
 			}
 			bh->b_private = NULL;
 			jh->b_bh = NULL;	/* debug, really */
@@ -2040,7 +1952,6 @@ static void jbd2_journal_destroy_caches(
 	jbd2_journal_destroy_revoke_caches();
 	jbd2_journal_destroy_jbd2_journal_head_cache();
 	jbd2_journal_destroy_handle_cache();
-	jbd2_journal_destroy_jbd_slabs();
 }
 
 static int __init journal_init(void)
Index: linux-2.6.23-rc5/fs/jbd/commit.c
===================================================================
--- linux-2.6.23-rc5.orig/fs/jbd/commit.c	2007-09-13 13:37:57.000000000 -0700
+++ linux-2.6.23-rc5/fs/jbd/commit.c	2007-09-13 13:40:03.000000000 -0700
@@ -375,7 +375,7 @@ void journal_commit_transaction(journal_
 			struct buffer_head *bh = jh2bh(jh);
 
 			jbd_lock_bh_state(bh);
-			jbd_slab_free(jh->b_committed_data, bh->b_size);
+			jbd_free(jh->b_committed_data, bh->b_size);
 			jh->b_committed_data = NULL;
 			jbd_unlock_bh_state(bh);
 		}
@@ -792,14 +792,14 @@ restart_loop:
 		 * Otherwise, we can just throw away the frozen data now.
 		 */
 		if (jh->b_committed_data) {
-			jbd_slab_free(jh->b_committed_data, bh->b_size);
+			jbd_free(jh->b_committed_data, bh->b_size);
 			jh->b_committed_data = NULL;
 			if (jh->b_frozen_data) {
 				jh->b_committed_data = jh->b_frozen_data;
 				jh->b_frozen_data = NULL;
 			}
 		} else if (jh->b_frozen_data) {
-			jbd_slab_free(jh->b_frozen_data, bh->b_size);
+			jbd_free(jh->b_frozen_data, bh->b_size);
 			jh->b_frozen_data = NULL;
 		}
 
Index: linux-2.6.23-rc5/fs/jbd2/commit.c
===================================================================
--- linux-2.6.23-rc5.orig/fs/jbd2/commit.c	2007-09-13 13:37:57.000000000 -0700
+++ linux-2.6.23-rc5/fs/jbd2/commit.c	2007-09-13 13:40:03.000000000 -0700
@@ -384,7 +384,7 @@ void jbd2_journal_commit_transaction(jou
 			struct buffer_head *bh = jh2bh(jh);
 
 			jbd_lock_bh_state(bh);
-			jbd2_slab_free(jh->b_committed_data, bh->b_size);
+			jbd2_free(jh->b_committed_data, bh->b_size);
 			jh->b_committed_data = NULL;
 			jbd_unlock_bh_state(bh);
 		}
@@ -801,14 +801,14 @@ restart_loop:
 		 * Otherwise, we can just throw away the frozen data now.
 		 */
 		if (jh->b_committed_data) {
-			jbd2_slab_free(jh->b_committed_data, bh->b_size);
+			jbd2_free(jh->b_committed_data, bh->b_size);
 			jh->b_committed_data = NULL;
 			if (jh->b_frozen_data) {
 				jh->b_committed_data = jh->b_frozen_data;
 				jh->b_frozen_data = NULL;
 			}
 		} else if (jh->b_frozen_data) {
-			jbd2_slab_free(jh->b_frozen_data, bh->b_size);
+			jbd2_free(jh->b_frozen_data, bh->b_size);
 			jh->b_frozen_data = NULL;
 		}
 
Index: linux-2.6.23-rc5/fs/jbd/transaction.c
===================================================================
--- linux-2.6.23-rc5.orig/fs/jbd/transaction.c	2007-09-13 13:37:57.000000000 -0700
+++ linux-2.6.23-rc5/fs/jbd/transaction.c	2007-09-13 13:46:23.000000000 -0700
@@ -229,7 +229,7 @@ repeat_locked:
 	spin_unlock(&journal->j_state_lock);
 out:
 	if (unlikely(new_transaction))		/* It's usually NULL */
-		kfree(new_transaction);
+		jbd_kfree(new_transaction);
 	return ret;
 }
 
@@ -668,7 +668,7 @@ repeat:
 				JBUFFER_TRACE(jh, "allocate memory for buffer");
 				jbd_unlock_bh_state(bh);
 				frozen_buffer =
-					jbd_slab_alloc(jh2bh(jh)->b_size,
+					jbd_alloc(jh2bh(jh)->b_size,
 							 GFP_NOFS);
 				if (!frozen_buffer) {
 					printk(KERN_EMERG
@@ -728,7 +728,7 @@ done:
 
 out:
 	if (unlikely(frozen_buffer))	/* It's usually NULL */
-		jbd_slab_free(frozen_buffer, bh->b_size);
+		jbd_free(frozen_buffer, bh->b_size);
 
 	JBUFFER_TRACE(jh, "exit");
 	return error;
@@ -881,7 +881,7 @@ int journal_get_undo_access(handle_t *ha
 
 repeat:
 	if (!jh->b_committed_data) {
-		committed_data = jbd_slab_alloc(jh2bh(jh)->b_size, GFP_NOFS);
+		committed_data = jbd_alloc(jh2bh(jh)->b_size, GFP_NOFS);
 		if (!committed_data) {
 			printk(KERN_EMERG "%s: No memory for committed data\n",
 				__FUNCTION__);
@@ -908,7 +908,7 @@ repeat:
 out:
 	journal_put_journal_head(jh);
 	if (unlikely(committed_data))
-		jbd_slab_free(committed_data, bh->b_size);
+		jbd_free(committed_data, bh->b_size);
 	return err;
 }
 
Index: linux-2.6.23-rc5/fs/jbd2/transaction.c
===================================================================
--- linux-2.6.23-rc5.orig/fs/jbd2/transaction.c	2007-09-13 13:37:57.000000000 -0700
+++ linux-2.6.23-rc5/fs/jbd2/transaction.c	2007-09-13 13:59:20.000000000 -0700
@@ -96,7 +96,7 @@ static int start_this_handle(journal_t *
 
 alloc_transaction:
 	if (!journal->j_running_transaction) {
-		new_transaction = jbd_kmalloc(sizeof(*new_transaction),
+		new_transaction = jbd2_kmalloc(sizeof(*new_transaction),
 						GFP_NOFS);
 		if (!new_transaction) {
 			ret = -ENOMEM;
@@ -229,14 +229,14 @@ repeat_locked:
 	spin_unlock(&journal->j_state_lock);
 out:
 	if (unlikely(new_transaction))		/* It's usually NULL */
-		kfree(new_transaction);
+		jbd2_kfree(new_transaction);
 	return ret;
 }
 
 /* Allocate a new handle.  This should probably be in a slab... */
 static handle_t *new_handle(int nblocks)
 {
-	handle_t *handle = jbd_alloc_handle(GFP_NOFS);
+	handle_t *handle = jbd2_alloc_handle(GFP_NOFS);
 	if (!handle)
 		return NULL;
 	memset(handle, 0, sizeof(*handle));
@@ -282,7 +282,7 @@ handle_t *jbd2_journal_start(journal_t *
 
 	err = start_this_handle(journal, handle);
 	if (err < 0) {
-		jbd_free_handle(handle);
+		jbd2_free_handle(handle);
 		current->journal_info = NULL;
 		handle = ERR_PTR(err);
 	}
@@ -668,7 +668,7 @@ repeat:
 				JBUFFER_TRACE(jh, "allocate memory for buffer");
 				jbd_unlock_bh_state(bh);
 				frozen_buffer =
-					jbd2_slab_alloc(jh2bh(jh)->b_size,
+					jbd2_alloc(jh2bh(jh)->b_size,
 							 GFP_NOFS);
 				if (!frozen_buffer) {
 					printk(KERN_EMERG
@@ -728,7 +728,7 @@ done:
 
 out:
 	if (unlikely(frozen_buffer))	/* It's usually NULL */
-		jbd2_slab_free(frozen_buffer, bh->b_size);
+		jbd2_free(frozen_buffer, bh->b_size);
 
 	JBUFFER_TRACE(jh, "exit");
 	return error;
@@ -881,7 +881,7 @@ int jbd2_journal_get_undo_access(handle_
 
 repeat:
 	if (!jh->b_committed_data) {
-		committed_data = jbd2_slab_alloc(jh2bh(jh)->b_size, GFP_NOFS);
+		committed_data = jbd2_alloc(jh2bh(jh)->b_size, GFP_NOFS);
 		if (!committed_data) {
 			printk(KERN_EMERG "%s: No memory for committed data\n",
 				__FUNCTION__);
@@ -908,7 +908,7 @@ repeat:
 out:
 	jbd2_journal_put_journal_head(jh);
 	if (unlikely(committed_data))
-		jbd2_slab_free(committed_data, bh->b_size);
+		jbd2_free(committed_data, bh->b_size);
 	return err;
 }
 
@@ -1411,7 +1411,7 @@ int jbd2_journal_stop(handle_t *handle)
 		spin_unlock(&journal->j_state_lock);
 	}
 
-	jbd_free_handle(handle);
+	jbd2_free_handle(handle);
 	return err;
 }
 
Index: linux-2.6.23-rc5/fs/jbd/checkpoint.c
===================================================================
--- linux-2.6.23-rc5.orig/fs/jbd/checkpoint.c	2007-09-13 13:37:57.000000000 -0700
+++ linux-2.6.23-rc5/fs/jbd/checkpoint.c	2007-09-14 09:57:21.000000000 -0700
@@ -693,5 +693,5 @@ void __journal_drop_transaction(journal_
 	J_ASSERT(journal->j_running_transaction != transaction);
 
 	jbd_debug(1, "Dropping transaction %d, all done\n", transaction->t_tid);
-	kfree(transaction);
+	jbd_kfree(transaction);
 }
Index: linux-2.6.23-rc5/fs/jbd2/checkpoint.c
===================================================================
--- linux-2.6.23-rc5.orig/fs/jbd2/checkpoint.c	2007-09-13 13:37:57.000000000 -0700
+++ linux-2.6.23-rc5/fs/jbd2/checkpoint.c	2007-09-14 09:57:03.000000000 -0700
@@ -693,5 +693,5 @@ void __jbd2_journal_drop_transaction(jou
 	J_ASSERT(journal->j_running_transaction != transaction);
 
 	jbd_debug(1, "Dropping transaction %d, all done\n", transaction->t_tid);
-	kfree(transaction);
+	jbd2_kfree(transaction);
 }



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH] JBD slab cleanups
  2007-09-14 18:53                       ` [PATCH] JBD slab cleanups Mingming Cao
@ 2007-09-14 18:58                         ` Christoph Lameter
  2007-09-17 19:29                         ` Mingming Cao
  1 sibling, 0 replies; 124+ messages in thread
From: Christoph Lameter @ 2007-09-14 18:58 UTC (permalink / raw)
  To: Mingming Cao
  Cc: Christoph Hellwig, linux-fsdevel, ext4 development, linux-kernel

Thanks Mingming.



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH] JBD slab cleanups
  2007-09-14 18:53                       ` [PATCH] JBD slab cleanups Mingming Cao
  2007-09-14 18:58                         ` Christoph Lameter
@ 2007-09-17 19:29                         ` Mingming Cao
  2007-09-17 19:34                           ` Christoph Hellwig
  2007-09-17 22:01                           ` Badari Pulavarty
  1 sibling, 2 replies; 124+ messages in thread
From: Mingming Cao @ 2007-09-17 19:29 UTC (permalink / raw)
  To: Christoph Hellwig, pbadari
  Cc: Christoph Lameter, linux-fsdevel, ext4 development, linux-kernel

On Fri, 2007-09-14 at 11:53 -0700, Mingming Cao wrote:
> jbd/jbd2: Replace slab allocations with page cache allocations
> 
> From: Christoph Lameter <clameter@sgi.com>
> 
> JBD should not pass slab pages down to the block layer.
> Use page allocator pages instead. This will also prepare
> JBD for the large blocksize patchset.
> 

Currently memory allocation for committed_data(and frozen_buffer) for
bufferhead is done through jbd slab management, as Christoph Hellwig
pointed out that this is broken as jbd should not pass slab pages down
to IO layer. and suggested to use get_free_pages() directly.

The problem with this patch, as Andreas Dilger pointed today in ext4
interlock call, for 1k,2k block size ext2/3/4, get_free_pages() waste
1/3-1/2 page space. 

What was the originally intention to set up slabs for committed_data(and
frozen_buffer) in JBD? Why not using kmalloc?

Mingming

> Tested on 2.6.23-rc6 with fsx runs fine.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> Signed-off-by: Mingming Cao <cmm@us.ibm.com>
> ---
>  fs/jbd/checkpoint.c   |    2 
>  fs/jbd/commit.c       |    6 +-
>  fs/jbd/journal.c      |  107 ++++---------------------------------------------
>  fs/jbd/transaction.c  |   10 ++--
>  fs/jbd2/checkpoint.c  |    2 
>  fs/jbd2/commit.c      |    6 +-
>  fs/jbd2/journal.c     |  109 ++++----------------------------------------------
>  fs/jbd2/transaction.c |   18 ++++----
>  include/linux/jbd.h   |   23 +++++++++-
>  include/linux/jbd2.h  |   28 ++++++++++--
>  10 files changed, 83 insertions(+), 228 deletions(-)
> 
> Index: linux-2.6.23-rc5/fs/jbd/journal.c
> ===================================================================
> --- linux-2.6.23-rc5.orig/fs/jbd/journal.c	2007-09-13 13:37:57.000000000 -0700
> +++ linux-2.6.23-rc5/fs/jbd/journal.c	2007-09-13 13:45:39.000000000 -0700
> @@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit);
> 
>  static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);
>  static void __journal_abort_soft (journal_t *journal, int errno);
> -static int journal_create_jbd_slab(size_t slab_size);
> 
>  /*
>   * Helper function used to manage commit timeouts
> @@ -334,10 +333,10 @@ repeat:
>  		char *tmp;
> 
>  		jbd_unlock_bh_state(bh_in);
> -		tmp = jbd_slab_alloc(bh_in->b_size, GFP_NOFS);
> +		tmp = jbd_alloc(bh_in->b_size, GFP_NOFS);
>  		jbd_lock_bh_state(bh_in);
>  		if (jh_in->b_frozen_data) {
> -			jbd_slab_free(tmp, bh_in->b_size);
> +			jbd_free(tmp, bh_in->b_size);
>  			goto repeat;
>  		}
> 
> @@ -679,7 +678,7 @@ static journal_t * journal_init_common (
>  	/* Set up a default-sized revoke table for the new mount. */
>  	err = journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH);
>  	if (err) {
> -		kfree(journal);
> +		jbd_kfree(journal);
>  		goto fail;
>  	}
>  	return journal;
> @@ -728,7 +727,7 @@ journal_t * journal_init_dev(struct bloc
>  	if (!journal->j_wbuf) {
>  		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
>  			__FUNCTION__);
> -		kfree(journal);
> +		jbd_kfree(journal);
>  		journal = NULL;
>  		goto out;
>  	}
> @@ -782,7 +781,7 @@ journal_t * journal_init_inode (struct i
>  	if (!journal->j_wbuf) {
>  		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
>  			__FUNCTION__);
> -		kfree(journal);
> +		jbd_kfree(journal);
>  		return NULL;
>  	}
> 
> @@ -791,7 +790,7 @@ journal_t * journal_init_inode (struct i
>  	if (err) {
>  		printk(KERN_ERR "%s: Cannnot locate journal superblock\n",
>  		       __FUNCTION__);
> -		kfree(journal);
> +		jbd_kfree(journal);
>  		return NULL;
>  	}
> 
> @@ -1095,13 +1094,6 @@ int journal_load(journal_t *journal)
>  		}
>  	}
> 
> -	/*
> -	 * Create a slab for this blocksize
> -	 */
> -	err = journal_create_jbd_slab(be32_to_cpu(sb->s_blocksize));
> -	if (err)
> -		return err;
> -
>  	/* Let the recovery code check whether it needs to recover any
>  	 * data from the journal. */
>  	if (journal_recover(journal))
> @@ -1166,7 +1158,7 @@ void journal_destroy(journal_t *journal)
>  	if (journal->j_revoke)
>  		journal_destroy_revoke(journal);
>  	kfree(journal->j_wbuf);
> -	kfree(journal);
> +	jbd_kfree(journal);
>  }
> 
> 
> @@ -1615,86 +1607,6 @@ int journal_blocks_per_page(struct inode
>  }
> 
>  /*
> - * Simple support for retrying memory allocations.  Introduced to help to
> - * debug different VM deadlock avoidance strategies.
> - */
> -void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry)
> -{
> -	return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
> -}
> -
> -/*
> - * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
> - * and allocate frozen and commit buffers from these slabs.
> - *
> - * Reason for doing this is to avoid, SLAB_DEBUG - since it could
> - * cause bh to cross page boundary.
> - */
> -
> -#define JBD_MAX_SLABS 5
> -#define JBD_SLAB_INDEX(size)  (size >> 11)
> -
> -static struct kmem_cache *jbd_slab[JBD_MAX_SLABS];
> -static const char *jbd_slab_names[JBD_MAX_SLABS] = {
> -	"jbd_1k", "jbd_2k", "jbd_4k", NULL, "jbd_8k"
> -};
> -
> -static void journal_destroy_jbd_slabs(void)
> -{
> -	int i;
> -
> -	for (i = 0; i < JBD_MAX_SLABS; i++) {
> -		if (jbd_slab[i])
> -			kmem_cache_destroy(jbd_slab[i]);
> -		jbd_slab[i] = NULL;
> -	}
> -}
> -
> -static int journal_create_jbd_slab(size_t slab_size)
> -{
> -	int i = JBD_SLAB_INDEX(slab_size);
> -
> -	BUG_ON(i >= JBD_MAX_SLABS);
> -
> -	/*
> -	 * Check if we already have a slab created for this size
> -	 */
> -	if (jbd_slab[i])
> -		return 0;
> -
> -	/*
> -	 * Create a slab and force alignment to be same as slabsize -
> -	 * this will make sure that allocations won't cross the page
> -	 * boundary.
> -	 */
> -	jbd_slab[i] = kmem_cache_create(jbd_slab_names[i],
> -				slab_size, slab_size, 0, NULL);
> -	if (!jbd_slab[i]) {
> -		printk(KERN_EMERG "JBD: no memory for jbd_slab cache\n");
> -		return -ENOMEM;
> -	}
> -	return 0;
> -}
> -
> -void * jbd_slab_alloc(size_t size, gfp_t flags)
> -{
> -	int idx;
> -
> -	idx = JBD_SLAB_INDEX(size);
> -	BUG_ON(jbd_slab[idx] == NULL);
> -	return kmem_cache_alloc(jbd_slab[idx], flags | __GFP_NOFAIL);
> -}
> -
> -void jbd_slab_free(void *ptr,  size_t size)
> -{
> -	int idx;
> -
> -	idx = JBD_SLAB_INDEX(size);
> -	BUG_ON(jbd_slab[idx] == NULL);
> -	kmem_cache_free(jbd_slab[idx], ptr);
> -}
> -
> -/*
>   * Journal_head storage management
>   */
>  static struct kmem_cache *journal_head_cache;
> @@ -1881,13 +1793,13 @@ static void __journal_remove_journal_hea
>  				printk(KERN_WARNING "%s: freeing "
>  						"b_frozen_data\n",
>  						__FUNCTION__);
> -				jbd_slab_free(jh->b_frozen_data, bh->b_size);
> +				jbd_free(jh->b_frozen_data, bh->b_size);
>  			}
>  			if (jh->b_committed_data) {
>  				printk(KERN_WARNING "%s: freeing "
>  						"b_committed_data\n",
>  						__FUNCTION__);
> -				jbd_slab_free(jh->b_committed_data, bh->b_size);
> +				jbd_free(jh->b_committed_data, bh->b_size);
>  			}
>  			bh->b_private = NULL;
>  			jh->b_bh = NULL;	/* debug, really */
> @@ -2042,7 +1954,6 @@ static void journal_destroy_caches(void)
>  	journal_destroy_revoke_caches();
>  	journal_destroy_journal_head_cache();
>  	journal_destroy_handle_cache();
> -	journal_destroy_jbd_slabs();
>  }
> 
>  static int __init journal_init(void)
> Index: linux-2.6.23-rc5/include/linux/jbd.h
> ===================================================================
> --- linux-2.6.23-rc5.orig/include/linux/jbd.h	2007-09-13 13:37:57.000000000 -0700
> +++ linux-2.6.23-rc5/include/linux/jbd.h	2007-09-13 13:42:27.000000000 -0700
> @@ -71,9 +71,26 @@ extern int journal_enable_debug;
>  #define jbd_debug(f, a...)	/**/
>  #endif
> 
> -extern void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry);
> -extern void * jbd_slab_alloc(size_t size, gfp_t flags);
> -extern void jbd_slab_free(void *ptr, size_t size);
> +static inline void *__jbd_kmalloc(const char *where, size_t size,
> +						gfp_t flags, int retry)
> +{
> +	return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
> +}
> +
> +static inline void jbd_kfree(void *ptr)
> +{
> +	return kfree(ptr);
> +}
> +
> +static inline void *jbd_alloc(size_t size, gfp_t flags)
> +{
> +	return (void *)__get_free_pages(flags, get_order(size));
> +}
> +
> +static inline void jbd_free(void *ptr, size_t size)
> +{
> +	free_pages((unsigned long)ptr, get_order(size));
> +};
> 
>  #define jbd_kmalloc(size, flags) \
>  	__jbd_kmalloc(__FUNCTION__, (size), (flags), journal_oom_retry)
> Index: linux-2.6.23-rc5/include/linux/jbd2.h
> ===================================================================
> --- linux-2.6.23-rc5.orig/include/linux/jbd2.h	2007-09-13 13:37:58.000000000 -0700
> +++ linux-2.6.23-rc5/include/linux/jbd2.h	2007-09-13 13:51:49.000000000 -0700
> @@ -71,11 +71,27 @@ extern u8 jbd2_journal_enable_debug;
>  #define jbd_debug(f, a...)	/**/
>  #endif
> 
> -extern void * __jbd2_kmalloc (const char *where, size_t size, gfp_t flags, int retry);
> -extern void * jbd2_slab_alloc(size_t size, gfp_t flags);
> -extern void jbd2_slab_free(void *ptr, size_t size);
> +static inline void *__jbd2_kmalloc(const char *where, size_t size,
> +						gfp_t flags, int retry)
> +{
> +	return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
> +}
> +static inline void jbd2_kfree(void *ptr)
> +{
> +	return kfree(ptr);
> +}
> +
> +static inline void *jbd2_alloc(size_t size, gfp_t flags)
> +{
> +	return (void *)__get_free_pages(flags, get_order(size));
> +}
> +
> +static inline void jbd2_free(void *ptr, size_t size)
> +{
> +	free_pages((unsigned long)ptr, get_order(size));
> +};
> 
> -#define jbd_kmalloc(size, flags) \
> +#define jbd2_kmalloc(size, flags) \
>  	__jbd2_kmalloc(__FUNCTION__, (size), (flags), journal_oom_retry)
>  #define jbd_rep_kmalloc(size, flags) \
>  	__jbd2_kmalloc(__FUNCTION__, (size), (flags), 1)
> @@ -959,12 +975,12 @@ void jbd2_journal_put_journal_head(struc
>   */
>  extern struct kmem_cache *jbd2_handle_cache;
> 
> -static inline handle_t *jbd_alloc_handle(gfp_t gfp_flags)
> +static inline handle_t *jbd2_alloc_handle(gfp_t gfp_flags)
>  {
>  	return kmem_cache_alloc(jbd2_handle_cache, gfp_flags);
>  }
> 
> -static inline void jbd_free_handle(handle_t *handle)
> +static inline void jbd2_free_handle(handle_t *handle)
>  {
>  	kmem_cache_free(jbd2_handle_cache, handle);
>  }
> Index: linux-2.6.23-rc5/fs/jbd2/journal.c
> ===================================================================
> --- linux-2.6.23-rc5.orig/fs/jbd2/journal.c	2007-09-13 13:37:57.000000000 -0700
> +++ linux-2.6.23-rc5/fs/jbd2/journal.c	2007-09-13 14:00:17.000000000 -0700
> @@ -84,7 +84,6 @@ EXPORT_SYMBOL(jbd2_journal_force_commit)
> 
>  static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);
>  static void __journal_abort_soft (journal_t *journal, int errno);
> -static int jbd2_journal_create_jbd_slab(size_t slab_size);
> 
>  /*
>   * Helper function used to manage commit timeouts
> @@ -335,10 +334,10 @@ repeat:
>  		char *tmp;
> 
>  		jbd_unlock_bh_state(bh_in);
> -		tmp = jbd2_slab_alloc(bh_in->b_size, GFP_NOFS);
> +		tmp = jbd2_alloc(bh_in->b_size, GFP_NOFS);
>  		jbd_lock_bh_state(bh_in);
>  		if (jh_in->b_frozen_data) {
> -			jbd2_slab_free(tmp, bh_in->b_size);
> +			jbd2_free(tmp, bh_in->b_size);
>  			goto repeat;
>  		}
> 
> @@ -655,7 +654,7 @@ static journal_t * journal_init_common (
>  	journal_t *journal;
>  	int err;
> 
> -	journal = jbd_kmalloc(sizeof(*journal), GFP_KERNEL);
> +	journal = jbd2_kmalloc(sizeof(*journal), GFP_KERNEL);
>  	if (!journal)
>  		goto fail;
>  	memset(journal, 0, sizeof(*journal));
> @@ -680,7 +679,7 @@ static journal_t * journal_init_common (
>  	/* Set up a default-sized revoke table for the new mount. */
>  	err = jbd2_journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH);
>  	if (err) {
> -		kfree(journal);
> +		jbd2_kfree(journal);
>  		goto fail;
>  	}
>  	return journal;
> @@ -729,7 +728,7 @@ journal_t * jbd2_journal_init_dev(struct
>  	if (!journal->j_wbuf) {
>  		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
>  			__FUNCTION__);
> -		kfree(journal);
> +		jbd2_kfree(journal);
>  		journal = NULL;
>  		goto out;
>  	}
> @@ -783,7 +782,7 @@ journal_t * jbd2_journal_init_inode (str
>  	if (!journal->j_wbuf) {
>  		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
>  			__FUNCTION__);
> -		kfree(journal);
> +		jbd2_kfree(journal);
>  		return NULL;
>  	}
> 
> @@ -792,7 +791,7 @@ journal_t * jbd2_journal_init_inode (str
>  	if (err) {
>  		printk(KERN_ERR "%s: Cannnot locate journal superblock\n",
>  		       __FUNCTION__);
> -		kfree(journal);
> +		jbd2_kfree(journal);
>  		return NULL;
>  	}
> 
> @@ -1096,13 +1095,6 @@ int jbd2_journal_load(journal_t *journal
>  		}
>  	}
> 
> -	/*
> -	 * Create a slab for this blocksize
> -	 */
> -	err = jbd2_journal_create_jbd_slab(be32_to_cpu(sb->s_blocksize));
> -	if (err)
> -		return err;
> -
>  	/* Let the recovery code check whether it needs to recover any
>  	 * data from the journal. */
>  	if (jbd2_journal_recover(journal))
> @@ -1167,7 +1159,7 @@ void jbd2_journal_destroy(journal_t *jou
>  	if (journal->j_revoke)
>  		jbd2_journal_destroy_revoke(journal);
>  	kfree(journal->j_wbuf);
> -	kfree(journal);
> +	jbd2_kfree(journal);
>  }
> 
> 
> @@ -1627,86 +1619,6 @@ size_t journal_tag_bytes(journal_t *jour
>  }
> 
>  /*
> - * Simple support for retrying memory allocations.  Introduced to help to
> - * debug different VM deadlock avoidance strategies.
> - */
> -void * __jbd2_kmalloc (const char *where, size_t size, gfp_t flags, int retry)
> -{
> -	return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
> -}
> -
> -/*
> - * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
> - * and allocate frozen and commit buffers from these slabs.
> - *
> - * Reason for doing this is to avoid, SLAB_DEBUG - since it could
> - * cause bh to cross page boundary.
> - */
> -
> -#define JBD_MAX_SLABS 5
> -#define JBD_SLAB_INDEX(size)  (size >> 11)
> -
> -static struct kmem_cache *jbd_slab[JBD_MAX_SLABS];
> -static const char *jbd_slab_names[JBD_MAX_SLABS] = {
> -	"jbd2_1k", "jbd2_2k", "jbd2_4k", NULL, "jbd2_8k"
> -};
> -
> -static void jbd2_journal_destroy_jbd_slabs(void)
> -{
> -	int i;
> -
> -	for (i = 0; i < JBD_MAX_SLABS; i++) {
> -		if (jbd_slab[i])
> -			kmem_cache_destroy(jbd_slab[i]);
> -		jbd_slab[i] = NULL;
> -	}
> -}
> -
> -static int jbd2_journal_create_jbd_slab(size_t slab_size)
> -{
> -	int i = JBD_SLAB_INDEX(slab_size);
> -
> -	BUG_ON(i >= JBD_MAX_SLABS);
> -
> -	/*
> -	 * Check if we already have a slab created for this size
> -	 */
> -	if (jbd_slab[i])
> -		return 0;
> -
> -	/*
> -	 * Create a slab and force alignment to be same as slabsize -
> -	 * this will make sure that allocations won't cross the page
> -	 * boundary.
> -	 */
> -	jbd_slab[i] = kmem_cache_create(jbd_slab_names[i],
> -				slab_size, slab_size, 0, NULL);
> -	if (!jbd_slab[i]) {
> -		printk(KERN_EMERG "JBD: no memory for jbd_slab cache\n");
> -		return -ENOMEM;
> -	}
> -	return 0;
> -}
> -
> -void * jbd2_slab_alloc(size_t size, gfp_t flags)
> -{
> -	int idx;
> -
> -	idx = JBD_SLAB_INDEX(size);
> -	BUG_ON(jbd_slab[idx] == NULL);
> -	return kmem_cache_alloc(jbd_slab[idx], flags | __GFP_NOFAIL);
> -}
> -
> -void jbd2_slab_free(void *ptr,  size_t size)
> -{
> -	int idx;
> -
> -	idx = JBD_SLAB_INDEX(size);
> -	BUG_ON(jbd_slab[idx] == NULL);
> -	kmem_cache_free(jbd_slab[idx], ptr);
> -}
> -
> -/*
>   * Journal_head storage management
>   */
>  static struct kmem_cache *jbd2_journal_head_cache;
> @@ -1893,13 +1805,13 @@ static void __journal_remove_journal_hea
>  				printk(KERN_WARNING "%s: freeing "
>  						"b_frozen_data\n",
>  						__FUNCTION__);
> -				jbd2_slab_free(jh->b_frozen_data, bh->b_size);
> +				jbd2_free(jh->b_frozen_data, bh->b_size);
>  			}
>  			if (jh->b_committed_data) {
>  				printk(KERN_WARNING "%s: freeing "
>  						"b_committed_data\n",
>  						__FUNCTION__);
> -				jbd2_slab_free(jh->b_committed_data, bh->b_size);
> +				jbd2_free(jh->b_committed_data, bh->b_size);
>  			}
>  			bh->b_private = NULL;
>  			jh->b_bh = NULL;	/* debug, really */
> @@ -2040,7 +1952,6 @@ static void jbd2_journal_destroy_caches(
>  	jbd2_journal_destroy_revoke_caches();
>  	jbd2_journal_destroy_jbd2_journal_head_cache();
>  	jbd2_journal_destroy_handle_cache();
> -	jbd2_journal_destroy_jbd_slabs();
>  }
> 
>  static int __init journal_init(void)
> Index: linux-2.6.23-rc5/fs/jbd/commit.c
> ===================================================================
> --- linux-2.6.23-rc5.orig/fs/jbd/commit.c	2007-09-13 13:37:57.000000000 -0700
> +++ linux-2.6.23-rc5/fs/jbd/commit.c	2007-09-13 13:40:03.000000000 -0700
> @@ -375,7 +375,7 @@ void journal_commit_transaction(journal_
>  			struct buffer_head *bh = jh2bh(jh);
> 
>  			jbd_lock_bh_state(bh);
> -			jbd_slab_free(jh->b_committed_data, bh->b_size);
> +			jbd_free(jh->b_committed_data, bh->b_size);
>  			jh->b_committed_data = NULL;
>  			jbd_unlock_bh_state(bh);
>  		}
> @@ -792,14 +792,14 @@ restart_loop:
>  		 * Otherwise, we can just throw away the frozen data now.
>  		 */
>  		if (jh->b_committed_data) {
> -			jbd_slab_free(jh->b_committed_data, bh->b_size);
> +			jbd_free(jh->b_committed_data, bh->b_size);
>  			jh->b_committed_data = NULL;
>  			if (jh->b_frozen_data) {
>  				jh->b_committed_data = jh->b_frozen_data;
>  				jh->b_frozen_data = NULL;
>  			}
>  		} else if (jh->b_frozen_data) {
> -			jbd_slab_free(jh->b_frozen_data, bh->b_size);
> +			jbd_free(jh->b_frozen_data, bh->b_size);
>  			jh->b_frozen_data = NULL;
>  		}
> 
> Index: linux-2.6.23-rc5/fs/jbd2/commit.c
> ===================================================================
> --- linux-2.6.23-rc5.orig/fs/jbd2/commit.c	2007-09-13 13:37:57.000000000 -0700
> +++ linux-2.6.23-rc5/fs/jbd2/commit.c	2007-09-13 13:40:03.000000000 -0700
> @@ -384,7 +384,7 @@ void jbd2_journal_commit_transaction(jou
>  			struct buffer_head *bh = jh2bh(jh);
> 
>  			jbd_lock_bh_state(bh);
> -			jbd2_slab_free(jh->b_committed_data, bh->b_size);
> +			jbd2_free(jh->b_committed_data, bh->b_size);
>  			jh->b_committed_data = NULL;
>  			jbd_unlock_bh_state(bh);
>  		}
> @@ -801,14 +801,14 @@ restart_loop:
>  		 * Otherwise, we can just throw away the frozen data now.
>  		 */
>  		if (jh->b_committed_data) {
> -			jbd2_slab_free(jh->b_committed_data, bh->b_size);
> +			jbd2_free(jh->b_committed_data, bh->b_size);
>  			jh->b_committed_data = NULL;
>  			if (jh->b_frozen_data) {
>  				jh->b_committed_data = jh->b_frozen_data;
>  				jh->b_frozen_data = NULL;
>  			}
>  		} else if (jh->b_frozen_data) {
> -			jbd2_slab_free(jh->b_frozen_data, bh->b_size);
> +			jbd2_free(jh->b_frozen_data, bh->b_size);
>  			jh->b_frozen_data = NULL;
>  		}
> 
> Index: linux-2.6.23-rc5/fs/jbd/transaction.c
> ===================================================================
> --- linux-2.6.23-rc5.orig/fs/jbd/transaction.c	2007-09-13 13:37:57.000000000 -0700
> +++ linux-2.6.23-rc5/fs/jbd/transaction.c	2007-09-13 13:46:23.000000000 -0700
> @@ -229,7 +229,7 @@ repeat_locked:
>  	spin_unlock(&journal->j_state_lock);
>  out:
>  	if (unlikely(new_transaction))		/* It's usually NULL */
> -		kfree(new_transaction);
> +		jbd_kfree(new_transaction);
>  	return ret;
>  }
> 
> @@ -668,7 +668,7 @@ repeat:
>  				JBUFFER_TRACE(jh, "allocate memory for buffer");
>  				jbd_unlock_bh_state(bh);
>  				frozen_buffer =
> -					jbd_slab_alloc(jh2bh(jh)->b_size,
> +					jbd_alloc(jh2bh(jh)->b_size,
>  							 GFP_NOFS);
>  				if (!frozen_buffer) {
>  					printk(KERN_EMERG
> @@ -728,7 +728,7 @@ done:
> 
>  out:
>  	if (unlikely(frozen_buffer))	/* It's usually NULL */
> -		jbd_slab_free(frozen_buffer, bh->b_size);
> +		jbd_free(frozen_buffer, bh->b_size);
> 
>  	JBUFFER_TRACE(jh, "exit");
>  	return error;
> @@ -881,7 +881,7 @@ int journal_get_undo_access(handle_t *ha
> 
>  repeat:
>  	if (!jh->b_committed_data) {
> -		committed_data = jbd_slab_alloc(jh2bh(jh)->b_size, GFP_NOFS);
> +		committed_data = jbd_alloc(jh2bh(jh)->b_size, GFP_NOFS);
>  		if (!committed_data) {
>  			printk(KERN_EMERG "%s: No memory for committed data\n",
>  				__FUNCTION__);
> @@ -908,7 +908,7 @@ repeat:
>  out:
>  	journal_put_journal_head(jh);
>  	if (unlikely(committed_data))
> -		jbd_slab_free(committed_data, bh->b_size);
> +		jbd_free(committed_data, bh->b_size);
>  	return err;
>  }
> 
> Index: linux-2.6.23-rc5/fs/jbd2/transaction.c
> ===================================================================
> --- linux-2.6.23-rc5.orig/fs/jbd2/transaction.c	2007-09-13 13:37:57.000000000 -0700
> +++ linux-2.6.23-rc5/fs/jbd2/transaction.c	2007-09-13 13:59:20.000000000 -0700
> @@ -96,7 +96,7 @@ static int start_this_handle(journal_t *
> 
>  alloc_transaction:
>  	if (!journal->j_running_transaction) {
> -		new_transaction = jbd_kmalloc(sizeof(*new_transaction),
> +		new_transaction = jbd2_kmalloc(sizeof(*new_transaction),
>  						GFP_NOFS);
>  		if (!new_transaction) {
>  			ret = -ENOMEM;
> @@ -229,14 +229,14 @@ repeat_locked:
>  	spin_unlock(&journal->j_state_lock);
>  out:
>  	if (unlikely(new_transaction))		/* It's usually NULL */
> -		kfree(new_transaction);
> +		jbd2_kfree(new_transaction);
>  	return ret;
>  }
> 
>  /* Allocate a new handle.  This should probably be in a slab... */
>  static handle_t *new_handle(int nblocks)
>  {
> -	handle_t *handle = jbd_alloc_handle(GFP_NOFS);
> +	handle_t *handle = jbd2_alloc_handle(GFP_NOFS);
>  	if (!handle)
>  		return NULL;
>  	memset(handle, 0, sizeof(*handle));
> @@ -282,7 +282,7 @@ handle_t *jbd2_journal_start(journal_t *
> 
>  	err = start_this_handle(journal, handle);
>  	if (err < 0) {
> -		jbd_free_handle(handle);
> +		jbd2_free_handle(handle);
>  		current->journal_info = NULL;
>  		handle = ERR_PTR(err);
>  	}
> @@ -668,7 +668,7 @@ repeat:
>  				JBUFFER_TRACE(jh, "allocate memory for buffer");
>  				jbd_unlock_bh_state(bh);
>  				frozen_buffer =
> -					jbd2_slab_alloc(jh2bh(jh)->b_size,
> +					jbd2_alloc(jh2bh(jh)->b_size,
>  							 GFP_NOFS);
>  				if (!frozen_buffer) {
>  					printk(KERN_EMERG
> @@ -728,7 +728,7 @@ done:
> 
>  out:
>  	if (unlikely(frozen_buffer))	/* It's usually NULL */
> -		jbd2_slab_free(frozen_buffer, bh->b_size);
> +		jbd2_free(frozen_buffer, bh->b_size);
> 
>  	JBUFFER_TRACE(jh, "exit");
>  	return error;
> @@ -881,7 +881,7 @@ int jbd2_journal_get_undo_access(handle_
> 
>  repeat:
>  	if (!jh->b_committed_data) {
> -		committed_data = jbd2_slab_alloc(jh2bh(jh)->b_size, GFP_NOFS);
> +		committed_data = jbd2_alloc(jh2bh(jh)->b_size, GFP_NOFS);
>  		if (!committed_data) {
>  			printk(KERN_EMERG "%s: No memory for committed data\n",
>  				__FUNCTION__);
> @@ -908,7 +908,7 @@ repeat:
>  out:
>  	jbd2_journal_put_journal_head(jh);
>  	if (unlikely(committed_data))
> -		jbd2_slab_free(committed_data, bh->b_size);
> +		jbd2_free(committed_data, bh->b_size);
>  	return err;
>  }
> 
> @@ -1411,7 +1411,7 @@ int jbd2_journal_stop(handle_t *handle)
>  		spin_unlock(&journal->j_state_lock);
>  	}
> 
> -	jbd_free_handle(handle);
> +	jbd2_free_handle(handle);
>  	return err;
>  }
> 
> Index: linux-2.6.23-rc5/fs/jbd/checkpoint.c
> ===================================================================
> --- linux-2.6.23-rc5.orig/fs/jbd/checkpoint.c	2007-09-13 13:37:57.000000000 -0700
> +++ linux-2.6.23-rc5/fs/jbd/checkpoint.c	2007-09-14 09:57:21.000000000 -0700
> @@ -693,5 +693,5 @@ void __journal_drop_transaction(journal_
>  	J_ASSERT(journal->j_running_transaction != transaction);
> 
>  	jbd_debug(1, "Dropping transaction %d, all done\n", transaction->t_tid);
> -	kfree(transaction);
> +	jbd_kfree(transaction);
>  }
> Index: linux-2.6.23-rc5/fs/jbd2/checkpoint.c
> ===================================================================
> --- linux-2.6.23-rc5.orig/fs/jbd2/checkpoint.c	2007-09-13 13:37:57.000000000 -0700
> +++ linux-2.6.23-rc5/fs/jbd2/checkpoint.c	2007-09-14 09:57:03.000000000 -0700
> @@ -693,5 +693,5 @@ void __jbd2_journal_drop_transaction(jou
>  	J_ASSERT(journal->j_running_transaction != transaction);
> 
>  	jbd_debug(1, "Dropping transaction %d, all done\n", transaction->t_tid);
> -	kfree(transaction);
> +	jbd2_kfree(transaction);
>  }
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH] JBD slab cleanups
  2007-09-17 19:29                         ` Mingming Cao
@ 2007-09-17 19:34                           ` Christoph Hellwig
  2007-09-17 22:01                           ` Badari Pulavarty
  1 sibling, 0 replies; 124+ messages in thread
From: Christoph Hellwig @ 2007-09-17 19:34 UTC (permalink / raw)
  To: Mingming Cao
  Cc: Christoph Hellwig, pbadari, Christoph Lameter, linux-fsdevel,
	ext4 development, linux-kernel

On Mon, Sep 17, 2007 at 12:29:51PM -0700, Mingming Cao wrote:
> The problem with this patch, as Andreas Dilger pointed today in ext4
> interlock call, for 1k,2k block size ext2/3/4, get_free_pages() waste
> 1/3-1/2 page space. 
> 
> What was the originally intention to set up slabs for committed_data(and
> frozen_buffer) in JBD? Why not using kmalloc?

kmalloc is using slabs :)

The intent was to avoid the wasted memory, but as we've repeated a gazillion
times wasted memory on a rather rare codepath doesn't really matter when
you just crash random storage drivers otherwise.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH] JBD slab cleanups
  2007-09-17 19:29                         ` Mingming Cao
  2007-09-17 19:34                           ` Christoph Hellwig
@ 2007-09-17 22:01                           ` Badari Pulavarty
  2007-09-17 22:57                             ` Mingming Cao
  1 sibling, 1 reply; 124+ messages in thread
From: Badari Pulavarty @ 2007-09-17 22:01 UTC (permalink / raw)
  To: cmm
  Cc: Christoph Hellwig, Christoph Lameter, linux-fsdevel,
	ext4 development, lkml

On Mon, 2007-09-17 at 12:29 -0700, Mingming Cao wrote:
> On Fri, 2007-09-14 at 11:53 -0700, Mingming Cao wrote:
> > jbd/jbd2: Replace slab allocations with page cache allocations
> > 
> > From: Christoph Lameter <clameter@sgi.com>
> > 
> > JBD should not pass slab pages down to the block layer.
> > Use page allocator pages instead. This will also prepare
> > JBD for the large blocksize patchset.
> > 
> 
> Currently memory allocation for committed_data(and frozen_buffer) for
> bufferhead is done through jbd slab management, as Christoph Hellwig
> pointed out that this is broken as jbd should not pass slab pages down
> to IO layer. and suggested to use get_free_pages() directly.
> 
> The problem with this patch, as Andreas Dilger pointed today in ext4
> interlock call, for 1k,2k block size ext2/3/4, get_free_pages() waste
> 1/3-1/2 page space. 
> 
> What was the originally intention to set up slabs for committed_data(and
> frozen_buffer) in JBD? Why not using kmalloc?
> 
> Mingming

Looks good. Small suggestion is to get rid of all kmalloc() usages and
consistently use jbd_kmalloc() or jbd2_kmalloc().

Thanks,
Badari


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH] JBD slab cleanups
  2007-09-17 22:01                           ` Badari Pulavarty
@ 2007-09-17 22:57                             ` Mingming Cao
  2007-09-18  9:04                               ` Christoph Hellwig
  0 siblings, 1 reply; 124+ messages in thread
From: Mingming Cao @ 2007-09-17 22:57 UTC (permalink / raw)
  To: Badari Pulavarty
  Cc: Christoph Hellwig, Christoph Lameter, linux-fsdevel,
	ext4 development, lkml

On Mon, 2007-09-17 at 15:01 -0700, Badari Pulavarty wrote:
> On Mon, 2007-09-17 at 12:29 -0700, Mingming Cao wrote:
> > On Fri, 2007-09-14 at 11:53 -0700, Mingming Cao wrote:
> > > jbd/jbd2: Replace slab allocations with page cache allocations
> > > 
> > > From: Christoph Lameter <clameter@sgi.com>
> > > 
> > > JBD should not pass slab pages down to the block layer.
> > > Use page allocator pages instead. This will also prepare
> > > JBD for the large blocksize patchset.
> > > 
> > 
> > Currently memory allocation for committed_data(and frozen_buffer) for
> > bufferhead is done through jbd slab management, as Christoph Hellwig
> > pointed out that this is broken as jbd should not pass slab pages down
> > to IO layer. and suggested to use get_free_pages() directly.
> > 
> > The problem with this patch, as Andreas Dilger pointed today in ext4
> > interlock call, for 1k,2k block size ext2/3/4, get_free_pages() waste
> > 1/3-1/2 page space. 
> > 
> > What was the originally intention to set up slabs for committed_data(and
> > frozen_buffer) in JBD? Why not using kmalloc?
> > 
> > Mingming
> 
> Looks good. Small suggestion is to get rid of all kmalloc() usages and
> consistently use jbd_kmalloc() or jbd2_kmalloc().
> 
> Thanks,
> Badari
> 

Here is the incremental small cleanup patch. 

Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc.


Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
 fs/jbd/journal.c  |    8 +++++---
 fs/jbd/revoke.c   |   12 ++++++------
 fs/jbd2/journal.c |    8 +++++---
 fs/jbd2/revoke.c  |   12 ++++++------
 4 files changed, 22 insertions(+), 18 deletions(-)

Index: linux-2.6.23-rc6/fs/jbd/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c	2007-09-17 14:32:16.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c	2007-09-17 14:33:59.000000000 -0700
@@ -723,7 +723,8 @@ journal_t * journal_init_dev(struct bloc
 	journal->j_blocksize = blocksize;
 	n = journal->j_blocksize / sizeof(journal_block_tag_t);
 	journal->j_wbufsize = n;
-	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+	journal->j_wbuf = jbd_kmalloc(n * sizeof(struct buffer_head*),
+					GFP_KERNEL);
 	if (!journal->j_wbuf) {
 		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
 			__FUNCTION__);
@@ -777,7 +778,8 @@ journal_t * journal_init_inode (struct i
 	/* journal descriptor can store up to n blocks -bzzz */
 	n = journal->j_blocksize / sizeof(journal_block_tag_t);
 	journal->j_wbufsize = n;
-	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+	journal->j_wbuf = jbd_kmalloc(n * sizeof(struct buffer_head*),
+					GFP_KERNEL);
 	if (!journal->j_wbuf) {
 		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
 			__FUNCTION__);
@@ -1157,7 +1159,7 @@ void journal_destroy(journal_t *journal)
 		iput(journal->j_inode);
 	if (journal->j_revoke)
 		journal_destroy_revoke(journal);
-	kfree(journal->j_wbuf);
+	jbd_kfree(journal->j_wbuf);
 	jbd_kfree(journal);
 }
 
Index: linux-2.6.23-rc6/fs/jbd/revoke.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/revoke.c	2007-09-17 14:32:22.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/revoke.c	2007-09-17 14:35:13.000000000 -0700
@@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ
 	journal->j_revoke->hash_shift = shift;
 
 	journal->j_revoke->hash_table =
-		kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
+		jbd_kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
 	if (!journal->j_revoke->hash_table) {
 		kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
 		journal->j_revoke = NULL;
@@ -231,7 +231,7 @@ int journal_init_revoke(journal_t *journ
 
 	journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
 	if (!journal->j_revoke_table[1]) {
-		kfree(journal->j_revoke_table[0]->hash_table);
+		jbd_kfree(journal->j_revoke_table[0]->hash_table);
 		kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
 		return -ENOMEM;
 	}
@@ -246,9 +246,9 @@ int journal_init_revoke(journal_t *journ
 	journal->j_revoke->hash_shift = shift;
 
 	journal->j_revoke->hash_table =
-		kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
+		jbd_kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
 	if (!journal->j_revoke->hash_table) {
-		kfree(journal->j_revoke_table[0]->hash_table);
+		jbd_kfree(journal->j_revoke_table[0]->hash_table);
 		kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
 		kmem_cache_free(revoke_table_cache, journal->j_revoke_table[1]);
 		journal->j_revoke = NULL;
@@ -280,7 +280,7 @@ void journal_destroy_revoke(journal_t *j
 		J_ASSERT (list_empty(hash_list));
 	}
 
-	kfree(table->hash_table);
+	jbd_kfree(table->hash_table);
 	kmem_cache_free(revoke_table_cache, table);
 	journal->j_revoke = NULL;
 
@@ -293,7 +293,7 @@ void journal_destroy_revoke(journal_t *j
 		J_ASSERT (list_empty(hash_list));
 	}
 
-	kfree(table->hash_table);
+	jbd_kfree(table->hash_table);
 	kmem_cache_free(revoke_table_cache, table);
 	journal->j_revoke = NULL;
 }
Index: linux-2.6.23-rc6/fs/jbd2/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/journal.c	2007-09-17 14:32:39.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/journal.c	2007-09-17 14:53:15.000000000 -0700
@@ -724,7 +724,8 @@ journal_t * jbd2_journal_init_dev(struct
 	journal->j_blocksize = blocksize;
 	n = journal->j_blocksize / sizeof(journal_block_tag_t);
 	journal->j_wbufsize = n;
-	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+	journal->j_wbuf = jbd2_kmalloc(n * sizeof(struct buffer_head*),
+					GFP_KERNEL);
 	if (!journal->j_wbuf) {
 		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
 			__FUNCTION__);
@@ -778,7 +779,8 @@ journal_t * jbd2_journal_init_inode (str
 	/* journal descriptor can store up to n blocks -bzzz */
 	n = journal->j_blocksize / sizeof(journal_block_tag_t);
 	journal->j_wbufsize = n;
-	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+	journal->j_wbuf = jbd2_kmalloc(n * sizeof(struct buffer_head*),
+					GFP_KERNEL);
 	if (!journal->j_wbuf) {
 		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
 			__FUNCTION__);
@@ -1158,7 +1160,7 @@ void jbd2_journal_destroy(journal_t *jou
 		iput(journal->j_inode);
 	if (journal->j_revoke)
 		jbd2_journal_destroy_revoke(journal);
-	kfree(journal->j_wbuf);
+	jbd2_kfree(journal->j_wbuf);
 	jbd2_kfree(journal);
 }
 
Index: linux-2.6.23-rc6/fs/jbd2/revoke.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/revoke.c	2007-09-17 14:32:34.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/revoke.c	2007-09-17 14:55:35.000000000 -0700
@@ -220,7 +220,7 @@ int jbd2_journal_init_revoke(journal_t *
 	journal->j_revoke->hash_shift = shift;
 
 	journal->j_revoke->hash_table =
-		kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
+		jbd2_kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
 	if (!journal->j_revoke->hash_table) {
 		kmem_cache_free(jbd2_revoke_table_cache, journal->j_revoke_table[0]);
 		journal->j_revoke = NULL;
@@ -232,7 +232,7 @@ int jbd2_journal_init_revoke(journal_t *
 
 	journal->j_revoke_table[1] = kmem_cache_alloc(jbd2_revoke_table_cache, GFP_KERNEL);
 	if (!journal->j_revoke_table[1]) {
-		kfree(journal->j_revoke_table[0]->hash_table);
+		jbd2_kfree(journal->j_revoke_table[0]->hash_table);
 		kmem_cache_free(jbd2_revoke_table_cache, journal->j_revoke_table[0]);
 		return -ENOMEM;
 	}
@@ -247,9 +247,9 @@ int jbd2_journal_init_revoke(journal_t *
 	journal->j_revoke->hash_shift = shift;
 
 	journal->j_revoke->hash_table =
-		kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
+		jbd2_kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
 	if (!journal->j_revoke->hash_table) {
-		kfree(journal->j_revoke_table[0]->hash_table);
+		jbd2_kfree(journal->j_revoke_table[0]->hash_table);
 		kmem_cache_free(jbd2_revoke_table_cache, journal->j_revoke_table[0]);
 		kmem_cache_free(jbd2_revoke_table_cache, journal->j_revoke_table[1]);
 		journal->j_revoke = NULL;
@@ -281,7 +281,7 @@ void jbd2_journal_destroy_revoke(journal
 		J_ASSERT (list_empty(hash_list));
 	}
 
-	kfree(table->hash_table);
+	jbd2_kfree(table->hash_table);
 	kmem_cache_free(jbd2_revoke_table_cache, table);
 	journal->j_revoke = NULL;
 
@@ -294,7 +294,7 @@ void jbd2_journal_destroy_revoke(journal
 		J_ASSERT (list_empty(hash_list));
 	}
 
-	kfree(table->hash_table);
+	jbd2_kfree(table->hash_table);
 	kmem_cache_free(jbd2_revoke_table_cache, table);
 	journal->j_revoke = NULL;
 }






^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH] JBD slab cleanups
  2007-09-17 22:57                             ` Mingming Cao
@ 2007-09-18  9:04                               ` Christoph Hellwig
  2007-09-18 16:35                                 ` Mingming Cao
  0 siblings, 1 reply; 124+ messages in thread
From: Christoph Hellwig @ 2007-09-18  9:04 UTC (permalink / raw)
  To: Mingming Cao
  Cc: Badari Pulavarty, Christoph Hellwig, Christoph Lameter,
	linux-fsdevel, ext4 development, lkml

On Mon, Sep 17, 2007 at 03:57:31PM -0700, Mingming Cao wrote:
> Here is the incremental small cleanup patch. 
> 
> Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc.

Shouldn't we kill jbd_kmalloc instead?


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH] JBD slab cleanups
  2007-09-18  9:04                               ` Christoph Hellwig
@ 2007-09-18 16:35                                 ` Mingming Cao
  2007-09-18 18:04                                   ` Dave Kleikamp
  0 siblings, 1 reply; 124+ messages in thread
From: Mingming Cao @ 2007-09-18 16:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Badari Pulavarty, Christoph Lameter, linux-fsdevel,
	ext4 development, lkml

On Tue, 2007-09-18 at 10:04 +0100, Christoph Hellwig wrote:
> On Mon, Sep 17, 2007 at 03:57:31PM -0700, Mingming Cao wrote:
> > Here is the incremental small cleanup patch. 
> > 
> > Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc.
> 
> Shouldn't we kill jbd_kmalloc instead?
> 

It seems useful to me to keep jbd_kmalloc/jbd_free. They are central
places to handle memory (de)allocation(<page size) via kmalloc/kfree, so
in the future if we need to change memory allocation in jbd(e.g. not
using kmalloc or using different flag), we don't need to touch every
place in the jbd code calling jbd_kmalloc.

Regards,
Mingming


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH] JBD slab cleanups
  2007-09-18 16:35                                 ` Mingming Cao
@ 2007-09-18 18:04                                   ` Dave Kleikamp
  2007-09-19  1:00                                     ` Mingming Cao
  0 siblings, 1 reply; 124+ messages in thread
From: Dave Kleikamp @ 2007-09-18 18:04 UTC (permalink / raw)
  To: cmm
  Cc: Christoph Hellwig, Badari Pulavarty, Christoph Lameter,
	linux-fsdevel, ext4 development, lkml

On Tue, 2007-09-18 at 09:35 -0700, Mingming Cao wrote:
> On Tue, 2007-09-18 at 10:04 +0100, Christoph Hellwig wrote:
> > On Mon, Sep 17, 2007 at 03:57:31PM -0700, Mingming Cao wrote:
> > > Here is the incremental small cleanup patch. 
> > > 
> > > Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc.
> > 
> > Shouldn't we kill jbd_kmalloc instead?
> > 
> 
> It seems useful to me to keep jbd_kmalloc/jbd_free. They are central
> places to handle memory (de)allocation(<page size) via kmalloc/kfree, so
> in the future if we need to change memory allocation in jbd(e.g. not
> using kmalloc or using different flag), we don't need to touch every
> place in the jbd code calling jbd_kmalloc.

I disagree.  Why would jbd need to globally change the way it allocates
memory?  It currently uses kmalloc (and jbd_kmalloc) for allocating a
variety of structures.  Having to change one particular instance won't
necessarily mean we want to change all of them.  Adding unnecessary
wrappers only obfuscates the code making it harder to understand.  You
wouldn't want every subsystem to have it's own *_kmalloc() that took
different arguments.  Besides, there aren't that many calls to kmalloc
and kfree in the jbd code, so there wouldn't be much pain in changing
GFP flags or whatever, if it ever needed to be done.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH] JBD slab cleanups
  2007-09-18 18:04                                   ` Dave Kleikamp
@ 2007-09-19  1:00                                     ` Mingming Cao
  2007-09-19  2:19                                       ` Andrew Morton
  0 siblings, 1 reply; 124+ messages in thread
From: Mingming Cao @ 2007-09-19  1:00 UTC (permalink / raw)
  To: Dave Kleikamp, Andrew Morton
  Cc: Christoph Hellwig, Badari Pulavarty, Christoph Lameter,
	linux-fsdevel, ext4 development, lkml

On Tue, 2007-09-18 at 13:04 -0500, Dave Kleikamp wrote:
> On Tue, 2007-09-18 at 09:35 -0700, Mingming Cao wrote:
> > On Tue, 2007-09-18 at 10:04 +0100, Christoph Hellwig wrote:
> > > On Mon, Sep 17, 2007 at 03:57:31PM -0700, Mingming Cao wrote:
> > > > Here is the incremental small cleanup patch. 
> > > > 
> > > > Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc.
> > > 
> > > Shouldn't we kill jbd_kmalloc instead?
> > > 
> > 
> > It seems useful to me to keep jbd_kmalloc/jbd_free. They are central
> > places to handle memory (de)allocation(<page size) via kmalloc/kfree, so
> > in the future if we need to change memory allocation in jbd(e.g. not
> > using kmalloc or using different flag), we don't need to touch every
> > place in the jbd code calling jbd_kmalloc.
> 
> I disagree.  Why would jbd need to globally change the way it allocates
> memory?  It currently uses kmalloc (and jbd_kmalloc) for allocating a
> variety of structures.  Having to change one particular instance won't
> necessarily mean we want to change all of them.  Adding unnecessary
> wrappers only obfuscates the code making it harder to understand.  You
> wouldn't want every subsystem to have it's own *_kmalloc() that took
> different arguments.  Besides, there aren't that many calls to kmalloc
> and kfree in the jbd code, so there wouldn't be much pain in changing
> GFP flags or whatever, if it ever needed to be done.
> 
> Shaggy

Okay, Points taken, Here is the updated patch to get rid of slab
management and jbd_kmalloc from jbd totally. This patch is intend to
replace the patch in mm tree, Andrew, could you pick up this one
instead?

Thanks,

Mingming


jbd/jbd2: JBD memory allocation cleanups

From: Christoph Lameter <clameter@sgi.com>

JBD: Replace slab allocations with page cache allocations

JBD allocate memory for committed_data and frozen_data from slab. However
JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset.


Also this patch cleans up jbd_kmalloc and replace it with kmalloc directly

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>

---
 fs/jbd/commit.c       |    6 +--
 fs/jbd/journal.c      |   99 ++------------------------------------------------
 fs/jbd/transaction.c  |   12 +++---
 fs/jbd2/commit.c      |    6 +--
 fs/jbd2/journal.c     |   99 ++------------------------------------------------
 fs/jbd2/transaction.c |   18 ++++-----
 include/linux/jbd.h   |   18 +++++----
 include/linux/jbd2.h  |   21 +++++-----
 8 files changed, 52 insertions(+), 227 deletions(-)

Index: linux-2.6.23-rc6/fs/jbd/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c	2007-09-18 17:19:01.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c	2007-09-18 17:51:21.000000000 -0700
@@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit);
 
 static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);
 static void __journal_abort_soft (journal_t *journal, int errno);
-static int journal_create_jbd_slab(size_t slab_size);
 
 /*
  * Helper function used to manage commit timeouts
@@ -334,10 +333,10 @@ repeat:
 		char *tmp;
 
 		jbd_unlock_bh_state(bh_in);
-		tmp = jbd_slab_alloc(bh_in->b_size, GFP_NOFS);
+		tmp = jbd_alloc(bh_in->b_size, GFP_NOFS);
 		jbd_lock_bh_state(bh_in);
 		if (jh_in->b_frozen_data) {
-			jbd_slab_free(tmp, bh_in->b_size);
+			jbd_free(tmp, bh_in->b_size);
 			goto repeat;
 		}
 
@@ -654,7 +653,7 @@ static journal_t * journal_init_common (
 	journal_t *journal;
 	int err;
 
-	journal = jbd_kmalloc(sizeof(*journal), GFP_KERNEL);
+	journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
 	if (!journal)
 		goto fail;
 	memset(journal, 0, sizeof(*journal));
@@ -1095,13 +1094,6 @@ int journal_load(journal_t *journal)
 		}
 	}
 
-	/*
-	 * Create a slab for this blocksize
-	 */
-	err = journal_create_jbd_slab(be32_to_cpu(sb->s_blocksize));
-	if (err)
-		return err;
-
 	/* Let the recovery code check whether it needs to recover any
 	 * data from the journal. */
 	if (journal_recover(journal))
@@ -1615,86 +1607,6 @@ int journal_blocks_per_page(struct inode
 }
 
 /*
- * Simple support for retrying memory allocations.  Introduced to help to
- * debug different VM deadlock avoidance strategies.
- */
-void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry)
-{
-	return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
-}
-
-/*
- * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
- * and allocate frozen and commit buffers from these slabs.
- *
- * Reason for doing this is to avoid, SLAB_DEBUG - since it could
- * cause bh to cross page boundary.
- */
-
-#define JBD_MAX_SLABS 5
-#define JBD_SLAB_INDEX(size)  (size >> 11)
-
-static struct kmem_cache *jbd_slab[JBD_MAX_SLABS];
-static const char *jbd_slab_names[JBD_MAX_SLABS] = {
-	"jbd_1k", "jbd_2k", "jbd_4k", NULL, "jbd_8k"
-};
-
-static void journal_destroy_jbd_slabs(void)
-{
-	int i;
-
-	for (i = 0; i < JBD_MAX_SLABS; i++) {
-		if (jbd_slab[i])
-			kmem_cache_destroy(jbd_slab[i]);
-		jbd_slab[i] = NULL;
-	}
-}
-
-static int journal_create_jbd_slab(size_t slab_size)
-{
-	int i = JBD_SLAB_INDEX(slab_size);
-
-	BUG_ON(i >= JBD_MAX_SLABS);
-
-	/*
-	 * Check if we already have a slab created for this size
-	 */
-	if (jbd_slab[i])
-		return 0;
-
-	/*
-	 * Create a slab and force alignment to be same as slabsize -
-	 * this will make sure that allocations won't cross the page
-	 * boundary.
-	 */
-	jbd_slab[i] = kmem_cache_create(jbd_slab_names[i],
-				slab_size, slab_size, 0, NULL);
-	if (!jbd_slab[i]) {
-		printk(KERN_EMERG "JBD: no memory for jbd_slab cache\n");
-		return -ENOMEM;
-	}
-	return 0;
-}
-
-void * jbd_slab_alloc(size_t size, gfp_t flags)
-{
-	int idx;
-
-	idx = JBD_SLAB_INDEX(size);
-	BUG_ON(jbd_slab[idx] == NULL);
-	return kmem_cache_alloc(jbd_slab[idx], flags | __GFP_NOFAIL);
-}
-
-void jbd_slab_free(void *ptr,  size_t size)
-{
-	int idx;
-
-	idx = JBD_SLAB_INDEX(size);
-	BUG_ON(jbd_slab[idx] == NULL);
-	kmem_cache_free(jbd_slab[idx], ptr);
-}
-
-/*
  * Journal_head storage management
  */
 static struct kmem_cache *journal_head_cache;
@@ -1881,13 +1793,13 @@ static void __journal_remove_journal_hea
 				printk(KERN_WARNING "%s: freeing "
 						"b_frozen_data\n",
 						__FUNCTION__);
-				jbd_slab_free(jh->b_frozen_data, bh->b_size);
+				jbd_free(jh->b_frozen_data, bh->b_size);
 			}
 			if (jh->b_committed_data) {
 				printk(KERN_WARNING "%s: freeing "
 						"b_committed_data\n",
 						__FUNCTION__);
-				jbd_slab_free(jh->b_committed_data, bh->b_size);
+				jbd_free(jh->b_committed_data, bh->b_size);
 			}
 			bh->b_private = NULL;
 			jh->b_bh = NULL;	/* debug, really */
@@ -2042,7 +1954,6 @@ static void journal_destroy_caches(void)
 	journal_destroy_revoke_caches();
 	journal_destroy_journal_head_cache();
 	journal_destroy_handle_cache();
-	journal_destroy_jbd_slabs();
 }
 
 static int __init journal_init(void)
Index: linux-2.6.23-rc6/include/linux/jbd.h
===================================================================
--- linux-2.6.23-rc6.orig/include/linux/jbd.h	2007-09-18 17:19:01.000000000 -0700
+++ linux-2.6.23-rc6/include/linux/jbd.h	2007-09-18 17:51:21.000000000 -0700
@@ -71,14 +71,16 @@ extern int journal_enable_debug;
 #define jbd_debug(f, a...)	/**/
 #endif
 
-extern void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry);
-extern void * jbd_slab_alloc(size_t size, gfp_t flags);
-extern void jbd_slab_free(void *ptr, size_t size);
-
-#define jbd_kmalloc(size, flags) \
-	__jbd_kmalloc(__FUNCTION__, (size), (flags), journal_oom_retry)
-#define jbd_rep_kmalloc(size, flags) \
-	__jbd_kmalloc(__FUNCTION__, (size), (flags), 1)
+
+static inline void *jbd_alloc(size_t size, gfp_t flags)
+{
+	return (void *)__get_free_pages(flags, get_order(size));
+}
+
+static inline void jbd_free(void *ptr, size_t size)
+{
+	free_pages((unsigned long)ptr, get_order(size));
+};
 
 #define JFS_MIN_JOURNAL_BLOCKS 1024
 
Index: linux-2.6.23-rc6/include/linux/jbd2.h
===================================================================
--- linux-2.6.23-rc6.orig/include/linux/jbd2.h	2007-09-18 17:19:01.000000000 -0700
+++ linux-2.6.23-rc6/include/linux/jbd2.h	2007-09-18 17:51:21.000000000 -0700
@@ -71,14 +71,15 @@ extern u8 jbd2_journal_enable_debug;
 #define jbd_debug(f, a...)	/**/
 #endif
 
-extern void * __jbd2_kmalloc (const char *where, size_t size, gfp_t flags, int retry);
-extern void * jbd2_slab_alloc(size_t size, gfp_t flags);
-extern void jbd2_slab_free(void *ptr, size_t size);
-
-#define jbd_kmalloc(size, flags) \
-	__jbd2_kmalloc(__FUNCTION__, (size), (flags), journal_oom_retry)
-#define jbd_rep_kmalloc(size, flags) \
-	__jbd2_kmalloc(__FUNCTION__, (size), (flags), 1)
+static inline void *jbd2_alloc(size_t size, gfp_t flags)
+{
+	return (void *)__get_free_pages(flags, get_order(size));
+}
+
+static inline void jbd2_free(void *ptr, size_t size)
+{
+	free_pages((unsigned long)ptr, get_order(size));
+};
 
 #define JBD2_MIN_JOURNAL_BLOCKS 1024
 
@@ -959,12 +960,12 @@ void jbd2_journal_put_journal_head(struc
  */
 extern struct kmem_cache *jbd2_handle_cache;
 
-static inline handle_t *jbd_alloc_handle(gfp_t gfp_flags)
+static inline handle_t *jbd2_alloc_handle(gfp_t gfp_flags)
 {
 	return kmem_cache_alloc(jbd2_handle_cache, gfp_flags);
 }
 
-static inline void jbd_free_handle(handle_t *handle)
+static inline void jbd2_free_handle(handle_t *handle)
 {
 	kmem_cache_free(jbd2_handle_cache, handle);
 }
Index: linux-2.6.23-rc6/fs/jbd2/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/journal.c	2007-09-18 17:19:01.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/journal.c	2007-09-18 17:51:21.000000000 -0700
@@ -84,7 +84,6 @@ EXPORT_SYMBOL(jbd2_journal_force_commit)
 
 static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);
 static void __journal_abort_soft (journal_t *journal, int errno);
-static int jbd2_journal_create_jbd_slab(size_t slab_size);
 
 /*
  * Helper function used to manage commit timeouts
@@ -335,10 +334,10 @@ repeat:
 		char *tmp;
 
 		jbd_unlock_bh_state(bh_in);
-		tmp = jbd2_slab_alloc(bh_in->b_size, GFP_NOFS);
+		tmp = jbd2_alloc(bh_in->b_size, GFP_NOFS);
 		jbd_lock_bh_state(bh_in);
 		if (jh_in->b_frozen_data) {
-			jbd2_slab_free(tmp, bh_in->b_size);
+			jbd2_free(tmp, bh_in->b_size);
 			goto repeat;
 		}
 
@@ -655,7 +654,7 @@ static journal_t * journal_init_common (
 	journal_t *journal;
 	int err;
 
-	journal = jbd_kmalloc(sizeof(*journal), GFP_KERNEL);
+	journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
 	if (!journal)
 		goto fail;
 	memset(journal, 0, sizeof(*journal));
@@ -1096,13 +1095,6 @@ int jbd2_journal_load(journal_t *journal
 		}
 	}
 
-	/*
-	 * Create a slab for this blocksize
-	 */
-	err = jbd2_journal_create_jbd_slab(be32_to_cpu(sb->s_blocksize));
-	if (err)
-		return err;
-
 	/* Let the recovery code check whether it needs to recover any
 	 * data from the journal. */
 	if (jbd2_journal_recover(journal))
@@ -1627,86 +1619,6 @@ size_t journal_tag_bytes(journal_t *jour
 }
 
 /*
- * Simple support for retrying memory allocations.  Introduced to help to
- * debug different VM deadlock avoidance strategies.
- */
-void * __jbd2_kmalloc (const char *where, size_t size, gfp_t flags, int retry)
-{
-	return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
-}
-
-/*
- * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
- * and allocate frozen and commit buffers from these slabs.
- *
- * Reason for doing this is to avoid, SLAB_DEBUG - since it could
- * cause bh to cross page boundary.
- */
-
-#define JBD_MAX_SLABS 5
-#define JBD_SLAB_INDEX(size)  (size >> 11)
-
-static struct kmem_cache *jbd_slab[JBD_MAX_SLABS];
-static const char *jbd_slab_names[JBD_MAX_SLABS] = {
-	"jbd2_1k", "jbd2_2k", "jbd2_4k", NULL, "jbd2_8k"
-};
-
-static void jbd2_journal_destroy_jbd_slabs(void)
-{
-	int i;
-
-	for (i = 0; i < JBD_MAX_SLABS; i++) {
-		if (jbd_slab[i])
-			kmem_cache_destroy(jbd_slab[i]);
-		jbd_slab[i] = NULL;
-	}
-}
-
-static int jbd2_journal_create_jbd_slab(size_t slab_size)
-{
-	int i = JBD_SLAB_INDEX(slab_size);
-
-	BUG_ON(i >= JBD_MAX_SLABS);
-
-	/*
-	 * Check if we already have a slab created for this size
-	 */
-	if (jbd_slab[i])
-		return 0;
-
-	/*
-	 * Create a slab and force alignment to be same as slabsize -
-	 * this will make sure that allocations won't cross the page
-	 * boundary.
-	 */
-	jbd_slab[i] = kmem_cache_create(jbd_slab_names[i],
-				slab_size, slab_size, 0, NULL);
-	if (!jbd_slab[i]) {
-		printk(KERN_EMERG "JBD: no memory for jbd_slab cache\n");
-		return -ENOMEM;
-	}
-	return 0;
-}
-
-void * jbd2_slab_alloc(size_t size, gfp_t flags)
-{
-	int idx;
-
-	idx = JBD_SLAB_INDEX(size);
-	BUG_ON(jbd_slab[idx] == NULL);
-	return kmem_cache_alloc(jbd_slab[idx], flags | __GFP_NOFAIL);
-}
-
-void jbd2_slab_free(void *ptr,  size_t size)
-{
-	int idx;
-
-	idx = JBD_SLAB_INDEX(size);
-	BUG_ON(jbd_slab[idx] == NULL);
-	kmem_cache_free(jbd_slab[idx], ptr);
-}
-
-/*
  * Journal_head storage management
  */
 static struct kmem_cache *jbd2_journal_head_cache;
@@ -1893,13 +1805,13 @@ static void __journal_remove_journal_hea
 				printk(KERN_WARNING "%s: freeing "
 						"b_frozen_data\n",
 						__FUNCTION__);
-				jbd2_slab_free(jh->b_frozen_data, bh->b_size);
+				jbd2_free(jh->b_frozen_data, bh->b_size);
 			}
 			if (jh->b_committed_data) {
 				printk(KERN_WARNING "%s: freeing "
 						"b_committed_data\n",
 						__FUNCTION__);
-				jbd2_slab_free(jh->b_committed_data, bh->b_size);
+				jbd2_free(jh->b_committed_data, bh->b_size);
 			}
 			bh->b_private = NULL;
 			jh->b_bh = NULL;	/* debug, really */
@@ -2040,7 +1952,6 @@ static void jbd2_journal_destroy_caches(
 	jbd2_journal_destroy_revoke_caches();
 	jbd2_journal_destroy_jbd2_journal_head_cache();
 	jbd2_journal_destroy_handle_cache();
-	jbd2_journal_destroy_jbd_slabs();
 }
 
 static int __init journal_init(void)
Index: linux-2.6.23-rc6/fs/jbd/commit.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/commit.c	2007-09-18 17:19:01.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/commit.c	2007-09-18 17:23:26.000000000 -0700
@@ -375,7 +375,7 @@ void journal_commit_transaction(journal_
 			struct buffer_head *bh = jh2bh(jh);
 
 			jbd_lock_bh_state(bh);
-			jbd_slab_free(jh->b_committed_data, bh->b_size);
+			jbd_free(jh->b_committed_data, bh->b_size);
 			jh->b_committed_data = NULL;
 			jbd_unlock_bh_state(bh);
 		}
@@ -792,14 +792,14 @@ restart_loop:
 		 * Otherwise, we can just throw away the frozen data now.
 		 */
 		if (jh->b_committed_data) {
-			jbd_slab_free(jh->b_committed_data, bh->b_size);
+			jbd_free(jh->b_committed_data, bh->b_size);
 			jh->b_committed_data = NULL;
 			if (jh->b_frozen_data) {
 				jh->b_committed_data = jh->b_frozen_data;
 				jh->b_frozen_data = NULL;
 			}
 		} else if (jh->b_frozen_data) {
-			jbd_slab_free(jh->b_frozen_data, bh->b_size);
+			jbd_free(jh->b_frozen_data, bh->b_size);
 			jh->b_frozen_data = NULL;
 		}
 
Index: linux-2.6.23-rc6/fs/jbd2/commit.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/commit.c	2007-09-18 17:19:01.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/commit.c	2007-09-18 17:23:26.000000000 -0700
@@ -384,7 +384,7 @@ void jbd2_journal_commit_transaction(jou
 			struct buffer_head *bh = jh2bh(jh);
 
 			jbd_lock_bh_state(bh);
-			jbd2_slab_free(jh->b_committed_data, bh->b_size);
+			jbd2_free(jh->b_committed_data, bh->b_size);
 			jh->b_committed_data = NULL;
 			jbd_unlock_bh_state(bh);
 		}
@@ -801,14 +801,14 @@ restart_loop:
 		 * Otherwise, we can just throw away the frozen data now.
 		 */
 		if (jh->b_committed_data) {
-			jbd2_slab_free(jh->b_committed_data, bh->b_size);
+			jbd2_free(jh->b_committed_data, bh->b_size);
 			jh->b_committed_data = NULL;
 			if (jh->b_frozen_data) {
 				jh->b_committed_data = jh->b_frozen_data;
 				jh->b_frozen_data = NULL;
 			}
 		} else if (jh->b_frozen_data) {
-			jbd2_slab_free(jh->b_frozen_data, bh->b_size);
+			jbd2_free(jh->b_frozen_data, bh->b_size);
 			jh->b_frozen_data = NULL;
 		}
 
Index: linux-2.6.23-rc6/fs/jbd/transaction.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/transaction.c	2007-09-18 17:19:01.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/transaction.c	2007-09-18 17:51:21.000000000 -0700
@@ -96,8 +96,8 @@ static int start_this_handle(journal_t *
 
 alloc_transaction:
 	if (!journal->j_running_transaction) {
-		new_transaction = jbd_kmalloc(sizeof(*new_transaction),
-						GFP_NOFS);
+		new_transaction = kmalloc(sizeof(*new_transaction),
+						GFP_NOFS|__GFP_NOFAIL);
 		if (!new_transaction) {
 			ret = -ENOMEM;
 			goto out;
@@ -668,7 +668,7 @@ repeat:
 				JBUFFER_TRACE(jh, "allocate memory for buffer");
 				jbd_unlock_bh_state(bh);
 				frozen_buffer =
-					jbd_slab_alloc(jh2bh(jh)->b_size,
+					jbd_alloc(jh2bh(jh)->b_size,
 							 GFP_NOFS);
 				if (!frozen_buffer) {
 					printk(KERN_EMERG
@@ -728,7 +728,7 @@ done:
 
 out:
 	if (unlikely(frozen_buffer))	/* It's usually NULL */
-		jbd_slab_free(frozen_buffer, bh->b_size);
+		jbd_free(frozen_buffer, bh->b_size);
 
 	JBUFFER_TRACE(jh, "exit");
 	return error;
@@ -881,7 +881,7 @@ int journal_get_undo_access(handle_t *ha
 
 repeat:
 	if (!jh->b_committed_data) {
-		committed_data = jbd_slab_alloc(jh2bh(jh)->b_size, GFP_NOFS);
+		committed_data = jbd_alloc(jh2bh(jh)->b_size, GFP_NOFS);
 		if (!committed_data) {
 			printk(KERN_EMERG "%s: No memory for committed data\n",
 				__FUNCTION__);
@@ -908,7 +908,7 @@ repeat:
 out:
 	journal_put_journal_head(jh);
 	if (unlikely(committed_data))
-		jbd_slab_free(committed_data, bh->b_size);
+		jbd_free(committed_data, bh->b_size);
 	return err;
 }
 
Index: linux-2.6.23-rc6/fs/jbd2/transaction.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/transaction.c	2007-09-18 17:19:01.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/transaction.c	2007-09-18 17:51:21.000000000 -0700
@@ -96,8 +96,8 @@ static int start_this_handle(journal_t *
 
 alloc_transaction:
 	if (!journal->j_running_transaction) {
-		new_transaction = jbd_kmalloc(sizeof(*new_transaction),
-						GFP_NOFS);
+		new_transaction = kmalloc(sizeof(*new_transaction),
+						GFP_NOFS|__GFP_NOFAIL);
 		if (!new_transaction) {
 			ret = -ENOMEM;
 			goto out;
@@ -236,7 +236,7 @@ out:
 /* Allocate a new handle.  This should probably be in a slab... */
 static handle_t *new_handle(int nblocks)
 {
-	handle_t *handle = jbd_alloc_handle(GFP_NOFS);
+	handle_t *handle = jbd2_alloc_handle(GFP_NOFS);
 	if (!handle)
 		return NULL;
 	memset(handle, 0, sizeof(*handle));
@@ -282,7 +282,7 @@ handle_t *jbd2_journal_start(journal_t *
 
 	err = start_this_handle(journal, handle);
 	if (err < 0) {
-		jbd_free_handle(handle);
+		jbd2_free_handle(handle);
 		current->journal_info = NULL;
 		handle = ERR_PTR(err);
 	}
@@ -668,7 +668,7 @@ repeat:
 				JBUFFER_TRACE(jh, "allocate memory for buffer");
 				jbd_unlock_bh_state(bh);
 				frozen_buffer =
-					jbd2_slab_alloc(jh2bh(jh)->b_size,
+					jbd2_alloc(jh2bh(jh)->b_size,
 							 GFP_NOFS);
 				if (!frozen_buffer) {
 					printk(KERN_EMERG
@@ -728,7 +728,7 @@ done:
 
 out:
 	if (unlikely(frozen_buffer))	/* It's usually NULL */
-		jbd2_slab_free(frozen_buffer, bh->b_size);
+		jbd2_free(frozen_buffer, bh->b_size);
 
 	JBUFFER_TRACE(jh, "exit");
 	return error;
@@ -881,7 +881,7 @@ int jbd2_journal_get_undo_access(handle_
 
 repeat:
 	if (!jh->b_committed_data) {
-		committed_data = jbd2_slab_alloc(jh2bh(jh)->b_size, GFP_NOFS);
+		committed_data = jbd2_alloc(jh2bh(jh)->b_size, GFP_NOFS);
 		if (!committed_data) {
 			printk(KERN_EMERG "%s: No memory for committed data\n",
 				__FUNCTION__);
@@ -908,7 +908,7 @@ repeat:
 out:
 	jbd2_journal_put_journal_head(jh);
 	if (unlikely(committed_data))
-		jbd2_slab_free(committed_data, bh->b_size);
+		jbd2_free(committed_data, bh->b_size);
 	return err;
 }
 
@@ -1411,7 +1411,7 @@ int jbd2_journal_stop(handle_t *handle)
 		spin_unlock(&journal->j_state_lock);
 	}
 
-	jbd_free_handle(handle);
+	jbd2_free_handle(handle);
 	return err;
 }
 



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH] JBD slab cleanups
  2007-09-19  1:00                                     ` Mingming Cao
@ 2007-09-19  2:19                                       ` Andrew Morton
  2007-09-19 19:15                                         ` Mingming Cao
  0 siblings, 1 reply; 124+ messages in thread
From: Andrew Morton @ 2007-09-19  2:19 UTC (permalink / raw)
  To: cmm
  Cc: Dave Kleikamp, Christoph Hellwig, Badari Pulavarty,
	Christoph Lameter, linux-fsdevel, ext4 development, lkml

On Tue, 18 Sep 2007 18:00:01 -0700 Mingming Cao <cmm@us.ibm.com> wrote:

> JBD: Replace slab allocations with page cache allocations
> 
> JBD allocate memory for committed_data and frozen_data from slab. However
> JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset.
> 
> 
> Also this patch cleans up jbd_kmalloc and replace it with kmalloc directly

__GFP_NOFAIL should only be used when we have no way of recovering
from failure.  The allocation in journal_init_common() (at least)
_can_ recover and hence really shouldn't be using __GFP_NOFAIL.

(Actually, nothing in the kernel should be using __GFP_NOFAIL.  It is 
there as a marker which says "we really shouldn't be doing this but
we don't know how to fix it").

So sometime it'd be good if you could review all the __GFP_NOFAILs in
there and see if we can remove some, thanks.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH] JBD slab cleanups
  2007-09-19  2:19                                       ` Andrew Morton
@ 2007-09-19 19:15                                         ` Mingming Cao
  2007-09-19 19:22                                           ` [PATCH] JBD: use GFP_NOFS in kmalloc Mingming Cao
                                                             ` (2 more replies)
  0 siblings, 3 replies; 124+ messages in thread
From: Mingming Cao @ 2007-09-19 19:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Kleikamp, Christoph Hellwig, Badari Pulavarty,
	Christoph Lameter, linux-fsdevel, ext4 development, lkml

On Tue, 2007-09-18 at 19:19 -0700, Andrew Morton wrote:
> On Tue, 18 Sep 2007 18:00:01 -0700 Mingming Cao <cmm@us.ibm.com> wrote:
> 
> > JBD: Replace slab allocations with page cache allocations
> > 
> > JBD allocate memory for committed_data and frozen_data from slab. However
> > JBD should not pass slab pages down to the block layer. Use page allocator pages instead. This will also prepare JBD for the large blocksize patchset.
> > 
> > 
> > Also this patch cleans up jbd_kmalloc and replace it with kmalloc directly
> 
> __GFP_NOFAIL should only be used when we have no way of recovering
> from failure.  The allocation in journal_init_common() (at least)
> _can_ recover and hence really shouldn't be using __GFP_NOFAIL.
> 
> (Actually, nothing in the kernel should be using __GFP_NOFAIL.  It is 
> there as a marker which says "we really shouldn't be doing this but
> we don't know how to fix it").
> 
> So sometime it'd be good if you could review all the __GFP_NOFAILs in
> there and see if we can remove some, thanks.

Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2. In all
cases except one handles memory allocation failure so I get rid of those
GFP_NOFAIL flags.

Also, shouldn't we use GFP_KERNEL instead of GFP_NOFS flag for kmalloc
in jbd/jbd2? I will send a separate patch to cleanup that.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
 fs/jbd/journal.c      |    2 +-
 fs/jbd/transaction.c  |    3 +--
 fs/jbd2/journal.c     |    2 +-
 fs/jbd2/transaction.c |    3 +--
 4 files changed, 4 insertions(+), 6 deletions(-)

Index: linux-2.6.23-rc6/fs/jbd/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c	2007-09-19 11:47:58.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c	2007-09-19 11:48:40.000000000 -0700
@@ -653,7 +653,7 @@ static journal_t * journal_init_common (
 	journal_t *journal;
 	int err;
 
-	journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
+	journal = kmalloc(sizeof(*journal), GFP_KERNEL);
 	if (!journal)
 		goto fail;
 	memset(journal, 0, sizeof(*journal));
Index: linux-2.6.23-rc6/fs/jbd/transaction.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/transaction.c	2007-09-19 11:48:05.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/transaction.c	2007-09-19 11:49:10.000000000 -0700
@@ -96,8 +96,7 @@ static int start_this_handle(journal_t *
 
 alloc_transaction:
 	if (!journal->j_running_transaction) {
-		new_transaction = kmalloc(sizeof(*new_transaction),
-						GFP_NOFS|__GFP_NOFAIL);
+		new_transaction = kmalloc(sizeof(*new_transaction), GFP_NOFS);
 		if (!new_transaction) {
 			ret = -ENOMEM;
 			goto out;
Index: linux-2.6.23-rc6/fs/jbd2/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/journal.c	2007-09-19 11:48:14.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/journal.c	2007-09-19 11:49:46.000000000 -0700
@@ -654,7 +654,7 @@ static journal_t * journal_init_common (
 	journal_t *journal;
 	int err;
 
-	journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
+	journal = kmalloc(sizeof(*journal), GFP_KERNEL);
 	if (!journal)
 		goto fail;
 	memset(journal, 0, sizeof(*journal));
Index: linux-2.6.23-rc6/fs/jbd2/transaction.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/transaction.c	2007-09-19 11:48:08.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/transaction.c	2007-09-19 11:50:12.000000000 -0700
@@ -96,8 +96,7 @@ static int start_this_handle(journal_t *
 
 alloc_transaction:
 	if (!journal->j_running_transaction) {
-		new_transaction = kmalloc(sizeof(*new_transaction),
-						GFP_NOFS|__GFP_NOFAIL);
+		new_transaction = kmalloc(sizeof(*new_transaction), GFP_NOFS);
 		if (!new_transaction) {
 			ret = -ENOMEM;
 			goto out;





^ permalink raw reply	[flat|nested] 124+ messages in thread

* [PATCH] JBD: use GFP_NOFS  in kmalloc
  2007-09-19 19:15                                         ` Mingming Cao
@ 2007-09-19 19:22                                           ` Mingming Cao
  2007-09-19 21:34                                             ` Andrew Morton
  2007-09-20  4:25                                             ` Andreas Dilger
  2007-09-19 19:26                                           ` [PATCH] JBD slab cleanups Dave Kleikamp
  2007-09-19 19:48                                           ` Andreas Dilger
  2 siblings, 2 replies; 124+ messages in thread
From: Mingming Cao @ 2007-09-19 19:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, ext4 development, lkml

Convert the GFP_KERNEL flag used in JBD/JBD2 to GFP_NOFS, consistent
with the rest of kmalloc flag used in the JBD/JBD2 layer.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>

---
 fs/jbd/journal.c  |    6 +++---
 fs/jbd/revoke.c   |    8 ++++----
 fs/jbd2/journal.c |    6 +++---
 fs/jbd2/revoke.c  |    8 ++++----
 4 files changed, 14 insertions(+), 14 deletions(-)

Index: linux-2.6.23-rc6/fs/jbd/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c	2007-09-19 11:51:10.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c	2007-09-19 11:51:57.000000000 -0700
@@ -653,7 +653,7 @@ static journal_t * journal_init_common (
 	journal_t *journal;
 	int err;
 
-	journal = kmalloc(sizeof(*journal), GFP_KERNEL);
+	journal = kmalloc(sizeof(*journal), GFP_NOFS);
 	if (!journal)
 		goto fail;
 	memset(journal, 0, sizeof(*journal));
@@ -723,7 +723,7 @@ journal_t * journal_init_dev(struct bloc
 	journal->j_blocksize = blocksize;
 	n = journal->j_blocksize / sizeof(journal_block_tag_t);
 	journal->j_wbufsize = n;
-	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
 	if (!journal->j_wbuf) {
 		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
 			__FUNCTION__);
@@ -777,7 +777,7 @@ journal_t * journal_init_inode (struct i
 	/* journal descriptor can store up to n blocks -bzzz */
 	n = journal->j_blocksize / sizeof(journal_block_tag_t);
 	journal->j_wbufsize = n;
-	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
 	if (!journal->j_wbuf) {
 		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
 			__FUNCTION__);
Index: linux-2.6.23-rc6/fs/jbd/revoke.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/revoke.c	2007-09-19 11:51:30.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/revoke.c	2007-09-19 11:52:34.000000000 -0700
@@ -206,7 +206,7 @@ int journal_init_revoke(journal_t *journ
 	while((tmp >>= 1UL) != 0UL)
 		shift++;
 
-	journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
+	journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS);
 	if (!journal->j_revoke_table[0])
 		return -ENOMEM;
 	journal->j_revoke = journal->j_revoke_table[0];
@@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ
 	journal->j_revoke->hash_shift = shift;
 
 	journal->j_revoke->hash_table =
-		kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
+		kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
 	if (!journal->j_revoke->hash_table) {
 		kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
 		journal->j_revoke = NULL;
@@ -229,7 +229,7 @@ int journal_init_revoke(journal_t *journ
 	for (tmp = 0; tmp < hash_size; tmp++)
 		INIT_LIST_HEAD(&journal->j_revoke->hash_table[tmp]);
 
-	journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
+	journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS);
 	if (!journal->j_revoke_table[1]) {
 		kfree(journal->j_revoke_table[0]->hash_table);
 		kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
@@ -246,7 +246,7 @@ int journal_init_revoke(journal_t *journ
 	journal->j_revoke->hash_shift = shift;
 
 	journal->j_revoke->hash_table =
-		kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
+		kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
 	if (!journal->j_revoke->hash_table) {
 		kfree(journal->j_revoke_table[0]->hash_table);
 		kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
Index: linux-2.6.23-rc6/fs/jbd2/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/journal.c	2007-09-19 11:52:48.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/journal.c	2007-09-19 11:53:12.000000000 -0700
@@ -654,7 +654,7 @@ static journal_t * journal_init_common (
 	journal_t *journal;
 	int err;
 
-	journal = kmalloc(sizeof(*journal), GFP_KERNEL);
+	journal = kmalloc(sizeof(*journal), GFP_NOFS);
 	if (!journal)
 		goto fail;
 	memset(journal, 0, sizeof(*journal));
@@ -724,7 +724,7 @@ journal_t * jbd2_journal_init_dev(struct
 	journal->j_blocksize = blocksize;
 	n = journal->j_blocksize / sizeof(journal_block_tag_t);
 	journal->j_wbufsize = n;
-	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
 	if (!journal->j_wbuf) {
 		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
 			__FUNCTION__);
@@ -778,7 +778,7 @@ journal_t * jbd2_journal_init_inode (str
 	/* journal descriptor can store up to n blocks -bzzz */
 	n = journal->j_blocksize / sizeof(journal_block_tag_t);
 	journal->j_wbufsize = n;
-	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
 	if (!journal->j_wbuf) {
 		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
 			__FUNCTION__);
Index: linux-2.6.23-rc6/fs/jbd2/revoke.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/revoke.c	2007-09-19 11:52:53.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/revoke.c	2007-09-19 11:53:32.000000000 -0700
@@ -207,7 +207,7 @@ int jbd2_journal_init_revoke(journal_t *
 	while((tmp >>= 1UL) != 0UL)
 		shift++;
 
-	journal->j_revoke_table[0] = kmem_cache_alloc(jbd2_revoke_table_cache, GFP_KERNEL);
+	journal->j_revoke_table[0] = kmem_cache_alloc(jbd2_revoke_table_cache, GFP_NOFS);
 	if (!journal->j_revoke_table[0])
 		return -ENOMEM;
 	journal->j_revoke = journal->j_revoke_table[0];
@@ -220,7 +220,7 @@ int jbd2_journal_init_revoke(journal_t *
 	journal->j_revoke->hash_shift = shift;
 
 	journal->j_revoke->hash_table =
-		kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
+		kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
 	if (!journal->j_revoke->hash_table) {
 		kmem_cache_free(jbd2_revoke_table_cache, journal->j_revoke_table[0]);
 		journal->j_revoke = NULL;
@@ -230,7 +230,7 @@ int jbd2_journal_init_revoke(journal_t *
 	for (tmp = 0; tmp < hash_size; tmp++)
 		INIT_LIST_HEAD(&journal->j_revoke->hash_table[tmp]);
 
-	journal->j_revoke_table[1] = kmem_cache_alloc(jbd2_revoke_table_cache, GFP_KERNEL);
+	journal->j_revoke_table[1] = kmem_cache_alloc(jbd2_revoke_table_cache, GFP_NOFS);
 	if (!journal->j_revoke_table[1]) {
 		kfree(journal->j_revoke_table[0]->hash_table);
 		kmem_cache_free(jbd2_revoke_table_cache, journal->j_revoke_table[0]);
@@ -247,7 +247,7 @@ int jbd2_journal_init_revoke(journal_t *
 	journal->j_revoke->hash_shift = shift;
 
 	journal->j_revoke->hash_table =
-		kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
+		kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
 	if (!journal->j_revoke->hash_table) {
 		kfree(journal->j_revoke_table[0]->hash_table);
 		kmem_cache_free(jbd2_revoke_table_cache, journal->j_revoke_table[0]);



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH] JBD slab cleanups
  2007-09-19 19:15                                         ` Mingming Cao
  2007-09-19 19:22                                           ` [PATCH] JBD: use GFP_NOFS in kmalloc Mingming Cao
@ 2007-09-19 19:26                                           ` Dave Kleikamp
  2007-09-19 19:28                                             ` Dave Kleikamp
  2007-09-19 19:48                                           ` Andreas Dilger
  2 siblings, 1 reply; 124+ messages in thread
From: Dave Kleikamp @ 2007-09-19 19:26 UTC (permalink / raw)
  To: cmm
  Cc: Andrew Morton, Christoph Hellwig, Badari Pulavarty,
	Christoph Lameter, linux-fsdevel, ext4 development, lkml

On Wed, 2007-09-19 at 12:15 -0700, Mingming Cao wrote:

> Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2. In all
> cases except one handles memory allocation failure so I get rid of those
> GFP_NOFAIL flags.
> 
> Also, shouldn't we use GFP_KERNEL instead of GFP_NOFS flag for kmalloc
> in jbd/jbd2? I will send a separate patch to cleanup that.

No.  GFP_NOFS avoids deadlock.  It prevents the allocation from making
recursive calls back into the file system that could end up blocking on
jbd code.

Shaggy
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH] JBD slab cleanups
  2007-09-19 19:26                                           ` [PATCH] JBD slab cleanups Dave Kleikamp
@ 2007-09-19 19:28                                             ` Dave Kleikamp
  2007-09-19 20:47                                               ` Mingming Cao
  0 siblings, 1 reply; 124+ messages in thread
From: Dave Kleikamp @ 2007-09-19 19:28 UTC (permalink / raw)
  To: cmm
  Cc: Andrew Morton, Christoph Hellwig, Badari Pulavarty,
	Christoph Lameter, linux-fsdevel, ext4 development, lkml

On Wed, 2007-09-19 at 14:26 -0500, Dave Kleikamp wrote:
> On Wed, 2007-09-19 at 12:15 -0700, Mingming Cao wrote:
> 
> > Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2. In all
> > cases except one handles memory allocation failure so I get rid of those
> > GFP_NOFAIL flags.
> > 
> > Also, shouldn't we use GFP_KERNEL instead of GFP_NOFS flag for kmalloc
> > in jbd/jbd2? I will send a separate patch to cleanup that.
> 
> No.  GFP_NOFS avoids deadlock.  It prevents the allocation from making
> recursive calls back into the file system that could end up blocking on
> jbd code.

Oh, I see your patch now.  You mean use GFP_NOFS instead of
GFP_KERNEL.  :-)  OK then.

> Shaggy
-- 
David Kleikamp
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH] JBD slab cleanups
  2007-09-19 19:15                                         ` Mingming Cao
  2007-09-19 19:22                                           ` [PATCH] JBD: use GFP_NOFS in kmalloc Mingming Cao
  2007-09-19 19:26                                           ` [PATCH] JBD slab cleanups Dave Kleikamp
@ 2007-09-19 19:48                                           ` Andreas Dilger
  2007-09-19 22:03                                             ` Mingming Cao
  2 siblings, 1 reply; 124+ messages in thread
From: Andreas Dilger @ 2007-09-19 19:48 UTC (permalink / raw)
  To: Mingming Cao
  Cc: Andrew Morton, Dave Kleikamp, Christoph Hellwig,
	Badari Pulavarty, Christoph Lameter, linux-fsdevel,
	ext4 development, lkml, Stephen C. Tweedie

On Sep 19, 2007  12:15 -0700, Mingming Cao wrote:
> @@ -96,8 +96,7 @@ static int start_this_handle(journal_t *
>  
>  alloc_transaction:
>  	if (!journal->j_running_transaction) {
> -		new_transaction = kmalloc(sizeof(*new_transaction),
> -						GFP_NOFS|__GFP_NOFAIL);
> +		new_transaction = kmalloc(sizeof(*new_transaction), GFP_NOFS);

This should probably be a __GFP_NOFAIL if we are trying to start a new
handle in truncate, as there is no way to propagate an error to the caller.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH] JBD slab cleanups
  2007-09-19 19:28                                             ` Dave Kleikamp
@ 2007-09-19 20:47                                               ` Mingming Cao
  0 siblings, 0 replies; 124+ messages in thread
From: Mingming Cao @ 2007-09-19 20:47 UTC (permalink / raw)
  To: Dave Kleikamp
  Cc: Andrew Morton, Christoph Hellwig, Badari Pulavarty,
	Christoph Lameter, linux-fsdevel, ext4 development, lkml

On Wed, 2007-09-19 at 19:28 +0000, Dave Kleikamp wrote:
> On Wed, 2007-09-19 at 14:26 -0500, Dave Kleikamp wrote:
> > On Wed, 2007-09-19 at 12:15 -0700, Mingming Cao wrote:
> > 
> > > Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2. In all
> > > cases except one handles memory allocation failure so I get rid of those
> > > GFP_NOFAIL flags.
> > > 
> > > Also, shouldn't we use GFP_KERNEL instead of GFP_NOFS flag for kmalloc
> > > in jbd/jbd2? I will send a separate patch to cleanup that.
> > 
> > No.  GFP_NOFS avoids deadlock.  It prevents the allocation from making
> > recursive calls back into the file system that could end up blocking on
> > jbd code.
> 
> Oh, I see your patch now.  You mean use GFP_NOFS instead of
> GFP_KERNEL.  :-)  OK then.
> 

oops, I did mean what you say here.:-)

> > Shaggy


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH] JBD: use GFP_NOFS  in kmalloc
  2007-09-19 19:22                                           ` [PATCH] JBD: use GFP_NOFS in kmalloc Mingming Cao
@ 2007-09-19 21:34                                             ` Andrew Morton
  2007-09-19 21:55                                               ` Mingming Cao
  2007-09-20  4:25                                             ` Andreas Dilger
  1 sibling, 1 reply; 124+ messages in thread
From: Andrew Morton @ 2007-09-19 21:34 UTC (permalink / raw)
  To: cmm; +Cc: linux-fsdevel, ext4 development, lkml

On Wed, 19 Sep 2007 12:22:09 -0700
Mingming Cao <cmm@us.ibm.com> wrote:

> Convert the GFP_KERNEL flag used in JBD/JBD2 to GFP_NOFS, consistent
> with the rest of kmalloc flag used in the JBD/JBD2 layer.
> 
> Signed-off-by: Mingming Cao <cmm@us.ibm.com>
> 
> ---
>  fs/jbd/journal.c  |    6 +++---
>  fs/jbd/revoke.c   |    8 ++++----
>  fs/jbd2/journal.c |    6 +++---
>  fs/jbd2/revoke.c  |    8 ++++----
>  4 files changed, 14 insertions(+), 14 deletions(-)
> 
> Index: linux-2.6.23-rc6/fs/jbd/journal.c
> ===================================================================
> --- linux-2.6.23-rc6.orig/fs/jbd/journal.c	2007-09-19 11:51:10.000000000 -0700
> +++ linux-2.6.23-rc6/fs/jbd/journal.c	2007-09-19 11:51:57.000000000 -0700
> @@ -653,7 +653,7 @@ static journal_t * journal_init_common (
>  	journal_t *journal;
>  	int err;
>  
> -	journal = kmalloc(sizeof(*journal), GFP_KERNEL);
> +	journal = kmalloc(sizeof(*journal), GFP_NOFS);
>  	if (!journal)
>  		goto fail;
>  	memset(journal, 0, sizeof(*journal));
> @@ -723,7 +723,7 @@ journal_t * journal_init_dev(struct bloc
>  	journal->j_blocksize = blocksize;
>  	n = journal->j_blocksize / sizeof(journal_block_tag_t);
>  	journal->j_wbufsize = n;
> -	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
> +	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
>  	if (!journal->j_wbuf) {
>  		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
>  			__FUNCTION__);
> @@ -777,7 +777,7 @@ journal_t * journal_init_inode (struct i
>  	/* journal descriptor can store up to n blocks -bzzz */
>  	n = journal->j_blocksize / sizeof(journal_block_tag_t);
>  	journal->j_wbufsize = n;
> -	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
> +	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
>  	if (!journal->j_wbuf) {
>  		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
>  			__FUNCTION__);
> Index: linux-2.6.23-rc6/fs/jbd/revoke.c
> ===================================================================
> --- linux-2.6.23-rc6.orig/fs/jbd/revoke.c	2007-09-19 11:51:30.000000000 -0700
> +++ linux-2.6.23-rc6/fs/jbd/revoke.c	2007-09-19 11:52:34.000000000 -0700
> @@ -206,7 +206,7 @@ int journal_init_revoke(journal_t *journ
>  	while((tmp >>= 1UL) != 0UL)
>  		shift++;
>  
> -	journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
> +	journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS);
>  	if (!journal->j_revoke_table[0])
>  		return -ENOMEM;
>  	journal->j_revoke = journal->j_revoke_table[0];
> @@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ
>  	journal->j_revoke->hash_shift = shift;
>  
>  	journal->j_revoke->hash_table =
> -		kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
> +		kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
>  	if (!journal->j_revoke->hash_table) {
>  		kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
>  		journal->j_revoke = NULL;
> @@ -229,7 +229,7 @@ int journal_init_revoke(journal_t *journ
>  	for (tmp = 0; tmp < hash_size; tmp++)
>  		INIT_LIST_HEAD(&journal->j_revoke->hash_table[tmp]);
>  
> -	journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
> +	journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS);
>  	if (!journal->j_revoke_table[1]) {
>  		kfree(journal->j_revoke_table[0]->hash_table);
>  		kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
> @@ -246,7 +246,7 @@ int journal_init_revoke(journal_t *journ
>  	journal->j_revoke->hash_shift = shift;
>  
>  	journal->j_revoke->hash_table =
> -		kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
> +		kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
>  	if (!journal->j_revoke->hash_table) {
>  		kfree(journal->j_revoke_table[0]->hash_table);
>  		kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);

These were all OK using GFP_KERNEL.

GFP_NOFS should only be used when the caller is holding some fs locks which
might cause a deadlock if that caller reentered the fs in ->writepage (and
maybe put_inode and such).  That isn't the case in any of the above code,
which is all mount time stuff (I think).

ext3/4 should be using GFP_NOFS when the caller has a transaction open, has
a page locked, is holding i_mutex, etc.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH] JBD: use GFP_NOFS  in kmalloc
  2007-09-19 21:34                                             ` Andrew Morton
@ 2007-09-19 21:55                                               ` Mingming Cao
  0 siblings, 0 replies; 124+ messages in thread
From: Mingming Cao @ 2007-09-19 21:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, ext4 development, lkml

On Wed, 2007-09-19 at 14:34 -0700, Andrew Morton wrote:
> On Wed, 19 Sep 2007 12:22:09 -0700
> Mingming Cao <cmm@us.ibm.com> wrote:
> 
> > Convert the GFP_KERNEL flag used in JBD/JBD2 to GFP_NOFS, consistent
> > with the rest of kmalloc flag used in the JBD/JBD2 layer.
> > 
> > Signed-off-by: Mingming Cao <cmm@us.ibm.com>
> > 
> > ---
> >  fs/jbd/journal.c  |    6 +++---
> >  fs/jbd/revoke.c   |    8 ++++----
> >  fs/jbd2/journal.c |    6 +++---
> >  fs/jbd2/revoke.c  |    8 ++++----
> >  4 files changed, 14 insertions(+), 14 deletions(-)
> > 
> > Index: linux-2.6.23-rc6/fs/jbd/journal.c
> > ===================================================================
> > --- linux-2.6.23-rc6.orig/fs/jbd/journal.c	2007-09-19 11:51:10.000000000 -0700
> > +++ linux-2.6.23-rc6/fs/jbd/journal.c	2007-09-19 11:51:57.000000000 -0700
> > @@ -653,7 +653,7 @@ static journal_t * journal_init_common (
> >  	journal_t *journal;
> >  	int err;
> >  
> > -	journal = kmalloc(sizeof(*journal), GFP_KERNEL);
> > +	journal = kmalloc(sizeof(*journal), GFP_NOFS);
> >  	if (!journal)
> >  		goto fail;
> >  	memset(journal, 0, sizeof(*journal));
> > @@ -723,7 +723,7 @@ journal_t * journal_init_dev(struct bloc
> >  	journal->j_blocksize = blocksize;
> >  	n = journal->j_blocksize / sizeof(journal_block_tag_t);
> >  	journal->j_wbufsize = n;
> > -	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
> > +	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
> >  	if (!journal->j_wbuf) {
> >  		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
> >  			__FUNCTION__);
> > @@ -777,7 +777,7 @@ journal_t * journal_init_inode (struct i
> >  	/* journal descriptor can store up to n blocks -bzzz */
> >  	n = journal->j_blocksize / sizeof(journal_block_tag_t);
> >  	journal->j_wbufsize = n;
> > -	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
> > +	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
> >  	if (!journal->j_wbuf) {
> >  		printk(KERN_ERR "%s: Cant allocate bhs for commit thread\n",
> >  			__FUNCTION__);
> > Index: linux-2.6.23-rc6/fs/jbd/revoke.c
> > ===================================================================
> > --- linux-2.6.23-rc6.orig/fs/jbd/revoke.c	2007-09-19 11:51:30.000000000 -0700
> > +++ linux-2.6.23-rc6/fs/jbd/revoke.c	2007-09-19 11:52:34.000000000 -0700
> > @@ -206,7 +206,7 @@ int journal_init_revoke(journal_t *journ
> >  	while((tmp >>= 1UL) != 0UL)
> >  		shift++;
> >  
> > -	journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
> > +	journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS);
> >  	if (!journal->j_revoke_table[0])
> >  		return -ENOMEM;
> >  	journal->j_revoke = journal->j_revoke_table[0];
> > @@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ
> >  	journal->j_revoke->hash_shift = shift;
> >  
> >  	journal->j_revoke->hash_table =
> > -		kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
> > +		kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
> >  	if (!journal->j_revoke->hash_table) {
> >  		kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
> >  		journal->j_revoke = NULL;
> > @@ -229,7 +229,7 @@ int journal_init_revoke(journal_t *journ
> >  	for (tmp = 0; tmp < hash_size; tmp++)
> >  		INIT_LIST_HEAD(&journal->j_revoke->hash_table[tmp]);
> >  
> > -	journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
> > +	journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_NOFS);
> >  	if (!journal->j_revoke_table[1]) {
> >  		kfree(journal->j_revoke_table[0]->hash_table);
> >  		kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
> > @@ -246,7 +246,7 @@ int journal_init_revoke(journal_t *journ
> >  	journal->j_revoke->hash_shift = shift;
> >  
> >  	journal->j_revoke->hash_table =
> > -		kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
> > +		kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
> >  	if (!journal->j_revoke->hash_table) {
> >  		kfree(journal->j_revoke_table[0]->hash_table);
> >  		kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
> 
> These were all OK using GFP_KERNEL.
> 
> GFP_NOFS should only be used when the caller is holding some fs locks which
> might cause a deadlock if that caller reentered the fs in ->writepage (and
> maybe put_inode and such).  That isn't the case in any of the above code,
> which is all mount time stuff (I think).
> 

You are right they are all occur at initialization time.

> ext3/4 should be using GFP_NOFS when the caller has a transaction open, has
> a page locked, is holding i_mutex, etc.
> 

Thanks for your feedback.

Mingming


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH] JBD slab cleanups
  2007-09-19 19:48                                           ` Andreas Dilger
@ 2007-09-19 22:03                                             ` Mingming Cao
  2007-09-21 23:13                                               ` [PATCH] JBD/ext34 cleanups: convert to kzalloc Mingming Cao
  0 siblings, 1 reply; 124+ messages in thread
From: Mingming Cao @ 2007-09-19 22:03 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Andrew Morton, Dave Kleikamp, Christoph Hellwig,
	Badari Pulavarty, Christoph Lameter, linux-fsdevel,
	ext4 development, lkml, Stephen C. Tweedie

On Wed, 2007-09-19 at 13:48 -0600, Andreas Dilger wrote:
> On Sep 19, 2007  12:15 -0700, Mingming Cao wrote:
> > @@ -96,8 +96,7 @@ static int start_this_handle(journal_t *
> >  
> >  alloc_transaction:
> >  	if (!journal->j_running_transaction) {
> > -		new_transaction = kmalloc(sizeof(*new_transaction),
> > -						GFP_NOFS|__GFP_NOFAIL);
> > +		new_transaction = kmalloc(sizeof(*new_transaction), GFP_NOFS);
> 
> This should probably be a __GFP_NOFAIL if we are trying to start a new
> handle in truncate, as there is no way to propagate an error to the caller.
> 

Thanks, updated version.

Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2, most cases
they are not needed.

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
 fs/jbd/journal.c  |    2 +-
 fs/jbd2/journal.c |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6.23-rc6/fs/jbd/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c	2007-09-19 11:47:58.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c	2007-09-19 14:23:45.000000000 -0700
@@ -653,7 +653,7 @@ static journal_t * journal_init_common (
 	journal_t *journal;
 	int err;
 
-	journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
+	journal = kmalloc(sizeof(*journal), GFP_KERNEL);
 	if (!journal)
 		goto fail;
 	memset(journal, 0, sizeof(*journal));
Index: linux-2.6.23-rc6/fs/jbd2/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/journal.c	2007-09-19 11:48:14.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/journal.c	2007-09-19 14:23:45.000000000 -0700
@@ -654,7 +654,7 @@ static journal_t * journal_init_common (
 	journal_t *journal;
 	int err;
 
-	journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
+	journal = kmalloc(sizeof(*journal), GFP_KERNEL);
 	if (!journal)
 		goto fail;
 	memset(journal, 0, sizeof(*journal));



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH] JBD: use GFP_NOFS  in kmalloc
  2007-09-19 19:22                                           ` [PATCH] JBD: use GFP_NOFS in kmalloc Mingming Cao
  2007-09-19 21:34                                             ` Andrew Morton
@ 2007-09-20  4:25                                             ` Andreas Dilger
  1 sibling, 0 replies; 124+ messages in thread
From: Andreas Dilger @ 2007-09-20  4:25 UTC (permalink / raw)
  To: Mingming Cao; +Cc: Andrew Morton, linux-fsdevel, ext4 development, lkml

On Sep 19, 2007  12:22 -0700, Mingming Cao wrote:
> Convert the GFP_KERNEL flag used in JBD/JBD2 to GFP_NOFS, consistent
> with the rest of kmalloc flag used in the JBD/JBD2 layer.
> 
> @@ -653,7 +653,7 @@ static journal_t * journal_init_common (
> -	journal = kmalloc(sizeof(*journal), GFP_KERNEL);
> +	journal = kmalloc(sizeof(*journal), GFP_NOFS);
> @@ -723,7 +723,7 @@ journal_t * journal_init_dev(struct bloc
> -	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
> +	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
> @@ -777,7 +777,7 @@ journal_t * journal_init_inode (struct i
> -	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
> +	journal->j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);

Is there a reason for this change except "it's in a filesystem, so it
should be GFP_NOFS"?  We are only doing journal setup during mount so
there shouldn't be any problem using GFP_KERNEL.  I don't think it will
inject any defect into the code, but I don't think it is needed either.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* [PATCH] JBD/ext34 cleanups: convert to kzalloc
  2007-09-19 22:03                                             ` Mingming Cao
@ 2007-09-21 23:13                                               ` Mingming Cao
  2007-09-21 23:32                                                 ` [PATCH] JBD2/ext4 naming cleanup Mingming Cao
  2007-09-26 19:54                                                 ` [PATCH] JBD/ext34 cleanups: convert to kzalloc Andrew Morton
  0 siblings, 2 replies; 124+ messages in thread
From: Mingming Cao @ 2007-09-21 23:13 UTC (permalink / raw)
  To: Andrew Morton; +Cc: ext4 development, lkml

Convert kmalloc to kzalloc() and get rid of the memset().

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
 fs/ext3/xattr.c       |    3 +--
 fs/ext4/xattr.c       |    3 +--
 fs/jbd/journal.c      |    3 +--
 fs/jbd/transaction.c  |    2 +-
 fs/jbd2/journal.c     |    3 +--
 fs/jbd2/transaction.c |    2 +-
 6 files changed, 6 insertions(+), 10 deletions(-)

Index: linux-2.6.23-rc6/fs/jbd/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c	2007-09-21 09:08:02.000000000
-0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c	2007-09-21 09:10:37.000000000
-0700
@@ -653,10 +653,9 @@ static journal_t * journal_init_common (
 	journal_t *journal;
 	int err;
 
-	journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
+	journal = kzalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
 	if (!journal)
 		goto fail;
-	memset(journal, 0, sizeof(*journal));
 
 	init_waitqueue_head(&journal->j_wait_transaction_locked);
 	init_waitqueue_head(&journal->j_wait_logspace);
Index: linux-2.6.23-rc6/fs/jbd/transaction.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd/transaction.c	2007-09-21
09:13:11.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd/transaction.c	2007-09-21 09:13:24.000000000
-0700
@@ -96,7 +96,7 @@ static int start_this_handle(journal_t *
 
 alloc_transaction:
 	if (!journal->j_running_transaction) {
-		new_transaction = kmalloc(sizeof(*new_transaction),
+		new_transaction = kzalloc(sizeof(*new_transaction),
 						GFP_NOFS|__GFP_NOFAIL);
 		if (!new_transaction) {
 			ret = -ENOMEM;
Index: linux-2.6.23-rc6/fs/jbd2/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/journal.c	2007-09-21
09:10:53.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/journal.c	2007-09-21 09:11:13.000000000
-0700
@@ -654,10 +654,9 @@ static journal_t * journal_init_common (
 	journal_t *journal;
 	int err;
 
-	journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
+	journal = kzalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
 	if (!journal)
 		goto fail;
-	memset(journal, 0, sizeof(*journal));
 
 	init_waitqueue_head(&journal->j_wait_transaction_locked);
 	init_waitqueue_head(&journal->j_wait_logspace);
Index: linux-2.6.23-rc6/fs/jbd2/transaction.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/transaction.c	2007-09-21
09:12:46.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/transaction.c	2007-09-21 09:12:59.000000000
-0700
@@ -96,7 +96,7 @@ static int start_this_handle(journal_t *
 
 alloc_transaction:
 	if (!journal->j_running_transaction) {
-		new_transaction = kmalloc(sizeof(*new_transaction),
+		new_transaction = kzalloc(sizeof(*new_transaction),
 						GFP_NOFS|__GFP_NOFAIL);
 		if (!new_transaction) {
 			ret = -ENOMEM;
Index: linux-2.6.23-rc6/fs/ext3/xattr.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/ext3/xattr.c	2007-09-21 10:22:24.000000000
-0700
+++ linux-2.6.23-rc6/fs/ext3/xattr.c	2007-09-21 10:24:19.000000000 -0700
@@ -741,12 +741,11 @@ ext3_xattr_block_set(handle_t *handle, s
 		}
 	} else {
 		/* Allocate a buffer where we construct the new block. */
-		s->base = kmalloc(sb->s_blocksize, GFP_KERNEL);
+		s->base = kzalloc(sb->s_blocksize, GFP_KERNEL);
 		/* assert(header == s->base) */
 		error = -ENOMEM;
 		if (s->base == NULL)
 			goto cleanup;
-		memset(s->base, 0, sb->s_blocksize);
 		header(s->base)->h_magic = cpu_to_le32(EXT3_XATTR_MAGIC);
 		header(s->base)->h_blocks = cpu_to_le32(1);
 		header(s->base)->h_refcount = cpu_to_le32(1);
Index: linux-2.6.23-rc6/fs/ext4/xattr.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/ext4/xattr.c	2007-09-21 10:20:21.000000000
-0700
+++ linux-2.6.23-rc6/fs/ext4/xattr.c	2007-09-21 10:21:00.000000000 -0700
@@ -750,12 +750,11 @@ ext4_xattr_block_set(handle_t *handle, s
 		}
 	} else {
 		/* Allocate a buffer where we construct the new block. */
-		s->base = kmalloc(sb->s_blocksize, GFP_KERNEL);
+		s->base = kzalloc(sb->s_blocksize, GFP_KERNEL);
 		/* assert(header == s->base) */
 		error = -ENOMEM;
 		if (s->base == NULL)
 			goto cleanup;
-		memset(s->base, 0, sb->s_blocksize);
 		header(s->base)->h_magic = cpu_to_le32(EXT4_XATTR_MAGIC);
 		header(s->base)->h_blocks = cpu_to_le32(1);
 		header(s->base)->h_refcount = cpu_to_le32(1);



^ permalink raw reply	[flat|nested] 124+ messages in thread

* [PATCH] JBD2/ext4 naming cleanup
  2007-09-21 23:13                                               ` [PATCH] JBD/ext34 cleanups: convert to kzalloc Mingming Cao
@ 2007-09-21 23:32                                                 ` Mingming Cao
  2007-09-26 19:54                                                 ` [PATCH] JBD/ext34 cleanups: convert to kzalloc Andrew Morton
  1 sibling, 0 replies; 124+ messages in thread
From: Mingming Cao @ 2007-09-21 23:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: ext4 development, lkml

JBD2 naming cleanup

From: Mingming Cao <cmm@us.ibm.com>

change micros name from JBD_XXX to JBD2_XXX in JBD2/Ext4

Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---
 fs/ext4/extents.c         |    2 +-
 fs/ext4/super.c           |    2 +-
 fs/jbd2/commit.c          |    2 +-
 fs/jbd2/journal.c         |    8 ++++----
 fs/jbd2/recovery.c        |    2 +-
 fs/jbd2/revoke.c          |    4 ++--
 include/linux/ext4_jbd2.h |    6 +++---
 include/linux/jbd2.h      |   30 +++++++++++++++---------------
 8 files changed, 28 insertions(+), 28 deletions(-)

Index: linux-2.6.23-rc6/fs/ext4/super.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/ext4/super.c	2007-09-21 16:27:31.000000000 -0700
+++ linux-2.6.23-rc6/fs/ext4/super.c	2007-09-21 16:27:46.000000000 -0700
@@ -966,7 +966,7 @@ static int parse_options (char *options,
 			if (option < 0)
 				return 0;
 			if (option == 0)
-				option = JBD_DEFAULT_MAX_COMMIT_AGE;
+				option = JBD2_DEFAULT_MAX_COMMIT_AGE;
 			sbi->s_commit_interval = HZ * option;
 			break;
 		case Opt_data_journal:
Index: linux-2.6.23-rc6/include/linux/ext4_jbd2.h
===================================================================
--- linux-2.6.23-rc6.orig/include/linux/ext4_jbd2.h	2007-09-10 19:50:29.000000000 -0700
+++ linux-2.6.23-rc6/include/linux/ext4_jbd2.h	2007-09-21 16:27:46.000000000 -0700
@@ -12,8 +12,8 @@
  * Ext4-specific journaling extensions.
  */
 
-#ifndef _LINUX_EXT4_JBD_H
-#define _LINUX_EXT4_JBD_H
+#ifndef _LINUX_EXT4_JBD2_H
+#define _LINUX_EXT4_JBD2_H
 
 #include <linux/fs.h>
 #include <linux/jbd2.h>
@@ -228,4 +228,4 @@ static inline int ext4_should_writeback_
 	return 0;
 }
 
-#endif	/* _LINUX_EXT4_JBD_H */
+#endif	/* _LINUX_EXT4_JBD2_H */
Index: linux-2.6.23-rc6/include/linux/jbd2.h
===================================================================
--- linux-2.6.23-rc6.orig/include/linux/jbd2.h	2007-09-21 09:07:09.000000000 -0700
+++ linux-2.6.23-rc6/include/linux/jbd2.h	2007-09-21 16:27:46.000000000 -0700
@@ -13,8 +13,8 @@
  * filesystem journaling support.
  */
 
-#ifndef _LINUX_JBD_H
-#define _LINUX_JBD_H
+#ifndef _LINUX_JBD2_H
+#define _LINUX_JBD2_H
 
 /* Allow this file to be included directly into e2fsprogs */
 #ifndef __KERNEL__
@@ -37,26 +37,26 @@
 #define journal_oom_retry 1
 
 /*
- * Define JBD_PARANIOD_IOFAIL to cause a kernel BUG() if ext3 finds
+ * Define JBD2_PARANIOD_IOFAIL to cause a kernel BUG() if ext4 finds
  * certain classes of error which can occur due to failed IOs.  Under
- * normal use we want ext3 to continue after such errors, because
+ * normal use we want ext4 to continue after such errors, because
  * hardware _can_ fail, but for debugging purposes when running tests on
  * known-good hardware we may want to trap these errors.
  */
-#undef JBD_PARANOID_IOFAIL
+#undef JBD2_PARANOID_IOFAIL
 
 /*
  * The default maximum commit age, in seconds.
  */
-#define JBD_DEFAULT_MAX_COMMIT_AGE 5
+#define JBD2_DEFAULT_MAX_COMMIT_AGE 5
 
 #ifdef CONFIG_JBD2_DEBUG
 /*
- * Define JBD_EXPENSIVE_CHECKING to enable more expensive internal
+ * Define JBD2_EXPENSIVE_CHECKING to enable more expensive internal
  * consistency checks.  By default we don't do this unless
  * CONFIG_JBD2_DEBUG is on.
  */
-#define JBD_EXPENSIVE_CHECKING
+#define JBD2_EXPENSIVE_CHECKING
 extern u8 jbd2_journal_enable_debug;
 
 #define jbd_debug(n, f, a...)						\
@@ -163,8 +163,8 @@ typedef struct journal_block_tag_s
 	__be32		t_blocknr_high; /* most-significant high 32bits. */
 } journal_block_tag_t;
 
-#define JBD_TAG_SIZE32 (offsetof(journal_block_tag_t, t_blocknr_high))
-#define JBD_TAG_SIZE64 (sizeof(journal_block_tag_t))
+#define JBD2_TAG_SIZE32 (offsetof(journal_block_tag_t, t_blocknr_high))
+#define JBD2_TAG_SIZE64 (sizeof(journal_block_tag_t))
 
 /*
  * The revoke descriptor: used on disk to describe a series of blocks to
@@ -256,8 +256,8 @@ typedef struct journal_superblock_s
 #include <linux/fs.h>
 #include <linux/sched.h>
 
-#define JBD_ASSERTIONS
-#ifdef JBD_ASSERTIONS
+#define JBD2_ASSERTIONS
+#ifdef JBD2_ASSERTIONS
 #define J_ASSERT(assert)						\
 do {									\
 	if (!(assert)) {						\
@@ -284,9 +284,9 @@ void buffer_assertion_failure(struct buf
 
 #else
 #define J_ASSERT(assert)	do { } while (0)
-#endif		/* JBD_ASSERTIONS */
+#endif		/* JBD2_ASSERTIONS */
 
-#if defined(JBD_PARANOID_IOFAIL)
+#if defined(JBD2_PARANOID_IOFAIL)
 #define J_EXPECT(expr, why...)		J_ASSERT(expr)
 #define J_EXPECT_BH(bh, expr, why...)	J_ASSERT_BH(bh, expr)
 #define J_EXPECT_JH(jh, expr, why...)	J_ASSERT_JH(jh, expr)
@@ -1104,4 +1104,4 @@ extern int jbd_blocks_per_page(struct in
 
 #endif	/* __KERNEL__ */
 
-#endif	/* _LINUX_JBD_H */
+#endif	/* _LINUX_JBD2_H */
Index: linux-2.6.23-rc6/fs/jbd2/commit.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/commit.c	2007-09-21 09:07:09.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/commit.c	2007-09-21 16:27:46.000000000 -0700
@@ -278,7 +278,7 @@ static inline void write_tag_block(int t
 				   unsigned long long block)
 {
 	tag->t_blocknr = cpu_to_be32(block & (u32)~0);
-	if (tag_bytes > JBD_TAG_SIZE32)
+	if (tag_bytes > JBD2_TAG_SIZE32)
 		tag->t_blocknr_high = cpu_to_be32((block >> 31) >> 1);
 }
 
Index: linux-2.6.23-rc6/fs/jbd2/journal.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/journal.c	2007-09-21 16:25:46.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/journal.c	2007-09-21 16:27:46.000000000 -0700
@@ -670,7 +670,7 @@ static journal_t * journal_init_common (
 	spin_lock_init(&journal->j_list_lock);
 	spin_lock_init(&journal->j_state_lock);
 
-	journal->j_commit_interval = (HZ * JBD_DEFAULT_MAX_COMMIT_AGE);
+	journal->j_commit_interval = (HZ * JBD2_DEFAULT_MAX_COMMIT_AGE);
 
 	/* The journal is marked for error until we succeed with recovery! */
 	journal->j_flags = JBD2_ABORT;
@@ -1612,9 +1612,9 @@ int jbd2_journal_blocks_per_page(struct 
 size_t journal_tag_bytes(journal_t *journal)
 {
 	if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_64BIT))
-		return JBD_TAG_SIZE64;
+		return JBD2_TAG_SIZE64;
 	else
-		return JBD_TAG_SIZE32;
+		return JBD2_TAG_SIZE32;
 }
 
 /*
@@ -1681,7 +1681,7 @@ static void journal_free_journal_head(st
 {
 #ifdef CONFIG_JBD2_DEBUG
 	atomic_dec(&nr_journal_heads);
-	memset(jh, JBD_POISON_FREE, sizeof(*jh));
+	memset(jh, JBD2_POISON_FREE, sizeof(*jh));
 #endif
 	kmem_cache_free(jbd2_journal_head_cache, jh);
 }
Index: linux-2.6.23-rc6/fs/jbd2/recovery.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/recovery.c	2007-09-21 09:07:05.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/recovery.c	2007-09-21 16:27:46.000000000 -0700
@@ -311,7 +311,7 @@ int jbd2_journal_skip_recovery(journal_t
 static inline unsigned long long read_tag_block(int tag_bytes, journal_block_tag_t *tag)
 {
 	unsigned long long block = be32_to_cpu(tag->t_blocknr);
-	if (tag_bytes > JBD_TAG_SIZE32)
+	if (tag_bytes > JBD2_TAG_SIZE32)
 		block |= (u64)be32_to_cpu(tag->t_blocknr_high) << 32;
 	return block;
 }
Index: linux-2.6.23-rc6/fs/jbd2/revoke.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/jbd2/revoke.c	2007-09-19 14:23:45.000000000 -0700
+++ linux-2.6.23-rc6/fs/jbd2/revoke.c	2007-09-21 16:27:46.000000000 -0700
@@ -352,7 +352,7 @@ int jbd2_journal_revoke(handle_t *handle
 		if (bh)
 			BUFFER_TRACE(bh, "found on hash");
 	}
-#ifdef JBD_EXPENSIVE_CHECKING
+#ifdef JBD2_EXPENSIVE_CHECKING
 	else {
 		struct buffer_head *bh2;
 
@@ -453,7 +453,7 @@ int jbd2_journal_cancel_revoke(handle_t 
 		}
 	}
 
-#ifdef JBD_EXPENSIVE_CHECKING
+#ifdef JBD2_EXPENSIVE_CHECKING
 	/* There better not be one left behind by now! */
 	record = find_revoke_record(journal, bh->b_blocknr);
 	J_ASSERT_JH(jh, record == NULL);
Index: linux-2.6.23-rc6/fs/ext4/extents.c
===================================================================
--- linux-2.6.23-rc6.orig/fs/ext4/extents.c	2007-09-21 09:07:04.000000000 -0700
+++ linux-2.6.23-rc6/fs/ext4/extents.c	2007-09-21 16:27:46.000000000 -0700
@@ -33,7 +33,7 @@
 #include <linux/fs.h>
 #include <linux/time.h>
 #include <linux/ext4_jbd2.h>
-#include <linux/jbd.h>
+#include <linux/jbd2.h>
 #include <linux/highuid.h>
 #include <linux/pagemap.h>
 #include <linux/quotaops.h>



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH] JBD/ext34 cleanups: convert to kzalloc
  2007-09-21 23:13                                               ` [PATCH] JBD/ext34 cleanups: convert to kzalloc Mingming Cao
  2007-09-21 23:32                                                 ` [PATCH] JBD2/ext4 naming cleanup Mingming Cao
@ 2007-09-26 19:54                                                 ` Andrew Morton
  2007-09-26 21:05                                                   ` Mingming Cao
  1 sibling, 1 reply; 124+ messages in thread
From: Andrew Morton @ 2007-09-26 19:54 UTC (permalink / raw)
  To: cmm; +Cc: linux-ext4, linux-kernel

On Fri, 21 Sep 2007 16:13:56 -0700
Mingming Cao <cmm@us.ibm.com> wrote:

> Convert kmalloc to kzalloc() and get rid of the memset().

I split this into separate ext3/jbd and ext4/jbd2 patches.  It's generally
better to raise separate patches, please - the ext3 patches I'll merge
directly but the ext4 patches should go through (and be against) the ext4
devel tree.

I fixed lots of rejects against the already-pending changes to these
filesystems.

You forgot to remove the memsets in both start_this_handle()s.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH] JBD/ext34 cleanups: convert to kzalloc
  2007-09-26 19:54                                                 ` [PATCH] JBD/ext34 cleanups: convert to kzalloc Andrew Morton
@ 2007-09-26 21:05                                                   ` Mingming Cao
  0 siblings, 0 replies; 124+ messages in thread
From: Mingming Cao @ 2007-09-26 21:05 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-ext4, linux-kernel

On Wed, 2007-09-26 at 12:54 -0700, Andrew Morton wrote:
> On Fri, 21 Sep 2007 16:13:56 -0700
> Mingming Cao <cmm@us.ibm.com> wrote:
> 
> > Convert kmalloc to kzalloc() and get rid of the memset().
> 
> I split this into separate ext3/jbd and ext4/jbd2 patches.  It's generally
> better to raise separate patches, please - the ext3 patches I'll merge
> directly but the ext4 patches should go through (and be against) the ext4
> devel tree.
> 
Sure. The patches(including ext3/jbd and ext4/jbd2) were merged into
ext4 devel tree already, I will remove the ext3/jbd part out of the ext4
devel tree.

> I fixed lots of rejects against the already-pending changes to these
> filesystems.
> 
> You forgot to remove the memsets in both start_this_handle()s.
> 
Thanks for catching this.

Mingming


^ permalink raw reply	[flat|nested] 124+ messages in thread

* [PATCH 1/2] ext4: Support large blocksize up to PAGESIZE
  2007-08-30  0:47     ` [RFC 1/4] Large Blocksize support for Ext2/3/4 Mingming Cao
                         ` (3 preceding siblings ...)
  2007-09-01  0:12       ` [RFC 2/2] JBD: blocks reservation fix for large block support Mingming Cao
@ 2007-10-02  0:34       ` Mingming Cao
  2007-10-02  0:35       ` [PATCH 2/2] ext4: Avoid rec_len overflow with 64KB block size Mingming Cao
                         ` (4 subsequent siblings)
  9 siblings, 0 replies; 124+ messages in thread
From: Mingming Cao @ 2007-10-02  0:34 UTC (permalink / raw)
  To: ext4 development, linux-kernel; +Cc: sho, Jan Kara, clameter, tytso

Support large blocksize up to PAGESIZE (max 64KB) for ext4. 

From: Takashi Sato <sho@tnes.nec.co.jp>

This patch set supports large block size(>4k, <=64k) in ext4,
just enlarging the block size limit. But it is NOT possible to have 64kB
blocksize on ext4 without some changes to the directory handling
code.  The reason is that an empty 64kB directory block would have a
rec_len == (__u16)2^16 == 0, and this would cause an error to be hit in
the filesystem.  The proposed solution is treat 64k rec_len
with a an impossible value like rec_len = 0xffff to handle this.

The Patch-set consists of the following 2 patches.
  [1/2]  ext4: enlarge blocksize
         - Allow blocksize up to pagesize

  [2/2]  ext4: fix rec_len overflow
         - prevent rec_len from overflow with 64KB blocksize

Now on 64k page ppc64 box runs with this patch set we could create a 64k
block size ext4dev, and able to handle empty directory block.
Patch consider to be merge to 2.6.24-rc1.

Signed-off-by: Takashi Sato <sho@tnes.nec.co.jp>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---

 fs/ext4/super.c         |    5 +++++
 include/linux/ext4_fs.h |    4 ++--
 2 files changed, 7 insertions(+), 2 deletions(-)


diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 619db84..d8bb279 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1548,6 +1548,11 @@ static int ext4_fill_super (struct super_block *sb, void *data, int silent)
 		goto out_fail;
 	}
 
+	if (!sb_set_blocksize(sb, blocksize)) {
+		printk(KERN_ERR "EXT4-fs: bad blocksize %d.\n", blocksize);
+		goto out_fail;
+	}
+
 	/*
 	 * The ext4 superblock will not be buffer aligned for other than 1kB
 	 * block sizes.  We need to calculate the offset from buffer start.
diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h
index f9881b6..d15a15e 100644
--- a/include/linux/ext4_fs.h
+++ b/include/linux/ext4_fs.h
@@ -77,8 +77,8 @@
  * Macro-instructions used to manage several block sizes
  */
 #define EXT4_MIN_BLOCK_SIZE		1024
-#define	EXT4_MAX_BLOCK_SIZE		4096
-#define EXT4_MIN_BLOCK_LOG_SIZE		  10
+#define	EXT4_MAX_BLOCK_SIZE		65536
+#define EXT4_MIN_BLOCK_LOG_SIZE		10
 #ifdef __KERNEL__
 # define EXT4_BLOCK_SIZE(s)		((s)->s_blocksize)
 #else



^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 2/2] ext4: Avoid rec_len overflow with 64KB block size
  2007-08-30  0:47     ` [RFC 1/4] Large Blocksize support for Ext2/3/4 Mingming Cao
                         ` (4 preceding siblings ...)
  2007-10-02  0:34       ` [PATCH 1/2] ext4: Support large blocksize up to PAGESIZE Mingming Cao
@ 2007-10-02  0:35       ` Mingming Cao
  2007-10-02  0:35       ` [PATCH 1/2] ext2: Support large blocksize up to PAGESIZE Mingming Cao
                         ` (3 subsequent siblings)
  9 siblings, 0 replies; 124+ messages in thread
From: Mingming Cao @ 2007-10-02  0:35 UTC (permalink / raw)
  To: ext4 development, linux-kernel; +Cc: sho, Jan Kara, clameter, tytso

ext4: Avoid rec_len overflow with 64KB block size

From: Jan Kara <jack@suse.cz>

With 64KB blocksize, a directory entry can have size 64KB which does not fit
into 16 bits we have for entry lenght. So we store 0xffff instead and convert
value when read from / written to disk. The patch also converts some places
to use ext4_next_entry() when we are changing them anyway.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---

 fs/ext4/dir.c           |   12 ++++---
 fs/ext4/namei.c         |   76 ++++++++++++++++++++++-------------------------
 include/linux/ext4_fs.h |   20 ++++++++++++
 3 files changed, 62 insertions(+), 46 deletions(-)


diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index 3ab01c0..20b1e28 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -69,7 +69,7 @@ int ext4_check_dir_entry (const char * function, struct inode * dir,
 			  unsigned long offset)
 {
 	const char * error_msg = NULL;
-	const int rlen = le16_to_cpu(de->rec_len);
+	const int rlen = ext4_rec_len_from_disk(de->rec_len);
 
 	if (rlen < EXT4_DIR_REC_LEN(1))
 		error_msg = "rec_len is smaller than minimal";
@@ -176,10 +176,10 @@ revalidate:
 				 * least that it is non-zero.  A
 				 * failure will be detected in the
 				 * dirent test below. */
-				if (le16_to_cpu(de->rec_len) <
-						EXT4_DIR_REC_LEN(1))
+				if (ext4_rec_len_from_disk(de->rec_len)
+						< EXT4_DIR_REC_LEN(1))
 					break;
-				i += le16_to_cpu(de->rec_len);
+				i += ext4_rec_len_from_disk(de->rec_len);
 			}
 			offset = i;
 			filp->f_pos = (filp->f_pos & ~(sb->s_blocksize - 1))
@@ -201,7 +201,7 @@ revalidate:
 				ret = stored;
 				goto out;
 			}
-			offset += le16_to_cpu(de->rec_len);
+			offset += ext4_rec_len_from_disk(de->rec_len);
 			if (le32_to_cpu(de->inode)) {
 				/* We might block in the next section
 				 * if the data destination is
@@ -223,7 +223,7 @@ revalidate:
 					goto revalidate;
 				stored ++;
 			}
-			filp->f_pos += le16_to_cpu(de->rec_len);
+			filp->f_pos += ext4_rec_len_from_disk(de->rec_len);
 		}
 		offset = 0;
 		brelse (bh);
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 5fdb862..96e8a85 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -281,7 +281,7 @@ static struct stats dx_show_leaf(struct dx_hash_info *hinfo, struct ext4_dir_ent
 			space += EXT4_DIR_REC_LEN(de->name_len);
 			names++;
 		}
-		de = (struct ext4_dir_entry_2 *) ((char *) de + le16_to_cpu(de->rec_len));
+		de = ext4_next_entry(de);
 	}
 	printk("(%i)\n", names);
 	return (struct stats) { names, space, 1 };
@@ -552,7 +552,8 @@ static int ext4_htree_next_block(struct inode *dir, __u32 hash,
  */
 static inline struct ext4_dir_entry_2 *ext4_next_entry(struct ext4_dir_entry_2 *p)
 {
-	return (struct ext4_dir_entry_2 *)((char*)p + le16_to_cpu(p->rec_len));
+	return (struct ext4_dir_entry_2 *)((char*)p +
+		ext4_rec_len_from_disk(p->rec_len));
 }
 
 /*
@@ -721,7 +722,7 @@ static int dx_make_map (struct ext4_dir_entry_2 *de, int size,
 			cond_resched();
 		}
 		/* XXX: do we need to check rec_len == 0 case? -Chris */
-		de = (struct ext4_dir_entry_2 *) ((char *) de + le16_to_cpu(de->rec_len));
+		de = ext4_next_entry(de);
 	}
 	return count;
 }
@@ -823,7 +824,7 @@ static inline int search_dirblock(struct buffer_head * bh,
 			return 1;
 		}
 		/* prevent looping on a bad block */
-		de_len = le16_to_cpu(de->rec_len);
+		de_len = ext4_rec_len_from_disk(de->rec_len);
 		if (de_len <= 0)
 			return -1;
 		offset += de_len;
@@ -1136,7 +1137,7 @@ dx_move_dirents(char *from, char *to, struct dx_map_entry *map, int count)
 		rec_len = EXT4_DIR_REC_LEN(de->name_len);
 		memcpy (to, de, rec_len);
 		((struct ext4_dir_entry_2 *) to)->rec_len =
-				cpu_to_le16(rec_len);
+				ext4_rec_len_to_disk(rec_len);
 		de->inode = 0;
 		map++;
 		to += rec_len;
@@ -1155,13 +1156,12 @@ static struct ext4_dir_entry_2* dx_pack_dirents(char *base, int size)
 
 	prev = to = de;
 	while ((char*)de < base + size) {
-		next = (struct ext4_dir_entry_2 *) ((char *) de +
-						    le16_to_cpu(de->rec_len));
+		next = ext4_next_entry(de);
 		if (de->inode && de->name_len) {
 			rec_len = EXT4_DIR_REC_LEN(de->name_len);
 			if (de > to)
 				memmove(to, de, rec_len);
-			to->rec_len = cpu_to_le16(rec_len);
+			to->rec_len = ext4_rec_len_to_disk(rec_len);
 			prev = to;
 			to = (struct ext4_dir_entry_2 *) (((char *) to) + rec_len);
 		}
@@ -1235,8 +1235,8 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
 	/* Fancy dance to stay within two buffers */
 	de2 = dx_move_dirents(data1, data2, map + split, count - split);
 	de = dx_pack_dirents(data1,blocksize);
-	de->rec_len = cpu_to_le16(data1 + blocksize - (char *) de);
-	de2->rec_len = cpu_to_le16(data2 + blocksize - (char *) de2);
+	de->rec_len = ext4_rec_len_to_disk(data1 + blocksize - (char *) de);
+	de2->rec_len = ext4_rec_len_to_disk(data2 + blocksize - (char *) de2);
 	dxtrace(dx_show_leaf (hinfo, (struct ext4_dir_entry_2 *) data1, blocksize, 1));
 	dxtrace(dx_show_leaf (hinfo, (struct ext4_dir_entry_2 *) data2, blocksize, 1));
 
@@ -1307,7 +1307,7 @@ static int add_dirent_to_buf(handle_t *handle, struct dentry *dentry,
 				return -EEXIST;
 			}
 			nlen = EXT4_DIR_REC_LEN(de->name_len);
-			rlen = le16_to_cpu(de->rec_len);
+			rlen = ext4_rec_len_from_disk(de->rec_len);
 			if ((de->inode? rlen - nlen: rlen) >= reclen)
 				break;
 			de = (struct ext4_dir_entry_2 *)((char *)de + rlen);
@@ -1326,11 +1326,11 @@ static int add_dirent_to_buf(handle_t *handle, struct dentry *dentry,
 
 	/* By now the buffer is marked for journaling */
 	nlen = EXT4_DIR_REC_LEN(de->name_len);
-	rlen = le16_to_cpu(de->rec_len);
+	rlen = ext4_rec_len_from_disk(de->rec_len);
 	if (de->inode) {
 		struct ext4_dir_entry_2 *de1 = (struct ext4_dir_entry_2 *)((char *)de + nlen);
-		de1->rec_len = cpu_to_le16(rlen - nlen);
-		de->rec_len = cpu_to_le16(nlen);
+		de1->rec_len = ext4_rec_len_to_disk(rlen - nlen);
+		de->rec_len = ext4_rec_len_to_disk(nlen);
 		de = de1;
 	}
 	de->file_type = EXT4_FT_UNKNOWN;
@@ -1408,17 +1408,18 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,
 
 	/* The 0th block becomes the root, move the dirents out */
 	fde = &root->dotdot;
-	de = (struct ext4_dir_entry_2 *)((char *)fde + le16_to_cpu(fde->rec_len));
+	de = (struct ext4_dir_entry_2 *)((char *)fde +
+		ext4_rec_len_from_disk(fde->rec_len));
 	len = ((char *) root) + blocksize - (char *) de;
 	memcpy (data1, de, len);
 	de = (struct ext4_dir_entry_2 *) data1;
 	top = data1 + len;
-	while ((char *)(de2=(void*)de+le16_to_cpu(de->rec_len)) < top)
+	while ((char *)(de2 = ext4_next_entry(de)) < top)
 		de = de2;
-	de->rec_len = cpu_to_le16(data1 + blocksize - (char *) de);
+	de->rec_len = ext4_rec_len_to_disk(data1 + blocksize - (char *) de);
 	/* Initialize the root; the dot dirents already exist */
 	de = (struct ext4_dir_entry_2 *) (&root->dotdot);
-	de->rec_len = cpu_to_le16(blocksize - EXT4_DIR_REC_LEN(2));
+	de->rec_len = ext4_rec_len_to_disk(blocksize - EXT4_DIR_REC_LEN(2));
 	memset (&root->info, 0, sizeof(root->info));
 	root->info.info_length = sizeof(root->info);
 	root->info.hash_version = EXT4_SB(dir->i_sb)->s_def_hash_version;
@@ -1505,7 +1506,7 @@ static int ext4_add_entry (handle_t *handle, struct dentry *dentry,
 		return retval;
 	de = (struct ext4_dir_entry_2 *) bh->b_data;
 	de->inode = 0;
-	de->rec_len = cpu_to_le16(blocksize);
+	de->rec_len = ext4_rec_len_to_disk(blocksize);
 	return add_dirent_to_buf(handle, dentry, inode, de, bh);
 }
 
@@ -1569,7 +1570,7 @@ static int ext4_dx_add_entry(handle_t *handle, struct dentry *dentry,
 			goto cleanup;
 		node2 = (struct dx_node *)(bh2->b_data);
 		entries2 = node2->entries;
-		node2->fake.rec_len = cpu_to_le16(sb->s_blocksize);
+		node2->fake.rec_len = ext4_rec_len_to_disk(sb->s_blocksize);
 		node2->fake.inode = 0;
 		BUFFER_TRACE(frame->bh, "get_write_access");
 		err = ext4_journal_get_write_access(handle, frame->bh);
@@ -1668,9 +1669,9 @@ static int ext4_delete_entry (handle_t *handle,
 			BUFFER_TRACE(bh, "get_write_access");
 			ext4_journal_get_write_access(handle, bh);
 			if (pde)
-				pde->rec_len =
-					cpu_to_le16(le16_to_cpu(pde->rec_len) +
-						    le16_to_cpu(de->rec_len));
+				pde->rec_len = ext4_rec_len_to_disk(
+					ext4_rec_len_from_disk(pde->rec_len) +
+					ext4_rec_len_from_disk(de->rec_len));
 			else
 				de->inode = 0;
 			dir->i_version++;
@@ -1678,10 +1679,9 @@ static int ext4_delete_entry (handle_t *handle,
 			ext4_journal_dirty_metadata(handle, bh);
 			return 0;
 		}
-		i += le16_to_cpu(de->rec_len);
+		i += ext4_rec_len_from_disk(de->rec_len);
 		pde = de;
-		de = (struct ext4_dir_entry_2 *)
-			((char *) de + le16_to_cpu(de->rec_len));
+		de = ext4_next_entry(de);
 	}
 	return -ENOENT;
 }
@@ -1844,13 +1844,12 @@ retry:
 	de = (struct ext4_dir_entry_2 *) dir_block->b_data;
 	de->inode = cpu_to_le32(inode->i_ino);
 	de->name_len = 1;
-	de->rec_len = cpu_to_le16(EXT4_DIR_REC_LEN(de->name_len));
+	de->rec_len = ext4_rec_len_to_disk(EXT4_DIR_REC_LEN(de->name_len));
 	strcpy (de->name, ".");
 	ext4_set_de_type(dir->i_sb, de, S_IFDIR);
-	de = (struct ext4_dir_entry_2 *)
-			((char *) de + le16_to_cpu(de->rec_len));
+	de = ext4_next_entry(de);
 	de->inode = cpu_to_le32(dir->i_ino);
-	de->rec_len = cpu_to_le16(inode->i_sb->s_blocksize-EXT4_DIR_REC_LEN(1));
+	de->rec_len = ext4_rec_len_to_disk(inode->i_sb->s_blocksize-EXT4_DIR_REC_LEN(1));
 	de->name_len = 2;
 	strcpy (de->name, "..");
 	ext4_set_de_type(dir->i_sb, de, S_IFDIR);
@@ -1902,8 +1901,7 @@ static int empty_dir (struct inode * inode)
 		return 1;
 	}
 	de = (struct ext4_dir_entry_2 *) bh->b_data;
-	de1 = (struct ext4_dir_entry_2 *)
-			((char *) de + le16_to_cpu(de->rec_len));
+	de1 = ext4_next_entry(de);
 	if (le32_to_cpu(de->inode) != inode->i_ino ||
 			!le32_to_cpu(de1->inode) ||
 			strcmp (".", de->name) ||
@@ -1914,9 +1912,9 @@ static int empty_dir (struct inode * inode)
 		brelse (bh);
 		return 1;
 	}
-	offset = le16_to_cpu(de->rec_len) + le16_to_cpu(de1->rec_len);
-	de = (struct ext4_dir_entry_2 *)
-			((char *) de1 + le16_to_cpu(de1->rec_len));
+	offset = ext4_rec_len_from_disk(de->rec_len) +
+		 ext4_rec_len_from_disk(de1->rec_len);
+	de = ext4_next_entry(de1);
 	while (offset < inode->i_size ) {
 		if (!bh ||
 			(void *) de >= (void *) (bh->b_data+sb->s_blocksize)) {
@@ -1945,9 +1943,8 @@ static int empty_dir (struct inode * inode)
 			brelse (bh);
 			return 0;
 		}
-		offset += le16_to_cpu(de->rec_len);
-		de = (struct ext4_dir_entry_2 *)
-				((char *) de + le16_to_cpu(de->rec_len));
+		offset += ext4_rec_len_from_disk(de->rec_len);
+		de = ext4_next_entry(de);
 	}
 	brelse (bh);
 	return 1;
@@ -2302,8 +2299,7 @@ retry:
 }
 
 #define PARENT_INO(buffer) \
-	((struct ext4_dir_entry_2 *) ((char *) buffer + \
-	le16_to_cpu(((struct ext4_dir_entry_2 *) buffer)->rec_len)))->inode
+	(ext4_next_entry((struct ext4_dir_entry_2 *)(buffer))->inode)
 
 /*
  * Anybody can rename anything with this: the permission checks are left to the
diff --git a/include/linux/ext4_fs.h b/include/linux/ext4_fs.h
index d15a15e..e1caf0a 100644
--- a/include/linux/ext4_fs.h
+++ b/include/linux/ext4_fs.h
@@ -771,6 +771,26 @@ struct ext4_dir_entry_2 {
 #define EXT4_DIR_ROUND			(EXT4_DIR_PAD - 1)
 #define EXT4_DIR_REC_LEN(name_len)	(((name_len) + 8 + EXT4_DIR_ROUND) & \
 					 ~EXT4_DIR_ROUND)
+#define EXT4_MAX_REC_LEN		((1<<16)-1)
+
+static inline unsigned ext4_rec_len_from_disk(__le16 dlen)
+{
+	unsigned len = le16_to_cpu(dlen);
+
+	if (len == EXT4_MAX_REC_LEN)
+		return 1 << 16;
+	return len;
+}
+
+static inline __le16 ext4_rec_len_to_disk(unsigned len)
+{
+	if (len == (1 << 16))
+		return cpu_to_le16(EXT4_MAX_REC_LEN);
+	else if (len > (1 << 16))
+		BUG();
+	return cpu_to_le16(len);
+}
+
 /*
  * Hash Tree Directory indexing
  * (c) Daniel Phillips, 2001



^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 1/2] ext2:  Support large blocksize up to PAGESIZE
  2007-08-30  0:47     ` [RFC 1/4] Large Blocksize support for Ext2/3/4 Mingming Cao
                         ` (5 preceding siblings ...)
  2007-10-02  0:35       ` [PATCH 2/2] ext4: Avoid rec_len overflow with 64KB block size Mingming Cao
@ 2007-10-02  0:35       ` Mingming Cao
  2007-10-02  0:35       ` [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size Mingming Cao
                         ` (2 subsequent siblings)
  9 siblings, 0 replies; 124+ messages in thread
From: Mingming Cao @ 2007-10-02  0:35 UTC (permalink / raw)
  To: ext4 development, linux-kernel; +Cc: sho, Jan Kara, clameter, akpm

Support large blocksize up to PAGESIZE (max 64KB) for ext2

From: Takashi Sato <sho@tnes.nec.co.jp>

This patch set supports large block size(>4k, <=64k) in ext2,
just enlarging the block size limit. But it is NOT possible to have 64kB
blocksize on ext2 without some changes to the directory handling
code.  The reason is that an empty 64kB directory block would have a
rec_len == (__u16)2^16 == 0, and this would cause an error to be hit in
the filesystem.  The proposed solution is treat 64k rec_len
with a an impossible value like rec_len = 0xffff to handle this.

The Patch-set consists of the following 2 patches.
  [1/2]  ext2: enlarge blocksize
         - Allow blocksize up to pagesize

  [2/2]  ext2: fix rec_len overflow
         - prevent rec_len from overflow with 64KB blocksize

Now on 64k page ppc64 box runs with this patch set we could create a 64k
block size ext2, and able to handle empty directory block.

Please consider to include to mm tree.

Signed-off-by: Takashi Sato <sho@tnes.nec.co.jp>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---

 fs/ext2/super.c         |    3 ++-
 include/linux/ext2_fs.h |    4 ++--
 2 files changed, 4 insertions(+), 3 deletions(-)


diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 639a32c..765c805 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -775,7 +775,8 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
 		brelse(bh);
 
 		if (!sb_set_blocksize(sb, blocksize)) {
-			printk(KERN_ERR "EXT2-fs: blocksize too small for device.\n");
+			printk(KERN_ERR "EXT2-fs: bad blocksize %d.\n",
+				blocksize);
 			goto failed_sbi;
 		}
 
diff --git a/include/linux/ext2_fs.h b/include/linux/ext2_fs.h
index 153d755..910a705 100644
--- a/include/linux/ext2_fs.h
+++ b/include/linux/ext2_fs.h
@@ -86,8 +86,8 @@ static inline struct ext2_sb_info *EXT2_SB(struct super_block *sb)
  * Macro-instructions used to manage several block sizes
  */
 #define EXT2_MIN_BLOCK_SIZE		1024
-#define	EXT2_MAX_BLOCK_SIZE		4096
-#define EXT2_MIN_BLOCK_LOG_SIZE		  10
+#define EXT2_MAX_BLOCK_SIZE		65536
+#define EXT2_MIN_BLOCK_LOG_SIZE		10
 #ifdef __KERNEL__
 # define EXT2_BLOCK_SIZE(s)		((s)->s_blocksize)
 #else



^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size
  2007-08-30  0:47     ` [RFC 1/4] Large Blocksize support for Ext2/3/4 Mingming Cao
                         ` (6 preceding siblings ...)
  2007-10-02  0:35       ` [PATCH 1/2] ext2: Support large blocksize up to PAGESIZE Mingming Cao
@ 2007-10-02  0:35       ` Mingming Cao
  2007-10-04 20:12         ` Andrew Morton
  2007-10-02  0:36       ` [PATCH 1/2] ext3: Support large blocksize up to PAGESIZE Mingming Cao
  2007-10-02  0:36       ` [PATCH 2/2] ext3: Avoid rec_len overflow with 64KB block size Mingming Cao
  9 siblings, 1 reply; 124+ messages in thread
From: Mingming Cao @ 2007-10-02  0:35 UTC (permalink / raw)
  To: ext4 development, linux-kernel; +Cc: sho, Jan Kara, clameter, akpm

ext2: Avoid rec_len overflow with 64KB block size

From: Jan Kara <jack@suse.cz>

With 64KB blocksize, a directory entry can have size 64KB which does not fit
into 16 bits we have for entry lenght. So we store 0xffff instead and convert
value when read from / written to disk.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---

 fs/ext2/dir.c           |   43 +++++++++++++++++++++++++++++++------------
 include/linux/ext2_fs.h |    1 +
 2 files changed, 32 insertions(+), 12 deletions(-)


diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 2bf49d7..1329bdb 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -26,6 +26,24 @@
 
 typedef struct ext2_dir_entry_2 ext2_dirent;
 
+static inline unsigned ext2_rec_len_from_disk(__le16 dlen)
+{
+	unsigned len = le16_to_cpu(dlen);
+
+	if (len == EXT2_MAX_REC_LEN)
+		return 1 << 16;
+	return len;
+}
+
+static inline __le16 ext2_rec_len_to_disk(unsigned len)
+{
+	if (len == (1 << 16))
+		return cpu_to_le16(EXT2_MAX_REC_LEN);
+	else if (len > (1 << 16))
+		BUG();
+	return cpu_to_le16(len);
+}
+
 /*
  * ext2 uses block-sized chunks. Arguably, sector-sized ones would be
  * more robust, but we have what we have
@@ -95,7 +113,7 @@ static void ext2_check_page(struct page *page)
 	}
 	for (offs = 0; offs <= limit - EXT2_DIR_REC_LEN(1); offs += rec_len) {
 		p = (ext2_dirent *)(kaddr + offs);
-		rec_len = le16_to_cpu(p->rec_len);
+		rec_len = ext2_rec_len_from_disk(p->rec_len);
 
 		if (rec_len < EXT2_DIR_REC_LEN(1))
 			goto Eshort;
@@ -193,7 +211,8 @@ static inline int ext2_match (int len, const char * const name,
  */
 static inline ext2_dirent *ext2_next_entry(ext2_dirent *p)
 {
-	return (ext2_dirent *)((char*)p + le16_to_cpu(p->rec_len));
+	return (ext2_dirent *)((char*)p +
+			ext2_rec_len_from_disk(p->rec_len));
 }
 
 static inline unsigned 
@@ -305,7 +324,7 @@ ext2_readdir (struct file * filp, void * dirent, filldir_t filldir)
 					return 0;
 				}
 			}
-			filp->f_pos += le16_to_cpu(de->rec_len);
+			filp->f_pos += ext2_rec_len_from_disk(de->rec_len);
 		}
 		ext2_put_page(page);
 	}
@@ -413,7 +432,7 @@ void ext2_set_link(struct inode *dir, struct ext2_dir_entry_2 *de,
 			struct page *page, struct inode *inode)
 {
 	unsigned from = (char *) de - (char *) page_address(page);
-	unsigned to = from + le16_to_cpu(de->rec_len);
+	unsigned to = from + ext2_rec_len_from_disk(de->rec_len);
 	int err;
 
 	lock_page(page);
@@ -469,7 +488,7 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode)
 				/* We hit i_size */
 				name_len = 0;
 				rec_len = chunk_size;
-				de->rec_len = cpu_to_le16(chunk_size);
+				de->rec_len = ext2_rec_len_to_disk(chunk_size);
 				de->inode = 0;
 				goto got_it;
 			}
@@ -483,7 +502,7 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode)
 			if (ext2_match (namelen, name, de))
 				goto out_unlock;
 			name_len = EXT2_DIR_REC_LEN(de->name_len);
-			rec_len = le16_to_cpu(de->rec_len);
+			rec_len = ext2_rec_len_from_disk(de->rec_len);
 			if (!de->inode && rec_len >= reclen)
 				goto got_it;
 			if (rec_len >= name_len + reclen)
@@ -504,8 +523,8 @@ got_it:
 		goto out_unlock;
 	if (de->inode) {
 		ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len);
-		de1->rec_len = cpu_to_le16(rec_len - name_len);
-		de->rec_len = cpu_to_le16(name_len);
+		de1->rec_len = ext2_rec_len_to_disk(rec_len - name_len);
+		de->rec_len = ext2_rec_len_to_disk(name_len);
 		de = de1;
 	}
 	de->name_len = namelen;
@@ -536,7 +555,7 @@ int ext2_delete_entry (struct ext2_dir_entry_2 * dir, struct page * page )
 	struct inode *inode = mapping->host;
 	char *kaddr = page_address(page);
 	unsigned from = ((char*)dir - kaddr) & ~(ext2_chunk_size(inode)-1);
-	unsigned to = ((char*)dir - kaddr) + le16_to_cpu(dir->rec_len);
+	unsigned to = ((char*)dir - kaddr) + ext2_rec_len_from_disk(dir->rec_len);
 	ext2_dirent * pde = NULL;
 	ext2_dirent * de = (ext2_dirent *) (kaddr + from);
 	int err;
@@ -557,7 +576,7 @@ int ext2_delete_entry (struct ext2_dir_entry_2 * dir, struct page * page )
 	err = mapping->a_ops->prepare_write(NULL, page, from, to);
 	BUG_ON(err);
 	if (pde)
-		pde->rec_len = cpu_to_le16(to-from);
+		pde->rec_len = ext2_rec_len_to_disk(to-from);
 	dir->inode = 0;
 	err = ext2_commit_chunk(page, from, to);
 	inode->i_ctime = inode->i_mtime = CURRENT_TIME_SEC;
@@ -591,14 +610,14 @@ int ext2_make_empty(struct inode *inode, struct inode *parent)
 	memset(kaddr, 0, chunk_size);
 	de = (struct ext2_dir_entry_2 *)kaddr;
 	de->name_len = 1;
-	de->rec_len = cpu_to_le16(EXT2_DIR_REC_LEN(1));
+	de->rec_len = ext2_rec_len_to_disk(EXT2_DIR_REC_LEN(1));
 	memcpy (de->name, ".\0\0", 4);
 	de->inode = cpu_to_le32(inode->i_ino);
 	ext2_set_de_type (de, inode);
 
 	de = (struct ext2_dir_entry_2 *)(kaddr + EXT2_DIR_REC_LEN(1));
 	de->name_len = 2;
-	de->rec_len = cpu_to_le16(chunk_size - EXT2_DIR_REC_LEN(1));
+	de->rec_len = ext2_rec_len_to_disk(chunk_size - EXT2_DIR_REC_LEN(1));
 	de->inode = cpu_to_le32(parent->i_ino);
 	memcpy (de->name, "..\0", 4);
 	ext2_set_de_type (de, inode);
diff --git a/include/linux/ext2_fs.h b/include/linux/ext2_fs.h
index 910a705..41063d5 100644
--- a/include/linux/ext2_fs.h
+++ b/include/linux/ext2_fs.h
@@ -557,5 +557,6 @@ enum {
 #define EXT2_DIR_ROUND 			(EXT2_DIR_PAD - 1)
 #define EXT2_DIR_REC_LEN(name_len)	(((name_len) + 8 + EXT2_DIR_ROUND) & \
 					 ~EXT2_DIR_ROUND)
+#define EXT2_MAX_REC_LEN		((1<<16)-1)
 
 #endif	/* _LINUX_EXT2_FS_H */



^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 1/2] ext3: Support large blocksize up to PAGESIZE
  2007-08-30  0:47     ` [RFC 1/4] Large Blocksize support for Ext2/3/4 Mingming Cao
                         ` (7 preceding siblings ...)
  2007-10-02  0:35       ` [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size Mingming Cao
@ 2007-10-02  0:36       ` Mingming Cao
  2007-10-02  0:36       ` [PATCH 2/2] ext3: Avoid rec_len overflow with 64KB block size Mingming Cao
  9 siblings, 0 replies; 124+ messages in thread
From: Mingming Cao @ 2007-10-02  0:36 UTC (permalink / raw)
  To: ext4 development, linux-kernel; +Cc: sho, Jan Kara, clameter, akpm

Support large blocksize up to PAGESIZE (max 64KB) for ext3

From: Takashi Sato <sho@tnes.nec.co.jp>

This patch set supports large block size(>4k, <=64k) in ext3
just enlarging the block size limit. But it is NOT possible to have 64kB
blocksize on ext3 without some changes to the directory handling
code.  The reason is that an empty 64kB directory block would have a
rec_len == (__u16)2^16 == 0, and this would cause an error to be hit in
the filesystem.  The proposed solution is treat 64k rec_len
with a an impossible value like rec_len = 0xffff to handle this.

The Patch-set consists of the following 2 patches.
  [1/2]  ext3: enlarge blocksize
         - Allow blocksize up to pagesize

  [2/2]  ext3: fix rec_len overflow
         - prevent rec_len from overflow with 64KB blocksize

Now on 64k page ppc64 box runs with this patch set we could create a 64k
block size ext3, and able to handle empty directory block.

Signed-off-by: Takashi Sato <sho@tnes.nec.co.jp>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---

 fs/ext3/super.c         |    6 +++++-
 include/linux/ext3_fs.h |    4 ++--
 2 files changed, 7 insertions(+), 3 deletions(-)


diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 9537316..b4bfd36 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -1549,7 +1549,11 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
 		}
 
 		brelse (bh);
-		sb_set_blocksize(sb, blocksize);
+		if (!sb_set_blocksize(sb, blocksize)) {
+			printk(KERN_ERR "EXT3-fs: bad blocksize %d.\n",
+				blocksize);
+			goto out_fail;
+		}
 		logic_sb_block = (sb_block * EXT3_MIN_BLOCK_SIZE) / blocksize;
 		offset = (sb_block * EXT3_MIN_BLOCK_SIZE) % blocksize;
 		bh = sb_bread(sb, logic_sb_block);
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index ece49a8..7aa5556 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -76,8 +76,8 @@
  * Macro-instructions used to manage several block sizes
  */
 #define EXT3_MIN_BLOCK_SIZE		1024
-#define	EXT3_MAX_BLOCK_SIZE		4096
-#define EXT3_MIN_BLOCK_LOG_SIZE		  10
+#define	EXT3_MAX_BLOCK_SIZE		65536
+#define EXT3_MIN_BLOCK_LOG_SIZE		10
 #ifdef __KERNEL__
 # define EXT3_BLOCK_SIZE(s)		((s)->s_blocksize)
 #else



^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [PATCH 2/2] ext3: Avoid rec_len overflow with 64KB block size
  2007-08-30  0:47     ` [RFC 1/4] Large Blocksize support for Ext2/3/4 Mingming Cao
                         ` (8 preceding siblings ...)
  2007-10-02  0:36       ` [PATCH 1/2] ext3: Support large blocksize up to PAGESIZE Mingming Cao
@ 2007-10-02  0:36       ` Mingming Cao
  9 siblings, 0 replies; 124+ messages in thread
From: Mingming Cao @ 2007-10-02  0:36 UTC (permalink / raw)
  To: ext4 development, linux-kernel; +Cc: sho, Jan Kara, clameter, akmp

ext3: Avoid rec_len overflow with 64KB block size

From: Jan Kara <jack@suse.cz>

With 64KB blocksize, a directory entry can have size 64KB which does not fit
into 16 bits we have for entry lenght. So we store 0xffff instead and convert
value when read from / written to disk. The patch also converts some places
to use ext3_next_entry() when we are changing them anyway.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
---

 fs/ext3/dir.c           |   10 +++--
 fs/ext3/namei.c         |   90 ++++++++++++++++++++++-------------------------
 include/linux/ext3_fs.h |   20 ++++++++++
 3 files changed, 68 insertions(+), 52 deletions(-)


diff --git a/fs/ext3/dir.c b/fs/ext3/dir.c
index c00723a..3c4c43a 100644
--- a/fs/ext3/dir.c
+++ b/fs/ext3/dir.c
@@ -69,7 +69,7 @@ int ext3_check_dir_entry (const char * function, struct inode * dir,
 			  unsigned long offset)
 {
 	const char * error_msg = NULL;
-	const int rlen = le16_to_cpu(de->rec_len);
+	const int rlen = ext3_rec_len_from_disk(de->rec_len);
 
 	if (rlen < EXT3_DIR_REC_LEN(1))
 		error_msg = "rec_len is smaller than minimal";
@@ -177,10 +177,10 @@ revalidate:
 				 * least that it is non-zero.  A
 				 * failure will be detected in the
 				 * dirent test below. */
-				if (le16_to_cpu(de->rec_len) <
+				if (ext3_rec_len_from_disk(de->rec_len) <
 						EXT3_DIR_REC_LEN(1))
 					break;
-				i += le16_to_cpu(de->rec_len);
+				i += ext3_rec_len_from_disk(de->rec_len);
 			}
 			offset = i;
 			filp->f_pos = (filp->f_pos & ~(sb->s_blocksize - 1))
@@ -201,7 +201,7 @@ revalidate:
 				ret = stored;
 				goto out;
 			}
-			offset += le16_to_cpu(de->rec_len);
+			offset += ext3_rec_len_from_disk(de->rec_len);
 			if (le32_to_cpu(de->inode)) {
 				/* We might block in the next section
 				 * if the data destination is
@@ -223,7 +223,7 @@ revalidate:
 					goto revalidate;
 				stored ++;
 			}
-			filp->f_pos += le16_to_cpu(de->rec_len);
+			filp->f_pos += ext3_rec_len_from_disk(de->rec_len);
 		}
 		offset = 0;
 		brelse (bh);
diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
index c1fa190..2c38eb6 100644
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -144,6 +144,15 @@ struct dx_map_entry
 	u16 size;
 };
 
+/*
+ * p is at least 6 bytes before the end of page
+ */
+static inline struct ext3_dir_entry_2 *ext3_next_entry(struct ext3_dir_entry_2 *p)
+{
+	return (struct ext3_dir_entry_2 *)((char*)p +
+		ext3_rec_len_from_disk(p->rec_len));
+}
+
 #ifdef CONFIG_EXT3_INDEX
 static inline unsigned dx_get_block (struct dx_entry *entry);
 static void dx_set_block (struct dx_entry *entry, unsigned value);
@@ -281,7 +290,7 @@ static struct stats dx_show_leaf(struct dx_hash_info *hinfo, struct ext3_dir_ent
 			space += EXT3_DIR_REC_LEN(de->name_len);
 			names++;
 		}
-		de = (struct ext3_dir_entry_2 *) ((char *) de + le16_to_cpu(de->rec_len));
+		de = ext3_next_entry(de);
 	}
 	printk("(%i)\n", names);
 	return (struct stats) { names, space, 1 };
@@ -548,14 +557,6 @@ static int ext3_htree_next_block(struct inode *dir, __u32 hash,
 
 
 /*
- * p is at least 6 bytes before the end of page
- */
-static inline struct ext3_dir_entry_2 *ext3_next_entry(struct ext3_dir_entry_2 *p)
-{
-	return (struct ext3_dir_entry_2 *)((char*)p + le16_to_cpu(p->rec_len));
-}
-
-/*
  * This function fills a red-black tree with information from a
  * directory block.  It returns the number directory entries loaded
  * into the tree.  If there is an error it is returned in err.
@@ -721,7 +722,7 @@ static int dx_make_map (struct ext3_dir_entry_2 *de, int size,
 			cond_resched();
 		}
 		/* XXX: do we need to check rec_len == 0 case? -Chris */
-		de = (struct ext3_dir_entry_2 *) ((char *) de + le16_to_cpu(de->rec_len));
+		de = ext3_next_entry(de);
 	}
 	return count;
 }
@@ -825,7 +826,7 @@ static inline int search_dirblock(struct buffer_head * bh,
 			return 1;
 		}
 		/* prevent looping on a bad block */
-		de_len = le16_to_cpu(de->rec_len);
+		de_len = ext3_rec_len_from_disk(de->rec_len);
 		if (de_len <= 0)
 			return -1;
 		offset += de_len;
@@ -1138,7 +1139,7 @@ dx_move_dirents(char *from, char *to, struct dx_map_entry *map, int count)
 		rec_len = EXT3_DIR_REC_LEN(de->name_len);
 		memcpy (to, de, rec_len);
 		((struct ext3_dir_entry_2 *) to)->rec_len =
-				cpu_to_le16(rec_len);
+				ext3_rec_len_to_disk(rec_len);
 		de->inode = 0;
 		map++;
 		to += rec_len;
@@ -1157,13 +1158,12 @@ static struct ext3_dir_entry_2* dx_pack_dirents(char *base, int size)
 
 	prev = to = de;
 	while ((char*)de < base + size) {
-		next = (struct ext3_dir_entry_2 *) ((char *) de +
-						    le16_to_cpu(de->rec_len));
+		next = ext3_next_entry(de);
 		if (de->inode && de->name_len) {
 			rec_len = EXT3_DIR_REC_LEN(de->name_len);
 			if (de > to)
 				memmove(to, de, rec_len);
-			to->rec_len = cpu_to_le16(rec_len);
+			to->rec_len = ext3_rec_len_to_disk(rec_len);
 			prev = to;
 			to = (struct ext3_dir_entry_2 *) (((char *) to) + rec_len);
 		}
@@ -1237,8 +1237,8 @@ static struct ext3_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
 	/* Fancy dance to stay within two buffers */
 	de2 = dx_move_dirents(data1, data2, map + split, count - split);
 	de = dx_pack_dirents(data1,blocksize);
-	de->rec_len = cpu_to_le16(data1 + blocksize - (char *) de);
-	de2->rec_len = cpu_to_le16(data2 + blocksize - (char *) de2);
+	de->rec_len = ext3_rec_len_to_disk(data1 + blocksize - (char *) de);
+	de2->rec_len = ext3_rec_len_to_disk(data2 + blocksize - (char *) de2);
 	dxtrace(dx_show_leaf (hinfo, (struct ext3_dir_entry_2 *) data1, blocksize, 1));
 	dxtrace(dx_show_leaf (hinfo, (struct ext3_dir_entry_2 *) data2, blocksize, 1));
 
@@ -1309,7 +1309,7 @@ static int add_dirent_to_buf(handle_t *handle, struct dentry *dentry,
 				return -EEXIST;
 			}
 			nlen = EXT3_DIR_REC_LEN(de->name_len);
-			rlen = le16_to_cpu(de->rec_len);
+			rlen = ext3_rec_len_from_disk(de->rec_len);
 			if ((de->inode? rlen - nlen: rlen) >= reclen)
 				break;
 			de = (struct ext3_dir_entry_2 *)((char *)de + rlen);
@@ -1328,11 +1328,11 @@ static int add_dirent_to_buf(handle_t *handle, struct dentry *dentry,
 
 	/* By now the buffer is marked for journaling */
 	nlen = EXT3_DIR_REC_LEN(de->name_len);
-	rlen = le16_to_cpu(de->rec_len);
+	rlen = ext3_rec_len_from_disk(de->rec_len);
 	if (de->inode) {
 		struct ext3_dir_entry_2 *de1 = (struct ext3_dir_entry_2 *)((char *)de + nlen);
-		de1->rec_len = cpu_to_le16(rlen - nlen);
-		de->rec_len = cpu_to_le16(nlen);
+		de1->rec_len = ext3_rec_len_to_disk(rlen - nlen);
+		de->rec_len = ext3_rec_len_to_disk(nlen);
 		de = de1;
 	}
 	de->file_type = EXT3_FT_UNKNOWN;
@@ -1410,17 +1410,18 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,
 
 	/* The 0th block becomes the root, move the dirents out */
 	fde = &root->dotdot;
-	de = (struct ext3_dir_entry_2 *)((char *)fde + le16_to_cpu(fde->rec_len));
+	de = (struct ext3_dir_entry_2 *)((char *)fde +
+			ext3_rec_len_from_disk(fde->rec_len));
 	len = ((char *) root) + blocksize - (char *) de;
 	memcpy (data1, de, len);
 	de = (struct ext3_dir_entry_2 *) data1;
 	top = data1 + len;
-	while ((char *)(de2=(void*)de+le16_to_cpu(de->rec_len)) < top)
+	while ((char *)(de2 = ext3_next_entry(de)) < top)
 		de = de2;
-	de->rec_len = cpu_to_le16(data1 + blocksize - (char *) de);
+	de->rec_len = ext3_rec_len_to_disk(data1 + blocksize - (char *) de);
 	/* Initialize the root; the dot dirents already exist */
 	de = (struct ext3_dir_entry_2 *) (&root->dotdot);
-	de->rec_len = cpu_to_le16(blocksize - EXT3_DIR_REC_LEN(2));
+	de->rec_len = ext3_rec_len_to_disk(blocksize - EXT3_DIR_REC_LEN(2));
 	memset (&root->info, 0, sizeof(root->info));
 	root->info.info_length = sizeof(root->info);
 	root->info.hash_version = EXT3_SB(dir->i_sb)->s_def_hash_version;
@@ -1507,7 +1508,7 @@ static int ext3_add_entry (handle_t *handle, struct dentry *dentry,
 		return retval;
 	de = (struct ext3_dir_entry_2 *) bh->b_data;
 	de->inode = 0;
-	de->rec_len = cpu_to_le16(blocksize);
+	de->rec_len = ext3_rec_len_to_disk(blocksize);
 	return add_dirent_to_buf(handle, dentry, inode, de, bh);
 }
 
@@ -1571,7 +1572,7 @@ static int ext3_dx_add_entry(handle_t *handle, struct dentry *dentry,
 			goto cleanup;
 		node2 = (struct dx_node *)(bh2->b_data);
 		entries2 = node2->entries;
-		node2->fake.rec_len = cpu_to_le16(sb->s_blocksize);
+		node2->fake.rec_len = ext3_rec_len_to_disk(sb->s_blocksize);
 		node2->fake.inode = 0;
 		BUFFER_TRACE(frame->bh, "get_write_access");
 		err = ext3_journal_get_write_access(handle, frame->bh);
@@ -1670,9 +1671,9 @@ static int ext3_delete_entry (handle_t *handle,
 			BUFFER_TRACE(bh, "get_write_access");
 			ext3_journal_get_write_access(handle, bh);
 			if (pde)
-				pde->rec_len =
-					cpu_to_le16(le16_to_cpu(pde->rec_len) +
-						    le16_to_cpu(de->rec_len));
+				pde->rec_len = ext3_rec_len_to_disk(
+					ext3_rec_len_from_disk(pde->rec_len) +
+					ext3_rec_len_from_disk(de->rec_len));
 			else
 				de->inode = 0;
 			dir->i_version++;
@@ -1680,10 +1681,9 @@ static int ext3_delete_entry (handle_t *handle,
 			ext3_journal_dirty_metadata(handle, bh);
 			return 0;
 		}
-		i += le16_to_cpu(de->rec_len);
+		i += ext3_rec_len_from_disk(de->rec_len);
 		pde = de;
-		de = (struct ext3_dir_entry_2 *)
-			((char *) de + le16_to_cpu(de->rec_len));
+		de = ext3_next_entry(de);
 	}
 	return -ENOENT;
 }
@@ -1817,13 +1817,12 @@ retry:
 	de = (struct ext3_dir_entry_2 *) dir_block->b_data;
 	de->inode = cpu_to_le32(inode->i_ino);
 	de->name_len = 1;
-	de->rec_len = cpu_to_le16(EXT3_DIR_REC_LEN(de->name_len));
+	de->rec_len = ext3_rec_len_to_disk(EXT3_DIR_REC_LEN(de->name_len));
 	strcpy (de->name, ".");
 	ext3_set_de_type(dir->i_sb, de, S_IFDIR);
-	de = (struct ext3_dir_entry_2 *)
-			((char *) de + le16_to_cpu(de->rec_len));
+	de = ext3_next_entry(de);
 	de->inode = cpu_to_le32(dir->i_ino);
-	de->rec_len = cpu_to_le16(inode->i_sb->s_blocksize-EXT3_DIR_REC_LEN(1));
+	de->rec_len = ext3_rec_len_to_disk(inode->i_sb->s_blocksize-EXT3_DIR_REC_LEN(1));
 	de->name_len = 2;
 	strcpy (de->name, "..");
 	ext3_set_de_type(dir->i_sb, de, S_IFDIR);
@@ -1875,8 +1874,7 @@ static int empty_dir (struct inode * inode)
 		return 1;
 	}
 	de = (struct ext3_dir_entry_2 *) bh->b_data;
-	de1 = (struct ext3_dir_entry_2 *)
-			((char *) de + le16_to_cpu(de->rec_len));
+	de1 = ext3_next_entry(de);
 	if (le32_to_cpu(de->inode) != inode->i_ino ||
 			!le32_to_cpu(de1->inode) ||
 			strcmp (".", de->name) ||
@@ -1887,9 +1885,9 @@ static int empty_dir (struct inode * inode)
 		brelse (bh);
 		return 1;
 	}
-	offset = le16_to_cpu(de->rec_len) + le16_to_cpu(de1->rec_len);
-	de = (struct ext3_dir_entry_2 *)
-			((char *) de1 + le16_to_cpu(de1->rec_len));
+	offset = ext3_rec_len_from_disk(de->rec_len) +
+			ext3_rec_len_from_disk(de1->rec_len);
+	de = ext3_next_entry(de1);
 	while (offset < inode->i_size ) {
 		if (!bh ||
 			(void *) de >= (void *) (bh->b_data+sb->s_blocksize)) {
@@ -1918,9 +1916,8 @@ static int empty_dir (struct inode * inode)
 			brelse (bh);
 			return 0;
 		}
-		offset += le16_to_cpu(de->rec_len);
-		de = (struct ext3_dir_entry_2 *)
-				((char *) de + le16_to_cpu(de->rec_len));
+		offset += ext3_rec_len_from_disk(de->rec_len);
+		de = ext3_next_entry(de);
 	}
 	brelse (bh);
 	return 1;
@@ -2274,8 +2271,7 @@ retry:
 }
 
 #define PARENT_INO(buffer) \
-	((struct ext3_dir_entry_2 *) ((char *) buffer + \
-	le16_to_cpu(((struct ext3_dir_entry_2 *) buffer)->rec_len)))->inode
+	(ext3_next_entry((struct ext3_dir_entry_2 *)(buffer))->inode)
 
 /*
  * Anybody can rename anything with this: the permission checks are left to the
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index 7aa5556..d9e378d 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -660,6 +660,26 @@ struct ext3_dir_entry_2 {
 #define EXT3_DIR_ROUND			(EXT3_DIR_PAD - 1)
 #define EXT3_DIR_REC_LEN(name_len)	(((name_len) + 8 + EXT3_DIR_ROUND) & \
 					 ~EXT3_DIR_ROUND)
+#define EXT3_MAX_REC_LEN		((1<<16)-1)
+
+static inline unsigned ext3_rec_len_from_disk(__le16 dlen)
+{
+	unsigned len = le16_to_cpu(dlen);
+
+	if (len == EXT3_MAX_REC_LEN)
+		return 1 << 16;
+	return len;
+}
+
+static inline __le16 ext3_rec_len_to_disk(unsigned len)
+{
+	if (len == (1 << 16))
+		return cpu_to_le16(EXT3_MAX_REC_LEN);
+	else if (len > (1 << 16))
+		BUG();
+	return cpu_to_le16(len);
+}
+
 /*
  * Hash Tree Directory indexing
  * (c) Daniel Phillips, 2001



^ permalink raw reply related	[flat|nested] 124+ messages in thread

* Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size
  2007-10-02  0:35       ` [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size Mingming Cao
@ 2007-10-04 20:12         ` Andrew Morton
  2007-10-04 22:40           ` Andreas Dilger
                             ` (2 more replies)
  0 siblings, 3 replies; 124+ messages in thread
From: Andrew Morton @ 2007-10-04 20:12 UTC (permalink / raw)
  To: cmm; +Cc: linux-ext4, linux-kernel, sho, jack, clameter

On Mon, 01 Oct 2007 17:35:46 -0700
Mingming Cao <cmm@us.ibm.com> wrote:

> ext2: Avoid rec_len overflow with 64KB block size
> 
> From: Jan Kara <jack@suse.cz>
> 
> With 64KB blocksize, a directory entry can have size 64KB which does not fit
> into 16 bits we have for entry lenght. So we store 0xffff instead and convert
> value when read from / written to disk.

This patch clashes in non-trivial ways with
ext2-convert-to-new-aops-fix.patch and perhaps other things which are
already queued for 2.6.24 inclusion, so I'll need to ask for an updated
patch, please.

Also, I'm planing on merging the ext2 reservations code into 2.6.24, so if
we're aiming for complete support of 64k blocksize in 2.6.24's ext2,
additional testing and checking will be needed.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size
  2007-10-04 20:12         ` Andrew Morton
@ 2007-10-04 22:40           ` Andreas Dilger
  2007-10-04 23:11             ` Andrew Morton
  2007-10-08 13:02           ` Jan Kara
  2007-10-11 11:18           ` Jan Kara
  2 siblings, 1 reply; 124+ messages in thread
From: Andreas Dilger @ 2007-10-04 22:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: cmm, linux-ext4, linux-kernel, sho, jack, clameter

On Oct 04, 2007  13:12 -0700, Andrew Morton wrote:
> On Mon, 01 Oct 2007 17:35:46 -0700
> > ext2: Avoid rec_len overflow with 64KB block size
> > 
> > into 16 bits we have for entry lenght. So we store 0xffff instead and
> > convert value when read from / written to disk.
> 
> This patch clashes in non-trivial ways with
> ext2-convert-to-new-aops-fix.patch and perhaps other things which are
> already queued for 2.6.24 inclusion, so I'll need to ask for an updated
> patch, please.

If the rel_len overflow patch isn't going to make it, then we also need
to revert the EXT*_MAX_BLOCK_SIZE change to 65536.  It would be possible
to allow this to be up to 32768 w/o the rec_len overflow fix however.

Yes, this does imply that those patches were in the wrong order in the
patch series, and I apologize for that, even if it isn't my fault.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size
  2007-10-04 22:40           ` Andreas Dilger
@ 2007-10-04 23:11             ` Andrew Morton
  2007-10-11 10:30               ` Jan Kara
  0 siblings, 1 reply; 124+ messages in thread
From: Andrew Morton @ 2007-10-04 23:11 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: cmm, linux-ext4, linux-kernel, sho, jack, clameter

On Thu, 4 Oct 2007 16:40:44 -0600
Andreas Dilger <adilger@clusterfs.com> wrote:

> On Oct 04, 2007  13:12 -0700, Andrew Morton wrote:
> > On Mon, 01 Oct 2007 17:35:46 -0700
> > > ext2: Avoid rec_len overflow with 64KB block size
> > > 
> > > into 16 bits we have for entry lenght. So we store 0xffff instead and
> > > convert value when read from / written to disk.
> > 
> > This patch clashes in non-trivial ways with
> > ext2-convert-to-new-aops-fix.patch and perhaps other things which are
> > already queued for 2.6.24 inclusion, so I'll need to ask for an updated
> > patch, please.
> 
> If the rel_len overflow patch isn't going to make it, then we also need
> to revert the EXT*_MAX_BLOCK_SIZE change to 65536.  It would be possible
> to allow this to be up to 32768 w/o the rec_len overflow fix however.
> 

Ok, thanks, I dropped ext3-support-large-blocksize-up-to-pagesize.patch and
ext2-support-large-blocksize-up-to-pagesize.patch.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size
  2007-10-04 20:12         ` Andrew Morton
  2007-10-04 22:40           ` Andreas Dilger
@ 2007-10-08 13:02           ` Jan Kara
  2007-10-11 11:18           ` Jan Kara
  2 siblings, 0 replies; 124+ messages in thread
From: Jan Kara @ 2007-10-08 13:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: cmm, linux-ext4, linux-kernel, sho, jack, clameter

On Thu 04-10-07 13:12:07, Andrew Morton wrote:
> On Mon, 01 Oct 2007 17:35:46 -0700
> Mingming Cao <cmm@us.ibm.com> wrote:
> 
> > ext2: Avoid rec_len overflow with 64KB block size
> > 
> > From: Jan Kara <jack@suse.cz>
> > 
> > With 64KB blocksize, a directory entry can have size 64KB which does not fit
> > into 16 bits we have for entry lenght. So we store 0xffff instead and convert
> > value when read from / written to disk.
> 
> This patch clashes in non-trivial ways with
> ext2-convert-to-new-aops-fix.patch and perhaps other things which are
> already queued for 2.6.24 inclusion, so I'll need to ask for an updated
> patch, please.
> 
> Also, I'm planing on merging the ext2 reservations code into 2.6.24, so if
> we're aiming for complete support of 64k blocksize in 2.6.24's ext2,
> additional testing and checking will be needed.
  OK, I'll fixup those rejects and send a new patch.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size
  2007-10-11 10:30               ` Jan Kara
@ 2007-10-11 10:14                 ` Andrew Morton
  0 siblings, 0 replies; 124+ messages in thread
From: Andrew Morton @ 2007-10-11 10:14 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andreas Dilger, cmm, linux-ext4, linux-kernel, sho, clameter

On Thu, 11 Oct 2007 12:30:03 +0200 Jan Kara <jack@suse.cz> wrote:

> On Thu 04-10-07 16:11:21, Andrew Morton wrote:
> > On Thu, 4 Oct 2007 16:40:44 -0600
> > Andreas Dilger <adilger@clusterfs.com> wrote:
> > 
> > > On Oct 04, 2007  13:12 -0700, Andrew Morton wrote:
> > > > On Mon, 01 Oct 2007 17:35:46 -0700
> > > > > ext2: Avoid rec_len overflow with 64KB block size
> > > > > 
> > > > > into 16 bits we have for entry lenght. So we store 0xffff instead and
> > > > > convert value when read from / written to disk.
> > > > 
> > > > This patch clashes in non-trivial ways with
> > > > ext2-convert-to-new-aops-fix.patch and perhaps other things which are
> > > > already queued for 2.6.24 inclusion, so I'll need to ask for an updated
> > > > patch, please.
> > > 
> > > If the rel_len overflow patch isn't going to make it, then we also need
> > > to revert the EXT*_MAX_BLOCK_SIZE change to 65536.  It would be possible
> > > to allow this to be up to 32768 w/o the rec_len overflow fix however.
> > > 
> > 
> > Ok, thanks, I dropped ext3-support-large-blocksize-up-to-pagesize.patch and
> > ext2-support-large-blocksize-up-to-pagesize.patch.
>   Sorry, for the delayed answer but I had some urgent bugs to fix...

You exceeded my memory span.

> Why did you drom ext3-support-large-blocksize-up-to-pagesize.patch?

I forget.  I'll bring it back and see what happens.

> As far
> as I understand your previous email (and also as I've checked against
> 2.6.23-rc8-mm2), the patch fixing rec_len overflow clashes only for ext2...
>   I'll send you an updated patch for ext2 in a moment.

ok..  I'm basically not applying anything any more - the whole thing
is a teetering wreck.   I need to go through the input queue delicately
adding things which look important or relatively non-injurious.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size
  2007-10-04 23:11             ` Andrew Morton
@ 2007-10-11 10:30               ` Jan Kara
  2007-10-11 10:14                 ` Andrew Morton
  0 siblings, 1 reply; 124+ messages in thread
From: Jan Kara @ 2007-10-11 10:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andreas Dilger, cmm, linux-ext4, linux-kernel, sho, clameter

On Thu 04-10-07 16:11:21, Andrew Morton wrote:
> On Thu, 4 Oct 2007 16:40:44 -0600
> Andreas Dilger <adilger@clusterfs.com> wrote:
> 
> > On Oct 04, 2007  13:12 -0700, Andrew Morton wrote:
> > > On Mon, 01 Oct 2007 17:35:46 -0700
> > > > ext2: Avoid rec_len overflow with 64KB block size
> > > > 
> > > > into 16 bits we have for entry lenght. So we store 0xffff instead and
> > > > convert value when read from / written to disk.
> > > 
> > > This patch clashes in non-trivial ways with
> > > ext2-convert-to-new-aops-fix.patch and perhaps other things which are
> > > already queued for 2.6.24 inclusion, so I'll need to ask for an updated
> > > patch, please.
> > 
> > If the rel_len overflow patch isn't going to make it, then we also need
> > to revert the EXT*_MAX_BLOCK_SIZE change to 65536.  It would be possible
> > to allow this to be up to 32768 w/o the rec_len overflow fix however.
> > 
> 
> Ok, thanks, I dropped ext3-support-large-blocksize-up-to-pagesize.patch and
> ext2-support-large-blocksize-up-to-pagesize.patch.
  Sorry, for the delayed answer but I had some urgent bugs to fix...
Why did you drom ext3-support-large-blocksize-up-to-pagesize.patch? As far
as I understand your previous email (and also as I've checked against
2.6.23-rc8-mm2), the patch fixing rec_len overflow clashes only for ext2...
  I'll send you an updated patch for ext2 in a moment.

									Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size
  2007-10-04 20:12         ` Andrew Morton
  2007-10-04 22:40           ` Andreas Dilger
  2007-10-08 13:02           ` Jan Kara
@ 2007-10-11 11:18           ` Jan Kara
  2007-10-18  4:07             ` Andrew Morton
  2007-10-18  4:09             ` Andrew Morton
  2 siblings, 2 replies; 124+ messages in thread
From: Jan Kara @ 2007-10-11 11:18 UTC (permalink / raw)
  To: Andrew Morton; +Cc: cmm, linux-ext4, linux-kernel, sho, jack, clameter

On Thu 04-10-07 13:12:07, Andrew Morton wrote:
> On Mon, 01 Oct 2007 17:35:46 -0700
> Mingming Cao <cmm@us.ibm.com> wrote:
> 
> > ext2: Avoid rec_len overflow with 64KB block size
> > 
> > From: Jan Kara <jack@suse.cz>
> > 
> > With 64KB blocksize, a directory entry can have size 64KB which does not fit
> > into 16 bits we have for entry lenght. So we store 0xffff instead and convert
> > value when read from / written to disk.
> 
> This patch clashes in non-trivial ways with
> ext2-convert-to-new-aops-fix.patch and perhaps other things which are
> already queued for 2.6.24 inclusion, so I'll need to ask for an updated
> patch, please.
> 
> Also, I'm planing on merging the ext2 reservations code into 2.6.24, so if
> we're aiming for complete support of 64k blocksize in 2.6.24's ext2,
> additional testing and checking will be needed.
  OK, attached is a patch diffed against 2.6.23-rc9-mm2 - does that work
fine for you?

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

------

With 64KB blocksize, a directory entry can have size 64KB which does not fit
into 16 bits we have for entry lenght. So we store 0xffff instead and convert
value when read from / written to disk.

Signed-off-by: Jan Kara <jack@suse.cz>

diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-mm/fs/ext2/dir.c linux-2.6.23-mm-1-ext2_64k_rec_len/fs/ext2/dir.c
--- linux-2.6.23-mm/fs/ext2/dir.c	2007-10-11 12:08:16.000000000 +0200
+++ linux-2.6.23-mm-1-ext2_64k_rec_len/fs/ext2/dir.c	2007-10-11 12:14:24.000000000 +0200
@@ -28,6 +28,24 @@
 
 typedef struct ext2_dir_entry_2 ext2_dirent;
 
+static inline unsigned ext2_rec_len_from_disk(__le16 dlen)
+{
+	unsigned len = le16_to_cpu(dlen);
+
+	if (len == EXT2_MAX_REC_LEN)
+		return 1 << 16;
+	return len;
+}
+
+static inline __le16 ext2_rec_len_to_disk(unsigned len)
+{
+	if (len == (1 << 16))
+		return cpu_to_le16(EXT2_MAX_REC_LEN);
+	else if (len > (1 << 16))
+		BUG();
+	return cpu_to_le16(len);
+}
+
 /*
  * ext2 uses block-sized chunks. Arguably, sector-sized ones would be
  * more robust, but we have what we have
@@ -106,7 +124,7 @@ static void ext2_check_page(struct page 
 	}
 	for (offs = 0; offs <= limit - EXT2_DIR_REC_LEN(1); offs += rec_len) {
 		p = (ext2_dirent *)(kaddr + offs);
-		rec_len = le16_to_cpu(p->rec_len);
+		rec_len = ext2_rec_len_from_disk(p->rec_len);
 
 		if (rec_len < EXT2_DIR_REC_LEN(1))
 			goto Eshort;
@@ -204,7 +222,8 @@ static inline int ext2_match (int len, c
  */
 static inline ext2_dirent *ext2_next_entry(ext2_dirent *p)
 {
-	return (ext2_dirent *)((char*)p + le16_to_cpu(p->rec_len));
+	return (ext2_dirent *)((char*)p +
+			ext2_rec_len_from_disk(p->rec_len));
 }
 
 static inline unsigned 
@@ -316,7 +335,7 @@ ext2_readdir (struct file * filp, void *
 					return 0;
 				}
 			}
-			filp->f_pos += le16_to_cpu(de->rec_len);
+			filp->f_pos += ext2_rec_len_from_disk(de->rec_len);
 		}
 		ext2_put_page(page);
 	}
@@ -425,7 +444,7 @@ void ext2_set_link(struct inode *dir, st
 {
 	loff_t pos = page_offset(page) +
 			(char *) de - (char *) page_address(page);
-	unsigned len = le16_to_cpu(de->rec_len);
+	unsigned len = ext2_rec_len_from_disk(de->rec_len);
 	int err;
 
 	lock_page(page);
@@ -482,7 +501,7 @@ int ext2_add_link (struct dentry *dentry
 				/* We hit i_size */
 				name_len = 0;
 				rec_len = chunk_size;
-				de->rec_len = cpu_to_le16(chunk_size);
+				de->rec_len = ext2_rec_len_to_disk(chunk_size);
 				de->inode = 0;
 				goto got_it;
 			}
@@ -496,7 +515,7 @@ int ext2_add_link (struct dentry *dentry
 			if (ext2_match (namelen, name, de))
 				goto out_unlock;
 			name_len = EXT2_DIR_REC_LEN(de->name_len);
-			rec_len = le16_to_cpu(de->rec_len);
+			rec_len = ext2_rec_len_from_disk(de->rec_len);
 			if (!de->inode && rec_len >= reclen)
 				goto got_it;
 			if (rec_len >= name_len + reclen)
@@ -518,8 +537,8 @@ got_it:
 		goto out_unlock;
 	if (de->inode) {
 		ext2_dirent *de1 = (ext2_dirent *) ((char *) de + name_len);
-		de1->rec_len = cpu_to_le16(rec_len - name_len);
-		de->rec_len = cpu_to_le16(name_len);
+		de1->rec_len = ext2_rec_len_to_disk(rec_len - name_len);
+		de->rec_len = ext2_rec_len_to_disk(name_len);
 		de = de1;
 	}
 	de->name_len = namelen;
@@ -550,7 +569,8 @@ int ext2_delete_entry (struct ext2_dir_e
 	struct inode *inode = mapping->host;
 	char *kaddr = page_address(page);
 	unsigned from = ((char*)dir - kaddr) & ~(ext2_chunk_size(inode)-1);
-	unsigned to = ((char*)dir - kaddr) + le16_to_cpu(dir->rec_len);
+	unsigned to = ((char*)dir - kaddr) +
+				ext2_rec_len_from_disk(dir->rec_len);
 	loff_t pos;
 	ext2_dirent * pde = NULL;
 	ext2_dirent * de = (ext2_dirent *) (kaddr + from);
@@ -574,7 +594,7 @@ int ext2_delete_entry (struct ext2_dir_e
 							&page, NULL);
 	BUG_ON(err);
 	if (pde)
-		pde->rec_len = cpu_to_le16(to - from);
+		pde->rec_len = ext2_rec_len_to_disk(to - from);
 	dir->inode = 0;
 	err = ext2_commit_chunk(page, pos, to - from);
 	inode->i_ctime = inode->i_mtime = CURRENT_TIME_SEC;
@@ -610,14 +630,14 @@ int ext2_make_empty(struct inode *inode,
 	memset(kaddr, 0, chunk_size);
 	de = (struct ext2_dir_entry_2 *)kaddr;
 	de->name_len = 1;
-	de->rec_len = cpu_to_le16(EXT2_DIR_REC_LEN(1));
+	de->rec_len = ext2_rec_len_to_disk(EXT2_DIR_REC_LEN(1));
 	memcpy (de->name, ".\0\0", 4);
 	de->inode = cpu_to_le32(inode->i_ino);
 	ext2_set_de_type (de, inode);
 
 	de = (struct ext2_dir_entry_2 *)(kaddr + EXT2_DIR_REC_LEN(1));
 	de->name_len = 2;
-	de->rec_len = cpu_to_le16(chunk_size - EXT2_DIR_REC_LEN(1));
+	de->rec_len = ext2_rec_len_to_disk(chunk_size - EXT2_DIR_REC_LEN(1));
 	de->inode = cpu_to_le32(parent->i_ino);
 	memcpy (de->name, "..\0", 4);
 	ext2_set_de_type (de, inode);
diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-mm/include/linux/ext2_fs.h linux-2.6.23-mm-1-ext2_64k_rec_len/include/linux/ext2_fs.h
--- linux-2.6.23-mm/include/linux/ext2_fs.h	2007-10-11 12:08:34.000000000 +0200
+++ linux-2.6.23-mm-1-ext2_64k_rec_len/include/linux/ext2_fs.h	2007-10-11 12:11:22.000000000 +0200
@@ -561,6 +561,7 @@ enum {
 #define EXT2_DIR_ROUND 			(EXT2_DIR_PAD - 1)
 #define EXT2_DIR_REC_LEN(name_len)	(((name_len) + 8 + EXT2_DIR_ROUND) & \
 					 ~EXT2_DIR_ROUND)
+#define EXT2_MAX_REC_LEN		((1<<16)-1)
 
 static inline ext2_fsblk_t
 ext2_group_first_block_no(struct super_block *sb, unsigned long group_no)

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size
  2007-10-11 11:18           ` Jan Kara
@ 2007-10-18  4:07             ` Andrew Morton
  2007-10-18  4:09             ` Andrew Morton
  1 sibling, 0 replies; 124+ messages in thread
From: Andrew Morton @ 2007-10-18  4:07 UTC (permalink / raw)
  To: Jan Kara; +Cc: cmm, linux-ext4, linux-kernel, sho, clameter

On Thu, 11 Oct 2007 13:18:49 +0200 Jan Kara <jack@suse.cz> wrote:

> +static inline __le16 ext2_rec_len_to_disk(unsigned len)
> +{
> +	if (len == (1 << 16))
> +		return cpu_to_le16(EXT2_MAX_REC_LEN);
> +	else if (len > (1 << 16))
> +		BUG();
> +	return cpu_to_le16(len);
> +}

Of course, ext2 shouldn't be trying to write a bad record length into a
directory entry.  But are we sure that there is no way in which this
situation could occur is the on-disk data was _already_ bad?

Because it is very bad for a fileysstem to go BUG in response to unexpected
data on the disk.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size
  2007-10-11 11:18           ` Jan Kara
  2007-10-18  4:07             ` Andrew Morton
@ 2007-10-18  4:09             ` Andrew Morton
  2007-10-18  9:03               ` Christoph Lameter
  2007-10-19  2:05               ` Mingming Cao
  1 sibling, 2 replies; 124+ messages in thread
From: Andrew Morton @ 2007-10-18  4:09 UTC (permalink / raw)
  To: Jan Kara; +Cc: cmm, linux-ext4, linux-kernel, sho, clameter

On Thu, 11 Oct 2007 13:18:49 +0200 Jan Kara <jack@suse.cz> wrote:

> With 64KB blocksize, a directory entry can have size 64KB which does not fit
> into 16 bits we have for entry lenght. So we store 0xffff instead and convert
> value when read from / written to disk.

btw, this changes ext2's on-disk format.

a) is the ext2 format documented anywhere?  If so, that document will
   need updating.

b) what happens when an old ext2 driver tries to read and/or write this
   directory entry?  Do we need a compat flag for it?

c) what happens when old and new ext3 or ext4 try to read/write this
   directory entry?



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size
  2007-10-18  4:09             ` Andrew Morton
@ 2007-10-18  9:03               ` Christoph Lameter
  2007-10-18  9:11                 ` Andrew Morton
  2007-10-19  2:05               ` Mingming Cao
  1 sibling, 1 reply; 124+ messages in thread
From: Christoph Lameter @ 2007-10-18  9:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jan Kara, cmm, linux-ext4, linux-kernel, sho

On Wed, 17 Oct 2007, Andrew Morton wrote:

> b) what happens when an old ext2 driver tries to read and/or write this
>    directory entry?  Do we need a compat flag for it?

Old ext2 only supports up to 4k

include/linux/ext2_fs.h:

#define EXT2_MIN_BLOCK_SIZE             1024
#define EXT2_MAX_BLOCK_SIZE             4096
#define EXT2_MIN_BLOCK_LOG_SIZE           10

Should fail to mount the volume since the block size is too large.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size
  2007-10-18  9:03               ` Christoph Lameter
@ 2007-10-18  9:11                 ` Andrew Morton
  0 siblings, 0 replies; 124+ messages in thread
From: Andrew Morton @ 2007-10-18  9:11 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Jan Kara, cmm, linux-ext4, linux-kernel, sho

On Thu, 18 Oct 2007 02:03:39 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote:

> On Wed, 17 Oct 2007, Andrew Morton wrote:
> 
> > b) what happens when an old ext2 driver tries to read and/or write this
> >    directory entry?  Do we need a compat flag for it?
> 
> Old ext2 only supports up to 4k
> 
> include/linux/ext2_fs.h:
> 
> #define EXT2_MIN_BLOCK_SIZE             1024
> #define EXT2_MAX_BLOCK_SIZE             4096
> #define EXT2_MIN_BLOCK_LOG_SIZE           10
> 
> Should fail to mount the volume since the block size is too large.

should, but does it?

box:/usr/src/25> grep MAX_BLOCK_SIZE fs/ext2/*.[ch] include/linux/ext2*
include/linux/ext2_fs.h:#define EXT2_MAX_BLOCK_SIZE             4096
box:/usr/src/25> 


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size
  2007-10-18  4:09             ` Andrew Morton
  2007-10-18  9:03               ` Christoph Lameter
@ 2007-10-19  2:05               ` Mingming Cao
  1 sibling, 0 replies; 124+ messages in thread
From: Mingming Cao @ 2007-10-19  2:05 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jan Kara, linux-ext4, linux-kernel, sho, clameter

On Wed, 2007-10-17 at 21:09 -0700, Andrew Morton wrote:
> On Thu, 11 Oct 2007 13:18:49 +0200 Jan Kara <jack@suse.cz> wrote:
> 
> > With 64KB blocksize, a directory entry can have size 64KB which does not fit
> > into 16 bits we have for entry lenght. So we store 0xffff instead and convert
> > value when read from / written to disk.
> 
> btw, this changes ext2's on-disk format.
> 
Just to clarify this is only changes the directory entries format on
ext2/3/4 fs with 64k block size. But currently without kernel changes
ext2/3/4 does not support 64 block size.

> a) is the ext2 format documented anywhere?  If so, that document will
>    need updating.
> 

The e2fsprogs needs to be changed to sync up with this change.

Ted has a paper a while back to show ext2 disk format 
http://web.mit.edu/tytso/www/linux/ext2intro.html

The Documentation/filesystems/ext2.txt doesn't have the ext2 format
documented. That document is out-dated need to be reviewed and cleaned
up.
 
> b) what happens when an old ext2 driver tries to read and/or write this
>    directory entry?  Do we need a compat flag for it?
> 
> c) what happens when old and new ext3 or ext4 try to read/write this
>    directory entry?
> 

Without the first patch in this series: ext2 large blocksize support
patches, it fails to mount a ext2 filesystem with 64k block size. 

[PATCH 1/2] ext2:  Support large blocksize up to PAGESIZE
http://lkml.org/lkml/2007/10/1/361

So the old ext2/3/4 driver will not get access the directory entry with
64k block size format changes.


Regards,

Mingming

> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 124+ messages in thread

end of thread, other threads:[~2007-10-19  2:06 UTC | newest]

Thread overview: 124+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-08-28 19:05 [00/36] Large Blocksize Support V6 clameter
2007-08-28 19:05 ` [01/36] Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user clameter
2007-08-28 19:05 ` [02/36] Define functions for page cache handling clameter
2007-08-28 19:05 ` [03/36] Use page_cache_xxx functions in mm/filemap.c clameter
2007-08-28 19:05 ` [04/36] Use page_cache_xxx in mm/page-writeback.c clameter
2007-08-28 19:05 ` [05/36] Use page_cache_xxx in mm/truncate.c clameter
2007-08-28 19:05 ` [06/36] Use page_cache_xxx in mm/rmap.c clameter
2007-08-28 19:05 ` [07/36] Use page_cache_xxx in mm/filemap_xip.c clameter
2007-08-28 19:49   ` Jörn Engel
2007-08-28 19:55     ` Christoph Hellwig
2007-08-28 23:49       ` Nick Piggin
2007-08-28 19:05 ` [08/36] Use page_cache_xxx in mm/migrate.c clameter
2007-08-28 19:06 ` [09/36] Use page_cache_xxx in fs/libfs.c clameter
2007-08-28 19:06 ` [10/36] Use page_cache_xxx in fs/sync clameter
2007-08-28 19:06 ` [11/36] Use page_cache_xxx in fs/buffer.c clameter
2007-08-30  9:20   ` Dmitry Monakhov
2007-08-30 18:14     ` Christoph Lameter
2007-08-31  1:47       ` Christoph Lameter
2007-08-31  6:56         ` Jens Axboe
2007-08-31  7:03           ` Christoph Lameter
2007-08-31  7:11             ` Jens Axboe
2007-08-31  7:17               ` Christoph Lameter
2007-08-31  7:26                 ` Jens Axboe
2007-08-31  7:33                   ` Christoph Lameter
2007-08-31  7:43                     ` Jens Axboe
2007-08-31  7:52                       ` Christoph Lameter
2007-08-31  8:12                         ` Jens Axboe
2007-08-31 15:22                           ` Christoph Lameter
2007-08-31 16:35                             ` Jörn Engel
2007-08-31 19:00                             ` Jens Axboe
2007-08-31  8:36                         ` Dmitry Monakhov
2007-08-31 15:28                           ` Christoph Lameter
2007-08-28 19:06 ` [12/36] Use page_cache_xxx in mm/mpage.c clameter
2007-08-28 19:06 ` [13/36] Use page_cache_xxx in mm/fadvise.c clameter
2007-08-28 19:06 ` [14/36] Use page_cache_xxx in fs/splice.c clameter
2007-08-28 19:06 ` [15/36] Use page_cache_xxx functions in fs/ext2 clameter
2007-08-28 19:06 ` [16/36] Use page_cache_xxx in fs/ext3 clameter
2007-08-28 19:06 ` [17/36] Use page_cache_xxx in fs/ext4 clameter
2007-08-28 19:06 ` [18/36] Use page_cache_xxx in fs/reiserfs clameter
2007-08-28 19:06 ` [19/36] Use page_cache_xxx for fs/xfs clameter
2007-08-28 19:06 ` [20/36] Use page_cache_xxx in drivers/block/rd.c clameter
2007-08-28 19:06 ` [21/36] compound pages: PageHead/PageTail instead of PageCompound clameter
2007-08-28 19:06 ` [22/36] compound pages: Add new support functions clameter
2007-08-28 19:06 ` [23/36] compound pages: vmstat support clameter
2007-08-28 19:06 ` [24/36] compound pages: Use new compound vmstat functions in SLUB clameter
2007-08-28 19:06 ` [25/36] compound pages: Allow use of get_page_unless_zero with compound pages clameter
2007-08-28 19:06 ` [26/36] compound pages: Allow freeing of compound pages via pagevec clameter
2007-08-28 19:06 ` [27/36] Compound page zeroing and flushing clameter
2007-08-28 19:06 ` [28/36] Fix PAGE SIZE assumption in miscellaneous places clameter
2007-08-28 19:06 ` [29/36] Fix up reclaim counters clameter
2007-08-28 19:06 ` [30/36] Add VM_BUG_ONs to check for correct page order clameter
2007-08-28 19:06 ` [31/36] Large Blocksize: Core piece clameter
2007-08-30  0:11   ` Mingming Cao
2007-08-30  0:12     ` Christoph Lameter
2007-08-30  0:47     ` [RFC 1/4] Large Blocksize support for Ext2/3/4 Mingming Cao
2007-08-30  0:59       ` Christoph Lameter
2007-09-01  0:01       ` Mingming Cao
2007-09-01  0:12       ` [RFC 1/2] JBD: slab management support for large block(>8k) Mingming Cao
2007-09-01 18:39         ` Christoph Hellwig
2007-09-02 11:40           ` Christoph Lameter
2007-09-02 15:28             ` Christoph Hellwig
2007-09-03  7:55               ` Christoph Lameter
2007-09-03 13:40                 ` Christoph Hellwig
2007-09-03 19:31                   ` Christoph Lameter
2007-09-03 19:33                     ` Christoph Hellwig
2007-09-14 18:53                       ` [PATCH] JBD slab cleanups Mingming Cao
2007-09-14 18:58                         ` Christoph Lameter
2007-09-17 19:29                         ` Mingming Cao
2007-09-17 19:34                           ` Christoph Hellwig
2007-09-17 22:01                           ` Badari Pulavarty
2007-09-17 22:57                             ` Mingming Cao
2007-09-18  9:04                               ` Christoph Hellwig
2007-09-18 16:35                                 ` Mingming Cao
2007-09-18 18:04                                   ` Dave Kleikamp
2007-09-19  1:00                                     ` Mingming Cao
2007-09-19  2:19                                       ` Andrew Morton
2007-09-19 19:15                                         ` Mingming Cao
2007-09-19 19:22                                           ` [PATCH] JBD: use GFP_NOFS in kmalloc Mingming Cao
2007-09-19 21:34                                             ` Andrew Morton
2007-09-19 21:55                                               ` Mingming Cao
2007-09-20  4:25                                             ` Andreas Dilger
2007-09-19 19:26                                           ` [PATCH] JBD slab cleanups Dave Kleikamp
2007-09-19 19:28                                             ` Dave Kleikamp
2007-09-19 20:47                                               ` Mingming Cao
2007-09-19 19:48                                           ` Andreas Dilger
2007-09-19 22:03                                             ` Mingming Cao
2007-09-21 23:13                                               ` [PATCH] JBD/ext34 cleanups: convert to kzalloc Mingming Cao
2007-09-21 23:32                                                 ` [PATCH] JBD2/ext4 naming cleanup Mingming Cao
2007-09-26 19:54                                                 ` [PATCH] JBD/ext34 cleanups: convert to kzalloc Andrew Morton
2007-09-26 21:05                                                   ` Mingming Cao
2007-09-01  0:12       ` [RFC 2/2] JBD: blocks reservation fix for large block support Mingming Cao
2007-10-02  0:34       ` [PATCH 1/2] ext4: Support large blocksize up to PAGESIZE Mingming Cao
2007-10-02  0:35       ` [PATCH 2/2] ext4: Avoid rec_len overflow with 64KB block size Mingming Cao
2007-10-02  0:35       ` [PATCH 1/2] ext2: Support large blocksize up to PAGESIZE Mingming Cao
2007-10-02  0:35       ` [PATCH 2/2] ext2: Avoid rec_len overflow with 64KB block size Mingming Cao
2007-10-04 20:12         ` Andrew Morton
2007-10-04 22:40           ` Andreas Dilger
2007-10-04 23:11             ` Andrew Morton
2007-10-11 10:30               ` Jan Kara
2007-10-11 10:14                 ` Andrew Morton
2007-10-08 13:02           ` Jan Kara
2007-10-11 11:18           ` Jan Kara
2007-10-18  4:07             ` Andrew Morton
2007-10-18  4:09             ` Andrew Morton
2007-10-18  9:03               ` Christoph Lameter
2007-10-18  9:11                 ` Andrew Morton
2007-10-19  2:05               ` Mingming Cao
2007-10-02  0:36       ` [PATCH 1/2] ext3: Support large blocksize up to PAGESIZE Mingming Cao
2007-10-02  0:36       ` [PATCH 2/2] ext3: Avoid rec_len overflow with 64KB block size Mingming Cao
2007-08-30  0:47     ` [RFC 2/4]ext2: fix " Mingming Cao
2007-08-30  0:48     ` [RFC 3/4] ext3: " Mingming Cao
2007-08-30  0:48     ` [RFC 4/4]ext4: " Mingming Cao
2007-08-28 19:06 ` [32/36] Readahead changes to support large blocksize clameter
2007-08-28 19:06 ` [33/36] Large blocksize support in ramfs clameter
2007-08-28 19:06 ` [34/36] Large blocksize support in XFS clameter
2007-08-28 19:06 ` [35/36] Large blocksize support for ext2 clameter
2007-08-28 19:22   ` Christoph Hellwig
2007-08-28 19:56     ` Christoph Lameter
2007-08-28 19:06 ` [36/36] Reiserfs: Fix up for mapping_set_gfp_mask clameter
2007-08-28 19:20 ` [00/36] Large Blocksize Support V6 Christoph Hellwig
2007-08-28 19:55   ` Christoph Lameter
2007-09-01  1:11     ` Christoph Lameter
2007-09-01 19:17 ` Peter Zijlstra
2007-09-02 11:44   ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).