All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] remove i_alloc_sem
@ 2011-06-20 20:15 Christoph Hellwig
  2011-06-20 20:15   ` Christoph Hellwig
                   ` (9 more replies)
  0 siblings, 10 replies; 36+ messages in thread
From: Christoph Hellwig @ 2011-06-20 20:15 UTC (permalink / raw)
  To: viro, tglx
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh, jlbec

i_alloc_sem has always been a bit of an odd "lock".  It's the only remaining
rw_semaphore that can be released by a different thread than the one that
locked it, and it's use case in the core direct I/O code is more like a
counter given that the writers already have external serialization.

This series removes it in favour of a simpler counter scheme, thus getting
rid of the rw_semaphore non-owner APIs as requests by Thomas, while at the
same time shrinking the size of struct inode by 160 bytes on 64-bit systems.

The only nasty bit is that two filesystems (fat and ext4) have started
abusing the lock for their own purposes.  I've added a new rw_semaphore
to their filesystem-specific inode structures to keep the current behaviour,
but I suspect the maintainers might have smarted ideas to archive the same
goal.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/8] far: remove i_alloc_sem abuse
  2011-06-20 20:15 [PATCH 0/8] remove i_alloc_sem Christoph Hellwig
@ 2011-06-20 20:15   ` Christoph Hellwig
  2011-06-20 20:15   ` Christoph Hellwig
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2011-06-20 20:15 UTC (permalink / raw)
  To: viro, tglx
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh, jlbec

Add a new rw_semaphore to protect bmap against truncate.  Previous
i_alloc_sem was abused for this, but it's going away in this series.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/fat/inode.c
===================================================================
--- linux-2.6.orig/fs/fat/inode.c	2011-06-20 21:28:19.707963855 +0200
+++ linux-2.6/fs/fat/inode.c	2011-06-20 21:29:25.031293882 +0200
@@ -224,9 +224,9 @@ static sector_t _fat_bmap(struct address
 	sector_t blocknr;
 
 	/* fat_get_cluster() assumes the requested blocknr isn't truncated. */
-	down_read(&mapping->host->i_alloc_sem);
+	down_read(&MSDOS_I(mapping->host)->truncate_lock);
 	blocknr = generic_block_bmap(mapping, block, fat_get_block);
-	up_read(&mapping->host->i_alloc_sem);
+	up_read(&MSDOS_I(mapping->host)->truncate_lock);
 
 	return blocknr;
 }
@@ -510,6 +510,8 @@ static struct inode *fat_alloc_inode(str
 	ei = kmem_cache_alloc(fat_inode_cachep, GFP_NOFS);
 	if (!ei)
 		return NULL;
+
+	init_rwsem(&ei->truncate_lock);
 	return &ei->vfs_inode;
 }
 
Index: linux-2.6/fs/fat/fat.h
===================================================================
--- linux-2.6.orig/fs/fat/fat.h	2011-06-20 21:28:19.724630522 +0200
+++ linux-2.6/fs/fat/fat.h	2011-06-20 21:29:25.034627215 +0200
@@ -109,6 +109,7 @@ struct msdos_inode_info {
 	int i_attrs;		/* unused attribute bits */
 	loff_t i_pos;		/* on-disk position of directory entry or 0 */
 	struct hlist_node i_fat_hash;	/* hash by i_location */
+	struct rw_semaphore truncate_lock; /* protect bmap against truncate */
 	struct inode vfs_inode;
 };
 
Index: linux-2.6/fs/fat/file.c
===================================================================
--- linux-2.6.orig/fs/fat/file.c	2011-06-20 21:28:19.744630521 +0200
+++ linux-2.6/fs/fat/file.c	2011-06-20 21:29:54.501292390 +0200
@@ -429,8 +429,10 @@ int fat_setattr(struct dentry *dentry, s
 	}
 
 	if (attr->ia_valid & ATTR_SIZE) {
+		down_write(&MSDOS_I(inode)->truncate_lock);
 		truncate_setsize(inode, attr->ia_size);
 		fat_truncate_blocks(inode, attr->ia_size);
+		up_write(&MSDOS_I(inode)->truncate_lock);
 	}
 
 	setattr_copy(inode, attr);


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/8] far: remove i_alloc_sem abuse
@ 2011-06-20 20:15   ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2011-06-20 20:15 UTC (permalink / raw)
  To: viro, tglx
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh, jlbec

[-- Attachment #1: fat-avoid-i_alloc_sem --]
[-- Type: text/plain, Size: 2152 bytes --]

Add a new rw_semaphore to protect bmap against truncate.  Previous
i_alloc_sem was abused for this, but it's going away in this series.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/fat/inode.c
===================================================================
--- linux-2.6.orig/fs/fat/inode.c	2011-06-20 21:28:19.707963855 +0200
+++ linux-2.6/fs/fat/inode.c	2011-06-20 21:29:25.031293882 +0200
@@ -224,9 +224,9 @@ static sector_t _fat_bmap(struct address
 	sector_t blocknr;
 
 	/* fat_get_cluster() assumes the requested blocknr isn't truncated. */
-	down_read(&mapping->host->i_alloc_sem);
+	down_read(&MSDOS_I(mapping->host)->truncate_lock);
 	blocknr = generic_block_bmap(mapping, block, fat_get_block);
-	up_read(&mapping->host->i_alloc_sem);
+	up_read(&MSDOS_I(mapping->host)->truncate_lock);
 
 	return blocknr;
 }
@@ -510,6 +510,8 @@ static struct inode *fat_alloc_inode(str
 	ei = kmem_cache_alloc(fat_inode_cachep, GFP_NOFS);
 	if (!ei)
 		return NULL;
+
+	init_rwsem(&ei->truncate_lock);
 	return &ei->vfs_inode;
 }
 
Index: linux-2.6/fs/fat/fat.h
===================================================================
--- linux-2.6.orig/fs/fat/fat.h	2011-06-20 21:28:19.724630522 +0200
+++ linux-2.6/fs/fat/fat.h	2011-06-20 21:29:25.034627215 +0200
@@ -109,6 +109,7 @@ struct msdos_inode_info {
 	int i_attrs;		/* unused attribute bits */
 	loff_t i_pos;		/* on-disk position of directory entry or 0 */
 	struct hlist_node i_fat_hash;	/* hash by i_location */
+	struct rw_semaphore truncate_lock; /* protect bmap against truncate */
 	struct inode vfs_inode;
 };
 
Index: linux-2.6/fs/fat/file.c
===================================================================
--- linux-2.6.orig/fs/fat/file.c	2011-06-20 21:28:19.744630521 +0200
+++ linux-2.6/fs/fat/file.c	2011-06-20 21:29:54.501292390 +0200
@@ -429,8 +429,10 @@ int fat_setattr(struct dentry *dentry, s
 	}
 
 	if (attr->ia_valid & ATTR_SIZE) {
+		down_write(&MSDOS_I(inode)->truncate_lock);
 		truncate_setsize(inode, attr->ia_size);
 		fat_truncate_blocks(inode, attr->ia_size);
+		up_write(&MSDOS_I(inode)->truncate_lock);
 	}
 
 	setattr_copy(inode, attr);


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 2/8] ext4: remove i_alloc_sem abuse
  2011-06-20 20:15 [PATCH 0/8] remove i_alloc_sem Christoph Hellwig
@ 2011-06-20 20:15   ` Christoph Hellwig
  2011-06-20 20:15   ` Christoph Hellwig
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2011-06-20 20:15 UTC (permalink / raw)
  To: viro, tglx
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh, jlbec

Add a new rw_semaphore to protect page_mkwrite against truncate.  Previous
i_alloc_sem was abused for this, but it's going away in this series.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/ext4/inode.c
===================================================================
--- linux-2.6.orig/fs/ext4/inode.c	2011-06-20 14:25:33.779248128 +0200
+++ linux-2.6/fs/ext4/inode.c	2011-06-20 14:27:53.025907745 +0200
@@ -5357,6 +5357,7 @@ int ext4_setattr(struct dentry *dentry,
 			if (attr->ia_size > sbi->s_bitmap_maxbytes)
 				return -EFBIG;
 		}
+		down_write(&(EXT4_I(inode)->truncate_lock));
 	}
 
 	if (S_ISREG(inode->i_mode) &&
@@ -5405,6 +5406,9 @@ int ext4_setattr(struct dentry *dentry,
 			ext4_truncate(inode);
 	}
 
+	if (attr->ia_valid & ATTR_SIZE)
+		up_write(&(EXT4_I(inode)->truncate_lock));
+
 	if (!rc) {
 		setattr_copy(inode, attr);
 		mark_inode_dirty(inode);
@@ -5850,10 +5854,10 @@ int ext4_page_mkwrite(struct vm_area_str
 	struct address_space *mapping = inode->i_mapping;
 
 	/*
-	 * Get i_alloc_sem to stop truncates messing with the inode. We cannot
-	 * get i_mutex because we are already holding mmap_sem.
+	 * Get truncate_lock to stop truncates messing with the inode. We
+	 * cannot get i_mutex because we are already holding mmap_sem.
 	 */
-	down_read(&inode->i_alloc_sem);
+	down_read(&(EXT4_I(inode)->truncate_lock));
 	size = i_size_read(inode);
 	if (page->mapping != mapping || size <= page_offset(page)
 	    || !PageUptodate(page)) {
@@ -5865,7 +5869,7 @@ int ext4_page_mkwrite(struct vm_area_str
 	lock_page(page);
 	wait_on_page_writeback(page);
 	if (PageMappedToDisk(page)) {
-		up_read(&inode->i_alloc_sem);
+		up_read(&(EXT4_I(inode)->truncate_lock));
 		return VM_FAULT_LOCKED;
 	}
 
@@ -5883,7 +5887,7 @@ int ext4_page_mkwrite(struct vm_area_str
 	if (page_has_buffers(page)) {
 		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
 					ext4_bh_unmapped)) {
-			up_read(&inode->i_alloc_sem);
+			up_read(&(EXT4_I(inode)->truncate_lock));
 			return VM_FAULT_LOCKED;
 		}
 	}
@@ -5912,11 +5916,11 @@ int ext4_page_mkwrite(struct vm_area_str
 	 */
 	lock_page(page);
 	wait_on_page_writeback(page);
-	up_read(&inode->i_alloc_sem);
+	up_read(&(EXT4_I(inode)->truncate_lock));
 	return VM_FAULT_LOCKED;
 out_unlock:
 	if (ret)
 		ret = VM_FAULT_SIGBUS;
-	up_read(&inode->i_alloc_sem);
+	up_read(&(EXT4_I(inode)->truncate_lock));
 	return ret;
 }
Index: linux-2.6/fs/ext4/super.c
===================================================================
--- linux-2.6.orig/fs/ext4/super.c	2011-06-20 14:25:33.795914793 +0200
+++ linux-2.6/fs/ext4/super.c	2011-06-20 14:27:06.269243443 +0200
@@ -871,6 +871,7 @@ static struct inode *ext4_alloc_inode(st
 	ei->i_datasync_tid = 0;
 	atomic_set(&ei->i_ioend_count, 0);
 	atomic_set(&ei->i_aiodio_unwritten, 0);
+	init_rwsem(&ei->truncate_lock);
 
 	return &ei->vfs_inode;
 }
Index: linux-2.6/fs/ext4/ext4.h
===================================================================
--- linux-2.6.orig/fs/ext4/ext4.h	2011-06-20 14:25:33.752581461 +0200
+++ linux-2.6/fs/ext4/ext4.h	2011-06-20 14:27:06.272576777 +0200
@@ -844,6 +844,9 @@ struct ext4_inode_info {
 	/* on-disk additional length */
 	__u16 i_extra_isize;
 
+	/* protect page_mkwrite against truncates */
+	struct rw_semaphore truncate_lock;
+
 #ifdef CONFIG_QUOTA
 	/* quota space reservation, managed internally by quota code */
 	qsize_t i_reserved_quota;


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 2/8] ext4: remove i_alloc_sem abuse
@ 2011-06-20 20:15   ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2011-06-20 20:15 UTC (permalink / raw)
  To: viro, tglx
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh, jlbec

[-- Attachment #1: ext4-introduce-truncate-mutex --]
[-- Type: text/plain, Size: 3431 bytes --]

Add a new rw_semaphore to protect page_mkwrite against truncate.  Previous
i_alloc_sem was abused for this, but it's going away in this series.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/ext4/inode.c
===================================================================
--- linux-2.6.orig/fs/ext4/inode.c	2011-06-20 14:25:33.779248128 +0200
+++ linux-2.6/fs/ext4/inode.c	2011-06-20 14:27:53.025907745 +0200
@@ -5357,6 +5357,7 @@ int ext4_setattr(struct dentry *dentry,
 			if (attr->ia_size > sbi->s_bitmap_maxbytes)
 				return -EFBIG;
 		}
+		down_write(&(EXT4_I(inode)->truncate_lock));
 	}
 
 	if (S_ISREG(inode->i_mode) &&
@@ -5405,6 +5406,9 @@ int ext4_setattr(struct dentry *dentry,
 			ext4_truncate(inode);
 	}
 
+	if (attr->ia_valid & ATTR_SIZE)
+		up_write(&(EXT4_I(inode)->truncate_lock));
+
 	if (!rc) {
 		setattr_copy(inode, attr);
 		mark_inode_dirty(inode);
@@ -5850,10 +5854,10 @@ int ext4_page_mkwrite(struct vm_area_str
 	struct address_space *mapping = inode->i_mapping;
 
 	/*
-	 * Get i_alloc_sem to stop truncates messing with the inode. We cannot
-	 * get i_mutex because we are already holding mmap_sem.
+	 * Get truncate_lock to stop truncates messing with the inode. We
+	 * cannot get i_mutex because we are already holding mmap_sem.
 	 */
-	down_read(&inode->i_alloc_sem);
+	down_read(&(EXT4_I(inode)->truncate_lock));
 	size = i_size_read(inode);
 	if (page->mapping != mapping || size <= page_offset(page)
 	    || !PageUptodate(page)) {
@@ -5865,7 +5869,7 @@ int ext4_page_mkwrite(struct vm_area_str
 	lock_page(page);
 	wait_on_page_writeback(page);
 	if (PageMappedToDisk(page)) {
-		up_read(&inode->i_alloc_sem);
+		up_read(&(EXT4_I(inode)->truncate_lock));
 		return VM_FAULT_LOCKED;
 	}
 
@@ -5883,7 +5887,7 @@ int ext4_page_mkwrite(struct vm_area_str
 	if (page_has_buffers(page)) {
 		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
 					ext4_bh_unmapped)) {
-			up_read(&inode->i_alloc_sem);
+			up_read(&(EXT4_I(inode)->truncate_lock));
 			return VM_FAULT_LOCKED;
 		}
 	}
@@ -5912,11 +5916,11 @@ int ext4_page_mkwrite(struct vm_area_str
 	 */
 	lock_page(page);
 	wait_on_page_writeback(page);
-	up_read(&inode->i_alloc_sem);
+	up_read(&(EXT4_I(inode)->truncate_lock));
 	return VM_FAULT_LOCKED;
 out_unlock:
 	if (ret)
 		ret = VM_FAULT_SIGBUS;
-	up_read(&inode->i_alloc_sem);
+	up_read(&(EXT4_I(inode)->truncate_lock));
 	return ret;
 }
Index: linux-2.6/fs/ext4/super.c
===================================================================
--- linux-2.6.orig/fs/ext4/super.c	2011-06-20 14:25:33.795914793 +0200
+++ linux-2.6/fs/ext4/super.c	2011-06-20 14:27:06.269243443 +0200
@@ -871,6 +871,7 @@ static struct inode *ext4_alloc_inode(st
 	ei->i_datasync_tid = 0;
 	atomic_set(&ei->i_ioend_count, 0);
 	atomic_set(&ei->i_aiodio_unwritten, 0);
+	init_rwsem(&ei->truncate_lock);
 
 	return &ei->vfs_inode;
 }
Index: linux-2.6/fs/ext4/ext4.h
===================================================================
--- linux-2.6.orig/fs/ext4/ext4.h	2011-06-20 14:25:33.752581461 +0200
+++ linux-2.6/fs/ext4/ext4.h	2011-06-20 14:27:06.272576777 +0200
@@ -844,6 +844,9 @@ struct ext4_inode_info {
 	/* on-disk additional length */
 	__u16 i_extra_isize;
 
+	/* protect page_mkwrite against truncates */
+	struct rw_semaphore truncate_lock;
+
 #ifdef CONFIG_QUOTA
 	/* quota space reservation, managed internally by quota code */
 	qsize_t i_reserved_quota;


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 3/8] fs: simpler handling of zero sized reads in __blockdev_direct_IO
  2011-06-20 20:15 [PATCH 0/8] remove i_alloc_sem Christoph Hellwig
@ 2011-06-20 20:15   ` Christoph Hellwig
  2011-06-20 20:15   ` Christoph Hellwig
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2011-06-20 20:15 UTC (permalink / raw)
  To: viro, tglx
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh, jlbec

Reject zero sized reads as soon as we know our I/O length, and don't
borther with locks or allocations that might have to be cleaned up
otherwise.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/direct-io.c
===================================================================
--- linux-2.6.orig/fs/direct-io.c	2011-06-11 12:10:10.205165161 +0200
+++ linux-2.6/fs/direct-io.c	2011-06-11 12:12:49.161823781 +0200
@@ -1200,6 +1200,10 @@ __blockdev_direct_IO(int rw, struct kioc
 		}
 	}
 
+	/* watch out for a 0 len io from a tricksy fs */
+	if (rw == READ && end == offset)
+		return 0;
+
 	dio = kmalloc(sizeof(*dio), GFP_KERNEL);
 	retval = -ENOMEM;
 	if (!dio)
@@ -1213,8 +1217,7 @@ __blockdev_direct_IO(int rw, struct kioc
 
 	dio->flags = flags;
 	if (dio->flags & DIO_LOCKING) {
-		/* watch out for a 0 len io from a tricksy fs */
-		if (rw == READ && end > offset) {
+		if (rw == READ) {
 			struct address_space *mapping =
 					iocb->ki_filp->f_mapping;
 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 3/8] fs: simpler handling of zero sized reads in __blockdev_direct_IO
@ 2011-06-20 20:15   ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2011-06-20 20:15 UTC (permalink / raw)
  To: viro, tglx
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh, jlbec

[-- Attachment #1: fs-cleanup-zero-size-dio-reads --]
[-- Type: text/plain, Size: 983 bytes --]

Reject zero sized reads as soon as we know our I/O length, and don't
borther with locks or allocations that might have to be cleaned up
otherwise.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/direct-io.c
===================================================================
--- linux-2.6.orig/fs/direct-io.c	2011-06-11 12:10:10.205165161 +0200
+++ linux-2.6/fs/direct-io.c	2011-06-11 12:12:49.161823781 +0200
@@ -1200,6 +1200,10 @@ __blockdev_direct_IO(int rw, struct kioc
 		}
 	}
 
+	/* watch out for a 0 len io from a tricksy fs */
+	if (rw == READ && end == offset)
+		return 0;
+
 	dio = kmalloc(sizeof(*dio), GFP_KERNEL);
 	retval = -ENOMEM;
 	if (!dio)
@@ -1213,8 +1217,7 @@ __blockdev_direct_IO(int rw, struct kioc
 
 	dio->flags = flags;
 	if (dio->flags & DIO_LOCKING) {
-		/* watch out for a 0 len io from a tricksy fs */
-		if (rw == READ && end > offset) {
+		if (rw == READ) {
 			struct address_space *mapping =
 					iocb->ki_filp->f_mapping;
 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 4/8] fs: kill i_alloc_sem
  2011-06-20 20:15 [PATCH 0/8] remove i_alloc_sem Christoph Hellwig
@ 2011-06-20 20:15   ` Christoph Hellwig
  2011-06-20 20:15   ` Christoph Hellwig
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2011-06-20 20:15 UTC (permalink / raw)
  To: viro, tglx
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh, jlbec

i_alloc_sem is a rather special rw_semaphore.  It's the last one that may
be released by a non-owner, and it's write side is always mirrored by
real exclusion.  It's intended use it to wait for all pending direct I/O
requests to finish before starting a truncate.

Replace it with a hand-grown construct:

 - exclusion for truncates is already guaranteed by i_mutex, so it can
   simply fall way
 - the reader side is replaced by an i_dio_count member in struct inode
   that counts the number of pending direct I/O requests.  Truncate can't
   proceed as long as it's non-zero
 - when i_dio_count reaches non-zero we wake up a pending truncate using
   wake_up_bit on a new bit in i_flags
 - new references to i_dio_count can't appear while we are waiting for
   it to read zero because the direct I/O count always needs i_mutex
   (or an equivalent like XFS's i_iolock) for starting a new operation.

This scheme is much simpler, and saves the space of a spinlock_t and a
struct list_head in struct inode (typically 160 bytes on a non-debug 64-bit
system).

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/direct-io.c
===================================================================
--- linux-2.6.orig/fs/direct-io.c	2011-06-20 14:55:31.000000000 +0200
+++ linux-2.6/fs/direct-io.c	2011-06-20 14:55:34.602490284 +0200
@@ -136,6 +136,27 @@ struct dio {
 };
 
 /*
+ * Wait for outstanding DIO requests to finish.  Must be locked against
+ * increments of i_dio_count by i_mutex.
+ */
+void inode_dio_wait(struct inode *inode)
+{
+	might_sleep();
+	while (atomic_read(&inode->i_dio_count)) {
+		wait_on_bit(&inode->i_state, __I_DIO_WAKEUP, inode_wait,
+			    TASK_UNINTERRUPTIBLE);
+	}
+}
+EXPORT_SYMBOL_GPL(inode_dio_wait);
+
+void inode_dio_wake(struct inode *inode)
+{
+	if (atomic_dec_and_test(&inode->i_dio_count))
+		wake_up_bit(&inode->i_state, __I_DIO_WAKEUP);
+}
+EXPORT_SYMBOL_GPL(inode_dio_wake);
+
+/*
  * How many pages are in the queue?
  */
 static inline unsigned dio_pages_present(struct dio *dio)
@@ -254,9 +275,7 @@ static ssize_t dio_complete(struct dio *
 	}
 
 	if (dio->flags & DIO_LOCKING)
-		/* lockdep: non-owner release */
-		up_read_non_owner(&dio->inode->i_alloc_sem);
-
+		inode_dio_wake(dio->inode);
 	return ret;
 }
 
@@ -980,9 +999,6 @@ out:
 	return ret;
 }
 
-/*
- * Releases both i_mutex and i_alloc_sem
- */
 static ssize_t
 direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, 
 	const struct iovec *iov, loff_t offset, unsigned long nr_segs, 
@@ -1146,15 +1162,14 @@ direct_io_worker(int rw, struct kiocb *i
  *    For writes this function is called under i_mutex and returns with
  *    i_mutex held, for reads, i_mutex is not held on entry, but it is
  *    taken and dropped again before returning.
- *    For reads and writes i_alloc_sem is taken in shared mode and released
- *    on I/O completion (which may happen asynchronously after returning to
- *    the caller).
+ *    The i_dio_count counter keeps track of the number of outstanding
+ *    direct I/O requests, and truncate waits for it to reach zero.
+ *    New references to i_dio_count must only be grabbed with i_mutex
+ *    held.
  *
  *  - if the flags value does NOT contain DIO_LOCKING we don't use any
  *    internal locking but rather rely on the filesystem to synchronize
  *    direct I/O reads/writes versus each other and truncate.
- *    For reads and writes both i_mutex and i_alloc_sem are not held on
- *    entry and are never taken.
  */
 ssize_t
 __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
@@ -1234,10 +1249,9 @@ __blockdev_direct_IO(int rw, struct kioc
 		}
 
 		/*
-		 * Will be released at I/O completion, possibly in a
-		 * different thread.
+		 * Will be decremented at I/O completion time.
 		 */
-		down_read_non_owner(&inode->i_alloc_sem);
+		atomic_inc(&inode->i_dio_count);
 	}
 
 	/*
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2011-06-20 14:19:27.019266696 +0200
+++ linux-2.6/mm/filemap.c	2011-06-20 14:55:34.605823617 +0200
@@ -78,9 +78,6 @@
  *  ->i_mutex			(generic_file_buffered_write)
  *    ->mmap_sem		(fault_in_pages_readable->do_page_fault)
  *
- *  ->i_mutex
- *    ->i_alloc_sem             (various)
- *
  *  inode_wb_list_lock
  *    sb_lock			(fs/fs-writeback.c)
  *    ->mapping->tree_lock	(__sync_single_inode)
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c	2011-06-20 14:19:27.000000000 +0200
+++ linux-2.6/mm/rmap.c	2011-06-20 14:55:34.605823617 +0200
@@ -21,7 +21,6 @@
  * Lock ordering in mm:
  *
  * inode->i_mutex	(while writing or truncating, not reading or faulting)
- *   inode->i_alloc_sem (vmtruncate_range)
  *   mm->mmap_sem
  *     page->flags PG_locked (lock_page)
  *       mapping->i_mmap_mutex
Index: linux-2.6/fs/attr.c
===================================================================
--- linux-2.6.orig/fs/attr.c	2011-06-20 14:19:26.000000000 +0200
+++ linux-2.6/fs/attr.c	2011-06-20 14:55:34.609156951 +0200
@@ -233,16 +233,13 @@ int notify_change(struct dentry * dentry
 		return error;
 
 	if (ia_valid & ATTR_SIZE)
-		down_write(&dentry->d_inode->i_alloc_sem);
+		inode_dio_wait(inode);
 
 	if (inode->i_op->setattr)
 		error = inode->i_op->setattr(dentry, attr);
 	else
 		error = simple_setattr(dentry, attr);
 
-	if (ia_valid & ATTR_SIZE)
-		up_write(&dentry->d_inode->i_alloc_sem);
-
 	if (!error)
 		fsnotify_change(dentry, ia_valid);
 
Index: linux-2.6/fs/ntfs/file.c
===================================================================
--- linux-2.6.orig/fs/ntfs/file.c	2011-06-20 14:19:26.000000000 +0200
+++ linux-2.6/fs/ntfs/file.c	2011-06-20 14:55:34.609156951 +0200
@@ -1832,9 +1832,8 @@ static ssize_t ntfs_file_buffered_write(
 	 * fails again.
 	 */
 	if (unlikely(NInoTruncateFailed(ni))) {
-		down_write(&vi->i_alloc_sem);
+		inode_dio_wait(vi);
 		err = ntfs_truncate(vi);
-		up_write(&vi->i_alloc_sem);
 		if (err || NInoTruncateFailed(ni)) {
 			if (!err)
 				err = -EIO;
Index: linux-2.6/fs/reiserfs/xattr.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/xattr.c	2011-06-20 14:19:26.000000000 +0200
+++ linux-2.6/fs/reiserfs/xattr.c	2011-06-20 14:55:34.612490285 +0200
@@ -555,11 +555,10 @@ reiserfs_xattr_set_handle(struct reiserf
 
 		reiserfs_write_unlock(inode->i_sb);
 		mutex_lock_nested(&dentry->d_inode->i_mutex, I_MUTEX_XATTR);
-		down_write(&dentry->d_inode->i_alloc_sem);
+		inode_dio_wait(dentry->d_inode);
 		reiserfs_write_lock(inode->i_sb);
 
 		err = reiserfs_setattr(dentry, &newattrs);
-		up_write(&dentry->d_inode->i_alloc_sem);
 		mutex_unlock(&dentry->d_inode->i_mutex);
 	} else
 		update_ctime(inode);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2011-06-20 14:19:27.000000000 +0200
+++ linux-2.6/include/linux/fs.h	2011-06-20 14:55:34.615823619 +0200
@@ -776,7 +776,7 @@ struct inode {
 	struct timespec		i_ctime;
 	blkcnt_t		i_blocks;
 	unsigned short          i_bytes;
-	struct rw_semaphore	i_alloc_sem;
+	atomic_t		i_dio_count;
 	const struct file_operations	*i_fop;	/* former ->i_op->default_file_ops */
 	struct file_lock	*i_flock;
 	struct address_space	*i_mapping;
@@ -1692,6 +1692,10 @@ struct super_operations {
  *			set during data writeback, and cleared with a wakeup
  *			on the bit address once it is done.
  *
+ * I_REFERENCED		Marks the inode as recently references on the LRU list.
+ *
+ * I_DIO_WAKEUP		Never set.  Only used as a key for wait_on_bit().
+ *
  * Q: What is the difference between I_WILL_FREE and I_FREEING?
  */
 #define I_DIRTY_SYNC		(1 << 0)
@@ -1705,6 +1709,8 @@ struct super_operations {
 #define __I_SYNC		7
 #define I_SYNC			(1 << __I_SYNC)
 #define I_REFERENCED		(1 << 8)
+#define __I_DIO_WAKEUP		9
+#define I_DIO_WAKEUP		(1 << I_DIO_WAKEUP)
 
 #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
 
@@ -1815,7 +1821,6 @@ struct file_system_type {
 	struct lock_class_key i_lock_key;
 	struct lock_class_key i_mutex_key;
 	struct lock_class_key i_mutex_dir_key;
-	struct lock_class_key i_alloc_sem_key;
 };
 
 extern struct dentry *mount_ns(struct file_system_type *fs_type, int flags,
@@ -2367,6 +2372,8 @@ enum {
 };
 
 void dio_end_io(struct bio *bio, int error);
+void inode_dio_wait(struct inode *inode);
+void inode_dio_wake(struct inode *inode);
 
 ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	struct block_device *bdev, const struct iovec *iov, loff_t offset,
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2011-06-20 14:19:27.000000000 +0200
+++ linux-2.6/mm/memory.c	2011-06-20 14:55:34.619156952 +0200
@@ -2811,12 +2811,11 @@ int vmtruncate_range(struct inode *inode
 		return -ENOSYS;
 
 	mutex_lock(&inode->i_mutex);
-	down_write(&inode->i_alloc_sem);
+	inode_dio_wait(inode);
 	unmap_mapping_range(mapping, offset, (end - offset), 1);
 	truncate_inode_pages_range(mapping, offset, end);
 	unmap_mapping_range(mapping, offset, (end - offset), 1);
 	inode->i_op->truncate_range(inode, offset, end);
-	up_write(&inode->i_alloc_sem);
 	mutex_unlock(&inode->i_mutex);
 
 	return 0;
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2011-06-20 14:19:26.000000000 +0200
+++ linux-2.6/fs/inode.c	2011-06-20 14:55:34.625823618 +0200
@@ -176,8 +176,7 @@ int inode_init_always(struct super_block
 	mutex_init(&inode->i_mutex);
 	lockdep_set_class(&inode->i_mutex, &sb->s_type->i_mutex_key);
 
-	init_rwsem(&inode->i_alloc_sem);
-	lockdep_set_class(&inode->i_alloc_sem, &sb->s_type->i_alloc_sem_key);
+	atomic_set(&inode->i_dio_count, 0);
 
 	mapping->a_ops = &empty_aops;
 	mapping->host = inode;
Index: linux-2.6/fs/ntfs/inode.c
===================================================================
--- linux-2.6.orig/fs/ntfs/inode.c	2011-06-20 14:19:26.000000000 +0200
+++ linux-2.6/fs/ntfs/inode.c	2011-06-20 14:55:34.629156951 +0200
@@ -2357,12 +2357,7 @@ static const char *es = "  Leaving incon
  *
  * Returns 0 on success or -errno on error.
  *
- * Called with ->i_mutex held.  In all but one case ->i_alloc_sem is held for
- * writing.  The only case in the kernel where ->i_alloc_sem is not held is
- * mm/filemap.c::generic_file_buffered_write() where vmtruncate() is called
- * with the current i_size as the offset.  The analogous place in NTFS is in
- * fs/ntfs/file.c::ntfs_file_buffered_write() where we call vmtruncate() again
- * without holding ->i_alloc_sem.
+ * Called with ->i_mutex held.
  */
 int ntfs_truncate(struct inode *vi)
 {
@@ -2887,8 +2882,7 @@ void ntfs_truncate_vfs(struct inode *vi)
  * We also abort all changes of user, group, and mode as we do not implement
  * the NTFS ACLs yet.
  *
- * Called with ->i_mutex held.  For the ATTR_SIZE (i.e. ->truncate) case, also
- * called with ->i_alloc_sem held for writing.
+ * Called with ->i_mutex held.
  */
 int ntfs_setattr(struct dentry *dentry, struct iattr *attr)
 {
Index: linux-2.6/fs/ocfs2/aops.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/aops.c	2011-06-20 14:19:27.000000000 +0200
+++ linux-2.6/fs/ocfs2/aops.c	2011-06-20 14:55:34.629156951 +0200
@@ -551,9 +551,8 @@ bail:
 
 /*
  * ocfs2_dio_end_io is called by the dio core when a dio is finished.  We're
- * particularly interested in the aio/dio case.  Like the core uses
- * i_alloc_sem, we use the rw_lock DLM lock to protect io on one node from
- * truncation on another.
+ * particularly interested in the aio/dio case.  We use the rw_lock DLM lock
+ * to protect io on one node from truncation on another.
  */
 static void ocfs2_dio_end_io(struct kiocb *iocb,
 			     loff_t offset,
@@ -569,7 +568,7 @@ static void ocfs2_dio_end_io(struct kioc
 	BUG_ON(!ocfs2_iocb_is_rw_locked(iocb));
 
 	if (ocfs2_iocb_is_sem_locked(iocb)) {
-		up_read(&inode->i_alloc_sem);
+		inode_dio_wake(inode);
 		ocfs2_iocb_clear_sem_locked(iocb);
 	}
 
Index: linux-2.6/fs/ocfs2/file.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/file.c	2011-06-20 14:19:27.000000000 +0200
+++ linux-2.6/fs/ocfs2/file.c	2011-06-20 14:55:34.635823617 +0200
@@ -2236,9 +2236,9 @@ static ssize_t ocfs2_file_aio_write(stru
 	ocfs2_iocb_clear_sem_locked(iocb);
 
 relock:
-	/* to match setattr's i_mutex -> i_alloc_sem -> rw_lock ordering */
+	/* to match setattr's i_mutex -> rw_lock ordering */
 	if (direct_io) {
-		down_read(&inode->i_alloc_sem);
+		atomic_inc(&inode->i_dio_count);
 		have_alloc_sem = 1;
 		/* communicate with ocfs2_dio_end_io */
 		ocfs2_iocb_set_sem_locked(iocb);
@@ -2290,7 +2290,7 @@ relock:
 	 */
 	if (direct_io && !can_do_direct) {
 		ocfs2_rw_unlock(inode, rw_level);
-		up_read(&inode->i_alloc_sem);
+		inode_dio_wake(inode);
 
 		have_alloc_sem = 0;
 		rw_level = -1;
@@ -2361,8 +2361,7 @@ out_dio:
 	/*
 	 * deep in g_f_a_w_n()->ocfs2_direct_IO we pass in a ocfs2_dio_end_io
 	 * function pointer which is called when o_direct io completes so that
-	 * it can unlock our rw lock.  (it's the clustered equivalent of
-	 * i_alloc_sem; protects truncate from racing with pending ios).
+	 * it can unlock our rw lock.
 	 * Unfortunately there are error cases which call end_io and others
 	 * that don't.  so we don't have to unlock the rw_lock if either an
 	 * async dio is going to do it in the future or an end_io after an
@@ -2379,7 +2378,7 @@ out:
 
 out_sems:
 	if (have_alloc_sem) {
-		up_read(&inode->i_alloc_sem);
+		inode_dio_wake(inode);
 		ocfs2_iocb_clear_sem_locked(iocb);
 	}
 
@@ -2531,8 +2530,8 @@ static ssize_t ocfs2_file_aio_read(struc
 	 * need locks to protect pending reads from racing with truncate.
 	 */
 	if (filp->f_flags & O_DIRECT) {
-		down_read(&inode->i_alloc_sem);
 		have_alloc_sem = 1;
+		atomic_inc(&inode->i_dio_count);
 		ocfs2_iocb_set_sem_locked(iocb);
 
 		ret = ocfs2_rw_lock(inode, 0);
@@ -2575,7 +2574,7 @@ static ssize_t ocfs2_file_aio_read(struc
 
 bail:
 	if (have_alloc_sem) {
-		up_read(&inode->i_alloc_sem);
+		inode_dio_wake(inode);
 		ocfs2_iocb_clear_sem_locked(iocb);
 	}
 	if (rw_level != -1)
Index: linux-2.6/mm/madvise.c
===================================================================
--- linux-2.6.orig/mm/madvise.c	2011-06-20 14:19:27.000000000 +0200
+++ linux-2.6/mm/madvise.c	2011-06-20 14:55:34.635823617 +0200
@@ -218,7 +218,7 @@ static long madvise_remove(struct vm_are
 	endoff = (loff_t)(end - vma->vm_start - 1)
 			+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
 
-	/* vmtruncate_range needs to take i_mutex and i_alloc_sem */
+	/* vmtruncate_range needs to take i_mutex */
 	up_read(&current->mm->mmap_sem);
 	error = vmtruncate_range(mapping->host, offset, endoff);
 	down_read(&current->mm->mmap_sem);


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 4/8] fs: kill i_alloc_sem
@ 2011-06-20 20:15   ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2011-06-20 20:15 UTC (permalink / raw)
  To: viro, tglx
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh, jlbec

[-- Attachment #1: fs-kill-i_alloc_sem --]
[-- Type: text/plain, Size: 14968 bytes --]

i_alloc_sem is a rather special rw_semaphore.  It's the last one that may
be released by a non-owner, and it's write side is always mirrored by
real exclusion.  It's intended use it to wait for all pending direct I/O
requests to finish before starting a truncate.

Replace it with a hand-grown construct:

 - exclusion for truncates is already guaranteed by i_mutex, so it can
   simply fall way
 - the reader side is replaced by an i_dio_count member in struct inode
   that counts the number of pending direct I/O requests.  Truncate can't
   proceed as long as it's non-zero
 - when i_dio_count reaches non-zero we wake up a pending truncate using
   wake_up_bit on a new bit in i_flags
 - new references to i_dio_count can't appear while we are waiting for
   it to read zero because the direct I/O count always needs i_mutex
   (or an equivalent like XFS's i_iolock) for starting a new operation.

This scheme is much simpler, and saves the space of a spinlock_t and a
struct list_head in struct inode (typically 160 bytes on a non-debug 64-bit
system).

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/direct-io.c
===================================================================
--- linux-2.6.orig/fs/direct-io.c	2011-06-20 14:55:31.000000000 +0200
+++ linux-2.6/fs/direct-io.c	2011-06-20 14:55:34.602490284 +0200
@@ -136,6 +136,27 @@ struct dio {
 };
 
 /*
+ * Wait for outstanding DIO requests to finish.  Must be locked against
+ * increments of i_dio_count by i_mutex.
+ */
+void inode_dio_wait(struct inode *inode)
+{
+	might_sleep();
+	while (atomic_read(&inode->i_dio_count)) {
+		wait_on_bit(&inode->i_state, __I_DIO_WAKEUP, inode_wait,
+			    TASK_UNINTERRUPTIBLE);
+	}
+}
+EXPORT_SYMBOL_GPL(inode_dio_wait);
+
+void inode_dio_wake(struct inode *inode)
+{
+	if (atomic_dec_and_test(&inode->i_dio_count))
+		wake_up_bit(&inode->i_state, __I_DIO_WAKEUP);
+}
+EXPORT_SYMBOL_GPL(inode_dio_wake);
+
+/*
  * How many pages are in the queue?
  */
 static inline unsigned dio_pages_present(struct dio *dio)
@@ -254,9 +275,7 @@ static ssize_t dio_complete(struct dio *
 	}
 
 	if (dio->flags & DIO_LOCKING)
-		/* lockdep: non-owner release */
-		up_read_non_owner(&dio->inode->i_alloc_sem);
-
+		inode_dio_wake(dio->inode);
 	return ret;
 }
 
@@ -980,9 +999,6 @@ out:
 	return ret;
 }
 
-/*
- * Releases both i_mutex and i_alloc_sem
- */
 static ssize_t
 direct_io_worker(int rw, struct kiocb *iocb, struct inode *inode, 
 	const struct iovec *iov, loff_t offset, unsigned long nr_segs, 
@@ -1146,15 +1162,14 @@ direct_io_worker(int rw, struct kiocb *i
  *    For writes this function is called under i_mutex and returns with
  *    i_mutex held, for reads, i_mutex is not held on entry, but it is
  *    taken and dropped again before returning.
- *    For reads and writes i_alloc_sem is taken in shared mode and released
- *    on I/O completion (which may happen asynchronously after returning to
- *    the caller).
+ *    The i_dio_count counter keeps track of the number of outstanding
+ *    direct I/O requests, and truncate waits for it to reach zero.
+ *    New references to i_dio_count must only be grabbed with i_mutex
+ *    held.
  *
  *  - if the flags value does NOT contain DIO_LOCKING we don't use any
  *    internal locking but rather rely on the filesystem to synchronize
  *    direct I/O reads/writes versus each other and truncate.
- *    For reads and writes both i_mutex and i_alloc_sem are not held on
- *    entry and are never taken.
  */
 ssize_t
 __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
@@ -1234,10 +1249,9 @@ __blockdev_direct_IO(int rw, struct kioc
 		}
 
 		/*
-		 * Will be released at I/O completion, possibly in a
-		 * different thread.
+		 * Will be decremented at I/O completion time.
 		 */
-		down_read_non_owner(&inode->i_alloc_sem);
+		atomic_inc(&inode->i_dio_count);
 	}
 
 	/*
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2011-06-20 14:19:27.019266696 +0200
+++ linux-2.6/mm/filemap.c	2011-06-20 14:55:34.605823617 +0200
@@ -78,9 +78,6 @@
  *  ->i_mutex			(generic_file_buffered_write)
  *    ->mmap_sem		(fault_in_pages_readable->do_page_fault)
  *
- *  ->i_mutex
- *    ->i_alloc_sem             (various)
- *
  *  inode_wb_list_lock
  *    sb_lock			(fs/fs-writeback.c)
  *    ->mapping->tree_lock	(__sync_single_inode)
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c	2011-06-20 14:19:27.000000000 +0200
+++ linux-2.6/mm/rmap.c	2011-06-20 14:55:34.605823617 +0200
@@ -21,7 +21,6 @@
  * Lock ordering in mm:
  *
  * inode->i_mutex	(while writing or truncating, not reading or faulting)
- *   inode->i_alloc_sem (vmtruncate_range)
  *   mm->mmap_sem
  *     page->flags PG_locked (lock_page)
  *       mapping->i_mmap_mutex
Index: linux-2.6/fs/attr.c
===================================================================
--- linux-2.6.orig/fs/attr.c	2011-06-20 14:19:26.000000000 +0200
+++ linux-2.6/fs/attr.c	2011-06-20 14:55:34.609156951 +0200
@@ -233,16 +233,13 @@ int notify_change(struct dentry * dentry
 		return error;
 
 	if (ia_valid & ATTR_SIZE)
-		down_write(&dentry->d_inode->i_alloc_sem);
+		inode_dio_wait(inode);
 
 	if (inode->i_op->setattr)
 		error = inode->i_op->setattr(dentry, attr);
 	else
 		error = simple_setattr(dentry, attr);
 
-	if (ia_valid & ATTR_SIZE)
-		up_write(&dentry->d_inode->i_alloc_sem);
-
 	if (!error)
 		fsnotify_change(dentry, ia_valid);
 
Index: linux-2.6/fs/ntfs/file.c
===================================================================
--- linux-2.6.orig/fs/ntfs/file.c	2011-06-20 14:19:26.000000000 +0200
+++ linux-2.6/fs/ntfs/file.c	2011-06-20 14:55:34.609156951 +0200
@@ -1832,9 +1832,8 @@ static ssize_t ntfs_file_buffered_write(
 	 * fails again.
 	 */
 	if (unlikely(NInoTruncateFailed(ni))) {
-		down_write(&vi->i_alloc_sem);
+		inode_dio_wait(vi);
 		err = ntfs_truncate(vi);
-		up_write(&vi->i_alloc_sem);
 		if (err || NInoTruncateFailed(ni)) {
 			if (!err)
 				err = -EIO;
Index: linux-2.6/fs/reiserfs/xattr.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/xattr.c	2011-06-20 14:19:26.000000000 +0200
+++ linux-2.6/fs/reiserfs/xattr.c	2011-06-20 14:55:34.612490285 +0200
@@ -555,11 +555,10 @@ reiserfs_xattr_set_handle(struct reiserf
 
 		reiserfs_write_unlock(inode->i_sb);
 		mutex_lock_nested(&dentry->d_inode->i_mutex, I_MUTEX_XATTR);
-		down_write(&dentry->d_inode->i_alloc_sem);
+		inode_dio_wait(dentry->d_inode);
 		reiserfs_write_lock(inode->i_sb);
 
 		err = reiserfs_setattr(dentry, &newattrs);
-		up_write(&dentry->d_inode->i_alloc_sem);
 		mutex_unlock(&dentry->d_inode->i_mutex);
 	} else
 		update_ctime(inode);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2011-06-20 14:19:27.000000000 +0200
+++ linux-2.6/include/linux/fs.h	2011-06-20 14:55:34.615823619 +0200
@@ -776,7 +776,7 @@ struct inode {
 	struct timespec		i_ctime;
 	blkcnt_t		i_blocks;
 	unsigned short          i_bytes;
-	struct rw_semaphore	i_alloc_sem;
+	atomic_t		i_dio_count;
 	const struct file_operations	*i_fop;	/* former ->i_op->default_file_ops */
 	struct file_lock	*i_flock;
 	struct address_space	*i_mapping;
@@ -1692,6 +1692,10 @@ struct super_operations {
  *			set during data writeback, and cleared with a wakeup
  *			on the bit address once it is done.
  *
+ * I_REFERENCED		Marks the inode as recently references on the LRU list.
+ *
+ * I_DIO_WAKEUP		Never set.  Only used as a key for wait_on_bit().
+ *
  * Q: What is the difference between I_WILL_FREE and I_FREEING?
  */
 #define I_DIRTY_SYNC		(1 << 0)
@@ -1705,6 +1709,8 @@ struct super_operations {
 #define __I_SYNC		7
 #define I_SYNC			(1 << __I_SYNC)
 #define I_REFERENCED		(1 << 8)
+#define __I_DIO_WAKEUP		9
+#define I_DIO_WAKEUP		(1 << I_DIO_WAKEUP)
 
 #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
 
@@ -1815,7 +1821,6 @@ struct file_system_type {
 	struct lock_class_key i_lock_key;
 	struct lock_class_key i_mutex_key;
 	struct lock_class_key i_mutex_dir_key;
-	struct lock_class_key i_alloc_sem_key;
 };
 
 extern struct dentry *mount_ns(struct file_system_type *fs_type, int flags,
@@ -2367,6 +2372,8 @@ enum {
 };
 
 void dio_end_io(struct bio *bio, int error);
+void inode_dio_wait(struct inode *inode);
+void inode_dio_wake(struct inode *inode);
 
 ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	struct block_device *bdev, const struct iovec *iov, loff_t offset,
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c	2011-06-20 14:19:27.000000000 +0200
+++ linux-2.6/mm/memory.c	2011-06-20 14:55:34.619156952 +0200
@@ -2811,12 +2811,11 @@ int vmtruncate_range(struct inode *inode
 		return -ENOSYS;
 
 	mutex_lock(&inode->i_mutex);
-	down_write(&inode->i_alloc_sem);
+	inode_dio_wait(inode);
 	unmap_mapping_range(mapping, offset, (end - offset), 1);
 	truncate_inode_pages_range(mapping, offset, end);
 	unmap_mapping_range(mapping, offset, (end - offset), 1);
 	inode->i_op->truncate_range(inode, offset, end);
-	up_write(&inode->i_alloc_sem);
 	mutex_unlock(&inode->i_mutex);
 
 	return 0;
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2011-06-20 14:19:26.000000000 +0200
+++ linux-2.6/fs/inode.c	2011-06-20 14:55:34.625823618 +0200
@@ -176,8 +176,7 @@ int inode_init_always(struct super_block
 	mutex_init(&inode->i_mutex);
 	lockdep_set_class(&inode->i_mutex, &sb->s_type->i_mutex_key);
 
-	init_rwsem(&inode->i_alloc_sem);
-	lockdep_set_class(&inode->i_alloc_sem, &sb->s_type->i_alloc_sem_key);
+	atomic_set(&inode->i_dio_count, 0);
 
 	mapping->a_ops = &empty_aops;
 	mapping->host = inode;
Index: linux-2.6/fs/ntfs/inode.c
===================================================================
--- linux-2.6.orig/fs/ntfs/inode.c	2011-06-20 14:19:26.000000000 +0200
+++ linux-2.6/fs/ntfs/inode.c	2011-06-20 14:55:34.629156951 +0200
@@ -2357,12 +2357,7 @@ static const char *es = "  Leaving incon
  *
  * Returns 0 on success or -errno on error.
  *
- * Called with ->i_mutex held.  In all but one case ->i_alloc_sem is held for
- * writing.  The only case in the kernel where ->i_alloc_sem is not held is
- * mm/filemap.c::generic_file_buffered_write() where vmtruncate() is called
- * with the current i_size as the offset.  The analogous place in NTFS is in
- * fs/ntfs/file.c::ntfs_file_buffered_write() where we call vmtruncate() again
- * without holding ->i_alloc_sem.
+ * Called with ->i_mutex held.
  */
 int ntfs_truncate(struct inode *vi)
 {
@@ -2887,8 +2882,7 @@ void ntfs_truncate_vfs(struct inode *vi)
  * We also abort all changes of user, group, and mode as we do not implement
  * the NTFS ACLs yet.
  *
- * Called with ->i_mutex held.  For the ATTR_SIZE (i.e. ->truncate) case, also
- * called with ->i_alloc_sem held for writing.
+ * Called with ->i_mutex held.
  */
 int ntfs_setattr(struct dentry *dentry, struct iattr *attr)
 {
Index: linux-2.6/fs/ocfs2/aops.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/aops.c	2011-06-20 14:19:27.000000000 +0200
+++ linux-2.6/fs/ocfs2/aops.c	2011-06-20 14:55:34.629156951 +0200
@@ -551,9 +551,8 @@ bail:
 
 /*
  * ocfs2_dio_end_io is called by the dio core when a dio is finished.  We're
- * particularly interested in the aio/dio case.  Like the core uses
- * i_alloc_sem, we use the rw_lock DLM lock to protect io on one node from
- * truncation on another.
+ * particularly interested in the aio/dio case.  We use the rw_lock DLM lock
+ * to protect io on one node from truncation on another.
  */
 static void ocfs2_dio_end_io(struct kiocb *iocb,
 			     loff_t offset,
@@ -569,7 +568,7 @@ static void ocfs2_dio_end_io(struct kioc
 	BUG_ON(!ocfs2_iocb_is_rw_locked(iocb));
 
 	if (ocfs2_iocb_is_sem_locked(iocb)) {
-		up_read(&inode->i_alloc_sem);
+		inode_dio_wake(inode);
 		ocfs2_iocb_clear_sem_locked(iocb);
 	}
 
Index: linux-2.6/fs/ocfs2/file.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/file.c	2011-06-20 14:19:27.000000000 +0200
+++ linux-2.6/fs/ocfs2/file.c	2011-06-20 14:55:34.635823617 +0200
@@ -2236,9 +2236,9 @@ static ssize_t ocfs2_file_aio_write(stru
 	ocfs2_iocb_clear_sem_locked(iocb);
 
 relock:
-	/* to match setattr's i_mutex -> i_alloc_sem -> rw_lock ordering */
+	/* to match setattr's i_mutex -> rw_lock ordering */
 	if (direct_io) {
-		down_read(&inode->i_alloc_sem);
+		atomic_inc(&inode->i_dio_count);
 		have_alloc_sem = 1;
 		/* communicate with ocfs2_dio_end_io */
 		ocfs2_iocb_set_sem_locked(iocb);
@@ -2290,7 +2290,7 @@ relock:
 	 */
 	if (direct_io && !can_do_direct) {
 		ocfs2_rw_unlock(inode, rw_level);
-		up_read(&inode->i_alloc_sem);
+		inode_dio_wake(inode);
 
 		have_alloc_sem = 0;
 		rw_level = -1;
@@ -2361,8 +2361,7 @@ out_dio:
 	/*
 	 * deep in g_f_a_w_n()->ocfs2_direct_IO we pass in a ocfs2_dio_end_io
 	 * function pointer which is called when o_direct io completes so that
-	 * it can unlock our rw lock.  (it's the clustered equivalent of
-	 * i_alloc_sem; protects truncate from racing with pending ios).
+	 * it can unlock our rw lock.
 	 * Unfortunately there are error cases which call end_io and others
 	 * that don't.  so we don't have to unlock the rw_lock if either an
 	 * async dio is going to do it in the future or an end_io after an
@@ -2379,7 +2378,7 @@ out:
 
 out_sems:
 	if (have_alloc_sem) {
-		up_read(&inode->i_alloc_sem);
+		inode_dio_wake(inode);
 		ocfs2_iocb_clear_sem_locked(iocb);
 	}
 
@@ -2531,8 +2530,8 @@ static ssize_t ocfs2_file_aio_read(struc
 	 * need locks to protect pending reads from racing with truncate.
 	 */
 	if (filp->f_flags & O_DIRECT) {
-		down_read(&inode->i_alloc_sem);
 		have_alloc_sem = 1;
+		atomic_inc(&inode->i_dio_count);
 		ocfs2_iocb_set_sem_locked(iocb);
 
 		ret = ocfs2_rw_lock(inode, 0);
@@ -2575,7 +2574,7 @@ static ssize_t ocfs2_file_aio_read(struc
 
 bail:
 	if (have_alloc_sem) {
-		up_read(&inode->i_alloc_sem);
+		inode_dio_wake(inode);
 		ocfs2_iocb_clear_sem_locked(iocb);
 	}
 	if (rw_level != -1)
Index: linux-2.6/mm/madvise.c
===================================================================
--- linux-2.6.orig/mm/madvise.c	2011-06-20 14:19:27.000000000 +0200
+++ linux-2.6/mm/madvise.c	2011-06-20 14:55:34.635823617 +0200
@@ -218,7 +218,7 @@ static long madvise_remove(struct vm_are
 	endoff = (loff_t)(end - vma->vm_start - 1)
 			+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
 
-	/* vmtruncate_range needs to take i_mutex and i_alloc_sem */
+	/* vmtruncate_range needs to take i_mutex */
 	up_read(&current->mm->mmap_sem);
 	error = vmtruncate_range(mapping->host, offset, endoff);
 	down_read(&current->mm->mmap_sem);


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 5/8] fs: move inode_dio_wait calls into ->setattr
  2011-06-20 20:15 [PATCH 0/8] remove i_alloc_sem Christoph Hellwig
@ 2011-06-20 20:15   ` Christoph Hellwig
  2011-06-20 20:15   ` Christoph Hellwig
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2011-06-20 20:15 UTC (permalink / raw)
  To: viro, tglx
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh, jlbec

Let filesystems handle waiting for direct I/O requests themselves instead
of doing it beforehand.  This means filesystem-specific locks to prevent
new dio referenes from appearing can be held.  This is important to allow
generalizing i_dio_count to non-DIO_LOCKING filesystems.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/ocfs2/file.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/file.c	2011-06-20 09:28:54.516815966 +0200
+++ linux-2.6/fs/ocfs2/file.c	2011-06-20 09:31:34.706807855 +0200
@@ -1142,6 +1142,8 @@ int ocfs2_setattr(struct dentry *dentry,
 		if (status)
 			goto bail_unlock;
 
+		inode_dio_wait(inode);
+
 		if (i_size_read(inode) > attr->ia_size) {
 			if (ocfs2_should_order_data(inode)) {
 				status = ocfs2_begin_ordered_truncate(inode,
Index: linux-2.6/fs/attr.c
===================================================================
--- linux-2.6.orig/fs/attr.c	2011-06-20 09:28:54.490149300 +0200
+++ linux-2.6/fs/attr.c	2011-06-20 09:29:06.000000000 +0200
@@ -232,9 +232,6 @@ int notify_change(struct dentry * dentry
 	if (error)
 		return error;
 
-	if (ia_valid & ATTR_SIZE)
-		inode_dio_wait(inode);
-
 	if (inode->i_op->setattr)
 		error = inode->i_op->setattr(dentry, attr);
 	else
Index: linux-2.6/fs/ext2/inode.c
===================================================================
--- linux-2.6.orig/fs/ext2/inode.c	2011-06-18 12:54:28.058273680 +0200
+++ linux-2.6/fs/ext2/inode.c	2011-06-20 09:29:06.500148692 +0200
@@ -1184,6 +1184,8 @@ static int ext2_setsize(struct inode *in
 	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
 		return -EPERM;
 
+	inode_dio_wait(inode);
+
 	if (mapping_is_xip(inode->i_mapping))
 		error = xip_truncate_page(inode->i_mapping, newsize);
 	else if (test_opt(inode->i_sb, NOBH))
Index: linux-2.6/fs/ext3/inode.c
===================================================================
--- linux-2.6.orig/fs/ext3/inode.c	2011-06-18 12:54:28.071607014 +0200
+++ linux-2.6/fs/ext3/inode.c	2011-06-20 09:29:06.500148692 +0200
@@ -3216,6 +3216,9 @@ int ext3_setattr(struct dentry *dentry,
 		ext3_journal_stop(handle);
 	}
 
+	if (attr->ia_valid & ATTR_SIZE)
+		inode_dio_wait(inode);
+
 	if (S_ISREG(inode->i_mode) &&
 	    attr->ia_valid & ATTR_SIZE && attr->ia_size < inode->i_size) {
 		handle_t *handle;
Index: linux-2.6/fs/ext4/inode.c
===================================================================
--- linux-2.6.orig/fs/ext4/inode.c	2011-06-20 09:28:54.506815967 +0200
+++ linux-2.6/fs/ext4/inode.c	2011-06-20 09:29:06.000000000 +0200
@@ -5351,6 +5351,8 @@ int ext4_setattr(struct dentry *dentry,
 	}
 
 	if (attr->ia_valid & ATTR_SIZE) {
+		inode_dio_wait(inode);
+
 		if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
 			struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 
Index: linux-2.6/fs/fat/file.c
===================================================================
--- linux-2.6.orig/fs/fat/file.c	2011-06-18 12:54:28.118273678 +0200
+++ linux-2.6/fs/fat/file.c	2011-06-20 09:29:06.000000000 +0200
@@ -397,6 +397,8 @@ int fat_setattr(struct dentry *dentry, s
 	 * sequence.
 	 */
 	if (attr->ia_valid & ATTR_SIZE) {
+		inode_dio_wait(inode);
+
 		if (attr->ia_size > inode->i_size) {
 			error = fat_cont_expand(inode, attr->ia_size);
 			if (error || attr->ia_valid == ATTR_SIZE)
Index: linux-2.6/fs/gfs2/bmap.c
===================================================================
--- linux-2.6.orig/fs/gfs2/bmap.c	2011-06-18 12:54:28.141607009 +0200
+++ linux-2.6/fs/gfs2/bmap.c	2011-06-20 09:29:06.510148693 +0200
@@ -1224,6 +1224,8 @@ int gfs2_setattr_size(struct inode *inod
 	if (ret)
 		return ret;
 
+	inode_dio_wait(inode);
+
 	oldsize = inode->i_size;
 	if (newsize >= oldsize)
 		return do_grow(inode, newsize);
Index: linux-2.6/fs/hfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hfs/inode.c	2011-06-18 12:54:28.154940342 +0200
+++ linux-2.6/fs/hfs/inode.c	2011-06-20 09:29:06.000000000 +0200
@@ -615,6 +615,8 @@ int hfs_inode_setattr(struct dentry *den
 
 	if ((attr->ia_valid & ATTR_SIZE) &&
 	    attr->ia_size != i_size_read(inode)) {
+		inode_dio_wait(inode);
+
 		error = vmtruncate(inode, attr->ia_size);
 		if (error)
 			return error;
Index: linux-2.6/fs/hfsplus/inode.c
===================================================================
--- linux-2.6.orig/fs/hfsplus/inode.c	2011-06-18 12:54:28.168273676 +0200
+++ linux-2.6/fs/hfsplus/inode.c	2011-06-20 09:29:06.000000000 +0200
@@ -296,6 +296,8 @@ static int hfsplus_setattr(struct dentry
 
 	if ((attr->ia_valid & ATTR_SIZE) &&
 	    attr->ia_size != i_size_read(inode)) {
+		inode_dio_wait(inode);
+
 		error = vmtruncate(inode, attr->ia_size);
 		if (error)
 			return error;
Index: linux-2.6/fs/jfs/file.c
===================================================================
--- linux-2.6.orig/fs/jfs/file.c	2011-06-18 12:54:28.191607007 +0200
+++ linux-2.6/fs/jfs/file.c	2011-06-20 09:29:06.000000000 +0200
@@ -110,6 +110,8 @@ int jfs_setattr(struct dentry *dentry, s
 
 	if ((iattr->ia_valid & ATTR_SIZE) &&
 	    iattr->ia_size != i_size_read(inode)) {
+		inode_dio_wait(inode);
+
 		rc = vmtruncate(inode, iattr->ia_size);
 		if (rc)
 			return rc;
Index: linux-2.6/fs/nilfs2/inode.c
===================================================================
--- linux-2.6.orig/fs/nilfs2/inode.c	2011-06-18 12:54:28.204940339 +0200
+++ linux-2.6/fs/nilfs2/inode.c	2011-06-20 09:29:06.000000000 +0200
@@ -778,6 +778,8 @@ int nilfs_setattr(struct dentry *dentry,
 
 	if ((iattr->ia_valid & ATTR_SIZE) &&
 	    iattr->ia_size != i_size_read(inode)) {
+		inode_dio_wait(inode);
+
 		err = vmtruncate(inode, iattr->ia_size);
 		if (unlikely(err))
 			goto out_err;
Index: linux-2.6/fs/reiserfs/inode.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/inode.c	2011-06-18 12:54:28.218273673 +0200
+++ linux-2.6/fs/reiserfs/inode.c	2011-06-20 09:29:06.000000000 +0200
@@ -3114,6 +3114,9 @@ int reiserfs_setattr(struct dentry *dent
 			error = -EFBIG;
 			goto out;
 		}
+
+		inode_dio_wait(inode);
+
 		/* fill in hole pointers in the expanding truncate case. */
 		if (attr->ia_size > inode->i_size) {
 			error = generic_cont_expand_simple(inode, attr->ia_size);


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 5/8] fs: move inode_dio_wait calls into ->setattr
@ 2011-06-20 20:15   ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2011-06-20 20:15 UTC (permalink / raw)
  To: viro, tglx
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh, jlbec

[-- Attachment #1: fs-move-dio_wait --]
[-- Type: text/plain, Size: 6289 bytes --]

Let filesystems handle waiting for direct I/O requests themselves instead
of doing it beforehand.  This means filesystem-specific locks to prevent
new dio referenes from appearing can be held.  This is important to allow
generalizing i_dio_count to non-DIO_LOCKING filesystems.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/ocfs2/file.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/file.c	2011-06-20 09:28:54.516815966 +0200
+++ linux-2.6/fs/ocfs2/file.c	2011-06-20 09:31:34.706807855 +0200
@@ -1142,6 +1142,8 @@ int ocfs2_setattr(struct dentry *dentry,
 		if (status)
 			goto bail_unlock;
 
+		inode_dio_wait(inode);
+
 		if (i_size_read(inode) > attr->ia_size) {
 			if (ocfs2_should_order_data(inode)) {
 				status = ocfs2_begin_ordered_truncate(inode,
Index: linux-2.6/fs/attr.c
===================================================================
--- linux-2.6.orig/fs/attr.c	2011-06-20 09:28:54.490149300 +0200
+++ linux-2.6/fs/attr.c	2011-06-20 09:29:06.000000000 +0200
@@ -232,9 +232,6 @@ int notify_change(struct dentry * dentry
 	if (error)
 		return error;
 
-	if (ia_valid & ATTR_SIZE)
-		inode_dio_wait(inode);
-
 	if (inode->i_op->setattr)
 		error = inode->i_op->setattr(dentry, attr);
 	else
Index: linux-2.6/fs/ext2/inode.c
===================================================================
--- linux-2.6.orig/fs/ext2/inode.c	2011-06-18 12:54:28.058273680 +0200
+++ linux-2.6/fs/ext2/inode.c	2011-06-20 09:29:06.500148692 +0200
@@ -1184,6 +1184,8 @@ static int ext2_setsize(struct inode *in
 	if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
 		return -EPERM;
 
+	inode_dio_wait(inode);
+
 	if (mapping_is_xip(inode->i_mapping))
 		error = xip_truncate_page(inode->i_mapping, newsize);
 	else if (test_opt(inode->i_sb, NOBH))
Index: linux-2.6/fs/ext3/inode.c
===================================================================
--- linux-2.6.orig/fs/ext3/inode.c	2011-06-18 12:54:28.071607014 +0200
+++ linux-2.6/fs/ext3/inode.c	2011-06-20 09:29:06.500148692 +0200
@@ -3216,6 +3216,9 @@ int ext3_setattr(struct dentry *dentry,
 		ext3_journal_stop(handle);
 	}
 
+	if (attr->ia_valid & ATTR_SIZE)
+		inode_dio_wait(inode);
+
 	if (S_ISREG(inode->i_mode) &&
 	    attr->ia_valid & ATTR_SIZE && attr->ia_size < inode->i_size) {
 		handle_t *handle;
Index: linux-2.6/fs/ext4/inode.c
===================================================================
--- linux-2.6.orig/fs/ext4/inode.c	2011-06-20 09:28:54.506815967 +0200
+++ linux-2.6/fs/ext4/inode.c	2011-06-20 09:29:06.000000000 +0200
@@ -5351,6 +5351,8 @@ int ext4_setattr(struct dentry *dentry,
 	}
 
 	if (attr->ia_valid & ATTR_SIZE) {
+		inode_dio_wait(inode);
+
 		if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
 			struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 
Index: linux-2.6/fs/fat/file.c
===================================================================
--- linux-2.6.orig/fs/fat/file.c	2011-06-18 12:54:28.118273678 +0200
+++ linux-2.6/fs/fat/file.c	2011-06-20 09:29:06.000000000 +0200
@@ -397,6 +397,8 @@ int fat_setattr(struct dentry *dentry, s
 	 * sequence.
 	 */
 	if (attr->ia_valid & ATTR_SIZE) {
+		inode_dio_wait(inode);
+
 		if (attr->ia_size > inode->i_size) {
 			error = fat_cont_expand(inode, attr->ia_size);
 			if (error || attr->ia_valid == ATTR_SIZE)
Index: linux-2.6/fs/gfs2/bmap.c
===================================================================
--- linux-2.6.orig/fs/gfs2/bmap.c	2011-06-18 12:54:28.141607009 +0200
+++ linux-2.6/fs/gfs2/bmap.c	2011-06-20 09:29:06.510148693 +0200
@@ -1224,6 +1224,8 @@ int gfs2_setattr_size(struct inode *inod
 	if (ret)
 		return ret;
 
+	inode_dio_wait(inode);
+
 	oldsize = inode->i_size;
 	if (newsize >= oldsize)
 		return do_grow(inode, newsize);
Index: linux-2.6/fs/hfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hfs/inode.c	2011-06-18 12:54:28.154940342 +0200
+++ linux-2.6/fs/hfs/inode.c	2011-06-20 09:29:06.000000000 +0200
@@ -615,6 +615,8 @@ int hfs_inode_setattr(struct dentry *den
 
 	if ((attr->ia_valid & ATTR_SIZE) &&
 	    attr->ia_size != i_size_read(inode)) {
+		inode_dio_wait(inode);
+
 		error = vmtruncate(inode, attr->ia_size);
 		if (error)
 			return error;
Index: linux-2.6/fs/hfsplus/inode.c
===================================================================
--- linux-2.6.orig/fs/hfsplus/inode.c	2011-06-18 12:54:28.168273676 +0200
+++ linux-2.6/fs/hfsplus/inode.c	2011-06-20 09:29:06.000000000 +0200
@@ -296,6 +296,8 @@ static int hfsplus_setattr(struct dentry
 
 	if ((attr->ia_valid & ATTR_SIZE) &&
 	    attr->ia_size != i_size_read(inode)) {
+		inode_dio_wait(inode);
+
 		error = vmtruncate(inode, attr->ia_size);
 		if (error)
 			return error;
Index: linux-2.6/fs/jfs/file.c
===================================================================
--- linux-2.6.orig/fs/jfs/file.c	2011-06-18 12:54:28.191607007 +0200
+++ linux-2.6/fs/jfs/file.c	2011-06-20 09:29:06.000000000 +0200
@@ -110,6 +110,8 @@ int jfs_setattr(struct dentry *dentry, s
 
 	if ((iattr->ia_valid & ATTR_SIZE) &&
 	    iattr->ia_size != i_size_read(inode)) {
+		inode_dio_wait(inode);
+
 		rc = vmtruncate(inode, iattr->ia_size);
 		if (rc)
 			return rc;
Index: linux-2.6/fs/nilfs2/inode.c
===================================================================
--- linux-2.6.orig/fs/nilfs2/inode.c	2011-06-18 12:54:28.204940339 +0200
+++ linux-2.6/fs/nilfs2/inode.c	2011-06-20 09:29:06.000000000 +0200
@@ -778,6 +778,8 @@ int nilfs_setattr(struct dentry *dentry,
 
 	if ((iattr->ia_valid & ATTR_SIZE) &&
 	    iattr->ia_size != i_size_read(inode)) {
+		inode_dio_wait(inode);
+
 		err = vmtruncate(inode, iattr->ia_size);
 		if (unlikely(err))
 			goto out_err;
Index: linux-2.6/fs/reiserfs/inode.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/inode.c	2011-06-18 12:54:28.218273673 +0200
+++ linux-2.6/fs/reiserfs/inode.c	2011-06-20 09:29:06.000000000 +0200
@@ -3114,6 +3114,9 @@ int reiserfs_setattr(struct dentry *dent
 			error = -EFBIG;
 			goto out;
 		}
+
+		inode_dio_wait(inode);
+
 		/* fill in hole pointers in the expanding truncate case. */
 		if (attr->ia_size > inode->i_size) {
 			error = generic_cont_expand_simple(inode, attr->ia_size);


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 6/8] fs: always maintain i_dio_count
  2011-06-20 20:15 [PATCH 0/8] remove i_alloc_sem Christoph Hellwig
@ 2011-06-20 20:15   ` Christoph Hellwig
  2011-06-20 20:15   ` Christoph Hellwig
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2011-06-20 20:15 UTC (permalink / raw)
  To: viro, tglx
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh, jlbec

Maintain i_dio_count for all filesystems, not just those using DIO_LOCKING.
This these filesystems to also protect truncate against direct I/O requests
by using common code.  Right now the only non-DIO_LOCKING filesystem that
appears to do so is XFS, which uses an opencoded variant of the i_dio_count
scheme.

Behaviour doesn't change for filesystems never calling inode_dio_wait,
which are all that never use DIO_LOCKING.

For ext4 behaviour changes with the dioread_nonlock option, which previous
was missing any protection between truncate and direct I/O reads.

For ocfs2 that handcrafted i_dio_count manipulations are replaced with
the common code noew available.

As a result inode_dio_wake can now be made static in direct-io.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/direct-io.c
===================================================================
--- linux-2.6.orig/fs/direct-io.c	2011-06-20 14:55:34.602490284 +0200
+++ linux-2.6/fs/direct-io.c	2011-06-20 14:57:24.575818051 +0200
@@ -149,12 +149,11 @@ void inode_dio_wait(struct inode *inode)
 }
 EXPORT_SYMBOL_GPL(inode_dio_wait);
 
-void inode_dio_wake(struct inode *inode)
+static inline void inode_dio_wake(struct inode *inode)
 {
 	if (atomic_dec_and_test(&inode->i_dio_count))
 		wake_up_bit(&inode->i_state, __I_DIO_WAKEUP);
 }
-EXPORT_SYMBOL_GPL(inode_dio_wake);
 
 /*
  * How many pages are in the queue?
@@ -274,8 +273,7 @@ static ssize_t dio_complete(struct dio *
 		aio_complete(dio->iocb, ret, 0);
 	}
 
-	if (dio->flags & DIO_LOCKING)
-		inode_dio_wake(dio->inode);
+	inode_dio_wake(dio->inode);
 	return ret;
 }
 
@@ -1162,14 +1160,16 @@ direct_io_worker(int rw, struct kiocb *i
  *    For writes this function is called under i_mutex and returns with
  *    i_mutex held, for reads, i_mutex is not held on entry, but it is
  *    taken and dropped again before returning.
- *    The i_dio_count counter keeps track of the number of outstanding
- *    direct I/O requests, and truncate waits for it to reach zero.
- *    New references to i_dio_count must only be grabbed with i_mutex
- *    held.
- *
  *  - if the flags value does NOT contain DIO_LOCKING we don't use any
  *    internal locking but rather rely on the filesystem to synchronize
  *    direct I/O reads/writes versus each other and truncate.
+ *
+ * To help with locking against truncate we incremented the i_dio_count
+ * counter before starting direct I/O, and decrement it once we are done.
+ * Truncate can wait for it to reach zero to provide exclusion.  It is
+ * expected that filesystem provide exclusion between new direct I/O
+ * and truncates.  For DIO_LOCKING filesystems this is done by i_mutex,
+ * but other filesystems need to take care of this on their own.
  */
 ssize_t
 __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
@@ -1247,14 +1247,14 @@ __blockdev_direct_IO(int rw, struct kioc
 				goto out;
 			}
 		}
-
-		/*
-		 * Will be decremented at I/O completion time.
-		 */
-		atomic_inc(&inode->i_dio_count);
 	}
 
 	/*
+	 * Will be decremented at I/O completion time.
+	 */
+	atomic_inc(&inode->i_dio_count);
+
+	/*
 	 * For file extending writes updating i_size before data
 	 * writeouts complete can expose uninitialized blocks. So
 	 * even for AIO, we need to wait for i/o to complete before
Index: linux-2.6/fs/ocfs2/aops.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/aops.c	2011-06-20 14:55:34.629156951 +0200
+++ linux-2.6/fs/ocfs2/aops.c	2011-06-20 14:56:59.259152666 +0200
@@ -567,10 +567,8 @@ static void ocfs2_dio_end_io(struct kioc
 	/* this io's submitter should not have unlocked this before we could */
 	BUG_ON(!ocfs2_iocb_is_rw_locked(iocb));
 
-	if (ocfs2_iocb_is_sem_locked(iocb)) {
-		inode_dio_wake(inode);
+	if (ocfs2_iocb_is_sem_locked(iocb))
 		ocfs2_iocb_clear_sem_locked(iocb);
-	}
 
 	ocfs2_iocb_clear_rw_locked(iocb);
 
Index: linux-2.6/fs/ocfs2/file.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/file.c	2011-06-20 14:56:55.375819530 +0200
+++ linux-2.6/fs/ocfs2/file.c	2011-06-20 14:56:59.262485999 +0200
@@ -2240,7 +2240,6 @@ static ssize_t ocfs2_file_aio_write(stru
 relock:
 	/* to match setattr's i_mutex -> rw_lock ordering */
 	if (direct_io) {
-		atomic_inc(&inode->i_dio_count);
 		have_alloc_sem = 1;
 		/* communicate with ocfs2_dio_end_io */
 		ocfs2_iocb_set_sem_locked(iocb);
@@ -2292,7 +2291,6 @@ relock:
 	 */
 	if (direct_io && !can_do_direct) {
 		ocfs2_rw_unlock(inode, rw_level);
-		inode_dio_wake(inode);
 
 		have_alloc_sem = 0;
 		rw_level = -1;
@@ -2379,10 +2377,8 @@ out:
 		ocfs2_rw_unlock(inode, rw_level);
 
 out_sems:
-	if (have_alloc_sem) {
-		inode_dio_wake(inode);
+	if (have_alloc_sem)
 		ocfs2_iocb_clear_sem_locked(iocb);
-	}
 
 	mutex_unlock(&inode->i_mutex);
 
@@ -2533,7 +2529,6 @@ static ssize_t ocfs2_file_aio_read(struc
 	 */
 	if (filp->f_flags & O_DIRECT) {
 		have_alloc_sem = 1;
-		atomic_inc(&inode->i_dio_count);
 		ocfs2_iocb_set_sem_locked(iocb);
 
 		ret = ocfs2_rw_lock(inode, 0);
@@ -2575,10 +2570,9 @@ static ssize_t ocfs2_file_aio_read(struc
 	}
 
 bail:
-	if (have_alloc_sem) {
-		inode_dio_wake(inode);
+	if (have_alloc_sem)
 		ocfs2_iocb_clear_sem_locked(iocb);
-	}
+
 	if (rw_level != -1)
 		ocfs2_rw_unlock(inode, rw_level);
 
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2011-06-20 14:57:08.582485528 +0200
+++ linux-2.6/include/linux/fs.h	2011-06-20 14:57:10.099152117 +0200
@@ -2373,7 +2373,6 @@ enum {
 
 void dio_end_io(struct bio *bio, int error);
 void inode_dio_wait(struct inode *inode);
-void inode_dio_wake(struct inode *inode);
 
 ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	struct block_device *bdev, const struct iovec *iov, loff_t offset,


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 6/8] fs: always maintain i_dio_count
@ 2011-06-20 20:15   ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2011-06-20 20:15 UTC (permalink / raw)
  To: viro, tglx
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh, jlbec

[-- Attachment #1: fs-generalize-dio_count --]
[-- Type: text/plain, Size: 5897 bytes --]

Maintain i_dio_count for all filesystems, not just those using DIO_LOCKING.
This these filesystems to also protect truncate against direct I/O requests
by using common code.  Right now the only non-DIO_LOCKING filesystem that
appears to do so is XFS, which uses an opencoded variant of the i_dio_count
scheme.

Behaviour doesn't change for filesystems never calling inode_dio_wait,
which are all that never use DIO_LOCKING.

For ext4 behaviour changes with the dioread_nonlock option, which previous
was missing any protection between truncate and direct I/O reads.

For ocfs2 that handcrafted i_dio_count manipulations are replaced with
the common code noew available.

As a result inode_dio_wake can now be made static in direct-io.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/direct-io.c
===================================================================
--- linux-2.6.orig/fs/direct-io.c	2011-06-20 14:55:34.602490284 +0200
+++ linux-2.6/fs/direct-io.c	2011-06-20 14:57:24.575818051 +0200
@@ -149,12 +149,11 @@ void inode_dio_wait(struct inode *inode)
 }
 EXPORT_SYMBOL_GPL(inode_dio_wait);
 
-void inode_dio_wake(struct inode *inode)
+static inline void inode_dio_wake(struct inode *inode)
 {
 	if (atomic_dec_and_test(&inode->i_dio_count))
 		wake_up_bit(&inode->i_state, __I_DIO_WAKEUP);
 }
-EXPORT_SYMBOL_GPL(inode_dio_wake);
 
 /*
  * How many pages are in the queue?
@@ -274,8 +273,7 @@ static ssize_t dio_complete(struct dio *
 		aio_complete(dio->iocb, ret, 0);
 	}
 
-	if (dio->flags & DIO_LOCKING)
-		inode_dio_wake(dio->inode);
+	inode_dio_wake(dio->inode);
 	return ret;
 }
 
@@ -1162,14 +1160,16 @@ direct_io_worker(int rw, struct kiocb *i
  *    For writes this function is called under i_mutex and returns with
  *    i_mutex held, for reads, i_mutex is not held on entry, but it is
  *    taken and dropped again before returning.
- *    The i_dio_count counter keeps track of the number of outstanding
- *    direct I/O requests, and truncate waits for it to reach zero.
- *    New references to i_dio_count must only be grabbed with i_mutex
- *    held.
- *
  *  - if the flags value does NOT contain DIO_LOCKING we don't use any
  *    internal locking but rather rely on the filesystem to synchronize
  *    direct I/O reads/writes versus each other and truncate.
+ *
+ * To help with locking against truncate we incremented the i_dio_count
+ * counter before starting direct I/O, and decrement it once we are done.
+ * Truncate can wait for it to reach zero to provide exclusion.  It is
+ * expected that filesystem provide exclusion between new direct I/O
+ * and truncates.  For DIO_LOCKING filesystems this is done by i_mutex,
+ * but other filesystems need to take care of this on their own.
  */
 ssize_t
 __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
@@ -1247,14 +1247,14 @@ __blockdev_direct_IO(int rw, struct kioc
 				goto out;
 			}
 		}
-
-		/*
-		 * Will be decremented at I/O completion time.
-		 */
-		atomic_inc(&inode->i_dio_count);
 	}
 
 	/*
+	 * Will be decremented at I/O completion time.
+	 */
+	atomic_inc(&inode->i_dio_count);
+
+	/*
 	 * For file extending writes updating i_size before data
 	 * writeouts complete can expose uninitialized blocks. So
 	 * even for AIO, we need to wait for i/o to complete before
Index: linux-2.6/fs/ocfs2/aops.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/aops.c	2011-06-20 14:55:34.629156951 +0200
+++ linux-2.6/fs/ocfs2/aops.c	2011-06-20 14:56:59.259152666 +0200
@@ -567,10 +567,8 @@ static void ocfs2_dio_end_io(struct kioc
 	/* this io's submitter should not have unlocked this before we could */
 	BUG_ON(!ocfs2_iocb_is_rw_locked(iocb));
 
-	if (ocfs2_iocb_is_sem_locked(iocb)) {
-		inode_dio_wake(inode);
+	if (ocfs2_iocb_is_sem_locked(iocb))
 		ocfs2_iocb_clear_sem_locked(iocb);
-	}
 
 	ocfs2_iocb_clear_rw_locked(iocb);
 
Index: linux-2.6/fs/ocfs2/file.c
===================================================================
--- linux-2.6.orig/fs/ocfs2/file.c	2011-06-20 14:56:55.375819530 +0200
+++ linux-2.6/fs/ocfs2/file.c	2011-06-20 14:56:59.262485999 +0200
@@ -2240,7 +2240,6 @@ static ssize_t ocfs2_file_aio_write(stru
 relock:
 	/* to match setattr's i_mutex -> rw_lock ordering */
 	if (direct_io) {
-		atomic_inc(&inode->i_dio_count);
 		have_alloc_sem = 1;
 		/* communicate with ocfs2_dio_end_io */
 		ocfs2_iocb_set_sem_locked(iocb);
@@ -2292,7 +2291,6 @@ relock:
 	 */
 	if (direct_io && !can_do_direct) {
 		ocfs2_rw_unlock(inode, rw_level);
-		inode_dio_wake(inode);
 
 		have_alloc_sem = 0;
 		rw_level = -1;
@@ -2379,10 +2377,8 @@ out:
 		ocfs2_rw_unlock(inode, rw_level);
 
 out_sems:
-	if (have_alloc_sem) {
-		inode_dio_wake(inode);
+	if (have_alloc_sem)
 		ocfs2_iocb_clear_sem_locked(iocb);
-	}
 
 	mutex_unlock(&inode->i_mutex);
 
@@ -2533,7 +2529,6 @@ static ssize_t ocfs2_file_aio_read(struc
 	 */
 	if (filp->f_flags & O_DIRECT) {
 		have_alloc_sem = 1;
-		atomic_inc(&inode->i_dio_count);
 		ocfs2_iocb_set_sem_locked(iocb);
 
 		ret = ocfs2_rw_lock(inode, 0);
@@ -2575,10 +2570,9 @@ static ssize_t ocfs2_file_aio_read(struc
 	}
 
 bail:
-	if (have_alloc_sem) {
-		inode_dio_wake(inode);
+	if (have_alloc_sem)
 		ocfs2_iocb_clear_sem_locked(iocb);
-	}
+
 	if (rw_level != -1)
 		ocfs2_rw_unlock(inode, rw_level);
 
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2011-06-20 14:57:08.582485528 +0200
+++ linux-2.6/include/linux/fs.h	2011-06-20 14:57:10.099152117 +0200
@@ -2373,7 +2373,6 @@ enum {
 
 void dio_end_io(struct bio *bio, int error);
 void inode_dio_wait(struct inode *inode);
-void inode_dio_wake(struct inode *inode);
 
 ssize_t __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
 	struct block_device *bdev, const struct iovec *iov, loff_t offset,


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 7/8] btrfs: wait for direct I/O requests in truncate
  2011-06-20 20:15 [PATCH 0/8] remove i_alloc_sem Christoph Hellwig
@ 2011-06-20 20:15   ` Christoph Hellwig
  2011-06-20 20:15   ` Christoph Hellwig
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2011-06-20 20:15 UTC (permalink / raw)
  To: viro, tglx
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh, jlbec

Wait for all direct I/O requests to finish before performing a truncate.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/btrfs/inode.c
===================================================================
--- linux-2.6.orig/fs/btrfs/inode.c	2011-06-11 12:58:46.615017504 +0200
+++ linux-2.6/fs/btrfs/inode.c	2011-06-11 12:59:23.218348984 +0200
@@ -3550,6 +3550,8 @@ static int btrfs_setsize(struct inode *i
 	loff_t oldsize = i_size_read(inode);
 	int ret;
 
+	inode_dio_wait(inode);
+
 	if (newsize == oldsize)
 		return 0;
 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 7/8] btrfs: wait for direct I/O requests in truncate
@ 2011-06-20 20:15   ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2011-06-20 20:15 UTC (permalink / raw)
  To: viro, tglx
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh, jlbec

[-- Attachment #1: btrfs-call-dio_wait --]
[-- Type: text/plain, Size: 546 bytes --]

Wait for all direct I/O requests to finish before performing a truncate.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/btrfs/inode.c
===================================================================
--- linux-2.6.orig/fs/btrfs/inode.c	2011-06-11 12:58:46.615017504 +0200
+++ linux-2.6/fs/btrfs/inode.c	2011-06-11 12:59:23.218348984 +0200
@@ -3550,6 +3550,8 @@ static int btrfs_setsize(struct inode *i
 	loff_t oldsize = i_size_read(inode);
 	int ret;
 
+	inode_dio_wait(inode);
+
 	if (newsize == oldsize)
 		return 0;
 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 8/8] rw_semaphore: remove up/down_read_non_owner
  2011-06-20 20:15 [PATCH 0/8] remove i_alloc_sem Christoph Hellwig
@ 2011-06-20 20:15   ` Christoph Hellwig
  2011-06-20 20:15   ` Christoph Hellwig
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2011-06-20 20:15 UTC (permalink / raw)
  To: viro, tglx
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh, jlbec

Now that the last users is gone these can be removed.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/include/linux/rwsem.h
===================================================================
--- linux-2.6.orig/include/linux/rwsem.h	2011-06-20 14:58:15.449148809 +0200
+++ linux-2.6/include/linux/rwsem.h	2011-06-20 14:58:30.915814691 +0200
@@ -124,19 +124,9 @@ extern void downgrade_write(struct rw_se
  */
 extern void down_read_nested(struct rw_semaphore *sem, int subclass);
 extern void down_write_nested(struct rw_semaphore *sem, int subclass);
-/*
- * Take/release a lock when not the owner will release it.
- *
- * [ This API should be avoided as much as possible - the
- *   proper abstraction for this case is completions. ]
- */
-extern void down_read_non_owner(struct rw_semaphore *sem);
-extern void up_read_non_owner(struct rw_semaphore *sem);
 #else
 # define down_read_nested(sem, subclass)		down_read(sem)
 # define down_write_nested(sem, subclass)	down_write(sem)
-# define down_read_non_owner(sem)		down_read(sem)
-# define up_read_non_owner(sem)			up_read(sem)
 #endif
 
 #endif /* _LINUX_RWSEM_H */
Index: linux-2.6/kernel/rwsem.c
===================================================================
--- linux-2.6.orig/kernel/rwsem.c	2011-06-20 14:58:34.509147842 +0200
+++ linux-2.6/kernel/rwsem.c	2011-06-20 14:58:47.479147187 +0200
@@ -117,15 +117,6 @@ void down_read_nested(struct rw_semaphor
 
 EXPORT_SYMBOL(down_read_nested);
 
-void down_read_non_owner(struct rw_semaphore *sem)
-{
-	might_sleep();
-
-	__down_read(sem);
-}
-
-EXPORT_SYMBOL(down_read_non_owner);
-
 void down_write_nested(struct rw_semaphore *sem, int subclass)
 {
 	might_sleep();
@@ -136,13 +127,6 @@ void down_write_nested(struct rw_semapho
 
 EXPORT_SYMBOL(down_write_nested);
 
-void up_read_non_owner(struct rw_semaphore *sem)
-{
-	__up_read(sem);
-}
-
-EXPORT_SYMBOL(up_read_non_owner);
-
 #endif
 
 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 8/8] rw_semaphore: remove up/down_read_non_owner
@ 2011-06-20 20:15   ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2011-06-20 20:15 UTC (permalink / raw)
  To: viro, tglx
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh, jlbec

[-- Attachment #1: remove-rw_semaphore-non_owner --]
[-- Type: text/plain, Size: 1921 bytes --]

Now that the last users is gone these can be removed.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/include/linux/rwsem.h
===================================================================
--- linux-2.6.orig/include/linux/rwsem.h	2011-06-20 14:58:15.449148809 +0200
+++ linux-2.6/include/linux/rwsem.h	2011-06-20 14:58:30.915814691 +0200
@@ -124,19 +124,9 @@ extern void downgrade_write(struct rw_se
  */
 extern void down_read_nested(struct rw_semaphore *sem, int subclass);
 extern void down_write_nested(struct rw_semaphore *sem, int subclass);
-/*
- * Take/release a lock when not the owner will release it.
- *
- * [ This API should be avoided as much as possible - the
- *   proper abstraction for this case is completions. ]
- */
-extern void down_read_non_owner(struct rw_semaphore *sem);
-extern void up_read_non_owner(struct rw_semaphore *sem);
 #else
 # define down_read_nested(sem, subclass)		down_read(sem)
 # define down_write_nested(sem, subclass)	down_write(sem)
-# define down_read_non_owner(sem)		down_read(sem)
-# define up_read_non_owner(sem)			up_read(sem)
 #endif
 
 #endif /* _LINUX_RWSEM_H */
Index: linux-2.6/kernel/rwsem.c
===================================================================
--- linux-2.6.orig/kernel/rwsem.c	2011-06-20 14:58:34.509147842 +0200
+++ linux-2.6/kernel/rwsem.c	2011-06-20 14:58:47.479147187 +0200
@@ -117,15 +117,6 @@ void down_read_nested(struct rw_semaphor
 
 EXPORT_SYMBOL(down_read_nested);
 
-void down_read_non_owner(struct rw_semaphore *sem)
-{
-	might_sleep();
-
-	__down_read(sem);
-}
-
-EXPORT_SYMBOL(down_read_non_owner);
-
 void down_write_nested(struct rw_semaphore *sem, int subclass)
 {
 	might_sleep();
@@ -136,13 +127,6 @@ void down_write_nested(struct rw_semapho
 
 EXPORT_SYMBOL(down_write_nested);
 
-void up_read_non_owner(struct rw_semaphore *sem)
-{
-	__up_read(sem);
-}
-
-EXPORT_SYMBOL(up_read_non_owner);
-
 #endif
 
 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/8] remove i_alloc_sem
  2011-06-20 20:15 [PATCH 0/8] remove i_alloc_sem Christoph Hellwig
                   ` (7 preceding siblings ...)
  2011-06-20 20:15   ` Christoph Hellwig
@ 2011-06-20 20:32 ` Christoph Hellwig
  2011-06-21 23:54 ` Jan Kara
  9 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2011-06-20 20:32 UTC (permalink / raw)
  To: viro, tglx
  Cc: linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh, jlbec

On Mon, Jun 20, 2011 at 04:15:33PM -0400, Christoph Hellwig wrote:
> This series removes it in favour of a simpler counter scheme, thus getting
> rid of the rw_semaphore non-owner APIs as requests by Thomas, while at the
> same time shrinking the size of struct inode by 160 bytes on 64-bit systems.

This should be 160 bits, aka 20 bytes of course, sorry.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/8] fs: always maintain i_dio_count
  2011-06-20 20:15   ` Christoph Hellwig
  (?)
@ 2011-06-20 21:29   ` Joel Becker
  2011-06-20 22:23     ` Christoph Hellwig
  -1 siblings, 1 reply; 36+ messages in thread
From: Joel Becker @ 2011-06-20 21:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: viro, tglx, linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh

On Mon, Jun 20, 2011 at 04:15:39PM -0400, Christoph Hellwig wrote:
> Maintain i_dio_count for all filesystems, not just those using DIO_LOCKING.
> This these filesystems to also protect truncate against direct I/O requests
> by using common code.  Right now the only non-DIO_LOCKING filesystem that
> appears to do so is XFS, which uses an opencoded variant of the i_dio_count
> scheme.
> 
> Behaviour doesn't change for filesystems never calling inode_dio_wait,
> which are all that never use DIO_LOCKING.
> 
> For ext4 behaviour changes with the dioread_nonlock option, which previous
> was missing any protection between truncate and direct I/O reads.
> 
> For ocfs2 that handcrafted i_dio_count manipulations are replaced with
> the common code noew available.

	Oh god you're making the world scary.  Are you guaranteeing that
all allocation changes are locked out by the time we get into
file_aio_write() and file_aio_read()?  This is not obvious to me.

Joel

-- 

"Glory is fleeting, but obscurity is forever."  
         - Napoleon Bonaparte

			http://www.jlbec.org/
			jlbec@evilplan.org

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/8] fs: kill i_alloc_sem
  2011-06-20 20:15   ` Christoph Hellwig
  (?)
@ 2011-06-20 21:32   ` Joel Becker
  2011-06-20 22:18     ` Christoph Hellwig
  -1 siblings, 1 reply; 36+ messages in thread
From: Joel Becker @ 2011-06-20 21:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: viro, tglx, linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh

On Mon, Jun 20, 2011 at 04:15:37PM -0400, Christoph Hellwig wrote:
> i_alloc_sem is a rather special rw_semaphore.  It's the last one that may
> be released by a non-owner, and it's write side is always mirrored by
> real exclusion.  It's intended use it to wait for all pending direct I/O
> requests to finish before starting a truncate.
> 
> Replace it with a hand-grown construct:
> 
>  - exclusion for truncates is already guaranteed by i_mutex, so it can
>    simply fall way
>  - the reader side is replaced by an i_dio_count member in struct inode
>    that counts the number of pending direct I/O requests.  Truncate can't
>    proceed as long as it's non-zero
>  - when i_dio_count reaches non-zero we wake up a pending truncate using
>    wake_up_bit on a new bit in i_flags
>  - new references to i_dio_count can't appear while we are waiting for
>    it to read zero because the direct I/O count always needs i_mutex
>    (or an equivalent like XFS's i_iolock) for starting a new operation.
> 
> This scheme is much simpler, and saves the space of a spinlock_t and a
> struct list_head in struct inode (typically 160 bytes on a non-debug 64-bit
> system).

	Are we guaranteed that all allocation changes are locked out by
i_dio_count>0?  I don't think we are.  The ocfs2 code very strongly
assumes the state of a file's allocation when it holds i_alloc_sem.  I
feel like we lose that here. 

Joel

-- 

"I don't even butter my bread; I consider that cooking."
         - Katherine Cebrian

			http://www.jlbec.org/
			jlbec@evilplan.org

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/8] fs: kill i_alloc_sem
  2011-06-20 21:32   ` Joel Becker
@ 2011-06-20 22:18     ` Christoph Hellwig
  2011-07-01  2:58       ` Joel Becker
  0 siblings, 1 reply; 36+ messages in thread
From: Christoph Hellwig @ 2011-06-20 22:18 UTC (permalink / raw)
  To: Christoph Hellwig, viro, tglx, linux-fsdevel, linux-ext4,
	linux-btrfs, hirofumi

On Mon, Jun 20, 2011 at 02:32:03PM -0700, Joel Becker wrote:
> 	Are we guaranteed that all allocation changes are locked out by
> i_dio_count>0?  I don't think we are.  The ocfs2 code very strongly
> assumes the state of a file's allocation when it holds i_alloc_sem.  I
> feel like we lose that here. 

You aren't, neither with the old i_alloc_sem code, nor with the 1:1
replacement using i_dio_count.

Do a quick grep who gets i_alloc_sem exclusively (down_write): it's
really just the truncate code, and it's cut & paste duplicates in ntfs
and reiserfs.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/8] fs: always maintain i_dio_count
  2011-06-20 21:29   ` Joel Becker
@ 2011-06-20 22:23     ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2011-06-20 22:23 UTC (permalink / raw)
  To: viro, tglx, linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh

On Mon, Jun 20, 2011 at 02:29:24PM -0700, Joel Becker wrote:
> 	Oh god you're making the world scary.  Are you guaranteeing that
> all allocation changes are locked out by the time we get into
> file_aio_write() and file_aio_read()?  This is not obvious to me.

I have no idea how ocfs2's internal allocator locking works, but this
patch doesn't change it.  What this patch touches is exclusion between
truncate and pending direct I/O requests, and even there only the
implementation and not the semantics.

The old and new semantics are that you may have either

	1 ongoing truncate

OR

	n (>= 0; <= ATOMIC_T_MAX) ongoing direct I/O reads or writes

before that was enforced using the i_alloc_sem rw_semaphore, including
non-owner releases of it from AIO code, in the new code it's done using
a combination of i_mutex which was already taken in the truncate path,
and when starting new direct I/O requests, and the new i_dio_count
counter.

> 
> Joel
> 
> -- 
> 
> "Glory is fleeting, but obscurity is forever."  
>          - Napoleon Bonaparte
> 
> 			http://www.jlbec.org/
> 			jlbec@evilplan.org
---end quoted text---

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/8] fs: kill i_alloc_sem
  2011-06-20 20:15   ` Christoph Hellwig
  (?)
  (?)
@ 2011-06-21  5:40   ` Dave Chinner
  2011-06-21  9:35     ` Christoph Hellwig
  -1 siblings, 1 reply; 36+ messages in thread
From: Dave Chinner @ 2011-06-21  5:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: viro, tglx, linux-fsdevel, linux-ext4, linux-btrfs, hirofumi,
	mfasheh, jlbec

On Mon, Jun 20, 2011 at 04:15:37PM -0400, Christoph Hellwig wrote:
> i_alloc_sem is a rather special rw_semaphore.  It's the last one that may
> be released by a non-owner, and it's write side is always mirrored by
> real exclusion.  It's intended use it to wait for all pending direct I/O
> requests to finish before starting a truncate.
> 
> Replace it with a hand-grown construct:
> 
>  - exclusion for truncates is already guaranteed by i_mutex, so it can
>    simply fall way
>  - the reader side is replaced by an i_dio_count member in struct inode
>    that counts the number of pending direct I/O requests.  Truncate can't
>    proceed as long as it's non-zero
>  - when i_dio_count reaches non-zero we wake up a pending truncate using
>    wake_up_bit on a new bit in i_flags
>  - new references to i_dio_count can't appear while we are waiting for
>    it to read zero because the direct I/O count always needs i_mutex
>    (or an equivalent like XFS's i_iolock) for starting a new operation.
> 
> This scheme is much simpler, and saves the space of a spinlock_t and a
> struct list_head in struct inode (typically 160 bytes on a non-debug 64-bit
> system).
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> 
> Index: linux-2.6/fs/direct-io.c
> ===================================================================
> --- linux-2.6.orig/fs/direct-io.c	2011-06-20 14:55:31.000000000 +0200
> +++ linux-2.6/fs/direct-io.c	2011-06-20 14:55:34.602490284 +0200
> @@ -136,6 +136,27 @@ struct dio {
>  };
>  
>  /*
> + * Wait for outstanding DIO requests to finish.  Must be locked against
> + * increments of i_dio_count by i_mutex.
> + */
> +void inode_dio_wait(struct inode *inode)
> +{
> +	might_sleep();
> +	while (atomic_read(&inode->i_dio_count)) {
> +		wait_on_bit(&inode->i_state, __I_DIO_WAKEUP, inode_wait,
> +			    TASK_UNINTERRUPTIBLE);
> +	}
> +}
> +EXPORT_SYMBOL_GPL(inode_dio_wait);
> +
> +void inode_dio_wake(struct inode *inode)
> +{
> +	if (atomic_dec_and_test(&inode->i_dio_count))
> +		wake_up_bit(&inode->i_state, __I_DIO_WAKEUP);
> +}
> +EXPORT_SYMBOL_GPL(inode_dio_wake);

Modification of inode->i_state is not safe outside the
inode->i_lock.

This probably needs to be implemented similar to the
__I_NEW/__wait_on_freeing_inode() and
__I_SYNC/inode_wait_for_writeback() pattern...

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/8] fs: kill i_alloc_sem
  2011-06-21  5:40   ` Dave Chinner
@ 2011-06-21  9:35     ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2011-06-21  9:35 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christoph Hellwig, viro, tglx, linux-fsdevel, linux-ext4,
	linux-btrfs, hirofumi, mfasheh, jlbec

On Tue, Jun 21, 2011 at 03:40:56PM +1000, Dave Chinner wrote:
> Modification of inode->i_state is not safe outside the
> inode->i_lock.

We never actually set the new bit in i_state, we just use it as a key
for the hashed lookups.  Or rather we try to, as I misunderstood how
wait_on_bit works, so currently we busywait for i_dio_count to reach
zero.  I'll respin a version that actually works as expected.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/8] far: remove i_alloc_sem abuse
  2011-06-20 20:15   ` Christoph Hellwig
  (?)
@ 2011-06-21 15:57   ` OGAWA Hirofumi
  2011-06-21 16:09     ` OGAWA Hirofumi
  2011-06-21 16:09     ` Christoph Hellwig
  -1 siblings, 2 replies; 36+ messages in thread
From: OGAWA Hirofumi @ 2011-06-21 15:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: viro, tglx, linux-fsdevel, linux-ext4, linux-btrfs, mfasheh, jlbec

Christoph Hellwig <hch@infradead.org> writes:

> Add a new rw_semaphore to protect bmap against truncate.  Previous
> i_alloc_sem was abused for this, but it's going away in this series.

In FAT case, ->i_mutex was better. But, last time I saw, shmfs was using
->i_mutex to call ->bmap. So, this was chosen instead.

I'm not checking current version yet though, if shmfs had change, we can
use ->i_mutex.

BTW, this patch looks good to me.

Thanks.

> Signed-off-by: Christoph Hellwig <hch@lst.de>
>
> Index: linux-2.6/fs/fat/inode.c
> ===================================================================
> --- linux-2.6.orig/fs/fat/inode.c	2011-06-20 21:28:19.707963855 +0200
> +++ linux-2.6/fs/fat/inode.c	2011-06-20 21:29:25.031293882 +0200
> @@ -224,9 +224,9 @@ static sector_t _fat_bmap(struct address
>  	sector_t blocknr;
>  
>  	/* fat_get_cluster() assumes the requested blocknr isn't truncated. */
> -	down_read(&mapping->host->i_alloc_sem);
> +	down_read(&MSDOS_I(mapping->host)->truncate_lock);
>  	blocknr = generic_block_bmap(mapping, block, fat_get_block);
> -	up_read(&mapping->host->i_alloc_sem);
> +	up_read(&MSDOS_I(mapping->host)->truncate_lock);
>  
>  	return blocknr;
>  }
> @@ -510,6 +510,8 @@ static struct inode *fat_alloc_inode(str
>  	ei = kmem_cache_alloc(fat_inode_cachep, GFP_NOFS);
>  	if (!ei)
>  		return NULL;
> +
> +	init_rwsem(&ei->truncate_lock);
>  	return &ei->vfs_inode;
>  }
>  
> Index: linux-2.6/fs/fat/fat.h
> ===================================================================
> --- linux-2.6.orig/fs/fat/fat.h	2011-06-20 21:28:19.724630522 +0200
> +++ linux-2.6/fs/fat/fat.h	2011-06-20 21:29:25.034627215 +0200
> @@ -109,6 +109,7 @@ struct msdos_inode_info {
>  	int i_attrs;		/* unused attribute bits */
>  	loff_t i_pos;		/* on-disk position of directory entry or 0 */
>  	struct hlist_node i_fat_hash;	/* hash by i_location */
> +	struct rw_semaphore truncate_lock; /* protect bmap against truncate */
>  	struct inode vfs_inode;
>  };
>  
> Index: linux-2.6/fs/fat/file.c
> ===================================================================
> --- linux-2.6.orig/fs/fat/file.c	2011-06-20 21:28:19.744630521 +0200
> +++ linux-2.6/fs/fat/file.c	2011-06-20 21:29:54.501292390 +0200
> @@ -429,8 +429,10 @@ int fat_setattr(struct dentry *dentry, s
>  	}
>  
>  	if (attr->ia_valid & ATTR_SIZE) {
> +		down_write(&MSDOS_I(inode)->truncate_lock);
>  		truncate_setsize(inode, attr->ia_size);
>  		fat_truncate_blocks(inode, attr->ia_size);
> +		up_write(&MSDOS_I(inode)->truncate_lock);
>  	}
>  
>  	setattr_copy(inode, attr);
>

-- 
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/8] far: remove i_alloc_sem abuse
  2011-06-21 15:57   ` OGAWA Hirofumi
@ 2011-06-21 16:09     ` OGAWA Hirofumi
  2011-06-21 16:09     ` Christoph Hellwig
  1 sibling, 0 replies; 36+ messages in thread
From: OGAWA Hirofumi @ 2011-06-21 16:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: viro, tglx, linux-fsdevel, linux-ext4, linux-btrfs, mfasheh, jlbec

OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> writes:

> Christoph Hellwig <hch@infradead.org> writes:
>
>> Add a new rw_semaphore to protect bmap against truncate.  Previous
>> i_alloc_sem was abused for this, but it's going away in this series.
>
> In FAT case, ->i_mutex was better. But, last time I saw, shmfs was using
> ->i_mutex to call ->bmap. So, this was chosen instead.
>
> I'm not checking current version yet though, if shmfs had change, we can
> use ->i_mutex.

It was swapfile. And unfortunately it doesn't have change.

> BTW, this patch looks good to me.
-- 
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/8] far: remove i_alloc_sem abuse
  2011-06-21 15:57   ` OGAWA Hirofumi
  2011-06-21 16:09     ` OGAWA Hirofumi
@ 2011-06-21 16:09     ` Christoph Hellwig
  1 sibling, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2011-06-21 16:09 UTC (permalink / raw)
  To: OGAWA Hirofumi
  Cc: Christoph Hellwig, viro, tglx, linux-fsdevel, linux-ext4,
	linux-btrfs, mfasheh, jlbec

On Wed, Jun 22, 2011 at 12:57:43AM +0900, OGAWA Hirofumi wrote:
> Christoph Hellwig <hch@infradead.org> writes:
> 
> > Add a new rw_semaphore to protect bmap against truncate.  Previous
> > i_alloc_sem was abused for this, but it's going away in this series.
> 
> In FAT case, ->i_mutex was better. But, last time I saw, shmfs was using
> ->i_mutex to call ->bmap. So, this was chosen instead.
> 
> I'm not checking current version yet though, if shmfs had change, we can
> use ->i_mutex.

I tried that first, but it gives an instant deadlock when using a
swapfile on fat.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/8] ext4: remove i_alloc_sem abuse
  2011-06-20 20:15   ` Christoph Hellwig
  (?)
@ 2011-06-21 16:34   ` Lukas Czerner
  2011-06-21 16:48     ` Lukas Czerner
  2011-06-21 17:16     ` Christoph Hellwig
  -1 siblings, 2 replies; 36+ messages in thread
From: Lukas Czerner @ 2011-06-21 16:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: viro, tglx, linux-fsdevel, linux-ext4, linux-btrfs, hirofumi,
	mfasheh, jlbec

On Mon, 20 Jun 2011, Christoph Hellwig wrote:

> Add a new rw_semaphore to protect page_mkwrite against truncate.  Previous
> i_alloc_sem was abused for this, but it's going away in this series.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> 
> Index: linux-2.6/fs/ext4/inode.c
> ===================================================================
> --- linux-2.6.orig/fs/ext4/inode.c	2011-06-20 14:25:33.779248128 +0200
> +++ linux-2.6/fs/ext4/inode.c	2011-06-20 14:27:53.025907745 +0200
> @@ -5357,6 +5357,7 @@ int ext4_setattr(struct dentry *dentry,
>  			if (attr->ia_size > sbi->s_bitmap_maxbytes)
>  				return -EFBIG;
>  		}
> +		down_write(&(EXT4_I(inode)->truncate_lock));
>  	}
>  
>  	if (S_ISREG(inode->i_mode) &&
> @@ -5405,6 +5406,9 @@ int ext4_setattr(struct dentry *dentry,
>  			ext4_truncate(inode);
>  	}
>  
> +	if (attr->ia_valid & ATTR_SIZE)
> +		up_write(&(EXT4_I(inode)->truncate_lock));
> +

Hi Christoph,

this looks like a bug fix, so we should rather do it in a separate
commit. Could you send it separately, or at least mention that in commit
description ?

Otherwise the patch looks good.

Thanks!
-Lukas

>  	if (!rc) {
>  		setattr_copy(inode, attr);
>  		mark_inode_dirty(inode);
> @@ -5850,10 +5854,10 @@ int ext4_page_mkwrite(struct vm_area_str
>  	struct address_space *mapping = inode->i_mapping;
>  
>  	/*
> -	 * Get i_alloc_sem to stop truncates messing with the inode. We cannot
> -	 * get i_mutex because we are already holding mmap_sem.
> +	 * Get truncate_lock to stop truncates messing with the inode. We
> +	 * cannot get i_mutex because we are already holding mmap_sem.
>  	 */
> -	down_read(&inode->i_alloc_sem);
> +	down_read(&(EXT4_I(inode)->truncate_lock));
>  	size = i_size_read(inode);
>  	if (page->mapping != mapping || size <= page_offset(page)
>  	    || !PageUptodate(page)) {
> @@ -5865,7 +5869,7 @@ int ext4_page_mkwrite(struct vm_area_str
>  	lock_page(page);
>  	wait_on_page_writeback(page);
>  	if (PageMappedToDisk(page)) {
> -		up_read(&inode->i_alloc_sem);
> +		up_read(&(EXT4_I(inode)->truncate_lock));
>  		return VM_FAULT_LOCKED;
>  	}
>  
> @@ -5883,7 +5887,7 @@ int ext4_page_mkwrite(struct vm_area_str
>  	if (page_has_buffers(page)) {
>  		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
>  					ext4_bh_unmapped)) {
> -			up_read(&inode->i_alloc_sem);
> +			up_read(&(EXT4_I(inode)->truncate_lock));
>  			return VM_FAULT_LOCKED;
>  		}
>  	}
> @@ -5912,11 +5916,11 @@ int ext4_page_mkwrite(struct vm_area_str
>  	 */
>  	lock_page(page);
>  	wait_on_page_writeback(page);
> -	up_read(&inode->i_alloc_sem);
> +	up_read(&(EXT4_I(inode)->truncate_lock));
>  	return VM_FAULT_LOCKED;
>  out_unlock:
>  	if (ret)
>  		ret = VM_FAULT_SIGBUS;
> -	up_read(&inode->i_alloc_sem);
> +	up_read(&(EXT4_I(inode)->truncate_lock));
>  	return ret;
>  }
> Index: linux-2.6/fs/ext4/super.c
> ===================================================================
> --- linux-2.6.orig/fs/ext4/super.c	2011-06-20 14:25:33.795914793 +0200
> +++ linux-2.6/fs/ext4/super.c	2011-06-20 14:27:06.269243443 +0200
> @@ -871,6 +871,7 @@ static struct inode *ext4_alloc_inode(st
>  	ei->i_datasync_tid = 0;
>  	atomic_set(&ei->i_ioend_count, 0);
>  	atomic_set(&ei->i_aiodio_unwritten, 0);
> +	init_rwsem(&ei->truncate_lock);
>  
>  	return &ei->vfs_inode;
>  }
> Index: linux-2.6/fs/ext4/ext4.h
> ===================================================================
> --- linux-2.6.orig/fs/ext4/ext4.h	2011-06-20 14:25:33.752581461 +0200
> +++ linux-2.6/fs/ext4/ext4.h	2011-06-20 14:27:06.272576777 +0200
> @@ -844,6 +844,9 @@ struct ext4_inode_info {
>  	/* on-disk additional length */
>  	__u16 i_extra_isize;
>  
> +	/* protect page_mkwrite against truncates */
> +	struct rw_semaphore truncate_lock;
> +
>  #ifdef CONFIG_QUOTA
>  	/* quota space reservation, managed internally by quota code */
>  	qsize_t i_reserved_quota;
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/8] ext4: remove i_alloc_sem abuse
  2011-06-21 16:34   ` Lukas Czerner
@ 2011-06-21 16:48     ` Lukas Czerner
  2011-06-21 17:16     ` Christoph Hellwig
  1 sibling, 0 replies; 36+ messages in thread
From: Lukas Czerner @ 2011-06-21 16:48 UTC (permalink / raw)
  To: Lukas Czerner
  Cc: Christoph Hellwig, viro, tglx, linux-fsdevel, linux-ext4,
	linux-btrfs, hirofumi, mfasheh, jlbec

On Tue, 21 Jun 2011, Lukas Czerner wrote:

> On Mon, 20 Jun 2011, Christoph Hellwig wrote:
> 
> > Add a new rw_semaphore to protect page_mkwrite against truncate.  Previous
> > i_alloc_sem was abused for this, but it's going away in this series.
> > 
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > 
> > Index: linux-2.6/fs/ext4/inode.c
> > ===================================================================
> > --- linux-2.6.orig/fs/ext4/inode.c	2011-06-20 14:25:33.779248128 +0200
> > +++ linux-2.6/fs/ext4/inode.c	2011-06-20 14:27:53.025907745 +0200
> > @@ -5357,6 +5357,7 @@ int ext4_setattr(struct dentry *dentry,
> >  			if (attr->ia_size > sbi->s_bitmap_maxbytes)
> >  				return -EFBIG;
> >  		}
> > +		down_write(&(EXT4_I(inode)->truncate_lock));
> >  	}
> >  
> >  	if (S_ISREG(inode->i_mode) &&
> > @@ -5405,6 +5406,9 @@ int ext4_setattr(struct dentry *dentry,
> >  			ext4_truncate(inode);
> >  	}
> >  
> > +	if (attr->ia_valid & ATTR_SIZE)
> > +		up_write(&(EXT4_I(inode)->truncate_lock));
> > +
> 
> Hi Christoph,
> 
> this looks like a bug fix, so we should rather do it in a separate
> commit. Could you send it separately, or at least mention that in commit
> description ?

That's stupid note, it is not a bugfix of course. Ignore it, sorry for
the noise.

> 
> Otherwise the patch looks good.
> 
> Thanks!
> -Lukas
> 
> >  	if (!rc) {
> >  		setattr_copy(inode, attr);
> >  		mark_inode_dirty(inode);
> > @@ -5850,10 +5854,10 @@ int ext4_page_mkwrite(struct vm_area_str
> >  	struct address_space *mapping = inode->i_mapping;
> >  
> >  	/*
> > -	 * Get i_alloc_sem to stop truncates messing with the inode. We cannot
> > -	 * get i_mutex because we are already holding mmap_sem.
> > +	 * Get truncate_lock to stop truncates messing with the inode. We
> > +	 * cannot get i_mutex because we are already holding mmap_sem.
> >  	 */
> > -	down_read(&inode->i_alloc_sem);
> > +	down_read(&(EXT4_I(inode)->truncate_lock));
> >  	size = i_size_read(inode);
> >  	if (page->mapping != mapping || size <= page_offset(page)
> >  	    || !PageUptodate(page)) {
> > @@ -5865,7 +5869,7 @@ int ext4_page_mkwrite(struct vm_area_str
> >  	lock_page(page);
> >  	wait_on_page_writeback(page);
> >  	if (PageMappedToDisk(page)) {
> > -		up_read(&inode->i_alloc_sem);
> > +		up_read(&(EXT4_I(inode)->truncate_lock));
> >  		return VM_FAULT_LOCKED;
> >  	}
> >  
> > @@ -5883,7 +5887,7 @@ int ext4_page_mkwrite(struct vm_area_str
> >  	if (page_has_buffers(page)) {
> >  		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
> >  					ext4_bh_unmapped)) {
> > -			up_read(&inode->i_alloc_sem);
> > +			up_read(&(EXT4_I(inode)->truncate_lock));
> >  			return VM_FAULT_LOCKED;
> >  		}
> >  	}
> > @@ -5912,11 +5916,11 @@ int ext4_page_mkwrite(struct vm_area_str
> >  	 */
> >  	lock_page(page);
> >  	wait_on_page_writeback(page);
> > -	up_read(&inode->i_alloc_sem);
> > +	up_read(&(EXT4_I(inode)->truncate_lock));
> >  	return VM_FAULT_LOCKED;
> >  out_unlock:
> >  	if (ret)
> >  		ret = VM_FAULT_SIGBUS;
> > -	up_read(&inode->i_alloc_sem);
> > +	up_read(&(EXT4_I(inode)->truncate_lock));
> >  	return ret;
> >  }
> > Index: linux-2.6/fs/ext4/super.c
> > ===================================================================
> > --- linux-2.6.orig/fs/ext4/super.c	2011-06-20 14:25:33.795914793 +0200
> > +++ linux-2.6/fs/ext4/super.c	2011-06-20 14:27:06.269243443 +0200
> > @@ -871,6 +871,7 @@ static struct inode *ext4_alloc_inode(st
> >  	ei->i_datasync_tid = 0;
> >  	atomic_set(&ei->i_ioend_count, 0);
> >  	atomic_set(&ei->i_aiodio_unwritten, 0);
> > +	init_rwsem(&ei->truncate_lock);
> >  
> >  	return &ei->vfs_inode;
> >  }
> > Index: linux-2.6/fs/ext4/ext4.h
> > ===================================================================
> > --- linux-2.6.orig/fs/ext4/ext4.h	2011-06-20 14:25:33.752581461 +0200
> > +++ linux-2.6/fs/ext4/ext4.h	2011-06-20 14:27:06.272576777 +0200
> > @@ -844,6 +844,9 @@ struct ext4_inode_info {
> >  	/* on-disk additional length */
> >  	__u16 i_extra_isize;
> >  
> > +	/* protect page_mkwrite against truncates */
> > +	struct rw_semaphore truncate_lock;
> > +
> >  #ifdef CONFIG_QUOTA
> >  	/* quota space reservation, managed internally by quota code */
> >  	qsize_t i_reserved_quota;
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> 

-- 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/8] ext4: remove i_alloc_sem abuse
  2011-06-21 16:34   ` Lukas Czerner
  2011-06-21 16:48     ` Lukas Czerner
@ 2011-06-21 17:16     ` Christoph Hellwig
  1 sibling, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2011-06-21 17:16 UTC (permalink / raw)
  To: Lukas Czerner
  Cc: Christoph Hellwig, viro, tglx, linux-fsdevel, linux-ext4,
	linux-btrfs, hirofumi, mfasheh, jlbec

On Tue, Jun 21, 2011 at 06:34:50PM +0200, Lukas Czerner wrote:
> this looks like a bug fix, so we should rather do it in a separate
> commit. Could you send it separately, or at least mention that in commit
> description ?

What looks like a bugfix?


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/8] remove i_alloc_sem
  2011-06-20 20:15 [PATCH 0/8] remove i_alloc_sem Christoph Hellwig
                   ` (8 preceding siblings ...)
  2011-06-20 20:32 ` [PATCH 0/8] remove i_alloc_sem Christoph Hellwig
@ 2011-06-21 23:54 ` Jan Kara
  2011-06-22  9:39   ` Christoph Hellwig
  2011-06-22 14:22   ` Ted Ts'o
  9 siblings, 2 replies; 36+ messages in thread
From: Jan Kara @ 2011-06-21 23:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: viro, tglx, linux-fsdevel, linux-ext4, linux-btrfs, hirofumi,
	mfasheh, jlbec, tytso

On Mon 20-06-11 16:15:33, Christoph Hellwig wrote:
> i_alloc_sem has always been a bit of an odd "lock".  It's the only remaining
> rw_semaphore that can be released by a different thread than the one that
> locked it, and it's use case in the core direct I/O code is more like a
> counter given that the writers already have external serialization.
> 
> This series removes it in favour of a simpler counter scheme, thus getting
> rid of the rw_semaphore non-owner APIs as requests by Thomas, while at the
> same time shrinking the size of struct inode by 160 bytes on 64-bit systems.
> 
> The only nasty bit is that two filesystems (fat and ext4) have started
> abusing the lock for their own purposes.  I've added a new rw_semaphore
  ext4 abuse should be gone when Ted merges my rewrite of
ext4_page_mkwrite()... Ted, what happened to that patch. Should I resend
it?

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/8] remove i_alloc_sem
  2011-06-21 23:54 ` Jan Kara
@ 2011-06-22  9:39   ` Christoph Hellwig
  2011-06-22 14:22   ` Ted Ts'o
  1 sibling, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2011-06-22  9:39 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, viro, tglx, linux-fsdevel, linux-ext4,
	linux-btrfs, hirofumi, mfasheh, jlbec, tytso

On Wed, Jun 22, 2011 at 01:54:25AM +0200, Jan Kara wrote:
>   ext4 abuse should be gone when Ted merges my rewrite of
> ext4_page_mkwrite()... Ted, what happened to that patch. Should I resend
> it?

So how should we coordinate merging the two? 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/8] remove i_alloc_sem
  2011-06-21 23:54 ` Jan Kara
  2011-06-22  9:39   ` Christoph Hellwig
@ 2011-06-22 14:22   ` Ted Ts'o
  2011-06-22 18:13     ` Jan Kara
  1 sibling, 1 reply; 36+ messages in thread
From: Ted Ts'o @ 2011-06-22 14:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, viro, tglx, linux-fsdevel, linux-ext4,
	linux-btrfs, hirofumi, mfasheh, jlbec

On Wed, Jun 22, 2011 at 01:54:25AM +0200, Jan Kara wrote:
> ext4_page_mkwrite()... Ted, what happened to that patch. Should I resend
> it?

So assuming I fix the refcounting issue in fs/ext4/page_io.c (which I
will do not dropping the page's refcount until after the workqueue
finishes its job), does your patch need changing, or is it a matter of
fixing the commit description?

In any case, this dragged got out, and it's late in -rc cycle for 3.0,
so I was planning on queuing this for the next merge window.  (Sorry
for that, this was mostly due to my not having enough time to really
dive into the issues involved.)

							- Ted

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/8] remove i_alloc_sem
  2011-06-22 14:22   ` Ted Ts'o
@ 2011-06-22 18:13     ` Jan Kara
  2011-06-23 10:36       ` Christoph Hellwig
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Kara @ 2011-06-22 18:13 UTC (permalink / raw)
  To: Ted Ts'o
  Cc: Jan Kara, Christoph Hellwig, viro, tglx, linux-fsdevel,
	linux-ext4, linux-btrfs, hirofumi, mfasheh, jlbec

On Wed 22-06-11 10:22:35, Ted Tso wrote:
> On Wed, Jun 22, 2011 at 01:54:25AM +0200, Jan Kara wrote:
> > ext4_page_mkwrite()... Ted, what happened to that patch. Should I resend
> > it?
> 
> So assuming I fix the refcounting issue in fs/ext4/page_io.c (which I
> will do not dropping the page's refcount until after the workqueue
> finishes its job), does your patch need changing, or is it a matter of
> fixing the commit description?
  Looking into patchwork (http://patchwork.ozlabs.org/patch/97924/) I see
the discussion of two issues (page refcounting and page_mkwrite got mixed).
I see the patch just needs fixing commit description because it's no longer
accurate after "stable page under writeback" changes went it. I'll send you
a patch with updated changelog in a minute.

> In any case, this dragged got out, and it's late in -rc cycle for 3.0,
> so I was planning on queuing this for the next merge window.  (Sorry
> for that, this was mostly due to my not having enough time to really
> dive into the issues involved.)
  No problem. Just we have to somehow coordinate with Christoph... Either
he can avoid touching ext4 and merge his patch set after you merge my patch
or he can take my patch instead of his ext4 change. Since my patch touches
only ext4_page_mkwrite() there's not a high risk of collision. The latter
would seem easier for both of you to me but I don't really care.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/8] remove i_alloc_sem
  2011-06-22 18:13     ` Jan Kara
@ 2011-06-23 10:36       ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2011-06-23 10:36 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ted Ts'o, Christoph Hellwig, viro, tglx, linux-fsdevel,
	linux-ext4, linux-btrfs, hirofumi, mfasheh, jlbec

On Wed, Jun 22, 2011 at 08:13:42PM +0200, Jan Kara wrote:
>   No problem. Just we have to somehow coordinate with Christoph... Either
> he can avoid touching ext4 and merge his patch set after you merge my patch
> or he can take my patch instead of his ext4 change. Since my patch touches
> only ext4_page_mkwrite() there's not a high risk of collision. The latter
> would seem easier for both of you to me but I don't really care.

For now I'll just take your repost into the next version of my series.
If you guys have any better idea for sorting out the merge issues I'm
happy to reshuffle it again.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/8] fs: kill i_alloc_sem
  2011-06-20 22:18     ` Christoph Hellwig
@ 2011-07-01  2:58       ` Joel Becker
  0 siblings, 0 replies; 36+ messages in thread
From: Joel Becker @ 2011-07-01  2:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: viro, tglx, linux-fsdevel, linux-ext4, linux-btrfs, hirofumi, mfasheh

On Mon, Jun 20, 2011 at 06:18:57PM -0400, Christoph Hellwig wrote:
> On Mon, Jun 20, 2011 at 02:32:03PM -0700, Joel Becker wrote:
> > 	Are we guaranteed that all allocation changes are locked out by
> > i_dio_count>0?  I don't think we are.  The ocfs2 code very strongly
> > assumes the state of a file's allocation when it holds i_alloc_sem.  I
> > feel like we lose that here. 
> 
> You aren't, neither with the old i_alloc_sem code, nor with the 1:1
> replacement using i_dio_count.
> 
> Do a quick grep who gets i_alloc_sem exclusively (down_write): it's
> really just the truncate code, and it's cut & paste duplicates in ntfs
> and reiserfs.

	Sorry, I confused this with our ip_alloc_sem.  I was tired.

Joel

-- 

Life's Little Instruction Book #24

	"Drink champagne for no reason at all."

			http://www.jlbec.org/
			jlbec@evilplan.org

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2011-07-01  2:58 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-20 20:15 [PATCH 0/8] remove i_alloc_sem Christoph Hellwig
2011-06-20 20:15 ` [PATCH 1/8] far: remove i_alloc_sem abuse Christoph Hellwig
2011-06-20 20:15   ` Christoph Hellwig
2011-06-21 15:57   ` OGAWA Hirofumi
2011-06-21 16:09     ` OGAWA Hirofumi
2011-06-21 16:09     ` Christoph Hellwig
2011-06-20 20:15 ` [PATCH 2/8] ext4: " Christoph Hellwig
2011-06-20 20:15   ` Christoph Hellwig
2011-06-21 16:34   ` Lukas Czerner
2011-06-21 16:48     ` Lukas Czerner
2011-06-21 17:16     ` Christoph Hellwig
2011-06-20 20:15 ` [PATCH 3/8] fs: simpler handling of zero sized reads in __blockdev_direct_IO Christoph Hellwig
2011-06-20 20:15   ` Christoph Hellwig
2011-06-20 20:15 ` [PATCH 4/8] fs: kill i_alloc_sem Christoph Hellwig
2011-06-20 20:15   ` Christoph Hellwig
2011-06-20 21:32   ` Joel Becker
2011-06-20 22:18     ` Christoph Hellwig
2011-07-01  2:58       ` Joel Becker
2011-06-21  5:40   ` Dave Chinner
2011-06-21  9:35     ` Christoph Hellwig
2011-06-20 20:15 ` [PATCH 5/8] fs: move inode_dio_wait calls into ->setattr Christoph Hellwig
2011-06-20 20:15   ` Christoph Hellwig
2011-06-20 20:15 ` [PATCH 6/8] fs: always maintain i_dio_count Christoph Hellwig
2011-06-20 20:15   ` Christoph Hellwig
2011-06-20 21:29   ` Joel Becker
2011-06-20 22:23     ` Christoph Hellwig
2011-06-20 20:15 ` [PATCH 7/8] btrfs: wait for direct I/O requests in truncate Christoph Hellwig
2011-06-20 20:15   ` Christoph Hellwig
2011-06-20 20:15 ` [PATCH 8/8] rw_semaphore: remove up/down_read_non_owner Christoph Hellwig
2011-06-20 20:15   ` Christoph Hellwig
2011-06-20 20:32 ` [PATCH 0/8] remove i_alloc_sem Christoph Hellwig
2011-06-21 23:54 ` Jan Kara
2011-06-22  9:39   ` Christoph Hellwig
2011-06-22 14:22   ` Ted Ts'o
2011-06-22 18:13     ` Jan Kara
2011-06-23 10:36       ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.