linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] hugetlbfs: Add new rw_semaphore to fix truncate/read race
@ 2012-02-26 18:19 Aneesh Kumar K.V
  2012-02-27 23:11 ` Andrew Morton
  0 siblings, 1 reply; 7+ messages in thread
From: Aneesh Kumar K.V @ 2012-02-26 18:19 UTC (permalink / raw)
  To: linux-mm, mgorman, kamezawa.hiroyu, dhillf, akpm, viro, hughd
  Cc: linux-kernel, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

Drop using inode->i_mutex from read, since that can result in deadlock with
mmap. Ideally we can extend the patch to make sure we don't increase i_size
in mmap. But that will break userspace, because application will have to now
use truncate(2) to increase i_size in hugetlbfs.

AFAIU i_mutex was added in hugetlbfs_read as per
http://lkml.indiana.edu/hypermail/linux/kernel/0707.2/3066.html

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 fs/hugetlbfs/inode.c    |   13 +++++++++----
 include/linux/hugetlb.h |    1 +
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 1e85a7a..3d541dd 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -237,8 +237,9 @@ static ssize_t hugetlbfs_read(struct file *filp, char __user *buf,
 	unsigned long end_index;
 	loff_t isize;
 	ssize_t retval = 0;
+	struct hugetlbfs_inode_info *hinfo = HUGETLBFS_I(inode);
 
-	mutex_lock(&inode->i_mutex);
+	down_read(&hinfo->truncate_sem);
 
 	/* validate length */
 	if (len == 0)
@@ -308,7 +309,7 @@ static ssize_t hugetlbfs_read(struct file *filp, char __user *buf,
 	}
 out:
 	*ppos = ((loff_t)index << huge_page_shift(h)) + offset;
-	mutex_unlock(&inode->i_mutex);
+	up_read(&hinfo->truncate_sem);
 	return retval;
 }
 
@@ -407,16 +408,19 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 	pgoff_t pgoff;
 	struct address_space *mapping = inode->i_mapping;
 	struct hstate *h = hstate_inode(inode);
+	struct hugetlbfs_inode_info *hinfo = HUGETLBFS_I(inode);
 
 	BUG_ON(offset & ~huge_page_mask(h));
 	pgoff = offset >> PAGE_SHIFT;
 
+	down_write(&hinfo->truncate_sem);
 	i_size_write(inode, offset);
 	mutex_lock(&mapping->i_mmap_mutex);
 	if (!prio_tree_empty(&mapping->i_mmap))
 		hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
 	mutex_unlock(&mapping->i_mmap_mutex);
 	truncate_hugepages(inode, offset);
+	up_write(&hinfo->truncate_sem);
 	return 0;
 }
 
@@ -694,9 +698,10 @@ static const struct address_space_operations hugetlbfs_aops = {
 
 static void init_once(void *foo)
 {
-	struct hugetlbfs_inode_info *ei = (struct hugetlbfs_inode_info *)foo;
+	struct hugetlbfs_inode_info *hinfo = (struct hugetlbfs_inode_info *)foo;
 
-	inode_init_once(&ei->vfs_inode);
+	init_rwsem(&hinfo->truncate_sem);
+	inode_init_once(&hinfo->vfs_inode);
 }
 
 const struct file_operations hugetlbfs_file_operations = {
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 8aef867..6d8469a 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -150,6 +150,7 @@ struct hugetlbfs_sb_info {
 
 struct hugetlbfs_inode_info {
 	struct shared_policy policy;
+	struct rw_semaphore truncate_sem;
 	struct inode vfs_inode;
 };
 
-- 
1.7.9


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] hugetlbfs: Add new rw_semaphore to fix truncate/read race
  2012-02-26 18:19 [PATCH] hugetlbfs: Add new rw_semaphore to fix truncate/read race Aneesh Kumar K.V
@ 2012-02-27 23:11 ` Andrew Morton
  2012-02-28  0:02   ` Al Viro
  2012-02-28 10:15   ` Aneesh Kumar K.V
  0 siblings, 2 replies; 7+ messages in thread
From: Andrew Morton @ 2012-02-27 23:11 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, mgorman, kamezawa.hiroyu, dhillf, viro, hughd, linux-kernel

On Sun, 26 Feb 2012 23:49:58 +0530
"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:

> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> 
> Drop using inode->i_mutex from read, since that can result in deadlock with
> mmap. Ideally we can extend the patch to make sure we don't increase i_size
> in mmap. But that will break userspace, because application will have to now
> use truncate(2) to increase i_size in hugetlbfs.
> 
> AFAIU i_mutex was added in hugetlbfs_read as per
> http://lkml.indiana.edu/hypermail/linux/kernel/0707.2/3066.html

This patch comes somewhat out of the blue and I'm unsure what's going on.

You say there's some (potential?) deadlock with mmap, but it is
undescribed.  Have people observed this deadlock?  Has it caused
lockdep warnings?  Please update the changelog to fully describe the
bug.

Also, the new truncate_sem is undoumented.  This leaves readers to work
out for themselves what it might be for.  Please let's add code
comments which completely describe the race, and how this lock prevents
it.

We should also document our locking rules.  When should code take this
lock?  What are its ranking rules with respect to i_mutex, i_mmap_mutex
and possibly others?



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] hugetlbfs: Add new rw_semaphore to fix truncate/read race
  2012-02-27 23:11 ` Andrew Morton
@ 2012-02-28  0:02   ` Al Viro
  2012-02-28 10:15   ` Aneesh Kumar K.V
  1 sibling, 0 replies; 7+ messages in thread
From: Al Viro @ 2012-02-28  0:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Aneesh Kumar K.V, linux-mm, mgorman, kamezawa.hiroyu, dhillf,
	hughd, linux-kernel

On Mon, Feb 27, 2012 at 03:11:35PM -0800, Andrew Morton wrote:

> This patch comes somewhat out of the blue and I'm unsure what's going on.
> 
> You say there's some (potential?) deadlock with mmap, but it is
> undescribed.  Have people observed this deadlock?  Has it caused
> lockdep warnings?  Please update the changelog to fully describe the
> bug.

There's one simple rule: never, ever take ->i_mutex under ->mmap_sem.
E.g.  in any ->mmap() (obvious - mmap(2) calls that under ->mmap_sem) or
any ->release() of mappable file (munmap(2) does fput() under ->mmap_sem
and that will call ->release() if no other references are still around).

Hugetlbfs is slightly unusual since it takes ->i_mutex in read() - usually
that's done in write(), while read() doesn't bother with that.  In either
case you do copying to/from userland buffer while holding ->i_mutex, which
nests ->mmap_sem within it.

> Also, the new truncate_sem is undoumented.  This leaves readers to work
> out for themselves what it might be for.  Please let's add code
> comments which completely describe the race, and how this lock prevents
> it.
> 
> We should also document our locking rules.

Hell, yes.  I've spent the last couple of weeks crawling through VM-related
code and locking in there is _scary_.  "Convoluted" doesn't even begin to
cover it, especially when it gets to "what locks are required when accessing
this field" ;-/  Got quite a catch out of that trawl by now...

>  When should code take this
> lock?  What are its ranking rules with respect to i_mutex, i_mmap_mutex
> and possibly others?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] hugetlbfs: Add new rw_semaphore to fix truncate/read race
  2012-02-27 23:11 ` Andrew Morton
  2012-02-28  0:02   ` Al Viro
@ 2012-02-28 10:15   ` Aneesh Kumar K.V
  2012-02-28 12:17     ` Hillf Danton
  1 sibling, 1 reply; 7+ messages in thread
From: Aneesh Kumar K.V @ 2012-02-28 10:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, mgorman, kamezawa.hiroyu, dhillf, viro, hughd, linux-kernel

On Mon, 27 Feb 2012 15:11:35 -0800, Andrew Morton <akpm@linux-foundation.org> wrote:
> On Sun, 26 Feb 2012 23:49:58 +0530
> "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> wrote:
> 
> > From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
> > 
> > Drop using inode->i_mutex from read, since that can result in deadlock with
> > mmap. Ideally we can extend the patch to make sure we don't increase i_size
> > in mmap. But that will break userspace, because application will have to now
> > use truncate(2) to increase i_size in hugetlbfs.
> > 
> > AFAIU i_mutex was added in hugetlbfs_read as per
> > http://lkml.indiana.edu/hypermail/linux/kernel/0707.2/3066.html
> 
> This patch comes somewhat out of the blue and I'm unsure what's going on.
> 
> You say there's some (potential?) deadlock with mmap, but it is
> undescribed.  Have people observed this deadlock?  Has it caused
> lockdep warnings?  Please update the changelog to fully describe the
> bug.

Viro explained the deadlock in detail here:

http://mid.gmane.org/20120217002726.GL23916@ZenIV.linux.org.uk

I will also update the commit message with this information.

> 
> Also, the new truncate_sem is undoumented.  This leaves readers to work
> out for themselves what it might be for.  Please let's add code
> comments which completely describe the race, and how this lock prevents
> it.
> 
> We should also document our locking rules.  When should code take this
> lock?  What are its ranking rules with respect to i_mutex, i_mmap_mutex
> and possibly others?
> 

Will update the patch with these details

Thanks
-aneesh


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] hugetlbfs: Add new rw_semaphore to fix truncate/read race
  2012-02-28 10:15   ` Aneesh Kumar K.V
@ 2012-02-28 12:17     ` Hillf Danton
  2012-02-29 11:04       ` Aneesh Kumar K.V
  0 siblings, 1 reply; 7+ messages in thread
From: Hillf Danton @ 2012-02-28 12:17 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, linux-mm, mgorman, kamezawa.hiroyu, viro, hughd,
	linux-kernel

On Tue, Feb 28, 2012 at 6:15 PM, Aneesh Kumar K.V
<aneesh.kumar@linux.vnet.ibm.com> wrote:
>
> Will update the patch with these details
>

A scratch is cooked, based on the -next tree, for accelerating your redelivery,
if you like it, in which i_mutex is eliminated directly and page lock is used.

-hd


--- a/fs/hugetlbfs/inode.c	Tue Feb 28 19:43:32 2012
+++ b/fs/hugetlbfs/inode.c	Tue Feb 28 19:56:50 2012
@@ -245,17 +245,10 @@ static ssize_t hugetlbfs_read(struct fil
 	loff_t isize;
 	ssize_t retval = 0;

-	mutex_lock(&inode->i_mutex);
-
 	/* validate length */
 	if (len == 0)
 		goto out;

-	isize = i_size_read(inode);
-	if (!isize)
-		goto out;
-
-	end_index = (isize - 1) >> huge_page_shift(h);
 	for (;;) {
 		struct page *page;
 		unsigned long nr, ret;
@@ -263,6 +256,8 @@ static ssize_t hugetlbfs_read(struct fil

 		/* nr is the maximum number of bytes to copy from this page */
 		nr = huge_page_size(h);
+		isize = i_size_read(inode);
+		end_index = isize >> huge_page_shift(h);
 		if (index >= end_index) {
 			if (index > end_index)
 				goto out;
@@ -274,7 +269,7 @@ static ssize_t hugetlbfs_read(struct fil
 		nr = nr - offset;

 		/* Find the page */
-		page = find_get_page(mapping, index);
+		page = find_lock_page(mapping, index);
 		if (unlikely(page == NULL)) {
 			/*
 			 * We have a HOLE, zero out the user-buffer for the
@@ -286,17 +281,30 @@ static ssize_t hugetlbfs_read(struct fil
 			else
 				ra = 0;
 		} else {
+			unlock_page(page);
+
+			/* Without i_mutex held, check isize again */
+			nr = huge_page_size(h);
+			isize = i_size_read(inode);
+			end_index = isize >> huge_page_shift(h);
+			if (index == end_index) {
+				nr = isize & ~huge_page_mask(h);
+				if (nr <= offset) {
+					page_cache_release(page);
+					goto out;
+				}
+			}
+			nr -= offset;
 			/*
 			 * We have the page, copy it to user space buffer.
 			 */
 			ra = hugetlbfs_read_actor(page, offset, buf, len, nr);
 			ret = ra;
+			page_cache_release(page);
 		}
 		if (ra < 0) {
 			if (retval == 0)
 				retval = ra;
-			if (page)
-				page_cache_release(page);
 			goto out;
 		}

@@ -306,16 +314,12 @@ static ssize_t hugetlbfs_read(struct fil
 		index += offset >> huge_page_shift(h);
 		offset &= ~huge_page_mask(h);

-		if (page)
-			page_cache_release(page);
-
 		/* short read or no more work */
 		if ((ret != nr) || (len == 0))
 			break;
 	}
 out:
 	*ppos = ((loff_t)index << huge_page_shift(h)) + offset;
-	mutex_unlock(&inode->i_mutex);
 	return retval;
 }

--

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] hugetlbfs: Add new rw_semaphore to fix truncate/read race
  2012-02-28 12:17     ` Hillf Danton
@ 2012-02-29 11:04       ` Aneesh Kumar K.V
  2012-02-29 14:42         ` Hillf Danton
  0 siblings, 1 reply; 7+ messages in thread
From: Aneesh Kumar K.V @ 2012-02-29 11:04 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Andrew Morton, linux-mm, mgorman, kamezawa.hiroyu, viro, hughd,
	linux-kernel

On Tue, 28 Feb 2012 20:17:28 +0800, Hillf Danton <dhillf@gmail.com> wrote:
> On Tue, Feb 28, 2012 at 6:15 PM, Aneesh Kumar K.V
> <aneesh.kumar@linux.vnet.ibm.com> wrote:
> >
> > Will update the patch with these details
> >
> 
> A scratch is cooked, based on the -next tree, for accelerating your redelivery,
> if you like it, in which i_mutex is eliminated directly and page lock is used.
> 
> -hd


This looks much better than what I had. 

> 
> 
> --- a/fs/hugetlbfs/inode.c	Tue Feb 28 19:43:32 2012
> +++ b/fs/hugetlbfs/inode.c	Tue Feb 28 19:56:50 2012
> @@ -245,17 +245,10 @@ static ssize_t hugetlbfs_read(struct fil
>  	loff_t isize;
>  	ssize_t retval = 0;
> 
> -	mutex_lock(&inode->i_mutex);
> -
>  	/* validate length */
>  	if (len == 0)
>  		goto out;
> 
> -	isize = i_size_read(inode);
> -	if (!isize)
> -		goto out;
> -
> -	end_index = (isize - 1) >> huge_page_shift(h);
>  	for (;;) {
>  		struct page *page;
>  		unsigned long nr, ret;
> @@ -263,6 +256,8 @@ static ssize_t hugetlbfs_read(struct fil
> 
>  		/* nr is the maximum number of bytes to copy from this page */
>  		nr = huge_page_size(h);
> +		isize = i_size_read(inode);
> +		end_index = isize >> huge_page_shift(h);


Should that be (isize - 1) >> huget_page_shift(h) ?


>  		if (index >= end_index) {
>  			if (index > end_index)
>  				goto out;
> @@ -274,7 +269,7 @@ static ssize_t hugetlbfs_read(struct fil
>  		nr = nr - offset;
> 
>  		/* Find the page */
> -		page = find_get_page(mapping, index);
> +		page = find_lock_page(mapping, index);
>  		if (unlikely(page == NULL)) {
>  			/*
>  			 * We have a HOLE, zero out the user-buffer for the
> @@ -286,17 +281,30 @@ static ssize_t hugetlbfs_read(struct fil
>  			else
>  				ra = 0;
>  		} else {
> +			unlock_page(page);
> +
> +			/* Without i_mutex held, check isize again */
> +			nr = huge_page_size(h);
> +			isize = i_size_read(inode);
> +			end_index = isize >> huge_page_shift(h);

same here (isize - 1) ?

> +			if (index == end_index) {
> +				nr = isize & ~huge_page_mask(h);


Is that correct ?  We calculate nr differently in the earlier part of the function

> +				if (nr <= offset) {
> +					page_cache_release(page);
> +					goto out;
> +				}
> +			}
> +			nr -= offset;
>  			/*
>  			 * We have the page, copy it to user space buffer.
>  			 */
>  			ra = hugetlbfs_read_actor(page, offset, buf, len, nr);
>  			ret = ra;
> +			page_cache_release(page);
>  		}
>  		if (ra < 0) {
>  			if (retval == 0)
>  				retval = ra;
> -			if (page)
> -				page_cache_release(page);
>  			goto out;
>  		}
> 
> @@ -306,16 +314,12 @@ static ssize_t hugetlbfs_read(struct fil
>  		index += offset >> huge_page_shift(h);
>  		offset &= ~huge_page_mask(h);
> 
> -		if (page)
> -			page_cache_release(page);
> -
>  		/* short read or no more work */
>  		if ((ret != nr) || (len == 0))
>  			break;
>  	}
>  out:
>  	*ppos = ((loff_t)index << huge_page_shift(h)) + offset;
> -	mutex_unlock(&inode->i_mutex);
>  	return retval;
>  }
> 

I guess we need to closely look at the patch with respect to end-of-file
condition. I will also try to get some testing with the patch.

-aneesh


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] hugetlbfs: Add new rw_semaphore to fix truncate/read race
  2012-02-29 11:04       ` Aneesh Kumar K.V
@ 2012-02-29 14:42         ` Hillf Danton
  0 siblings, 0 replies; 7+ messages in thread
From: Hillf Danton @ 2012-02-29 14:42 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Andrew Morton, linux-mm, mgorman, kamezawa.hiroyu, viro, hughd,
	linux-kernel

On Wed, Feb 29, 2012 at 7:04 PM, Aneesh Kumar K.V
<aneesh.kumar@linux.vnet.ibm.com> wrote:
> I guess we need to closely look at the patch with respect to end-of-file

Feel free to correct Hillf:)

> condition. I will also try to get some testing with the patch.
>
Look forward to seeing test results.

Best regards
-hd

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-02-29 14:42 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-26 18:19 [PATCH] hugetlbfs: Add new rw_semaphore to fix truncate/read race Aneesh Kumar K.V
2012-02-27 23:11 ` Andrew Morton
2012-02-28  0:02   ` Al Viro
2012-02-28 10:15   ` Aneesh Kumar K.V
2012-02-28 12:17     ` Hillf Danton
2012-02-29 11:04       ` Aneesh Kumar K.V
2012-02-29 14:42         ` Hillf Danton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).