[PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3
@ 2014-07-15 10:28 ` Hugh Dickins
  0 siblings, 0 replies; 42+ messages in thread
From: Hugh Dickins @ 2014-07-15 10:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sasha Levin, Vlastimil Babka, Konstantin Khlebnikov,
	Johannes Weiner, Michel Lespinasse, Lukas Czerner, Dave Jones,
	linux-mm, linux-fsdevel, linux-kernel

Hi Andrew,

Here's my latest and hopefully last stab at fixing the trinity
hole-punch starvation issue that became known as CVE-2014-4171.

You may prefer to hear a testing update from Sasha and Vlastimil before
paying any attention to these, or you may prefer to add them into mmotm
for wider testing now: whichever you think appropriate.

Please throw away mmotm's
revert-shmem-fix-faulting-into-a-hole-while-its-punched.patch
and replace it by 1/2, which fixes that commit instead of reverting it.

Please throw away mmotm's
shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch
and replace it by 2/2, which reworks the commit message and adds a fix.

Please keep the 3/3 I sent last time in mmotm
mm-fs-fix-pessimization-in-hole-punching-pagecache.patch
which remains valid.

In the end I decided that we had better look at it as two problems,
the trinity faulting starvation, and the indefinite punching loop,
so 1/2 and 2/2 present both solutions: belt and braces.

Which may be the best for fixing, but the worst for ease of backporting.
Vlastimil, I have prepared (and lightly tested) a 3.2.61-based version
of the combination of f00cdc6df7d7 and 1/2 and 2/2 (basically, I moved
vmtruncate_range from mm/truncate.c to mm/shmem.c, since nothing but
shmem ever implemented the truncate_range method).  It should give a
good hint for backports earlier and later: I'll send it privately to
you now, but keep in mind that it may need to be revised if today's
patches for 3.16 get revised again (I'll send it to Ben Hutchings
only when that's settled).

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3
@ 2014-07-15 10:28 ` Hugh Dickins
  0 siblings, 0 replies; 42+ messages in thread
From: Hugh Dickins @ 2014-07-15 10:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sasha Levin, Vlastimil Babka, Konstantin Khlebnikov,
	Johannes Weiner, Michel Lespinasse, Lukas Czerner, Dave Jones,
	linux-mm, linux-fsdevel, linux-kernel

Hi Andrew,

Here's my latest and hopefully last stab at fixing the trinity
hole-punch starvation issue that became known as CVE-2014-4171.

You may prefer to hear a testing update from Sasha and Vlastimil before
paying any attention to these, or you may prefer to add them into mmotm
for wider testing now: whichever you think appropriate.

Please throw away mmotm's
revert-shmem-fix-faulting-into-a-hole-while-its-punched.patch
and replace it by 1/2, which fixes that commit instead of reverting it.

Please throw away mmotm's
shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch
and replace it by 2/2, which reworks the commit message and adds a fix.

Please keep the 3/3 I sent last time in mmotm
mm-fs-fix-pessimization-in-hole-punching-pagecache.patch
which remains valid.

In the end I decided that we had better look at it as two problems,
the trinity faulting starvation, and the indefinite punching loop,
so 1/2 and 2/2 present both solutions: belt and braces.

Which may be the best for fixing, but the worst for ease of backporting.
Vlastimil, I have prepared (and lightly tested) a 3.2.61-based version
of the combination of f00cdc6df7d7 and 1/2 and 2/2 (basically, I moved
vmtruncate_range from mm/truncate.c to mm/shmem.c, since nothing but
shmem ever implemented the truncate_range method).  It should give a
good hint for backports earlier and later: I'll send it privately to
you now, but keep in mind that it may need to be revised if today's
patches for 3.16 get revised again (I'll send it to Ben Hutchings
only when that's settled).

Thanks,
Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 1/2] shmem: fix faulting into a hole, not taking i_mutex
  2014-07-15 10:28 ` Hugh Dickins
@ 2014-07-15 10:31   ` Hugh Dickins
  -1 siblings, 0 replies; 42+ messages in thread
From: Hugh Dickins @ 2014-07-15 10:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sasha Levin, Vlastimil Babka, Konstantin Khlebnikov,
	Johannes Weiner, Michel Lespinasse, Lukas Czerner, Dave Jones,
	linux-mm, linux-fsdevel, linux-kernel

f00cdc6df7d7 ("shmem: fix faulting into a hole while it's punched") was
buggy: Sasha sent a lockdep report to remind us that grabbing i_mutex in
the fault path is a no-no (write syscall may already hold i_mutex while
faulting user buffer).

We tried a completely different approach (see following patch) but that
proved inadequate: good enough for a rational workload, but not good
enough against trinity - which forks off so many mappings of the object
that contention on i_mmap_mutex while hole-puncher holds i_mutex builds
into serious starvation when concurrent faults force the puncher to fall
back to single-page unmap_mapping_range() searches of the i_mmap tree.

So return to the original umbrella approach, but keep away from i_mutex
this time.  We really don't want to bloat every shmem inode with a new
mutex or completion, just to protect this unlikely case from trinity.
So extend the original with wait_queue_head on stack at the hole-punch
end, and wait_queue item on the stack at the fault end.

This involves further use of i_lock to guard against the races: lockdep
has been happy so far, and I see fs/inode.c:unlock_new_inode() holds
i_lock around wake_up_bit(), which is comparable to what we do here.
i_lock is more convenient, but we could switch to shmem's info->lock.

This issue has been tagged with CVE-2014-4171, which will require
f00cdc6df7d7 and this and the following patch to be backported: we
suggest to 3.1+, though in fact the trinity forkbomb effect might go
back as far as 2.6.16, when madvise(,,MADV_REMOVE) came in - or might
not, since much has changed, with i_mmap_mutex a spinlock before 3.0.
Anyone running trinity on 3.0 and earlier?  I don't think we need care.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lukas Czerner <lczerner@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: <stable@vger.kernel.org>	[3.1+]
---
Please replace mmotm's
revert-shmem-fix-faulting-into-a-hole-while-its-punched.patch
by this patch.

 mm/shmem.c |   78 ++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 52 insertions(+), 26 deletions(-)

--- 3.16-rc5/mm/shmem.c	2014-07-06 13:25:19.688009119 -0700
+++ 3.16-rc5+/mm/shmem.c	2014-07-14 20:34:28.196153828 -0700
@@ -85,7 +85,7 @@ static struct vfsmount *shm_mnt;
  * a time): we would prefer not to enlarge the shmem inode just for that.
  */
 struct shmem_falloc {
-	int	mode;		/* FALLOC_FL mode currently operating */
+	wait_queue_head_t *waitq; /* faults into hole wait for punch to end */
 	pgoff_t start;		/* start of range currently being fallocated */
 	pgoff_t next;		/* the next page offset to be fallocated */
 	pgoff_t nr_falloced;	/* how many new pages have been fallocated */
@@ -760,7 +760,7 @@ static int shmem_writepage(struct page *
 			spin_lock(&inode->i_lock);
 			shmem_falloc = inode->i_private;
 			if (shmem_falloc &&
-			    !shmem_falloc->mode &&
+			    !shmem_falloc->waitq &&
 			    index >= shmem_falloc->start &&
 			    index < shmem_falloc->next)
 				shmem_falloc->nr_unswapped++;
@@ -1248,38 +1248,58 @@ static int shmem_fault(struct vm_area_st
 	 * Trinity finds that probing a hole which tmpfs is punching can
 	 * prevent the hole-punch from ever completing: which in turn
 	 * locks writers out with its hold on i_mutex.  So refrain from
-	 * faulting pages into the hole while it's being punched, and
-	 * wait on i_mutex to be released if vmf->flags permits.
+	 * faulting pages into the hole while it's being punched.  Although
+	 * shmem_undo_range() does remove the additions, it may be unable to
+	 * keep up, as each new page needs its own unmap_mapping_range() call,
+	 * and the i_mmap tree grows ever slower to scan if new vmas are added.
+	 *
+	 * It does not matter if we sometimes reach this check just before the
+	 * hole-punch begins, so that one fault then races with the punch:
+	 * we just need to make racing faults a rare case.
+	 *
+	 * The implementation below would be much simpler if we just used a
+	 * standard mutex or completion: but we cannot take i_mutex in fault,
+	 * and bloating every shmem inode for this unlikely case would be sad.
 	 */
 	if (unlikely(inode->i_private)) {
 		struct shmem_falloc *shmem_falloc;
 
 		spin_lock(&inode->i_lock);
 		shmem_falloc = inode->i_private;
-		if (!shmem_falloc ||
-		    shmem_falloc->mode != FALLOC_FL_PUNCH_HOLE ||
-		    vmf->pgoff < shmem_falloc->start ||
-		    vmf->pgoff >= shmem_falloc->next)
-			shmem_falloc = NULL;
-		spin_unlock(&inode->i_lock);
-		/*
-		 * i_lock has protected us from taking shmem_falloc seriously
-		 * once return from shmem_fallocate() went back up that stack.
-		 * i_lock does not serialize with i_mutex at all, but it does
-		 * not matter if sometimes we wait unnecessarily, or sometimes
-		 * miss out on waiting: we just need to make those cases rare.
-		 */
-		if (shmem_falloc) {
+		if (shmem_falloc &&
+		    shmem_falloc->waitq &&
+		    vmf->pgoff >= shmem_falloc->start &&
+		    vmf->pgoff < shmem_falloc->next) {
+			wait_queue_head_t *shmem_falloc_waitq;
+			DEFINE_WAIT(shmem_fault_wait);
+
+			ret = VM_FAULT_NOPAGE;
 			if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
 			   !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
+				/* It's polite to up mmap_sem if we can */
 				up_read(&vma->vm_mm->mmap_sem);
-				mutex_lock(&inode->i_mutex);
-				mutex_unlock(&inode->i_mutex);
-				return VM_FAULT_RETRY;
+				ret = VM_FAULT_RETRY;
 			}
-			/* cond_resched? Leave that to GUP or return to user */
-			return VM_FAULT_NOPAGE;
+
+			shmem_falloc_waitq = shmem_falloc->waitq;
+			prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait,
+					TASK_KILLABLE);
+			spin_unlock(&inode->i_lock);
+			schedule();
+
+			/*
+			 * shmem_falloc_waitq points into the shmem_fallocate()
+			 * stack of the hole-punching task: shmem_falloc_waitq
+			 * is usually invalid by the time we reach here, but
+			 * finish_wait() does not dereference it in that case;
+			 * though i_lock needed lest racing with wake_up_all().
+			 */
+			spin_lock(&inode->i_lock);
+			finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
+			spin_unlock(&inode->i_lock);
+			return ret;
 		}
+		spin_unlock(&inode->i_lock);
 	}
 
 	error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret);
@@ -1774,13 +1794,13 @@ static long shmem_fallocate(struct file
 
 	mutex_lock(&inode->i_mutex);
 
-	shmem_falloc.mode = mode & ~FALLOC_FL_KEEP_SIZE;
-
 	if (mode & FALLOC_FL_PUNCH_HOLE) {
 		struct address_space *mapping = file->f_mapping;
 		loff_t unmap_start = round_up(offset, PAGE_SIZE);
 		loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
+		DECLARE_WAIT_QUEUE_HEAD_ONSTACK(shmem_falloc_waitq);
 
+		shmem_falloc.waitq = &shmem_falloc_waitq;
 		shmem_falloc.start = unmap_start >> PAGE_SHIFT;
 		shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT;
 		spin_lock(&inode->i_lock);
@@ -1792,8 +1812,13 @@ static long shmem_fallocate(struct file
 					    1 + unmap_end - unmap_start, 0);
 		shmem_truncate_range(inode, offset, offset + len - 1);
 		/* No need to unmap again: hole-punching leaves COWed pages */
+
+		spin_lock(&inode->i_lock);
+		inode->i_private = NULL;
+		wake_up_all(&shmem_falloc_waitq);
+		spin_unlock(&inode->i_lock);
 		error = 0;
-		goto undone;
+		goto out;
 	}
 
 	/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
@@ -1809,6 +1834,7 @@ static long shmem_fallocate(struct file
 		goto out;
 	}
 
+	shmem_falloc.waitq = NULL;
 	shmem_falloc.start = start;
 	shmem_falloc.next  = start;
 	shmem_falloc.nr_falloced = 0;

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 1/2] shmem: fix faulting into a hole, not taking i_mutex
@ 2014-07-15 10:31   ` Hugh Dickins
  0 siblings, 0 replies; 42+ messages in thread
From: Hugh Dickins @ 2014-07-15 10:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sasha Levin, Vlastimil Babka, Konstantin Khlebnikov,
	Johannes Weiner, Michel Lespinasse, Lukas Czerner, Dave Jones,
	linux-mm, linux-fsdevel, linux-kernel

f00cdc6df7d7 ("shmem: fix faulting into a hole while it's punched") was
buggy: Sasha sent a lockdep report to remind us that grabbing i_mutex in
the fault path is a no-no (write syscall may already hold i_mutex while
faulting user buffer).

We tried a completely different approach (see following patch) but that
proved inadequate: good enough for a rational workload, but not good
enough against trinity - which forks off so many mappings of the object
that contention on i_mmap_mutex while hole-puncher holds i_mutex builds
into serious starvation when concurrent faults force the puncher to fall
back to single-page unmap_mapping_range() searches of the i_mmap tree.

So return to the original umbrella approach, but keep away from i_mutex
this time.  We really don't want to bloat every shmem inode with a new
mutex or completion, just to protect this unlikely case from trinity.
So extend the original with wait_queue_head on stack at the hole-punch
end, and wait_queue item on the stack at the fault end.

This involves further use of i_lock to guard against the races: lockdep
has been happy so far, and I see fs/inode.c:unlock_new_inode() holds
i_lock around wake_up_bit(), which is comparable to what we do here.
i_lock is more convenient, but we could switch to shmem's info->lock.

This issue has been tagged with CVE-2014-4171, which will require
f00cdc6df7d7 and this and the following patch to be backported: we
suggest to 3.1+, though in fact the trinity forkbomb effect might go
back as far as 2.6.16, when madvise(,,MADV_REMOVE) came in - or might
not, since much has changed, with i_mmap_mutex a spinlock before 3.0.
Anyone running trinity on 3.0 and earlier?  I don't think we need care.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lukas Czerner <lczerner@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: <stable@vger.kernel.org>	[3.1+]
---
Please replace mmotm's
revert-shmem-fix-faulting-into-a-hole-while-its-punched.patch
by this patch.

 mm/shmem.c |   78 ++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 52 insertions(+), 26 deletions(-)

--- 3.16-rc5/mm/shmem.c	2014-07-06 13:25:19.688009119 -0700
+++ 3.16-rc5+/mm/shmem.c	2014-07-14 20:34:28.196153828 -0700
@@ -85,7 +85,7 @@ static struct vfsmount *shm_mnt;
  * a time): we would prefer not to enlarge the shmem inode just for that.
  */
 struct shmem_falloc {
-	int	mode;		/* FALLOC_FL mode currently operating */
+	wait_queue_head_t *waitq; /* faults into hole wait for punch to end */
 	pgoff_t start;		/* start of range currently being fallocated */
 	pgoff_t next;		/* the next page offset to be fallocated */
 	pgoff_t nr_falloced;	/* how many new pages have been fallocated */
@@ -760,7 +760,7 @@ static int shmem_writepage(struct page *
 			spin_lock(&inode->i_lock);
 			shmem_falloc = inode->i_private;
 			if (shmem_falloc &&
-			    !shmem_falloc->mode &&
+			    !shmem_falloc->waitq &&
 			    index >= shmem_falloc->start &&
 			    index < shmem_falloc->next)
 				shmem_falloc->nr_unswapped++;
@@ -1248,38 +1248,58 @@ static int shmem_fault(struct vm_area_st
 	 * Trinity finds that probing a hole which tmpfs is punching can
 	 * prevent the hole-punch from ever completing: which in turn
 	 * locks writers out with its hold on i_mutex.  So refrain from
-	 * faulting pages into the hole while it's being punched, and
-	 * wait on i_mutex to be released if vmf->flags permits.
+	 * faulting pages into the hole while it's being punched.  Although
+	 * shmem_undo_range() does remove the additions, it may be unable to
+	 * keep up, as each new page needs its own unmap_mapping_range() call,
+	 * and the i_mmap tree grows ever slower to scan if new vmas are added.
+	 *
+	 * It does not matter if we sometimes reach this check just before the
+	 * hole-punch begins, so that one fault then races with the punch:
+	 * we just need to make racing faults a rare case.
+	 *
+	 * The implementation below would be much simpler if we just used a
+	 * standard mutex or completion: but we cannot take i_mutex in fault,
+	 * and bloating every shmem inode for this unlikely case would be sad.
 	 */
 	if (unlikely(inode->i_private)) {
 		struct shmem_falloc *shmem_falloc;
 
 		spin_lock(&inode->i_lock);
 		shmem_falloc = inode->i_private;
-		if (!shmem_falloc ||
-		    shmem_falloc->mode != FALLOC_FL_PUNCH_HOLE ||
-		    vmf->pgoff < shmem_falloc->start ||
-		    vmf->pgoff >= shmem_falloc->next)
-			shmem_falloc = NULL;
-		spin_unlock(&inode->i_lock);
-		/*
-		 * i_lock has protected us from taking shmem_falloc seriously
-		 * once return from shmem_fallocate() went back up that stack.
-		 * i_lock does not serialize with i_mutex at all, but it does
-		 * not matter if sometimes we wait unnecessarily, or sometimes
-		 * miss out on waiting: we just need to make those cases rare.
-		 */
-		if (shmem_falloc) {
+		if (shmem_falloc &&
+		    shmem_falloc->waitq &&
+		    vmf->pgoff >= shmem_falloc->start &&
+		    vmf->pgoff < shmem_falloc->next) {
+			wait_queue_head_t *shmem_falloc_waitq;
+			DEFINE_WAIT(shmem_fault_wait);
+
+			ret = VM_FAULT_NOPAGE;
 			if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
 			   !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
+				/* It's polite to up mmap_sem if we can */
 				up_read(&vma->vm_mm->mmap_sem);
-				mutex_lock(&inode->i_mutex);
-				mutex_unlock(&inode->i_mutex);
-				return VM_FAULT_RETRY;
+				ret = VM_FAULT_RETRY;
 			}
-			/* cond_resched? Leave that to GUP or return to user */
-			return VM_FAULT_NOPAGE;
+
+			shmem_falloc_waitq = shmem_falloc->waitq;
+			prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait,
+					TASK_KILLABLE);
+			spin_unlock(&inode->i_lock);
+			schedule();
+
+			/*
+			 * shmem_falloc_waitq points into the shmem_fallocate()
+			 * stack of the hole-punching task: shmem_falloc_waitq
+			 * is usually invalid by the time we reach here, but
+			 * finish_wait() does not dereference it in that case;
+			 * though i_lock needed lest racing with wake_up_all().
+			 */
+			spin_lock(&inode->i_lock);
+			finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
+			spin_unlock(&inode->i_lock);
+			return ret;
 		}
+		spin_unlock(&inode->i_lock);
 	}
 
 	error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret);
@@ -1774,13 +1794,13 @@ static long shmem_fallocate(struct file
 
 	mutex_lock(&inode->i_mutex);
 
-	shmem_falloc.mode = mode & ~FALLOC_FL_KEEP_SIZE;
-
 	if (mode & FALLOC_FL_PUNCH_HOLE) {
 		struct address_space *mapping = file->f_mapping;
 		loff_t unmap_start = round_up(offset, PAGE_SIZE);
 		loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
+		DECLARE_WAIT_QUEUE_HEAD_ONSTACK(shmem_falloc_waitq);
 
+		shmem_falloc.waitq = &shmem_falloc_waitq;
 		shmem_falloc.start = unmap_start >> PAGE_SHIFT;
 		shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT;
 		spin_lock(&inode->i_lock);
@@ -1792,8 +1812,13 @@ static long shmem_fallocate(struct file
 					    1 + unmap_end - unmap_start, 0);
 		shmem_truncate_range(inode, offset, offset + len - 1);
 		/* No need to unmap again: hole-punching leaves COWed pages */
+
+		spin_lock(&inode->i_lock);
+		inode->i_private = NULL;
+		wake_up_all(&shmem_falloc_waitq);
+		spin_unlock(&inode->i_lock);
 		error = 0;
-		goto undone;
+		goto out;
 	}
 
 	/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
@@ -1809,6 +1834,7 @@ static long shmem_fallocate(struct file
 		goto out;
 	}
 
+	shmem_falloc.waitq = NULL;
 	shmem_falloc.start = start;
 	shmem_falloc.next  = start;
 	shmem_falloc.nr_falloced = 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 2/2] shmem: fix splicing from a hole while it's punched
  2014-07-15 10:28 ` Hugh Dickins
@ 2014-07-15 10:33   ` Hugh Dickins
  -1 siblings, 0 replies; 42+ messages in thread
From: Hugh Dickins @ 2014-07-15 10:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sasha Levin, Vlastimil Babka, Konstantin Khlebnikov,
	Johannes Weiner, Michel Lespinasse, Lukas Czerner, Dave Jones,
	linux-mm, linux-fsdevel, linux-kernel

shmem_fault() is the actual culprit in trinity's hole-punch starvation,
and the most significant cause of such problems: since a page faulted is
one that then appears page_mapped(), needing unmap_mapping_range() and
i_mmap_mutex to be unmapped again.

But it is not the only way in which a page can be brought into a hole in
the radix_tree while that hole is being punched; and Vlastimil's testing
implies that if enough other processors are busy filling in the hole,
then shmem_undo_range() can be kept from completing indefinitely.

shmem_file_splice_read() is the main other user of SGP_CACHE, which
can instantiate shmem pagecache pages in the read-only case (without
holding i_mutex, so perhaps concurrently with a hole-punch).  Probably
it's silly not to use SGP_READ already (using the ZERO_PAGE for holes):
which ought to be safe, but might bring surprises - not a change to be
rushed.

shmem_read_mapping_page_gfp() is an internal interface used by
drivers/gpu/drm GEM (and next by uprobes): it should be okay.  And
shmem_file_read_iter() uses the SGP_DIRTY variant of SGP_CACHE, when
called internally by the kernel (perhaps for a stacking filesystem,
which might rely on holes to be reserved): it's unclear whether it
could be provoked to keep hole-punch busy or not.

We could apply the same umbrella as now used in shmem_fault() to
shmem_file_splice_read() and the others; but it looks ugly, and use
over a range raises questions - should it actually be per page?  can
these get starved themselves?

The origin of this part of the problem is my v3.1 commit d0823576bf4b
("mm: pincer in truncate_inode_pages_range"), once it was duplicated
into shmem.c.  It seemed like a nice idea at the time, to ensure
(barring RCU lookup fuzziness) that there's an instant when the entire
hole is empty; but the indefinitely repeated scans to ensure that make
it vulnerable.

Revert that "enhancement" to hole-punch from shmem_undo_range(), but
retain the unproblematic rescanning when it's truncating; add a couple
of comments there.

Remove the "indices[0] >= end" test: that is now handled satisfactorily
by the inner loop, and mem_cgroup_uncharge_start()/end() are too light
to be worth avoiding here.

But if we do not always loop indefinitely, we do need to handle the case
of swap swizzled back to page before shmem_free_swap() gets it: add a
retry for that case, as suggested by Konstantin Khlebnikov; and for the
case of page swizzled back to swap, as suggested by Johannes Weiner.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lukas Czerner <lczerner@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: <stable@vger.kernel.org>	[3.1+]
---
Please replace mmotm's
shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch
by this patch: which is substantially the same as that, but with
updated commit comment, and a page retry as indicated by Hannes.

 mm/shmem.c |   24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

--- 3.16-rc5+/mm/shmem.c	2014-07-14 20:34:28.196153828 -0700
+++ 3.16-rc5++/mm/shmem.c	2014-07-14 20:35:14.156154916 -0700
@@ -468,23 +468,20 @@ static void shmem_undo_range(struct inod
 		return;

 	index = start;
-	for ( ; ; ) {
+	while (index < end) {
 		cond_resched();

 		pvec.nr = find_get_entries(mapping, index,
 				min(end - index, (pgoff_t)PAGEVEC_SIZE),
 				pvec.pages, indices);
 		if (!pvec.nr) {
-			if (index == start || unfalloc)
+			/* If all gone or hole-punch or unfalloc, we're done */
+			if (index == start || end != -1)
 				break;
+			/* But if truncating, restart to make sure all gone */
 			index = start;
 			continue;
 		}
-		if ((index == start || unfalloc) && indices[0] >= end) {
-			pagevec_remove_exceptionals(&pvec);
-			pagevec_release(&pvec);
-			break;
-		}
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
@@ -496,8 +493,12 @@ static void shmem_undo_range(struct inod
 			if (radix_tree_exceptional_entry(page)) {
 				if (unfalloc)
 					continue;
-				nr_swaps_freed += !shmem_free_swap(mapping,
-								index, page);
+				if (shmem_free_swap(mapping, index, page)) {
+					/* Swap was replaced by page: retry */
+					index--;
+					break;
+				}
+				nr_swaps_freed++;
 				continue;
 			}

@@ -506,6 +507,11 @@ static void shmem_undo_range(struct inod
 				if (page->mapping == mapping) {
 					VM_BUG_ON_PAGE(PageWriteback(page), page);
 					truncate_inode_page(mapping, page);
+				} else {
+					/* Page was replaced by swap: retry */
+					unlock_page(page);
+					index--;
+					break;
 				}
 			}
 			unlock_page(page);

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 2/2] shmem: fix splicing from a hole while it's punched
@ 2014-07-15 10:33   ` Hugh Dickins
  0 siblings, 0 replies; 42+ messages in thread
From: Hugh Dickins @ 2014-07-15 10:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sasha Levin, Vlastimil Babka, Konstantin Khlebnikov,
	Johannes Weiner, Michel Lespinasse, Lukas Czerner, Dave Jones,
	linux-mm, linux-fsdevel, linux-kernel

shmem_fault() is the actual culprit in trinity's hole-punch starvation,
and the most significant cause of such problems: since a page faulted is
one that then appears page_mapped(), needing unmap_mapping_range() and
i_mmap_mutex to be unmapped again.

But it is not the only way in which a page can be brought into a hole in
the radix_tree while that hole is being punched; and Vlastimil's testing
implies that if enough other processors are busy filling in the hole,
then shmem_undo_range() can be kept from completing indefinitely.

shmem_file_splice_read() is the main other user of SGP_CACHE, which
can instantiate shmem pagecache pages in the read-only case (without
holding i_mutex, so perhaps concurrently with a hole-punch).  Probably
it's silly not to use SGP_READ already (using the ZERO_PAGE for holes):
which ought to be safe, but might bring surprises - not a change to be
rushed.

shmem_read_mapping_page_gfp() is an internal interface used by
drivers/gpu/drm GEM (and next by uprobes): it should be okay.  And
shmem_file_read_iter() uses the SGP_DIRTY variant of SGP_CACHE, when
called internally by the kernel (perhaps for a stacking filesystem,
which might rely on holes to be reserved): it's unclear whether it
could be provoked to keep hole-punch busy or not.

We could apply the same umbrella as now used in shmem_fault() to
shmem_file_splice_read() and the others; but it looks ugly, and use
over a range raises questions - should it actually be per page?  can
these get starved themselves?

The origin of this part of the problem is my v3.1 commit d0823576bf4b
("mm: pincer in truncate_inode_pages_range"), once it was duplicated
into shmem.c.  It seemed like a nice idea at the time, to ensure
(barring RCU lookup fuzziness) that there's an instant when the entire
hole is empty; but the indefinitely repeated scans to ensure that make
it vulnerable.

Revert that "enhancement" to hole-punch from shmem_undo_range(), but
retain the unproblematic rescanning when it's truncating; add a couple
of comments there.

Remove the "indices[0] >= end" test: that is now handled satisfactorily
by the inner loop, and mem_cgroup_uncharge_start()/end() are too light
to be worth avoiding here.

But if we do not always loop indefinitely, we do need to handle the case
of swap swizzled back to page before shmem_free_swap() gets it: add a
retry for that case, as suggested by Konstantin Khlebnikov; and for the
case of page swizzled back to swap, as suggested by Johannes Weiner.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lukas Czerner <lczerner@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: <stable@vger.kernel.org>	[3.1+]
---
Please replace mmotm's
shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch
by this patch: which is substantially the same as that, but with
updated commit comment, and a page retry as indicated by Hannes.

 mm/shmem.c |   24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

--- 3.16-rc5+/mm/shmem.c	2014-07-14 20:34:28.196153828 -0700
+++ 3.16-rc5++/mm/shmem.c	2014-07-14 20:35:14.156154916 -0700
@@ -468,23 +468,20 @@ static void shmem_undo_range(struct inod
 		return;

 	index = start;
-	for ( ; ; ) {
+	while (index < end) {
 		cond_resched();

 		pvec.nr = find_get_entries(mapping, index,
 				min(end - index, (pgoff_t)PAGEVEC_SIZE),
 				pvec.pages, indices);
 		if (!pvec.nr) {
-			if (index == start || unfalloc)
+			/* If all gone or hole-punch or unfalloc, we're done */
+			if (index == start || end != -1)
 				break;
+			/* But if truncating, restart to make sure all gone */
 			index = start;
 			continue;
 		}
-		if ((index == start || unfalloc) && indices[0] >= end) {
-			pagevec_remove_exceptionals(&pvec);
-			pagevec_release(&pvec);
-			break;
-		}
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
@@ -496,8 +493,12 @@ static void shmem_undo_range(struct inod
 			if (radix_tree_exceptional_entry(page)) {
 				if (unfalloc)
 					continue;
-				nr_swaps_freed += !shmem_free_swap(mapping,
-								index, page);
+				if (shmem_free_swap(mapping, index, page)) {
+					/* Swap was replaced by page: retry */
+					index--;
+					break;
+				}
+				nr_swaps_freed++;
 				continue;
 			}

@@ -506,6 +507,11 @@ static void shmem_undo_range(struct inod
 				if (page->mapping == mapping) {
 					VM_BUG_ON_PAGE(PageWriteback(page), page);
 					truncate_inode_page(mapping, page);
+				} else {
+					/* Page was replaced by swap: retry */
+					unlock_page(page);
+					index--;
+					break;
 				}
 			}
 			unlock_page(page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/2] shmem: fix faulting into a hole, not taking i_mutex
  2014-07-15 10:31   ` Hugh Dickins
@ 2014-07-15 16:07     ` Vlastimil Babka
  -1 siblings, 0 replies; 42+ messages in thread
From: Vlastimil Babka @ 2014-07-15 16:07 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Sasha Levin, Konstantin Khlebnikov, Johannes Weiner,
	Michel Lespinasse, Lukas Czerner, Dave Jones, linux-mm,
	linux-fsdevel, linux-kernel

On 07/15/2014 12:31 PM, Hugh Dickins wrote:
> f00cdc6df7d7 ("shmem: fix faulting into a hole while it's punched") was
> buggy: Sasha sent a lockdep report to remind us that grabbing i_mutex in
> the fault path is a no-no (write syscall may already hold i_mutex while
> faulting user buffer).
>
> We tried a completely different approach (see following patch) but that
> proved inadequate: good enough for a rational workload, but not good
> enough against trinity - which forks off so many mappings of the object
> that contention on i_mmap_mutex while hole-puncher holds i_mutex builds
> into serious starvation when concurrent faults force the puncher to fall
> back to single-page unmap_mapping_range() searches of the i_mmap tree.
>
> So return to the original umbrella approach, but keep away from i_mutex
> this time.  We really don't want to bloat every shmem inode with a new
> mutex or completion, just to protect this unlikely case from trinity.
> So extend the original with wait_queue_head on stack at the hole-punch
> end, and wait_queue item on the stack at the fault end.

Hi, thanks a lot, I will definitely test it soon, although my reproducer 
is rather limited - it already works fine with the current kernel. 
Trinity will be more useful here. But there's something that caught my 
eye so I though I would raise the concern now.

> @@ -760,7 +760,7 @@ static int shmem_writepage(struct page *
>   			spin_lock(&inode->i_lock);
>   			shmem_falloc = inode->i_private;

Without ACCESS_ONCE, can shmem_falloc potentially become an alias on 
inode->i_private and later become re-read outside of the lock?

>   			if (shmem_falloc &&
> -			    !shmem_falloc->mode &&
> +			    !shmem_falloc->waitq &&
>   			    index >= shmem_falloc->start &&
>   			    index < shmem_falloc->next)
>   				shmem_falloc->nr_unswapped++;
> @@ -1248,38 +1248,58 @@ static int shmem_fault(struct vm_area_st
>   	 * Trinity finds that probing a hole which tmpfs is punching can
>   	 * prevent the hole-punch from ever completing: which in turn
>   	 * locks writers out with its hold on i_mutex.  So refrain from
> -	 * faulting pages into the hole while it's being punched, and
> -	 * wait on i_mutex to be released if vmf->flags permits.
> +	 * faulting pages into the hole while it's being punched.  Although
> +	 * shmem_undo_range() does remove the additions, it may be unable to
> +	 * keep up, as each new page needs its own unmap_mapping_range() call,
> +	 * and the i_mmap tree grows ever slower to scan if new vmas are added.
> +	 *
> +	 * It does not matter if we sometimes reach this check just before the
> +	 * hole-punch begins, so that one fault then races with the punch:
> +	 * we just need to make racing faults a rare case.
> +	 *
> +	 * The implementation below would be much simpler if we just used a
> +	 * standard mutex or completion: but we cannot take i_mutex in fault,
> +	 * and bloating every shmem inode for this unlikely case would be sad.
>   	 */
>   	if (unlikely(inode->i_private)) {
>   		struct shmem_falloc *shmem_falloc;
>
>   		spin_lock(&inode->i_lock);
>   		shmem_falloc = inode->i_private;

Same here.

> -		if (!shmem_falloc ||
> -		    shmem_falloc->mode != FALLOC_FL_PUNCH_HOLE ||
> -		    vmf->pgoff < shmem_falloc->start ||
> -		    vmf->pgoff >= shmem_falloc->next)
> -			shmem_falloc = NULL;
> -		spin_unlock(&inode->i_lock);
> -		/*
> -		 * i_lock has protected us from taking shmem_falloc seriously
> -		 * once return from shmem_fallocate() went back up that stack.
> -		 * i_lock does not serialize with i_mutex at all, but it does
> -		 * not matter if sometimes we wait unnecessarily, or sometimes
> -		 * miss out on waiting: we just need to make those cases rare.
> -		 */
> -		if (shmem_falloc) {
> +		if (shmem_falloc &&
> +		    shmem_falloc->waitq &&

Here it's operating outside of lock.

> +		    vmf->pgoff >= shmem_falloc->start &&
> +		    vmf->pgoff < shmem_falloc->next) {
> +			wait_queue_head_t *shmem_falloc_waitq;
> +			DEFINE_WAIT(shmem_fault_wait);
> +
> +			ret = VM_FAULT_NOPAGE;
>   			if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
>   			   !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
> +				/* It's polite to up mmap_sem if we can */
>   				up_read(&vma->vm_mm->mmap_sem);
> -				mutex_lock(&inode->i_mutex);
> -				mutex_unlock(&inode->i_mutex);
> -				return VM_FAULT_RETRY;
> +				ret = VM_FAULT_RETRY;
>   			}
> -			/* cond_resched? Leave that to GUP or return to user */
> -			return VM_FAULT_NOPAGE;
> +
> +			shmem_falloc_waitq = shmem_falloc->waitq;
> +			prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait,
> +					TASK_KILLABLE);
> +			spin_unlock(&inode->i_lock);
> +			schedule();
> +
> +			/*
> +			 * shmem_falloc_waitq points into the shmem_fallocate()
> +			 * stack of the hole-punching task: shmem_falloc_waitq
> +			 * is usually invalid by the time we reach here, but
> +			 * finish_wait() does not dereference it in that case;
> +			 * though i_lock needed lest racing with wake_up_all().
> +			 */
> +			spin_lock(&inode->i_lock);
> +			finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
> +			spin_unlock(&inode->i_lock);
> +			return ret;
>   		}
> +		spin_unlock(&inode->i_lock);
>   	}
>
>   	error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret);


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/2] shmem: fix faulting into a hole, not taking i_mutex
@ 2014-07-15 16:07     ` Vlastimil Babka
  0 siblings, 0 replies; 42+ messages in thread
From: Vlastimil Babka @ 2014-07-15 16:07 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Sasha Levin, Konstantin Khlebnikov, Johannes Weiner,
	Michel Lespinasse, Lukas Czerner, Dave Jones, linux-mm,
	linux-fsdevel, linux-kernel

On 07/15/2014 12:31 PM, Hugh Dickins wrote:
> f00cdc6df7d7 ("shmem: fix faulting into a hole while it's punched") was
> buggy: Sasha sent a lockdep report to remind us that grabbing i_mutex in
> the fault path is a no-no (write syscall may already hold i_mutex while
> faulting user buffer).
>
> We tried a completely different approach (see following patch) but that
> proved inadequate: good enough for a rational workload, but not good
> enough against trinity - which forks off so many mappings of the object
> that contention on i_mmap_mutex while hole-puncher holds i_mutex builds
> into serious starvation when concurrent faults force the puncher to fall
> back to single-page unmap_mapping_range() searches of the i_mmap tree.
>
> So return to the original umbrella approach, but keep away from i_mutex
> this time.  We really don't want to bloat every shmem inode with a new
> mutex or completion, just to protect this unlikely case from trinity.
> So extend the original with wait_queue_head on stack at the hole-punch
> end, and wait_queue item on the stack at the fault end.

Hi, thanks a lot, I will definitely test it soon, although my reproducer 
is rather limited - it already works fine with the current kernel. 
Trinity will be more useful here. But there's something that caught my 
eye so I though I would raise the concern now.

> @@ -760,7 +760,7 @@ static int shmem_writepage(struct page *
>   			spin_lock(&inode->i_lock);
>   			shmem_falloc = inode->i_private;

Without ACCESS_ONCE, can shmem_falloc potentially become an alias on 
inode->i_private and later become re-read outside of the lock?

>   			if (shmem_falloc &&
> -			    !shmem_falloc->mode &&
> +			    !shmem_falloc->waitq &&
>   			    index >= shmem_falloc->start &&
>   			    index < shmem_falloc->next)
>   				shmem_falloc->nr_unswapped++;
> @@ -1248,38 +1248,58 @@ static int shmem_fault(struct vm_area_st
>   	 * Trinity finds that probing a hole which tmpfs is punching can
>   	 * prevent the hole-punch from ever completing: which in turn
>   	 * locks writers out with its hold on i_mutex.  So refrain from
> -	 * faulting pages into the hole while it's being punched, and
> -	 * wait on i_mutex to be released if vmf->flags permits.
> +	 * faulting pages into the hole while it's being punched.  Although
> +	 * shmem_undo_range() does remove the additions, it may be unable to
> +	 * keep up, as each new page needs its own unmap_mapping_range() call,
> +	 * and the i_mmap tree grows ever slower to scan if new vmas are added.
> +	 *
> +	 * It does not matter if we sometimes reach this check just before the
> +	 * hole-punch begins, so that one fault then races with the punch:
> +	 * we just need to make racing faults a rare case.
> +	 *
> +	 * The implementation below would be much simpler if we just used a
> +	 * standard mutex or completion: but we cannot take i_mutex in fault,
> +	 * and bloating every shmem inode for this unlikely case would be sad.
>   	 */
>   	if (unlikely(inode->i_private)) {
>   		struct shmem_falloc *shmem_falloc;
>
>   		spin_lock(&inode->i_lock);
>   		shmem_falloc = inode->i_private;

Same here.

> -		if (!shmem_falloc ||
> -		    shmem_falloc->mode != FALLOC_FL_PUNCH_HOLE ||
> -		    vmf->pgoff < shmem_falloc->start ||
> -		    vmf->pgoff >= shmem_falloc->next)
> -			shmem_falloc = NULL;
> -		spin_unlock(&inode->i_lock);
> -		/*
> -		 * i_lock has protected us from taking shmem_falloc seriously
> -		 * once return from shmem_fallocate() went back up that stack.
> -		 * i_lock does not serialize with i_mutex at all, but it does
> -		 * not matter if sometimes we wait unnecessarily, or sometimes
> -		 * miss out on waiting: we just need to make those cases rare.
> -		 */
> -		if (shmem_falloc) {
> +		if (shmem_falloc &&
> +		    shmem_falloc->waitq &&

Here it's operating outside of lock.

> +		    vmf->pgoff >= shmem_falloc->start &&
> +		    vmf->pgoff < shmem_falloc->next) {
> +			wait_queue_head_t *shmem_falloc_waitq;
> +			DEFINE_WAIT(shmem_fault_wait);
> +
> +			ret = VM_FAULT_NOPAGE;
>   			if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
>   			   !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
> +				/* It's polite to up mmap_sem if we can */
>   				up_read(&vma->vm_mm->mmap_sem);
> -				mutex_lock(&inode->i_mutex);
> -				mutex_unlock(&inode->i_mutex);
> -				return VM_FAULT_RETRY;
> +				ret = VM_FAULT_RETRY;
>   			}
> -			/* cond_resched? Leave that to GUP or return to user */
> -			return VM_FAULT_NOPAGE;
> +
> +			shmem_falloc_waitq = shmem_falloc->waitq;
> +			prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait,
> +					TASK_KILLABLE);
> +			spin_unlock(&inode->i_lock);
> +			schedule();
> +
> +			/*
> +			 * shmem_falloc_waitq points into the shmem_fallocate()
> +			 * stack of the hole-punching task: shmem_falloc_waitq
> +			 * is usually invalid by the time we reach here, but
> +			 * finish_wait() does not dereference it in that case;
> +			 * though i_lock needed lest racing with wake_up_all().
> +			 */
> +			spin_lock(&inode->i_lock);
> +			finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
> +			spin_unlock(&inode->i_lock);
> +			return ret;
>   		}
> +		spin_unlock(&inode->i_lock);
>   	}
>
>   	error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/2] shmem: fix faulting into a hole, not taking i_mutex
  2014-07-15 16:07     ` Vlastimil Babka
@ 2014-07-15 19:26       ` Hugh Dickins
  -1 siblings, 0 replies; 42+ messages in thread
From: Hugh Dickins @ 2014-07-15 19:26 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hugh Dickins, Andrew Morton, Sasha Levin, Konstantin Khlebnikov,
	Johannes Weiner, Michel Lespinasse, Lukas Czerner, Dave Jones,
	linux-mm, linux-fsdevel, linux-kernel

On Tue, 15 Jul 2014, Vlastimil Babka wrote:
> On 07/15/2014 12:31 PM, Hugh Dickins wrote:
> > f00cdc6df7d7 ("shmem: fix faulting into a hole while it's punched") was
> > buggy: Sasha sent a lockdep report to remind us that grabbing i_mutex in
> > the fault path is a no-no (write syscall may already hold i_mutex while
> > faulting user buffer).
> > 
> > We tried a completely different approach (see following patch) but that
> > proved inadequate: good enough for a rational workload, but not good
> > enough against trinity - which forks off so many mappings of the object
> > that contention on i_mmap_mutex while hole-puncher holds i_mutex builds
> > into serious starvation when concurrent faults force the puncher to fall
> > back to single-page unmap_mapping_range() searches of the i_mmap tree.
> > 
> > So return to the original umbrella approach, but keep away from i_mutex
> > this time.  We really don't want to bloat every shmem inode with a new
> > mutex or completion, just to protect this unlikely case from trinity.
> > So extend the original with wait_queue_head on stack at the hole-punch
> > end, and wait_queue item on the stack at the fault end.
> 
> Hi, thanks a lot, I will definitely test it soon, although my reproducer is
> rather limited - it already works fine with the current kernel. Trinity will
> be more useful here.

Yes, 2/2 (minus the page->swap addition) already proved good enough for
your (more realistic than trinity) testcase, and for mine.  And 1/2 (minus
the new waiting) already proved good enough for you too, just more awkward
to backport way back.  I agree that it's trinity we most need, to check
that I didn't mess up 1/2 - though your testing welcome too, thanks.

> But there's something that caught my eye so I though I
> would raise the concern now.

Thank you.

> 
> > @@ -760,7 +760,7 @@ static int shmem_writepage(struct page *
> >   			spin_lock(&inode->i_lock);
> >   			shmem_falloc = inode->i_private;
> 
> Without ACCESS_ONCE, can shmem_falloc potentially become an alias on
> inode->i_private and later become re-read outside of the lock?

No, it could be re-read inside the locked section (which is okay since
the locking ensures the same value would be re-read each time), but it
cannot be re-read after the unlock.  The unlock guarantees that (whereas
an assignment after the unlock might be moved up before the unlock).

I searched for a simple example (preferably not in code written by me!)
to convince you.  I thought it would be easy to find an example of

	spin_lock(&lock);
	thing_to_free = whatever;
	spin_unlock(&lock);
	if (thing_to_free)
		free(thing_to_free);

but everything I hit upon was actually a little more complicated than
than that (e.g. involving whatever(), or setting whatever = NULL after),
and therefore less convincing.  Please hunt around to convince yourself.

> 
> >   			if (shmem_falloc &&
> > -			    !shmem_falloc->mode &&
> > +			    !shmem_falloc->waitq &&
> >   			    index >= shmem_falloc->start &&
> >   			    index < shmem_falloc->next)
> >   				shmem_falloc->nr_unswapped++;
...
> >   	if (unlikely(inode->i_private)) {
> >   		struct shmem_falloc *shmem_falloc;
> > 
> >   		spin_lock(&inode->i_lock);
> >   		shmem_falloc = inode->i_private;
> 
> Same here.

Same here :)

> 
> > -		if (!shmem_falloc ||
> > -		    shmem_falloc->mode != FALLOC_FL_PUNCH_HOLE ||
> > -		    vmf->pgoff < shmem_falloc->start ||
> > -		    vmf->pgoff >= shmem_falloc->next)
> > -			shmem_falloc = NULL;
> > -		spin_unlock(&inode->i_lock);
> > -		/*
> > -		 * i_lock has protected us from taking shmem_falloc seriously
> > -		 * once return from shmem_fallocate() went back up that
> > stack.
> > -		 * i_lock does not serialize with i_mutex at all, but it does
> > -		 * not matter if sometimes we wait unnecessarily, or
> > sometimes
> > -		 * miss out on waiting: we just need to make those cases
> > rare.
> > -		 */
> > -		if (shmem_falloc) {
> > +		if (shmem_falloc &&
> > +		    shmem_falloc->waitq &&
> 
> Here it's operating outside of lock.

No, it's inside the lock: just easier to see from the patched source
than from the patch itself.

Hugh

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/2] shmem: fix faulting into a hole, not taking i_mutex
@ 2014-07-15 19:26       ` Hugh Dickins
  0 siblings, 0 replies; 42+ messages in thread
From: Hugh Dickins @ 2014-07-15 19:26 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hugh Dickins, Andrew Morton, Sasha Levin, Konstantin Khlebnikov,
	Johannes Weiner, Michel Lespinasse, Lukas Czerner, Dave Jones,
	linux-mm, linux-fsdevel, linux-kernel

On Tue, 15 Jul 2014, Vlastimil Babka wrote:
> On 07/15/2014 12:31 PM, Hugh Dickins wrote:
> > f00cdc6df7d7 ("shmem: fix faulting into a hole while it's punched") was
> > buggy: Sasha sent a lockdep report to remind us that grabbing i_mutex in
> > the fault path is a no-no (write syscall may already hold i_mutex while
> > faulting user buffer).
> > 
> > We tried a completely different approach (see following patch) but that
> > proved inadequate: good enough for a rational workload, but not good
> > enough against trinity - which forks off so many mappings of the object
> > that contention on i_mmap_mutex while hole-puncher holds i_mutex builds
> > into serious starvation when concurrent faults force the puncher to fall
> > back to single-page unmap_mapping_range() searches of the i_mmap tree.
> > 
> > So return to the original umbrella approach, but keep away from i_mutex
> > this time.  We really don't want to bloat every shmem inode with a new
> > mutex or completion, just to protect this unlikely case from trinity.
> > So extend the original with wait_queue_head on stack at the hole-punch
> > end, and wait_queue item on the stack at the fault end.
> 
> Hi, thanks a lot, I will definitely test it soon, although my reproducer is
> rather limited - it already works fine with the current kernel. Trinity will
> be more useful here.

Yes, 2/2 (minus the page->swap addition) already proved good enough for
your (more realistic than trinity) testcase, and for mine.  And 1/2 (minus
the new waiting) already proved good enough for you too, just more awkward
to backport way back.  I agree that it's trinity we most need, to check
that I didn't mess up 1/2 - though your testing welcome too, thanks.

> But there's something that caught my eye so I though I
> would raise the concern now.

Thank you.

> 
> > @@ -760,7 +760,7 @@ static int shmem_writepage(struct page *
> >   			spin_lock(&inode->i_lock);
> >   			shmem_falloc = inode->i_private;
> 
> Without ACCESS_ONCE, can shmem_falloc potentially become an alias on
> inode->i_private and later become re-read outside of the lock?

No, it could be re-read inside the locked section (which is okay since
the locking ensures the same value would be re-read each time), but it
cannot be re-read after the unlock.  The unlock guarantees that (whereas
an assignment after the unlock might be moved up before the unlock).

I searched for a simple example (preferably not in code written by me!)
to convince you.  I thought it would be easy to find an example of

	spin_lock(&lock);
	thing_to_free = whatever;
	spin_unlock(&lock);
	if (thing_to_free)
		free(thing_to_free);

but everything I hit upon was actually a little more complicated than
than that (e.g. involving whatever(), or setting whatever = NULL after),
and therefore less convincing.  Please hunt around to convince yourself.

> 
> >   			if (shmem_falloc &&
> > -			    !shmem_falloc->mode &&
> > +			    !shmem_falloc->waitq &&
> >   			    index >= shmem_falloc->start &&
> >   			    index < shmem_falloc->next)
> >   				shmem_falloc->nr_unswapped++;
...
> >   	if (unlikely(inode->i_private)) {
> >   		struct shmem_falloc *shmem_falloc;
> > 
> >   		spin_lock(&inode->i_lock);
> >   		shmem_falloc = inode->i_private;
> 
> Same here.

Same here :)

> 
> > -		if (!shmem_falloc ||
> > -		    shmem_falloc->mode != FALLOC_FL_PUNCH_HOLE ||
> > -		    vmf->pgoff < shmem_falloc->start ||
> > -		    vmf->pgoff >= shmem_falloc->next)
> > -			shmem_falloc = NULL;
> > -		spin_unlock(&inode->i_lock);
> > -		/*
> > -		 * i_lock has protected us from taking shmem_falloc seriously
> > -		 * once return from shmem_fallocate() went back up that
> > stack.
> > -		 * i_lock does not serialize with i_mutex at all, but it does
> > -		 * not matter if sometimes we wait unnecessarily, or
> > sometimes
> > -		 * miss out on waiting: we just need to make those cases
> > rare.
> > -		 */
> > -		if (shmem_falloc) {
> > +		if (shmem_falloc &&
> > +		    shmem_falloc->waitq &&
> 
> Here it's operating outside of lock.

No, it's inside the lock: just easier to see from the patched source
than from the patch itself.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/2] shmem: fix faulting into a hole, not taking i_mutex
  2014-07-15 19:26       ` Hugh Dickins
@ 2014-07-16  7:18         ` Vlastimil Babka
  -1 siblings, 0 replies; 42+ messages in thread
From: Vlastimil Babka @ 2014-07-16  7:18 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Sasha Levin, Konstantin Khlebnikov,
	Johannes Weiner, Michel Lespinasse, Lukas Czerner, Dave Jones,
	linux-mm, linux-fsdevel, linux-kernel

On 07/15/2014 09:26 PM, Hugh Dickins wrote:
>> 
>> > @@ -760,7 +760,7 @@ static int shmem_writepage(struct page *
>> >   			spin_lock(&inode->i_lock);
>> >   			shmem_falloc = inode->i_private;
>> 
>> Without ACCESS_ONCE, can shmem_falloc potentially become an alias on
>> inode->i_private and later become re-read outside of the lock?
> 
> No, it could be re-read inside the locked section (which is okay since
> the locking ensures the same value would be re-read each time), but it
> cannot be re-read after the unlock.  The unlock guarantees that (whereas
> an assignment after the unlock might be moved up before the unlock).
> 
> I searched for a simple example (preferably not in code written by me!)
> to convince you.  I thought it would be easy to find an example of
> 
> 	spin_lock(&lock);
> 	thing_to_free = whatever;
> 	spin_unlock(&lock);
> 	if (thing_to_free)
> 		free(thing_to_free);
> 
> but everything I hit upon was actually a little more complicated than
> than that (e.g. involving whatever(), or setting whatever = NULL after),
> and therefore less convincing.  Please hunt around to convince yourself.

Yeah, I thought myself on the way home that this is probably the case. I guess
some recent bugs made me too paranoid. Sorry for the noise and time you spent
explaining this :/

>> 
>> > -		if (!shmem_falloc ||
>> > -		    shmem_falloc->mode != FALLOC_FL_PUNCH_HOLE ||
>> > -		    vmf->pgoff < shmem_falloc->start ||
>> > -		    vmf->pgoff >= shmem_falloc->next)
>> > -			shmem_falloc = NULL;
>> > -		spin_unlock(&inode->i_lock);
>> > -		/*
>> > -		 * i_lock has protected us from taking shmem_falloc seriously
>> > -		 * once return from shmem_fallocate() went back up that
>> > stack.
>> > -		 * i_lock does not serialize with i_mutex at all, but it does
>> > -		 * not matter if sometimes we wait unnecessarily, or
>> > sometimes
>> > -		 * miss out on waiting: we just need to make those cases
>> > rare.
>> > -		 */
>> > -		if (shmem_falloc) {
>> > +		if (shmem_falloc &&
>> > +		    shmem_falloc->waitq &&
>> 
>> Here it's operating outside of lock.
> 
> No, it's inside the lock: just easier to see from the patched source
> than from the patch itself.

Ah, right :/

> Hugh
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/2] shmem: fix faulting into a hole, not taking i_mutex
@ 2014-07-16  7:18         ` Vlastimil Babka
  0 siblings, 0 replies; 42+ messages in thread
From: Vlastimil Babka @ 2014-07-16  7:18 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Sasha Levin, Konstantin Khlebnikov,
	Johannes Weiner, Michel Lespinasse, Lukas Czerner, Dave Jones,
	linux-mm, linux-fsdevel, linux-kernel

On 07/15/2014 09:26 PM, Hugh Dickins wrote:
>> 
>> > @@ -760,7 +760,7 @@ static int shmem_writepage(struct page *
>> >   			spin_lock(&inode->i_lock);
>> >   			shmem_falloc = inode->i_private;
>> 
>> Without ACCESS_ONCE, can shmem_falloc potentially become an alias on
>> inode->i_private and later become re-read outside of the lock?
> 
> No, it could be re-read inside the locked section (which is okay since
> the locking ensures the same value would be re-read each time), but it
> cannot be re-read after the unlock.  The unlock guarantees that (whereas
> an assignment after the unlock might be moved up before the unlock).
> 
> I searched for a simple example (preferably not in code written by me!)
> to convince you.  I thought it would be easy to find an example of
> 
> 	spin_lock(&lock);
> 	thing_to_free = whatever;
> 	spin_unlock(&lock);
> 	if (thing_to_free)
> 		free(thing_to_free);
> 
> but everything I hit upon was actually a little more complicated than
> than that (e.g. involving whatever(), or setting whatever = NULL after),
> and therefore less convincing.  Please hunt around to convince yourself.

Yeah, I thought myself on the way home that this is probably the case. I guess
some recent bugs made me too paranoid. Sorry for the noise and time you spent
explaining this :/

>> 
>> > -		if (!shmem_falloc ||
>> > -		    shmem_falloc->mode != FALLOC_FL_PUNCH_HOLE ||
>> > -		    vmf->pgoff < shmem_falloc->start ||
>> > -		    vmf->pgoff >= shmem_falloc->next)
>> > -			shmem_falloc = NULL;
>> > -		spin_unlock(&inode->i_lock);
>> > -		/*
>> > -		 * i_lock has protected us from taking shmem_falloc seriously
>> > -		 * once return from shmem_fallocate() went back up that
>> > stack.
>> > -		 * i_lock does not serialize with i_mutex at all, but it does
>> > -		 * not matter if sometimes we wait unnecessarily, or
>> > sometimes
>> > -		 * miss out on waiting: we just need to make those cases
>> > rare.
>> > -		 */
>> > -		if (shmem_falloc) {
>> > +		if (shmem_falloc &&
>> > +		    shmem_falloc->waitq &&
>> 
>> Here it's operating outside of lock.
> 
> No, it's inside the lock: just easier to see from the patched source
> than from the patch itself.

Ah, right :/

> Hugh
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3
  2014-07-15 10:28 ` Hugh Dickins
@ 2014-07-17 16:10   ` Vlastimil Babka
  -1 siblings, 0 replies; 42+ messages in thread
From: Vlastimil Babka @ 2014-07-17 16:10 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Sasha Levin, Konstantin Khlebnikov, Johannes Weiner,
	Michel Lespinasse, Lukas Czerner, Dave Jones, linux-mm,
	linux-fsdevel, linux-kernel

On 07/15/2014 12:28 PM, Hugh Dickins wrote:
> In the end I decided that we had better look at it as two problems,
> the trinity faulting starvation, and the indefinite punching loop,
> so 1/2 and 2/2 present both solutions: belt and braces.

I tested that with my reproducer and it was OK, but as I already said, 
it's not trinity so I didn't observe the new problems in the first place.

> Which may be the best for fixing, but the worst for ease of backporting.
> Vlastimil, I have prepared (and lightly tested) a 3.2.61-based version
> of the combination of f00cdc6df7d7 and 1/2 and 2/2 (basically, I moved
> vmtruncate_range from mm/truncate.c to mm/shmem.c, since nothing but
> shmem ever implemented the truncate_range method).  It should give a

I don't know how much stable kernel updates are supposed to care about 
out-of-tree modules, but doesn't the change mean that an out-of-tree FS 
supporting truncate_range (if such thing exists) would effectively stop 
supporting madvise(MADV_REMOVE) after this change? But hey it's still 
madvise so maybe we don't need to care. And I suppose kernels where 
FALLOC_FL_PUNCH_HOLE is supported, can be backported normally.

> good hint for backports earlier and later: I'll send it privately to
> you now, but keep in mind that it may need to be revised if today's
> patches for 3.16 get revised again (I'll send it to Ben Hutchings
> only when that's settled).
>
> Thanks,
> Hugh
>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3
@ 2014-07-17 16:10   ` Vlastimil Babka
  0 siblings, 0 replies; 42+ messages in thread
From: Vlastimil Babka @ 2014-07-17 16:10 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Sasha Levin, Konstantin Khlebnikov, Johannes Weiner,
	Michel Lespinasse, Lukas Czerner, Dave Jones, linux-mm,
	linux-fsdevel, linux-kernel

On 07/15/2014 12:28 PM, Hugh Dickins wrote:
> In the end I decided that we had better look at it as two problems,
> the trinity faulting starvation, and the indefinite punching loop,
> so 1/2 and 2/2 present both solutions: belt and braces.

I tested that with my reproducer and it was OK, but as I already said, 
it's not trinity so I didn't observe the new problems in the first place.

> Which may be the best for fixing, but the worst for ease of backporting.
> Vlastimil, I have prepared (and lightly tested) a 3.2.61-based version
> of the combination of f00cdc6df7d7 and 1/2 and 2/2 (basically, I moved
> vmtruncate_range from mm/truncate.c to mm/shmem.c, since nothing but
> shmem ever implemented the truncate_range method).  It should give a

I don't know how much stable kernel updates are supposed to care about 
out-of-tree modules, but doesn't the change mean that an out-of-tree FS 
supporting truncate_range (if such thing exists) would effectively stop 
supporting madvise(MADV_REMOVE) after this change? But hey it's still 
madvise so maybe we don't need to care. And I suppose kernels where 
FALLOC_FL_PUNCH_HOLE is supported, can be backported normally.

> good hint for backports earlier and later: I'll send it privately to
> you now, but keep in mind that it may need to be revised if today's
> patches for 3.16 get revised again (I'll send it to Ben Hutchings
> only when that's settled).
>
> Thanks,
> Hugh
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3
  2014-07-17 16:10   ` Vlastimil Babka
@ 2014-07-17 16:12     ` Sasha Levin
  -1 siblings, 0 replies; 42+ messages in thread
From: Sasha Levin @ 2014-07-17 16:12 UTC (permalink / raw)
  To: Vlastimil Babka, Hugh Dickins, Andrew Morton
  Cc: Konstantin Khlebnikov, Johannes Weiner, Michel Lespinasse,
	Lukas Czerner, Dave Jones, linux-mm, linux-fsdevel, linux-kernel

On 07/17/2014 12:10 PM, Vlastimil Babka wrote:
> On 07/15/2014 12:28 PM, Hugh Dickins wrote:
>> In the end I decided that we had better look at it as two problems,
>> the trinity faulting starvation, and the indefinite punching loop,
>> so 1/2 and 2/2 present both solutions: belt and braces.
> 
> I tested that with my reproducer and it was OK, but as I already said, it's not trinity so I didn't observe the new problems in the first place.

I've started seeing a new hang in the lru code, but I'm not sure if
it's related to this patch or not (the locks are the same ones, but
the location is very different).

I'm looking into that.


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3
@ 2014-07-17 16:12     ` Sasha Levin
  0 siblings, 0 replies; 42+ messages in thread
From: Sasha Levin @ 2014-07-17 16:12 UTC (permalink / raw)
  To: Vlastimil Babka, Hugh Dickins, Andrew Morton
  Cc: Konstantin Khlebnikov, Johannes Weiner, Michel Lespinasse,
	Lukas Czerner, Dave Jones, linux-mm, linux-fsdevel, linux-kernel

On 07/17/2014 12:10 PM, Vlastimil Babka wrote:
> On 07/15/2014 12:28 PM, Hugh Dickins wrote:
>> In the end I decided that we had better look at it as two problems,
>> the trinity faulting starvation, and the indefinite punching loop,
>> so 1/2 and 2/2 present both solutions: belt and braces.
> 
> I tested that with my reproducer and it was OK, but as I already said, it's not trinity so I didn't observe the new problems in the first place.

I've started seeing a new hang in the lru code, but I'm not sure if
it's related to this patch or not (the locks are the same ones, but
the location is very different).

I'm looking into that.


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3
  2014-07-17 16:10   ` Vlastimil Babka
@ 2014-07-17 23:34     ` Hugh Dickins
  -1 siblings, 0 replies; 42+ messages in thread
From: Hugh Dickins @ 2014-07-17 23:34 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hugh Dickins, Andrew Morton, Sasha Levin, Konstantin Khlebnikov,
	Johannes Weiner, Michel Lespinasse, Lukas Czerner, Dave Jones,
	linux-mm, linux-fsdevel, linux-kernel

On Thu, 17 Jul 2014, Vlastimil Babka wrote:
> On 07/15/2014 12:28 PM, Hugh Dickins wrote:
> > In the end I decided that we had better look at it as two problems,
> > the trinity faulting starvation, and the indefinite punching loop,
> > so 1/2 and 2/2 present both solutions: belt and braces.
> 
> I tested that with my reproducer and it was OK, but as I already said, it's
> not trinity so I didn't observe the new problems in the first place.

Yes, but thanks for doing so anyway.

> 
> > Which may be the best for fixing, but the worst for ease of backporting.
> > Vlastimil, I have prepared (and lightly tested) a 3.2.61-based version
> > of the combination of f00cdc6df7d7 and 1/2 and 2/2 (basically, I moved
> > vmtruncate_range from mm/truncate.c to mm/shmem.c, since nothing but
> > shmem ever implemented the truncate_range method).  It should give a
> 
> I don't know how much stable kernel updates are supposed to care about
> out-of-tree modules,

I suggest that stable kernel updates do not need to care about
out-of-tree modules: for so long as they are out of tree, they have
to look after their own compatibility from one version to another.
I have no desire to break them gratuitously, but it's not for me
to spend more time accommodating them.

Now, SLES and RHEL and other distros may have different priorities
from that: if they distribute additional filesystems, which happen to
support the ->truncate_range() method, or work with partners who supply
such filesystems, then they may want to rework the shmem-specific
vmtruncate_range() to allow for those - that's up to them.

> but doesn't the change mean that an out-of-tree FS
> supporting truncate_range (if such thing exists) would effectively stop
> supporting madvise(MADV_REMOVE) after this change?

Yes, it would need to be reworked a little for them: I've not thought
through what more would need to be done.  But it seems odd to me that
an out-of-tree driver would support it, when it got no take up at all
from in-tree filesystems, even from those which went on to support
hole-punching in fallocate() (until the tmpfs series brought them in).

Or perhaps MADV_REMOVE-support is their secret sauce :-?  In that case
I would expect them to support FALLOC_FL_PUNCH_HOLE already, and prefer
a backport of v3.5's merging of the madvise and fallocate routes.

> But hey it's still madvise so maybe we don't need to care.

That's an argument I would not use, not in Linus's kernel anyway:
users may have come to rely upon the behaviour of madvise(MADV_REMOVE):
never mind that it says "advise", I would not be happy to break them.

> And I suppose kernels where
> FALLOC_FL_PUNCH_HOLE is supported, can be backported normally.

Yes.

Hugh

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3
@ 2014-07-17 23:34     ` Hugh Dickins
  0 siblings, 0 replies; 42+ messages in thread
From: Hugh Dickins @ 2014-07-17 23:34 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hugh Dickins, Andrew Morton, Sasha Levin, Konstantin Khlebnikov,
	Johannes Weiner, Michel Lespinasse, Lukas Czerner, Dave Jones,
	linux-mm, linux-fsdevel, linux-kernel

On Thu, 17 Jul 2014, Vlastimil Babka wrote:
> On 07/15/2014 12:28 PM, Hugh Dickins wrote:
> > In the end I decided that we had better look at it as two problems,
> > the trinity faulting starvation, and the indefinite punching loop,
> > so 1/2 and 2/2 present both solutions: belt and braces.
> 
> I tested that with my reproducer and it was OK, but as I already said, it's
> not trinity so I didn't observe the new problems in the first place.

Yes, but thanks for doing so anyway.

> 
> > Which may be the best for fixing, but the worst for ease of backporting.
> > Vlastimil, I have prepared (and lightly tested) a 3.2.61-based version
> > of the combination of f00cdc6df7d7 and 1/2 and 2/2 (basically, I moved
> > vmtruncate_range from mm/truncate.c to mm/shmem.c, since nothing but
> > shmem ever implemented the truncate_range method).  It should give a
> 
> I don't know how much stable kernel updates are supposed to care about
> out-of-tree modules,

I suggest that stable kernel updates do not need to care about
out-of-tree modules: for so long as they are out of tree, they have
to look after their own compatibility from one version to another.
I have no desire to break them gratuitously, but it's not for me
to spend more time accommodating them.

Now, SLES and RHEL and other distros may have different priorities
from that: if they distribute additional filesystems, which happen to
support the ->truncate_range() method, or work with partners who supply
such filesystems, then they may want to rework the shmem-specific
vmtruncate_range() to allow for those - that's up to them.

> but doesn't the change mean that an out-of-tree FS
> supporting truncate_range (if such thing exists) would effectively stop
> supporting madvise(MADV_REMOVE) after this change?

Yes, it would need to be reworked a little for them: I've not thought
through what more would need to be done.  But it seems odd to me that
an out-of-tree driver would support it, when it got no take up at all
from in-tree filesystems, even from those which went on to support
hole-punching in fallocate() (until the tmpfs series brought them in).

Or perhaps MADV_REMOVE-support is their secret sauce :-?  In that case
I would expect them to support FALLOC_FL_PUNCH_HOLE already, and prefer
a backport of v3.5's merging of the madvise and fallocate routes.

> But hey it's still madvise so maybe we don't need to care.

That's an argument I would not use, not in Linus's kernel anyway:
users may have come to rely upon the behaviour of madvise(MADV_REMOVE):
never mind that it says "advise", I would not be happy to break them.

> And I suppose kernels where
> FALLOC_FL_PUNCH_HOLE is supported, can be backported normally.

Yes.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3
  2014-07-17 23:34     ` Hugh Dickins
@ 2014-07-18  8:05       ` Vlastimil Babka
  -1 siblings, 0 replies; 42+ messages in thread
From: Vlastimil Babka @ 2014-07-18  8:05 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Sasha Levin, Konstantin Khlebnikov,
	Johannes Weiner, Michel Lespinasse, Lukas Czerner, Dave Jones,
	linux-mm, linux-fsdevel, linux-kernel

On 07/18/2014 01:34 AM, Hugh Dickins wrote:
> On Thu, 17 Jul 2014, Vlastimil Babka wrote:
>> On 07/15/2014 12:28 PM, Hugh Dickins wrote:
>> > In the end I decided that we had better look at it as two problems,
>> > the trinity faulting starvation, and the indefinite punching loop,
>> > so 1/2 and 2/2 present both solutions: belt and braces.
>> 
>> I tested that with my reproducer and it was OK, but as I already said, it's
>> not trinity so I didn't observe the new problems in the first place.
> 
> Yes, but thanks for doing so anyway.

Now also tested vanilla 3.2.61, also OK.

>> 
>> > Which may be the best for fixing, but the worst for ease of backporting.
>> > Vlastimil, I have prepared (and lightly tested) a 3.2.61-based version
>> > of the combination of f00cdc6df7d7 and 1/2 and 2/2 (basically, I moved
>> > vmtruncate_range from mm/truncate.c to mm/shmem.c, since nothing but
>> > shmem ever implemented the truncate_range method).  It should give a
>> 
>> I don't know how much stable kernel updates are supposed to care about
>> out-of-tree modules,
> 
> I suggest that stable kernel updates do not need to care about
> out-of-tree modules: for so long as they are out of tree, they have
> to look after their own compatibility from one version to another.
> I have no desire to break them gratuitously, but it's not for me
> to spend more time accommodating them.

Fair enough.

> Now, SLES and RHEL and other distros may have different priorities
> from that: if they distribute additional filesystems, which happen to
> support the ->truncate_range() method, or work with partners who supply
> such filesystems, then they may want to rework the shmem-specific
> vmtruncate_range() to allow for those - that's up to them.

Sure, it wasn't my intention to raise any enterprise kernel specific concerns here.

>> but doesn't the change mean that an out-of-tree FS
>> supporting truncate_range (if such thing exists) would effectively stop
>> supporting madvise(MADV_REMOVE) after this change?
> 
> Yes, it would need to be reworked a little for them: I've not thought
> through what more would need to be done.  But it seems odd to me that
> an out-of-tree driver would support it, when it got no take up at all
> from in-tree filesystems, even from those which went on to support
> hole-punching in fallocate() (until the tmpfs series brought them in).
> 
> Or perhaps MADV_REMOVE-support is their secret sauce :-?  In that case
> I would expect them to support FALLOC_FL_PUNCH_HOLE already, and prefer
> a backport of v3.5's merging of the madvise and fallocate routes.
> 
>> But hey it's still madvise so maybe we don't need to care.
> 
> That's an argument I would not use, not in Linus's kernel anyway:
> users may have come to rely upon the behaviour of madvise(MADV_REMOVE):
> never mind that it says "advise", I would not be happy to break them.

Right.

>> And I suppose kernels where
>> FALLOC_FL_PUNCH_HOLE is supported, can be backported normally.
> 
> Yes.
> 
> Hugh
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3
@ 2014-07-18  8:05       ` Vlastimil Babka
  0 siblings, 0 replies; 42+ messages in thread
From: Vlastimil Babka @ 2014-07-18  8:05 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Sasha Levin, Konstantin Khlebnikov,
	Johannes Weiner, Michel Lespinasse, Lukas Czerner, Dave Jones,
	linux-mm, linux-fsdevel, linux-kernel

On 07/18/2014 01:34 AM, Hugh Dickins wrote:
> On Thu, 17 Jul 2014, Vlastimil Babka wrote:
>> On 07/15/2014 12:28 PM, Hugh Dickins wrote:
>> > In the end I decided that we had better look at it as two problems,
>> > the trinity faulting starvation, and the indefinite punching loop,
>> > so 1/2 and 2/2 present both solutions: belt and braces.
>> 
>> I tested that with my reproducer and it was OK, but as I already said, it's
>> not trinity so I didn't observe the new problems in the first place.
> 
> Yes, but thanks for doing so anyway.

Now also tested vanilla 3.2.61, also OK.

>> 
>> > Which may be the best for fixing, but the worst for ease of backporting.
>> > Vlastimil, I have prepared (and lightly tested) a 3.2.61-based version
>> > of the combination of f00cdc6df7d7 and 1/2 and 2/2 (basically, I moved
>> > vmtruncate_range from mm/truncate.c to mm/shmem.c, since nothing but
>> > shmem ever implemented the truncate_range method).  It should give a
>> 
>> I don't know how much stable kernel updates are supposed to care about
>> out-of-tree modules,
> 
> I suggest that stable kernel updates do not need to care about
> out-of-tree modules: for so long as they are out of tree, they have
> to look after their own compatibility from one version to another.
> I have no desire to break them gratuitously, but it's not for me
> to spend more time accommodating them.

Fair enough.

> Now, SLES and RHEL and other distros may have different priorities
> from that: if they distribute additional filesystems, which happen to
> support the ->truncate_range() method, or work with partners who supply
> such filesystems, then they may want to rework the shmem-specific
> vmtruncate_range() to allow for those - that's up to them.

Sure, it wasn't my intention to raise any enterprise kernel specific concerns here.

>> but doesn't the change mean that an out-of-tree FS
>> supporting truncate_range (if such thing exists) would effectively stop
>> supporting madvise(MADV_REMOVE) after this change?
> 
> Yes, it would need to be reworked a little for them: I've not thought
> through what more would need to be done.  But it seems odd to me that
> an out-of-tree driver would support it, when it got no take up at all
> from in-tree filesystems, even from those which went on to support
> hole-punching in fallocate() (until the tmpfs series brought them in).
> 
> Or perhaps MADV_REMOVE-support is their secret sauce :-?  In that case
> I would expect them to support FALLOC_FL_PUNCH_HOLE already, and prefer
> a backport of v3.5's merging of the madvise and fallocate routes.
> 
>> But hey it's still madvise so maybe we don't need to care.
> 
> That's an argument I would not use, not in Linus's kernel anyway:
> users may have come to rely upon the behaviour of madvise(MADV_REMOVE):
> never mind that it says "advise", I would not be happy to break them.

Right.

>> And I suppose kernels where
>> FALLOC_FL_PUNCH_HOLE is supported, can be backported normally.
> 
> Yes.
> 
> Hugh
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3
  2014-07-17 16:12     ` Sasha Levin
@ 2014-07-18 10:44       ` Sasha Levin
  -1 siblings, 0 replies; 42+ messages in thread
From: Sasha Levin @ 2014-07-18 10:44 UTC (permalink / raw)
  To: Vlastimil Babka, Hugh Dickins, Andrew Morton
  Cc: Konstantin Khlebnikov, Johannes Weiner, Michel Lespinasse,
	Lukas Czerner, Dave Jones, linux-mm, linux-fsdevel, linux-kernel

On 07/17/2014 12:12 PM, Sasha Levin wrote:
> On 07/17/2014 12:10 PM, Vlastimil Babka wrote:
>> > On 07/15/2014 12:28 PM, Hugh Dickins wrote:
>>> >> In the end I decided that we had better look at it as two problems,
>>> >> the trinity faulting starvation, and the indefinite punching loop,
>>> >> so 1/2 and 2/2 present both solutions: belt and braces.
>> > 
>> > I tested that with my reproducer and it was OK, but as I already said, it's not trinity so I didn't observe the new problems in the first place.
> I've started seeing a new hang in the lru code, but I'm not sure if
> it's related to this patch or not (the locks are the same ones, but
> the location is very different).
> 
> I'm looking into that.

Hi Hugh,

The new hang I'm seeing is much simpler to analyse (compared to shmem_fallocate) and
doesn't seem to be related. I'll send a separate mail and Cc you just in case, but
I don't think that this patchset has anything to do with it.

Otherwise, I've been unable to reproduce the shmem_fallocate hang.


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3
@ 2014-07-18 10:44       ` Sasha Levin
  0 siblings, 0 replies; 42+ messages in thread
From: Sasha Levin @ 2014-07-18 10:44 UTC (permalink / raw)
  To: Vlastimil Babka, Hugh Dickins, Andrew Morton
  Cc: Konstantin Khlebnikov, Johannes Weiner, Michel Lespinasse,
	Lukas Czerner, Dave Jones, linux-mm, linux-fsdevel, linux-kernel

On 07/17/2014 12:12 PM, Sasha Levin wrote:
> On 07/17/2014 12:10 PM, Vlastimil Babka wrote:
>> > On 07/15/2014 12:28 PM, Hugh Dickins wrote:
>>> >> In the end I decided that we had better look at it as two problems,
>>> >> the trinity faulting starvation, and the indefinite punching loop,
>>> >> so 1/2 and 2/2 present both solutions: belt and braces.
>> > 
>> > I tested that with my reproducer and it was OK, but as I already said, it's not trinity so I didn't observe the new problems in the first place.
> I've started seeing a new hang in the lru code, but I'm not sure if
> it's related to this patch or not (the locks are the same ones, but
> the location is very different).
> 
> I'm looking into that.

Hi Hugh,

The new hang I'm seeing is much simpler to analyse (compared to shmem_fallocate) and
doesn't seem to be related. I'll send a separate mail and Cc you just in case, but
I don't think that this patchset has anything to do with it.

Otherwise, I've been unable to reproduce the shmem_fallocate hang.


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3
  2014-07-18 10:44       ` Sasha Levin
@ 2014-07-19 23:44         ` Hugh Dickins
  -1 siblings, 0 replies; 42+ messages in thread
From: Hugh Dickins @ 2014-07-19 23:44 UTC (permalink / raw)
  To: Sasha Levin, Andrew Morton
  Cc: Vlastimil Babka, Hugh Dickins, Konstantin Khlebnikov,
	Johannes Weiner, Michel Lespinasse, Lukas Czerner, Dave Jones,
	linux-mm, linux-fsdevel, linux-kernel

On Fri, 18 Jul 2014, Sasha Levin wrote:
> On 07/17/2014 12:12 PM, Sasha Levin wrote:
> > On 07/17/2014 12:10 PM, Vlastimil Babka wrote:
> >> > On 07/15/2014 12:28 PM, Hugh Dickins wrote:
> >>> >> In the end I decided that we had better look at it as two problems,
> >>> >> the trinity faulting starvation, and the indefinite punching loop,
> >>> >> so 1/2 and 2/2 present both solutions: belt and braces.
> >> > 
> >> > I tested that with my reproducer and it was OK, but as I already said, it's not trinity so I didn't observe the new problems in the first place.
> > I've started seeing a new hang in the lru code, but I'm not sure if
> > it's related to this patch or not (the locks are the same ones, but
> > the location is very different).
> > 
> > I'm looking into that.
> 
> Hi Hugh,
> 
> The new hang I'm seeing is much simpler to analyse (compared to shmem_fallocate) and
> doesn't seem to be related. I'll send a separate mail and Cc you just in case, but
> I don't think that this patchset has anything to do with it.

Thanks for testing and following up, Sasha.  I agree, that one is
unrelated.  I've just sent a suggestion in your lru_add_drain_all thread.

> 
> Otherwise, I've been unable to reproduce the shmem_fallocate hang.

Great.  Andrew, I think we can say that it's now safe to send
1/2 shmem: fix faulting into a hole, not taking i_mutex
2/2 shmem: fix splicing from a hole while it's punched
on to Linus whenever suits you.

(You have some other patches in the mainline-later section of the
mmotm/series file: they're okay too, but not in doubt as these two were.)

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3
@ 2014-07-19 23:44         ` Hugh Dickins
  0 siblings, 0 replies; 42+ messages in thread
From: Hugh Dickins @ 2014-07-19 23:44 UTC (permalink / raw)
  To: Sasha Levin, Andrew Morton
  Cc: Vlastimil Babka, Hugh Dickins, Konstantin Khlebnikov,
	Johannes Weiner, Michel Lespinasse, Lukas Czerner, Dave Jones,
	linux-mm, linux-fsdevel, linux-kernel

On Fri, 18 Jul 2014, Sasha Levin wrote:
> On 07/17/2014 12:12 PM, Sasha Levin wrote:
> > On 07/17/2014 12:10 PM, Vlastimil Babka wrote:
> >> > On 07/15/2014 12:28 PM, Hugh Dickins wrote:
> >>> >> In the end I decided that we had better look at it as two problems,
> >>> >> the trinity faulting starvation, and the indefinite punching loop,
> >>> >> so 1/2 and 2/2 present both solutions: belt and braces.
> >> > 
> >> > I tested that with my reproducer and it was OK, but as I already said, it's not trinity so I didn't observe the new problems in the first place.
> > I've started seeing a new hang in the lru code, but I'm not sure if
> > it's related to this patch or not (the locks are the same ones, but
> > the location is very different).
> > 
> > I'm looking into that.
> 
> Hi Hugh,
> 
> The new hang I'm seeing is much simpler to analyse (compared to shmem_fallocate) and
> doesn't seem to be related. I'll send a separate mail and Cc you just in case, but
> I don't think that this patchset has anything to do with it.

Thanks for testing and following up, Sasha.  I agree, that one is
unrelated.  I've just sent a suggestion in your lru_add_drain_all thread.

> 
> Otherwise, I've been unable to reproduce the shmem_fallocate hang.

Great.  Andrew, I think we can say that it's now safe to send
1/2 shmem: fix faulting into a hole, not taking i_mutex
2/2 shmem: fix splicing from a hole while it's punched
on to Linus whenever suits you.

(You have some other patches in the mainline-later section of the
mmotm/series file: they're okay too, but not in doubt as these two were.)

Thanks,
Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3
  2014-07-19 23:44         ` Hugh Dickins
@ 2014-07-22  3:24           ` Sasha Levin
  -1 siblings, 0 replies; 42+ messages in thread
From: Sasha Levin @ 2014-07-22  3:24 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Vlastimil Babka, Konstantin Khlebnikov, Johannes Weiner,
	Michel Lespinasse, Lukas Czerner, Dave Jones, linux-mm,
	linux-fsdevel, linux-kernel

On 07/19/2014 07:44 PM, Hugh Dickins wrote:
>> Otherwise, I've been unable to reproduce the shmem_fallocate hang.
> Great.  Andrew, I think we can say that it's now safe to send
> 1/2 shmem: fix faulting into a hole, not taking i_mutex
> 2/2 shmem: fix splicing from a hole while it's punched
> on to Linus whenever suits you.
> 
> (You have some other patches in the mainline-later section of the
> mmotm/series file: they're okay too, but not in doubt as these two were.)

I think we may need to hold off on sending them...

It seems that this code in shmem_fault():

	/*
	 * shmem_falloc_waitq points into the shmem_fallocate()
	 * stack of the hole-punching task: shmem_falloc_waitq
	 * is usually invalid by the time we reach here, but
	 * finish_wait() does not dereference it in that case;
	 * though i_lock needed lest racing with wake_up_all().
	 */
	spin_lock(&inode->i_lock);
	finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
	spin_unlock(&inode->i_lock);

Is problematic. I'm not sure what changed, but it seems to be causing everything
from NULL ptr derefs:

[  169.922536] BUG: unable to handle kernel NULL pointer dereference at 0000000000000631
[  169.925638] IP: __lock_acquire (./arch/x86/include/asm/atomic.h:92 kernel/locking/lockdep.c:3082)
[  169.927845] PGD 1d38af067 PUD 1d38b0067 PMD 0
[  169.929644] Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  169.930082] Dumping ftrace buffer:
[  169.930082]    (ftrace buffer empty)
[  169.930082] Modules linked in:
[  169.930082] CPU: 14 PID: 8824 Comm: trinity-c53 Tainted: G        W      3.16.0-rc5-next-20140721-sasha-00051-g258dfea-dirty #925
[  169.930082] task: ffff8801d3893000 ti: ffff8801d38f8000 task.ti: ffff8801d38f8000
[  169.930082] RIP: __lock_acquire (./arch/x86/include/asm/atomic.h:92 kernel/locking/lockdep.c:3082)
[  169.930082] RSP: 0000:ffff8801d38fb6c0  EFLAGS: 00010006
[  169.930082] RAX: 0000000000000000 RBX: ffff8801d3893000 RCX: 0000000000000001
[  169.930082] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8801b2b13d98
[  169.930082] RBP: ffff8801d38fb728 R08: 0000000000000001 R09: 0000000000000001
[  169.930082] R10: 0000000000000499 R11: 0000000000000001 R12: 0000000000000000
[  169.930082] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8801b2b13d98
[  169.930082] FS:  00007f9e6374a700(0000) GS:ffff880548e00000(0000) knlGS:0000000000000000
[  169.930082] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  169.930082] CR2: 0000000000000631 CR3: 00000001d38ae000 CR4: 00000000000006a0
[  169.930082] Stack:
[  169.930082]  ffff8801d3893000 ffff8801d3893000 ffffffffa6053bf0 0000000000000290
[  169.930082]  0000000000000000 ffff8801d38fb760 ffffffff9f1d0be2 ffffffff9f1cdbdb
[  169.930082]  ffff8801b2b13d80 0000000000000000 0000000000000000 0000000000000001
[  169.930082] Call Trace:
[  169.930082] ? __lock_acquire (kernel/locking/lockdep.c:3189)
[  169.930082] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2557 kernel/locking/lockdep.c:2599)
[  169.930082] lock_acquire (./arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
[  169.930082] ? finish_wait (include/linux/list.h:144 kernel/sched/wait.c:251)
[  169.930082] _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:117 kernel/locking/spinlock.c:159)
[  169.930082] ? finish_wait (include/linux/list.h:144 kernel/sched/wait.c:251)
[  169.930082] finish_wait (include/linux/list.h:144 kernel/sched/wait.c:251)
[  169.930082] shmem_fault (include/linux/spinlock.h:343 mm/shmem.c:1327)
[  169.930082] ? __wait_on_bit_lock (kernel/sched/wait.c:291)
[  169.930082] __do_fault (mm/memory.c:2713)
[  169.930082] do_read_fault.isra.40 (mm/memory.c:2905)
[  169.930082] handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
[  169.930082] ? __lock_is_held (kernel/locking/lockdep.c:3516)
[  170.003723] __do_page_fault (arch/x86/mm/fault.c:1231)
[  170.003723] ? context_tracking_user_exit (kernel/context_tracking.c:184)
[  170.003723] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[  170.003723] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2639 (discriminator 8))
[  170.003723] trace_do_page_fault (arch/x86/mm/fault.c:1314 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1315)
[  170.003723] do_async_page_fault (arch/x86/kernel/kvm.c:279)
[  170.003723] async_page_fault (arch/x86/kernel/entry_64.S:1321)
[  170.003723] ? copy_user_generic_unrolled (arch/x86/lib/copy_user_64.S:137)
[  170.003723] ? copy_page_from_iter_iovec (mm/iov_iter.c:141)
[  170.003723] copy_page_from_iter (mm/iov_iter.c:668)
[  170.003723] process_vm_rw_core.isra.2 (mm/process_vm_access.c:50 mm/process_vm_access.c:114 mm/process_vm_access.c:213)
[  170.003723] ? might_fault (./arch/x86/include/asm/current.h:14 mm/memory.c:3769)
[  170.003723] ? might_fault (mm/memory.c:3770)
[  170.003723] ? might_fault (./arch/x86/include/asm/current.h:14 mm/memory.c:3769)
[  170.003723] ? rw_copy_check_uvector (fs/read_write.c:758)
[  170.003723] process_vm_rw (mm/process_vm_access.c:287)
[  170.003723] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
[  170.003723] ? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
[  170.003723] ? vtime_account_user (kernel/sched/cputime.c:687)
[  170.003723] ? context_tracking_user_exit (./arch/x86/include/asm/paravirt.h:809 (discriminator 2) kernel/context_tracking.c:184 (discriminator 2))
[  170.003723] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[  170.003723] ? syscall_trace_enter (include/trace/events/syscalls.h:16 arch/x86/kernel/ptrace.c:1488)
[  170.003723] SyS_process_vm_writev (mm/process_vm_access.c:302)
[  170.003723] tracesys (arch/x86/kernel/entry_64.S:541)
[ 170.003723] Code: 49 81 3f 00 3e 97 a5 b8 00 00 00 00 44 0f 44 c0 41 83 fe 01 0f 87 e5 fe ff ff 44 89 f0 4d 8b 54 c7 08 4d 85 d2 0f 84 d4 fe ff ff <f0> 41 ff 82 98 01 00 00 8b 8b f0 0c 00 00 83 f9 2f 76 0e 8b 05
All code
========
   0:   49 81 3f 00 3e 97 a5    cmpq   $0xffffffffa5973e00,(%r15)
   7:   b8 00 00 00 00          mov    $0x0,%eax
   c:   44 0f 44 c0             cmove  %eax,%r8d
  10:   41 83 fe 01             cmp    $0x1,%r14d
  14:   0f 87 e5 fe ff ff       ja     0xfffffffffffffeff
  1a:   44 89 f0                mov    %r14d,%eax
  1d:   4d 8b 54 c7 08          mov    0x8(%r15,%rax,8),%r10
  22:   4d 85 d2                test   %r10,%r10
  25:   0f 84 d4 fe ff ff       je     0xfffffffffffffeff
  2b:*  f0 41 ff 82 98 01 00    lock incl 0x198(%r10)           <-- trapping instruction
  32:   00
  33:   8b 8b f0 0c 00 00       mov    0xcf0(%rbx),%ecx
  39:   83 f9 2f                cmp    $0x2f,%ecx
  3c:   76 0e                   jbe    0x4c
  3e:   8b                      .byte 0x8b
  3f:   05                      .byte 0x5
        ...

Code starting with the faulting instruction
===========================================
   0:   f0 41 ff 82 98 01 00    lock incl 0x198(%r10)
   7:   00
   8:   8b 8b f0 0c 00 00       mov    0xcf0(%rbx),%ecx
   e:   83 f9 2f                cmp    $0x2f,%ecx
  11:   76 0e                   jbe    0x21
  13:   8b                      .byte 0x8b
  14:   05                      .byte 0x5
        ...
[  170.003723] RIP __lock_acquire (./arch/x86/include/asm/atomic.h:92 kernel/locking/lockdep.c:3082)
[  170.003723]  RSP <ffff8801d38fb6c0>
[  170.003723] CR2: 0000000000000631

To memory corruptions:

[ 1031.264226] BUG: spinlock bad magic on CPU#1, trinity-c99/25740
[ 1031.265632]  lock: 0xffff88038023fd80, .magic: ffff8802, .owner: %<C0><DA>/1711276032, .owner_cpu: 0
[ 1031.267000] CPU: 1 PID: 25740 Comm: trinity-c99 Tainted: G        W      3.16.0-rc5-next-20140721-sasha-00051-g258dfea-dirty #925
[ 1031.270013]  ffff88038023fd80 ffff88010d2a38c0 ffffffffa24c0712 ffffffff9f1a703d
[ 1031.270081]  ffff88010d2a38e0 ffffffff9f1d6d76 ffff88038023fd80 ffffffffa396a896
[ 1031.270081]  ffff88010d2a3900 ffffffff9f1d6df6 ffff88038023fd80 ffff88038023fd80
[ 1031.270081] Call Trace:
[ 1031.270081] dump_stack (lib/dump_stack.c:52)
[ 1031.270081] ? sched_clock_local (kernel/sched/clock.c:214)
[ 1031.270081] spin_dump (kernel/locking/spinlock_debug.c:68 (discriminator 8))
[ 1031.270081] spin_bug (kernel/locking/spinlock_debug.c:76)
[ 1031.270081] do_raw_spin_unlock (./arch/x86/include/asm/spinlock.h:165 kernel/locking/spinlock_debug.c:98 kernel/locking/spinlock_debug.c:158)
[ 1031.270081] _raw_spin_unlock_irqrestore (include/linux/spinlock_api_smp.h:160 kernel/locking/spinlock.c:191)
[ 1031.270081] finish_wait (kernel/sched/wait.c:254)
[ 1031.270081] shmem_fault (include/linux/spinlock.h:343 mm/shmem.c:1327)
[ 1031.270081] ? __wait_on_bit_lock (kernel/sched/wait.c:291)
[ 1031.270081] __do_fault (mm/memory.c:2713)
[ 1031.270081] do_shared_fault (mm/memory.c:2985 (discriminator 8))
[ 1031.270081] handle_mm_fault (mm/memory.c:3097 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
[ 1031.270081] __do_page_fault (arch/x86/mm/fault.c:1231)
[ 1031.270081] ? sched_clock_cpu (kernel/sched/clock.c:311)
[ 1031.270081] ? context_tracking_user_exit (kernel/context_tracking.c:184)
[ 1031.270081] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 1031.270081] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2639 (discriminator 8))
[ 1031.270081] trace_do_page_fault (arch/x86/mm/fault.c:1314 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1315)
[ 1031.270081] do_async_page_fault (arch/x86/kernel/kvm.c:279)
[ 1031.270081] async_page_fault (arch/x86/kernel/entry_64.S:1321)
[ 1031.270081] ? copy_page_to_iter_iovec (include/linux/pagemap.h:562 mm/iov_iter.c:27)
[ 1031.270081] ? vmsplice_to_user (fs/splice.c:1533)
[ 1031.270081] copy_page_to_iter (mm/iov_iter.c:658)
[ 1031.270081] ? pipe_lock (fs/pipe.c:69)
[ 1031.270081] ? preempt_count_sub (kernel/sched/core.c:2617)
[ 1031.270081] ? vmsplice_to_user (fs/splice.c:1533)
[ 1031.270081] pipe_to_user (fs/splice.c:1535)
[ 1031.270081] __splice_from_pipe (fs/splice.c:770 fs/splice.c:886)
[ 1031.270081] vmsplice_to_user (fs/splice.c:1573)
[ 1031.270081] ? rcu_read_lock_held (kernel/rcu/update.c:168)
[ 1031.270081] SyS_vmsplice (include/linux/file.h:38 fs/splice.c:1657 fs/splice.c:1638)
[ 1031.270081] tracesys (arch/x86/kernel/entry_64.S:541)

And hangs:

[  212.010020] INFO: rcu_preempt detected stalls on CPUs/tasks:
[  212.010020]  Tasks blocked on level-1 rcu_node (CPUs 0-15):
[  212.010020]  8: (136 GPs behind) idle=2b9/140000000000000/0 softirq=4/4 last_accelerate: 0000/dda2, nonlazy_posted: 0, .D
[  212.010020]  9: (136 GPs behind) idle=92e/0/0 softirq=3/3 last_accelerate: 0000/dda2, nonlazy_posted: 0, .D
[  212.010020]  (detected by 1, t=6502 jiffies, g=4645, c=4644, q=0)
[  212.010020] Task dump for CPU 8:
[  212.010020] trinity-c350    R  running task    13000  9101   8424 0x00080006
[  212.010020]  ffff880520f47d98 0000000000000296 ffff8805230cfb38 ffffffffb750ba04
[  212.010020]  ffffffffb41bc165 ffff8805230cfb88 ffff8805230cfba0 ffff880520f47d80
[  212.010020]  ffff8805230cfb68 ffffffffb41bc165 ffff880520f47d80 ffff8805230c8800
[  212.010020] Call Trace:
[  212.010020] ? _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:117 kernel/locking/spinlock.c:159)
[  212.010020] ? finish_wait (include/linux/list.h:144 kernel/sched/wait.c:251)
[  212.010020] ? finish_wait (include/linux/list.h:144 kernel/sched/wait.c:251)
[  212.010020] ? shmem_fault (include/linux/spinlock.h:343 mm/shmem.c:1327)
[  212.010020] ? __wait_on_bit_lock (kernel/sched/wait.c:291)
[  212.010020] ? __do_fault (mm/memory.c:2713)
[  212.010020] ? do_shared_fault (mm/memory.c:2985 (discriminator 8))
[  212.010020] ? handle_mm_fault (mm/memory.c:3097 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
[  212.010020] ? __do_page_fault (arch/x86/mm/fault.c:1231)
[  212.010020] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
[  212.010020] ? __tick_nohz_task_switch (./arch/x86/include/asm/paravirt.h:809 (discriminator 2) kernel/time/tick-sched.c:278 (discriminator 2))
[  212.010020] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[  212.010020] ? context_tracking_user_exit (kernel/context_tracking.c:184)
[  212.010020] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[  212.010020] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2639 (discriminator 8))
[  212.010020] ? trace_do_page_fault (arch/x86/mm/fault.c:1314 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1315)
[  212.010020] ? do_async_page_fault (arch/x86/kernel/kvm.c:279)
[  212.010020] ? async_page_fault (arch/x86/kernel/entry_64.S:1321)
[  212.010020] ? copy_user_generic_unrolled (arch/x86/lib/copy_user_64.S:167)
[  212.010020] ? SyS_getcwd (./arch/x86/include/asm/uaccess.h:731 fs/dcache.c:3200 fs/dcache.c:3164)
[  212.010020] ? tracesys (arch/x86/kernel/entry_64.S:541)
[  212.010020] ? tracesys (arch/x86/kernel/entry_64.S:541)


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3
@ 2014-07-22  3:24           ` Sasha Levin
  0 siblings, 0 replies; 42+ messages in thread
From: Sasha Levin @ 2014-07-22  3:24 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Vlastimil Babka, Konstantin Khlebnikov, Johannes Weiner,
	Michel Lespinasse, Lukas Czerner, Dave Jones, linux-mm,
	linux-fsdevel, linux-kernel

On 07/19/2014 07:44 PM, Hugh Dickins wrote:
>> Otherwise, I've been unable to reproduce the shmem_fallocate hang.
> Great.  Andrew, I think we can say that it's now safe to send
> 1/2 shmem: fix faulting into a hole, not taking i_mutex
> 2/2 shmem: fix splicing from a hole while it's punched
> on to Linus whenever suits you.
> 
> (You have some other patches in the mainline-later section of the
> mmotm/series file: they're okay too, but not in doubt as these two were.)

I think we may need to hold off on sending them...

It seems that this code in shmem_fault():

	/*
	 * shmem_falloc_waitq points into the shmem_fallocate()
	 * stack of the hole-punching task: shmem_falloc_waitq
	 * is usually invalid by the time we reach here, but
	 * finish_wait() does not dereference it in that case;
	 * though i_lock needed lest racing with wake_up_all().
	 */
	spin_lock(&inode->i_lock);
	finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
	spin_unlock(&inode->i_lock);

Is problematic. I'm not sure what changed, but it seems to be causing everything
from NULL ptr derefs:

[  169.922536] BUG: unable to handle kernel NULL pointer dereference at 0000000000000631
[  169.925638] IP: __lock_acquire (./arch/x86/include/asm/atomic.h:92 kernel/locking/lockdep.c:3082)
[  169.927845] PGD 1d38af067 PUD 1d38b0067 PMD 0
[  169.929644] Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  169.930082] Dumping ftrace buffer:
[  169.930082]    (ftrace buffer empty)
[  169.930082] Modules linked in:
[  169.930082] CPU: 14 PID: 8824 Comm: trinity-c53 Tainted: G        W      3.16.0-rc5-next-20140721-sasha-00051-g258dfea-dirty #925
[  169.930082] task: ffff8801d3893000 ti: ffff8801d38f8000 task.ti: ffff8801d38f8000
[  169.930082] RIP: __lock_acquire (./arch/x86/include/asm/atomic.h:92 kernel/locking/lockdep.c:3082)
[  169.930082] RSP: 0000:ffff8801d38fb6c0  EFLAGS: 00010006
[  169.930082] RAX: 0000000000000000 RBX: ffff8801d3893000 RCX: 0000000000000001
[  169.930082] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8801b2b13d98
[  169.930082] RBP: ffff8801d38fb728 R08: 0000000000000001 R09: 0000000000000001
[  169.930082] R10: 0000000000000499 R11: 0000000000000001 R12: 0000000000000000
[  169.930082] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8801b2b13d98
[  169.930082] FS:  00007f9e6374a700(0000) GS:ffff880548e00000(0000) knlGS:0000000000000000
[  169.930082] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  169.930082] CR2: 0000000000000631 CR3: 00000001d38ae000 CR4: 00000000000006a0
[  169.930082] Stack:
[  169.930082]  ffff8801d3893000 ffff8801d3893000 ffffffffa6053bf0 0000000000000290
[  169.930082]  0000000000000000 ffff8801d38fb760 ffffffff9f1d0be2 ffffffff9f1cdbdb
[  169.930082]  ffff8801b2b13d80 0000000000000000 0000000000000000 0000000000000001
[  169.930082] Call Trace:
[  169.930082] ? __lock_acquire (kernel/locking/lockdep.c:3189)
[  169.930082] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2557 kernel/locking/lockdep.c:2599)
[  169.930082] lock_acquire (./arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
[  169.930082] ? finish_wait (include/linux/list.h:144 kernel/sched/wait.c:251)
[  169.930082] _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:117 kernel/locking/spinlock.c:159)
[  169.930082] ? finish_wait (include/linux/list.h:144 kernel/sched/wait.c:251)
[  169.930082] finish_wait (include/linux/list.h:144 kernel/sched/wait.c:251)
[  169.930082] shmem_fault (include/linux/spinlock.h:343 mm/shmem.c:1327)
[  169.930082] ? __wait_on_bit_lock (kernel/sched/wait.c:291)
[  169.930082] __do_fault (mm/memory.c:2713)
[  169.930082] do_read_fault.isra.40 (mm/memory.c:2905)
[  169.930082] handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
[  169.930082] ? __lock_is_held (kernel/locking/lockdep.c:3516)
[  170.003723] __do_page_fault (arch/x86/mm/fault.c:1231)
[  170.003723] ? context_tracking_user_exit (kernel/context_tracking.c:184)
[  170.003723] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[  170.003723] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2639 (discriminator 8))
[  170.003723] trace_do_page_fault (arch/x86/mm/fault.c:1314 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1315)
[  170.003723] do_async_page_fault (arch/x86/kernel/kvm.c:279)
[  170.003723] async_page_fault (arch/x86/kernel/entry_64.S:1321)
[  170.003723] ? copy_user_generic_unrolled (arch/x86/lib/copy_user_64.S:137)
[  170.003723] ? copy_page_from_iter_iovec (mm/iov_iter.c:141)
[  170.003723] copy_page_from_iter (mm/iov_iter.c:668)
[  170.003723] process_vm_rw_core.isra.2 (mm/process_vm_access.c:50 mm/process_vm_access.c:114 mm/process_vm_access.c:213)
[  170.003723] ? might_fault (./arch/x86/include/asm/current.h:14 mm/memory.c:3769)
[  170.003723] ? might_fault (mm/memory.c:3770)
[  170.003723] ? might_fault (./arch/x86/include/asm/current.h:14 mm/memory.c:3769)
[  170.003723] ? rw_copy_check_uvector (fs/read_write.c:758)
[  170.003723] process_vm_rw (mm/process_vm_access.c:287)
[  170.003723] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
[  170.003723] ? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
[  170.003723] ? vtime_account_user (kernel/sched/cputime.c:687)
[  170.003723] ? context_tracking_user_exit (./arch/x86/include/asm/paravirt.h:809 (discriminator 2) kernel/context_tracking.c:184 (discriminator 2))
[  170.003723] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[  170.003723] ? syscall_trace_enter (include/trace/events/syscalls.h:16 arch/x86/kernel/ptrace.c:1488)
[  170.003723] SyS_process_vm_writev (mm/process_vm_access.c:302)
[  170.003723] tracesys (arch/x86/kernel/entry_64.S:541)
[ 170.003723] Code: 49 81 3f 00 3e 97 a5 b8 00 00 00 00 44 0f 44 c0 41 83 fe 01 0f 87 e5 fe ff ff 44 89 f0 4d 8b 54 c7 08 4d 85 d2 0f 84 d4 fe ff ff <f0> 41 ff 82 98 01 00 00 8b 8b f0 0c 00 00 83 f9 2f 76 0e 8b 05
All code
========
   0:   49 81 3f 00 3e 97 a5    cmpq   $0xffffffffa5973e00,(%r15)
   7:   b8 00 00 00 00          mov    $0x0,%eax
   c:   44 0f 44 c0             cmove  %eax,%r8d
  10:   41 83 fe 01             cmp    $0x1,%r14d
  14:   0f 87 e5 fe ff ff       ja     0xfffffffffffffeff
  1a:   44 89 f0                mov    %r14d,%eax
  1d:   4d 8b 54 c7 08          mov    0x8(%r15,%rax,8),%r10
  22:   4d 85 d2                test   %r10,%r10
  25:   0f 84 d4 fe ff ff       je     0xfffffffffffffeff
  2b:*  f0 41 ff 82 98 01 00    lock incl 0x198(%r10)           <-- trapping instruction
  32:   00
  33:   8b 8b f0 0c 00 00       mov    0xcf0(%rbx),%ecx
  39:   83 f9 2f                cmp    $0x2f,%ecx
  3c:   76 0e                   jbe    0x4c
  3e:   8b                      .byte 0x8b
  3f:   05                      .byte 0x5
        ...

Code starting with the faulting instruction
===========================================
   0:   f0 41 ff 82 98 01 00    lock incl 0x198(%r10)
   7:   00
   8:   8b 8b f0 0c 00 00       mov    0xcf0(%rbx),%ecx
   e:   83 f9 2f                cmp    $0x2f,%ecx
  11:   76 0e                   jbe    0x21
  13:   8b                      .byte 0x8b
  14:   05                      .byte 0x5
        ...
[  170.003723] RIP __lock_acquire (./arch/x86/include/asm/atomic.h:92 kernel/locking/lockdep.c:3082)
[  170.003723]  RSP <ffff8801d38fb6c0>
[  170.003723] CR2: 0000000000000631

To memory corruptions:

[ 1031.264226] BUG: spinlock bad magic on CPU#1, trinity-c99/25740
[ 1031.265632]  lock: 0xffff88038023fd80, .magic: ffff8802, .owner: %<C0><DA>/1711276032, .owner_cpu: 0
[ 1031.267000] CPU: 1 PID: 25740 Comm: trinity-c99 Tainted: G        W      3.16.0-rc5-next-20140721-sasha-00051-g258dfea-dirty #925
[ 1031.270013]  ffff88038023fd80 ffff88010d2a38c0 ffffffffa24c0712 ffffffff9f1a703d
[ 1031.270081]  ffff88010d2a38e0 ffffffff9f1d6d76 ffff88038023fd80 ffffffffa396a896
[ 1031.270081]  ffff88010d2a3900 ffffffff9f1d6df6 ffff88038023fd80 ffff88038023fd80
[ 1031.270081] Call Trace:
[ 1031.270081] dump_stack (lib/dump_stack.c:52)
[ 1031.270081] ? sched_clock_local (kernel/sched/clock.c:214)
[ 1031.270081] spin_dump (kernel/locking/spinlock_debug.c:68 (discriminator 8))
[ 1031.270081] spin_bug (kernel/locking/spinlock_debug.c:76)
[ 1031.270081] do_raw_spin_unlock (./arch/x86/include/asm/spinlock.h:165 kernel/locking/spinlock_debug.c:98 kernel/locking/spinlock_debug.c:158)
[ 1031.270081] _raw_spin_unlock_irqrestore (include/linux/spinlock_api_smp.h:160 kernel/locking/spinlock.c:191)
[ 1031.270081] finish_wait (kernel/sched/wait.c:254)
[ 1031.270081] shmem_fault (include/linux/spinlock.h:343 mm/shmem.c:1327)
[ 1031.270081] ? __wait_on_bit_lock (kernel/sched/wait.c:291)
[ 1031.270081] __do_fault (mm/memory.c:2713)
[ 1031.270081] do_shared_fault (mm/memory.c:2985 (discriminator 8))
[ 1031.270081] handle_mm_fault (mm/memory.c:3097 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
[ 1031.270081] __do_page_fault (arch/x86/mm/fault.c:1231)
[ 1031.270081] ? sched_clock_cpu (kernel/sched/clock.c:311)
[ 1031.270081] ? context_tracking_user_exit (kernel/context_tracking.c:184)
[ 1031.270081] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 1031.270081] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2639 (discriminator 8))
[ 1031.270081] trace_do_page_fault (arch/x86/mm/fault.c:1314 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1315)
[ 1031.270081] do_async_page_fault (arch/x86/kernel/kvm.c:279)
[ 1031.270081] async_page_fault (arch/x86/kernel/entry_64.S:1321)
[ 1031.270081] ? copy_page_to_iter_iovec (include/linux/pagemap.h:562 mm/iov_iter.c:27)
[ 1031.270081] ? vmsplice_to_user (fs/splice.c:1533)
[ 1031.270081] copy_page_to_iter (mm/iov_iter.c:658)
[ 1031.270081] ? pipe_lock (fs/pipe.c:69)
[ 1031.270081] ? preempt_count_sub (kernel/sched/core.c:2617)
[ 1031.270081] ? vmsplice_to_user (fs/splice.c:1533)
[ 1031.270081] pipe_to_user (fs/splice.c:1535)
[ 1031.270081] __splice_from_pipe (fs/splice.c:770 fs/splice.c:886)
[ 1031.270081] vmsplice_to_user (fs/splice.c:1573)
[ 1031.270081] ? rcu_read_lock_held (kernel/rcu/update.c:168)
[ 1031.270081] SyS_vmsplice (include/linux/file.h:38 fs/splice.c:1657 fs/splice.c:1638)
[ 1031.270081] tracesys (arch/x86/kernel/entry_64.S:541)

And hangs:

[  212.010020] INFO: rcu_preempt detected stalls on CPUs/tasks:
[  212.010020]  Tasks blocked on level-1 rcu_node (CPUs 0-15):
[  212.010020]  8: (136 GPs behind) idle=2b9/140000000000000/0 softirq=4/4 last_accelerate: 0000/dda2, nonlazy_posted: 0, .D
[  212.010020]  9: (136 GPs behind) idle=92e/0/0 softirq=3/3 last_accelerate: 0000/dda2, nonlazy_posted: 0, .D
[  212.010020]  (detected by 1, t=6502 jiffies, g=4645, c=4644, q=0)
[  212.010020] Task dump for CPU 8:
[  212.010020] trinity-c350    R  running task    13000  9101   8424 0x00080006
[  212.010020]  ffff880520f47d98 0000000000000296 ffff8805230cfb38 ffffffffb750ba04
[  212.010020]  ffffffffb41bc165 ffff8805230cfb88 ffff8805230cfba0 ffff880520f47d80
[  212.010020]  ffff8805230cfb68 ffffffffb41bc165 ffff880520f47d80 ffff8805230c8800
[  212.010020] Call Trace:
[  212.010020] ? _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:117 kernel/locking/spinlock.c:159)
[  212.010020] ? finish_wait (include/linux/list.h:144 kernel/sched/wait.c:251)
[  212.010020] ? finish_wait (include/linux/list.h:144 kernel/sched/wait.c:251)
[  212.010020] ? shmem_fault (include/linux/spinlock.h:343 mm/shmem.c:1327)
[  212.010020] ? __wait_on_bit_lock (kernel/sched/wait.c:291)
[  212.010020] ? __do_fault (mm/memory.c:2713)
[  212.010020] ? do_shared_fault (mm/memory.c:2985 (discriminator 8))
[  212.010020] ? handle_mm_fault (mm/memory.c:3097 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
[  212.010020] ? __do_page_fault (arch/x86/mm/fault.c:1231)
[  212.010020] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
[  212.010020] ? __tick_nohz_task_switch (./arch/x86/include/asm/paravirt.h:809 (discriminator 2) kernel/time/tick-sched.c:278 (discriminator 2))
[  212.010020] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[  212.010020] ? context_tracking_user_exit (kernel/context_tracking.c:184)
[  212.010020] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[  212.010020] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2639 (discriminator 8))
[  212.010020] ? trace_do_page_fault (arch/x86/mm/fault.c:1314 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1315)
[  212.010020] ? do_async_page_fault (arch/x86/kernel/kvm.c:279)
[  212.010020] ? async_page_fault (arch/x86/kernel/entry_64.S:1321)
[  212.010020] ? copy_user_generic_unrolled (arch/x86/lib/copy_user_64.S:167)
[  212.010020] ? SyS_getcwd (./arch/x86/include/asm/uaccess.h:731 fs/dcache.c:3200 fs/dcache.c:3164)
[  212.010020] ? tracesys (arch/x86/kernel/entry_64.S:541)
[  212.010020] ? tracesys (arch/x86/kernel/entry_64.S:541)


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3
  2014-07-22  3:24           ` Sasha Levin
@ 2014-07-22  8:07             ` Hugh Dickins
  -1 siblings, 0 replies; 42+ messages in thread
From: Hugh Dickins @ 2014-07-22  8:07 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Hugh Dickins, Andrew Morton, Vlastimil Babka,
	Konstantin Khlebnikov, Johannes Weiner, Michel Lespinasse,
	Lukas Czerner, Dave Jones, linux-mm, linux-fsdevel, linux-kernel

On Mon, 21 Jul 2014, Sasha Levin wrote:
> On 07/19/2014 07:44 PM, Hugh Dickins wrote:
> >> Otherwise, I've been unable to reproduce the shmem_fallocate hang.
> > Great.  Andrew, I think we can say that it's now safe to send
> > 1/2 shmem: fix faulting into a hole, not taking i_mutex
> > 2/2 shmem: fix splicing from a hole while it's punched
> > on to Linus whenever suits you.
> > 
> > (You have some other patches in the mainline-later section of the
> > mmotm/series file: they're okay too, but not in doubt as these two were.)
> 
> I think we may need to hold off on sending them...
> 
> It seems that this code in shmem_fault():
> 
> 	/*
> 	 * shmem_falloc_waitq points into the shmem_fallocate()
> 	 * stack of the hole-punching task: shmem_falloc_waitq
> 	 * is usually invalid by the time we reach here, but
> 	 * finish_wait() does not dereference it in that case;
> 	 * though i_lock needed lest racing with wake_up_all().
> 	 */
> 	spin_lock(&inode->i_lock);
> 	finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
> 	spin_unlock(&inode->i_lock);
> 
> Is problematic. I'm not sure what changed, but it seems to be causing everything
> from NULL ptr derefs:
> 
> [  169.922536] BUG: unable to handle kernel NULL pointer dereference at 0000000000000631
> 
> To memory corruptions:
> 
> [ 1031.264226] BUG: spinlock bad magic on CPU#1, trinity-c99/25740
> 
> And hangs:
> 
> [  212.010020] INFO: rcu_preempt detected stalls on CPUs/tasks:

Ugh.

I'm so tired of this, I'm flailing around here, and could use some help.

But there is one easy change which might do it: please would you try
changing the TASK_KILLABLE a few lines above to TASK_UNINTERRUPTIBLE.

I noticed when deciding on the i_lock'ing there, that a lot of the
difficulty in races between the two ends, came from allowing KILLABLE
at the faulting end.  Which was a nice courtesy, but one I'll gladly
give up on now, if it is responsible for these troubles.

Please give it a try, I don't know what else to suggest.  And I've
no idea why the problem should emerge only now.  If this change
does appear to fix it, please go back and forth with and without,
to gather confidence that the problem is still reproducible without
the fix.  If fix it turns out to be - touch wood.

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3
@ 2014-07-22  8:07             ` Hugh Dickins
  0 siblings, 0 replies; 42+ messages in thread
From: Hugh Dickins @ 2014-07-22  8:07 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Hugh Dickins, Andrew Morton, Vlastimil Babka,
	Konstantin Khlebnikov, Johannes Weiner, Michel Lespinasse,
	Lukas Czerner, Dave Jones, linux-mm, linux-fsdevel, linux-kernel

On Mon, 21 Jul 2014, Sasha Levin wrote:
> On 07/19/2014 07:44 PM, Hugh Dickins wrote:
> >> Otherwise, I've been unable to reproduce the shmem_fallocate hang.
> > Great.  Andrew, I think we can say that it's now safe to send
> > 1/2 shmem: fix faulting into a hole, not taking i_mutex
> > 2/2 shmem: fix splicing from a hole while it's punched
> > on to Linus whenever suits you.
> > 
> > (You have some other patches in the mainline-later section of the
> > mmotm/series file: they're okay too, but not in doubt as these two were.)
> 
> I think we may need to hold off on sending them...
> 
> It seems that this code in shmem_fault():
> 
> 	/*
> 	 * shmem_falloc_waitq points into the shmem_fallocate()
> 	 * stack of the hole-punching task: shmem_falloc_waitq
> 	 * is usually invalid by the time we reach here, but
> 	 * finish_wait() does not dereference it in that case;
> 	 * though i_lock needed lest racing with wake_up_all().
> 	 */
> 	spin_lock(&inode->i_lock);
> 	finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
> 	spin_unlock(&inode->i_lock);
> 
> Is problematic. I'm not sure what changed, but it seems to be causing everything
> from NULL ptr derefs:
> 
> [  169.922536] BUG: unable to handle kernel NULL pointer dereference at 0000000000000631
> 
> To memory corruptions:
> 
> [ 1031.264226] BUG: spinlock bad magic on CPU#1, trinity-c99/25740
> 
> And hangs:
> 
> [  212.010020] INFO: rcu_preempt detected stalls on CPUs/tasks:

Ugh.

I'm so tired of this, I'm flailing around here, and could use some help.

But there is one easy change which might do it: please would you try
changing the TASK_KILLABLE a few lines above to TASK_UNINTERRUPTIBLE.

I noticed when deciding on the i_lock'ing there, that a lot of the
difficulty in races between the two ends, came from allowing KILLABLE
at the faulting end.  Which was a nice courtesy, but one I'll gladly
give up on now, if it is responsible for these troubles.

Please give it a try, I don't know what else to suggest.  And I've
no idea why the problem should emerge only now.  If this change
does appear to fix it, please go back and forth with and without,
to gather confidence that the problem is still reproducible without
the fix.  If fix it turns out to be - touch wood.

Thanks,
Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3
  2014-07-22  8:07             ` Hugh Dickins
@ 2014-07-22 10:06               ` Vlastimil Babka
  -1 siblings, 0 replies; 42+ messages in thread
From: Vlastimil Babka @ 2014-07-22 10:06 UTC (permalink / raw)
  To: Hugh Dickins, Sasha Levin
  Cc: Andrew Morton, Konstantin Khlebnikov, Johannes Weiner,
	Michel Lespinasse, Lukas Czerner, Dave Jones, linux-mm,
	linux-fsdevel, linux-kernel, Michal Hocko

On 07/22/2014 10:07 AM, Hugh Dickins wrote:
> On Mon, 21 Jul 2014, Sasha Levin wrote:
>> On 07/19/2014 07:44 PM, Hugh Dickins wrote:
>>>> Otherwise, I've been unable to reproduce the shmem_fallocate hang.
>>> Great.  Andrew, I think we can say that it's now safe to send
>>> 1/2 shmem: fix faulting into a hole, not taking i_mutex
>>> 2/2 shmem: fix splicing from a hole while it's punched
>>> on to Linus whenever suits you.
>>>
>>> (You have some other patches in the mainline-later section of the
>>> mmotm/series file: they're okay too, but not in doubt as these two were.)
>>
>> I think we may need to hold off on sending them...
>>
>> It seems that this code in shmem_fault():
>>
>> 	/*
>> 	 * shmem_falloc_waitq points into the shmem_fallocate()
>> 	 * stack of the hole-punching task: shmem_falloc_waitq
>> 	 * is usually invalid by the time we reach here, but
>> 	 * finish_wait() does not dereference it in that case;
>> 	 * though i_lock needed lest racing with wake_up_all().
>> 	 */
>> 	spin_lock(&inode->i_lock);
>> 	finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
>> 	spin_unlock(&inode->i_lock);
>>
>> Is problematic. I'm not sure what changed, but it seems to be causing everything
>> from NULL ptr derefs:
>>
>> [  169.922536] BUG: unable to handle kernel NULL pointer dereference at 0000000000000631
>>
>> To memory corruptions:
>>
>> [ 1031.264226] BUG: spinlock bad magic on CPU#1, trinity-c99/25740
>>
>> And hangs:
>>
>> [  212.010020] INFO: rcu_preempt detected stalls on CPUs/tasks:
>
> Ugh.
>
> I'm so tired of this, I'm flailing around here, and could use some help.
>
> But there is one easy change which might do it: please would you try
> changing the TASK_KILLABLE a few lines above to TASK_UNINTERRUPTIBLE.

Hello,

I have discussed it with Michal Hocko, as he was recently fighting with 
wait queues, and indeed he found a possible race quickly. And yes, it's 
related to the TASK_KILLABLE state.

The problem would manifest when the waiting faulting task is woken by a 
signal (thanks to the TASK_KILLABLE), which will change its state to 
TASK_WAKING. Later it might become TASK_RUNNING again.
Either value means that a wake_up_all() from the punching task will skip 
the faulter's shmem_fault_wait when clearing up the wait queue (see the 
p->state & state check in try_to_wake_up()).

Then in the faulter's finish_wait(), &wait->task_list will not be empty, 
so it will try to access the wait queue, which might be already already 
gone from the puncher's stack...

So if this is true, the change to TASK_UNINTERRUPTIBLE will avoid the 
problem, but it would be nicer to keep the KILLABLE state.
I think it could be done by testing if the wait queue still exists and 
is the same, before attempting finish wait. If it doesn't exist, that 
means the faulter can skip finish_wait altogether because it must be 
already TASK_RUNNING.

shmem_falloc = inode->i_private;
if (shmem_falloc && shmem_falloc->waitq == shmem_falloc_waitq)
	finish_wait(shmem_falloc_waitq, &shmem_fault_wait);

It might still be theoretically possible that although it has the same 
address, it's not the same wait queue, but that doesn't hurt 
correctness. I might be uselessly locking some other waitq's lock, but 
the inode->i_lock still protects me from other faulters that are in the 
same situation. The puncher is already gone.

However it's quite ugly and if there is some wait queue debugging mode 
(I hadn't checked) that e.g. checks if wait queues and wait objects are 
empty before destruction, it wouldn't like this at all...

> I noticed when deciding on the i_lock'ing there, that a lot of the
> difficulty in races between the two ends, came from allowing KILLABLE
> at the faulting end.  Which was a nice courtesy, but one I'll gladly
> give up on now, if it is responsible for these troubles.
>
> Please give it a try, I don't know what else to suggest.  And I've
> no idea why the problem should emerge only now.  If this change
> does appear to fix it, please go back and forth with and without,
> to gather confidence that the problem is still reproducible without
> the fix.  If fix it turns out to be - touch wood.
>
> Thanks,
> Hugh
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3
@ 2014-07-22 10:06               ` Vlastimil Babka
  0 siblings, 0 replies; 42+ messages in thread
From: Vlastimil Babka @ 2014-07-22 10:06 UTC (permalink / raw)
  To: Hugh Dickins, Sasha Levin
  Cc: Andrew Morton, Konstantin Khlebnikov, Johannes Weiner,
	Michel Lespinasse, Lukas Czerner, Dave Jones, linux-mm,
	linux-fsdevel, linux-kernel, Michal Hocko

On 07/22/2014 10:07 AM, Hugh Dickins wrote:
> On Mon, 21 Jul 2014, Sasha Levin wrote:
>> On 07/19/2014 07:44 PM, Hugh Dickins wrote:
>>>> Otherwise, I've been unable to reproduce the shmem_fallocate hang.
>>> Great.  Andrew, I think we can say that it's now safe to send
>>> 1/2 shmem: fix faulting into a hole, not taking i_mutex
>>> 2/2 shmem: fix splicing from a hole while it's punched
>>> on to Linus whenever suits you.
>>>
>>> (You have some other patches in the mainline-later section of the
>>> mmotm/series file: they're okay too, but not in doubt as these two were.)
>>
>> I think we may need to hold off on sending them...
>>
>> It seems that this code in shmem_fault():
>>
>> 	/*
>> 	 * shmem_falloc_waitq points into the shmem_fallocate()
>> 	 * stack of the hole-punching task: shmem_falloc_waitq
>> 	 * is usually invalid by the time we reach here, but
>> 	 * finish_wait() does not dereference it in that case;
>> 	 * though i_lock needed lest racing with wake_up_all().
>> 	 */
>> 	spin_lock(&inode->i_lock);
>> 	finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
>> 	spin_unlock(&inode->i_lock);
>>
>> Is problematic. I'm not sure what changed, but it seems to be causing everything
>> from NULL ptr derefs:
>>
>> [  169.922536] BUG: unable to handle kernel NULL pointer dereference at 0000000000000631
>>
>> To memory corruptions:
>>
>> [ 1031.264226] BUG: spinlock bad magic on CPU#1, trinity-c99/25740
>>
>> And hangs:
>>
>> [  212.010020] INFO: rcu_preempt detected stalls on CPUs/tasks:
>
> Ugh.
>
> I'm so tired of this, I'm flailing around here, and could use some help.
>
> But there is one easy change which might do it: please would you try
> changing the TASK_KILLABLE a few lines above to TASK_UNINTERRUPTIBLE.

Hello,

I have discussed it with Michal Hocko, as he was recently fighting with 
wait queues, and indeed he found a possible race quickly. And yes, it's 
related to the TASK_KILLABLE state.

The problem would manifest when the waiting faulting task is woken by a 
signal (thanks to the TASK_KILLABLE), which will change its state to 
TASK_WAKING. Later it might become TASK_RUNNING again.
Either value means that a wake_up_all() from the punching task will skip 
the faulter's shmem_fault_wait when clearing up the wait queue (see the 
p->state & state check in try_to_wake_up()).

Then in the faulter's finish_wait(), &wait->task_list will not be empty, 
so it will try to access the wait queue, which might be already already 
gone from the puncher's stack...

So if this is true, the change to TASK_UNINTERRUPTIBLE will avoid the 
problem, but it would be nicer to keep the KILLABLE state.
I think it could be done by testing if the wait queue still exists and 
is the same, before attempting finish wait. If it doesn't exist, that 
means the faulter can skip finish_wait altogether because it must be 
already TASK_RUNNING.

shmem_falloc = inode->i_private;
if (shmem_falloc && shmem_falloc->waitq == shmem_falloc_waitq)
	finish_wait(shmem_falloc_waitq, &shmem_fault_wait);

It might still be theoretically possible that although it has the same 
address, it's not the same wait queue, but that doesn't hurt 
correctness. I might be uselessly locking some other waitq's lock, but 
the inode->i_lock still protects me from other faulters that are in the 
same situation. The puncher is already gone.

However it's quite ugly and if there is some wait queue debugging mode 
(I hadn't checked) that e.g. checks if wait queues and wait objects are 
empty before destruction, it wouldn't like this at all...

> I noticed when deciding on the i_lock'ing there, that a lot of the
> difficulty in races between the two ends, came from allowing KILLABLE
> at the faulting end.  Which was a nice courtesy, but one I'll gladly
> give up on now, if it is responsible for these troubles.
>
> Please give it a try, I don't know what else to suggest.  And I've
> no idea why the problem should emerge only now.  If this change
> does appear to fix it, please go back and forth with and without,
> to gather confidence that the problem is still reproducible without
> the fix.  If fix it turns out to be - touch wood.
>
> Thanks,
> Hugh
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3
  2014-07-22 10:06               ` Vlastimil Babka
@ 2014-07-22 12:09                 ` Vlastimil Babka
  -1 siblings, 0 replies; 42+ messages in thread
From: Vlastimil Babka @ 2014-07-22 12:09 UTC (permalink / raw)
  To: Hugh Dickins, Sasha Levin
  Cc: Andrew Morton, Konstantin Khlebnikov, Johannes Weiner,
	Michel Lespinasse, Lukas Czerner, Dave Jones, linux-mm,
	linux-fsdevel, linux-kernel, Michal Hocko

On 07/22/2014 12:06 PM, Vlastimil Babka wrote:
> So if this is true, the change to TASK_UNINTERRUPTIBLE will avoid the
> problem, but it would be nicer to keep the KILLABLE state.
> I think it could be done by testing if the wait queue still exists and
> is the same, before attempting finish wait. If it doesn't exist, that
> means the faulter can skip finish_wait altogether because it must be
> already TASK_RUNNING.
>
> shmem_falloc = inode->i_private;
> if (shmem_falloc && shmem_falloc->waitq == shmem_falloc_waitq)
> 	finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
>
> It might still be theoretically possible that although it has the same
> address, it's not the same wait queue, but that doesn't hurt
> correctness. I might be uselessly locking some other waitq's lock, but
> the inode->i_lock still protects me from other faulters that are in the
> same situation. The puncher is already gone.

Actually, I was wrong and deleting from a different queue could corrupt 
the queue head. I don't know if trinity would be able to trigger this, 
but I wouldn't be comfortable knowing it's possible. Calling fallocate 
twice in quick succession from the same process could easily end up at 
the same address on the stack, no?

Another also somewhat ugly possibility is to make sure that the wait 
queue is empty before the puncher quits, regardless of the running state 
of the processes in the queue. I think the conditions here 
(serialization by i_lock) might allow us to do that without risking that 
we e.g. leave anyone sleeping. But it's bending the wait queue design...

> However it's quite ugly and if there is some wait queue debugging mode
> (I hadn't checked) that e.g. checks if wait queues and wait objects are
> empty before destruction, it wouldn't like this at all...


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3
@ 2014-07-22 12:09                 ` Vlastimil Babka
  0 siblings, 0 replies; 42+ messages in thread
From: Vlastimil Babka @ 2014-07-22 12:09 UTC (permalink / raw)
  To: Hugh Dickins, Sasha Levin
  Cc: Andrew Morton, Konstantin Khlebnikov, Johannes Weiner,
	Michel Lespinasse, Lukas Czerner, Dave Jones, linux-mm,
	linux-fsdevel, linux-kernel, Michal Hocko

On 07/22/2014 12:06 PM, Vlastimil Babka wrote:
> So if this is true, the change to TASK_UNINTERRUPTIBLE will avoid the
> problem, but it would be nicer to keep the KILLABLE state.
> I think it could be done by testing if the wait queue still exists and
> is the same, before attempting finish wait. If it doesn't exist, that
> means the faulter can skip finish_wait altogether because it must be
> already TASK_RUNNING.
>
> shmem_falloc = inode->i_private;
> if (shmem_falloc && shmem_falloc->waitq == shmem_falloc_waitq)
> 	finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
>
> It might still be theoretically possible that although it has the same
> address, it's not the same wait queue, but that doesn't hurt
> correctness. I might be uselessly locking some other waitq's lock, but
> the inode->i_lock still protects me from other faulters that are in the
> same situation. The puncher is already gone.

Actually, I was wrong and deleting from a different queue could corrupt 
the queue head. I don't know if trinity would be able to trigger this, 
but I wouldn't be comfortable knowing it's possible. Calling fallocate 
twice in quick succession from the same process could easily end up at 
the same address on the stack, no?

Another also somewhat ugly possibility is to make sure that the wait 
queue is empty before the puncher quits, regardless of the running state 
of the processes in the queue. I think the conditions here 
(serialization by i_lock) might allow us to do that without risking that 
we e.g. leave anyone sleeping. But it's bending the wait queue design...

> However it's quite ugly and if there is some wait queue debugging mode
> (I hadn't checked) that e.g. checks if wait queues and wait objects are
> empty before destruction, it wouldn't like this at all...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3
  2014-07-22 12:09                 ` Vlastimil Babka
@ 2014-07-22 18:42                   ` Hugh Dickins
  -1 siblings, 0 replies; 42+ messages in thread
From: Hugh Dickins @ 2014-07-22 18:42 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hugh Dickins, Sasha Levin, Andrew Morton, Konstantin Khlebnikov,
	Johannes Weiner, Michel Lespinasse, Lukas Czerner, Dave Jones,
	linux-mm, linux-fsdevel, linux-kernel, Michal Hocko

On Tue, 22 Jul 2014, Vlastimil Babka wrote:
> On 07/22/2014 12:06 PM, Vlastimil Babka wrote:
> > So if this is true, the change to TASK_UNINTERRUPTIBLE will avoid the
> > problem, but it would be nicer to keep the KILLABLE state.
> > I think it could be done by testing if the wait queue still exists and
> > is the same, before attempting finish wait. If it doesn't exist, that
> > means the faulter can skip finish_wait altogether because it must be
> > already TASK_RUNNING.
> > 
> > shmem_falloc = inode->i_private;
> > if (shmem_falloc && shmem_falloc->waitq == shmem_falloc_waitq)
> > 	finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
> > 
> > It might still be theoretically possible that although it has the same
> > address, it's not the same wait queue, but that doesn't hurt
> > correctness. I might be uselessly locking some other waitq's lock, but
> > the inode->i_lock still protects me from other faulters that are in the
> > same situation. The puncher is already gone.
> 
> Actually, I was wrong and deleting from a different queue could corrupt the
> queue head. I don't know if trinity would be able to trigger this, but I
> wouldn't be comfortable knowing it's possible. Calling fallocate twice in
> quick succession from the same process could easily end up at the same
> address on the stack, no?
> 
> Another also somewhat ugly possibility is to make sure that the wait queue is
> empty before the puncher quits, regardless of the running state of the
> processes in the queue. I think the conditions here (serialization by i_lock)
> might allow us to do that without risking that we e.g. leave anyone sleeping.
> But it's bending the wait queue design...
> 
> > However it's quite ugly and if there is some wait queue debugging mode
> > (I hadn't checked) that e.g. checks if wait queues and wait objects are
> > empty before destruction, it wouldn't like this at all...

Thanks a lot for confirming the TASK_KILLABLE scenario, Vlastimil:
that fits with how it looked to me last night, but it helps that you
and Michal have investigated that avenue much more thoroughly than I did.

As to refinements to retain TASK_KILLABLE: I am not at all tempted to
complicate or make it more subtle: please let's just agree that it was
a good impulse to try for TASK_KILLABLE, but having seen the problems,
much safer now to go for the simpler TASK_UNINTERRUPTIBLE instead. 

Who are the fault-while-hole-punching users who might be hurt by removing
that little killability window?  Trinity, and people testing fault versus
hole-punching?  Not a huge and deserving market segment, I judge.

Of course, if trinity goes on to pin some deadlock on lack of killability
there, then we shall have to address it.  I don't expect that, and the
fault path would not normally be killable, while waiting for memory.
But trinity does spring some surprises...

(Of course, we could simplify all this by extending the shmem inode,
but I remain strongly resistant to that, which would have an adverse
effect on all the shmem users, rather than just these testers.  Not that
I've computed the sizes and checked how they currently pack on slab and
slub: maybe there would be no adverse effect today, but a generic change
tomorrow would sooner push us over an edge to poorer object density.)

Hugh

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3
@ 2014-07-22 18:42                   ` Hugh Dickins
  0 siblings, 0 replies; 42+ messages in thread
From: Hugh Dickins @ 2014-07-22 18:42 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hugh Dickins, Sasha Levin, Andrew Morton, Konstantin Khlebnikov,
	Johannes Weiner, Michel Lespinasse, Lukas Czerner, Dave Jones,
	linux-mm, linux-fsdevel, linux-kernel, Michal Hocko

On Tue, 22 Jul 2014, Vlastimil Babka wrote:
> On 07/22/2014 12:06 PM, Vlastimil Babka wrote:
> > So if this is true, the change to TASK_UNINTERRUPTIBLE will avoid the
> > problem, but it would be nicer to keep the KILLABLE state.
> > I think it could be done by testing if the wait queue still exists and
> > is the same, before attempting finish wait. If it doesn't exist, that
> > means the faulter can skip finish_wait altogether because it must be
> > already TASK_RUNNING.
> > 
> > shmem_falloc = inode->i_private;
> > if (shmem_falloc && shmem_falloc->waitq == shmem_falloc_waitq)
> > 	finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
> > 
> > It might still be theoretically possible that although it has the same
> > address, it's not the same wait queue, but that doesn't hurt
> > correctness. I might be uselessly locking some other waitq's lock, but
> > the inode->i_lock still protects me from other faulters that are in the
> > same situation. The puncher is already gone.
> 
> Actually, I was wrong and deleting from a different queue could corrupt the
> queue head. I don't know if trinity would be able to trigger this, but I
> wouldn't be comfortable knowing it's possible. Calling fallocate twice in
> quick succession from the same process could easily end up at the same
> address on the stack, no?
> 
> Another also somewhat ugly possibility is to make sure that the wait queue is
> empty before the puncher quits, regardless of the running state of the
> processes in the queue. I think the conditions here (serialization by i_lock)
> might allow us to do that without risking that we e.g. leave anyone sleeping.
> But it's bending the wait queue design...
> 
> > However it's quite ugly and if there is some wait queue debugging mode
> > (I hadn't checked) that e.g. checks if wait queues and wait objects are
> > empty before destruction, it wouldn't like this at all...

Thanks a lot for confirming the TASK_KILLABLE scenario, Vlastimil:
that fits with how it looked to me last night, but it helps that you
and Michal have investigated that avenue much more thoroughly than I did.

As to refinements to retain TASK_KILLABLE: I am not at all tempted to
complicate or make it more subtle: please let's just agree that it was
a good impulse to try for TASK_KILLABLE, but having seen the problems,
much safer now to go for the simpler TASK_UNINTERRUPTIBLE instead. 

Who are the fault-while-hole-punching users who might be hurt by removing
that little killability window?  Trinity, and people testing fault versus
hole-punching?  Not a huge and deserving market segment, I judge.

Of course, if trinity goes on to pin some deadlock on lack of killability
there, then we shall have to address it.  I don't expect that, and the
fault path would not normally be killable, while waiting for memory.
But trinity does spring some surprises...

(Of course, we could simplify all this by extending the shmem inode,
but I remain strongly resistant to that, which would have an adverse
effect on all the shmem users, rather than just these testers.  Not that
I've computed the sizes and checked how they currently pack on slab and
slub: maybe there would be no adverse effect today, but a generic change
tomorrow would sooner push us over an edge to poorer object density.)

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3
  2014-07-22  8:07             ` Hugh Dickins
@ 2014-07-22 23:19               ` Sasha Levin
  -1 siblings, 0 replies; 42+ messages in thread
From: Sasha Levin @ 2014-07-22 23:19 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Vlastimil Babka, Konstantin Khlebnikov,
	Johannes Weiner, Michel Lespinasse, Lukas Czerner, Dave Jones,
	linux-mm, linux-fsdevel, linux-kernel

On 07/22/2014 04:07 AM, Hugh Dickins wrote:
> But there is one easy change which might do it: please would you try
> changing the TASK_KILLABLE a few lines above to TASK_UNINTERRUPTIBLE.

That seems to have done the trick, everything works fine.


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3
@ 2014-07-22 23:19               ` Sasha Levin
  0 siblings, 0 replies; 42+ messages in thread
From: Sasha Levin @ 2014-07-22 23:19 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Vlastimil Babka, Konstantin Khlebnikov,
	Johannes Weiner, Michel Lespinasse, Lukas Czerner, Dave Jones,
	linux-mm, linux-fsdevel, linux-kernel

On 07/22/2014 04:07 AM, Hugh Dickins wrote:
> But there is one easy change which might do it: please would you try
> changing the TASK_KILLABLE a few lines above to TASK_UNINTERRUPTIBLE.

That seems to have done the trick, everything works fine.


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3
  2014-07-22 23:19               ` Sasha Levin
@ 2014-07-22 23:58                 ` Hugh Dickins
  -1 siblings, 0 replies; 42+ messages in thread
From: Hugh Dickins @ 2014-07-22 23:58 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Hugh Dickins, Andrew Morton, Vlastimil Babka,
	Konstantin Khlebnikov, Johannes Weiner, Michel Lespinasse,
	Lukas Czerner, Dave Jones, linux-mm, linux-fsdevel, linux-kernel

On Tue, 22 Jul 2014, Sasha Levin wrote:
> On 07/22/2014 04:07 AM, Hugh Dickins wrote:
> > But there is one easy change which might do it: please would you try
> > changing the TASK_KILLABLE a few lines above to TASK_UNINTERRUPTIBLE.
> 
> That seems to have done the trick, everything works fine.

Super, thank you Sasha: patch to Andrew follows.

Hugh

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3
@ 2014-07-22 23:58                 ` Hugh Dickins
  0 siblings, 0 replies; 42+ messages in thread
From: Hugh Dickins @ 2014-07-22 23:58 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Hugh Dickins, Andrew Morton, Vlastimil Babka,
	Konstantin Khlebnikov, Johannes Weiner, Michel Lespinasse,
	Lukas Czerner, Dave Jones, linux-mm, linux-fsdevel, linux-kernel

On Tue, 22 Jul 2014, Sasha Levin wrote:
> On 07/22/2014 04:07 AM, Hugh Dickins wrote:
> > But there is one easy change which might do it: please would you try
> > changing the TASK_KILLABLE a few lines above to TASK_UNINTERRUPTIBLE.
> 
> That seems to have done the trick, everything works fine.

Super, thank you Sasha: patch to Andrew follows.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/2] shmem: fix faulting into a hole, not taking i_mutex
  2014-07-15 10:31   ` Hugh Dickins
@ 2014-07-25 14:25     ` Michal Hocko
  -1 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2014-07-25 14:25 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Sasha Levin, Vlastimil Babka,
	Konstantin Khlebnikov, Johannes Weiner, Michel Lespinasse,
	Lukas Czerner, Dave Jones, linux-mm, linux-fsdevel, linux-kernel

On Tue 15-07-14 03:31:11, Hugh Dickins wrote:
> f00cdc6df7d7 ("shmem: fix faulting into a hole while it's punched") was
> buggy: Sasha sent a lockdep report to remind us that grabbing i_mutex in
> the fault path is a no-no (write syscall may already hold i_mutex while
> faulting user buffer).
> 
> We tried a completely different approach (see following patch) but that
> proved inadequate: good enough for a rational workload, but not good
> enough against trinity - which forks off so many mappings of the object
> that contention on i_mmap_mutex while hole-puncher holds i_mutex builds
> into serious starvation when concurrent faults force the puncher to fall
> back to single-page unmap_mapping_range() searches of the i_mmap tree.
> 
> So return to the original umbrella approach, but keep away from i_mutex
> this time.  We really don't want to bloat every shmem inode with a new
> mutex or completion, just to protect this unlikely case from trinity.
> So extend the original with wait_queue_head on stack at the hole-punch
> end, and wait_queue item on the stack at the fault end.
> 
> This involves further use of i_lock to guard against the races: lockdep
> has been happy so far, and I see fs/inode.c:unlock_new_inode() holds
> i_lock around wake_up_bit(), which is comparable to what we do here.
> i_lock is more convenient, but we could switch to shmem's info->lock.
> 
> This issue has been tagged with CVE-2014-4171, which will require
> f00cdc6df7d7 and this and the following patch to be backported: we
> suggest to 3.1+, though in fact the trinity forkbomb effect might go
> back as far as 2.6.16, when madvise(,,MADV_REMOVE) came in - or might
> not, since much has changed, with i_mmap_mutex a spinlock before 3.0.
> Anyone running trinity on 3.0 and earlier?  I don't think we need care.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Lukas Czerner <lczerner@redhat.com>
> Cc: Dave Jones <davej@redhat.com>
> Cc: <stable@vger.kernel.org>	[3.1+]

Feel free to add
Reviewed-by: Michal Hocko <mhocko@suse.cz>

for the UNINTERUPTIBLE sleep version.

> ---
> Please replace mmotm's
> revert-shmem-fix-faulting-into-a-hole-while-its-punched.patch
> by this patch.
> 
>  mm/shmem.c |   78 ++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 52 insertions(+), 26 deletions(-)
> 
> --- 3.16-rc5/mm/shmem.c	2014-07-06 13:25:19.688009119 -0700
> +++ 3.16-rc5+/mm/shmem.c	2014-07-14 20:34:28.196153828 -0700
> @@ -85,7 +85,7 @@ static struct vfsmount *shm_mnt;
>   * a time): we would prefer not to enlarge the shmem inode just for that.
>   */
>  struct shmem_falloc {
> -	int	mode;		/* FALLOC_FL mode currently operating */
> +	wait_queue_head_t *waitq; /* faults into hole wait for punch to end */
>  	pgoff_t start;		/* start of range currently being fallocated */
>  	pgoff_t next;		/* the next page offset to be fallocated */
>  	pgoff_t nr_falloced;	/* how many new pages have been fallocated */
> @@ -760,7 +760,7 @@ static int shmem_writepage(struct page *
>  			spin_lock(&inode->i_lock);
>  			shmem_falloc = inode->i_private;
>  			if (shmem_falloc &&
> -			    !shmem_falloc->mode &&
> +			    !shmem_falloc->waitq &&
>  			    index >= shmem_falloc->start &&
>  			    index < shmem_falloc->next)
>  				shmem_falloc->nr_unswapped++;
> @@ -1248,38 +1248,58 @@ static int shmem_fault(struct vm_area_st
>  	 * Trinity finds that probing a hole which tmpfs is punching can
>  	 * prevent the hole-punch from ever completing: which in turn
>  	 * locks writers out with its hold on i_mutex.  So refrain from
> -	 * faulting pages into the hole while it's being punched, and
> -	 * wait on i_mutex to be released if vmf->flags permits.
> +	 * faulting pages into the hole while it's being punched.  Although
> +	 * shmem_undo_range() does remove the additions, it may be unable to
> +	 * keep up, as each new page needs its own unmap_mapping_range() call,
> +	 * and the i_mmap tree grows ever slower to scan if new vmas are added.
> +	 *
> +	 * It does not matter if we sometimes reach this check just before the
> +	 * hole-punch begins, so that one fault then races with the punch:
> +	 * we just need to make racing faults a rare case.
> +	 *
> +	 * The implementation below would be much simpler if we just used a
> +	 * standard mutex or completion: but we cannot take i_mutex in fault,
> +	 * and bloating every shmem inode for this unlikely case would be sad.
>  	 */
>  	if (unlikely(inode->i_private)) {
>  		struct shmem_falloc *shmem_falloc;
>  
>  		spin_lock(&inode->i_lock);
>  		shmem_falloc = inode->i_private;
> -		if (!shmem_falloc ||
> -		    shmem_falloc->mode != FALLOC_FL_PUNCH_HOLE ||
> -		    vmf->pgoff < shmem_falloc->start ||
> -		    vmf->pgoff >= shmem_falloc->next)
> -			shmem_falloc = NULL;
> -		spin_unlock(&inode->i_lock);
> -		/*
> -		 * i_lock has protected us from taking shmem_falloc seriously
> -		 * once return from shmem_fallocate() went back up that stack.
> -		 * i_lock does not serialize with i_mutex at all, but it does
> -		 * not matter if sometimes we wait unnecessarily, or sometimes
> -		 * miss out on waiting: we just need to make those cases rare.
> -		 */
> -		if (shmem_falloc) {
> +		if (shmem_falloc &&
> +		    shmem_falloc->waitq &&
> +		    vmf->pgoff >= shmem_falloc->start &&
> +		    vmf->pgoff < shmem_falloc->next) {
> +			wait_queue_head_t *shmem_falloc_waitq;
> +			DEFINE_WAIT(shmem_fault_wait);
> +
> +			ret = VM_FAULT_NOPAGE;
>  			if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
>  			   !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
> +				/* It's polite to up mmap_sem if we can */
>  				up_read(&vma->vm_mm->mmap_sem);
> -				mutex_lock(&inode->i_mutex);
> -				mutex_unlock(&inode->i_mutex);
> -				return VM_FAULT_RETRY;
> +				ret = VM_FAULT_RETRY;
>  			}
> -			/* cond_resched? Leave that to GUP or return to user */
> -			return VM_FAULT_NOPAGE;
> +
> +			shmem_falloc_waitq = shmem_falloc->waitq;
> +			prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait,
> +					TASK_KILLABLE);
> +			spin_unlock(&inode->i_lock);
> +			schedule();
> +
> +			/*
> +			 * shmem_falloc_waitq points into the shmem_fallocate()
> +			 * stack of the hole-punching task: shmem_falloc_waitq
> +			 * is usually invalid by the time we reach here, but
> +			 * finish_wait() does not dereference it in that case;
> +			 * though i_lock needed lest racing with wake_up_all().
> +			 */
> +			spin_lock(&inode->i_lock);
> +			finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
> +			spin_unlock(&inode->i_lock);
> +			return ret;
>  		}
> +		spin_unlock(&inode->i_lock);
>  	}
>  
>  	error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret);
> @@ -1774,13 +1794,13 @@ static long shmem_fallocate(struct file
>  
>  	mutex_lock(&inode->i_mutex);
>  
> -	shmem_falloc.mode = mode & ~FALLOC_FL_KEEP_SIZE;
> -
>  	if (mode & FALLOC_FL_PUNCH_HOLE) {
>  		struct address_space *mapping = file->f_mapping;
>  		loff_t unmap_start = round_up(offset, PAGE_SIZE);
>  		loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
> +		DECLARE_WAIT_QUEUE_HEAD_ONSTACK(shmem_falloc_waitq);
>  
> +		shmem_falloc.waitq = &shmem_falloc_waitq;
>  		shmem_falloc.start = unmap_start >> PAGE_SHIFT;
>  		shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT;
>  		spin_lock(&inode->i_lock);
> @@ -1792,8 +1812,13 @@ static long shmem_fallocate(struct file
>  					    1 + unmap_end - unmap_start, 0);
>  		shmem_truncate_range(inode, offset, offset + len - 1);
>  		/* No need to unmap again: hole-punching leaves COWed pages */
> +
> +		spin_lock(&inode->i_lock);
> +		inode->i_private = NULL;
> +		wake_up_all(&shmem_falloc_waitq);
> +		spin_unlock(&inode->i_lock);
>  		error = 0;
> -		goto undone;
> +		goto out;
>  	}
>  
>  	/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
> @@ -1809,6 +1834,7 @@ static long shmem_fallocate(struct file
>  		goto out;
>  	}
>  
> +	shmem_falloc.waitq = NULL;
>  	shmem_falloc.start = start;
>  	shmem_falloc.next  = start;
>  	shmem_falloc.nr_falloced = 0;
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/2] shmem: fix faulting into a hole, not taking i_mutex
@ 2014-07-25 14:25     ` Michal Hocko
  0 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2014-07-25 14:25 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Sasha Levin, Vlastimil Babka,
	Konstantin Khlebnikov, Johannes Weiner, Michel Lespinasse,
	Lukas Czerner, Dave Jones, linux-mm, linux-fsdevel, linux-kernel

On Tue 15-07-14 03:31:11, Hugh Dickins wrote:
> f00cdc6df7d7 ("shmem: fix faulting into a hole while it's punched") was
> buggy: Sasha sent a lockdep report to remind us that grabbing i_mutex in
> the fault path is a no-no (write syscall may already hold i_mutex while
> faulting user buffer).
> 
> We tried a completely different approach (see following patch) but that
> proved inadequate: good enough for a rational workload, but not good
> enough against trinity - which forks off so many mappings of the object
> that contention on i_mmap_mutex while hole-puncher holds i_mutex builds
> into serious starvation when concurrent faults force the puncher to fall
> back to single-page unmap_mapping_range() searches of the i_mmap tree.
> 
> So return to the original umbrella approach, but keep away from i_mutex
> this time.  We really don't want to bloat every shmem inode with a new
> mutex or completion, just to protect this unlikely case from trinity.
> So extend the original with wait_queue_head on stack at the hole-punch
> end, and wait_queue item on the stack at the fault end.
> 
> This involves further use of i_lock to guard against the races: lockdep
> has been happy so far, and I see fs/inode.c:unlock_new_inode() holds
> i_lock around wake_up_bit(), which is comparable to what we do here.
> i_lock is more convenient, but we could switch to shmem's info->lock.
> 
> This issue has been tagged with CVE-2014-4171, which will require
> f00cdc6df7d7 and this and the following patch to be backported: we
> suggest to 3.1+, though in fact the trinity forkbomb effect might go
> back as far as 2.6.16, when madvise(,,MADV_REMOVE) came in - or might
> not, since much has changed, with i_mmap_mutex a spinlock before 3.0.
> Anyone running trinity on 3.0 and earlier?  I don't think we need care.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Lukas Czerner <lczerner@redhat.com>
> Cc: Dave Jones <davej@redhat.com>
> Cc: <stable@vger.kernel.org>	[3.1+]

Feel free to add
Reviewed-by: Michal Hocko <mhocko@suse.cz>

for the UNINTERUPTIBLE sleep version.

> ---
> Please replace mmotm's
> revert-shmem-fix-faulting-into-a-hole-while-its-punched.patch
> by this patch.
> 
>  mm/shmem.c |   78 ++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 52 insertions(+), 26 deletions(-)
> 
> --- 3.16-rc5/mm/shmem.c	2014-07-06 13:25:19.688009119 -0700
> +++ 3.16-rc5+/mm/shmem.c	2014-07-14 20:34:28.196153828 -0700
> @@ -85,7 +85,7 @@ static struct vfsmount *shm_mnt;
>   * a time): we would prefer not to enlarge the shmem inode just for that.
>   */
>  struct shmem_falloc {
> -	int	mode;		/* FALLOC_FL mode currently operating */
> +	wait_queue_head_t *waitq; /* faults into hole wait for punch to end */
>  	pgoff_t start;		/* start of range currently being fallocated */
>  	pgoff_t next;		/* the next page offset to be fallocated */
>  	pgoff_t nr_falloced;	/* how many new pages have been fallocated */
> @@ -760,7 +760,7 @@ static int shmem_writepage(struct page *
>  			spin_lock(&inode->i_lock);
>  			shmem_falloc = inode->i_private;
>  			if (shmem_falloc &&
> -			    !shmem_falloc->mode &&
> +			    !shmem_falloc->waitq &&
>  			    index >= shmem_falloc->start &&
>  			    index < shmem_falloc->next)
>  				shmem_falloc->nr_unswapped++;
> @@ -1248,38 +1248,58 @@ static int shmem_fault(struct vm_area_st
>  	 * Trinity finds that probing a hole which tmpfs is punching can
>  	 * prevent the hole-punch from ever completing: which in turn
>  	 * locks writers out with its hold on i_mutex.  So refrain from
> -	 * faulting pages into the hole while it's being punched, and
> -	 * wait on i_mutex to be released if vmf->flags permits.
> +	 * faulting pages into the hole while it's being punched.  Although
> +	 * shmem_undo_range() does remove the additions, it may be unable to
> +	 * keep up, as each new page needs its own unmap_mapping_range() call,
> +	 * and the i_mmap tree grows ever slower to scan if new vmas are added.
> +	 *
> +	 * It does not matter if we sometimes reach this check just before the
> +	 * hole-punch begins, so that one fault then races with the punch:
> +	 * we just need to make racing faults a rare case.
> +	 *
> +	 * The implementation below would be much simpler if we just used a
> +	 * standard mutex or completion: but we cannot take i_mutex in fault,
> +	 * and bloating every shmem inode for this unlikely case would be sad.
>  	 */
>  	if (unlikely(inode->i_private)) {
>  		struct shmem_falloc *shmem_falloc;
>  
>  		spin_lock(&inode->i_lock);
>  		shmem_falloc = inode->i_private;
> -		if (!shmem_falloc ||
> -		    shmem_falloc->mode != FALLOC_FL_PUNCH_HOLE ||
> -		    vmf->pgoff < shmem_falloc->start ||
> -		    vmf->pgoff >= shmem_falloc->next)
> -			shmem_falloc = NULL;
> -		spin_unlock(&inode->i_lock);
> -		/*
> -		 * i_lock has protected us from taking shmem_falloc seriously
> -		 * once return from shmem_fallocate() went back up that stack.
> -		 * i_lock does not serialize with i_mutex at all, but it does
> -		 * not matter if sometimes we wait unnecessarily, or sometimes
> -		 * miss out on waiting: we just need to make those cases rare.
> -		 */
> -		if (shmem_falloc) {
> +		if (shmem_falloc &&
> +		    shmem_falloc->waitq &&
> +		    vmf->pgoff >= shmem_falloc->start &&
> +		    vmf->pgoff < shmem_falloc->next) {
> +			wait_queue_head_t *shmem_falloc_waitq;
> +			DEFINE_WAIT(shmem_fault_wait);
> +
> +			ret = VM_FAULT_NOPAGE;
>  			if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
>  			   !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
> +				/* It's polite to up mmap_sem if we can */
>  				up_read(&vma->vm_mm->mmap_sem);
> -				mutex_lock(&inode->i_mutex);
> -				mutex_unlock(&inode->i_mutex);
> -				return VM_FAULT_RETRY;
> +				ret = VM_FAULT_RETRY;
>  			}
> -			/* cond_resched? Leave that to GUP or return to user */
> -			return VM_FAULT_NOPAGE;
> +
> +			shmem_falloc_waitq = shmem_falloc->waitq;
> +			prepare_to_wait(shmem_falloc_waitq, &shmem_fault_wait,
> +					TASK_KILLABLE);
> +			spin_unlock(&inode->i_lock);
> +			schedule();
> +
> +			/*
> +			 * shmem_falloc_waitq points into the shmem_fallocate()
> +			 * stack of the hole-punching task: shmem_falloc_waitq
> +			 * is usually invalid by the time we reach here, but
> +			 * finish_wait() does not dereference it in that case;
> +			 * though i_lock needed lest racing with wake_up_all().
> +			 */
> +			spin_lock(&inode->i_lock);
> +			finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
> +			spin_unlock(&inode->i_lock);
> +			return ret;
>  		}
> +		spin_unlock(&inode->i_lock);
>  	}
>  
>  	error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret);
> @@ -1774,13 +1794,13 @@ static long shmem_fallocate(struct file
>  
>  	mutex_lock(&inode->i_mutex);
>  
> -	shmem_falloc.mode = mode & ~FALLOC_FL_KEEP_SIZE;
> -
>  	if (mode & FALLOC_FL_PUNCH_HOLE) {
>  		struct address_space *mapping = file->f_mapping;
>  		loff_t unmap_start = round_up(offset, PAGE_SIZE);
>  		loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
> +		DECLARE_WAIT_QUEUE_HEAD_ONSTACK(shmem_falloc_waitq);
>  
> +		shmem_falloc.waitq = &shmem_falloc_waitq;
>  		shmem_falloc.start = unmap_start >> PAGE_SHIFT;
>  		shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT;
>  		spin_lock(&inode->i_lock);
> @@ -1792,8 +1812,13 @@ static long shmem_fallocate(struct file
>  					    1 + unmap_end - unmap_start, 0);
>  		shmem_truncate_range(inode, offset, offset + len - 1);
>  		/* No need to unmap again: hole-punching leaves COWed pages */
> +
> +		spin_lock(&inode->i_lock);
> +		inode->i_private = NULL;
> +		wake_up_all(&shmem_falloc_waitq);
> +		spin_unlock(&inode->i_lock);
>  		error = 0;
> -		goto undone;
> +		goto out;
>  	}
>  
>  	/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
> @@ -1809,6 +1834,7 @@ static long shmem_fallocate(struct file
>  		goto out;
>  	}
>  
> +	shmem_falloc.waitq = NULL;
>  	shmem_falloc.start = start;
>  	shmem_falloc.next  = start;
>  	shmem_falloc.nr_falloced = 0;
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/2] shmem: fix splicing from a hole while it's punched
  2014-07-15 10:33   ` Hugh Dickins
@ 2014-07-25 14:33     ` Michal Hocko
  -1 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2014-07-25 14:33 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Sasha Levin, Vlastimil Babka,
	Konstantin Khlebnikov, Johannes Weiner, Michel Lespinasse,
	Lukas Czerner, Dave Jones, linux-mm, linux-fsdevel, linux-kernel

On Tue 15-07-14 03:33:02, Hugh Dickins wrote:
> shmem_fault() is the actual culprit in trinity's hole-punch starvation,
> and the most significant cause of such problems: since a page faulted is
> one that then appears page_mapped(), needing unmap_mapping_range() and
> i_mmap_mutex to be unmapped again.
> 
> But it is not the only way in which a page can be brought into a hole in
> the radix_tree while that hole is being punched; and Vlastimil's testing
> implies that if enough other processors are busy filling in the hole,
> then shmem_undo_range() can be kept from completing indefinitely.
> 
> shmem_file_splice_read() is the main other user of SGP_CACHE, which
> can instantiate shmem pagecache pages in the read-only case (without
> holding i_mutex, so perhaps concurrently with a hole-punch).  Probably
> it's silly not to use SGP_READ already (using the ZERO_PAGE for holes):
> which ought to be safe, but might bring surprises - not a change to be
> rushed.
> 
> shmem_read_mapping_page_gfp() is an internal interface used by
> drivers/gpu/drm GEM (and next by uprobes): it should be okay.  And
> shmem_file_read_iter() uses the SGP_DIRTY variant of SGP_CACHE, when
> called internally by the kernel (perhaps for a stacking filesystem,
> which might rely on holes to be reserved): it's unclear whether it
> could be provoked to keep hole-punch busy or not.
> 
> We could apply the same umbrella as now used in shmem_fault() to
> shmem_file_splice_read() and the others; but it looks ugly, and use
> over a range raises questions - should it actually be per page?  can
> these get starved themselves?
> 
> The origin of this part of the problem is my v3.1 commit d0823576bf4b
> ("mm: pincer in truncate_inode_pages_range"), once it was duplicated
> into shmem.c.  It seemed like a nice idea at the time, to ensure
> (barring RCU lookup fuzziness) that there's an instant when the entire
> hole is empty; but the indefinitely repeated scans to ensure that make
> it vulnerable.
> 
> Revert that "enhancement" to hole-punch from shmem_undo_range(), but
> retain the unproblematic rescanning when it's truncating; add a couple
> of comments there.
> 
> Remove the "indices[0] >= end" test: that is now handled satisfactorily
> by the inner loop, and mem_cgroup_uncharge_start()/end() are too light
> to be worth avoiding here.
> 
> But if we do not always loop indefinitely, we do need to handle the case
> of swap swizzled back to page before shmem_free_swap() gets it: add a
> retry for that case, as suggested by Konstantin Khlebnikov; and for the
> case of page swizzled back to swap, as suggested by Johannes Weiner.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Lukas Czerner <lczerner@redhat.com>
> Cc: Dave Jones <davej@redhat.com>
> Cc: <stable@vger.kernel.org>	[3.1+]

Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
> Please replace mmotm's
> shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch
> by this patch: which is substantially the same as that, but with
> updated commit comment, and a page retry as indicated by Hannes.
> 
>  mm/shmem.c |   24 +++++++++++++++---------
>  1 file changed, 15 insertions(+), 9 deletions(-)
> 
> --- 3.16-rc5+/mm/shmem.c	2014-07-14 20:34:28.196153828 -0700
> +++ 3.16-rc5++/mm/shmem.c	2014-07-14 20:35:14.156154916 -0700
> @@ -468,23 +468,20 @@ static void shmem_undo_range(struct inod
>  		return;
>  
>  	index = start;
> -	for ( ; ; ) {
> +	while (index < end) {
>  		cond_resched();
>  
>  		pvec.nr = find_get_entries(mapping, index,
>  				min(end - index, (pgoff_t)PAGEVEC_SIZE),
>  				pvec.pages, indices);
>  		if (!pvec.nr) {
> -			if (index == start || unfalloc)
> +			/* If all gone or hole-punch or unfalloc, we're done */
> +			if (index == start || end != -1)
>  				break;
> +			/* But if truncating, restart to make sure all gone */
>  			index = start;
>  			continue;
>  		}
> -		if ((index == start || unfalloc) && indices[0] >= end) {
> -			pagevec_remove_exceptionals(&pvec);
> -			pagevec_release(&pvec);
> -			break;
> -		}
>  		mem_cgroup_uncharge_start();
>  		for (i = 0; i < pagevec_count(&pvec); i++) {
>  			struct page *page = pvec.pages[i];
> @@ -496,8 +493,12 @@ static void shmem_undo_range(struct inod
>  			if (radix_tree_exceptional_entry(page)) {
>  				if (unfalloc)
>  					continue;
> -				nr_swaps_freed += !shmem_free_swap(mapping,
> -								index, page);
> +				if (shmem_free_swap(mapping, index, page)) {
> +					/* Swap was replaced by page: retry */
> +					index--;
> +					break;
> +				}
> +				nr_swaps_freed++;
>  				continue;
>  			}
>  
> @@ -506,6 +507,11 @@ static void shmem_undo_range(struct inod
>  				if (page->mapping == mapping) {
>  					VM_BUG_ON_PAGE(PageWriteback(page), page);
>  					truncate_inode_page(mapping, page);
> +				} else {
> +					/* Page was replaced by swap: retry */
> +					unlock_page(page);
> +					index--;
> +					break;
>  				}
>  			}
>  			unlock_page(page);
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/2] shmem: fix splicing from a hole while it's punched
@ 2014-07-25 14:33     ` Michal Hocko
  0 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2014-07-25 14:33 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Sasha Levin, Vlastimil Babka,
	Konstantin Khlebnikov, Johannes Weiner, Michel Lespinasse,
	Lukas Czerner, Dave Jones, linux-mm, linux-fsdevel, linux-kernel

On Tue 15-07-14 03:33:02, Hugh Dickins wrote:
> shmem_fault() is the actual culprit in trinity's hole-punch starvation,
> and the most significant cause of such problems: since a page faulted is
> one that then appears page_mapped(), needing unmap_mapping_range() and
> i_mmap_mutex to be unmapped again.
> 
> But it is not the only way in which a page can be brought into a hole in
> the radix_tree while that hole is being punched; and Vlastimil's testing
> implies that if enough other processors are busy filling in the hole,
> then shmem_undo_range() can be kept from completing indefinitely.
> 
> shmem_file_splice_read() is the main other user of SGP_CACHE, which
> can instantiate shmem pagecache pages in the read-only case (without
> holding i_mutex, so perhaps concurrently with a hole-punch).  Probably
> it's silly not to use SGP_READ already (using the ZERO_PAGE for holes):
> which ought to be safe, but might bring surprises - not a change to be
> rushed.
> 
> shmem_read_mapping_page_gfp() is an internal interface used by
> drivers/gpu/drm GEM (and next by uprobes): it should be okay.  And
> shmem_file_read_iter() uses the SGP_DIRTY variant of SGP_CACHE, when
> called internally by the kernel (perhaps for a stacking filesystem,
> which might rely on holes to be reserved): it's unclear whether it
> could be provoked to keep hole-punch busy or not.
> 
> We could apply the same umbrella as now used in shmem_fault() to
> shmem_file_splice_read() and the others; but it looks ugly, and use
> over a range raises questions - should it actually be per page?  can
> these get starved themselves?
> 
> The origin of this part of the problem is my v3.1 commit d0823576bf4b
> ("mm: pincer in truncate_inode_pages_range"), once it was duplicated
> into shmem.c.  It seemed like a nice idea at the time, to ensure
> (barring RCU lookup fuzziness) that there's an instant when the entire
> hole is empty; but the indefinitely repeated scans to ensure that make
> it vulnerable.
> 
> Revert that "enhancement" to hole-punch from shmem_undo_range(), but
> retain the unproblematic rescanning when it's truncating; add a couple
> of comments there.
> 
> Remove the "indices[0] >= end" test: that is now handled satisfactorily
> by the inner loop, and mem_cgroup_uncharge_start()/end() are too light
> to be worth avoiding here.
> 
> But if we do not always loop indefinitely, we do need to handle the case
> of swap swizzled back to page before shmem_free_swap() gets it: add a
> retry for that case, as suggested by Konstantin Khlebnikov; and for the
> case of page swizzled back to swap, as suggested by Johannes Weiner.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Lukas Czerner <lczerner@redhat.com>
> Cc: Dave Jones <davej@redhat.com>
> Cc: <stable@vger.kernel.org>	[3.1+]

Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
> Please replace mmotm's
> shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch
> by this patch: which is substantially the same as that, but with
> updated commit comment, and a page retry as indicated by Hannes.
> 
>  mm/shmem.c |   24 +++++++++++++++---------
>  1 file changed, 15 insertions(+), 9 deletions(-)
> 
> --- 3.16-rc5+/mm/shmem.c	2014-07-14 20:34:28.196153828 -0700
> +++ 3.16-rc5++/mm/shmem.c	2014-07-14 20:35:14.156154916 -0700
> @@ -468,23 +468,20 @@ static void shmem_undo_range(struct inod
>  		return;
>  
>  	index = start;
> -	for ( ; ; ) {
> +	while (index < end) {
>  		cond_resched();
>  
>  		pvec.nr = find_get_entries(mapping, index,
>  				min(end - index, (pgoff_t)PAGEVEC_SIZE),
>  				pvec.pages, indices);
>  		if (!pvec.nr) {
> -			if (index == start || unfalloc)
> +			/* If all gone or hole-punch or unfalloc, we're done */
> +			if (index == start || end != -1)
>  				break;
> +			/* But if truncating, restart to make sure all gone */
>  			index = start;
>  			continue;
>  		}
> -		if ((index == start || unfalloc) && indices[0] >= end) {
> -			pagevec_remove_exceptionals(&pvec);
> -			pagevec_release(&pvec);
> -			break;
> -		}
>  		mem_cgroup_uncharge_start();
>  		for (i = 0; i < pagevec_count(&pvec); i++) {
>  			struct page *page = pvec.pages[i];
> @@ -496,8 +493,12 @@ static void shmem_undo_range(struct inod
>  			if (radix_tree_exceptional_entry(page)) {
>  				if (unfalloc)
>  					continue;
> -				nr_swaps_freed += !shmem_free_swap(mapping,
> -								index, page);
> +				if (shmem_free_swap(mapping, index, page)) {
> +					/* Swap was replaced by page: retry */
> +					index--;
> +					break;
> +				}
> +				nr_swaps_freed++;
>  				continue;
>  			}
>  
> @@ -506,6 +507,11 @@ static void shmem_undo_range(struct inod
>  				if (page->mapping == mapping) {
>  					VM_BUG_ON_PAGE(PageWriteback(page), page);
>  					truncate_inode_page(mapping, page);
> +				} else {
> +					/* Page was replaced by swap: retry */
> +					unlock_page(page);
> +					index--;
> +					break;
>  				}
>  			}
>  			unlock_page(page);
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2014-07-25 14:33 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-15 10:28 [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3 Hugh Dickins
2014-07-15 10:28 ` Hugh Dickins
2014-07-15 10:31 ` [PATCH 1/2] shmem: fix faulting into a hole, not taking i_mutex Hugh Dickins
2014-07-15 10:31   ` Hugh Dickins
2014-07-15 16:07   ` Vlastimil Babka
2014-07-15 16:07     ` Vlastimil Babka
2014-07-15 19:26     ` Hugh Dickins
2014-07-15 19:26       ` Hugh Dickins
2014-07-16  7:18       ` Vlastimil Babka
2014-07-16  7:18         ` Vlastimil Babka
2014-07-25 14:25   ` Michal Hocko
2014-07-25 14:25     ` Michal Hocko
2014-07-15 10:33 ` [PATCH 2/2] shmem: fix splicing from a hole while it's punched Hugh Dickins
2014-07-15 10:33   ` Hugh Dickins
2014-07-25 14:33   ` Michal Hocko
2014-07-25 14:33     ` Michal Hocko
2014-07-17 16:10 ` [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3 Vlastimil Babka
2014-07-17 16:10   ` Vlastimil Babka
2014-07-17 16:12   ` Sasha Levin
2014-07-17 16:12     ` Sasha Levin
2014-07-18 10:44     ` Sasha Levin
2014-07-18 10:44       ` Sasha Levin
2014-07-19 23:44       ` Hugh Dickins
2014-07-19 23:44         ` Hugh Dickins
2014-07-22  3:24         ` Sasha Levin
2014-07-22  3:24           ` Sasha Levin
2014-07-22  8:07           ` Hugh Dickins
2014-07-22  8:07             ` Hugh Dickins
2014-07-22 10:06             ` Vlastimil Babka
2014-07-22 10:06               ` Vlastimil Babka
2014-07-22 12:09               ` Vlastimil Babka
2014-07-22 12:09                 ` Vlastimil Babka
2014-07-22 18:42                 ` Hugh Dickins
2014-07-22 18:42                   ` Hugh Dickins
2014-07-22 23:19             ` Sasha Levin
2014-07-22 23:19               ` Sasha Levin
2014-07-22 23:58               ` Hugh Dickins
2014-07-22 23:58                 ` Hugh Dickins
2014-07-17 23:34   ` Hugh Dickins
2014-07-17 23:34     ` Hugh Dickins
2014-07-18  8:05     ` Vlastimil Babka
2014-07-18  8:05       ` Vlastimil Babka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.