All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/3] Revert "shmem: fix faulting into a hole while it's punched"
@ 2014-07-02 19:09 ` Hugh Dickins
  0 siblings, 0 replies; 14+ messages in thread
From: Hugh Dickins @ 2014-07-02 19:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sasha Levin, Vlastimil Babka, Konstantin Khlebnikov,
	Lukas Czerner, Dave Jones, linux-mm, linux-fsdevel, linux-kernel

This reverts commit f00cdc6df7d7cfcabb5b740911e6788cb0802bdb.

(a) It was buggy: Sasha sent a lockdep report to remind us that grabbing
i_mutex in the fault path is a no-no (write syscall may already hold
i_mutex while faulting user buffer), no matter that the patch took care
to drop mmap_sem first.

(b) It may be thought too elaborate: see the diffstat.

(c) Vlastimil proposed a preferred approach, better for backporting to
v3.1..v3.4, which had madvise hole-punch support before the fallocate
infrastructure used in that commit - backporting being required once
the issue fixed was tagged with CVE-2014-4171.

(d) Hugh noticed a further pessimization fix needed in the same area.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Lukas Czerner <lczerner@redhat.com>
Cc: Dave Jones <davej@redhat.com>
---

 mm/shmem.c |   56 +++------------------------------------------------
 1 file changed, 4 insertions(+), 52 deletions(-)

--- 3.16-rc3/mm/shmem.c	2014-06-29 15:22:10.592003936 -0700
+++ linux/mm/shmem.c	2014-07-02 03:31:12.956546569 -0700
@@ -80,12 +80,11 @@ static struct vfsmount *shm_mnt;
 #define SHORT_SYMLINK_LEN 128
 
 /*
- * shmem_fallocate communicates with shmem_fault or shmem_writepage via
- * inode->i_private (with i_mutex making sure that it has only one user at
- * a time): we would prefer not to enlarge the shmem inode just for that.
+ * shmem_fallocate and shmem_writepage communicate via inode->i_private
+ * (with i_mutex making sure that it has only one user at a time):
+ * we would prefer not to enlarge the shmem inode just for that.
  */
 struct shmem_falloc {
-	int	mode;		/* FALLOC_FL mode currently operating */
 	pgoff_t start;		/* start of range currently being fallocated */
 	pgoff_t next;		/* the next page offset to be fallocated */
 	pgoff_t nr_falloced;	/* how many new pages have been fallocated */
@@ -760,7 +759,6 @@ static int shmem_writepage(struct page *
 			spin_lock(&inode->i_lock);
 			shmem_falloc = inode->i_private;
 			if (shmem_falloc &&
-			    !shmem_falloc->mode &&
 			    index >= shmem_falloc->start &&
 			    index < shmem_falloc->next)
 				shmem_falloc->nr_unswapped++;
@@ -1235,44 +1233,6 @@ static int shmem_fault(struct vm_area_st
 	int error;
 	int ret = VM_FAULT_LOCKED;
 
-	/*
-	 * Trinity finds that probing a hole which tmpfs is punching can
-	 * prevent the hole-punch from ever completing: which in turn
-	 * locks writers out with its hold on i_mutex.  So refrain from
-	 * faulting pages into the hole while it's being punched, and
-	 * wait on i_mutex to be released if vmf->flags permits.
-	 */
-	if (unlikely(inode->i_private)) {
-		struct shmem_falloc *shmem_falloc;
-
-		spin_lock(&inode->i_lock);
-		shmem_falloc = inode->i_private;
-		if (!shmem_falloc ||
-		    shmem_falloc->mode != FALLOC_FL_PUNCH_HOLE ||
-		    vmf->pgoff < shmem_falloc->start ||
-		    vmf->pgoff >= shmem_falloc->next)
-			shmem_falloc = NULL;
-		spin_unlock(&inode->i_lock);
-		/*
-		 * i_lock has protected us from taking shmem_falloc seriously
-		 * once return from shmem_fallocate() went back up that stack.
-		 * i_lock does not serialize with i_mutex at all, but it does
-		 * not matter if sometimes we wait unnecessarily, or sometimes
-		 * miss out on waiting: we just need to make those cases rare.
-		 */
-		if (shmem_falloc) {
-			if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
-			   !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
-				up_read(&vma->vm_mm->mmap_sem);
-				mutex_lock(&inode->i_mutex);
-				mutex_unlock(&inode->i_mutex);
-				return VM_FAULT_RETRY;
-			}
-			/* cond_resched? Leave that to GUP or return to user */
-			return VM_FAULT_NOPAGE;
-		}
-	}
-
 	error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret);
 	if (error)
 		return ((error == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS);
@@ -1769,26 +1729,18 @@ static long shmem_fallocate(struct file
 
 	mutex_lock(&inode->i_mutex);
 
-	shmem_falloc.mode = mode & ~FALLOC_FL_KEEP_SIZE;
-
 	if (mode & FALLOC_FL_PUNCH_HOLE) {
 		struct address_space *mapping = file->f_mapping;
 		loff_t unmap_start = round_up(offset, PAGE_SIZE);
 		loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
 
-		shmem_falloc.start = unmap_start >> PAGE_SHIFT;
-		shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT;
-		spin_lock(&inode->i_lock);
-		inode->i_private = &shmem_falloc;
-		spin_unlock(&inode->i_lock);
-
 		if ((u64)unmap_end > (u64)unmap_start)
 			unmap_mapping_range(mapping, unmap_start,
 					    1 + unmap_end - unmap_start, 0);
 		shmem_truncate_range(inode, offset, offset + len - 1);
 		/* No need to unmap again: hole-punching leaves COWed pages */
 		error = 0;
-		goto undone;
+		goto out;
 	}
 
 	/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/3] Revert "shmem: fix faulting into a hole while it's punched"
@ 2014-07-02 19:09 ` Hugh Dickins
  0 siblings, 0 replies; 14+ messages in thread
From: Hugh Dickins @ 2014-07-02 19:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sasha Levin, Vlastimil Babka, Konstantin Khlebnikov,
	Lukas Czerner, Dave Jones, linux-mm, linux-fsdevel, linux-kernel

This reverts commit f00cdc6df7d7cfcabb5b740911e6788cb0802bdb.

(a) It was buggy: Sasha sent a lockdep report to remind us that grabbing
i_mutex in the fault path is a no-no (write syscall may already hold
i_mutex while faulting user buffer), no matter that the patch took care
to drop mmap_sem first.

(b) It may be thought too elaborate: see the diffstat.

(c) Vlastimil proposed a preferred approach, better for backporting to
v3.1..v3.4, which had madvise hole-punch support before the fallocate
infrastructure used in that commit - backporting being required once
the issue fixed was tagged with CVE-2014-4171.

(d) Hugh noticed a further pessimization fix needed in the same area.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Lukas Czerner <lczerner@redhat.com>
Cc: Dave Jones <davej@redhat.com>
---

 mm/shmem.c |   56 +++------------------------------------------------
 1 file changed, 4 insertions(+), 52 deletions(-)

--- 3.16-rc3/mm/shmem.c	2014-06-29 15:22:10.592003936 -0700
+++ linux/mm/shmem.c	2014-07-02 03:31:12.956546569 -0700
@@ -80,12 +80,11 @@ static struct vfsmount *shm_mnt;
 #define SHORT_SYMLINK_LEN 128
 
 /*
- * shmem_fallocate communicates with shmem_fault or shmem_writepage via
- * inode->i_private (with i_mutex making sure that it has only one user at
- * a time): we would prefer not to enlarge the shmem inode just for that.
+ * shmem_fallocate and shmem_writepage communicate via inode->i_private
+ * (with i_mutex making sure that it has only one user at a time):
+ * we would prefer not to enlarge the shmem inode just for that.
  */
 struct shmem_falloc {
-	int	mode;		/* FALLOC_FL mode currently operating */
 	pgoff_t start;		/* start of range currently being fallocated */
 	pgoff_t next;		/* the next page offset to be fallocated */
 	pgoff_t nr_falloced;	/* how many new pages have been fallocated */
@@ -760,7 +759,6 @@ static int shmem_writepage(struct page *
 			spin_lock(&inode->i_lock);
 			shmem_falloc = inode->i_private;
 			if (shmem_falloc &&
-			    !shmem_falloc->mode &&
 			    index >= shmem_falloc->start &&
 			    index < shmem_falloc->next)
 				shmem_falloc->nr_unswapped++;
@@ -1235,44 +1233,6 @@ static int shmem_fault(struct vm_area_st
 	int error;
 	int ret = VM_FAULT_LOCKED;
 
-	/*
-	 * Trinity finds that probing a hole which tmpfs is punching can
-	 * prevent the hole-punch from ever completing: which in turn
-	 * locks writers out with its hold on i_mutex.  So refrain from
-	 * faulting pages into the hole while it's being punched, and
-	 * wait on i_mutex to be released if vmf->flags permits.
-	 */
-	if (unlikely(inode->i_private)) {
-		struct shmem_falloc *shmem_falloc;
-
-		spin_lock(&inode->i_lock);
-		shmem_falloc = inode->i_private;
-		if (!shmem_falloc ||
-		    shmem_falloc->mode != FALLOC_FL_PUNCH_HOLE ||
-		    vmf->pgoff < shmem_falloc->start ||
-		    vmf->pgoff >= shmem_falloc->next)
-			shmem_falloc = NULL;
-		spin_unlock(&inode->i_lock);
-		/*
-		 * i_lock has protected us from taking shmem_falloc seriously
-		 * once return from shmem_fallocate() went back up that stack.
-		 * i_lock does not serialize with i_mutex at all, but it does
-		 * not matter if sometimes we wait unnecessarily, or sometimes
-		 * miss out on waiting: we just need to make those cases rare.
-		 */
-		if (shmem_falloc) {
-			if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
-			   !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
-				up_read(&vma->vm_mm->mmap_sem);
-				mutex_lock(&inode->i_mutex);
-				mutex_unlock(&inode->i_mutex);
-				return VM_FAULT_RETRY;
-			}
-			/* cond_resched? Leave that to GUP or return to user */
-			return VM_FAULT_NOPAGE;
-		}
-	}
-
 	error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret);
 	if (error)
 		return ((error == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS);
@@ -1769,26 +1729,18 @@ static long shmem_fallocate(struct file
 
 	mutex_lock(&inode->i_mutex);
 
-	shmem_falloc.mode = mode & ~FALLOC_FL_KEEP_SIZE;
-
 	if (mode & FALLOC_FL_PUNCH_HOLE) {
 		struct address_space *mapping = file->f_mapping;
 		loff_t unmap_start = round_up(offset, PAGE_SIZE);
 		loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
 
-		shmem_falloc.start = unmap_start >> PAGE_SHIFT;
-		shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT;
-		spin_lock(&inode->i_lock);
-		inode->i_private = &shmem_falloc;
-		spin_unlock(&inode->i_lock);
-
 		if ((u64)unmap_end > (u64)unmap_start)
 			unmap_mapping_range(mapping, unmap_start,
 					    1 + unmap_end - unmap_start, 0);
 		shmem_truncate_range(inode, offset, offset + len - 1);
 		/* No need to unmap again: hole-punching leaves COWed pages */
 		error = 0;
-		goto undone;
+		goto out;
 	}
 
 	/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/3] shmem: fix faulting into a hole while it's punched, take 2
  2014-07-02 19:09 ` Hugh Dickins
@ 2014-07-02 19:11   ` Hugh Dickins
  -1 siblings, 0 replies; 14+ messages in thread
From: Hugh Dickins @ 2014-07-02 19:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sasha Levin, Vlastimil Babka, Konstantin Khlebnikov,
	Lukas Czerner, Dave Jones, linux-mm, linux-fsdevel, linux-kernel

Trinity finds that mmap access to a hole while it's punched from shmem
can prevent the madvise(MADV_REMOVE) or fallocate(FALLOC_FL_PUNCH_HOLE)
from completing, until the (killable) reader stops; with the puncher's
hold on i_mutex locking out all other writers until it can complete.
This issue was tagged with CVE-2014-4171.

It appears that the tmpfs fault path is too light in comparison with its
hole-punching path, lacking an i_data_sem to obstruct it; but we don't
want to slow down the common case.  It is not a problem in truncation,
because there the SIGBUS beyond i_size stops pages from being appended.

The origin of this problem is my v3.1 commit d0823576bf4b ("mm: pincer
in truncate_inode_pages_range"), once it was duplicated into shmem.c.
It seemed like a nice idea at the time, to ensure (barring RCU lookup
fuzziness) that there's an instant when the entire hole is empty; but 
the indefinitely repeated scans to ensure that make it vulnerable.

Revert that "enhancement" to hole-punch from shmem_undo_range(), but
retain the unproblematic rescanning when it's truncating; add a couple
of comments there.

Remove the "indices[0] >= end" test: that is now handled satisfactorily
by the inner loop, and mem_cgroup_uncharge_start()/end() are too light
to be worth avoiding here.

But if we do not always loop indefinitely, we do need to handle the
case of swap swizzled back to page before shmem_free_swap() gets it:
add a retry for that case, as suggested by Konstantin Khlebnikov.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Suggested-and-Tested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Lukas Czerner <lczerner@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: stable@vger.kernel.org # v3.1+
---

 mm/shmem.c |   19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

--- 3.16-rc3+/mm/shmem.c	2014-07-02 03:31:12.956546569 -0700
+++ linux/mm/shmem.c	2014-07-02 03:34:13.172550852 -0700
@@ -467,23 +467,20 @@ static void shmem_undo_range(struct inod
 		return;
 
 	index = start;
-	for ( ; ; ) {
+	while (index < end) {
 		cond_resched();
 
 		pvec.nr = find_get_entries(mapping, index,
 				min(end - index, (pgoff_t)PAGEVEC_SIZE),
 				pvec.pages, indices);
 		if (!pvec.nr) {
-			if (index == start || unfalloc)
+			/* If all gone or hole-punch or unfalloc, we're done */
+			if (index == start || end != -1)
 				break;
+			/* But if truncating, restart to make sure all gone */
 			index = start;
 			continue;
 		}
-		if ((index == start || unfalloc) && indices[0] >= end) {
-			pagevec_remove_exceptionals(&pvec);
-			pagevec_release(&pvec);
-			break;
-		}
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
@@ -495,8 +492,12 @@ static void shmem_undo_range(struct inod
 			if (radix_tree_exceptional_entry(page)) {
 				if (unfalloc)
 					continue;
-				nr_swaps_freed += !shmem_free_swap(mapping,
-								index, page);
+				if (shmem_free_swap(mapping, index, page)) {
+					/* Swap was replaced by page: retry */
+					index--;
+					break;
+				}
+				nr_swaps_freed++;
 				continue;
 			}
 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/3] shmem: fix faulting into a hole while it's punched, take 2
@ 2014-07-02 19:11   ` Hugh Dickins
  0 siblings, 0 replies; 14+ messages in thread
From: Hugh Dickins @ 2014-07-02 19:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sasha Levin, Vlastimil Babka, Konstantin Khlebnikov,
	Lukas Czerner, Dave Jones, linux-mm, linux-fsdevel, linux-kernel

Trinity finds that mmap access to a hole while it's punched from shmem
can prevent the madvise(MADV_REMOVE) or fallocate(FALLOC_FL_PUNCH_HOLE)
from completing, until the (killable) reader stops; with the puncher's
hold on i_mutex locking out all other writers until it can complete.
This issue was tagged with CVE-2014-4171.

It appears that the tmpfs fault path is too light in comparison with its
hole-punching path, lacking an i_data_sem to obstruct it; but we don't
want to slow down the common case.  It is not a problem in truncation,
because there the SIGBUS beyond i_size stops pages from being appended.

The origin of this problem is my v3.1 commit d0823576bf4b ("mm: pincer
in truncate_inode_pages_range"), once it was duplicated into shmem.c.
It seemed like a nice idea at the time, to ensure (barring RCU lookup
fuzziness) that there's an instant when the entire hole is empty; but 
the indefinitely repeated scans to ensure that make it vulnerable.

Revert that "enhancement" to hole-punch from shmem_undo_range(), but
retain the unproblematic rescanning when it's truncating; add a couple
of comments there.

Remove the "indices[0] >= end" test: that is now handled satisfactorily
by the inner loop, and mem_cgroup_uncharge_start()/end() are too light
to be worth avoiding here.

But if we do not always loop indefinitely, we do need to handle the
case of swap swizzled back to page before shmem_free_swap() gets it:
add a retry for that case, as suggested by Konstantin Khlebnikov.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Suggested-and-Tested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Lukas Czerner <lczerner@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: stable@vger.kernel.org # v3.1+
---

 mm/shmem.c |   19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

--- 3.16-rc3+/mm/shmem.c	2014-07-02 03:31:12.956546569 -0700
+++ linux/mm/shmem.c	2014-07-02 03:34:13.172550852 -0700
@@ -467,23 +467,20 @@ static void shmem_undo_range(struct inod
 		return;
 
 	index = start;
-	for ( ; ; ) {
+	while (index < end) {
 		cond_resched();
 
 		pvec.nr = find_get_entries(mapping, index,
 				min(end - index, (pgoff_t)PAGEVEC_SIZE),
 				pvec.pages, indices);
 		if (!pvec.nr) {
-			if (index == start || unfalloc)
+			/* If all gone or hole-punch or unfalloc, we're done */
+			if (index == start || end != -1)
 				break;
+			/* But if truncating, restart to make sure all gone */
 			index = start;
 			continue;
 		}
-		if ((index == start || unfalloc) && indices[0] >= end) {
-			pagevec_remove_exceptionals(&pvec);
-			pagevec_release(&pvec);
-			break;
-		}
 		mem_cgroup_uncharge_start();
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
@@ -495,8 +492,12 @@ static void shmem_undo_range(struct inod
 			if (radix_tree_exceptional_entry(page)) {
 				if (unfalloc)
 					continue;
-				nr_swaps_freed += !shmem_free_swap(mapping,
-								index, page);
+				if (shmem_free_swap(mapping, index, page)) {
+					/* Swap was replaced by page: retry */
+					index--;
+					break;
+				}
+				nr_swaps_freed++;
 				continue;
 			}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 3/3] mm/fs: fix pessimization in hole-punching pagecache
  2014-07-02 19:09 ` Hugh Dickins
@ 2014-07-02 19:13   ` Hugh Dickins
  -1 siblings, 0 replies; 14+ messages in thread
From: Hugh Dickins @ 2014-07-02 19:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sasha Levin, Vlastimil Babka, Konstantin Khlebnikov,
	Lukas Czerner, Dave Jones, linux-mm, linux-fsdevel, linux-kernel

I wanted to revert my v3.1 commit d0823576bf4b ("mm: pincer in
truncate_inode_pages_range"), to keep truncate_inode_pages_range()
in synch with shmem_undo_range(); but have stepped back - a change
to hole-punching in truncate_inode_pages_range() is a change to
hole-punching in every filesystem (except tmpfs) that supports it.

If there's a logical proof why no filesystem can depend for its own
correctness on the pincer guarantee in truncate_inode_pages_range() -
an instant when the entire hole is removed from pagecache - then let's
revisit later.  But the evidence is that only tmpfs suffered from the
livelock, and we have no intention of extending hole-punch to ramfs.
So for now just add a few comments (to match or differ from those in
shmem_undo_range()), and fix one silliness noticed in d0823576bf4b...

Its "index == start" addition to the hole-punch termination test was
incomplete: it opened a way for the end condition to be missed, and the
loop go on looking through the radix_tree, all the way to end of file.
Fix that pessimization by resetting index when detected in inner loop.

Note that it's actually hard to hit this case, without the obsessive
concurrent faulting that trinity does: normally all pages are removed
in the initial trylock_page() pass, and this loop finds nothing to do.
I had to "#if 0" out the initial pass to reproduce bug and test fix.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Lukas Czerner <lczerner@redhat.com>
Cc: Dave Jones <davej@redhat.com>
---

 mm/truncate.c |   11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

--- 3.16-rc3/mm/truncate.c	2014-06-08 11:19:54.000000000 -0700
+++ linux/mm/truncate.c	2014-07-02 03:36:05.972553533 -0700
@@ -355,14 +355,16 @@ void truncate_inode_pages_range(struct a
 	for ( ; ; ) {
 		cond_resched();
 		if (!pagevec_lookup_entries(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE),
-			indices)) {
+			min(end - index, (pgoff_t)PAGEVEC_SIZE), indices)) {
+			/* If all gone from start onwards, we're done */
 			if (index == start)
 				break;
+			/* Otherwise restart to make sure all gone */
 			index = start;
 			continue;
 		}
 		if (index == start && indices[0] >= end) {
+			/* All gone out of hole to be punched, we're done */
 			pagevec_remove_exceptionals(&pvec);
 			pagevec_release(&pvec);
 			break;
@@ -373,8 +375,11 @@ void truncate_inode_pages_range(struct a
 
 			/* We rely upon deletion not changing page->index */
 			index = indices[i];
-			if (index >= end)
+			if (index >= end) {
+				/* Restart punch to make sure all gone */
+				index = start - 1;
 				break;
+			}
 
 			if (radix_tree_exceptional_entry(page)) {
 				clear_exceptional_entry(mapping, index, page);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 3/3] mm/fs: fix pessimization in hole-punching pagecache
@ 2014-07-02 19:13   ` Hugh Dickins
  0 siblings, 0 replies; 14+ messages in thread
From: Hugh Dickins @ 2014-07-02 19:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sasha Levin, Vlastimil Babka, Konstantin Khlebnikov,
	Lukas Czerner, Dave Jones, linux-mm, linux-fsdevel, linux-kernel

I wanted to revert my v3.1 commit d0823576bf4b ("mm: pincer in
truncate_inode_pages_range"), to keep truncate_inode_pages_range()
in synch with shmem_undo_range(); but have stepped back - a change
to hole-punching in truncate_inode_pages_range() is a change to
hole-punching in every filesystem (except tmpfs) that supports it.

If there's a logical proof why no filesystem can depend for its own
correctness on the pincer guarantee in truncate_inode_pages_range() -
an instant when the entire hole is removed from pagecache - then let's
revisit later.  But the evidence is that only tmpfs suffered from the
livelock, and we have no intention of extending hole-punch to ramfs.
So for now just add a few comments (to match or differ from those in
shmem_undo_range()), and fix one silliness noticed in d0823576bf4b...

Its "index == start" addition to the hole-punch termination test was
incomplete: it opened a way for the end condition to be missed, and the
loop go on looking through the radix_tree, all the way to end of file.
Fix that pessimization by resetting index when detected in inner loop.

Note that it's actually hard to hit this case, without the obsessive
concurrent faulting that trinity does: normally all pages are removed
in the initial trylock_page() pass, and this loop finds nothing to do.
I had to "#if 0" out the initial pass to reproduce bug and test fix.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Lukas Czerner <lczerner@redhat.com>
Cc: Dave Jones <davej@redhat.com>
---

 mm/truncate.c |   11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

--- 3.16-rc3/mm/truncate.c	2014-06-08 11:19:54.000000000 -0700
+++ linux/mm/truncate.c	2014-07-02 03:36:05.972553533 -0700
@@ -355,14 +355,16 @@ void truncate_inode_pages_range(struct a
 	for ( ; ; ) {
 		cond_resched();
 		if (!pagevec_lookup_entries(&pvec, mapping, index,
-			min(end - index, (pgoff_t)PAGEVEC_SIZE),
-			indices)) {
+			min(end - index, (pgoff_t)PAGEVEC_SIZE), indices)) {
+			/* If all gone from start onwards, we're done */
 			if (index == start)
 				break;
+			/* Otherwise restart to make sure all gone */
 			index = start;
 			continue;
 		}
 		if (index == start && indices[0] >= end) {
+			/* All gone out of hole to be punched, we're done */
 			pagevec_remove_exceptionals(&pvec);
 			pagevec_release(&pvec);
 			break;
@@ -373,8 +375,11 @@ void truncate_inode_pages_range(struct a
 
 			/* We rely upon deletion not changing page->index */
 			index = indices[i];
-			if (index >= end)
+			if (index >= end) {
+				/* Restart punch to make sure all gone */
+				index = start - 1;
 				break;
+			}
 
 			if (radix_tree_exceptional_entry(page)) {
 				clear_exceptional_entry(mapping, index, page);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] Revert "shmem: fix faulting into a hole while it's punched"
  2014-07-02 19:09 ` Hugh Dickins
@ 2014-07-03 12:54   ` Vlastimil Babka
  -1 siblings, 0 replies; 14+ messages in thread
From: Vlastimil Babka @ 2014-07-03 12:54 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Sasha Levin, Konstantin Khlebnikov, Lukas Czerner, Dave Jones,
	linux-mm, linux-fsdevel, linux-kernel

On 07/02/2014 09:09 PM, Hugh Dickins wrote:
> This reverts commit f00cdc6df7d7cfcabb5b740911e6788cb0802bdb.
>
> (a) It was buggy: Sasha sent a lockdep report to remind us that grabbing
> i_mutex in the fault path is a no-no (write syscall may already hold
> i_mutex while faulting user buffer), no matter that the patch took care
> to drop mmap_sem first.
>
> (b) It may be thought too elaborate: see the diffstat.
>
> (c) Vlastimil proposed a preferred approach, better for backporting to
> v3.1..v3.4, which had madvise hole-punch support before the fallocate
> infrastructure used in that commit - backporting being required once
> the issue fixed was tagged with CVE-2014-4171.
>
> (d) Hugh noticed a further pessimization fix needed in the same area.
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
> Cc: Lukas Czerner <lczerner@redhat.com>
> Cc: Dave Jones <davej@redhat.com>
> ---
>
>   mm/shmem.c |   56 +++------------------------------------------------
>   1 file changed, 4 insertions(+), 52 deletions(-)
>
> --- 3.16-rc3/mm/shmem.c	2014-06-29 15:22:10.592003936 -0700
> +++ linux/mm/shmem.c	2014-07-02 03:31:12.956546569 -0700
> @@ -80,12 +80,11 @@ static struct vfsmount *shm_mnt;
>   #define SHORT_SYMLINK_LEN 128
>
>   /*
> - * shmem_fallocate communicates with shmem_fault or shmem_writepage via
> - * inode->i_private (with i_mutex making sure that it has only one user at
> - * a time): we would prefer not to enlarge the shmem inode just for that.
> + * shmem_fallocate and shmem_writepage communicate via inode->i_private
> + * (with i_mutex making sure that it has only one user at a time):
> + * we would prefer not to enlarge the shmem inode just for that.
>    */
>   struct shmem_falloc {
> -	int	mode;		/* FALLOC_FL mode currently operating */
>   	pgoff_t start;		/* start of range currently being fallocated */
>   	pgoff_t next;		/* the next page offset to be fallocated */
>   	pgoff_t nr_falloced;	/* how many new pages have been fallocated */
> @@ -760,7 +759,6 @@ static int shmem_writepage(struct page *
>   			spin_lock(&inode->i_lock);
>   			shmem_falloc = inode->i_private;
>   			if (shmem_falloc &&
> -			    !shmem_falloc->mode &&
>   			    index >= shmem_falloc->start &&
>   			    index < shmem_falloc->next)
>   				shmem_falloc->nr_unswapped++;
> @@ -1235,44 +1233,6 @@ static int shmem_fault(struct vm_area_st
>   	int error;
>   	int ret = VM_FAULT_LOCKED;
>
> -	/*
> -	 * Trinity finds that probing a hole which tmpfs is punching can
> -	 * prevent the hole-punch from ever completing: which in turn
> -	 * locks writers out with its hold on i_mutex.  So refrain from
> -	 * faulting pages into the hole while it's being punched, and
> -	 * wait on i_mutex to be released if vmf->flags permits.
> -	 */
> -	if (unlikely(inode->i_private)) {
> -		struct shmem_falloc *shmem_falloc;
> -
> -		spin_lock(&inode->i_lock);
> -		shmem_falloc = inode->i_private;
> -		if (!shmem_falloc ||
> -		    shmem_falloc->mode != FALLOC_FL_PUNCH_HOLE ||
> -		    vmf->pgoff < shmem_falloc->start ||
> -		    vmf->pgoff >= shmem_falloc->next)
> -			shmem_falloc = NULL;
> -		spin_unlock(&inode->i_lock);
> -		/*
> -		 * i_lock has protected us from taking shmem_falloc seriously
> -		 * once return from shmem_fallocate() went back up that stack.
> -		 * i_lock does not serialize with i_mutex at all, but it does
> -		 * not matter if sometimes we wait unnecessarily, or sometimes
> -		 * miss out on waiting: we just need to make those cases rare.
> -		 */
> -		if (shmem_falloc) {
> -			if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
> -			   !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
> -				up_read(&vma->vm_mm->mmap_sem);
> -				mutex_lock(&inode->i_mutex);
> -				mutex_unlock(&inode->i_mutex);
> -				return VM_FAULT_RETRY;
> -			}
> -			/* cond_resched? Leave that to GUP or return to user */
> -			return VM_FAULT_NOPAGE;
> -		}
> -	}
> -
>   	error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret);
>   	if (error)
>   		return ((error == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS);
> @@ -1769,26 +1729,18 @@ static long shmem_fallocate(struct file
>
>   	mutex_lock(&inode->i_mutex);
>
> -	shmem_falloc.mode = mode & ~FALLOC_FL_KEEP_SIZE;
> -
>   	if (mode & FALLOC_FL_PUNCH_HOLE) {
>   		struct address_space *mapping = file->f_mapping;
>   		loff_t unmap_start = round_up(offset, PAGE_SIZE);
>   		loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
>
> -		shmem_falloc.start = unmap_start >> PAGE_SHIFT;
> -		shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT;
> -		spin_lock(&inode->i_lock);
> -		inode->i_private = &shmem_falloc;
> -		spin_unlock(&inode->i_lock);
> -
>   		if ((u64)unmap_end > (u64)unmap_start)
>   			unmap_mapping_range(mapping, unmap_start,
>   					    1 + unmap_end - unmap_start, 0);
>   		shmem_truncate_range(inode, offset, offset + len - 1);
>   		/* No need to unmap again: hole-punching leaves COWed pages */
>   		error = 0;
> -		goto undone;
> +		goto out;
>   	}
>
>   	/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] Revert "shmem: fix faulting into a hole while it's punched"
@ 2014-07-03 12:54   ` Vlastimil Babka
  0 siblings, 0 replies; 14+ messages in thread
From: Vlastimil Babka @ 2014-07-03 12:54 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Sasha Levin, Konstantin Khlebnikov, Lukas Czerner, Dave Jones,
	linux-mm, linux-fsdevel, linux-kernel

On 07/02/2014 09:09 PM, Hugh Dickins wrote:
> This reverts commit f00cdc6df7d7cfcabb5b740911e6788cb0802bdb.
>
> (a) It was buggy: Sasha sent a lockdep report to remind us that grabbing
> i_mutex in the fault path is a no-no (write syscall may already hold
> i_mutex while faulting user buffer), no matter that the patch took care
> to drop mmap_sem first.
>
> (b) It may be thought too elaborate: see the diffstat.
>
> (c) Vlastimil proposed a preferred approach, better for backporting to
> v3.1..v3.4, which had madvise hole-punch support before the fallocate
> infrastructure used in that commit - backporting being required once
> the issue fixed was tagged with CVE-2014-4171.
>
> (d) Hugh noticed a further pessimization fix needed in the same area.
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
> Cc: Lukas Czerner <lczerner@redhat.com>
> Cc: Dave Jones <davej@redhat.com>
> ---
>
>   mm/shmem.c |   56 +++------------------------------------------------
>   1 file changed, 4 insertions(+), 52 deletions(-)
>
> --- 3.16-rc3/mm/shmem.c	2014-06-29 15:22:10.592003936 -0700
> +++ linux/mm/shmem.c	2014-07-02 03:31:12.956546569 -0700
> @@ -80,12 +80,11 @@ static struct vfsmount *shm_mnt;
>   #define SHORT_SYMLINK_LEN 128
>
>   /*
> - * shmem_fallocate communicates with shmem_fault or shmem_writepage via
> - * inode->i_private (with i_mutex making sure that it has only one user at
> - * a time): we would prefer not to enlarge the shmem inode just for that.
> + * shmem_fallocate and shmem_writepage communicate via inode->i_private
> + * (with i_mutex making sure that it has only one user at a time):
> + * we would prefer not to enlarge the shmem inode just for that.
>    */
>   struct shmem_falloc {
> -	int	mode;		/* FALLOC_FL mode currently operating */
>   	pgoff_t start;		/* start of range currently being fallocated */
>   	pgoff_t next;		/* the next page offset to be fallocated */
>   	pgoff_t nr_falloced;	/* how many new pages have been fallocated */
> @@ -760,7 +759,6 @@ static int shmem_writepage(struct page *
>   			spin_lock(&inode->i_lock);
>   			shmem_falloc = inode->i_private;
>   			if (shmem_falloc &&
> -			    !shmem_falloc->mode &&
>   			    index >= shmem_falloc->start &&
>   			    index < shmem_falloc->next)
>   				shmem_falloc->nr_unswapped++;
> @@ -1235,44 +1233,6 @@ static int shmem_fault(struct vm_area_st
>   	int error;
>   	int ret = VM_FAULT_LOCKED;
>
> -	/*
> -	 * Trinity finds that probing a hole which tmpfs is punching can
> -	 * prevent the hole-punch from ever completing: which in turn
> -	 * locks writers out with its hold on i_mutex.  So refrain from
> -	 * faulting pages into the hole while it's being punched, and
> -	 * wait on i_mutex to be released if vmf->flags permits.
> -	 */
> -	if (unlikely(inode->i_private)) {
> -		struct shmem_falloc *shmem_falloc;
> -
> -		spin_lock(&inode->i_lock);
> -		shmem_falloc = inode->i_private;
> -		if (!shmem_falloc ||
> -		    shmem_falloc->mode != FALLOC_FL_PUNCH_HOLE ||
> -		    vmf->pgoff < shmem_falloc->start ||
> -		    vmf->pgoff >= shmem_falloc->next)
> -			shmem_falloc = NULL;
> -		spin_unlock(&inode->i_lock);
> -		/*
> -		 * i_lock has protected us from taking shmem_falloc seriously
> -		 * once return from shmem_fallocate() went back up that stack.
> -		 * i_lock does not serialize with i_mutex at all, but it does
> -		 * not matter if sometimes we wait unnecessarily, or sometimes
> -		 * miss out on waiting: we just need to make those cases rare.
> -		 */
> -		if (shmem_falloc) {
> -			if ((vmf->flags & FAULT_FLAG_ALLOW_RETRY) &&
> -			   !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT)) {
> -				up_read(&vma->vm_mm->mmap_sem);
> -				mutex_lock(&inode->i_mutex);
> -				mutex_unlock(&inode->i_mutex);
> -				return VM_FAULT_RETRY;
> -			}
> -			/* cond_resched? Leave that to GUP or return to user */
> -			return VM_FAULT_NOPAGE;
> -		}
> -	}
> -
>   	error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret);
>   	if (error)
>   		return ((error == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS);
> @@ -1769,26 +1729,18 @@ static long shmem_fallocate(struct file
>
>   	mutex_lock(&inode->i_mutex);
>
> -	shmem_falloc.mode = mode & ~FALLOC_FL_KEEP_SIZE;
> -
>   	if (mode & FALLOC_FL_PUNCH_HOLE) {
>   		struct address_space *mapping = file->f_mapping;
>   		loff_t unmap_start = round_up(offset, PAGE_SIZE);
>   		loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
>
> -		shmem_falloc.start = unmap_start >> PAGE_SHIFT;
> -		shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT;
> -		spin_lock(&inode->i_lock);
> -		inode->i_private = &shmem_falloc;
> -		spin_unlock(&inode->i_lock);
> -
>   		if ((u64)unmap_end > (u64)unmap_start)
>   			unmap_mapping_range(mapping, unmap_start,
>   					    1 + unmap_end - unmap_start, 0);
>   		shmem_truncate_range(inode, offset, offset + len - 1);
>   		/* No need to unmap again: hole-punching leaves COWed pages */
>   		error = 0;
> -		goto undone;
> +		goto out;
>   	}
>
>   	/* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] mm/fs: fix pessimization in hole-punching pagecache
  2014-07-02 19:13   ` Hugh Dickins
@ 2014-07-03 13:02     ` Vlastimil Babka
  -1 siblings, 0 replies; 14+ messages in thread
From: Vlastimil Babka @ 2014-07-03 13:02 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Sasha Levin, Konstantin Khlebnikov, Lukas Czerner, Dave Jones,
	linux-mm, linux-fsdevel, linux-kernel

On 07/02/2014 09:13 PM, Hugh Dickins wrote:
> I wanted to revert my v3.1 commit d0823576bf4b ("mm: pincer in
> truncate_inode_pages_range"), to keep truncate_inode_pages_range()
> in synch with shmem_undo_range(); but have stepped back - a change
> to hole-punching in truncate_inode_pages_range() is a change to
> hole-punching in every filesystem (except tmpfs) that supports it.
>
> If there's a logical proof why no filesystem can depend for its own
> correctness on the pincer guarantee in truncate_inode_pages_range() -
> an instant when the entire hole is removed from pagecache - then let's
> revisit later.  But the evidence is that only tmpfs suffered from the
> livelock, and we have no intention of extending hole-punch to ramfs.
> So for now just add a few comments (to match or differ from those in
> shmem_undo_range()), and fix one silliness noticed in d0823576bf4b...
>
> Its "index == start" addition to the hole-punch termination test was
> incomplete: it opened a way for the end condition to be missed, and the
> loop go on looking through the radix_tree, all the way to end of file.
> Fix that pessimization by resetting index when detected in inner loop.
>
> Note that it's actually hard to hit this case, without the obsessive
> concurrent faulting that trinity does: normally all pages are removed
> in the initial trylock_page() pass, and this loop finds nothing to do.
> I had to "#if 0" out the initial pass to reproduce bug and test fix.
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
> Cc: Lukas Czerner <lczerner@redhat.com>
> Cc: Dave Jones <davej@redhat.com>
> ---
>
>   mm/truncate.c |   11 ++++++++---
>   1 file changed, 8 insertions(+), 3 deletions(-)
>
> --- 3.16-rc3/mm/truncate.c	2014-06-08 11:19:54.000000000 -0700
> +++ linux/mm/truncate.c	2014-07-02 03:36:05.972553533 -0700
> @@ -355,14 +355,16 @@ void truncate_inode_pages_range(struct a
>   	for ( ; ; ) {
>   		cond_resched();
>   		if (!pagevec_lookup_entries(&pvec, mapping, index,
> -			min(end - index, (pgoff_t)PAGEVEC_SIZE),
> -			indices)) {
> +			min(end - index, (pgoff_t)PAGEVEC_SIZE), indices)) {
> +			/* If all gone from start onwards, we're done */
>   			if (index == start)
>   				break;
> +			/* Otherwise restart to make sure all gone */
>   			index = start;
>   			continue;
>   		}
>   		if (index == start && indices[0] >= end) {
> +			/* All gone out of hole to be punched, we're done */
>   			pagevec_remove_exceptionals(&pvec);
>   			pagevec_release(&pvec);
>   			break;
> @@ -373,8 +375,11 @@ void truncate_inode_pages_range(struct a
>
>   			/* We rely upon deletion not changing page->index */
>   			index = indices[i];
> -			if (index >= end)
> +			if (index >= end) {
> +				/* Restart punch to make sure all gone */
> +				index = start - 1;
>   				break;
> +			}
>
>   			if (radix_tree_exceptional_entry(page)) {
>   				clear_exceptional_entry(mapping, index, page);
>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] mm/fs: fix pessimization in hole-punching pagecache
@ 2014-07-03 13:02     ` Vlastimil Babka
  0 siblings, 0 replies; 14+ messages in thread
From: Vlastimil Babka @ 2014-07-03 13:02 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Sasha Levin, Konstantin Khlebnikov, Lukas Czerner, Dave Jones,
	linux-mm, linux-fsdevel, linux-kernel

On 07/02/2014 09:13 PM, Hugh Dickins wrote:
> I wanted to revert my v3.1 commit d0823576bf4b ("mm: pincer in
> truncate_inode_pages_range"), to keep truncate_inode_pages_range()
> in synch with shmem_undo_range(); but have stepped back - a change
> to hole-punching in truncate_inode_pages_range() is a change to
> hole-punching in every filesystem (except tmpfs) that supports it.
>
> If there's a logical proof why no filesystem can depend for its own
> correctness on the pincer guarantee in truncate_inode_pages_range() -
> an instant when the entire hole is removed from pagecache - then let's
> revisit later.  But the evidence is that only tmpfs suffered from the
> livelock, and we have no intention of extending hole-punch to ramfs.
> So for now just add a few comments (to match or differ from those in
> shmem_undo_range()), and fix one silliness noticed in d0823576bf4b...
>
> Its "index == start" addition to the hole-punch termination test was
> incomplete: it opened a way for the end condition to be missed, and the
> loop go on looking through the radix_tree, all the way to end of file.
> Fix that pessimization by resetting index when detected in inner loop.
>
> Note that it's actually hard to hit this case, without the obsessive
> concurrent faulting that trinity does: normally all pages are removed
> in the initial trylock_page() pass, and this loop finds nothing to do.
> I had to "#if 0" out the initial pass to reproduce bug and test fix.
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: Sasha Levin <sasha.levin@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
> Cc: Lukas Czerner <lczerner@redhat.com>
> Cc: Dave Jones <davej@redhat.com>
> ---
>
>   mm/truncate.c |   11 ++++++++---
>   1 file changed, 8 insertions(+), 3 deletions(-)
>
> --- 3.16-rc3/mm/truncate.c	2014-06-08 11:19:54.000000000 -0700
> +++ linux/mm/truncate.c	2014-07-02 03:36:05.972553533 -0700
> @@ -355,14 +355,16 @@ void truncate_inode_pages_range(struct a
>   	for ( ; ; ) {
>   		cond_resched();
>   		if (!pagevec_lookup_entries(&pvec, mapping, index,
> -			min(end - index, (pgoff_t)PAGEVEC_SIZE),
> -			indices)) {
> +			min(end - index, (pgoff_t)PAGEVEC_SIZE), indices)) {
> +			/* If all gone from start onwards, we're done */
>   			if (index == start)
>   				break;
> +			/* Otherwise restart to make sure all gone */
>   			index = start;
>   			continue;
>   		}
>   		if (index == start && indices[0] >= end) {
> +			/* All gone out of hole to be punched, we're done */
>   			pagevec_remove_exceptionals(&pvec);
>   			pagevec_release(&pvec);
>   			break;
> @@ -373,8 +375,11 @@ void truncate_inode_pages_range(struct a
>
>   			/* We rely upon deletion not changing page->index */
>   			index = indices[i];
> -			if (index >= end)
> +			if (index >= end) {
> +				/* Restart punch to make sure all gone */
> +				index = start - 1;
>   				break;
> +			}
>
>   			if (radix_tree_exceptional_entry(page)) {
>   				clear_exceptional_entry(mapping, index, page);
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] shmem: fix faulting into a hole while it's punched, take 2
  2014-07-02 19:11   ` Hugh Dickins
@ 2014-07-03 15:37     ` Vlastimil Babka
  -1 siblings, 0 replies; 14+ messages in thread
From: Vlastimil Babka @ 2014-07-03 15:37 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Sasha Levin, Konstantin Khlebnikov, Lukas Czerner, Dave Jones,
	linux-mm, linux-fsdevel, linux-kernel

On 07/02/2014 09:11 PM, Hugh Dickins wrote:
> Trinity finds that mmap access to a hole while it's punched from shmem
> can prevent the madvise(MADV_REMOVE) or fallocate(FALLOC_FL_PUNCH_HOLE)
> from completing, until the (killable) reader stops; with the puncher's
> hold on i_mutex locking out all other writers until it can complete.
> This issue was tagged with CVE-2014-4171.
>
> It appears that the tmpfs fault path is too light in comparison with its
> hole-punching path, lacking an i_data_sem to obstruct it; but we don't
> want to slow down the common case.  It is not a problem in truncation,
> because there the SIGBUS beyond i_size stops pages from being appended.
>
> The origin of this problem is my v3.1 commit d0823576bf4b ("mm: pincer
> in truncate_inode_pages_range"), once it was duplicated into shmem.c.
> It seemed like a nice idea at the time, to ensure (barring RCU lookup
> fuzziness) that there's an instant when the entire hole is empty; but
> the indefinitely repeated scans to ensure that make it vulnerable.
>
> Revert that "enhancement" to hole-punch from shmem_undo_range(), but
> retain the unproblematic rescanning when it's truncating; add a couple
> of comments there.
>
> Remove the "indices[0] >= end" test: that is now handled satisfactorily
> by the inner loop, and mem_cgroup_uncharge_start()/end() are too light
> to be worth avoiding here.
>
> But if we do not always loop indefinitely, we do need to handle the
> case of swap swizzled back to page before shmem_free_swap() gets it:
> add a retry for that case, as suggested by Konstantin Khlebnikov.
>
> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Suggested-and-Tested-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
> Cc: Lukas Czerner <lczerner@redhat.com>
> Cc: Dave Jones <davej@redhat.com>
> Cc: stable@vger.kernel.org # v3.1+
> ---
>
>   mm/shmem.c |   19 ++++++++++---------
>   1 file changed, 10 insertions(+), 9 deletions(-)
>
> --- 3.16-rc3+/mm/shmem.c	2014-07-02 03:31:12.956546569 -0700
> +++ linux/mm/shmem.c	2014-07-02 03:34:13.172550852 -0700
> @@ -467,23 +467,20 @@ static void shmem_undo_range(struct inod
>   		return;
>
>   	index = start;
> -	for ( ; ; ) {
> +	while (index < end) {
>   		cond_resched();
>
>   		pvec.nr = find_get_entries(mapping, index,
>   				min(end - index, (pgoff_t)PAGEVEC_SIZE),
>   				pvec.pages, indices);
>   		if (!pvec.nr) {
> -			if (index == start || unfalloc)
> +			/* If all gone or hole-punch or unfalloc, we're done */
> +			if (index == start || end != -1)
>   				break;
> +			/* But if truncating, restart to make sure all gone */
>   			index = start;
>   			continue;
>   		}
> -		if ((index == start || unfalloc) && indices[0] >= end) {
> -			pagevec_remove_exceptionals(&pvec);
> -			pagevec_release(&pvec);
> -			break;
> -		}
>   		mem_cgroup_uncharge_start();
>   		for (i = 0; i < pagevec_count(&pvec); i++) {
>   			struct page *page = pvec.pages[i];
> @@ -495,8 +492,12 @@ static void shmem_undo_range(struct inod
>   			if (radix_tree_exceptional_entry(page)) {
>   				if (unfalloc)
>   					continue;
> -				nr_swaps_freed += !shmem_free_swap(mapping,
> -								index, page);
> +				if (shmem_free_swap(mapping, index, page)) {
> +					/* Swap was replaced by page: retry */
> +					index--;
> +					break;
> +				}
> +				nr_swaps_freed++;
>   				continue;

Ugh, a warning to anyone trying to backport this. This hunk can match 
both instances of the same code in the function, and I've just seen 
patch picking the wrong one.

>   			}
>
>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] shmem: fix faulting into a hole while it's punched, take 2
@ 2014-07-03 15:37     ` Vlastimil Babka
  0 siblings, 0 replies; 14+ messages in thread
From: Vlastimil Babka @ 2014-07-03 15:37 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Sasha Levin, Konstantin Khlebnikov, Lukas Czerner, Dave Jones,
	linux-mm, linux-fsdevel, linux-kernel

On 07/02/2014 09:11 PM, Hugh Dickins wrote:
> Trinity finds that mmap access to a hole while it's punched from shmem
> can prevent the madvise(MADV_REMOVE) or fallocate(FALLOC_FL_PUNCH_HOLE)
> from completing, until the (killable) reader stops; with the puncher's
> hold on i_mutex locking out all other writers until it can complete.
> This issue was tagged with CVE-2014-4171.
>
> It appears that the tmpfs fault path is too light in comparison with its
> hole-punching path, lacking an i_data_sem to obstruct it; but we don't
> want to slow down the common case.  It is not a problem in truncation,
> because there the SIGBUS beyond i_size stops pages from being appended.
>
> The origin of this problem is my v3.1 commit d0823576bf4b ("mm: pincer
> in truncate_inode_pages_range"), once it was duplicated into shmem.c.
> It seemed like a nice idea at the time, to ensure (barring RCU lookup
> fuzziness) that there's an instant when the entire hole is empty; but
> the indefinitely repeated scans to ensure that make it vulnerable.
>
> Revert that "enhancement" to hole-punch from shmem_undo_range(), but
> retain the unproblematic rescanning when it's truncating; add a couple
> of comments there.
>
> Remove the "indices[0] >= end" test: that is now handled satisfactorily
> by the inner loop, and mem_cgroup_uncharge_start()/end() are too light
> to be worth avoiding here.
>
> But if we do not always loop indefinitely, we do need to handle the
> case of swap swizzled back to page before shmem_free_swap() gets it:
> add a retry for that case, as suggested by Konstantin Khlebnikov.
>
> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Suggested-and-Tested-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
> Cc: Lukas Czerner <lczerner@redhat.com>
> Cc: Dave Jones <davej@redhat.com>
> Cc: stable@vger.kernel.org # v3.1+
> ---
>
>   mm/shmem.c |   19 ++++++++++---------
>   1 file changed, 10 insertions(+), 9 deletions(-)
>
> --- 3.16-rc3+/mm/shmem.c	2014-07-02 03:31:12.956546569 -0700
> +++ linux/mm/shmem.c	2014-07-02 03:34:13.172550852 -0700
> @@ -467,23 +467,20 @@ static void shmem_undo_range(struct inod
>   		return;
>
>   	index = start;
> -	for ( ; ; ) {
> +	while (index < end) {
>   		cond_resched();
>
>   		pvec.nr = find_get_entries(mapping, index,
>   				min(end - index, (pgoff_t)PAGEVEC_SIZE),
>   				pvec.pages, indices);
>   		if (!pvec.nr) {
> -			if (index == start || unfalloc)
> +			/* If all gone or hole-punch or unfalloc, we're done */
> +			if (index == start || end != -1)
>   				break;
> +			/* But if truncating, restart to make sure all gone */
>   			index = start;
>   			continue;
>   		}
> -		if ((index == start || unfalloc) && indices[0] >= end) {
> -			pagevec_remove_exceptionals(&pvec);
> -			pagevec_release(&pvec);
> -			break;
> -		}
>   		mem_cgroup_uncharge_start();
>   		for (i = 0; i < pagevec_count(&pvec); i++) {
>   			struct page *page = pvec.pages[i];
> @@ -495,8 +492,12 @@ static void shmem_undo_range(struct inod
>   			if (radix_tree_exceptional_entry(page)) {
>   				if (unfalloc)
>   					continue;
> -				nr_swaps_freed += !shmem_free_swap(mapping,
> -								index, page);
> +				if (shmem_free_swap(mapping, index, page)) {
> +					/* Swap was replaced by page: retry */
> +					index--;
> +					break;
> +				}
> +				nr_swaps_freed++;
>   				continue;

Ugh, a warning to anyone trying to backport this. This hunk can match 
both instances of the same code in the function, and I've just seen 
patch picking the wrong one.

>   			}
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] shmem: fix faulting into a hole while it's punched, take 2
  2014-07-03 15:37     ` Vlastimil Babka
@ 2014-07-03 19:14       ` Hugh Dickins
  -1 siblings, 0 replies; 14+ messages in thread
From: Hugh Dickins @ 2014-07-03 19:14 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hugh Dickins, Andrew Morton, Sasha Levin, Konstantin Khlebnikov,
	Lukas Czerner, Dave Jones, linux-mm, linux-fsdevel, linux-kernel

On Thu, 3 Jul 2014, Vlastimil Babka wrote:
> On 07/02/2014 09:11 PM, Hugh Dickins wrote:
> > 
> > --- 3.16-rc3+/mm/shmem.c	2014-07-02 03:31:12.956546569 -0700
> > +++ linux/mm/shmem.c	2014-07-02 03:34:13.172550852 -0700
> > @@ -467,23 +467,20 @@ static void shmem_undo_range(struct inod
> >   		return;
> > 
> >   	index = start;
> > -	for ( ; ; ) {
> > +	while (index < end) {
> >   		cond_resched();
> > 
> >   		pvec.nr = find_get_entries(mapping, index,
> >   				min(end - index, (pgoff_t)PAGEVEC_SIZE),
> >   				pvec.pages, indices);
> >   		if (!pvec.nr) {
> > -			if (index == start || unfalloc)
> > +			/* If all gone or hole-punch or unfalloc, we're done
> > */
> > +			if (index == start || end != -1)
> >   				break;
> > +			/* But if truncating, restart to make sure all gone
> > */
> >   			index = start;
> >   			continue;
> >   		}
> > -		if ((index == start || unfalloc) && indices[0] >= end) {
> > -			pagevec_remove_exceptionals(&pvec);
> > -			pagevec_release(&pvec);
> > -			break;
> > -		}
> >   		mem_cgroup_uncharge_start();
> >   		for (i = 0; i < pagevec_count(&pvec); i++) {
> >   			struct page *page = pvec.pages[i];
> > @@ -495,8 +492,12 @@ static void shmem_undo_range(struct inod
> >   			if (radix_tree_exceptional_entry(page)) {
> >   				if (unfalloc)
> >   					continue;
> > -				nr_swaps_freed += !shmem_free_swap(mapping,
> > -								index, page);
> > +				if (shmem_free_swap(mapping, index, page)) {
> > +					/* Swap was replaced by page: retry
> > */
> > +					index--;
> > +					break;
> > +				}
> > +				nr_swaps_freed++;
> >   				continue;
> 
> Ugh, a warning to anyone trying to backport this. This hunk can match both
> instances of the same code in the function, and I've just seen patch picking
> the wrong one.

Thanks for the warning.

Yes, as it ends up, there are only two hunks: so if the first fails
to apply (and down the releases there may be various trivial reasons
why it would fail to apply cleanly, although easily edited by hand),
patch might very well choose the first match to apply the second hunk.

I'm expecting to have to do (or at least to check) each -stable by
hand as it comes by.  I did just check mmotm, and it came out fine.

Hugh

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] shmem: fix faulting into a hole while it's punched, take 2
@ 2014-07-03 19:14       ` Hugh Dickins
  0 siblings, 0 replies; 14+ messages in thread
From: Hugh Dickins @ 2014-07-03 19:14 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hugh Dickins, Andrew Morton, Sasha Levin, Konstantin Khlebnikov,
	Lukas Czerner, Dave Jones, linux-mm, linux-fsdevel, linux-kernel

On Thu, 3 Jul 2014, Vlastimil Babka wrote:
> On 07/02/2014 09:11 PM, Hugh Dickins wrote:
> > 
> > --- 3.16-rc3+/mm/shmem.c	2014-07-02 03:31:12.956546569 -0700
> > +++ linux/mm/shmem.c	2014-07-02 03:34:13.172550852 -0700
> > @@ -467,23 +467,20 @@ static void shmem_undo_range(struct inod
> >   		return;
> > 
> >   	index = start;
> > -	for ( ; ; ) {
> > +	while (index < end) {
> >   		cond_resched();
> > 
> >   		pvec.nr = find_get_entries(mapping, index,
> >   				min(end - index, (pgoff_t)PAGEVEC_SIZE),
> >   				pvec.pages, indices);
> >   		if (!pvec.nr) {
> > -			if (index == start || unfalloc)
> > +			/* If all gone or hole-punch or unfalloc, we're done
> > */
> > +			if (index == start || end != -1)
> >   				break;
> > +			/* But if truncating, restart to make sure all gone
> > */
> >   			index = start;
> >   			continue;
> >   		}
> > -		if ((index == start || unfalloc) && indices[0] >= end) {
> > -			pagevec_remove_exceptionals(&pvec);
> > -			pagevec_release(&pvec);
> > -			break;
> > -		}
> >   		mem_cgroup_uncharge_start();
> >   		for (i = 0; i < pagevec_count(&pvec); i++) {
> >   			struct page *page = pvec.pages[i];
> > @@ -495,8 +492,12 @@ static void shmem_undo_range(struct inod
> >   			if (radix_tree_exceptional_entry(page)) {
> >   				if (unfalloc)
> >   					continue;
> > -				nr_swaps_freed += !shmem_free_swap(mapping,
> > -								index, page);
> > +				if (shmem_free_swap(mapping, index, page)) {
> > +					/* Swap was replaced by page: retry
> > */
> > +					index--;
> > +					break;
> > +				}
> > +				nr_swaps_freed++;
> >   				continue;
> 
> Ugh, a warning to anyone trying to backport this. This hunk can match both
> instances of the same code in the function, and I've just seen patch picking
> the wrong one.

Thanks for the warning.

Yes, as it ends up, there are only two hunks: so if the first fails
to apply (and down the releases there may be various trivial reasons
why it would fail to apply cleanly, although easily edited by hand),
patch might very well choose the first match to apply the second hunk.

I'm expecting to have to do (or at least to check) each -stable by
hand as it comes by.  I did just check mmotm, and it came out fine.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-07-03 19:16 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-02 19:09 [PATCH 1/3] Revert "shmem: fix faulting into a hole while it's punched" Hugh Dickins
2014-07-02 19:09 ` Hugh Dickins
2014-07-02 19:11 ` [PATCH 2/3] shmem: fix faulting into a hole while it's punched, take 2 Hugh Dickins
2014-07-02 19:11   ` Hugh Dickins
2014-07-03 15:37   ` Vlastimil Babka
2014-07-03 15:37     ` Vlastimil Babka
2014-07-03 19:14     ` Hugh Dickins
2014-07-03 19:14       ` Hugh Dickins
2014-07-02 19:13 ` [PATCH 3/3] mm/fs: fix pessimization in hole-punching pagecache Hugh Dickins
2014-07-02 19:13   ` Hugh Dickins
2014-07-03 13:02   ` Vlastimil Babka
2014-07-03 13:02     ` Vlastimil Babka
2014-07-03 12:54 ` [PATCH 1/3] Revert "shmem: fix faulting into a hole while it's punched" Vlastimil Babka
2014-07-03 12:54   ` Vlastimil Babka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.