All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] shmem/tmpfs: three late patches
@ 2012-07-09 22:35 ` Hugh Dickins
  0 siblings, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2012-07-09 22:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel

Here's three little shmem/tmpfs patches against v3.5-rc6.
Either the first should go in before v3.5 final, or it should not go
in at all.  The second and third are independent of it: I'd like them
in v3.5, but don't have a clinching argument: see what you think.

[PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE
[PATCH 2/3] shmem: fix negative rss in memcg memory.stat
[PATCH 3/3] shmem: cleanup shmem_add_to_page_cache

 mm/shmem.c |  193 +++++++++++++++------------------------------------
 1 file changed, 58 insertions(+), 135 deletions(-)

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 0/3] shmem/tmpfs: three late patches
@ 2012-07-09 22:35 ` Hugh Dickins
  0 siblings, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2012-07-09 22:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel

Here's three little shmem/tmpfs patches against v3.5-rc6.
Either the first should go in before v3.5 final, or it should not go
in at all.  The second and third are independent of it: I'd like them
in v3.5, but don't have a clinching argument: see what you think.

[PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE
[PATCH 2/3] shmem: fix negative rss in memcg memory.stat
[PATCH 3/3] shmem: cleanup shmem_add_to_page_cache

 mm/shmem.c |  193 +++++++++++++++------------------------------------
 1 file changed, 58 insertions(+), 135 deletions(-)

Thanks,
Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE
  2012-07-09 22:35 ` Hugh Dickins
@ 2012-07-09 22:41   ` Hugh Dickins
  -1 siblings, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2012-07-09 22:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Josef Bacik, Andi Kleen, Andreas Dilger,
	Dave Chinner, Marco Stornelli, Jeff liu, Chris Mason, linux-mm,
	linux-fsdevel, linux-kernel

Revert 4fb5ef089b28 ("tmpfs: support SEEK_DATA and SEEK_HOLE").
I believe it's correct, and it's been nice to have from rc1 to rc6;
but as the original commit said:

I don't know who actually uses SEEK_DATA or SEEK_HOLE, and whether it
would be of any use to them on tmpfs.  This code adds 92 lines and 752
bytes on x86_64 - is that bloat or worthwhile?

Nobody asked for it, so I conclude that it's bloat: let's revert tmpfs
to the dumb generic support for v3.5.  We can always reinstate it later
if useful, and anyone needing it in a hurry can just get it out of git.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Josef Bacik <josef@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andreas Dilger <adilger@dilger.ca>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Marco Stornelli <marco.stornelli@gmail.com>
Cc: Jeff liu <jeff.liu@oracle.com>
Cc: Chris Mason <chris.mason@fusionio.com>
---
But if someone protests at this reversion, of course we can drop it.

 mm/shmem.c |   94 ---------------------------------------------------
 1 file changed, 1 insertion(+), 93 deletions(-)

--- 3.5-rc6/mm/shmem.c	2012-07-07 18:20:40.635328642 -0700
+++ linux/mm/shmem.c	2012-07-07 19:20:02.986950655 -0700
@@ -1692,98 +1692,6 @@ static ssize_t shmem_file_splice_read(st
 	return error;
 }
 
-/*
- * llseek SEEK_DATA or SEEK_HOLE through the radix_tree.
- */
-static pgoff_t shmem_seek_hole_data(struct address_space *mapping,
-				    pgoff_t index, pgoff_t end, int origin)
-{
-	struct page *page;
-	struct pagevec pvec;
-	pgoff_t indices[PAGEVEC_SIZE];
-	bool done = false;
-	int i;
-
-	pagevec_init(&pvec, 0);
-	pvec.nr = 1;		/* start small: we may be there already */
-	while (!done) {
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
-					pvec.nr, pvec.pages, indices);
-		if (!pvec.nr) {
-			if (origin == SEEK_DATA)
-				index = end;
-			break;
-		}
-		for (i = 0; i < pvec.nr; i++, index++) {
-			if (index < indices[i]) {
-				if (origin == SEEK_HOLE) {
-					done = true;
-					break;
-				}
-				index = indices[i];
-			}
-			page = pvec.pages[i];
-			if (page && !radix_tree_exceptional_entry(page)) {
-				if (!PageUptodate(page))
-					page = NULL;
-			}
-			if (index >= end ||
-			    (page && origin == SEEK_DATA) ||
-			    (!page && origin == SEEK_HOLE)) {
-				done = true;
-				break;
-			}
-		}
-		shmem_deswap_pagevec(&pvec);
-		pagevec_release(&pvec);
-		pvec.nr = PAGEVEC_SIZE;
-		cond_resched();
-	}
-	return index;
-}
-
-static loff_t shmem_file_llseek(struct file *file, loff_t offset, int origin)
-{
-	struct address_space *mapping;
-	struct inode *inode;
-	pgoff_t start, end;
-	loff_t new_offset;
-
-	if (origin != SEEK_DATA && origin != SEEK_HOLE)
-		return generic_file_llseek_size(file, offset, origin,
-							MAX_LFS_FILESIZE);
-	mapping = file->f_mapping;
-	inode = mapping->host;
-	mutex_lock(&inode->i_mutex);
-	/* We're holding i_mutex so we can access i_size directly */
-
-	if (offset < 0)
-		offset = -EINVAL;
-	else if (offset >= inode->i_size)
-		offset = -ENXIO;
-	else {
-		start = offset >> PAGE_CACHE_SHIFT;
-		end = (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-		new_offset = shmem_seek_hole_data(mapping, start, end, origin);
-		new_offset <<= PAGE_CACHE_SHIFT;
-		if (new_offset > offset) {
-			if (new_offset < inode->i_size)
-				offset = new_offset;
-			else if (origin == SEEK_DATA)
-				offset = -ENXIO;
-			else
-				offset = inode->i_size;
-		}
-	}
-
-	if (offset >= 0 && offset != file->f_pos) {
-		file->f_pos = offset;
-		file->f_version = 0;
-	}
-	mutex_unlock(&inode->i_mutex);
-	return offset;
-}
-
 static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 							 loff_t len)
 {
@@ -2787,7 +2695,7 @@ static const struct address_space_operat
 static const struct file_operations shmem_file_operations = {
 	.mmap		= shmem_mmap,
 #ifdef CONFIG_TMPFS
-	.llseek		= shmem_file_llseek,
+	.llseek		= generic_file_llseek,
 	.read		= do_sync_read,
 	.write		= do_sync_write,
 	.aio_read	= shmem_file_aio_read,

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE
@ 2012-07-09 22:41   ` Hugh Dickins
  0 siblings, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2012-07-09 22:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Josef Bacik, Andi Kleen, Andreas Dilger,
	Dave Chinner, Marco Stornelli, Jeff liu, Chris Mason, linux-mm,
	linux-fsdevel, linux-kernel

Revert 4fb5ef089b28 ("tmpfs: support SEEK_DATA and SEEK_HOLE").
I believe it's correct, and it's been nice to have from rc1 to rc6;
but as the original commit said:

I don't know who actually uses SEEK_DATA or SEEK_HOLE, and whether it
would be of any use to them on tmpfs.  This code adds 92 lines and 752
bytes on x86_64 - is that bloat or worthwhile?

Nobody asked for it, so I conclude that it's bloat: let's revert tmpfs
to the dumb generic support for v3.5.  We can always reinstate it later
if useful, and anyone needing it in a hurry can just get it out of git.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Josef Bacik <josef@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andreas Dilger <adilger@dilger.ca>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Marco Stornelli <marco.stornelli@gmail.com>
Cc: Jeff liu <jeff.liu@oracle.com>
Cc: Chris Mason <chris.mason@fusionio.com>
---
But if someone protests at this reversion, of course we can drop it.

 mm/shmem.c |   94 ---------------------------------------------------
 1 file changed, 1 insertion(+), 93 deletions(-)

--- 3.5-rc6/mm/shmem.c	2012-07-07 18:20:40.635328642 -0700
+++ linux/mm/shmem.c	2012-07-07 19:20:02.986950655 -0700
@@ -1692,98 +1692,6 @@ static ssize_t shmem_file_splice_read(st
 	return error;
 }
 
-/*
- * llseek SEEK_DATA or SEEK_HOLE through the radix_tree.
- */
-static pgoff_t shmem_seek_hole_data(struct address_space *mapping,
-				    pgoff_t index, pgoff_t end, int origin)
-{
-	struct page *page;
-	struct pagevec pvec;
-	pgoff_t indices[PAGEVEC_SIZE];
-	bool done = false;
-	int i;
-
-	pagevec_init(&pvec, 0);
-	pvec.nr = 1;		/* start small: we may be there already */
-	while (!done) {
-		pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
-					pvec.nr, pvec.pages, indices);
-		if (!pvec.nr) {
-			if (origin == SEEK_DATA)
-				index = end;
-			break;
-		}
-		for (i = 0; i < pvec.nr; i++, index++) {
-			if (index < indices[i]) {
-				if (origin == SEEK_HOLE) {
-					done = true;
-					break;
-				}
-				index = indices[i];
-			}
-			page = pvec.pages[i];
-			if (page && !radix_tree_exceptional_entry(page)) {
-				if (!PageUptodate(page))
-					page = NULL;
-			}
-			if (index >= end ||
-			    (page && origin == SEEK_DATA) ||
-			    (!page && origin == SEEK_HOLE)) {
-				done = true;
-				break;
-			}
-		}
-		shmem_deswap_pagevec(&pvec);
-		pagevec_release(&pvec);
-		pvec.nr = PAGEVEC_SIZE;
-		cond_resched();
-	}
-	return index;
-}
-
-static loff_t shmem_file_llseek(struct file *file, loff_t offset, int origin)
-{
-	struct address_space *mapping;
-	struct inode *inode;
-	pgoff_t start, end;
-	loff_t new_offset;
-
-	if (origin != SEEK_DATA && origin != SEEK_HOLE)
-		return generic_file_llseek_size(file, offset, origin,
-							MAX_LFS_FILESIZE);
-	mapping = file->f_mapping;
-	inode = mapping->host;
-	mutex_lock(&inode->i_mutex);
-	/* We're holding i_mutex so we can access i_size directly */
-
-	if (offset < 0)
-		offset = -EINVAL;
-	else if (offset >= inode->i_size)
-		offset = -ENXIO;
-	else {
-		start = offset >> PAGE_CACHE_SHIFT;
-		end = (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
-		new_offset = shmem_seek_hole_data(mapping, start, end, origin);
-		new_offset <<= PAGE_CACHE_SHIFT;
-		if (new_offset > offset) {
-			if (new_offset < inode->i_size)
-				offset = new_offset;
-			else if (origin == SEEK_DATA)
-				offset = -ENXIO;
-			else
-				offset = inode->i_size;
-		}
-	}
-
-	if (offset >= 0 && offset != file->f_pos) {
-		file->f_pos = offset;
-		file->f_version = 0;
-	}
-	mutex_unlock(&inode->i_mutex);
-	return offset;
-}
-
 static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 							 loff_t len)
 {
@@ -2787,7 +2695,7 @@ static const struct address_space_operat
 static const struct file_operations shmem_file_operations = {
 	.mmap		= shmem_mmap,
 #ifdef CONFIG_TMPFS
-	.llseek		= shmem_file_llseek,
+	.llseek		= generic_file_llseek,
 	.read		= do_sync_read,
 	.write		= do_sync_write,
 	.aio_read	= shmem_file_aio_read,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 2/3] shmem: fix negative rss in memcg memory.stat
  2012-07-09 22:35 ` Hugh Dickins
@ 2012-07-09 22:44   ` Hugh Dickins
  -1 siblings, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2012-07-09 22:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, KAMEZAWA Hiroyuki, Michal Hocko, linux-mm, linux-kernel

When adding the page_private checks before calling shmem_replace_page(),
I did realize that there is a further race, but thought it too unlikely
to need a hurried fix.

But independently I've been chasing why a mem cgroup's memory.stat
sometimes shows negative rss after all tasks have gone: I expected it
to be a stats gathering bug, but actually it's shmem swapping's fault.

It's an old surprise, that when you lock_page(lookup_swap_cache(swap)),
the page may have been removed from swapcache before getting the lock; 
or it may have been freed and reused and be back in swapcache; and it
can even be using the same swap location as before (page_private same).

The swapoff case is already secure against this (swap cannot be reused
until the whole area has been swapped off, and a new swapped on); and
shmem_getpage_gfp() is protected by shmem_add_to_page_cache()'s check
for the expected radix_tree entry - but a little too late.

By that time, we might have already decided to shmem_replace_page():
I don't know of a problem from that, but I'd feel more at ease not to
do so spuriously.  And we have already done mem_cgroup_cache_charge(),
on perhaps the wrong mem cgroup: and this charge is not then undone on
the error path, because PageSwapCache ends up preventing that.

It's this last case which causes the occasional negative rss in
memory.stat: the page is charged here as cache, but (sometimes) found
to be anon when eventually it's uncharged - and in between, it's an
undeserved charge on the wrong memcg.

Fix this by adding an earlier check on the radix_tree entry: it's
inelegant to descend the tree twice, but swapping is not the fast path,
and a better solution would need a pair (try+commit) of memcg calls,
and a rework of shmem_replace_page() to keep out of the swapcache.

We can use the added shmem_confirm_swap() function to replace the
find_get_page+page_cache_release we were already doing on the error
path.  And add a comment on that -EEXIST: it seems a peculiar errno
to be using, but originates from its use in radix_tree_insert().

[It can be surprising to see positive rss left in a memcg's memory.stat
after all tasks have gone, since it is supposed to count anonymous but
not shmem.  Aside from sharing anon pages via fork with a task in some
other memcg, it often happens after swapping: because a swap page can't
be freed while under writeback, nor while locked.  So it's not an error,
and these residual pages are easily freed once pressure demands.]

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
---
I'd rather like this to go into v3.5, but it is late, and I don't have
a very strong argument for it: as you prefer.  And I've not marked it
for stable, since the patch won't apply to v3.4 as is; but I'd happily
supply a patch for v3.1 onwards if asked.

 mm/shmem.c |   41 +++++++++++++++++++++++++++++------------
 1 file changed, 29 insertions(+), 12 deletions(-)

--- 3.5-rc6/mm/shmem.c	2012-07-07 19:20:02.986950655 -0700
+++ linux/mm/shmem.c	2012-07-07 19:20:52.026952048 -0700
@@ -264,6 +264,24 @@ static int shmem_radix_tree_replace(stru
 }
 
 /*
+ * Sometimes, before we decide whether to proceed or to fail, we must check
+ * that an entry was not already brought back from swap by a racing thread.
+ *
+ * Checking page is not enough: by the time a SwapCache page is locked, it
+ * might be reused, and again be SwapCache, using the same swap as before.
+ */
+static bool shmem_confirm_swap(struct address_space *mapping,
+			       pgoff_t index, swp_entry_t swap)
+{
+	void *item;
+
+	rcu_read_lock();
+	item = radix_tree_lookup(&mapping->page_tree, index);
+	rcu_read_unlock();
+	return item == swp_to_radix_entry(swap);
+}
+
+/*
  * Like add_to_page_cache_locked, but error if expected item has gone.
  */
 static int shmem_add_to_page_cache(struct page *page,
@@ -1124,9 +1142,9 @@ repeat:
 		/* We have to do this with page locked to prevent races */
 		lock_page(page);
 		if (!PageSwapCache(page) || page_private(page) != swap.val ||
-		    page->mapping) {
+		    !shmem_confirm_swap(mapping, index, swap)) {
 			error = -EEXIST;	/* try again */
-			goto failed;
+			goto unlock;
 		}
 		if (!PageUptodate(page)) {
 			error = -EIO;
@@ -1142,9 +1160,12 @@ repeat:
 
 		error = mem_cgroup_cache_charge(page, current->mm,
 						gfp & GFP_RECLAIM_MASK);
-		if (!error)
+		if (!error) {
 			error = shmem_add_to_page_cache(page, mapping, index,
 						gfp, swp_to_radix_entry(swap));
+			/* We already confirmed swap, and make no allocation */
+			VM_BUG_ON(error);
+		}
 		if (error)
 			goto failed;
 
@@ -1245,14 +1266,10 @@ decused:
 unacct:
 	shmem_unacct_blocks(info->flags, 1);
 failed:
-	if (swap.val && error != -EINVAL) {
-		struct page *test = find_get_page(mapping, index);
-		if (test && !radix_tree_exceptional_entry(test))
-			page_cache_release(test);
-		/* Have another try if the entry has changed */
-		if (test != swp_to_radix_entry(swap))
-			error = -EEXIST;
-	}
+	if (swap.val && error != -EINVAL &&
+	    !shmem_confirm_swap(mapping, index, swap))
+		error = -EEXIST;
+unlock:
 	if (page) {
 		unlock_page(page);
 		page_cache_release(page);
@@ -1264,7 +1281,7 @@ failed:
 		spin_unlock(&info->lock);
 		goto repeat;
 	}
-	if (error == -EEXIST)
+	if (error == -EEXIST)	/* from above or from radix_tree_insert */
 		goto repeat;
 	return error;
 }

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 2/3] shmem: fix negative rss in memcg memory.stat
@ 2012-07-09 22:44   ` Hugh Dickins
  0 siblings, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2012-07-09 22:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, KAMEZAWA Hiroyuki, Michal Hocko, linux-mm, linux-kernel

When adding the page_private checks before calling shmem_replace_page(),
I did realize that there is a further race, but thought it too unlikely
to need a hurried fix.

But independently I've been chasing why a mem cgroup's memory.stat
sometimes shows negative rss after all tasks have gone: I expected it
to be a stats gathering bug, but actually it's shmem swapping's fault.

It's an old surprise, that when you lock_page(lookup_swap_cache(swap)),
the page may have been removed from swapcache before getting the lock; 
or it may have been freed and reused and be back in swapcache; and it
can even be using the same swap location as before (page_private same).

The swapoff case is already secure against this (swap cannot be reused
until the whole area has been swapped off, and a new swapped on); and
shmem_getpage_gfp() is protected by shmem_add_to_page_cache()'s check
for the expected radix_tree entry - but a little too late.

By that time, we might have already decided to shmem_replace_page():
I don't know of a problem from that, but I'd feel more at ease not to
do so spuriously.  And we have already done mem_cgroup_cache_charge(),
on perhaps the wrong mem cgroup: and this charge is not then undone on
the error path, because PageSwapCache ends up preventing that.

It's this last case which causes the occasional negative rss in
memory.stat: the page is charged here as cache, but (sometimes) found
to be anon when eventually it's uncharged - and in between, it's an
undeserved charge on the wrong memcg.

Fix this by adding an earlier check on the radix_tree entry: it's
inelegant to descend the tree twice, but swapping is not the fast path,
and a better solution would need a pair (try+commit) of memcg calls,
and a rework of shmem_replace_page() to keep out of the swapcache.

We can use the added shmem_confirm_swap() function to replace the
find_get_page+page_cache_release we were already doing on the error
path.  And add a comment on that -EEXIST: it seems a peculiar errno
to be using, but originates from its use in radix_tree_insert().

[It can be surprising to see positive rss left in a memcg's memory.stat
after all tasks have gone, since it is supposed to count anonymous but
not shmem.  Aside from sharing anon pages via fork with a task in some
other memcg, it often happens after swapping: because a swap page can't
be freed while under writeback, nor while locked.  So it's not an error,
and these residual pages are easily freed once pressure demands.]

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
---
I'd rather like this to go into v3.5, but it is late, and I don't have
a very strong argument for it: as you prefer.  And I've not marked it
for stable, since the patch won't apply to v3.4 as is; but I'd happily
supply a patch for v3.1 onwards if asked.

 mm/shmem.c |   41 +++++++++++++++++++++++++++++------------
 1 file changed, 29 insertions(+), 12 deletions(-)

--- 3.5-rc6/mm/shmem.c	2012-07-07 19:20:02.986950655 -0700
+++ linux/mm/shmem.c	2012-07-07 19:20:52.026952048 -0700
@@ -264,6 +264,24 @@ static int shmem_radix_tree_replace(stru
 }
 
 /*
+ * Sometimes, before we decide whether to proceed or to fail, we must check
+ * that an entry was not already brought back from swap by a racing thread.
+ *
+ * Checking page is not enough: by the time a SwapCache page is locked, it
+ * might be reused, and again be SwapCache, using the same swap as before.
+ */
+static bool shmem_confirm_swap(struct address_space *mapping,
+			       pgoff_t index, swp_entry_t swap)
+{
+	void *item;
+
+	rcu_read_lock();
+	item = radix_tree_lookup(&mapping->page_tree, index);
+	rcu_read_unlock();
+	return item == swp_to_radix_entry(swap);
+}
+
+/*
  * Like add_to_page_cache_locked, but error if expected item has gone.
  */
 static int shmem_add_to_page_cache(struct page *page,
@@ -1124,9 +1142,9 @@ repeat:
 		/* We have to do this with page locked to prevent races */
 		lock_page(page);
 		if (!PageSwapCache(page) || page_private(page) != swap.val ||
-		    page->mapping) {
+		    !shmem_confirm_swap(mapping, index, swap)) {
 			error = -EEXIST;	/* try again */
-			goto failed;
+			goto unlock;
 		}
 		if (!PageUptodate(page)) {
 			error = -EIO;
@@ -1142,9 +1160,12 @@ repeat:
 
 		error = mem_cgroup_cache_charge(page, current->mm,
 						gfp & GFP_RECLAIM_MASK);
-		if (!error)
+		if (!error) {
 			error = shmem_add_to_page_cache(page, mapping, index,
 						gfp, swp_to_radix_entry(swap));
+			/* We already confirmed swap, and make no allocation */
+			VM_BUG_ON(error);
+		}
 		if (error)
 			goto failed;
 
@@ -1245,14 +1266,10 @@ decused:
 unacct:
 	shmem_unacct_blocks(info->flags, 1);
 failed:
-	if (swap.val && error != -EINVAL) {
-		struct page *test = find_get_page(mapping, index);
-		if (test && !radix_tree_exceptional_entry(test))
-			page_cache_release(test);
-		/* Have another try if the entry has changed */
-		if (test != swp_to_radix_entry(swap))
-			error = -EEXIST;
-	}
+	if (swap.val && error != -EINVAL &&
+	    !shmem_confirm_swap(mapping, index, swap))
+		error = -EEXIST;
+unlock:
 	if (page) {
 		unlock_page(page);
 		page_cache_release(page);
@@ -1264,7 +1281,7 @@ failed:
 		spin_unlock(&info->lock);
 		goto repeat;
 	}
-	if (error == -EEXIST)
+	if (error == -EEXIST)	/* from above or from radix_tree_insert */
 		goto repeat;
 	return error;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 3/3] shmem: cleanup shmem_add_to_page_cache
  2012-07-09 22:35 ` Hugh Dickins
@ 2012-07-09 22:46   ` Hugh Dickins
  -1 siblings, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2012-07-09 22:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, KAMEZAWA Hiroyuki, Michal Hocko, linux-mm, linux-kernel

shmem_add_to_page_cache() has three callsites, but only one of them
wants the radix_tree_preload() (an exceptional entry guarantees that
the radix tree node is present in the other cases), and only that site
can achieve mem_cgroup_uncharge_cache_page() (PageSwapCache makes it a
no-op in the other cases).  We did it this way originally to reflect
add_to_page_cache_locked(); but it's confusing now, so move the
radix_tree preloading and mem_cgroup uncharging to that one caller.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
---
This is just a cleanup: I'd prefer it to go in along with the fix 2/3,
but it can be delayed to v3.6 if you prefer.

 mm/shmem.c |   58 ++++++++++++++++++++++++---------------------------
 1 file changed, 28 insertions(+), 30 deletions(-)

--- 3.5-rc6+/mm/shmem.c	2012-07-07 19:20:52.026952048 -0700
+++ linux/mm/shmem.c	2012-07-07 19:21:44.342952082 -0700
@@ -288,40 +288,31 @@ static int shmem_add_to_page_cache(struc
 				   struct address_space *mapping,
 				   pgoff_t index, gfp_t gfp, void *expected)
 {
-	int error = 0;
+	int error;
 
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(!PageSwapBacked(page));
 
+	page_cache_get(page);
+	page->mapping = mapping;
+	page->index = index;
+
+	spin_lock_irq(&mapping->tree_lock);
 	if (!expected)
-		error = radix_tree_preload(gfp & GFP_RECLAIM_MASK);
+		error = radix_tree_insert(&mapping->page_tree, index, page);
+	else
+		error = shmem_radix_tree_replace(mapping, index, expected,
+								 page);
 	if (!error) {
-		page_cache_get(page);
-		page->mapping = mapping;
-		page->index = index;
-
-		spin_lock_irq(&mapping->tree_lock);
-		if (!expected)
-			error = radix_tree_insert(&mapping->page_tree,
-							index, page);
-		else
-			error = shmem_radix_tree_replace(mapping, index,
-							expected, page);
-		if (!error) {
-			mapping->nrpages++;
-			__inc_zone_page_state(page, NR_FILE_PAGES);
-			__inc_zone_page_state(page, NR_SHMEM);
-			spin_unlock_irq(&mapping->tree_lock);
-		} else {
-			page->mapping = NULL;
-			spin_unlock_irq(&mapping->tree_lock);
-			page_cache_release(page);
-		}
-		if (!expected)
-			radix_tree_preload_end();
+		mapping->nrpages++;
+		__inc_zone_page_state(page, NR_FILE_PAGES);
+		__inc_zone_page_state(page, NR_SHMEM);
+		spin_unlock_irq(&mapping->tree_lock);
+	} else {
+		page->mapping = NULL;
+		spin_unlock_irq(&mapping->tree_lock);
+		page_cache_release(page);
 	}
-	if (error)
-		mem_cgroup_uncharge_cache_page(page);
 	return error;
 }
 
@@ -1202,11 +1193,18 @@ repeat:
 		__set_page_locked(page);
 		error = mem_cgroup_cache_charge(page, current->mm,
 						gfp & GFP_RECLAIM_MASK);
-		if (!error)
-			error = shmem_add_to_page_cache(page, mapping, index,
-						gfp, NULL);
 		if (error)
 			goto decused;
+		error = radix_tree_preload(gfp & GFP_RECLAIM_MASK);
+		if (!error) {
+			error = shmem_add_to_page_cache(page, mapping, index,
+							gfp, NULL);
+			radix_tree_preload_end();
+		}
+		if (error) {
+			mem_cgroup_uncharge_cache_page(page);
+			goto decused;
+		}
 		lru_cache_add_anon(page);
 
 		spin_lock(&info->lock);

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 3/3] shmem: cleanup shmem_add_to_page_cache
@ 2012-07-09 22:46   ` Hugh Dickins
  0 siblings, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2012-07-09 22:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, KAMEZAWA Hiroyuki, Michal Hocko, linux-mm, linux-kernel

shmem_add_to_page_cache() has three callsites, but only one of them
wants the radix_tree_preload() (an exceptional entry guarantees that
the radix tree node is present in the other cases), and only that site
can achieve mem_cgroup_uncharge_cache_page() (PageSwapCache makes it a
no-op in the other cases).  We did it this way originally to reflect
add_to_page_cache_locked(); but it's confusing now, so move the
radix_tree preloading and mem_cgroup uncharging to that one caller.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
---
This is just a cleanup: I'd prefer it to go in along with the fix 2/3,
but it can be delayed to v3.6 if you prefer.

 mm/shmem.c |   58 ++++++++++++++++++++++++---------------------------
 1 file changed, 28 insertions(+), 30 deletions(-)

--- 3.5-rc6+/mm/shmem.c	2012-07-07 19:20:52.026952048 -0700
+++ linux/mm/shmem.c	2012-07-07 19:21:44.342952082 -0700
@@ -288,40 +288,31 @@ static int shmem_add_to_page_cache(struc
 				   struct address_space *mapping,
 				   pgoff_t index, gfp_t gfp, void *expected)
 {
-	int error = 0;
+	int error;
 
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(!PageSwapBacked(page));
 
+	page_cache_get(page);
+	page->mapping = mapping;
+	page->index = index;
+
+	spin_lock_irq(&mapping->tree_lock);
 	if (!expected)
-		error = radix_tree_preload(gfp & GFP_RECLAIM_MASK);
+		error = radix_tree_insert(&mapping->page_tree, index, page);
+	else
+		error = shmem_radix_tree_replace(mapping, index, expected,
+								 page);
 	if (!error) {
-		page_cache_get(page);
-		page->mapping = mapping;
-		page->index = index;
-
-		spin_lock_irq(&mapping->tree_lock);
-		if (!expected)
-			error = radix_tree_insert(&mapping->page_tree,
-							index, page);
-		else
-			error = shmem_radix_tree_replace(mapping, index,
-							expected, page);
-		if (!error) {
-			mapping->nrpages++;
-			__inc_zone_page_state(page, NR_FILE_PAGES);
-			__inc_zone_page_state(page, NR_SHMEM);
-			spin_unlock_irq(&mapping->tree_lock);
-		} else {
-			page->mapping = NULL;
-			spin_unlock_irq(&mapping->tree_lock);
-			page_cache_release(page);
-		}
-		if (!expected)
-			radix_tree_preload_end();
+		mapping->nrpages++;
+		__inc_zone_page_state(page, NR_FILE_PAGES);
+		__inc_zone_page_state(page, NR_SHMEM);
+		spin_unlock_irq(&mapping->tree_lock);
+	} else {
+		page->mapping = NULL;
+		spin_unlock_irq(&mapping->tree_lock);
+		page_cache_release(page);
 	}
-	if (error)
-		mem_cgroup_uncharge_cache_page(page);
 	return error;
 }
 
@@ -1202,11 +1193,18 @@ repeat:
 		__set_page_locked(page);
 		error = mem_cgroup_cache_charge(page, current->mm,
 						gfp & GFP_RECLAIM_MASK);
-		if (!error)
-			error = shmem_add_to_page_cache(page, mapping, index,
-						gfp, NULL);
 		if (error)
 			goto decused;
+		error = radix_tree_preload(gfp & GFP_RECLAIM_MASK);
+		if (!error) {
+			error = shmem_add_to_page_cache(page, mapping, index,
+							gfp, NULL);
+			radix_tree_preload_end();
+		}
+		if (error) {
+			mem_cgroup_uncharge_cache_page(page);
+			goto decused;
+		}
 		lru_cache_add_anon(page);
 
 		spin_lock(&info->lock);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/3] shmem/tmpfs: three late patches
  2012-07-09 22:35 ` Hugh Dickins
@ 2012-07-09 23:39   ` Andrew Morton
  -1 siblings, 0 replies; 34+ messages in thread
From: Andrew Morton @ 2012-07-09 23:39 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-mm, linux-kernel

On Mon, 9 Jul 2012 15:35:26 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> Here's three little shmem/tmpfs patches against v3.5-rc6.
> Either the first should go in before v3.5 final, or it should not go
> in at all.  The second and third are independent of it: I'd like them
> in v3.5, but don't have a clinching argument: see what you think.

Thanks, I queued all three for 3.5.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/3] shmem/tmpfs: three late patches
@ 2012-07-09 23:39   ` Andrew Morton
  0 siblings, 0 replies; 34+ messages in thread
From: Andrew Morton @ 2012-07-09 23:39 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-mm, linux-kernel

On Mon, 9 Jul 2012 15:35:26 -0700 (PDT)
Hugh Dickins <hughd@google.com> wrote:

> Here's three little shmem/tmpfs patches against v3.5-rc6.
> Either the first should go in before v3.5 final, or it should not go
> in at all.  The second and third are independent of it: I'd like them
> in v3.5, but don't have a clinching argument: see what you think.

Thanks, I queued all three for 3.5.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3] shmem: fix negative rss in memcg memory.stat
  2012-07-09 22:44   ` Hugh Dickins
@ 2012-07-10 12:41     ` Johannes Weiner
  -1 siblings, 0 replies; 34+ messages in thread
From: Johannes Weiner @ 2012-07-10 12:41 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Michal Hocko, linux-mm, linux-kernel

On Mon, Jul 09, 2012 at 03:44:24PM -0700, Hugh Dickins wrote:
> When adding the page_private checks before calling shmem_replace_page(),
> I did realize that there is a further race, but thought it too unlikely
> to need a hurried fix.
> 
> But independently I've been chasing why a mem cgroup's memory.stat
> sometimes shows negative rss after all tasks have gone: I expected it
> to be a stats gathering bug, but actually it's shmem swapping's fault.
> 
> It's an old surprise, that when you lock_page(lookup_swap_cache(swap)),
> the page may have been removed from swapcache before getting the lock; 
> or it may have been freed and reused and be back in swapcache; and it
> can even be using the same swap location as before (page_private same).
> 
> The swapoff case is already secure against this (swap cannot be reused
> until the whole area has been swapped off, and a new swapped on); and
> shmem_getpage_gfp() is protected by shmem_add_to_page_cache()'s check
> for the expected radix_tree entry - but a little too late.
> 
> By that time, we might have already decided to shmem_replace_page():
> I don't know of a problem from that, but I'd feel more at ease not to
> do so spuriously.  And we have already done mem_cgroup_cache_charge(),
> on perhaps the wrong mem cgroup: and this charge is not then undone on
> the error path, because PageSwapCache ends up preventing that.

I couldn't see anything wrong with shmem_replace_page(), either, but
maybe the comment in its error path could be updated as the callsite
does not rely on page_private alone anymore to confirm correct swap.

> It's this last case which causes the occasional negative rss in
> memory.stat: the page is charged here as cache, but (sometimes) found
> to be anon when eventually it's uncharged - and in between, it's an
> undeserved charge on the wrong memcg.
> 
> Fix this by adding an earlier check on the radix_tree entry: it's
> inelegant to descend the tree twice, but swapping is not the fast path,
> and a better solution would need a pair (try+commit) of memcg calls,
> and a rework of shmem_replace_page() to keep out of the swapcache.
> 
> We can use the added shmem_confirm_swap() function to replace the
> find_get_page+page_cache_release we were already doing on the error
> path.  And add a comment on that -EEXIST: it seems a peculiar errno
> to be using, but originates from its use in radix_tree_insert().
> 
> [It can be surprising to see positive rss left in a memcg's memory.stat
> after all tasks have gone, since it is supposed to count anonymous but
> not shmem.  Aside from sharing anon pages via fork with a task in some
> other memcg, it often happens after swapping: because a swap page can't
> be freed while under writeback, nor while locked.  So it's not an error,
> and these residual pages are easily freed once pressure demands.]
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Michal Hocko <mhocko@suse.cz>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3] shmem: fix negative rss in memcg memory.stat
@ 2012-07-10 12:41     ` Johannes Weiner
  0 siblings, 0 replies; 34+ messages in thread
From: Johannes Weiner @ 2012-07-10 12:41 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Michal Hocko, linux-mm, linux-kernel

On Mon, Jul 09, 2012 at 03:44:24PM -0700, Hugh Dickins wrote:
> When adding the page_private checks before calling shmem_replace_page(),
> I did realize that there is a further race, but thought it too unlikely
> to need a hurried fix.
> 
> But independently I've been chasing why a mem cgroup's memory.stat
> sometimes shows negative rss after all tasks have gone: I expected it
> to be a stats gathering bug, but actually it's shmem swapping's fault.
> 
> It's an old surprise, that when you lock_page(lookup_swap_cache(swap)),
> the page may have been removed from swapcache before getting the lock; 
> or it may have been freed and reused and be back in swapcache; and it
> can even be using the same swap location as before (page_private same).
> 
> The swapoff case is already secure against this (swap cannot be reused
> until the whole area has been swapped off, and a new swapped on); and
> shmem_getpage_gfp() is protected by shmem_add_to_page_cache()'s check
> for the expected radix_tree entry - but a little too late.
> 
> By that time, we might have already decided to shmem_replace_page():
> I don't know of a problem from that, but I'd feel more at ease not to
> do so spuriously.  And we have already done mem_cgroup_cache_charge(),
> on perhaps the wrong mem cgroup: and this charge is not then undone on
> the error path, because PageSwapCache ends up preventing that.

I couldn't see anything wrong with shmem_replace_page(), either, but
maybe the comment in its error path could be updated as the callsite
does not rely on page_private alone anymore to confirm correct swap.

> It's this last case which causes the occasional negative rss in
> memory.stat: the page is charged here as cache, but (sometimes) found
> to be anon when eventually it's uncharged - and in between, it's an
> undeserved charge on the wrong memcg.
> 
> Fix this by adding an earlier check on the radix_tree entry: it's
> inelegant to descend the tree twice, but swapping is not the fast path,
> and a better solution would need a pair (try+commit) of memcg calls,
> and a rework of shmem_replace_page() to keep out of the swapcache.
> 
> We can use the added shmem_confirm_swap() function to replace the
> find_get_page+page_cache_release we were already doing on the error
> path.  And add a comment on that -EEXIST: it seems a peculiar errno
> to be using, but originates from its use in radix_tree_insert().
> 
> [It can be surprising to see positive rss left in a memcg's memory.stat
> after all tasks have gone, since it is supposed to count anonymous but
> not shmem.  Aside from sharing anon pages via fork with a task in some
> other memcg, it often happens after swapping: because a swap page can't
> be freed while under writeback, nor while locked.  So it's not an error,
> and these residual pages are easily freed once pressure demands.]
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Michal Hocko <mhocko@suse.cz>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] shmem: cleanup shmem_add_to_page_cache
  2012-07-09 22:46   ` Hugh Dickins
@ 2012-07-10 13:01     ` Johannes Weiner
  -1 siblings, 0 replies; 34+ messages in thread
From: Johannes Weiner @ 2012-07-10 13:01 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Michal Hocko, linux-mm, linux-kernel

On Mon, Jul 09, 2012 at 03:46:53PM -0700, Hugh Dickins wrote:
> shmem_add_to_page_cache() has three callsites, but only one of them
> wants the radix_tree_preload() (an exceptional entry guarantees that
> the radix tree node is present in the other cases), and only that site
> can achieve mem_cgroup_uncharge_cache_page() (PageSwapCache makes it a
> no-op in the other cases).  We did it this way originally to reflect
> add_to_page_cache_locked(); but it's confusing now, so move the
> radix_tree preloading and mem_cgroup uncharging to that one caller.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Michal Hocko <mhocko@suse.cz>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

I'm rebasing my (un)charge series on top of these, thanks.  It only
annihilates 3/11 and leaves the rest alone--line numbers aside--since
the rules did not change.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] shmem: cleanup shmem_add_to_page_cache
@ 2012-07-10 13:01     ` Johannes Weiner
  0 siblings, 0 replies; 34+ messages in thread
From: Johannes Weiner @ 2012-07-10 13:01 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Michal Hocko, linux-mm, linux-kernel

On Mon, Jul 09, 2012 at 03:46:53PM -0700, Hugh Dickins wrote:
> shmem_add_to_page_cache() has three callsites, but only one of them
> wants the radix_tree_preload() (an exceptional entry guarantees that
> the radix tree node is present in the other cases), and only that site
> can achieve mem_cgroup_uncharge_cache_page() (PageSwapCache makes it a
> no-op in the other cases).  We did it this way originally to reflect
> add_to_page_cache_locked(); but it's confusing now, so move the
> radix_tree preloading and mem_cgroup uncharging to that one caller.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Michal Hocko <mhocko@suse.cz>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

I'm rebasing my (un)charge series on top of these, thanks.  It only
annihilates 3/11 and leaves the rest alone--line numbers aside--since
the rules did not change.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE
  2012-07-09 22:41   ` Hugh Dickins
@ 2012-07-11  6:07     ` Cong Wang
  -1 siblings, 0 replies; 34+ messages in thread
From: Cong Wang @ 2012-07-11  6:07 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-fsdevel, linux-kernel

On Mon, 09 Jul 2012 at 22:41 GMT, Hugh Dickins <hughd@google.com> wrote:
> Revert 4fb5ef089b28 ("tmpfs: support SEEK_DATA and SEEK_HOLE").
> I believe it's correct, and it's been nice to have from rc1 to rc6;
> but as the original commit said:
>
> I don't know who actually uses SEEK_DATA or SEEK_HOLE, and whether it
> would be of any use to them on tmpfs.  This code adds 92 lines and 752
> bytes on x86_64 - is that bloat or worthwhile?


I don't think 752 bytes matter much, especially for x86_64.

>
> Nobody asked for it, so I conclude that it's bloat: let's revert tmpfs
> to the dumb generic support for v3.5.  We can always reinstate it later
> if useful, and anyone needing it in a hurry can just get it out of git.
>

If you don't have burden to maintain it, I'd prefer to leave as it is,
I don't think 752-bytes is the reason we revert it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE
@ 2012-07-11  6:07     ` Cong Wang
  0 siblings, 0 replies; 34+ messages in thread
From: Cong Wang @ 2012-07-11  6:07 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-fsdevel, linux-kernel

On Mon, 09 Jul 2012 at 22:41 GMT, Hugh Dickins <hughd@google.com> wrote:
> Revert 4fb5ef089b28 ("tmpfs: support SEEK_DATA and SEEK_HOLE").
> I believe it's correct, and it's been nice to have from rc1 to rc6;
> but as the original commit said:
>
> I don't know who actually uses SEEK_DATA or SEEK_HOLE, and whether it
> would be of any use to them on tmpfs.  This code adds 92 lines and 752
> bytes on x86_64 - is that bloat or worthwhile?


I don't think 752 bytes matter much, especially for x86_64.

>
> Nobody asked for it, so I conclude that it's bloat: let's revert tmpfs
> to the dumb generic support for v3.5.  We can always reinstate it later
> if useful, and anyone needing it in a hurry can just get it out of git.
>

If you don't have burden to maintain it, I'd prefer to leave as it is,
I don't think 752-bytes is the reason we revert it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3] shmem: fix negative rss in memcg memory.stat
  2012-07-10 12:41     ` Johannes Weiner
@ 2012-07-11 18:15       ` Hugh Dickins
  -1 siblings, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2012-07-11 18:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Michal Hocko, linux-mm, linux-kernel

On Tue, 10 Jul 2012, Johannes Weiner wrote:
> 
> I couldn't see anything wrong with shmem_replace_page(), either, but
> maybe the comment in its error path could be updated as the callsite
> does not rely on page_private alone anymore to confirm correct swap.

I went in to make an incremental fix to update that comment as you
suggest, but found that actually the comment should stay as is.

We're dealing with two different radix trees here: shmem_confirm_swap()
and shmem_add_to_page_cache() are operating on the tmpfs file radix tree,
but shmem_replace_page() is using shmem_radix_tree_replace() to operate
on the "swapper_space" radix tree, exchanging the page pointer there.

The preliminary page_private test (under page lock) should indeed be
guaranteeing that the page pointer found in the swapper_space radix
tree at that (swp_entry_t) offset is the one we expect there, as before.

Whereas the new shmem_confirm_swap() test doesn't help to guarantee that
part at all: it's for confirming that the swap entry is still being used
for the offset in the file that we're interested in.

The comment I would like to change is the "nice clean interface" one!
While it does make for a nice old-page-in/new-page-out interface to that
function, I've come to feel that it would be much better not to mess with
the swapcache at all there - leave the old page in the swapcache, and
remove it at the same time as inserting the new page into filecache.

But I'm also reluctant to mess with what's working: I'm in no rush
to change that around, I'd be sure to screw it up at first.

> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3] shmem: fix negative rss in memcg memory.stat
@ 2012-07-11 18:15       ` Hugh Dickins
  0 siblings, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2012-07-11 18:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Michal Hocko, linux-mm, linux-kernel

On Tue, 10 Jul 2012, Johannes Weiner wrote:
> 
> I couldn't see anything wrong with shmem_replace_page(), either, but
> maybe the comment in its error path could be updated as the callsite
> does not rely on page_private alone anymore to confirm correct swap.

I went in to make an incremental fix to update that comment as you
suggest, but found that actually the comment should stay as is.

We're dealing with two different radix trees here: shmem_confirm_swap()
and shmem_add_to_page_cache() are operating on the tmpfs file radix tree,
but shmem_replace_page() is using shmem_radix_tree_replace() to operate
on the "swapper_space" radix tree, exchanging the page pointer there.

The preliminary page_private test (under page lock) should indeed be
guaranteeing that the page pointer found in the swapper_space radix
tree at that (swp_entry_t) offset is the one we expect there, as before.

Whereas the new shmem_confirm_swap() test doesn't help to guarantee that
part at all: it's for confirming that the swap entry is still being used
for the offset in the file that we're interested in.

The comment I would like to change is the "nice clean interface" one!
While it does make for a nice old-page-in/new-page-out interface to that
function, I've come to feel that it would be much better not to mess with
the swapcache at all there - leave the old page in the swapcache, and
remove it at the same time as inserting the new page into filecache.

But I'm also reluctant to mess with what's working: I'm in no rush
to change that around, I'd be sure to screw it up at first.

> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks,
Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE
  2012-07-11  6:07     ` Cong Wang
@ 2012-07-11 18:55       ` Hugh Dickins
  -1 siblings, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2012-07-11 18:55 UTC (permalink / raw)
  To: Cong Wang; +Cc: linux-mm, linux-fsdevel, linux-kernel

On Wed, 11 Jul 2012, Cong Wang wrote:
> On Mon, 09 Jul 2012 at 22:41 GMT, Hugh Dickins <hughd@google.com> wrote:
> > Revert 4fb5ef089b28 ("tmpfs: support SEEK_DATA and SEEK_HOLE").
> > I believe it's correct, and it's been nice to have from rc1 to rc6;
> > but as the original commit said:
> >
> > I don't know who actually uses SEEK_DATA or SEEK_HOLE, and whether it
> > would be of any use to them on tmpfs.  This code adds 92 lines and 752
> > bytes on x86_64 - is that bloat or worthwhile?
> 
> 
> I don't think 752 bytes matter much, especially for x86_64.
> 
> >
> > Nobody asked for it, so I conclude that it's bloat: let's revert tmpfs
> > to the dumb generic support for v3.5.  We can always reinstate it later
> > if useful, and anyone needing it in a hurry can just get it out of git.
> >
> 
> If you don't have burden to maintain it, I'd prefer to leave as it is,
> I don't think 752-bytes is the reason we revert it.

Thank you, your vote has been counted ;)
and I'll be glad if yours stimulates some agreement or disagreement.

But your vote would count for a lot more if you know of some app which
would really benefit from this functionality in tmpfs: I've heard of none.

Hugh

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE
@ 2012-07-11 18:55       ` Hugh Dickins
  0 siblings, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2012-07-11 18:55 UTC (permalink / raw)
  To: Cong Wang; +Cc: linux-mm, linux-fsdevel, linux-kernel

On Wed, 11 Jul 2012, Cong Wang wrote:
> On Mon, 09 Jul 2012 at 22:41 GMT, Hugh Dickins <hughd@google.com> wrote:
> > Revert 4fb5ef089b28 ("tmpfs: support SEEK_DATA and SEEK_HOLE").
> > I believe it's correct, and it's been nice to have from rc1 to rc6;
> > but as the original commit said:
> >
> > I don't know who actually uses SEEK_DATA or SEEK_HOLE, and whether it
> > would be of any use to them on tmpfs.  This code adds 92 lines and 752
> > bytes on x86_64 - is that bloat or worthwhile?
> 
> 
> I don't think 752 bytes matter much, especially for x86_64.
> 
> >
> > Nobody asked for it, so I conclude that it's bloat: let's revert tmpfs
> > to the dumb generic support for v3.5.  We can always reinstate it later
> > if useful, and anyone needing it in a hurry can just get it out of git.
> >
> 
> If you don't have burden to maintain it, I'd prefer to leave as it is,
> I don't think 752-bytes is the reason we revert it.

Thank you, your vote has been counted ;)
and I'll be glad if yours stimulates some agreement or disagreement.

But your vote would count for a lot more if you know of some app which
would really benefit from this functionality in tmpfs: I've heard of none.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE
  2012-07-11 18:55       ` Hugh Dickins
@ 2012-07-11 23:01         ` Dave Chinner
  -1 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2012-07-11 23:01 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Cong Wang, linux-mm, linux-fsdevel, linux-kernel

On Wed, Jul 11, 2012 at 11:55:34AM -0700, Hugh Dickins wrote:
> On Wed, 11 Jul 2012, Cong Wang wrote:
> > On Mon, 09 Jul 2012 at 22:41 GMT, Hugh Dickins <hughd@google.com> wrote:
> > > Revert 4fb5ef089b28 ("tmpfs: support SEEK_DATA and SEEK_HOLE").
> > > I believe it's correct, and it's been nice to have from rc1 to rc6;
> > > but as the original commit said:
> > >
> > > I don't know who actually uses SEEK_DATA or SEEK_HOLE, and whether it
> > > would be of any use to them on tmpfs.  This code adds 92 lines and 752
> > > bytes on x86_64 - is that bloat or worthwhile?
> > 
> > 
> > I don't think 752 bytes matter much, especially for x86_64.
> > 
> > >
> > > Nobody asked for it, so I conclude that it's bloat: let's revert tmpfs
> > > to the dumb generic support for v3.5.  We can always reinstate it later
> > > if useful, and anyone needing it in a hurry can just get it out of git.
> > >
> > 
> > If you don't have burden to maintain it, I'd prefer to leave as it is,
> > I don't think 752-bytes is the reason we revert it.
> 
> Thank you, your vote has been counted ;)
> and I'll be glad if yours stimulates some agreement or disagreement.
> 
> But your vote would count for a lot more if you know of some app which
> would really benefit from this functionality in tmpfs: I've heard of none.

So what? I've heard of no apps that use this functionality on XFS,
either, but I have heard of a lot of people asking for it to be
implemented over the past couple of years so they can use it.
There's been patches written to make coreutils (cp) make use of it
instead of parsing FIEMAP output to find holes, though I don't know
if that's gone beyond more than "here's some patches"....

Besides, given that you can punch holes in tmpfs files, it seems
strange to then say "we don't need a method of skipping holes to
find data quickly"....

Besides, seek-hole/data is still shiny new and lots of developers
aren't even aware of it's presence in recent kernels. Removing new
functionality saying "no-one is using it" is like smashing the egg
before the chicken hatches (or is it cutting of the chickes's head
before it lays the egg?).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE
@ 2012-07-11 23:01         ` Dave Chinner
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2012-07-11 23:01 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Cong Wang, linux-mm, linux-fsdevel, linux-kernel

On Wed, Jul 11, 2012 at 11:55:34AM -0700, Hugh Dickins wrote:
> On Wed, 11 Jul 2012, Cong Wang wrote:
> > On Mon, 09 Jul 2012 at 22:41 GMT, Hugh Dickins <hughd@google.com> wrote:
> > > Revert 4fb5ef089b28 ("tmpfs: support SEEK_DATA and SEEK_HOLE").
> > > I believe it's correct, and it's been nice to have from rc1 to rc6;
> > > but as the original commit said:
> > >
> > > I don't know who actually uses SEEK_DATA or SEEK_HOLE, and whether it
> > > would be of any use to them on tmpfs.  This code adds 92 lines and 752
> > > bytes on x86_64 - is that bloat or worthwhile?
> > 
> > 
> > I don't think 752 bytes matter much, especially for x86_64.
> > 
> > >
> > > Nobody asked for it, so I conclude that it's bloat: let's revert tmpfs
> > > to the dumb generic support for v3.5.  We can always reinstate it later
> > > if useful, and anyone needing it in a hurry can just get it out of git.
> > >
> > 
> > If you don't have burden to maintain it, I'd prefer to leave as it is,
> > I don't think 752-bytes is the reason we revert it.
> 
> Thank you, your vote has been counted ;)
> and I'll be glad if yours stimulates some agreement or disagreement.
> 
> But your vote would count for a lot more if you know of some app which
> would really benefit from this functionality in tmpfs: I've heard of none.

So what? I've heard of no apps that use this functionality on XFS,
either, but I have heard of a lot of people asking for it to be
implemented over the past couple of years so they can use it.
There's been patches written to make coreutils (cp) make use of it
instead of parsing FIEMAP output to find holes, though I don't know
if that's gone beyond more than "here's some patches"....

Besides, given that you can punch holes in tmpfs files, it seems
strange to then say "we don't need a method of skipping holes to
find data quickly"....

Besides, seek-hole/data is still shiny new and lots of developers
aren't even aware of it's presence in recent kernels. Removing new
functionality saying "no-one is using it" is like smashing the egg
before the chicken hatches (or is it cutting of the chickes's head
before it lays the egg?).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE
  2012-07-11 23:01         ` Dave Chinner
@ 2012-07-12  2:50           ` Hugh Dickins
  -1 siblings, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2012-07-12  2:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Cong Wang, linux-mm, linux-fsdevel, linux-kernel

On Thu, 12 Jul 2012, Dave Chinner wrote:
> On Wed, Jul 11, 2012 at 11:55:34AM -0700, Hugh Dickins wrote:
> > On Wed, 11 Jul 2012, Cong Wang wrote:
> > > 
> > > If you don't have burden to maintain it, I'd prefer to leave as it is,
> > > I don't think 752-bytes is the reason we revert it.
> > 
> > Thank you, your vote has been counted ;)
> > and I'll be glad if yours stimulates some agreement or disagreement.
> > 
> > But your vote would count for a lot more if you know of some app which
> > would really benefit from this functionality in tmpfs: I've heard of none.
> 
> So what? I've heard of no apps that use this functionality on XFS,
> either, but I have heard of a lot of people asking for it to be
> implemented over the past couple of years so they can use it.

I'd certainly not ask you to remove your support for it from XFS:
nobody would call XFS a minimal filesystem.

But tmpfs has a tradition and a duty to keep fairly small:
it needs to be useful, but it shouldn't be carrying unused baggage.

> There's been patches written to make coreutils (cp) make use of it
> instead of parsing FIEMAP output to find holes, though I don't know
> if that's gone beyond more than "here's some patches"....
> 
> Besides, given that you can punch holes in tmpfs files, it seems
> strange to then say "we don't need a method of skipping holes to
> find data quickly"....

tmpfs has been punching holes (via MADV_REMOVE) since 2.6.16 (and
that wasn't added on my whim, IBM wanted and did it).  But I haven't
heard of anybody asking for a method of skipping them in six years.

> 
> Besides, seek-hole/data is still shiny new and lots of developers
> aren't even aware of it's presence in recent kernels. Removing new
> functionality saying "no-one is using it" is like smashing the egg
> before the chicken hatches (or is it cutting of the chickes's head
> before it lays the egg?).

(You remind me of my chicken-and-egg sandwiches - you can't get them,
you see, it's chicken and egg.)

I'm not trying to remove SEEK_HOLE/SEEK_DATA support from the kernel:
I'm just saying that nobody has yet made the case for their usefulness
in tmpfs, so they're better removed from it before v3.5 is released.

Once we see how useful they have become in the grown-up filesystems,
or someone shows how useful they can be on tmpfs, then we reinstate.

Of course, I'm on both sides of this argument: I wrote that code,
I like it, I'll be glad to put it back when it's useful to someone.

Hugh

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE
@ 2012-07-12  2:50           ` Hugh Dickins
  0 siblings, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2012-07-12  2:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Cong Wang, linux-mm, linux-fsdevel, linux-kernel

On Thu, 12 Jul 2012, Dave Chinner wrote:
> On Wed, Jul 11, 2012 at 11:55:34AM -0700, Hugh Dickins wrote:
> > On Wed, 11 Jul 2012, Cong Wang wrote:
> > > 
> > > If you don't have burden to maintain it, I'd prefer to leave as it is,
> > > I don't think 752-bytes is the reason we revert it.
> > 
> > Thank you, your vote has been counted ;)
> > and I'll be glad if yours stimulates some agreement or disagreement.
> > 
> > But your vote would count for a lot more if you know of some app which
> > would really benefit from this functionality in tmpfs: I've heard of none.
> 
> So what? I've heard of no apps that use this functionality on XFS,
> either, but I have heard of a lot of people asking for it to be
> implemented over the past couple of years so they can use it.

I'd certainly not ask you to remove your support for it from XFS:
nobody would call XFS a minimal filesystem.

But tmpfs has a tradition and a duty to keep fairly small:
it needs to be useful, but it shouldn't be carrying unused baggage.

> There's been patches written to make coreutils (cp) make use of it
> instead of parsing FIEMAP output to find holes, though I don't know
> if that's gone beyond more than "here's some patches"....
> 
> Besides, given that you can punch holes in tmpfs files, it seems
> strange to then say "we don't need a method of skipping holes to
> find data quickly"....

tmpfs has been punching holes (via MADV_REMOVE) since 2.6.16 (and
that wasn't added on my whim, IBM wanted and did it).  But I haven't
heard of anybody asking for a method of skipping them in six years.

> 
> Besides, seek-hole/data is still shiny new and lots of developers
> aren't even aware of it's presence in recent kernels. Removing new
> functionality saying "no-one is using it" is like smashing the egg
> before the chicken hatches (or is it cutting of the chickes's head
> before it lays the egg?).

(You remind me of my chicken-and-egg sandwiches - you can't get them,
you see, it's chicken and egg.)

I'm not trying to remove SEEK_HOLE/SEEK_DATA support from the kernel:
I'm just saying that nobody has yet made the case for their usefulness
in tmpfs, so they're better removed from it before v3.5 is released.

Once we see how useful they have become in the grown-up filesystems,
or someone shows how useful they can be on tmpfs, then we reinstate.

Of course, I'm on both sides of this argument: I wrote that code,
I like it, I'll be glad to put it back when it's useful to someone.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE
  2012-07-11 23:01         ` Dave Chinner
@ 2012-07-12  3:21           ` Jeff Liu
  -1 siblings, 0 replies; 34+ messages in thread
From: Jeff Liu @ 2012-07-12  3:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Hugh Dickins, Cong Wang, linux-mm, linux-fsdevel, linux-kernel

On 07/12/2012 07:01 AM, Dave Chinner wrote:

> On Wed, Jul 11, 2012 at 11:55:34AM -0700, Hugh Dickins wrote:
>> On Wed, 11 Jul 2012, Cong Wang wrote:
>>> On Mon, 09 Jul 2012 at 22:41 GMT, Hugh Dickins <hughd@google.com> wrote:
>>>> Revert 4fb5ef089b28 ("tmpfs: support SEEK_DATA and SEEK_HOLE").
>>>> I believe it's correct, and it's been nice to have from rc1 to rc6;
>>>> but as the original commit said:
>>>>
>>>> I don't know who actually uses SEEK_DATA or SEEK_HOLE, and whether it
>>>> would be of any use to them on tmpfs.  This code adds 92 lines and 752
>>>> bytes on x86_64 - is that bloat or worthwhile?
>>>
>>>
>>> I don't think 752 bytes matter much, especially for x86_64.
>>>
>>>>
>>>> Nobody asked for it, so I conclude that it's bloat: let's revert tmpfs
>>>> to the dumb generic support for v3.5.  We can always reinstate it later
>>>> if useful, and anyone needing it in a hurry can just get it out of git.
>>>>
>>>
>>> If you don't have burden to maintain it, I'd prefer to leave as it is,
>>> I don't think 752-bytes is the reason we revert it.
>>
>> Thank you, your vote has been counted ;)
>> and I'll be glad if yours stimulates some agreement or disagreement.
>>
>> But your vote would count for a lot more if you know of some app which
>> would really benefit from this functionality in tmpfs: I've heard of none.
> 
> So what? I've heard of no apps that use this functionality on XFS,
> either, but I have heard of a lot of people asking for it to be
> implemented over the past couple of years so they can use it.
> There's been patches written to make coreutils (cp) make use of it
> instead of parsing FIEMAP output to find holes, though I don't know
> if that's gone beyond more than "here's some patches"...

Yes, for apps, cp(1) will make use of it to replace the old FIEMAP for efficient sparse file copy.
I have implemented an extent-scan module to coreutils a few years ago,
http://fossies.org/dox/coreutils-8.17/extent-scan_8c_source.html

It does extent scan through FIEMAP, however, SEEK_DATA/SEEK_HOLE is more convenient and easy to use
considering the call interface.  So FIEMAP will be replaced by SEEK_XXX once it got supported by EXT4.

Moreover, I have discussed with Jim who is the coreutils maintainer previously, He would like to post
extent-scan module to Gnulib so that other GNU utilities which are relied on Gnulib might be a potential
user of it, at least, GNU tar will definitely need it for sparse file backup.

> 
> Besides, given that you can punch holes in tmpfs files, it seems
> strange to then say "we don't need a method of skipping holes to
> find data quickly"....

So its deserve to keep this feature working on tmpfs considering hole punch. :)

Thanks,
-Jeff

> 
> Besides, seek-hole/data is still shiny new and lots of developers
> aren't even aware of it's presence in recent kernels. Removing new
> functionality saying "no-one is using it" is like smashing the egg
> before the chicken hatches (or is it cutting of the chickes's head
> before it lays the egg?).
> 
> Cheers,
> 
> Dave.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE
@ 2012-07-12  3:21           ` Jeff Liu
  0 siblings, 0 replies; 34+ messages in thread
From: Jeff Liu @ 2012-07-12  3:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Hugh Dickins, Cong Wang, linux-mm, linux-fsdevel, linux-kernel

On 07/12/2012 07:01 AM, Dave Chinner wrote:

> On Wed, Jul 11, 2012 at 11:55:34AM -0700, Hugh Dickins wrote:
>> On Wed, 11 Jul 2012, Cong Wang wrote:
>>> On Mon, 09 Jul 2012 at 22:41 GMT, Hugh Dickins <hughd@google.com> wrote:
>>>> Revert 4fb5ef089b28 ("tmpfs: support SEEK_DATA and SEEK_HOLE").
>>>> I believe it's correct, and it's been nice to have from rc1 to rc6;
>>>> but as the original commit said:
>>>>
>>>> I don't know who actually uses SEEK_DATA or SEEK_HOLE, and whether it
>>>> would be of any use to them on tmpfs.  This code adds 92 lines and 752
>>>> bytes on x86_64 - is that bloat or worthwhile?
>>>
>>>
>>> I don't think 752 bytes matter much, especially for x86_64.
>>>
>>>>
>>>> Nobody asked for it, so I conclude that it's bloat: let's revert tmpfs
>>>> to the dumb generic support for v3.5.  We can always reinstate it later
>>>> if useful, and anyone needing it in a hurry can just get it out of git.
>>>>
>>>
>>> If you don't have burden to maintain it, I'd prefer to leave as it is,
>>> I don't think 752-bytes is the reason we revert it.
>>
>> Thank you, your vote has been counted ;)
>> and I'll be glad if yours stimulates some agreement or disagreement.
>>
>> But your vote would count for a lot more if you know of some app which
>> would really benefit from this functionality in tmpfs: I've heard of none.
> 
> So what? I've heard of no apps that use this functionality on XFS,
> either, but I have heard of a lot of people asking for it to be
> implemented over the past couple of years so they can use it.
> There's been patches written to make coreutils (cp) make use of it
> instead of parsing FIEMAP output to find holes, though I don't know
> if that's gone beyond more than "here's some patches"...

Yes, for apps, cp(1) will make use of it to replace the old FIEMAP for efficient sparse file copy.
I have implemented an extent-scan module to coreutils a few years ago,
http://fossies.org/dox/coreutils-8.17/extent-scan_8c_source.html

It does extent scan through FIEMAP, however, SEEK_DATA/SEEK_HOLE is more convenient and easy to use
considering the call interface.  So FIEMAP will be replaced by SEEK_XXX once it got supported by EXT4.

Moreover, I have discussed with Jim who is the coreutils maintainer previously, He would like to post
extent-scan module to Gnulib so that other GNU utilities which are relied on Gnulib might be a potential
user of it, at least, GNU tar will definitely need it for sparse file backup.

> 
> Besides, given that you can punch holes in tmpfs files, it seems
> strange to then say "we don't need a method of skipping holes to
> find data quickly"....

So its deserve to keep this feature working on tmpfs considering hole punch. :)

Thanks,
-Jeff

> 
> Besides, seek-hole/data is still shiny new and lots of developers
> aren't even aware of it's presence in recent kernels. Removing new
> functionality saying "no-one is using it" is like smashing the egg
> before the chicken hatches (or is it cutting of the chickes's head
> before it lays the egg?).
> 
> Cheers,
> 
> Dave.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE
  2012-07-12  3:21           ` Jeff Liu
@ 2012-07-16  9:28             ` Hugh Dickins
  -1 siblings, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2012-07-16  9:28 UTC (permalink / raw)
  To: Jeff Liu; +Cc: Dave Chinner, Cong Wang, linux-mm, linux-fsdevel, linux-kernel

On Thu, 12 Jul 2012, Jeff Liu wrote:
> On 07/12/2012 07:01 AM, Dave Chinner wrote:
> > On Wed, Jul 11, 2012 at 11:55:34AM -0700, Hugh Dickins wrote:
> >>
> >> But your vote would count for a lot more if you know of some app which
> >> would really benefit from this functionality in tmpfs: I've heard of none.
> > 
> > So what? I've heard of no apps that use this functionality on XFS,
> > either, but I have heard of a lot of people asking for it to be
> > implemented over the past couple of years so they can use it.
> > There's been patches written to make coreutils (cp) make use of it
> > instead of parsing FIEMAP output to find holes, though I don't know
> > if that's gone beyond more than "here's some patches"...
> 
> Yes, for apps, cp(1) will make use of it to replace the old FIEMAP for efficient sparse file copy.
> I have implemented an extent-scan module to coreutils a few years ago,
> http://fossies.org/dox/coreutils-8.17/extent-scan_8c_source.html

Thanks for confirming Dave's pointer to cp.

Of course, tmpfs has never supported FIBMAP or FIEMAP;
but SEEK_DATA and SEEK_HOLE do fit it much more naturally.

> 
> It does extent scan through FIEMAP, however, SEEK_DATA/SEEK_HOLE is more convenient and easy to use
> considering the call interface.  So FIEMAP will be replaced by SEEK_XXX once it got supported by EXT4.
> 
> Moreover, I have discussed with Jim who is the coreutils maintainer previously, He would like to post
> extent-scan module to Gnulib so that other GNU utilities which are relied on Gnulib might be a potential
> user of it, at least, GNU tar will definitely need it for sparse file backup.

Thanks for the info.  I confess I'm not hugely swayed by cp and sparse
file archive arguments - I doubt many people care, and I doubt those who
do care are using tmpfs for them.

But my doubts are just ignorance.  I was hoping to hear, not that we have
tools to copy sparse files efficiently (umm, over the network?), but
what apps are actually working live with those sparse files on tmpfs,
and now need to seek around them.  Some math or physics applications?

> > 
> > Besides, given that you can punch holes in tmpfs files, it seems
> > strange to then say "we don't need a method of skipping holes to
> > find data quickly"....
> 
> So its deserve to keep this feature working on tmpfs considering hole punch. :)

Well, thank you, as I said earlier I am on both sides of the argument.
(And feel uncomfortably like a prima donna waiting in the wings until
the audience has shouted long enough for the encore.)

It's now taken out of 3.5, but we can bring it back when there's more
demand.  Your extent-scan is itself waiting for ext4 to support it:
maybe get noisy at me when that's imminent.

Hugh

> 
> Thanks,
> -Jeff
> 
> > 
> > Besides, seek-hole/data is still shiny new and lots of developers
> > aren't even aware of it's presence in recent kernels. Removing new
> > functionality saying "no-one is using it" is like smashing the egg
> > before the chicken hatches (or is it cutting of the chickes's head
> > before it lays the egg?).
> > 
> > Cheers,
> > 
> > Dave.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE
@ 2012-07-16  9:28             ` Hugh Dickins
  0 siblings, 0 replies; 34+ messages in thread
From: Hugh Dickins @ 2012-07-16  9:28 UTC (permalink / raw)
  To: Jeff Liu; +Cc: Dave Chinner, Cong Wang, linux-mm, linux-fsdevel, linux-kernel

On Thu, 12 Jul 2012, Jeff Liu wrote:
> On 07/12/2012 07:01 AM, Dave Chinner wrote:
> > On Wed, Jul 11, 2012 at 11:55:34AM -0700, Hugh Dickins wrote:
> >>
> >> But your vote would count for a lot more if you know of some app which
> >> would really benefit from this functionality in tmpfs: I've heard of none.
> > 
> > So what? I've heard of no apps that use this functionality on XFS,
> > either, but I have heard of a lot of people asking for it to be
> > implemented over the past couple of years so they can use it.
> > There's been patches written to make coreutils (cp) make use of it
> > instead of parsing FIEMAP output to find holes, though I don't know
> > if that's gone beyond more than "here's some patches"...
> 
> Yes, for apps, cp(1) will make use of it to replace the old FIEMAP for efficient sparse file copy.
> I have implemented an extent-scan module to coreutils a few years ago,
> http://fossies.org/dox/coreutils-8.17/extent-scan_8c_source.html

Thanks for confirming Dave's pointer to cp.

Of course, tmpfs has never supported FIBMAP or FIEMAP;
but SEEK_DATA and SEEK_HOLE do fit it much more naturally.

> 
> It does extent scan through FIEMAP, however, SEEK_DATA/SEEK_HOLE is more convenient and easy to use
> considering the call interface.  So FIEMAP will be replaced by SEEK_XXX once it got supported by EXT4.
> 
> Moreover, I have discussed with Jim who is the coreutils maintainer previously, He would like to post
> extent-scan module to Gnulib so that other GNU utilities which are relied on Gnulib might be a potential
> user of it, at least, GNU tar will definitely need it for sparse file backup.

Thanks for the info.  I confess I'm not hugely swayed by cp and sparse
file archive arguments - I doubt many people care, and I doubt those who
do care are using tmpfs for them.

But my doubts are just ignorance.  I was hoping to hear, not that we have
tools to copy sparse files efficiently (umm, over the network?), but
what apps are actually working live with those sparse files on tmpfs,
and now need to seek around them.  Some math or physics applications?

> > 
> > Besides, given that you can punch holes in tmpfs files, it seems
> > strange to then say "we don't need a method of skipping holes to
> > find data quickly"....
> 
> So its deserve to keep this feature working on tmpfs considering hole punch. :)

Well, thank you, as I said earlier I am on both sides of the argument.
(And feel uncomfortably like a prima donna waiting in the wings until
the audience has shouted long enough for the encore.)

It's now taken out of 3.5, but we can bring it back when there's more
demand.  Your extent-scan is itself waiting for ext4 to support it:
maybe get noisy at me when that's imminent.

Hugh

> 
> Thanks,
> -Jeff
> 
> > 
> > Besides, seek-hole/data is still shiny new and lots of developers
> > aren't even aware of it's presence in recent kernels. Removing new
> > functionality saying "no-one is using it" is like smashing the egg
> > before the chicken hatches (or is it cutting of the chickes's head
> > before it lays the egg?).
> > 
> > Cheers,
> > 
> > Dave.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE
  2012-07-16  9:28             ` Hugh Dickins
@ 2012-07-17  6:15               ` Jeff Liu
  -1 siblings, 0 replies; 34+ messages in thread
From: Jeff Liu @ 2012-07-17  6:15 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Dave Chinner, Cong Wang, linux-mm, linux-fsdevel, linux-kernel

On 07/16/2012 05:28 PM, Hugh Dickins wrote:

> On Thu, 12 Jul 2012, Jeff Liu wrote:
>> On 07/12/2012 07:01 AM, Dave Chinner wrote:
>>> On Wed, Jul 11, 2012 at 11:55:34AM -0700, Hugh Dickins wrote:
>>>>
>>>> But your vote would count for a lot more if you know of some app which
>>>> would really benefit from this functionality in tmpfs: I've heard of none.
>>>
>>> So what? I've heard of no apps that use this functionality on XFS,
>>> either, but I have heard of a lot of people asking for it to be
>>> implemented over the past couple of years so they can use it.
>>> There's been patches written to make coreutils (cp) make use of it
>>> instead of parsing FIEMAP output to find holes, though I don't know
>>> if that's gone beyond more than "here's some patches"...
>>
>> Yes, for apps, cp(1) will make use of it to replace the old FIEMAP for efficient sparse file copy.
>> I have implemented an extent-scan module to coreutils a few years ago,
>> http://fossies.org/dox/coreutils-8.17/extent-scan_8c_source.html
> 
> Thanks for confirming Dave's pointer to cp.
> 
> Of course, tmpfs has never supported FIBMAP or FIEMAP;
> but SEEK_DATA and SEEK_HOLE do fit it much more naturally.
> 
>>
>> It does extent scan through FIEMAP, however, SEEK_DATA/SEEK_HOLE is more convenient and easy to use
>> considering the call interface.  So FIEMAP will be replaced by SEEK_XXX once it got supported by EXT4.
>>
>> Moreover, I have discussed with Jim who is the coreutils maintainer previously, He would like to post
>> extent-scan module to Gnulib so that other GNU utilities which are relied on Gnulib might be a potential
>> user of it, at least, GNU tar will definitely need it for sparse file backup.
> 
> Thanks for the info.  I confess I'm not hugely swayed by cp and sparse
> file archive arguments - I doubt many people care, and I doubt those who
> do care are using tmpfs for them.
> 
> But my doubts are just ignorance.  I was hoping to hear, not that we have
> tools to copy sparse files efficiently (umm, over the network?), but
> what apps are actually working live with those sparse files on tmpfs,
> and now need to seek around them.  Some math or physics applications?
> 
>>>
>>> Besides, given that you can punch holes in tmpfs files, it seems
>>> strange to then say "we don't need a method of skipping holes to
>>> find data quickly"....
>>
>> So its deserve to keep this feature working on tmpfs considering hole punch. :)
> 
> Well, thank you, as I said earlier I am on both sides of the argument.
> (And feel uncomfortably like a prima donna waiting in the wings until
> the audience has shouted long enough for the encore.)

Oh, sorry, I missed you response to Dave before.

> 
> It's now taken out of 3.5, but we can bring it back when there's more
> demand.

Yep. :)

Thanks,
-Jeff

> Your extent-scan is itself waiting for ext4 to support it:
> maybe get noisy at me when that's imminent.

> 
> Hugh
> 
>>
>> Thanks,
>> -Jeff
>>
>>>
>>> Besides, seek-hole/data is still shiny new and lots of developers
>>> aren't even aware of it's presence in recent kernels. Removing new
>>> functionality saying "no-one is using it" is like smashing the egg
>>> before the chicken hatches (or is it cutting of the chickes's head
>>> before it lays the egg?).
>>>
>>> Cheers,
>>>
>>> Dave.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE
@ 2012-07-17  6:15               ` Jeff Liu
  0 siblings, 0 replies; 34+ messages in thread
From: Jeff Liu @ 2012-07-17  6:15 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Dave Chinner, Cong Wang, linux-mm, linux-fsdevel, linux-kernel

On 07/16/2012 05:28 PM, Hugh Dickins wrote:

> On Thu, 12 Jul 2012, Jeff Liu wrote:
>> On 07/12/2012 07:01 AM, Dave Chinner wrote:
>>> On Wed, Jul 11, 2012 at 11:55:34AM -0700, Hugh Dickins wrote:
>>>>
>>>> But your vote would count for a lot more if you know of some app which
>>>> would really benefit from this functionality in tmpfs: I've heard of none.
>>>
>>> So what? I've heard of no apps that use this functionality on XFS,
>>> either, but I have heard of a lot of people asking for it to be
>>> implemented over the past couple of years so they can use it.
>>> There's been patches written to make coreutils (cp) make use of it
>>> instead of parsing FIEMAP output to find holes, though I don't know
>>> if that's gone beyond more than "here's some patches"...
>>
>> Yes, for apps, cp(1) will make use of it to replace the old FIEMAP for efficient sparse file copy.
>> I have implemented an extent-scan module to coreutils a few years ago,
>> http://fossies.org/dox/coreutils-8.17/extent-scan_8c_source.html
> 
> Thanks for confirming Dave's pointer to cp.
> 
> Of course, tmpfs has never supported FIBMAP or FIEMAP;
> but SEEK_DATA and SEEK_HOLE do fit it much more naturally.
> 
>>
>> It does extent scan through FIEMAP, however, SEEK_DATA/SEEK_HOLE is more convenient and easy to use
>> considering the call interface.  So FIEMAP will be replaced by SEEK_XXX once it got supported by EXT4.
>>
>> Moreover, I have discussed with Jim who is the coreutils maintainer previously, He would like to post
>> extent-scan module to Gnulib so that other GNU utilities which are relied on Gnulib might be a potential
>> user of it, at least, GNU tar will definitely need it for sparse file backup.
> 
> Thanks for the info.  I confess I'm not hugely swayed by cp and sparse
> file archive arguments - I doubt many people care, and I doubt those who
> do care are using tmpfs for them.
> 
> But my doubts are just ignorance.  I was hoping to hear, not that we have
> tools to copy sparse files efficiently (umm, over the network?), but
> what apps are actually working live with those sparse files on tmpfs,
> and now need to seek around them.  Some math or physics applications?
> 
>>>
>>> Besides, given that you can punch holes in tmpfs files, it seems
>>> strange to then say "we don't need a method of skipping holes to
>>> find data quickly"....
>>
>> So its deserve to keep this feature working on tmpfs considering hole punch. :)
> 
> Well, thank you, as I said earlier I am on both sides of the argument.
> (And feel uncomfortably like a prima donna waiting in the wings until
> the audience has shouted long enough for the encore.)

Oh, sorry, I missed you response to Dave before.

> 
> It's now taken out of 3.5, but we can bring it back when there's more
> demand.

Yep. :)

Thanks,
-Jeff

> Your extent-scan is itself waiting for ext4 to support it:
> maybe get noisy at me when that's imminent.

> 
> Hugh
> 
>>
>> Thanks,
>> -Jeff
>>
>>>
>>> Besides, seek-hole/data is still shiny new and lots of developers
>>> aren't even aware of it's presence in recent kernels. Removing new
>>> functionality saying "no-one is using it" is like smashing the egg
>>> before the chicken hatches (or is it cutting of the chickes's head
>>> before it lays the egg?).
>>>
>>> Cheers,
>>>
>>> Dave.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE
  2012-07-16  9:28             ` Hugh Dickins
  (?)
  (?)
@ 2012-07-31 14:30             ` Jim Meyering
  2012-07-31 14:42               ` Jim Meyering
  2012-08-08  2:08               ` Hugh Dickins
  -1 siblings, 2 replies; 34+ messages in thread
From: Jim Meyering @ 2012-07-31 14:30 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Hugh Dickins

Hugh Dickins wrote:
> On Thu, 12 Jul 2012, Jeff Liu wrote:
> > On 07/12/2012 07:01 AM, Dave Chinner wrote:
> > > On Wed, Jul 11, 2012 at 11:55:34AM -0700, Hugh Dickins wrote:
> > >>
> > >> But your vote would count for a lot more if you know of some app which
> > >> would really benefit from this functionality in tmpfs: I've heard
> > >> of none.
...
[Jeff mentioned "cp"]

grep is another tool that would benefit.
I often put very large files (often sparse, too) on tmpfs file systems
and would like "grep -r PAT /tmp" to work well in spite of those files.

Please consider restoring SEEK_HOLE/SEEK_DATA support for tmpfs.

The lack of cross-FS support in SEEK_HOLE/SEEK_DATA support is a bit of a
thorn in our sides.  FIEMAP is not a viable option, and SEEK_HOLE support
works only if you happen to be using btrfs, xfs, ocfs2 or 3.5.0-rcN tmpfs.
Not something we can rely on for a feature whose lack can convert grep -r
into a memory-hogging apparently-hung job or OOM-killer-target.

What would you like to happen when you run
(deliberately or inadvertently) grep on a large sparse file?
I want it to search only the non-HOLE sections of that file,
especially when examining a hole involves accumulating a
"line" that may be so long that it exhausts virtual memory.
We're not quite there, but for now can at least avoid the
VM-abusing behavior with --binary-file=without-match option,
which says to treat "binary" (sparse) files as if they contain no match.
Sometimes.

With working SEEK_HOLE support, grep does the right thing here:

    (${AWK-awk} 'BEGIN{ for (i=0;i<1000;i++) printf "%080d\n", 0 }' < /dev/null
     echo x | dd bs=1024k seek=8000000
    ) >8T-or-so

    $ env time --format=%e grep x 8T-or-so
    0.00

But without SEEK_HOLE support, and with a lot of memory, grep takes a
long time to allocate all of that space before it finally chokes or is killed.
Here, it takes 46 seconds before running out of memory:

    $ env time grep --binary-file=without-match x 8T-or-so
    grep: memory exhausted
    3.15user 25.48system 0:46.46elapsed 61%CPU\
      (0avgtext+0avgdata 12583712maxresident)k
    0inputs+8outputs (0major+2733623minor)pagefaults 0swaps
    [Exit 2]

Until very recently, grep was trying to guess whether an input
has a hole using st_blocks and st_size, but with file systems now
using compression, that method it too subject to false-positives.

Ideally we would use SEEK_HOLE/SEEK_DATA, but until that is useful on
more linux file systems, I suspect we'll have to choose our method based
on the file system type (at the cost of a statvfs call for each st_dev),
possibly in combination with the linux kernel version.

Here's some background/discussion on the topic, including the
original report about the st_blocks-based heuristic not working:

    http://thread.gmane.org/gmane.comp.gnu.grep.bugs/4604/focus=4610

In case you want to see the SEEK_HOLE-using code, grep's file_is_binary
function is here:

    http://git.savannah.gnu.org/cgit/grep.git/tree/src/main.c#n439

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE
  2012-07-31 14:30             ` Jim Meyering
@ 2012-07-31 14:42               ` Jim Meyering
  2012-08-08  2:08               ` Hugh Dickins
  1 sibling, 0 replies; 34+ messages in thread
From: Jim Meyering @ 2012-07-31 14:42 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Hugh Dickins

Jim Meyering wrote:
> Hugh Dickins wrote:
>> On Thu, 12 Jul 2012, Jeff Liu wrote:
>> > On 07/12/2012 07:01 AM, Dave Chinner wrote:
>> > > On Wed, Jul 11, 2012 at 11:55:34AM -0700, Hugh Dickins wrote:
>> > >>
>> > >> But your vote would count for a lot more if you know of some app which
>> > >> would really benefit from this functionality in tmpfs: I've heard
>> > >> of none.
> ...
> [Jeff mentioned "cp"]
>
> grep is another tool that would benefit.
> I often put very large files (often sparse, too) on tmpfs file systems
> and would like "grep -r PAT /tmp" to work well in spite of those files.
>
> Please consider restoring SEEK_HOLE/SEEK_DATA support for tmpfs.

Also, lseek's SEEK_HOLE/SEEK_DATA support is slated to be required
by POSIX 2008: http://austingroupbugs.net/view.php?id=415

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE
  2012-07-31 14:30             ` Jim Meyering
  2012-07-31 14:42               ` Jim Meyering
@ 2012-08-08  2:08               ` Hugh Dickins
  2012-08-14 17:03                 ` Paul Eggert
  1 sibling, 1 reply; 34+ messages in thread
From: Hugh Dickins @ 2012-08-08  2:08 UTC (permalink / raw)
  To: Jim Meyering; +Cc: Paul Eggert, Zheng Liu, linux-fsdevel

On Tue, 31 Jul 2012, Jim Meyering wrote:
> Hugh Dickins wrote:
> > On Thu, 12 Jul 2012, Jeff Liu wrote:
> > > On 07/12/2012 07:01 AM, Dave Chinner wrote:
> > > > On Wed, Jul 11, 2012 at 11:55:34AM -0700, Hugh Dickins wrote:
> > > >>
> > > >> But your vote would count for a lot more if you know of some app which
> > > >> would really benefit from this functionality in tmpfs: I've heard
> > > >> of none.
> ...
> [Jeff mentioned "cp"]
> 
> grep is another tool that would benefit.
> I often put very large files (often sparse, too) on tmpfs file systems
> and would like "grep -r PAT /tmp" to work well in spite of those files.
> 
> Please consider restoring SEEK_HOLE/SEEK_DATA support for tmpfs.

Many thanks, Jim, for taking the time to explain your case and provide
the links below.  I am sorry to have given you guys trouble by adding
and then reverting the tmpfs SEEK_HOLE before 3.5 final.

I really wanted to respond to you positively.  But now that (I think) I
understand the grep case better, I believe the best I can do is to keep
an eye on ext4 development, and reintroduce tmpfs SEEK_HOLE/SEEK_DATA
in the same release that proper support comes to ext4, whenever that is
(I notice that Zheng Liu has posted patches for it, but I have no idea
how close to ready they are).

> 
> The lack of cross-FS support in SEEK_HOLE/SEEK_DATA support is a bit of a
> thorn in our sides.  FIEMAP is not a viable option, and SEEK_HOLE support
> works only if you happen to be using btrfs, xfs, ocfs2 or 3.5.0-rcN tmpfs.
> Not something we can rely on for a feature whose lack can convert grep -r
> into a memory-hogging apparently-hung job or OOM-killer-target.
> 
> What would you like to happen when you run
> (deliberately or inadvertently) grep on a large sparse file?
> I want it to search only the non-HOLE sections of that file,

That will be a nice optimization once SEEK_DATA is more widely
supported, yes.

> especially when examining a hole involves accumulating a
> "line" that may be so long that it exhausts virtual memory.

I think you're conflating two different things there.

As I understand it, there's a bug (or silliness) in grep such that
it ends up exhausting memory when it cannot divide the file up into
sensibly lengthed lines.  I agree that you don't want grep to collapse
with "memory exhausted" in such a case; and you'll know better than me
what a sensible maximum length for a grep line should be.

But what has that got to do with holes?  I have not tried grepping a
larger-than-memory+swap file containing neither NULs nor newlines, so
please correct me if I'm wrong: but I expect it to fail grep's binary
file test (with or without proper SEEK_HOLE support), but then to hit
the memory exhausted error.

What it's got to do with holes is that simple text files don't have
NULs in them, and there tends to be a correlation between files
containing NULs and files not divided into lines of text (though
I'm blissfully ignorant of even the most common document formats).

There's also a correlation between files containing NULs and files
containing NULs in their first 32k.  grep's file_is_binary() looks for
a NUL in that first 32k (?); and the heuristic has been "enhanced" to
check for sparse files too (whether by st_blocks or now SEEK_HOLE) -
though no attempt to check for isolated NULs beyond the first 32k.
And there's even a "big-hole" test for this behaviour in the tree.

But wouldn't the developer's common case (object files amidst source
in the tree) usually be handled by that check on the first 32k?

And all this stuff about holes: shouldn't grep instead just be
checking for over-long lines instead of running out of memory?

I apologize, this is not the right forum to discuss grep code:
let's take it away from linux-fsdevel if followup is needed.

I'll happily reintroduce tmpfs SEEK_DATA and SEEK_HOLE
when ext4 joins btrfs and xfs in supporting it.

Thanks,
Hugh

> We're not quite there, but for now can at least avoid the
> VM-abusing behavior with --binary-file=without-match option,
> which says to treat "binary" (sparse) files as if they contain no match.
> Sometimes.
> 
> With working SEEK_HOLE support, grep does the right thing here:
> 
>     (${AWK-awk} 'BEGIN{ for (i=0;i<1000;i++) printf "%080d\n", 0 }' < /dev/null
>      echo x | dd bs=1024k seek=8000000
>     ) >8T-or-so
> 
>     $ env time --format=%e grep x 8T-or-so
>     0.00
> 
> But without SEEK_HOLE support, and with a lot of memory, grep takes a
> long time to allocate all of that space before it finally chokes or is killed.
> Here, it takes 46 seconds before running out of memory:
> 
>     $ env time grep --binary-file=without-match x 8T-or-so
>     grep: memory exhausted
>     3.15user 25.48system 0:46.46elapsed 61%CPU\
>       (0avgtext+0avgdata 12583712maxresident)k
>     0inputs+8outputs (0major+2733623minor)pagefaults 0swaps
>     [Exit 2]
> 
> Until very recently, grep was trying to guess whether an input
> has a hole using st_blocks and st_size, but with file systems now
> using compression, that method it too subject to false-positives.
> 
> Ideally we would use SEEK_HOLE/SEEK_DATA, but until that is useful on
> more linux file systems, I suspect we'll have to choose our method based
> on the file system type (at the cost of a statvfs call for each st_dev),
> possibly in combination with the linux kernel version.
> 
> Here's some background/discussion on the topic, including the
> original report about the st_blocks-based heuristic not working:
> 
>     http://thread.gmane.org/gmane.comp.gnu.grep.bugs/4604/focus=4610
> 
> In case you want to see the SEEK_HOLE-using code, grep's file_is_binary
> function is here:
> 
>     http://git.savannah.gnu.org/cgit/grep.git/tree/src/main.c#n439

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE
  2012-08-08  2:08               ` Hugh Dickins
@ 2012-08-14 17:03                 ` Paul Eggert
  0 siblings, 0 replies; 34+ messages in thread
From: Paul Eggert @ 2012-08-14 17:03 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Jim Meyering, Zheng Liu, linux-fsdevel

On 08/07/2012 07:08 PM, Hugh Dickins wrote:
> wouldn't the developer's common case (object files amidst source
> in the tree) usually be handled by that check on the first 32k?

Yes, but grep should also handle the less-common case
where the first 32K is text and there's a large hole later.
The particular case I'm worried about is a denial-of-service
attack, so it's irrelevant that this case is uncommon in
typical files.

> shouldn't grep instead just be
> checking for over-long lines instead of running out of memory?

GNU programs should not have arbitrary limits.  An arbitrary limit, such
as 100,000 bytes, that we put on line length, would cause grep to
not work on some valid inputs.

This is not to say that grep couldn't function better on files with
lots of nulls -- it can, and that's on our list of things to do --
but SEEK_HOLE is a big and obvious win in this area.

We also need SEEK_HOLE and SEEK_DATA for GNU 'tar', for the same reason
(denial-of-service attacks, mostly).

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2012-08-14 17:13 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-09 22:35 [PATCH 0/3] shmem/tmpfs: three late patches Hugh Dickins
2012-07-09 22:35 ` Hugh Dickins
2012-07-09 22:41 ` [PATCH 1/3] tmpfs: revert SEEK_DATA and SEEK_HOLE Hugh Dickins
2012-07-09 22:41   ` Hugh Dickins
2012-07-11  6:07   ` Cong Wang
2012-07-11  6:07     ` Cong Wang
2012-07-11 18:55     ` Hugh Dickins
2012-07-11 18:55       ` Hugh Dickins
2012-07-11 23:01       ` Dave Chinner
2012-07-11 23:01         ` Dave Chinner
2012-07-12  2:50         ` Hugh Dickins
2012-07-12  2:50           ` Hugh Dickins
2012-07-12  3:21         ` Jeff Liu
2012-07-12  3:21           ` Jeff Liu
2012-07-16  9:28           ` Hugh Dickins
2012-07-16  9:28             ` Hugh Dickins
2012-07-17  6:15             ` Jeff Liu
2012-07-17  6:15               ` Jeff Liu
2012-07-31 14:30             ` Jim Meyering
2012-07-31 14:42               ` Jim Meyering
2012-08-08  2:08               ` Hugh Dickins
2012-08-14 17:03                 ` Paul Eggert
2012-07-09 22:44 ` [PATCH 2/3] shmem: fix negative rss in memcg memory.stat Hugh Dickins
2012-07-09 22:44   ` Hugh Dickins
2012-07-10 12:41   ` Johannes Weiner
2012-07-10 12:41     ` Johannes Weiner
2012-07-11 18:15     ` Hugh Dickins
2012-07-11 18:15       ` Hugh Dickins
2012-07-09 22:46 ` [PATCH 3/3] shmem: cleanup shmem_add_to_page_cache Hugh Dickins
2012-07-09 22:46   ` Hugh Dickins
2012-07-10 13:01   ` Johannes Weiner
2012-07-10 13:01     ` Johannes Weiner
2012-07-09 23:39 ` [PATCH 0/3] shmem/tmpfs: three late patches Andrew Morton
2012-07-09 23:39   ` Andrew Morton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.