All of lore.kernel.org
 help / color / mirror / Atom feed
* POSIX_FADV_DONTNEED implemented wrong
@ 2013-02-22 19:57 Phillip Susi
  2013-02-22 20:29 ` Johannes Weiner
  0 siblings, 1 reply; 18+ messages in thread
From: Phillip Susi @ 2013-02-22 19:57 UTC (permalink / raw)
  To: linux-mm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I believe the current implementation for this is wrong.  For clean
pages, it immediately discards them from the cache, and for dirty
ones, it immediately tries to initiate writeout if the bdi is not
congested.  I believe this is wrong for three reasons:

1)  It is completely useless for writing files.  This hint should
allow a program generating lots of writes to files that will not
likely be read again to reduce the cache pressure that causes.

2)  When there is little to no cache pressure, this hint should not
cause the disk to spin up.

3)  This is supposed to be a hint that caching this data is unlikely
to do any good, so the cache should favor other data instead.  Just
because one process does not think it will be used again does not mean
it won't be, so when there is little to no cache pressure, we
shouldn't go discarding potentially useful data.

I'd like to change this to simply force the pages to the inactive
list, so they will be reclaimed sooner than other pages, but not
immediately discarded, or written out.

Also the related POSIX_FADV_NOREUSE is currently unimplemented, and
this should also cause the cache pages to skip the active list and go
straight to the inactive list.

Any thoughts or hints on how to go about doing this?

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJRJ82ZAAoJEJrBOlT6nu75i3YIAMjrwhzL28m/WbsD4m2BQaX9
swz0OlO9AimoQLE0vvbYSRYFmlGAayQafIOJU1GiLSijPGmHqisOePZpWnCKbesP
PeoHFxC+jDNHGrmIDHGOgq7ELAX6DNh5yU1sBhvo7iSnDCfjdfvJP7wWNyzCD/bD
FT7bEgQ1vjd6bB3812Qj3PBs/UHvHUj8zAJDAiArqMJSW6LgxINzjyXs030NRqxS
A1RUVUJ/4ydJz7SS4uitFWmObrpImIt6oxpQnIb1SOzL67KNx/YwMgWq/hknAS3H
ravePc2VwH2aS/gcyo2VW3OLHlIXOxgbjhZWbKidkNv6KsccEqqY8yFeO+fCvjA=
=dsVO
-----END PGP SIGNATURE-----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: POSIX_FADV_DONTNEED implemented wrong
  2013-02-22 19:57 POSIX_FADV_DONTNEED implemented wrong Phillip Susi
@ 2013-02-22 20:29 ` Johannes Weiner
  2013-02-22 21:52   ` Phillip Susi
  0 siblings, 1 reply; 18+ messages in thread
From: Johannes Weiner @ 2013-02-22 20:29 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-mm, Minchan Kim

On Fri, Feb 22, 2013 at 02:57:15PM -0500, Phillip Susi wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> I believe the current implementation for this is wrong.  For clean
> pages, it immediately discards them from the cache, and for dirty
> ones, it immediately tries to initiate writeout if the bdi is not
> congested.  I believe this is wrong for three reasons:
> 
> 1)  It is completely useless for writing files.  This hint should
> allow a program generating lots of writes to files that will not
> likely be read again to reduce the cache pressure that causes.

Wouldn't direct IO make more sense in that case?

> 2)  When there is little to no cache pressure, this hint should not
> cause the disk to spin up.

It's a hard problem to link memory pressure to writeback in a smart
way, we haven't been all too successful with it.  But maybe it makes
sense to remove the writeout in fadvise, since the user can do that up
front if the user is willing to give up throughput and energy for
memory.

> 3)  This is supposed to be a hint that caching this data is unlikely
> to do any good, so the cache should favor other data instead.  Just
> because one process does not think it will be used again does not mean
> it won't be, so when there is little to no cache pressure, we
> shouldn't go discarding potentially useful data.
> 
> I'd like to change this to simply force the pages to the inactive
> list, so they will be reclaimed sooner than other pages, but not
> immediately discarded, or written out.

Minchan worked on deactivating pages on truncation.  Maybe all it
takes is to implement deactivate_mapping_range() or something to
combine a page cache walk with deactivate_page().

While you are at it, madvise(MADV_DONTNEED) does not do anything to
the page cache, but it probably should.  :-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: POSIX_FADV_DONTNEED implemented wrong
  2013-02-22 20:29 ` Johannes Weiner
@ 2013-02-22 21:52   ` Phillip Susi
  2013-02-23 22:57     ` [PATCH 0/2] FADV_DONTNEED and FADV_NOREUSE Phillip Susi
                       ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Phillip Susi @ 2013-02-22 21:52 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, Minchan Kim

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/22/2013 3:29 PM, Johannes Weiner wrote:
>> 1)  It is completely useless for writing files.  This hint
>> should allow a program generating lots of writes to files that
>> will not likely be read again to reduce the cache pressure that
>> causes.
> 
> Wouldn't direct IO make more sense in that case?

It could, but doing direct aio is a lot more complicated than just using
posix_fadvise. Also it has problem #3: even if it is unlikely, it
*may* be used again, so it makes sense to cache it if we have plenty
of ram.

> Minchan worked on deactivating pages on truncation.  Maybe all it 
> takes is to implement deactivate_mapping_range() or something to 
> combine a page cache walk with deactivate_page().

Looks like a good idea!

> While you are at it, madvise(MADV_DONTNEED) does not do anything
> to the page cache, but it probably should.  :-)

It seems to be implemented by discarding the pages, even if dirty.
This also seems to be wrong.  According to posix, this is a hint that
it will not access the pages again any time *soon*, not that the data
will never be needed again and so it can be discarded.

It looks like MADV_SEQUENTIAL is missing the second part of its
implementation: making sure the pages will be discarded soon after
they are accessed.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJRJ+i3AAoJEJrBOlT6nu759/IIAMbFAjJGmI7hkpt1GMtUfty8
n72BoygV3bDULpZ8BybJonRfiVw/ze9sVf6KhojB2dm9bvdZHl11DDGvf9Ro8OSr
dcjWUQxbHzuzGtUtnUEZPOAj6Ux6cetBtmjUxjnLyJrijK+W+cEHzWnzUZXWddlo
XcPe3IHNmj7YlTH+tcPevLCeTlzfFkjq/t4JXuZWmFW97MmMe5wTCScS0eiBYpHM
SVVL+VJ8TPG9Hnk/9oP0RqAyg+SjshGfaqhM8mTFvS4FtMbp/gXFz8GnewxG322h
ZdgwZqafiWsNeC8KitcTwKlxMU5fWFDLfHXKoFWgX2P7hVh8zJ9T6Ugi4KeaAN4=
=RFzN
-----END PGP SIGNATURE-----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 0/2] FADV_DONTNEED and FADV_NOREUSE
  2013-02-22 21:52   ` Phillip Susi
@ 2013-02-23 22:57     ` Phillip Susi
  2013-02-23 22:58     ` [PATCH 1/2] mm: fadvise: fix POSIX_FADV_DONTNEED Phillip Susi
  2013-02-23 22:58     ` [PATCH 2/2] mm: fadvise: implement POSIX_FADV_NOREUSE Phillip Susi
  2 siblings, 0 replies; 18+ messages in thread
From: Phillip Susi @ 2013-02-23 22:57 UTC (permalink / raw)
  To: linux-mm

How do the two of these look?

Phillip Susi (2):
  mm: fadvise: fix POSIX_FADV_DONTNEED
  mm: fadvise: implement POSIX_FADV_NOREUSE

 include/linux/fs.h |  5 +++++
 mm/fadvise.c       |  9 +++------
 mm/filemap.c       | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 59 insertions(+), 9 deletions(-)

-- 
1.8.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/2] mm: fadvise: fix POSIX_FADV_DONTNEED
  2013-02-22 21:52   ` Phillip Susi
  2013-02-23 22:57     ` [PATCH 0/2] FADV_DONTNEED and FADV_NOREUSE Phillip Susi
@ 2013-02-23 22:58     ` Phillip Susi
  2013-02-24  1:46       ` Dave Hansen
                         ` (2 more replies)
  2013-02-23 22:58     ` [PATCH 2/2] mm: fadvise: implement POSIX_FADV_NOREUSE Phillip Susi
  2 siblings, 3 replies; 18+ messages in thread
From: Phillip Susi @ 2013-02-23 22:58 UTC (permalink / raw)
  To: linux-mm

The previous implementation initiated writeout for a non congested bdi, and
then discarded any clean pages.   This had 3 problems:

1) The writeout would spin up the disk unnecessarily
2) Discarding pages under low cache pressure is a waste
3) It was useless on files being written, and thus full of dirty pages

Now we just move the pages to the inactive list so they will be reclaimed
sooner.
---
 include/linux/fs.h |  2 ++
 mm/fadvise.c       |  8 ++------
 mm/filemap.c       | 43 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 47 insertions(+), 6 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7d2e893..2abd193 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2198,6 +2198,8 @@ extern int __filemap_fdatawrite_range(struct address_space *mapping,
 				loff_t start, loff_t end, int sync_mode);
 extern int filemap_fdatawrite_range(struct address_space *mapping,
 				loff_t start, loff_t end);
+extern void filemap_deactivate_range(struct address_space *mapping, pgoff_t start,
+				     pgoff_t end);
 
 extern int vfs_fsync_range(struct file *file, loff_t start, loff_t end,
 			   int datasync);
diff --git a/mm/fadvise.c b/mm/fadvise.c
index a47f0f5..fbd58b0 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -112,17 +112,13 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
 	case POSIX_FADV_NOREUSE:
 		break;
 	case POSIX_FADV_DONTNEED:
-		if (!bdi_write_congested(mapping->backing_dev_info))
-			__filemap_fdatawrite_range(mapping, offset, endbyte,
-						   WB_SYNC_NONE);
-
 		/* First and last FULL page! */
 		start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
 		end_index = (endbyte >> PAGE_CACHE_SHIFT);
 
 		if (end_index >= start_index)
-			invalidate_mapping_pages(mapping, start_index,
-						end_index);
+			filemap_deactivate_range(mapping, start_index,
+						 end_index);
 		break;
 	default:
 		ret = -EINVAL;
diff --git a/mm/filemap.c b/mm/filemap.c
index c610076..bcdcdbf 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -217,7 +217,49 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
 	return ret;
 }
 
+/**
+ * filemap_deactivate_range - moves pages in range to the inactive list
+ * @mapping:	the address_space which holds the pages to deactivate
+ * @start:	offset where the range starts
+ * @end:	offset where the range ends (inclusive)
+ */
+void filemap_deactivate_range(struct address_space *mapping, pgoff_t start,
+			      pgoff_t end)
+{
+	struct pagevec pvec;
+	pgoff_t index = start;
+	int i;
+
+	/*
+	 * Note: this function may get called on a shmem/tmpfs mapping:
+	 * pagevec_lookup() might then return 0 prematurely (because it
+	 * got a gangful of swap entries); but it's hardly worth worrying
+	 * about - it can rarely have anything to free from such a mapping
+	 * (most pages are dirty), and already skips over any difficulties.
+	 */
+
+	pagevec_init(&pvec, 0);
+	while (index <= end && pagevec_lookup(&pvec, mapping, index,
+			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+		mem_cgroup_uncharge_start();
+		for (i = 0; i < pagevec_count(&pvec); i++) {
+			struct page *page = pvec.pages[i];
+
+			/* We rely upon deletion not changing page->index */
+			index = page->index;
+			if (index > end)
+				break;
+
+			WARN_ON(page->index != index);
+			deactivate_page(page);
+		}
+		pagevec_release(&pvec);
+		mem_cgroup_uncharge_end();
+		cond_resched();
+		index++;
+	}
+}
+
 static inline int __filemap_fdatawrite(struct address_space *mapping,
 	int sync_mode)
 {
-- 
1.8.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/2] mm: fadvise: implement POSIX_FADV_NOREUSE
  2013-02-22 21:52   ` Phillip Susi
  2013-02-23 22:57     ` [PATCH 0/2] FADV_DONTNEED and FADV_NOREUSE Phillip Susi
  2013-02-23 22:58     ` [PATCH 1/2] mm: fadvise: fix POSIX_FADV_DONTNEED Phillip Susi
@ 2013-02-23 22:58     ` Phillip Susi
  2 siblings, 0 replies; 18+ messages in thread
From: Phillip Susi @ 2013-02-23 22:58 UTC (permalink / raw)
  To: linux-mm

This hint was a nop, now it causes the read/write path to deactivate pages
after reading/writing them, so they will be reclaimed sooner.
---
 include/linux/fs.h |  3 +++
 mm/fadvise.c       |  1 +
 mm/filemap.c       | 11 ++++++++---
 3 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2abd193..de26ee3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -121,6 +121,9 @@ typedef void (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 /* File is opened with O_PATH; almost nothing can be done with it */
 #define FMODE_PATH		((__force fmode_t)0x4000)
 
+/* POSIX_FADV_NOREUSE has been set */
+#define FMODE_NOREUSE		((__force fmode_t)0x8000)
+
 /* File was opened by fanotify and shouldn't generate fanotify events */
 #define FMODE_NONOTIFY		((__force fmode_t)0x1000000)
 
diff --git a/mm/fadvise.c b/mm/fadvise.c
index fbd58b0..b42bf5b 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -110,6 +110,7 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
 					   nrpages);
 		break;
 	case POSIX_FADV_NOREUSE:
+		f.file->f_mode |= FMODE_NOREUSE;
 		break;
 	case POSIX_FADV_DONTNEED:
 		/* First and last FULL page! */
diff --git a/mm/filemap.c b/mm/filemap.c
index bcdcdbf..cba2f41 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1215,8 +1215,11 @@ page_ok:
 		 * When a sequential read accesses a page several times,
 		 * only mark it as accessed the first time.
 		 */
-		if (prev_index != index || offset != prev_offset)
-			mark_page_accessed(page);
+		if (prev_index != index || offset != prev_offset) {
+			if (filp->f_mode & FMODE_NOREUSE)
+				deactivate_page(page);
+			else mark_page_accessed(page);
+		}
 		prev_index = index;
 
 		/*
@@ -2378,7 +2381,9 @@ again:
 		pagefault_enable();
 		flush_dcache_page(page);
 
-		mark_page_accessed(page);
+		if (file->f_mode & FMODE_NOREUSE)
+			deactivate_page(page);
+		else mark_page_accessed(page);
 		status = a_ops->write_end(file, mapping, pos, bytes, copied,
 						page, fsdata);
 		if (unlikely(status < 0))
-- 
1.8.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] mm: fadvise: fix POSIX_FADV_DONTNEED
  2013-02-23 22:58     ` [PATCH 1/2] mm: fadvise: fix POSIX_FADV_DONTNEED Phillip Susi
@ 2013-02-24  1:46       ` Dave Hansen
  2013-02-24  3:37         ` Phillip Susi
  2013-02-24  3:58       ` Zheng Liu
  2013-02-26  4:21       ` Minchan Kim
  2 siblings, 1 reply; 18+ messages in thread
From: Dave Hansen @ 2013-02-24  1:46 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-mm

On 02/23/2013 02:58 PM, Phillip Susi wrote:
> 2) Discarding pages under low cache pressure is a waste
> 3) It was useless on files being written, and thus full of dirty pages
> 
> Now we just move the pages to the inactive list so they will be reclaimed
> sooner.

Folks actually use this in practice to flush the page cache out:

http://git.sr71.net/?p=eyefi-config.git;a=blob;f=eyefi-linux.c;h=b77a891995109f6caa288925a13985cc495d7b2d;hb=HEAD#l62

I have really good reasons for really wanting to be _rid_ of the page
cache no matter how much memory pressure there is.

I've seen people at IBM using this to ensure that they stay out of
memory reclaim completely.  I don't completely agree with the approach,
but this would completely ruin their performance since the VM-initiated
writeout is so relatively slow for them.

I think this patch is a really bad idea.  If you want the behavior
you're proposing, I'd suggest using another flag.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] mm: fadvise: fix POSIX_FADV_DONTNEED
  2013-02-24  1:46       ` Dave Hansen
@ 2013-02-24  3:37         ` Phillip Susi
  2013-02-24 18:24           ` Dave Hansen
  0 siblings, 1 reply; 18+ messages in thread
From: Phillip Susi @ 2013-02-24  3:37 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 02/23/2013 08:46 PM, Dave Hansen wrote:
> Folks actually use this in practice to flush the page cache out:
> 
> http://git.sr71.net/?p=eyefi-config.git;a=blob;f=eyefi-linux.c;h=b77a891995109f6caa288925a13985cc495d7b2d;hb=HEAD#l62
>
>  I have really good reasons for really wanting to be _rid_ of the
> page cache no matter how much memory pressure there is.
> 
> I've seen people at IBM using this to ensure that they stay out of 
> memory reclaim completely.  I don't completely agree with the
> approach, but this would completely ruin their performance since
> the VM-initiated writeout is so relatively slow for them.
> 
> I think this patch is a really bad idea.  If you want the behavior 
> you're proposing, I'd suggest using another flag.

This is the correct behavior prescribed by posix.  If you have been
using it for that purpose in the past, then you were using the wrong
syscall.  If you want to begin writeout now, then you should be using
sync_file_range().  As it was, it only initiated writeout if the
backing device was not already congested, which is going to no longer
be the case rather soon if you ( or other tasks ) are writing
significant amounts of data.

If you really want to stay out of memory reclaim entirely, then you
should be using O_DIRECT.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with undefined - http://www.enigmail.net/

iQEcBAEBAgAGBQJRKYsKAAoJEJrBOlT6nu75nLAH/AyZetl9eFqSsXEXoSVsmimW
ih9Nwlhqy1g4zSuThHWIS41t2XQ6vrwh7NDkGdFSwJ0GWVoWIFu5E31LofbCQEYk
ApsTrUflUk/Cn/82oCCBzxv9G4RrmG+ywcz9SCG62uOHs3+e2525+aPzt0mPMsBR
672J5wPXV59NmEp2jNl2VFObnBQBWKxQR9xFfZ/jzvtjW+KtVvg+G4eG+3gFGfqi
gExlAnh6V05AS9ut7GUNDhWkJky/2qQl7sE53NbYC738f6I70vF38IMF68Taojcw
kWWW3gc8tZvhlYVnZqWqbK9Yz7+fBxca73ELtCI5i89gcV6VBekdFTqjq4HnWyg=
=7et5
-----END PGP SIGNATURE-----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] mm: fadvise: fix POSIX_FADV_DONTNEED
  2013-02-23 22:58     ` [PATCH 1/2] mm: fadvise: fix POSIX_FADV_DONTNEED Phillip Susi
  2013-02-24  1:46       ` Dave Hansen
@ 2013-02-24  3:58       ` Zheng Liu
  2013-02-24  4:04         ` Phillip Susi
  2013-02-26  4:21       ` Minchan Kim
  2 siblings, 1 reply; 18+ messages in thread
From: Zheng Liu @ 2013-02-24  3:58 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-mm

On Sat, Feb 23, 2013 at 05:58:00PM -0500, Phillip Susi wrote:
> The previous implementation initiated writeout for a non congested bdi, and
> then discarded any clean pages.   This had 3 problems:
> 
> 1) The writeout would spin up the disk unnecessarily
> 2) Discarding pages under low cache pressure is a waste
> 3) It was useless on files being written, and thus full of dirty pages
> 
> Now we just move the pages to the inactive list so they will be reclaimed
> sooner.

Hi Phillip,

I think we need to initiate writeout.  IIRC, when we try to free pages,
we would wait on page writeback.  That will cause a huge latency for
some applications.  If these pages have been written out, we just need
to invalidate them.  IMO we can move these pages to inactive list and
write them out.

Regards,
                                                - Zheng

> ---
>  include/linux/fs.h |  2 ++
>  mm/fadvise.c       |  8 ++------
>  mm/filemap.c       | 43 +++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 47 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 7d2e893..2abd193 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2198,6 +2198,8 @@ extern int __filemap_fdatawrite_range(struct address_space *mapping,
>  				loff_t start, loff_t end, int sync_mode);
>  extern int filemap_fdatawrite_range(struct address_space *mapping,
>  				loff_t start, loff_t end);
> +extern void filemap_deactivate_range(struct address_space *mapping, pgoff_t start,
> +				     pgoff_t end);
>  
>  extern int vfs_fsync_range(struct file *file, loff_t start, loff_t end,
>  			   int datasync);
> diff --git a/mm/fadvise.c b/mm/fadvise.c
> index a47f0f5..fbd58b0 100644
> --- a/mm/fadvise.c
> +++ b/mm/fadvise.c
> @@ -112,17 +112,13 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
>  	case POSIX_FADV_NOREUSE:
>  		break;
>  	case POSIX_FADV_DONTNEED:
> -		if (!bdi_write_congested(mapping->backing_dev_info))
> -			__filemap_fdatawrite_range(mapping, offset, endbyte,
> -						   WB_SYNC_NONE);
> -
>  		/* First and last FULL page! */
>  		start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
>  		end_index = (endbyte >> PAGE_CACHE_SHIFT);
>  
>  		if (end_index >= start_index)
> -			invalidate_mapping_pages(mapping, start_index,
> -						end_index);
> +			filemap_deactivate_range(mapping, start_index,
> +						 end_index);
>  		break;
>  	default:
>  		ret = -EINVAL;
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c610076..bcdcdbf 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -217,7 +217,49 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
>  	return ret;
>  }
>  
> +/**
> + * filemap_deactivate_range - moves pages in range to the inactive list
> + * @mapping:	the address_space which holds the pages to deactivate
> + * @start:	offset where the range starts
> + * @end:	offset where the range ends (inclusive)
> + */
> +void filemap_deactivate_range(struct address_space *mapping, pgoff_t start,
> +			      pgoff_t end)
> +{
> +	struct pagevec pvec;
> +	pgoff_t index = start;
> +	int i;
> +
> +	/*
> +	 * Note: this function may get called on a shmem/tmpfs mapping:
> +	 * pagevec_lookup() might then return 0 prematurely (because it
> +	 * got a gangful of swap entries); but it's hardly worth worrying
> +	 * about - it can rarely have anything to free from such a mapping
> +	 * (most pages are dirty), and already skips over any difficulties.
> +	 */
> +
> +	pagevec_init(&pvec, 0);
> +	while (index <= end && pagevec_lookup(&pvec, mapping, index,
> +			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
> +		mem_cgroup_uncharge_start();
> +		for (i = 0; i < pagevec_count(&pvec); i++) {
> +			struct page *page = pvec.pages[i];
> +
> +			/* We rely upon deletion not changing page->index */
> +			index = page->index;
> +			if (index > end)
> +				break;
> +
> +			WARN_ON(page->index != index);
> +			deactivate_page(page);
> +		}
> +		pagevec_release(&pvec);
> +		mem_cgroup_uncharge_end();
> +		cond_resched();
> +		index++;
> +	}
> +}
> +
>  static inline int __filemap_fdatawrite(struct address_space *mapping,
>  	int sync_mode)
>  {
> -- 
> 1.8.1.2
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] mm: fadvise: fix POSIX_FADV_DONTNEED
  2013-02-24  3:58       ` Zheng Liu
@ 2013-02-24  4:04         ` Phillip Susi
  0 siblings, 0 replies; 18+ messages in thread
From: Phillip Susi @ 2013-02-24  4:04 UTC (permalink / raw)
  To: Zheng Liu; +Cc: linux-mm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 02/23/2013 10:58 PM, Zheng Liu wrote:
> Hi Phillip,
> 
> I think we need to initiate writeout.  IIRC, when we try to free
> pages, we would wait on page writeback.  That will cause a huge
> latency for

Not really.  Most writes are initiated by the flush kernel thread.
The only way the old implementation would help applications avoid
delays in writing was if it slowly writes just the right amount of
data to use a significant amount of cache pages, but not quite enough
for the kernel flush thread to start writing it out, and then suddenly
tried to burst a lot of writes.

> some applications.  If these pages have been written out, we just
> need to invalidate them.  IMO we can move these pages to inactive
> list and write them out.

If you want to be sure writes start now, you should be using
sync_file_range().  If you combine that with posix_fadvise, then you
can be sure that writing starts now, and that the page cache will
prefer to discard that data ahead of other cached pages.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with undefined - http://www.enigmail.net/

iQEcBAEBAgAGBQJRKZE4AAoJEJrBOlT6nu75rmcIAIBmhok62teELHqDSLUA2Mj7
bEM10Iosghsq+QqH4kWq2U7S/eA935JVN2xKduQAL0/doO0+qTHIha40Fl9b7Q2D
k17cush2Z26tk7qZcQ9zh1HKfKQ1mxReU7eSkdv7FWdkOo7DTA71yk/2Ej7Zuv+E
4Fl26HYhNQADK6t5Y1hyfpG+MebuTM/jFrfCD5RRO1cnDxrU8xK3NTEEmrooZsB1
buUfWE0Wfm9MaSvArft6YVMr0XJlCUEUwkV/0LDBGBQs+YjdawQ9wPYdNTLhiijP
Y7HEZJe5Oi9mzeQNcr9QwGIqA4dSVW6XqiGDHLMDNanXhgnEhH1S3CmsZXL37OM=
=JoEo
-----END PGP SIGNATURE-----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] mm: fadvise: fix POSIX_FADV_DONTNEED
  2013-02-24  3:37         ` Phillip Susi
@ 2013-02-24 18:24           ` Dave Hansen
  2013-02-24 20:40             ` Phillip Susi
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Hansen @ 2013-02-24 18:24 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-mm

On 02/23/2013 07:37 PM, Phillip Susi wrote:
> This is the correct behavior prescribed by posix.  If you have been
> using it for that purpose in the past, then you were using the wrong
> syscall.  If you want to begin writeout now, then you should be using
> sync_file_range().  As it was, it only initiated writeout if the
> backing device was not already congested, which is going to no longer
> be the case rather soon if you ( or other tasks ) are writing
> significant amounts of data.
> 
> If you really want to stay out of memory reclaim entirely, then you
> should be using O_DIRECT.

These are folks that want to use the page cache, but also want to be in
control of when it gets written out (sync_file_range() is used) and when
it goes away.  Sure, they can use O_DIRECT and do all of the buffering
internally, but that means changing the application.

I actually really like the concept behind your patch.  It looks like
very useful functionality.  I'm just saying that I know it will break
_existing_ users.

I'm actually in the process of _trying_ to extricate this particular app
from handling their own reclaim management entirely.  Your patch looks
like a nice part of the puzzle.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] mm: fadvise: fix POSIX_FADV_DONTNEED
  2013-02-24 18:24           ` Dave Hansen
@ 2013-02-24 20:40             ` Phillip Susi
  2013-02-24 21:25               ` Dave Hansen
  0 siblings, 1 reply; 18+ messages in thread
From: Phillip Susi @ 2013-02-24 20:40 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 02/24/2013 01:24 PM, Dave Hansen wrote:
> These are folks that want to use the page cache, but also want to
> be in control of when it gets written out (sync_file_range() is
> used) and when it goes away.  Sure, they can use O_DIRECT and do
> all of the buffering internally, but that means changing the
> application.
> 
> I actually really like the concept behind your patch.  It looks
> like very useful functionality.  I'm just saying that I know it
> will break _existing_ users.

I'm not seeing how it will break anything.  Which aspect of the
current behavior is the app relying on?  If it is the immediate
removal of clean pages from the cache, then it should not care about
the new behavior since the pages will still be removed very soon when
under high cache pressure.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with undefined - http://www.enigmail.net/

iQEcBAEBAgAGBQJRKnrEAAoJEJrBOlT6nu75swEIALnyhEwJ38Q6UUIfwFZcOgGm
J1HF6e0jvoDmcjqwC+bInmnaYVtsbeimGZSbugxOTHw+pwNiV7twPf+b6KOrPt6F
GzVpHtVP2dCrrnhsWwCjIcJYBDOlRx2lpVEiOWPE6WpH2O8/GmlTadCx+bWjndbg
0lIdbmhaBOIlI2jWaSen0xWVaJM9Peh5cA7hS8lZOYYSckiKbZ1fsLV378zc8ltp
yC39SzZ0JuAfJfYqGI56fWfOdwHLbZiyYB8VmKIRsGtHU89ITvWH8vF7h5pf9VaV
cwdrNa4d2aLrpy95O2gMW0V+G+0lFDrpUszZets0u5r6ihi9jjt/akyImIO4U58=
=qMk5
-----END PGP SIGNATURE-----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] mm: fadvise: fix POSIX_FADV_DONTNEED
  2013-02-24 20:40             ` Phillip Susi
@ 2013-02-24 21:25               ` Dave Hansen
  2013-02-24 22:38                 ` Phillip Susi
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Hansen @ 2013-02-24 21:25 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-mm

On 02/24/2013 12:40 PM, Phillip Susi wrote:
>> > I actually really like the concept behind your patch.  It looks
>> > like very useful functionality.  I'm just saying that I know it
>> > will break _existing_ users.
> I'm not seeing how it will break anything.  Which aspect of the
> current behavior is the app relying on?  If it is the immediate
> removal of clean pages from the cache, then it should not care about
> the new behavior since the pages will still be removed very soon when
> under high cache pressure.

Essentially, they don't want any I/O initiated except that which is
initiated by the app.  If you let the system get in to reclaim, it'll
start doing dirty writeout for pages other than those the app is
interested in.

I'm also not sure how far the "just use O_DIRECT" argument is going to go:

	https://lkml.org/lkml/2007/1/10/233

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] mm: fadvise: fix POSIX_FADV_DONTNEED
  2013-02-24 21:25               ` Dave Hansen
@ 2013-02-24 22:38                 ` Phillip Susi
  2013-02-25 17:50                   ` Dave Hansen
  0 siblings, 1 reply; 18+ messages in thread
From: Phillip Susi @ 2013-02-24 22:38 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 02/24/2013 04:25 PM, Dave Hansen wrote:
> Essentially, they don't want any I/O initiated except that which
> is initiated by the app.  If you let the system get in to reclaim,
> it'll start doing dirty writeout for pages other than those the app
> is interested in.

Are you talking about IO initiated by the app, or other tasks in the
system?  If the former then it won't be affected by this change.  For
the latter, you don't really have control over that now since the
pages very well may cause other writeouts before you get around to
calling fadvise.  The extent to which the fadvise call mitigates
pushing writes from other applications out of the cache is only
slightly affected by this patch.  Specifically it may cause other
pages that already got pushed to the inactive list to be flushed, but
they were already very close to that before.

If you use POSIX_FADV_NOREUSE instead, then you will end up at least
as well off as before, possibly better since it takes effect at write
time instead of some later call to fadvise, after having synced the pages.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with undefined - http://www.enigmail.net/

iQEcBAEBAgAGBQJRKpZZAAoJEJrBOlT6nu75VpMH/1Lu37g/y6uRs4cfnAlKritI
P5Jk2cDO9Bc/DrdyHlbDxI45FnuOr/4KCQfRvWpbSArqdpIWZdvUI1uUJ8D71+MH
xKuJrdF4Z1tXXpcNAvGTN6bhTuD+mdDJOkQG+YvcIEKUPvlZHVpswsmddVkLmnRm
CZPwzEuZ52dU9YyLEAQu+XyirBwrLnTaGfwVtY6qkB8Ts5SxOMMrkq+X+sRxe/6e
fnuA+1hMpbXLHEJh+Q4xGWK9BnMahA96/0VJgRGzRmTHnenrjO3z+n7GKr63W+1U
2t3EYtgB9pHBUmrdv/AtFFe8ciTtgZuj9sMoFYJdOn+6BHy76jmiivQYJhXPqSQ=
=hrZY
-----END PGP SIGNATURE-----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] mm: fadvise: fix POSIX_FADV_DONTNEED
  2013-02-24 22:38                 ` Phillip Susi
@ 2013-02-25 17:50                   ` Dave Hansen
  0 siblings, 0 replies; 18+ messages in thread
From: Dave Hansen @ 2013-02-25 17:50 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-mm

On 02/24/2013 02:38 PM, Phillip Susi wrote:
> On 02/24/2013 04:25 PM, Dave Hansen wrote:
>> Essentially, they don't want any I/O initiated except that which
>> is initiated by the app.  If you let the system get in to reclaim,
>> it'll start doing dirty writeout for pages other than those the app
>> is interested in.
> 
> Are you talking about IO initiated by the app, or other tasks in the
> system?  If the former then it won't be affected by this change.

Once we go in to reclaim, we'll start writeback on dirty pages.  The
VM's writeback patterns are not as efficient as if the app itself was
initiating them from sync_file_range(), and we see massive throughput
loss on the disk.

>From the looks of your patch, deactivating all the pages (if clean of
course) will get them on the LRU, and should keep any direct-reclaiming
tasks from actually going and initiating any additional I/O.

It looks like a promising approach, at least theoretically.  I'll
definitely test and see if it works in practice.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] mm: fadvise: fix POSIX_FADV_DONTNEED
  2013-02-23 22:58     ` [PATCH 1/2] mm: fadvise: fix POSIX_FADV_DONTNEED Phillip Susi
  2013-02-24  1:46       ` Dave Hansen
  2013-02-24  3:58       ` Zheng Liu
@ 2013-02-26  4:21       ` Minchan Kim
  2013-02-26 14:06         ` Andrea Righi
  2 siblings, 1 reply; 18+ messages in thread
From: Minchan Kim @ 2013-02-26  4:21 UTC (permalink / raw)
  To: Phillip Susi; +Cc: linux-mm, andrea, Andrew Morton

Hello,

On Sat, Feb 23, 2013 at 05:58:00PM -0500, Phillip Susi wrote:
> The previous implementation initiated writeout for a non congested bdi, and
> then discarded any clean pages.   This had 3 problems:
> 
> 1) The writeout would spin up the disk unnecessarily
> 2) Discarding pages under low cache pressure is a waste
> 3) It was useless on files being written, and thus full of dirty pages
> 
> Now we just move the pages to the inactive list so they will be reclaimed
> sooner.
> ---
>  include/linux/fs.h |  2 ++
>  mm/fadvise.c       |  8 ++------
>  mm/filemap.c       | 43 +++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 47 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 7d2e893..2abd193 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2198,6 +2198,8 @@ extern int __filemap_fdatawrite_range(struct address_space *mapping,
>  				loff_t start, loff_t end, int sync_mode);
>  extern int filemap_fdatawrite_range(struct address_space *mapping,
>  				loff_t start, loff_t end);
> +extern void filemap_deactivate_range(struct address_space *mapping, pgoff_t start,
> +				     pgoff_t end);
>  
>  extern int vfs_fsync_range(struct file *file, loff_t start, loff_t end,
>  			   int datasync);
> diff --git a/mm/fadvise.c b/mm/fadvise.c
> index a47f0f5..fbd58b0 100644
> --- a/mm/fadvise.c
> +++ b/mm/fadvise.c
> @@ -112,17 +112,13 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
>  	case POSIX_FADV_NOREUSE:
>  		break;
>  	case POSIX_FADV_DONTNEED:
> -		if (!bdi_write_congested(mapping->backing_dev_info))
> -			__filemap_fdatawrite_range(mapping, offset, endbyte,
> -						   WB_SYNC_NONE);
> -
>  		/* First and last FULL page! */
>  		start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
>  		end_index = (endbyte >> PAGE_CACHE_SHIFT);
>  
>  		if (end_index >= start_index)
> -			invalidate_mapping_pages(mapping, start_index,
> -						end_index);
> +			filemap_deactivate_range(mapping, start_index,
> +						 end_index);
>  		break;
>  	default:
>  		ret = -EINVAL;
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c610076..bcdcdbf 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -217,7 +217,49 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
>  	return ret;
>  }
>  
> +/**
> + * filemap_deactivate_range - moves pages in range to the inactive list
> + * @mapping:	the address_space which holds the pages to deactivate
> + * @start:	offset where the range starts
> + * @end:	offset where the range ends (inclusive)
> + */
> +void filemap_deactivate_range(struct address_space *mapping, pgoff_t start,
> +			      pgoff_t end)
> +{
> +	struct pagevec pvec;
> +	pgoff_t index = start;
> +	int i;
> +
> +	/*
> +	 * Note: this function may get called on a shmem/tmpfs mapping:
> +	 * pagevec_lookup() might then return 0 prematurely (because it
> +	 * got a gangful of swap entries); but it's hardly worth worrying
> +	 * about - it can rarely have anything to free from such a mapping
> +	 * (most pages are dirty), and already skips over any difficulties.
> +	 */
> +
> +	pagevec_init(&pvec, 0);
> +	while (index <= end && pagevec_lookup(&pvec, mapping, index,
> +			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
> +		mem_cgroup_uncharge_start();
> +		for (i = 0; i < pagevec_count(&pvec); i++) {
> +			struct page *page = pvec.pages[i];
> +
> +			/* We rely upon deletion not changing page->index */
> +			index = page->index;
> +			if (index > end)
> +				break;
> +
> +			WARN_ON(page->index != index);
> +			deactivate_page(page);
> +		}
> +		pagevec_release(&pvec);
> +		mem_cgroup_uncharge_end();
> +		cond_resched();
> +		index++;
> +	}
> +}
> +
>  static inline int __filemap_fdatawrite(struct address_space *mapping,
>  	int sync_mode)
>  {
> -- 
> 1.8.1.2
> 

Just FYI,
there was a person tried to solve similar problem, Andrea Righi. Ccing him,

Personally, I like this but unfortunately maintainer didn't like it
due to breaking compatibility and I understand.
https://lkml.org/lkml/2011/6/28/468

You might need another round with akpm. :(

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] mm: fadvise: fix POSIX_FADV_DONTNEED
  2013-02-26  4:21       ` Minchan Kim
@ 2013-02-26 14:06         ` Andrea Righi
  2013-02-26 15:39           ` Phillip Susi
  0 siblings, 1 reply; 18+ messages in thread
From: Andrea Righi @ 2013-02-26 14:06 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Phillip Susi, linux-mm, Andrew Morton, Pádraig Brady

On Tue, Feb 26, 2013 at 01:21:23PM +0900, Minchan Kim wrote:
> Hello,
> 
> On Sat, Feb 23, 2013 at 05:58:00PM -0500, Phillip Susi wrote:
> > The previous implementation initiated writeout for a non congested bdi, and
> > then discarded any clean pages.   This had 3 problems:
> > 
> > 1) The writeout would spin up the disk unnecessarily
> > 2) Discarding pages under low cache pressure is a waste
> > 3) It was useless on files being written, and thus full of dirty pages
> > 
> > Now we just move the pages to the inactive list so they will be reclaimed
> > sooner.
> > ---
> >  include/linux/fs.h |  2 ++
> >  mm/fadvise.c       |  8 ++------
> >  mm/filemap.c       | 43 +++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 47 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 7d2e893..2abd193 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -2198,6 +2198,8 @@ extern int __filemap_fdatawrite_range(struct address_space *mapping,
> >  				loff_t start, loff_t end, int sync_mode);
> >  extern int filemap_fdatawrite_range(struct address_space *mapping,
> >  				loff_t start, loff_t end);
> > +extern void filemap_deactivate_range(struct address_space *mapping, pgoff_t start,
> > +				     pgoff_t end);
> >  
> >  extern int vfs_fsync_range(struct file *file, loff_t start, loff_t end,
> >  			   int datasync);
> > diff --git a/mm/fadvise.c b/mm/fadvise.c
> > index a47f0f5..fbd58b0 100644
> > --- a/mm/fadvise.c
> > +++ b/mm/fadvise.c
> > @@ -112,17 +112,13 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
> >  	case POSIX_FADV_NOREUSE:
> >  		break;
> >  	case POSIX_FADV_DONTNEED:
> > -		if (!bdi_write_congested(mapping->backing_dev_info))
> > -			__filemap_fdatawrite_range(mapping, offset, endbyte,
> > -						   WB_SYNC_NONE);
> > -
> >  		/* First and last FULL page! */
> >  		start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
> >  		end_index = (endbyte >> PAGE_CACHE_SHIFT);
> >  
> >  		if (end_index >= start_index)
> > -			invalidate_mapping_pages(mapping, start_index,
> > -						end_index);
> > +			filemap_deactivate_range(mapping, start_index,
> > +						 end_index);
> >  		break;
> >  	default:
> >  		ret = -EINVAL;
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index c610076..bcdcdbf 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -217,7 +217,49 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
> >  	return ret;
> >  }
> >  
> > +/**
> > + * filemap_deactivate_range - moves pages in range to the inactive list
> > + * @mapping:	the address_space which holds the pages to deactivate
> > + * @start:	offset where the range starts
> > + * @end:	offset where the range ends (inclusive)
> > + */
> > +void filemap_deactivate_range(struct address_space *mapping, pgoff_t start,
> > +			      pgoff_t end)
> > +{
> > +	struct pagevec pvec;
> > +	pgoff_t index = start;
> > +	int i;
> > +
> > +	/*
> > +	 * Note: this function may get called on a shmem/tmpfs mapping:
> > +	 * pagevec_lookup() might then return 0 prematurely (because it
> > +	 * got a gangful of swap entries); but it's hardly worth worrying
> > +	 * about - it can rarely have anything to free from such a mapping
> > +	 * (most pages are dirty), and already skips over any difficulties.
> > +	 */
> > +
> > +	pagevec_init(&pvec, 0);
> > +	while (index <= end && pagevec_lookup(&pvec, mapping, index,
> > +			min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
> > +		mem_cgroup_uncharge_start();
> > +		for (i = 0; i < pagevec_count(&pvec); i++) {
> > +			struct page *page = pvec.pages[i];
> > +
> > +			/* We rely upon deletion not changing page->index */
> > +			index = page->index;
> > +			if (index > end)
> > +				break;
> > +
> > +			WARN_ON(page->index != index);
> > +			deactivate_page(page);
> > +		}
> > +		pagevec_release(&pvec);
> > +		mem_cgroup_uncharge_end();
> > +		cond_resched();
> > +		index++;
> > +	}
> > +}
> > +
> >  static inline int __filemap_fdatawrite(struct address_space *mapping,
> >  	int sync_mode)
> >  {
> > -- 
> > 1.8.1.2
> > 
> 
> Just FYI,
> there was a person tried to solve similar problem, Andrea Righi. Ccing him,

Thanks, Minchan.

> 
> Personally, I like this but unfortunately maintainer didn't like it
> due to breaking compatibility and I understand.
> https://lkml.org/lkml/2011/6/28/468
> 
> You might need another round with akpm. :(

I also like this approach, it looks very similar to the one that I
proposed a long time ago. However, last time we ended up saying that the
next step should have been a proposal for a better page cache management
interface for the userland, adding more fadvise() flags, obviously
without breaking the current behavior.

We started with these ideal requirements, but unfortunately I didn't go
ahead with this project:
http://marc.info/?l=linux-kernel&m=130917619416123&w=2

About breaking the compatibility, keep in mind that even tools like dd,
for example, has been modified to support invalidating the cache for a
file via POSIX_FADV_DONTNEED:
http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=commit;h=5f311553

And it expects to discard cache for the target pages, when possible,
even if POSIX just says that it will not access the pages again any time
soon.

So, in any case, I think that it would be safer if we don't touch the
current POSIX_FADV_DONTNEED implementation. But we could add add more
Linux-specific flags for example, and maybe call them LINUX_FADV_xxx
rather than POSIX_FADV_xxx.

Another idea would be to explore if it's possible to plug this feature
into the memory cgroup controller...

-Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] mm: fadvise: fix POSIX_FADV_DONTNEED
  2013-02-26 14:06         ` Andrea Righi
@ 2013-02-26 15:39           ` Phillip Susi
  0 siblings, 0 replies; 18+ messages in thread
From: Phillip Susi @ 2013-02-26 15:39 UTC (permalink / raw)
  To: Andrea Righi; +Cc: Minchan Kim, linux-mm, Andrew Morton, Pádraig Brady

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/26/2013 9:06 AM, Andrea Righi wrote:
> I also like this approach, it looks very similar to the one that I 
> proposed a long time ago. However, last time we ended up saying
> that the next step should have been a proposal for a better page
> cache management interface for the userland, adding more fadvise()
> flags, obviously without breaking the current behavior.

If someone wants to add more flags, good for them, but how about we
get the ones we have right first? ;)

> We started with these ideal requirements, but unfortunately I
> didn't go ahead with this project: 
> http://marc.info/?l=linux-kernel&m=130917619416123&w=2
> 
> About breaking the compatibility, keep in mind that even tools like
> dd, for example, has been modified to support invalidating the
> cache for a file via POSIX_FADV_DONTNEED: 
> http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=commit;h=5f311553

I
> 
don't see how dd would be harmed by this change.

> And it expects to discard cache for the target pages, when
> possible, even if POSIX just says that it will not access the pages
> again any time soon.

Other than the description for the human user, I don't see how it
actually has this expectation.

In fact, when under high cache pressure, the description would still
be essentially correct since the pages will be discarded, just not
necessarily by the time the syscall returns.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJRLNc8AAoJEJrBOlT6nu75yHIIAMQzRiTW0jgTTU+sICmWtMjE
klHGX0NtnXMirs9imkOUkSRhpCpS02dxrZUEm0GfMSbKBgYIQXUOChTzY9jBCghj
A4vJ697NS2UaLETtx1FXGRoaPvDD3VWYDL5gtzE4W05tnmim2QdjBGqfBPcHr9nL
RO586QUpiq66Fv15QdzIevMXrWEvBuyJKRQA/Hln2Sirmy8vZiEpa0O+qew35217
W7NgPsc37b/uGK2sEJsxP6tO6wnf7absk1laZJrCsHkNNGjGLYKBfY2ASs7OMsAB
xDXNap0eyFoWChSlMkbLaaBNdAHN/9EqkkeoN/WyiGA/ePYqAxISrb8EnSDVD1E=
=aFFY
-----END PGP SIGNATURE-----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2013-02-26 15:39 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-22 19:57 POSIX_FADV_DONTNEED implemented wrong Phillip Susi
2013-02-22 20:29 ` Johannes Weiner
2013-02-22 21:52   ` Phillip Susi
2013-02-23 22:57     ` [PATCH 0/2] FADV_DONTNEED and FADV_NOREUSE Phillip Susi
2013-02-23 22:58     ` [PATCH 1/2] mm: fadvise: fix POSIX_FADV_DONTNEED Phillip Susi
2013-02-24  1:46       ` Dave Hansen
2013-02-24  3:37         ` Phillip Susi
2013-02-24 18:24           ` Dave Hansen
2013-02-24 20:40             ` Phillip Susi
2013-02-24 21:25               ` Dave Hansen
2013-02-24 22:38                 ` Phillip Susi
2013-02-25 17:50                   ` Dave Hansen
2013-02-24  3:58       ` Zheng Liu
2013-02-24  4:04         ` Phillip Susi
2013-02-26  4:21       ` Minchan Kim
2013-02-26 14:06         ` Andrea Righi
2013-02-26 15:39           ` Phillip Susi
2013-02-23 22:58     ` [PATCH 2/2] mm: fadvise: implement POSIX_FADV_NOREUSE Phillip Susi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.