linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] Fadvise: Directory level page cache cleaning support
@ 2013-12-30 13:45 Li Wang
  2013-12-30 13:45 ` [PATCH 1/3] VFS: Add the declaration of shrink_pagecache_parent Li Wang
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Li Wang @ 2013-12-30 13:45 UTC (permalink / raw)
  To: Alexander Viro
  Cc: linux-fsdevel, linux-mm, linux-kernel, Andrew Morton, Cong Wang,
	Zefan Li, Matthew Wilcox, Li Wang

VFS relies on LRU-like page cache eviction algorithm
to reclaim cache space, such general and simple algorithm 
is good regarding its application independence, and is working 
for normal situations. However, sometimes it does not help much
for those applications which are performance sensitive or under 
heavy loads. Since LRU may incorrectly evict going-to-be referenced 
pages out, resulting in severe performance degradation due to 
cache thrashing. Applications have the most knowledge
about the things they are doing, they can always do better if
they are given a chance. This motivates to endow the applications 
more abilities to manipulate the page cache.

Currently, Linux support file system wide cache cleaing by virtue of
proc interface 'drop-caches', but it is very coarse granularity and
was originally proposed for debugging. The other is to do file-level
page cache cleaning through 'fadvise', however, this is sometimes less 
flexible and not easy to use especially in directory wide operations or 
under massive small-file situations.

This patch extends 'fadvise' to support directory level page cache
cleaning. The call to posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED) 
with 'fd' referring to a directory will recursively reclaim page cache 
entries of files inside 'fd'. For secruity concern, those inodes
which the caller does not own appropriate permissions will not 
be manipulated.

It is easy to demonstrate the advantages of directory level page 
cache cleaning. We use a machine with a Pentium(R) Dual-Core CPU 
E5800 @ 3.20GHz, and with 2GB memory. Two directories named '1' 
and '3' are created, with each containing X (360 - 460) files, 
and each file with a size of 2MB. The test scripts are as follows,

The test scripts (without cache cleaning)
#!/bin/bash
cp -r 1 2
sync
cp -r 3 4
sync
time grep "data" 1/*

The time on 'grep "data" 1/*' is measured
with/without cache cleaning, under different file counts.
With cache cleaning, we clean all cache entries of files
in '2' before doing 'cp -r 3 4' by using pretty much
the following two statements,
fd = open("2", O_DIRECTORY, 0644);
posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);

The results are as follows (in seconds), 
X: Number of files inside each directory

 X       Without Cleaning     With Cleaning
360          2.385                1.361
380          3.159                1.466
400          3.972                1.558
420          4.823                1.548
440          5.798                1.702
460          6.888                2.197

The page cache is not large enough to buffer all the four
directories, so 'cp -r 3 4' will result in some
entries of '1' to be evicted (due to LRU). When re-accessing '1',
some entries need be reloaded from disk, which is time-consuming.
In this case, cleaning '2' before 'cp -r 3 4' enjoys a good
speedup. 
 
Li Wang (3):
  VFS: Add the declaration of shrink_pagecache_parent
  Add shrink_pagecache_parent
  Fadvise: Add the ability for directory level page cache cleaning

 fs/dcache.c            |   36 ++++++++++++++++++++++++++++++++++++
 include/linux/dcache.h |    1 +
 mm/fadvise.c           |    4 ++++
 3 files changed, 41 insertions(+)

-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/3] VFS: Add the declaration of shrink_pagecache_parent
  2013-12-30 13:45 [PATCH 0/3] Fadvise: Directory level page cache cleaning support Li Wang
@ 2013-12-30 13:45 ` Li Wang
  2013-12-30 13:45 ` [PATCH 2/3] Add shrink_pagecache_parent Li Wang
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Li Wang @ 2013-12-30 13:45 UTC (permalink / raw)
  To: Alexander Viro
  Cc: linux-fsdevel, linux-mm, linux-kernel, Andrew Morton, Cong Wang,
	Zefan Li, Matthew Wilcox, Li Wang, Yunchuan Wen


Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
---
 include/linux/dcache.h |    1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index bf72e9a..6262171 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -249,6 +249,7 @@ extern struct dentry *d_find_any_alias(struct inode *inode);
 extern struct dentry * d_obtain_alias(struct inode *);
 extern void shrink_dcache_sb(struct super_block *);
 extern void shrink_dcache_parent(struct dentry *);
+extern void shrink_pagecache_parent(struct dentry *);
 extern void shrink_dcache_for_umount(struct super_block *);
 extern int d_invalidate(struct dentry *);
 
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/3] Add shrink_pagecache_parent
  2013-12-30 13:45 [PATCH 0/3] Fadvise: Directory level page cache cleaning support Li Wang
  2013-12-30 13:45 ` [PATCH 1/3] VFS: Add the declaration of shrink_pagecache_parent Li Wang
@ 2013-12-30 13:45 ` Li Wang
  2014-01-02 23:55   ` Andrew Morton
  2013-12-30 13:45 ` [PATCH 3/3] Fadvise: Add the ability for directory level page cache cleaning Li Wang
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Li Wang @ 2013-12-30 13:45 UTC (permalink / raw)
  To: Alexander Viro
  Cc: linux-fsdevel, linux-mm, linux-kernel, Andrew Morton, Cong Wang,
	Zefan Li, Matthew Wilcox, Li Wang, Yunchuan Wen

Analogous to shrink_dcache_parent except that it collects inodes.
It is not very appropriate to be put in dcache.c, but d_walk can only
be invoked from here.

Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
---
 fs/dcache.c |   36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 6055d61..0fc0f80 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1318,6 +1318,42 @@ void shrink_dcache_parent(struct dentry *parent)
 }
 EXPORT_SYMBOL(shrink_dcache_parent);
 
+static enum d_walk_ret gather_inode(void *data, struct dentry *dentry)
+{
+	struct list_head *list = data;
+	struct inode *inode = dentry->d_inode;
+
+	if ((inode == NULL) || ((!inode_owner_or_capable(inode)) &&
+				(!capable(CAP_SYS_ADMIN))))
+		goto out;
+	spin_lock(&inode->i_lock);
+	if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
+		(inode->i_mapping->nrpages == 0) ||
+		(!list_empty(&inode->i_lru))) {
+			goto out_unlock;
+	}
+	__iget(inode);
+	list_add_tail(&inode->i_lru, list);
+out_unlock:
+	spin_unlock(&inode->i_lock);
+out:
+	return D_WALK_CONTINUE;
+}
+
+void shrink_pagecache_parent(struct dentry *parent)
+{
+	LIST_HEAD(list);
+	struct inode *inode, *next;
+
+	d_walk(parent, &list, gather_inode, NULL);
+	list_for_each_entry_safe(inode, next, &list, i_lru) {
+		list_del_init(&inode->i_lru);
+		invalidate_mapping_pages(inode->i_mapping, 0, -1);
+		iput(inode);
+	}
+}
+EXPORT_SYMBOL(shrink_pagecache_parent);
+
 static enum d_walk_ret umount_collect(void *_data, struct dentry *dentry)
 {
 	struct select_data *data = _data;
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/3] Fadvise: Add the ability for directory level page cache cleaning
  2013-12-30 13:45 [PATCH 0/3] Fadvise: Directory level page cache cleaning support Li Wang
  2013-12-30 13:45 ` [PATCH 1/3] VFS: Add the declaration of shrink_pagecache_parent Li Wang
  2013-12-30 13:45 ` [PATCH 2/3] Add shrink_pagecache_parent Li Wang
@ 2013-12-30 13:45 ` Li Wang
  2013-12-30 14:57 ` [PATCH 0/3] Fadvise: Directory level page cache cleaning support Matthew Wilcox
  2013-12-30 19:18 ` Dave Hansen
  4 siblings, 0 replies; 14+ messages in thread
From: Li Wang @ 2013-12-30 13:45 UTC (permalink / raw)
  To: Alexander Viro
  Cc: linux-fsdevel, linux-mm, linux-kernel, Andrew Morton, Cong Wang,
	Zefan Li, Matthew Wilcox, Li Wang, Yunchuan Wen


Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
---
 mm/fadvise.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/fadvise.c b/mm/fadvise.c
index 3bcfd81..644d32d 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -113,6 +113,10 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
 	case POSIX_FADV_NOREUSE:
 		break;
 	case POSIX_FADV_DONTNEED:
+		if (S_ISDIR(file_inode(f.file)->i_mode)) {
+			shrink_pagecache_parent(f.file->f_dentry);
+			goto out;
+		}
 		if (!bdi_write_congested(mapping->backing_dev_info))
 			__filemap_fdatawrite_range(mapping, offset, endbyte,
 						   WB_SYNC_NONE);
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] Fadvise: Directory level page cache cleaning support
  2013-12-30 13:45 [PATCH 0/3] Fadvise: Directory level page cache cleaning support Li Wang
                   ` (2 preceding siblings ...)
  2013-12-30 13:45 ` [PATCH 3/3] Fadvise: Add the ability for directory level page cache cleaning Li Wang
@ 2013-12-30 14:57 ` Matthew Wilcox
  2013-12-30 19:18 ` Dave Hansen
  4 siblings, 0 replies; 14+ messages in thread
From: Matthew Wilcox @ 2013-12-30 14:57 UTC (permalink / raw)
  To: Li Wang
  Cc: Alexander Viro, linux-fsdevel, linux-mm, linux-kernel,
	Andrew Morton, Cong Wang, Zefan Li

On Mon, Dec 30, 2013 at 09:45:15PM +0800, Li Wang wrote:
> VFS relies on LRU-like page cache eviction algorithm

Why is this a series of three patches?  It just seems like you're trying
to game the patch count statistics.  One patch would have been fine for
this tiny change.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] Fadvise: Directory level page cache cleaning support
  2013-12-30 13:45 [PATCH 0/3] Fadvise: Directory level page cache cleaning support Li Wang
                   ` (3 preceding siblings ...)
  2013-12-30 14:57 ` [PATCH 0/3] Fadvise: Directory level page cache cleaning support Matthew Wilcox
@ 2013-12-30 19:18 ` Dave Hansen
  2013-12-30 19:40   ` Andreas Dilger
  4 siblings, 1 reply; 14+ messages in thread
From: Dave Hansen @ 2013-12-30 19:18 UTC (permalink / raw)
  To: Li Wang, Alexander Viro
  Cc: linux-fsdevel, linux-mm, linux-kernel, Andrew Morton, Cong Wang,
	Zefan Li, Matthew Wilcox

On 12/30/2013 05:45 AM, Li Wang wrote:
> This patch extends 'fadvise' to support directory level page cache
> cleaning. The call to posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED) 
> with 'fd' referring to a directory will recursively reclaim page cache 
> entries of files inside 'fd'. For secruity concern, those inodes
> which the caller does not own appropriate permissions will not 
> be manipulated.

Why is this necessary to do in the kernel?  Why not leave it to
userspace to walk the filesystem(s)?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] Fadvise: Directory level page cache cleaning support
  2013-12-30 19:18 ` Dave Hansen
@ 2013-12-30 19:40   ` Andreas Dilger
  2013-12-30 21:33     ` Dave Hansen
  0 siblings, 1 reply; 14+ messages in thread
From: Andreas Dilger @ 2013-12-30 19:40 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Li Wang, Alexander Viro, linux-fsdevel, linux-mm, linux-kernel,
	Andrew Morton, Cong Wang, Zefan Li, Matthew Wilcox

On Dec 30, 2013, at 12:18, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> Why is this necessary to do in the kernel?  Why not leave it to
> userspace to walk the filesystem(s)?

I would suspect that trying to do it in userspace would be quite bad. It would require traversing the whole directory tree to issue cache flushed for each subdirectory, but it doesn't know when to stop traversal. That would mean the "cache flush" would turn into "cache pollute" and cause a lot of disk IO for subdirectories not in cache to begin with. 

Cheers, Andreas
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] Fadvise: Directory level page cache cleaning support
  2013-12-30 19:40   ` Andreas Dilger
@ 2013-12-30 21:33     ` Dave Hansen
  2014-01-02 12:44       ` Li Wang
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Hansen @ 2013-12-30 21:33 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Li Wang, Alexander Viro, linux-fsdevel, linux-mm, linux-kernel,
	Andrew Morton, Cong Wang, Zefan Li, Matthew Wilcox

On 12/30/2013 11:40 AM, Andreas Dilger wrote:
> On Dec 30, 2013, at 12:18, Dave Hansen <dave.hansen@intel.com> wrote:
>> Why is this necessary to do in the kernel?  Why not leave it to
>> userspace to walk the filesystem(s)?
> 
> I would suspect that trying to do it in userspace would be quite bad. It would require traversing the whole directory tree to issue cache flushed for each subdirectory, but it doesn't know when to stop traversal. That would mean the "cache flush" would turn into "cache pollute" and cause a lot of disk IO for subdirectories not in cache to begin with. 

That makes sense for dentries at least and is a pretty good reason.
Probably good enough to to include some text in the patch description.
;)  Perhaps: "We need this interface because we have no way of
determining what is in the dcache from userspace, and we do not want
userspace to pollute the dcache going and looking for page cache to evict."

One other thing that bothers me: POSIX_FADV_DONTNEED on a directory
seems like it should do something with the _directory_.  It should undo
the kernel's caching that happens as a result of readdir().

Should this also be trying to drop the dentry/inode entries like "echo 2
> .../drop_caches" does?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] Fadvise: Directory level page cache cleaning support
  2013-12-30 21:33     ` Dave Hansen
@ 2014-01-02 12:44       ` Li Wang
  2014-01-02 18:35         ` Dave Hansen
  0 siblings, 1 reply; 14+ messages in thread
From: Li Wang @ 2014-01-02 12:44 UTC (permalink / raw)
  To: Dave Hansen, Andreas Dilger
  Cc: Alexander Viro, linux-fsdevel, linux-mm, linux-kernel,
	Andrew Morton, Cong Wang, Zefan Li, Matthew Wilcox

Do we really need clean dcache/icache at the current stage?
That will introduce more code work, so far, iput() will put
those unreferenced inodes into superblock lru list. To free
the inodes inside a specific directory, it seems we do not
have a handy API to use, and need
modify iput() to recognize our situation, and collect those
inodes into our list rather than superblock lru list. Maybe
we stay at current stage now, since it is simple and could
gain the major benefits, leave the dcache/icache cleaning
to do in the future?

On 2013/12/31 5:33, Dave Hansen wrote:
> On 12/30/2013 11:40 AM, Andreas Dilger wrote:
>> On Dec 30, 2013, at 12:18, Dave Hansen <dave.hansen@intel.com> wrote:
>>> Why is this necessary to do in the kernel?  Why not leave it to
>>> userspace to walk the filesystem(s)?
>>
>> I would suspect that trying to do it in userspace would be quite bad. It would require traversing the whole directory tree to issue cache flushed for each subdirectory, but it doesn't know when to stop traversal. That would mean the "cache flush" would turn into "cache pollute" and cause a lot of disk IO for subdirectories not in cache to begin with.
>
> That makes sense for dentries at least and is a pretty good reason.
> Probably good enough to to include some text in the patch description.
> ;)  Perhaps: "We need this interface because we have no way of
> determining what is in the dcache from userspace, and we do not want
> userspace to pollute the dcache going and looking for page cache to evict."
>
> One other thing that bothers me: POSIX_FADV_DONTNEED on a directory
> seems like it should do something with the _directory_.  It should undo
> the kernel's caching that happens as a result of readdir().
>
> Should this also be trying to drop the dentry/inode entries like "echo 2
>> .../drop_caches" does?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] Fadvise: Directory level page cache cleaning support
  2014-01-02 12:44       ` Li Wang
@ 2014-01-02 18:35         ` Dave Hansen
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Hansen @ 2014-01-02 18:35 UTC (permalink / raw)
  To: Li Wang, Andreas Dilger
  Cc: Alexander Viro, linux-fsdevel, linux-mm, linux-kernel,
	Andrew Morton, Cong Wang, Zefan Li, Matthew Wilcox

On 01/02/2014 04:44 AM, Li Wang wrote:
> Do we really need clean dcache/icache at the current stage?
> That will introduce more code work, so far, iput() will put
> those unreferenced inodes into superblock lru list. To free
> the inodes inside a specific directory, it seems we do not
> have a handy API to use, and need
> modify iput() to recognize our situation, and collect those
> inodes into our list rather than superblock lru list. Maybe
> we stay at current stage now, since it is simple and could
> gain the major benefits, leave the dcache/icache cleaning
> to do in the future?

<sigh> top posting....

I read your response as "that's the right thing to do, but it's too much
work".  Fair enough.  But if we're going to take the lazy hack approach
here, maybe we should do this through some other interface than a
syscall where we're stuck with the behavior.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] Add shrink_pagecache_parent
  2013-12-30 13:45 ` [PATCH 2/3] Add shrink_pagecache_parent Li Wang
@ 2014-01-02 23:55   ` Andrew Morton
  2014-01-06 13:30     ` Dave Chinner
  2014-01-08  2:06     ` Li Wang
  0 siblings, 2 replies; 14+ messages in thread
From: Andrew Morton @ 2014-01-02 23:55 UTC (permalink / raw)
  To: Li Wang
  Cc: Alexander Viro, linux-fsdevel, linux-mm, linux-kernel, Cong Wang,
	Zefan Li, Matthew Wilcox, Yunchuan Wen, Dave Chinner

On Mon, 30 Dec 2013 21:45:17 +0800 Li Wang <liwang@ubuntukylin.com> wrote:

> Analogous to shrink_dcache_parent except that it collects inodes.
> It is not very appropriate to be put in dcache.c, but d_walk can only
> be invoked from here.

Please cc Dave Chinner on future revisions.  He be da man.

The overall intent of the patchset seems reasonable and I agree that it
can't be efficiently done from userspace with the current kernel API. 
We *could* do it from userspace by providing facilities for userspace to
query the VFS caches: "is this pathname in the dentry cache" and "is
this inode in the inode cache".

> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -1318,6 +1318,42 @@ void shrink_dcache_parent(struct dentry *parent)
>  }
>  EXPORT_SYMBOL(shrink_dcache_parent);
>  
> +static enum d_walk_ret gather_inode(void *data, struct dentry *dentry)
> +{
> +	struct list_head *list = data;
> +	struct inode *inode = dentry->d_inode;
> +
> +	if ((inode == NULL) || ((!inode_owner_or_capable(inode)) &&
> +				(!capable(CAP_SYS_ADMIN))))
> +		goto out;
> +	spin_lock(&inode->i_lock);
> +	if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||

It's unclear what rationale lies behind this particular group of tests.

> +		(inode->i_mapping->nrpages == 0) ||
> +		(!list_empty(&inode->i_lru))) {

arg, the "Inode locking rules" at the top of fs/inode.c needs a
refresh, I suspect.  It is too vague.

Formally, inode->i_lru is protected by
i_sb->s_inode_lru->node[nid].lock, not by ->i_lock.  I guess you can
just do a list_lru_add() and that will atomically add the inode to your
local list_lru if ->i_lru wasn't being used for anything else.

I *think* that your use of i_lock works OK, because code which fiddles
with i_lru and s_inode_lru also takes i_lock.  However we need to
decide which is the preferred and official lock.  ie: what is the
design here??

However...  most inodes will be on an LRU list, won't they?  Doesn't
this reuse of i_lru mean that many inodes will fail to be processed? 
If so, we might need to add a new list_head to the inode, which will be
problematic.


Aside: inode_lru_isolate() fiddles directly with inode->i_lru without
taking i_sb->s_inode_lru->node[nid].lock.  Why doesn't this make a
concurrent s_inode_lru walker go oops??  Should we be using
list_lru_del() in there?  (which should have been called
list_lru_del_init(), sigh).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] Add shrink_pagecache_parent
  2014-01-02 23:55   ` Andrew Morton
@ 2014-01-06 13:30     ` Dave Chinner
  2014-01-08  2:06     ` Li Wang
  1 sibling, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2014-01-06 13:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Li Wang, Alexander Viro, linux-fsdevel, linux-mm, linux-kernel,
	Cong Wang, Zefan Li, Matthew Wilcox, Yunchuan Wen

On Thu, Jan 02, 2014 at 03:55:34PM -0800, Andrew Morton wrote:
> On Mon, 30 Dec 2013 21:45:17 +0800 Li Wang <liwang@ubuntukylin.com> wrote:
> 
> > Analogous to shrink_dcache_parent except that it collects inodes.
> > It is not very appropriate to be put in dcache.c, but d_walk can only
> > be invoked from here.
> 
> Please cc Dave Chinner on future revisions.  He be da man.
> 
> The overall intent of the patchset seems reasonable and I agree that it
> can't be efficiently done from userspace with the current kernel API. 
> We *could* do it from userspace by providing facilities for userspace to
> query the VFS caches: "is this pathname in the dentry cache" and "is
> this inode in the inode cache".
> 
> > --- a/fs/dcache.c
> > +++ b/fs/dcache.c
> > @@ -1318,6 +1318,42 @@ void shrink_dcache_parent(struct dentry *parent)
> >  }
> >  EXPORT_SYMBOL(shrink_dcache_parent);
> >  
> > +static enum d_walk_ret gather_inode(void *data, struct dentry *dentry)
> > +{
> > +	struct list_head *list = data;
> > +	struct inode *inode = dentry->d_inode;
> > +
> > +	if ((inode == NULL) || ((!inode_owner_or_capable(inode)) &&
> > +				(!capable(CAP_SYS_ADMIN))))
> > +		goto out;
> > +	spin_lock(&inode->i_lock);
> > +	if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> 
> It's unclear what rationale lies behind this particular group of tests.
> 
> > +		(inode->i_mapping->nrpages == 0) ||
> > +		(!list_empty(&inode->i_lru))) {
> 
> arg, the "Inode locking rules" at the top of fs/inode.c needs a
> refresh, I suspect.  It is too vague.

Yes, it probably does need work.

> Formally, inode->i_lru is protected by
> i_sb->s_inode_lru->node[nid].lock, not by ->i_lock.  I guess you can
> just do a list_lru_add() and that will atomically add the inode to your
> local list_lru if ->i_lru wasn't being used for anything else.

There is no such thing as a "local" list_lru. If you need to put an
object on a local list, then just use a local struct list_head.
That's how we do dispose lists for the objects being removed from
the LRU...

However, the only way you can check if the i_lru is not in use is to
hold the relevant LRU lock, and that's something that should not be
directly accessed - the internal locking of the LRU is private,
subject to change and as such is only accessible in th places that
it is explicitly exposed. i.e. the ->isolate callback.

> I *think* that your use of i_lock works OK, because code which fiddles
> with i_lru and s_inode_lru also takes i_lock.  However we need to
> decide which is the preferred and official lock.  ie: what is the
> design here??

THe LRU lock nests inside the i_lock. The i_lock does not provide
exclusive access to i_lru if the inode is on the LRU; LRU list
manipulations can modify i_lru (e.g. removing an adjacent inode in
the LRU list) without holding i_lock....

> However...  most inodes will be on an LRU list, won't they?  Doesn't
> this reuse of i_lru mean that many inodes will fail to be processed? 
> If so, we might need to add a new list_head to the inode, which will be
> problematic.

Yes, yes, and yes, adding a new list head to the struct inode for
such an uncommon corner case is a non-starter.

> Aside: inode_lru_isolate() fiddles directly with inode->i_lru without
> taking i_sb->s_inode_lru->node[nid].lock.  Why doesn't this make a
> concurrent s_inode_lru walker go oops??  Should we be using
> list_lru_del() in there? 

No, inode_lru_isoalte() is called with the lru lock held. The
specific list lock is passed as the lru_lock parameter, so it can be
droppped if a blocking operation needs to be done to prepare the
object for isolation.  So, calling list_lru_del() will deadlock on
the LRU lock.

> (which should have been called list_lru_del_init(), sigh).

That implies that removing the object from the LRU without
initialising the object being removed is a valid thing to do. It's
not - the lru_list code requires that an object not on an LRU is in
an intialised state so that list_empty() checks work correctly. i.e
list_lru_del(object); list_lru_add(object); needs to work, and that
is non-negotiable. So, no need for suffixes to define different
behaviours - there can be only one...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] Add shrink_pagecache_parent
  2014-01-02 23:55   ` Andrew Morton
  2014-01-06 13:30     ` Dave Chinner
@ 2014-01-08  2:06     ` Li Wang
  2014-01-15  0:22       ` Dave Chinner
  1 sibling, 1 reply; 14+ messages in thread
From: Li Wang @ 2014-01-08  2:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, linux-fsdevel, linux-mm, linux-kernel, Cong Wang,
	Zefan Li, Matthew Wilcox, Yunchuan Wen, Dave Chinner

Hi,

On 01/03/2014 07:55 AM, Andrew Morton wrote:
> On Mon, 30 Dec 2013 21:45:17 +0800 Li Wang <liwang@ubuntukylin.com> wrote:
>
>> Analogous to shrink_dcache_parent except that it collects inodes.
>> It is not very appropriate to be put in dcache.c, but d_walk can only
>> be invoked from here.
>
> Please cc Dave Chinner on future revisions.  He be da man.
>
> The overall intent of the patchset seems reasonable and I agree that it
> can't be efficiently done from userspace with the current kernel API.
> We *could* do it from userspace by providing facilities for userspace to
> query the VFS caches: "is this pathname in the dentry cache" and "is
> this inode in the inode cache".
>
Even we have these available, i am afraid it will still introduce
non-negligible overhead due to frequent system calls for a directory
  walking operation, especially under massive small file situations.

>> --- a/fs/dcache.c
>> +++ b/fs/dcache.c
>> @@ -1318,6 +1318,42 @@ void shrink_dcache_parent(struct dentry *parent)
>>   }
>>   EXPORT_SYMBOL(shrink_dcache_parent);
>>
>> +static enum d_walk_ret gather_inode(void *data, struct dentry *dentry)
>> +{
>> +	struct list_head *list = data;
>> +	struct inode *inode = dentry->d_inode;
>> +
>> +	if ((inode == NULL) || ((!inode_owner_or_capable(inode)) &&
>> +				(!capable(CAP_SYS_ADMIN))))
>> +		goto out;
>> +	spin_lock(&inode->i_lock);
>> +	if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
>
> It's unclear what rationale lies behind this particular group of tests.
>
>> +		(inode->i_mapping->nrpages == 0) ||
>> +		(!list_empty(&inode->i_lru))) {
>
> arg, the "Inode locking rules" at the top of fs/inode.c needs a
> refresh, I suspect.  It is too vague.
>
> Formally, inode->i_lru is protected by
> i_sb->s_inode_lru->node[nid].lock, not by ->i_lock.  I guess you can
> just do a list_lru_add() and that will atomically add the inode to your
> local list_lru if ->i_lru wasn't being used for anything else.
>
> I *think* that your use of i_lock works OK, because code which fiddles
> with i_lru and s_inode_lru also takes i_lock.  However we need to
> decide which is the preferred and official lock.  ie: what is the
> design here??
>
> However...  most inodes will be on an LRU list, won't they?  Doesn't
> this reuse of i_lru mean that many inodes will fail to be processed?
> If so, we might need to add a new list_head to the inode, which will be
> problematic.
>
As far as I know, fix me if i am wrong, only when inode has zero
reference count, it will be put into superblock lru list. For most
situations, there is at least a dentry refers to it, so it will not
be on any lru list.

>
> Aside: inode_lru_isolate() fiddles directly with inode->i_lru without
> taking i_sb->s_inode_lru->node[nid].lock.  Why doesn't this make a
> concurrent s_inode_lru walker go oops??  Should we be using
> list_lru_del() in there?  (which should have been called
> list_lru_del_init(), sigh).
>
It seems inode_lru_isolate() only called by prune_icache_sb() as
a callback function. Before calling it, the caller has hold
the lock.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] Add shrink_pagecache_parent
  2014-01-08  2:06     ` Li Wang
@ 2014-01-15  0:22       ` Dave Chinner
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2014-01-15  0:22 UTC (permalink / raw)
  To: Li Wang
  Cc: Andrew Morton, Alexander Viro, linux-fsdevel, linux-mm,
	linux-kernel, Cong Wang, Zefan Li, Matthew Wilcox, Yunchuan Wen

On Wed, Jan 08, 2014 at 10:06:31AM +0800, Li Wang wrote:
> Hi,
> 
> On 01/03/2014 07:55 AM, Andrew Morton wrote:
> >On Mon, 30 Dec 2013 21:45:17 +0800 Li Wang <liwang@ubuntukylin.com> wrote:
> >
> >>Analogous to shrink_dcache_parent except that it collects inodes.
> >>It is not very appropriate to be put in dcache.c, but d_walk can only
> >>be invoked from here.
> >However...  most inodes will be on an LRU list, won't they?  Doesn't
> >this reuse of i_lru mean that many inodes will fail to be processed?
> >If so, we might need to add a new list_head to the inode, which will be
> >problematic.
> >
> As far as I know, fix me if i am wrong, only when inode has zero
> reference count, it will be put into superblock lru list. For most
> situations, there is at least a dentry refers to it, so it will not
> be on any lru list.

Yes, that's when they get added to the LRU, but they don't get
removed if they are referenced again by a dentry. Hence dentries can
be reclaimed, which puts the inode on it's LRU, but then the
directory is read again and a new dentry allocated to point to it.
We do not remove the inode from the LRU at this point in time.
Hence you can have referenced inodes that are on the LRU, and in
many workloads most of the referenced inodes in the system are on
the LRU....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-01-15  0:22 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-30 13:45 [PATCH 0/3] Fadvise: Directory level page cache cleaning support Li Wang
2013-12-30 13:45 ` [PATCH 1/3] VFS: Add the declaration of shrink_pagecache_parent Li Wang
2013-12-30 13:45 ` [PATCH 2/3] Add shrink_pagecache_parent Li Wang
2014-01-02 23:55   ` Andrew Morton
2014-01-06 13:30     ` Dave Chinner
2014-01-08  2:06     ` Li Wang
2014-01-15  0:22       ` Dave Chinner
2013-12-30 13:45 ` [PATCH 3/3] Fadvise: Add the ability for directory level page cache cleaning Li Wang
2013-12-30 14:57 ` [PATCH 0/3] Fadvise: Directory level page cache cleaning support Matthew Wilcox
2013-12-30 19:18 ` Dave Hansen
2013-12-30 19:40   ` Andreas Dilger
2013-12-30 21:33     ` Dave Hansen
2014-01-02 12:44       ` Li Wang
2014-01-02 18:35         ` Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).