All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC] fs/fcntl: add fcntl F_GET_RSS
@ 2019-10-28 10:28 Konstantin Khlebnikov
  2019-10-28 11:10 ` Matthew Wilcox
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Konstantin Khlebnikov @ 2019-10-28 10:28 UTC (permalink / raw)
  To: linux-fsdevel, linux-mm, linux-kernel, linux-api
  Cc: Michal Hocko, Alexander Viro, Johannes Weiner, Andrew Morton,
	Linus Torvalds, Roman Gushchin

This implements fcntl() for getting amount of resident memory in cache.
Kernel already maintains counter for each inode, this patch just exposes
it into userspace. Returned size is in kilobytes like values in procfs.

Alternatively this could be implemented via mapping file and collecting
map of cached pages with mincore(). Which is much slower and O(n*log n).

Syscall fincore() never was implemented in Linux.
This fcntl() covers one of its use-cases with minimal footprint.

Unlike to mincore() this fcntl counts all pages, including allocated but
not read yet (non-uptodate) and pages beyond end of file.

This employs same security model as mincore() and requires one of:
- file is opened for writing
- current user owns inode
- current user could open inode for writing

Usage:
resident_kb = fcntl(fd, F_GET_RSS);

Error codes:
-EINVAL		- not supported
-EPERM		- not writable / owner
-ENODATA	- special inode without cache

Notes:
Range of pages could be evicted from cache using POSIX_FADV_DONTNEED.
Populating with POSIX_FADV_WILLNEED is asynchronous and limited with
disk read_ahead_kb and max_sectors_kb. It seems most effective way to
read data into cache synchronously is a sendfile() into /dev/null.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 fs/fcntl.c                 |   30 ++++++++++++++++++++++++++++++
 include/uapi/linux/fcntl.h |    5 +++++
 2 files changed, 35 insertions(+)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 3d40771e8e7c..b241d3c925db 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -25,6 +25,8 @@
 #include <linux/user_namespace.h>
 #include <linux/memfd.h>
 #include <linux/compat.h>
+#include <linux/dax.h>
+#include <linux/hugetlb.h>
 
 #include <linux/poll.h>
 #include <asm/siginfo.h>
@@ -319,6 +321,31 @@ static long fcntl_rw_hint(struct file *file, unsigned int cmd,
 	}
 }
 
+static long fcntl_get_rss(struct file *filp)
+{
+	struct address_space *mapping = filp->f_mapping;
+	unsigned long pages;
+
+	if (!mapping)
+		return -ENODATA;
+
+	/* The same limitations as for sys_mincore() */
+	if (!(filp->f_mode & FMODE_WRITE) &&
+	    !inode_owner_or_capable(mapping->host) &&
+	    inode_permission(mapping->host, MAY_WRITE))
+		return -EPERM;
+
+	if (dax_mapping(mapping))
+		pages = READ_ONCE(mapping->nrexceptional);
+	else
+		pages = READ_ONCE(mapping->nrpages);
+
+	if (is_file_hugepages(filp))
+		pages <<= huge_page_order(hstate_file(filp));
+
+	return pages << (PAGE_SHIFT - 10);	/* page -> kb */
+}
+
 static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 		struct file *filp)
 {
@@ -426,6 +453,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	case F_SET_FILE_RW_HINT:
 		err = fcntl_rw_hint(filp, cmd, arg);
 		break;
+	case F_GET_RSS:
+		err = fcntl_get_rss(filp);
+		break;
 	default:
 		break;
 	}
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 1d338357df8a..d467f1dbfc67 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -54,6 +54,11 @@
 #define F_GET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 13)
 #define F_SET_FILE_RW_HINT	(F_LINUX_SPECIFIC_BASE + 14)
 
+/*
+ * Get amount of resident memory in file cache in kilobytes.
+ */
+#define F_GET_RSS		(F_LINUX_SPECIFIC_BASE + 15)
+
 /*
  * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
  * used to clear any hints previously set.


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] fs/fcntl: add fcntl F_GET_RSS
  2019-10-28 10:28 [PATCH RFC] fs/fcntl: add fcntl F_GET_RSS Konstantin Khlebnikov
@ 2019-10-28 11:10 ` Matthew Wilcox
  2019-10-28 11:20   ` Konstantin Khlebnikov
  2019-10-28 11:46   ` Florian Weimer
  2019-10-28 12:27   ` Linus Torvalds
  2 siblings, 1 reply; 11+ messages in thread
From: Matthew Wilcox @ 2019-10-28 11:10 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-fsdevel, linux-mm, linux-kernel, linux-api, Michal Hocko,
	Alexander Viro, Johannes Weiner, Andrew Morton, Linus Torvalds,
	Roman Gushchin

On Mon, Oct 28, 2019 at 01:28:09PM +0300, Konstantin Khlebnikov wrote:
> +	if (dax_mapping(mapping))
> +		pages = READ_ONCE(mapping->nrexceptional);
> +	else
> +		pages = READ_ONCE(mapping->nrpages);

I'm not sure this is the right calculation for DAX files.  We haven't
allocated any memory for DAX; we're just accessing storage directly.
The entries in the page caache are just translation from file offset to
physical address.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] fs/fcntl: add fcntl F_GET_RSS
  2019-10-28 11:10 ` Matthew Wilcox
@ 2019-10-28 11:20   ` Konstantin Khlebnikov
  0 siblings, 0 replies; 11+ messages in thread
From: Konstantin Khlebnikov @ 2019-10-28 11:20 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-mm, linux-kernel, linux-api, Michal Hocko,
	Alexander Viro, Johannes Weiner, Andrew Morton, Linus Torvalds,
	Roman Gushchin

On 28/10/2019 14.10, Matthew Wilcox wrote:
> On Mon, Oct 28, 2019 at 01:28:09PM +0300, Konstantin Khlebnikov wrote:
>> +	if (dax_mapping(mapping))
>> +		pages = READ_ONCE(mapping->nrexceptional);
>> +	else
>> +		pages = READ_ONCE(mapping->nrpages);
> 
> I'm not sure this is the right calculation for DAX files.  We haven't
> allocated any memory for DAX; we're just accessing storage directly.
> The entries in the page caache are just translation from file offset to
> physical address.
> 

Yep, makes sense. If RSS declared as memory usage then this chunk must do
pages = READ_ONCE(mapping->nrpages) unconditionally and report 0 for DAX.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] fs/fcntl: add fcntl F_GET_RSS
  2019-10-28 10:28 [PATCH RFC] fs/fcntl: add fcntl F_GET_RSS Konstantin Khlebnikov
@ 2019-10-28 11:46   ` Florian Weimer
  2019-10-28 11:46   ` Florian Weimer
  2019-10-28 12:27   ` Linus Torvalds
  2 siblings, 0 replies; 11+ messages in thread
From: Florian Weimer @ 2019-10-28 11:46 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-fsdevel, linux-mm, linux-kernel, linux-api, Michal Hocko,
	Alexander Viro, Johannes Weiner, Andrew Morton, Linus Torvalds,
	Roman Gushchin

* Konstantin Khlebnikov:

> This implements fcntl() for getting amount of resident memory in cache.
> Kernel already maintains counter for each inode, this patch just exposes
> it into userspace. Returned size is in kilobytes like values in procfs.

I think this needs a 32-bit compat implementation which clamps the
returned value to INT_MAX.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] fs/fcntl: add fcntl F_GET_RSS
@ 2019-10-28 11:46   ` Florian Weimer
  0 siblings, 0 replies; 11+ messages in thread
From: Florian Weimer @ 2019-10-28 11:46 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-fsdevel, linux-mm, linux-kernel, linux-api, Michal Hocko,
	Alexander Viro, Johannes Weiner, Andrew Morton, Linus Torvalds,
	Roman Gushchin

* Konstantin Khlebnikov:

> This implements fcntl() for getting amount of resident memory in cache.
> Kernel already maintains counter for each inode, this patch just exposes
> it into userspace. Returned size is in kilobytes like values in procfs.

I think this needs a 32-bit compat implementation which clamps the
returned value to INT_MAX.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] fs/fcntl: add fcntl F_GET_RSS
  2019-10-28 10:28 [PATCH RFC] fs/fcntl: add fcntl F_GET_RSS Konstantin Khlebnikov
@ 2019-10-28 12:27   ` Linus Torvalds
  2019-10-28 11:46   ` Florian Weimer
  2019-10-28 12:27   ` Linus Torvalds
  2 siblings, 0 replies; 11+ messages in thread
From: Linus Torvalds @ 2019-10-28 12:27 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-fsdevel, Linux-MM, Linux Kernel Mailing List, Linux API,
	Michal Hocko, Alexander Viro, Johannes Weiner, Andrew Morton,
	Roman Gushchin

On Mon, Oct 28, 2019 at 11:28 AM Konstantin Khlebnikov
<khlebnikov@yandex-team.ru> wrote:
>
> This implements fcntl() for getting amount of resident memory in cache.
> Kernel already maintains counter for each inode, this patch just exposes
> it into userspace. Returned size is in kilobytes like values in procfs.

This doesn't actually explain why anybody would want it, and what the
usage scenario is.

             Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] fs/fcntl: add fcntl F_GET_RSS
@ 2019-10-28 12:27   ` Linus Torvalds
  0 siblings, 0 replies; 11+ messages in thread
From: Linus Torvalds @ 2019-10-28 12:27 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-fsdevel, Linux-MM, Linux Kernel Mailing List, Linux API,
	Michal Hocko, Alexander Viro, Johannes Weiner, Andrew Morton,
	Roman Gushchin

On Mon, Oct 28, 2019 at 11:28 AM Konstantin Khlebnikov
<khlebnikov@yandex-team.ru> wrote:
>
> This implements fcntl() for getting amount of resident memory in cache.
> Kernel already maintains counter for each inode, this patch just exposes
> it into userspace. Returned size is in kilobytes like values in procfs.

This doesn't actually explain why anybody would want it, and what the
usage scenario is.

             Linus


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] fs/fcntl: add fcntl F_GET_RSS
  2019-10-28 12:27   ` Linus Torvalds
  (?)
@ 2019-10-28 12:49   ` Konstantin Khlebnikov
  -1 siblings, 0 replies; 11+ messages in thread
From: Konstantin Khlebnikov @ 2019-10-28 12:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Linux-MM, Linux Kernel Mailing List, Linux API,
	Michal Hocko, Alexander Viro, Johannes Weiner, Andrew Morton,
	Roman Gushchin

On 28/10/2019 15.27, Linus Torvalds wrote:
> On Mon, Oct 28, 2019 at 11:28 AM Konstantin Khlebnikov
> <khlebnikov@yandex-team.ru> wrote:
>>
>> This implements fcntl() for getting amount of resident memory in cache.
>> Kernel already maintains counter for each inode, this patch just exposes
>> it into userspace. Returned size is in kilobytes like values in procfs.
> 
> This doesn't actually explain why anybody would want it, and what the
> usage scenario is.
> 

This really helps to plot memory usage distribution. Right now file cache
have only total counters. Collecting statistics via mincore as implemented
in page-types tool isn't efficient and very racy.

Usage scenario is the same as finding top memory usage among processes.
But among files which are not always mapped anywhere.

For example if somebody writes\reads logs too intensive this file cache
could bloat and push more important data out out memory.

Also little bit of introspection wouldn't hurt.
Using this I've found unneeded pages beyond i_size.

>               Linus
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] fs/fcntl: add fcntl F_GET_RSS
  2019-10-28 11:46   ` Florian Weimer
  (?)
@ 2019-10-28 12:55   ` Konstantin Khlebnikov
  2019-10-28 13:05       ` Florian Weimer
  -1 siblings, 1 reply; 11+ messages in thread
From: Konstantin Khlebnikov @ 2019-10-28 12:55 UTC (permalink / raw)
  To: Florian Weimer
  Cc: linux-fsdevel, linux-mm, linux-kernel, linux-api, Michal Hocko,
	Alexander Viro, Johannes Weiner, Andrew Morton, Linus Torvalds,
	Roman Gushchin

On 28/10/2019 14.46, Florian Weimer wrote:
> * Konstantin Khlebnikov:
> 
>> This implements fcntl() for getting amount of resident memory in cache.
>> Kernel already maintains counter for each inode, this patch just exposes
>> it into userspace. Returned size is in kilobytes like values in procfs.
> 
> I think this needs a 32-bit compat implementation which clamps the
> returned value to INT_MAX.
> 

32-bit machine couldn't hold more than 2TB cache in one file.
Even radix tree wouldn't fit into low memory area.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] fs/fcntl: add fcntl F_GET_RSS
  2019-10-28 12:55   ` Konstantin Khlebnikov
@ 2019-10-28 13:05       ` Florian Weimer
  0 siblings, 0 replies; 11+ messages in thread
From: Florian Weimer @ 2019-10-28 13:05 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-fsdevel, linux-mm, linux-kernel, linux-api, Michal Hocko,
	Alexander Viro, Johannes Weiner, Andrew Morton, Linus Torvalds,
	Roman Gushchin

* Konstantin Khlebnikov:

> On 28/10/2019 14.46, Florian Weimer wrote:
>> * Konstantin Khlebnikov:
>> 
>>> This implements fcntl() for getting amount of resident memory in cache.
>>> Kernel already maintains counter for each inode, this patch just exposes
>>> it into userspace. Returned size is in kilobytes like values in procfs.
>> 
>> I think this needs a 32-bit compat implementation which clamps the
>> returned value to INT_MAX.
>> 
>
> 32-bit machine couldn't hold more than 2TB cache in one file.
> Even radix tree wouldn't fit into low memory area.

I meant a 32-bit process running on a 64-bit kernel.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] fs/fcntl: add fcntl F_GET_RSS
@ 2019-10-28 13:05       ` Florian Weimer
  0 siblings, 0 replies; 11+ messages in thread
From: Florian Weimer @ 2019-10-28 13:05 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: linux-fsdevel, linux-mm, linux-kernel, linux-api, Michal Hocko,
	Alexander Viro, Johannes Weiner, Andrew Morton, Linus Torvalds,
	Roman Gushchin

* Konstantin Khlebnikov:

> On 28/10/2019 14.46, Florian Weimer wrote:
>> * Konstantin Khlebnikov:
>> 
>>> This implements fcntl() for getting amount of resident memory in cache.
>>> Kernel already maintains counter for each inode, this patch just exposes
>>> it into userspace. Returned size is in kilobytes like values in procfs.
>> 
>> I think this needs a 32-bit compat implementation which clamps the
>> returned value to INT_MAX.
>> 
>
> 32-bit machine couldn't hold more than 2TB cache in one file.
> Even radix tree wouldn't fit into low memory area.

I meant a 32-bit process running on a 64-bit kernel.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-10-28 13:06 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-28 10:28 [PATCH RFC] fs/fcntl: add fcntl F_GET_RSS Konstantin Khlebnikov
2019-10-28 11:10 ` Matthew Wilcox
2019-10-28 11:20   ` Konstantin Khlebnikov
2019-10-28 11:46 ` Florian Weimer
2019-10-28 11:46   ` Florian Weimer
2019-10-28 12:55   ` Konstantin Khlebnikov
2019-10-28 13:05     ` Florian Weimer
2019-10-28 13:05       ` Florian Weimer
2019-10-28 12:27 ` Linus Torvalds
2019-10-28 12:27   ` Linus Torvalds
2019-10-28 12:49   ` Konstantin Khlebnikov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.