All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] proc: report open files as size in stat() for /proc/pid/fd
@ 2022-09-16 23:08 Ivan Babrou
  2022-09-17  0:01 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Ivan Babrou @ 2022-09-16 23:08 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-kernel, kernel-team, Ivan Babrou, Andrew Morton, Kalesh Singh

Many monitoring tools include open file count as a metric. Currently
the only way to get this number is to enumerate the files in /proc/pid/fd.

The problem with the current approach is that it does many things people
generally don't care about when they need one number for a metric.
In our tests for cadvisor, which reports open file counts per cgroup,
we observed that reading the number of open files is slow. Out of 35.23%
of CPU time spent in `proc_readfd_common`, we see 29.43% spent in
`proc_fill_cache`, which is responsible for filling dentry info.
Some of this extra time is spinlock contention, but it's a contention
for the lock we don't want to take to begin with.

We considered putting the number of open files in /proc/pid/stat.
Unfortunately, counting the number of fds involves iterating the fdtable,
which means that it might slow down /proc/pid/stat for processes
with many open files. Instead we opted to put this info in /proc/pid/fd
as a size member of the stat syscall result. Previously the reported
number was zero, so there's very little risk of breaking anything,
while still providing a somewhat logical way to count the open files.

Previously:

```
$ sudo stat /proc/1/fd | head -n2
  File: /proc/1/fd
  Size: 0         	Blocks: 0          IO Block: 1024   directory
```

With this patch:

```
$ sudo stat /proc/1/fd | head -n2
  File: /proc/1/fd
  Size: 65        	Blocks: 0          IO Block: 1024   directory
```

Correctness check:

```
$ sudo ls /proc/1/fd | wc -l
65
```

There are two alternatives to this approach that I can see:

* Expose /proc/pid/fd_count with a count there
* Make fd count acces O(1) and expose it in /proc/pid/status

I can probably figure out how to do the former, but the latter
will require somebody with more experience in file code than myself.

Signed-off-by: Ivan Babrou <ivan@cloudflare.com>
---
 fs/proc/fd.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 913bef0d2a36..c7ac142500a8 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -279,6 +279,29 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx,
 	return 0;
 }
 
+static int proc_readfd_count(struct inode *inode)
+{
+	struct task_struct *p = get_proc_task(inode);
+	unsigned int fd = 0, count = 0;
+
+	if (!p)
+		return -ENOENT;
+
+	rcu_read_lock();
+	while (task_lookup_next_fd_rcu(p, &fd)) {
+		rcu_read_unlock();
+
+		count++;
+		fd++;
+
+		cond_resched();
+		rcu_read_lock();
+	}
+	rcu_read_unlock();
+	put_task_struct(p);
+	return count;
+}
+
 static int proc_readfd(struct file *file, struct dir_context *ctx)
 {
 	return proc_readfd_common(file, ctx, proc_fd_instantiate);
@@ -319,9 +342,33 @@ int proc_fd_permission(struct user_namespace *mnt_userns,
 	return rv;
 }
 
+int proc_fd_getattr(struct user_namespace *mnt_userns,
+			const struct path *path, struct kstat *stat,
+			u32 request_mask, unsigned int query_flags)
+{
+	struct inode *inode = d_inode(path->dentry);
+	struct proc_dir_entry *de = PDE(inode);
+
+	if (de) {
+		nlink_t nlink = READ_ONCE(de->nlink);
+
+		if (nlink > 0)
+			set_nlink(inode, nlink);
+	}
+
+	generic_fillattr(&init_user_ns, inode, stat);
+
+	/* If it's a directory, put the number of open fds there */
+	if (S_ISDIR(inode->i_mode))
+		stat->size = proc_readfd_count(inode);
+
+	return 0;
+}
+
 const struct inode_operations proc_fd_inode_operations = {
 	.lookup		= proc_lookupfd,
 	.permission	= proc_fd_permission,
+	.getattr	= proc_fd_getattr,
 	.setattr	= proc_setattr,
 };
 
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC] proc: report open files as size in stat() for /proc/pid/fd
  2022-09-16 23:08 [RFC] proc: report open files as size in stat() for /proc/pid/fd Ivan Babrou
@ 2022-09-17  0:01 ` Andrew Morton
  2022-09-17  0:28   ` Ivan Babrou
  2022-09-17  7:15   ` Alexey Dobriyan
  2022-09-17  3:22 ` kernel test robot
  2022-09-17 15:31 ` Theodore Ts'o
  2 siblings, 2 replies; 11+ messages in thread
From: Andrew Morton @ 2022-09-17  0:01 UTC (permalink / raw)
  To: Ivan Babrou
  Cc: linux-fsdevel, linux-kernel, kernel-team, Kalesh Singh,
	Alexey Dobriyan, Al Viro

(cc's added)

On Fri, 16 Sep 2022 16:08:52 -0700 Ivan Babrou <ivan@cloudflare.com> wrote:

> Many monitoring tools include open file count as a metric. Currently
> the only way to get this number is to enumerate the files in /proc/pid/fd.
> 
> The problem with the current approach is that it does many things people
> generally don't care about when they need one number for a metric.
> In our tests for cadvisor, which reports open file counts per cgroup,
> we observed that reading the number of open files is slow. Out of 35.23%
> of CPU time spent in `proc_readfd_common`, we see 29.43% spent in
> `proc_fill_cache`, which is responsible for filling dentry info.
> Some of this extra time is spinlock contention, but it's a contention
> for the lock we don't want to take to begin with.
> 
> We considered putting the number of open files in /proc/pid/stat.
> Unfortunately, counting the number of fds involves iterating the fdtable,
> which means that it might slow down /proc/pid/stat for processes
> with many open files. Instead we opted to put this info in /proc/pid/fd
> as a size member of the stat syscall result. Previously the reported
> number was zero, so there's very little risk of breaking anything,
> while still providing a somewhat logical way to count the open files.

Documentation/filesystems/proc.rst would be an appropriate place to
document this ;)

> Previously:
> 
> ```
> $ sudo stat /proc/1/fd | head -n2
>   File: /proc/1/fd
>   Size: 0         	Blocks: 0          IO Block: 1024   directory
> ```
> 
> With this patch:
> 
> ```
> $ sudo stat /proc/1/fd | head -n2
>   File: /proc/1/fd
>   Size: 65        	Blocks: 0          IO Block: 1024   directory
> ```
> 
> Correctness check:
> 
> ```
> $ sudo ls /proc/1/fd | wc -l
> 65
> ```
> 
> There are two alternatives to this approach that I can see:
> 
> * Expose /proc/pid/fd_count with a count there
> * Make fd count acces O(1) and expose it in /proc/pid/status
> 
> I can probably figure out how to do the former, but the latter
> will require somebody with more experience in file code than myself.
> 
> Signed-off-by: Ivan Babrou <ivan@cloudflare.com>
> ---
>  fs/proc/fd.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 47 insertions(+)
> 
> diff --git a/fs/proc/fd.c b/fs/proc/fd.c
> index 913bef0d2a36..c7ac142500a8 100644
> --- a/fs/proc/fd.c
> +++ b/fs/proc/fd.c
> @@ -279,6 +279,29 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx,
>  	return 0;
>  }
>  
> +static int proc_readfd_count(struct inode *inode)
> +{
> +	struct task_struct *p = get_proc_task(inode);
> +	unsigned int fd = 0, count = 0;
> +
> +	if (!p)
> +		return -ENOENT;
> +
> +	rcu_read_lock();
> +	while (task_lookup_next_fd_rcu(p, &fd)) {
> +		rcu_read_unlock();
> +
> +		count++;
> +		fd++;
> +
> +		cond_resched();
> +		rcu_read_lock();
> +	}
> +	rcu_read_unlock();
> +	put_task_struct(p);
> +	return count;
> +}
> +
>  static int proc_readfd(struct file *file, struct dir_context *ctx)
>  {
>  	return proc_readfd_common(file, ctx, proc_fd_instantiate);
> @@ -319,9 +342,33 @@ int proc_fd_permission(struct user_namespace *mnt_userns,
>  	return rv;
>  }
>  
> +int proc_fd_getattr(struct user_namespace *mnt_userns,
> +			const struct path *path, struct kstat *stat,
> +			u32 request_mask, unsigned int query_flags)
> +{
> +	struct inode *inode = d_inode(path->dentry);
> +	struct proc_dir_entry *de = PDE(inode);
> +
> +	if (de) {
> +		nlink_t nlink = READ_ONCE(de->nlink);
> +
> +		if (nlink > 0)
> +			set_nlink(inode, nlink);
> +	}
> +
> +	generic_fillattr(&init_user_ns, inode, stat);
> +
> +	/* If it's a directory, put the number of open fds there */
> +	if (S_ISDIR(inode->i_mode))
> +		stat->size = proc_readfd_count(inode);
> +
> +	return 0;
> +}
> +
>  const struct inode_operations proc_fd_inode_operations = {
>  	.lookup		= proc_lookupfd,
>  	.permission	= proc_fd_permission,
> +	.getattr	= proc_fd_getattr,
>  	.setattr	= proc_setattr,
>  };
>  
> -- 
> 2.37.2

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] proc: report open files as size in stat() for /proc/pid/fd
  2022-09-17  0:01 ` Andrew Morton
@ 2022-09-17  0:28   ` Ivan Babrou
  2022-09-17  7:15   ` Alexey Dobriyan
  1 sibling, 0 replies; 11+ messages in thread
From: Ivan Babrou @ 2022-09-17  0:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel, linux-kernel, kernel-team, Kalesh Singh,
	Alexey Dobriyan, Al Viro

On Fri, Sep 16, 2022 at 5:01 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> (cc's added)
>
> On Fri, 16 Sep 2022 16:08:52 -0700 Ivan Babrou <ivan@cloudflare.com> wrote:
>
> > Many monitoring tools include open file count as a metric. Currently
> > the only way to get this number is to enumerate the files in /proc/pid/fd.
> >
> > The problem with the current approach is that it does many things people
> > generally don't care about when they need one number for a metric.
> > In our tests for cadvisor, which reports open file counts per cgroup,
> > we observed that reading the number of open files is slow. Out of 35.23%
> > of CPU time spent in `proc_readfd_common`, we see 29.43% spent in
> > `proc_fill_cache`, which is responsible for filling dentry info.
> > Some of this extra time is spinlock contention, but it's a contention
> > for the lock we don't want to take to begin with.
> >
> > We considered putting the number of open files in /proc/pid/stat.
> > Unfortunately, counting the number of fds involves iterating the fdtable,
> > which means that it might slow down /proc/pid/stat for processes
> > with many open files. Instead we opted to put this info in /proc/pid/fd
> > as a size member of the stat syscall result. Previously the reported
> > number was zero, so there's very little risk of breaking anything,
> > while still providing a somewhat logical way to count the open files.
>
> Documentation/filesystems/proc.rst would be an appropriate place to
> document this ;)

I am more than happy to add the docs after there's a confirmation that
this is an appropriate approach to expose this information. I probably
should've mentioned this explicitly, that's on me. There are two
alternative approaches at the bottom of my original email that might
be considered.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] proc: report open files as size in stat() for /proc/pid/fd
  2022-09-16 23:08 [RFC] proc: report open files as size in stat() for /proc/pid/fd Ivan Babrou
  2022-09-17  0:01 ` Andrew Morton
@ 2022-09-17  3:22 ` kernel test robot
  2022-09-17 15:31 ` Theodore Ts'o
  2 siblings, 0 replies; 11+ messages in thread
From: kernel test robot @ 2022-09-17  3:22 UTC (permalink / raw)
  To: Ivan Babrou; +Cc: llvm, kbuild-all

Hi Ivan,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on linus/master v6.0-rc5 next-20220916]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Ivan-Babrou/proc-report-open-files-as-size-in-stat-for-proc-pid-fd/20220917-071056
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
config: hexagon-randconfig-r041-20220916 (https://download.01.org/0day-ci/archive/20220917/202209171129.dnkI0XiE-lkp@intel.com/config)
compiler: clang version 16.0.0 (https://github.com/llvm/llvm-project 791a7ae1ba3efd6bca96338e10ffde557ba83920)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/58f6f7e4082a88ddcb4b732d386f9a4b0e1b03b4
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Ivan-Babrou/proc-report-open-files-as-size-in-stat-for-proc-pid-fd/20220917-071056
        git checkout 58f6f7e4082a88ddcb4b732d386f9a4b0e1b03b4
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash fs/proc/

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> fs/proc/fd.c:345:5: warning: no previous prototype for function 'proc_fd_getattr' [-Wmissing-prototypes]
   int proc_fd_getattr(struct user_namespace *mnt_userns,
       ^
   fs/proc/fd.c:345:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   int proc_fd_getattr(struct user_namespace *mnt_userns,
   ^
   static 
   1 warning generated.


vim +/proc_fd_getattr +345 fs/proc/fd.c

   344	
 > 345	int proc_fd_getattr(struct user_namespace *mnt_userns,
   346				const struct path *path, struct kstat *stat,
   347				u32 request_mask, unsigned int query_flags)
   348	{
   349		struct inode *inode = d_inode(path->dentry);
   350		struct proc_dir_entry *de = PDE(inode);
   351	
   352		if (de) {
   353			nlink_t nlink = READ_ONCE(de->nlink);
   354	
   355			if (nlink > 0)
   356				set_nlink(inode, nlink);
   357		}
   358	
   359		generic_fillattr(&init_user_ns, inode, stat);
   360	
   361		/* If it's a directory, put the number of open fds there */
   362		if (S_ISDIR(inode->i_mode))
   363			stat->size = proc_readfd_count(inode);
   364	
   365		return 0;
   366	}
   367	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] proc: report open files as size in stat() for /proc/pid/fd
  2022-09-17  0:01 ` Andrew Morton
  2022-09-17  0:28   ` Ivan Babrou
@ 2022-09-17  7:15   ` Alexey Dobriyan
  2022-09-17 18:32     ` Ivan Babrou
  2022-09-19  8:18     ` Christian Brauner
  1 sibling, 2 replies; 11+ messages in thread
From: Alexey Dobriyan @ 2022-09-17  7:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Ivan Babrou, linux-fsdevel, linux-kernel, kernel-team,
	Kalesh Singh, Al Viro

On Fri, Sep 16, 2022 at 05:01:15PM -0700, Andrew Morton wrote:
> (cc's added)
> 
> On Fri, 16 Sep 2022 16:08:52 -0700 Ivan Babrou <ivan@cloudflare.com> wrote:
> 
> > Many monitoring tools include open file count as a metric. Currently
> > the only way to get this number is to enumerate the files in /proc/pid/fd.
> > 
> > The problem with the current approach is that it does many things people
> > generally don't care about when they need one number for a metric.
> > In our tests for cadvisor, which reports open file counts per cgroup,
> > we observed that reading the number of open files is slow. Out of 35.23%
> > of CPU time spent in `proc_readfd_common`, we see 29.43% spent in
> > `proc_fill_cache`, which is responsible for filling dentry info.
> > Some of this extra time is spinlock contention, but it's a contention
> > for the lock we don't want to take to begin with.
> > 
> > We considered putting the number of open files in /proc/pid/stat.
> > Unfortunately, counting the number of fds involves iterating the fdtable,
> > which means that it might slow down /proc/pid/stat for processes
> > with many open files. Instead we opted to put this info in /proc/pid/fd
> > as a size member of the stat syscall result. Previously the reported
> > number was zero, so there's very little risk of breaking anything,
> > while still providing a somewhat logical way to count the open files.
> 
> Documentation/filesystems/proc.rst would be an appropriate place to
> document this ;)
> 
> > Previously:
> > 
> > ```
> > $ sudo stat /proc/1/fd | head -n2
> >   File: /proc/1/fd
> >   Size: 0         	Blocks: 0          IO Block: 1024   directory
> > ```
> > 
> > With this patch:
> > 
> > ```
> > $ sudo stat /proc/1/fd | head -n2
> >   File: /proc/1/fd
> >   Size: 65        	Blocks: 0          IO Block: 1024   directory

Yes. This is natural place.

> > ```
> > 
> > Correctness check:
> > 
> > ```
> > $ sudo ls /proc/1/fd | wc -l
> > 65
> > ```
> > 
> > There are two alternatives to this approach that I can see:
> > 
> > * Expose /proc/pid/fd_count with a count there

> > * Make fd count acces O(1) and expose it in /proc/pid/status

This is doable, next to FDSize.

Below is doable too.

> > --- a/fs/proc/fd.c
> > +++ b/fs/proc/fd.c
> > @@ -279,6 +279,29 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx,
> >  	return 0;
> >  }
> >  
> > +static int proc_readfd_count(struct inode *inode)
> > +{
> > +	struct task_struct *p = get_proc_task(inode);
> > +	unsigned int fd = 0, count = 0;
> > +
> > +	if (!p)
> > +		return -ENOENT;
> > +
> > +	rcu_read_lock();
> > +	while (task_lookup_next_fd_rcu(p, &fd)) {
> > +		rcu_read_unlock();
> > +
> > +		count++;
> > +		fd++;
> > +
> > +		cond_resched();
> > +		rcu_read_lock();
> > +	}
> > +	rcu_read_unlock();
> > +	put_task_struct(p);
> > +	return count;
> > +}
> > +
> >  static int proc_readfd(struct file *file, struct dir_context *ctx)
> >  {
> >  	return proc_readfd_common(file, ctx, proc_fd_instantiate);
> > @@ -319,9 +342,33 @@ int proc_fd_permission(struct user_namespace *mnt_userns,
> >  	return rv;
> >  }
> >  
> > +int proc_fd_getattr(struct user_namespace *mnt_userns,
> > +			const struct path *path, struct kstat *stat,
> > +			u32 request_mask, unsigned int query_flags)
> > +{
> > +	struct inode *inode = d_inode(path->dentry);
> > +	struct proc_dir_entry *de = PDE(inode);
> > +
> > +	if (de) {
> > +		nlink_t nlink = READ_ONCE(de->nlink);
> > +
> > +		if (nlink > 0)
> > +			set_nlink(inode, nlink);
> > +	}
> > +
> > +	generic_fillattr(&init_user_ns, inode, stat);
			 ^^^^^^^^^^^^^

Is this correct? I'm not userns guy at all.

> > +
> > +	/* If it's a directory, put the number of open fds there */
> > +	if (S_ISDIR(inode->i_mode))
> > +		stat->size = proc_readfd_count(inode);

ENOENT can get there. In principle this is OK, userspace can live with it.

> >  const struct inode_operations proc_fd_inode_operations = {
> >  	.lookup		= proc_lookupfd,
> >  	.permission	= proc_fd_permission,
> > +	.getattr	= proc_fd_getattr,
> >  	.setattr	= proc_setattr,

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] proc: report open files as size in stat() for /proc/pid/fd
  2022-09-16 23:08 [RFC] proc: report open files as size in stat() for /proc/pid/fd Ivan Babrou
  2022-09-17  0:01 ` Andrew Morton
  2022-09-17  3:22 ` kernel test robot
@ 2022-09-17 15:31 ` Theodore Ts'o
  2022-09-17 18:32   ` Ivan Babrou
  2 siblings, 1 reply; 11+ messages in thread
From: Theodore Ts'o @ 2022-09-17 15:31 UTC (permalink / raw)
  To: Ivan Babrou
  Cc: linux-fsdevel, linux-kernel, kernel-team, Andrew Morton, Kalesh Singh

On Fri, Sep 16, 2022 at 04:08:52PM -0700, Ivan Babrou wrote:
> We considered putting the number of open files in /proc/pid/stat.
> Unfortunately, counting the number of fds involves iterating the fdtable,
> which means that it might slow down /proc/pid/stat for processes
> with many open files. Instead we opted to put this info in /proc/pid/fd
> as a size member of the stat syscall result. Previously the reported
> number was zero, so there's very little risk of breaking anything,
> while still providing a somewhat logical way to count the open files.

Instead of using the st_size of /proc/<pid>/fd, why not return that
value in st_nlink?  /proc/<pid>/fd is a directory, so having st_nlinks
return number of fd's plus 2 (for . and ..) would be much more natural.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] proc: report open files as size in stat() for /proc/pid/fd
  2022-09-17  7:15   ` Alexey Dobriyan
@ 2022-09-17 18:32     ` Ivan Babrou
  2022-09-19 12:30       ` Alexey Dobriyan
  2022-09-19  8:18     ` Christian Brauner
  1 sibling, 1 reply; 11+ messages in thread
From: Ivan Babrou @ 2022-09-17 18:32 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Andrew Morton, linux-fsdevel, linux-kernel, kernel-team,
	Kalesh Singh, Al Viro

> > > * Make fd count acces O(1) and expose it in /proc/pid/status
>
> This is doable, next to FDSize.

It feels like a better solution, but maybe I'm missing some context
here. Let me know whether this is preferred.

That said, I've tried doing it, but failed. There's a noticeable
mismatch in the numbers:

* systemd:

ivan@vm:~$ sudo ls -l /proc/1/fd | wc -l
66
ivan@vm:~$ cat /proc/1/status | fgrep FD
FDSize: 256
FDUsed: 71

* journald:

ivan@vm:~$ sudo ls -l /proc/803/fd | wc -l
29
ivan@vm:~$ cat /proc/803/status | fgrep FD
FDSize: 128
FDUsed: 37

I'll see if I can make it work next week. I'm happy to receive tips as well.

Below is my attempt (link in case gmail breaks patch formatting):

* https://gist.githubusercontent.com/bobrik/acce40881d629d8cce2e55966b31a0a2/raw/716eb4724a8fe3afeeb76fd2a7a47ee13790a9e9/fdused.patch

diff --git a/fs/file.c b/fs/file.c
index 3bcc1ecc314a..8bc0741cabf1 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -85,6 +85,8 @@ static void copy_fdtable(struct fdtable *nfdt,
struct fdtable *ofdt)
  memset((char *)nfdt->fd + cpy, 0, set);

  copy_fd_bitmaps(nfdt, ofdt, ofdt->max_fds);
+
+ atomic_set(&nfdt->count, atomic_read(&ofdt->count));
 }

 /*
@@ -105,6 +107,7 @@ static void copy_fdtable(struct fdtable *nfdt,
struct fdtable *ofdt)
 static struct fdtable * alloc_fdtable(unsigned int nr)
 {
  struct fdtable *fdt;
+ atomic_t count = ATOMIC_INIT(0);
  void *data;

  /*
@@ -148,6 +151,7 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
  fdt->close_on_exec = data;
  data += nr / BITS_PER_BYTE;
  fdt->full_fds_bits = data;
+ fdt->count = count;

  return fdt;

@@ -399,6 +403,8 @@ struct files_struct *dup_fd(struct files_struct
*oldf, unsigned int max_fds, int
  /* clear the remainder */
  memset(new_fds, 0, (new_fdt->max_fds - open_files) * sizeof(struct file *));

+ atomic_set(&new_fdt->count, atomic_read(&old_fdt->count));
+
  rcu_assign_pointer(newf->fdt, new_fdt);

  return newf;
@@ -474,6 +480,7 @@ struct files_struct init_files = {
  .close_on_exec = init_files.close_on_exec_init,
  .open_fds = init_files.open_fds_init,
  .full_fds_bits = init_files.full_fds_bits_init,
+ .count = ATOMIC_INIT(0),
  },
  .file_lock = __SPIN_LOCK_UNLOCKED(init_files.file_lock),
  .resize_wait = __WAIT_QUEUE_HEAD_INITIALIZER(init_files.resize_wait),
@@ -613,6 +620,7 @@ void fd_install(unsigned int fd, struct file *file)
  BUG_ON(fdt->fd[fd] != NULL);
  rcu_assign_pointer(fdt->fd[fd], file);
  spin_unlock(&files->file_lock);
+ atomic_inc(&fdt->count);
  return;
  }
  /* coupled with smp_wmb() in expand_fdtable() */
@@ -621,6 +629,7 @@ void fd_install(unsigned int fd, struct file *file)
  BUG_ON(fdt->fd[fd] != NULL);
  rcu_assign_pointer(fdt->fd[fd], file);
  rcu_read_unlock_sched();
+ atomic_inc(&fdt->count);
 }

 EXPORT_SYMBOL(fd_install);
@@ -646,6 +655,7 @@ static struct file *pick_file(struct files_struct
*files, unsigned fd)
  if (file) {
  rcu_assign_pointer(fdt->fd[fd], NULL);
  __put_unused_fd(files, fd);
+ atomic_dec(&fdt->count);
  }
  return file;
 }
@@ -844,6 +854,7 @@ void do_close_on_exec(struct files_struct *files)
  filp_close(file, files);
  cond_resched();
  spin_lock(&files->file_lock);
+ atomic_dec(&fdt->count);
  }

  }
@@ -1108,6 +1119,7 @@ __releases(&files->file_lock)
  else
  __clear_close_on_exec(fd, fdt);
  spin_unlock(&files->file_lock);
+ atomic_inc(&fdt->count);

  if (tofree)
  filp_close(tofree, files);
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 99fcbfda8e25..5847f077bfc3 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -153,7 +153,8 @@ static inline void task_state(struct seq_file *m,
struct pid_namespace *ns,
  struct task_struct *tracer;
  const struct cred *cred;
  pid_t ppid, tpid = 0, tgid, ngid;
- unsigned int max_fds = 0;
+ struct fdtable *fdt;
+ unsigned int max_fds = 0, open_fds = 0;

  rcu_read_lock();
  ppid = pid_alive(p) ?
@@ -170,8 +171,11 @@ static inline void task_state(struct seq_file *m,
struct pid_namespace *ns,
  task_lock(p);
  if (p->fs)
  umask = p->fs->umask;
- if (p->files)
- max_fds = files_fdtable(p->files)->max_fds;
+ if (p->files) {
+ fdt = files_fdtable(p->files);
+ max_fds = fdt->max_fds;
+ open_fds = atomic_read(&fdt->count);
+ }
  task_unlock(p);
  rcu_read_unlock();

@@ -194,6 +198,7 @@ static inline void task_state(struct seq_file *m,
struct pid_namespace *ns,
  seq_put_decimal_ull(m, "\t", from_kgid_munged(user_ns, cred->sgid));
  seq_put_decimal_ull(m, "\t", from_kgid_munged(user_ns, cred->fsgid));
  seq_put_decimal_ull(m, "\nFDSize:\t", max_fds);
+ seq_put_decimal_ull(m, "\nFDUsed:\t", open_fds);

  seq_puts(m, "\nGroups:\t");
  group_info = cred->group_info;
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index e066816f3519..59aceb1e4bc6 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -31,6 +31,7 @@ struct fdtable {
  unsigned long *open_fds;
  unsigned long *full_fds_bits;
  struct rcu_head rcu;
+ atomic_t count;
 };

 static inline bool close_on_exec(unsigned int fd, const struct fdtable *fdt)


> > > +
> > > +   generic_fillattr(&init_user_ns, inode, stat);
>                          ^^^^^^^^^^^^^
>
> Is this correct? I'm not userns guy at all.

I mostly copied from here:

* https://elixir.bootlin.com/linux/v6.0-rc5/source/fs/proc/generic.c#L150

Maybe it can be simplified even further to match this one:

* https://elixir.bootlin.com/linux/v6.0-rc5/source/fs/proc/root.c#L317

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC] proc: report open files as size in stat() for /proc/pid/fd
  2022-09-17 15:31 ` Theodore Ts'o
@ 2022-09-17 18:32   ` Ivan Babrou
  0 siblings, 0 replies; 11+ messages in thread
From: Ivan Babrou @ 2022-09-17 18:32 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: linux-fsdevel, linux-kernel, kernel-team, Andrew Morton, Kalesh Singh

On Sat, Sep 17, 2022 at 8:31 AM Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Fri, Sep 16, 2022 at 04:08:52PM -0700, Ivan Babrou wrote:
> > We considered putting the number of open files in /proc/pid/stat.
> > Unfortunately, counting the number of fds involves iterating the fdtable,
> > which means that it might slow down /proc/pid/stat for processes
> > with many open files. Instead we opted to put this info in /proc/pid/fd
> > as a size member of the stat syscall result. Previously the reported
> > number was zero, so there's very little risk of breaking anything,
> > while still providing a somewhat logical way to count the open files.
>
> Instead of using the st_size of /proc/<pid>/fd, why not return that
> value in st_nlink?  /proc/<pid>/fd is a directory, so having st_nlinks
> return number of fd's plus 2 (for . and ..) would be much more natural.

From what I see, st_nlinks is used for the number of subdirectories
and it doesn't include files. In /proc/fd we only have files (well,
symlinks really). I'm still happy to use that instead if that's
preferred.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] proc: report open files as size in stat() for /proc/pid/fd
  2022-09-17  7:15   ` Alexey Dobriyan
  2022-09-17 18:32     ` Ivan Babrou
@ 2022-09-19  8:18     ` Christian Brauner
  1 sibling, 0 replies; 11+ messages in thread
From: Christian Brauner @ 2022-09-19  8:18 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Andrew Morton, Ivan Babrou, linux-fsdevel, linux-kernel,
	kernel-team, Kalesh Singh, Al Viro

On Sat, Sep 17, 2022 at 10:15:13AM +0300, Alexey Dobriyan wrote:
> On Fri, Sep 16, 2022 at 05:01:15PM -0700, Andrew Morton wrote:
> > (cc's added)
> > 
> > On Fri, 16 Sep 2022 16:08:52 -0700 Ivan Babrou <ivan@cloudflare.com> wrote:
> > 
> > > Many monitoring tools include open file count as a metric. Currently
> > > the only way to get this number is to enumerate the files in /proc/pid/fd.
> > > 
> > > The problem with the current approach is that it does many things people
> > > generally don't care about when they need one number for a metric.
> > > In our tests for cadvisor, which reports open file counts per cgroup,
> > > we observed that reading the number of open files is slow. Out of 35.23%
> > > of CPU time spent in `proc_readfd_common`, we see 29.43% spent in
> > > `proc_fill_cache`, which is responsible for filling dentry info.
> > > Some of this extra time is spinlock contention, but it's a contention
> > > for the lock we don't want to take to begin with.
> > > 
> > > We considered putting the number of open files in /proc/pid/stat.
> > > Unfortunately, counting the number of fds involves iterating the fdtable,
> > > which means that it might slow down /proc/pid/stat for processes
> > > with many open files. Instead we opted to put this info in /proc/pid/fd
> > > as a size member of the stat syscall result. Previously the reported
> > > number was zero, so there's very little risk of breaking anything,
> > > while still providing a somewhat logical way to count the open files.
> > 
> > Documentation/filesystems/proc.rst would be an appropriate place to
> > document this ;)
> > 
> > > Previously:
> > > 
> > > ```
> > > $ sudo stat /proc/1/fd | head -n2
> > >   File: /proc/1/fd
> > >   Size: 0         	Blocks: 0          IO Block: 1024   directory
> > > ```
> > > 
> > > With this patch:
> > > 
> > > ```
> > > $ sudo stat /proc/1/fd | head -n2
> > >   File: /proc/1/fd
> > >   Size: 65        	Blocks: 0          IO Block: 1024   directory
> 
> Yes. This is natural place.
> 
> > > ```
> > > 
> > > Correctness check:
> > > 
> > > ```
> > > $ sudo ls /proc/1/fd | wc -l
> > > 65
> > > ```
> > > 
> > > There are two alternatives to this approach that I can see:
> > > 
> > > * Expose /proc/pid/fd_count with a count there
> 
> > > * Make fd count acces O(1) and expose it in /proc/pid/status
> 
> This is doable, next to FDSize.
> 
> Below is doable too.
> 
> > > --- a/fs/proc/fd.c
> > > +++ b/fs/proc/fd.c
> > > @@ -279,6 +279,29 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx,
> > >  	return 0;
> > >  }
> > >  
> > > +static int proc_readfd_count(struct inode *inode)
> > > +{
> > > +	struct task_struct *p = get_proc_task(inode);
> > > +	unsigned int fd = 0, count = 0;
> > > +
> > > +	if (!p)
> > > +		return -ENOENT;
> > > +
> > > +	rcu_read_lock();
> > > +	while (task_lookup_next_fd_rcu(p, &fd)) {
> > > +		rcu_read_unlock();
> > > +
> > > +		count++;
> > > +		fd++;
> > > +
> > > +		cond_resched();
> > > +		rcu_read_lock();
> > > +	}
> > > +	rcu_read_unlock();
> > > +	put_task_struct(p);
> > > +	return count;
> > > +}
> > > +
> > >  static int proc_readfd(struct file *file, struct dir_context *ctx)
> > >  {
> > >  	return proc_readfd_common(file, ctx, proc_fd_instantiate);
> > > @@ -319,9 +342,33 @@ int proc_fd_permission(struct user_namespace *mnt_userns,
> > >  	return rv;
> > >  }
> > >  
> > > +int proc_fd_getattr(struct user_namespace *mnt_userns,
> > > +			const struct path *path, struct kstat *stat,
> > > +			u32 request_mask, unsigned int query_flags)
> > > +{
> > > +	struct inode *inode = d_inode(path->dentry);
> > > +	struct proc_dir_entry *de = PDE(inode);
> > > +
> > > +	if (de) {
> > > +		nlink_t nlink = READ_ONCE(de->nlink);
> > > +
> > > +		if (nlink > 0)
> > > +			set_nlink(inode, nlink);
> > > +	}
> > > +
> > > +	generic_fillattr(&init_user_ns, inode, stat);
> 			 ^^^^^^^^^^^^^
> 
> Is this correct? I'm not userns guy at all.

This is correct. :)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] proc: report open files as size in stat() for /proc/pid/fd
  2022-09-17 18:32     ` Ivan Babrou
@ 2022-09-19 12:30       ` Alexey Dobriyan
  2022-09-19 22:03         ` Ivan Babrou
  0 siblings, 1 reply; 11+ messages in thread
From: Alexey Dobriyan @ 2022-09-19 12:30 UTC (permalink / raw)
  To: Ivan Babrou
  Cc: Andrew Morton, linux-fsdevel, linux-kernel, kernel-team,
	Kalesh Singh, Al Viro

On Sat, Sep 17, 2022 at 11:32:02AM -0700, Ivan Babrou wrote:
> > > > * Make fd count acces O(1) and expose it in /proc/pid/status
> >
> > This is doable, next to FDSize.
> 
> It feels like a better solution, but maybe I'm missing some context
> here. Let me know whether this is preferred.

I don't know. I'd put it in st_size as you did initially.
/proc/*/status should be slow.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC] proc: report open files as size in stat() for /proc/pid/fd
  2022-09-19 12:30       ` Alexey Dobriyan
@ 2022-09-19 22:03         ` Ivan Babrou
  0 siblings, 0 replies; 11+ messages in thread
From: Ivan Babrou @ 2022-09-19 22:03 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Andrew Morton, linux-fsdevel, linux-kernel, kernel-team,
	Kalesh Singh, Al Viro

On Mon, Sep 19, 2022 at 5:30 AM Alexey Dobriyan <adobriyan@gmail.com> wrote:
>
> On Sat, Sep 17, 2022 at 11:32:02AM -0700, Ivan Babrou wrote:
> > > > > * Make fd count acces O(1) and expose it in /proc/pid/status
> > >
> > > This is doable, next to FDSize.
> >
> > It feels like a better solution, but maybe I'm missing some context
> > here. Let me know whether this is preferred.
>
> I don't know. I'd put it in st_size as you did initially.
> /proc/*/status should be slow.

Could you elaborate what you mean?

* Are you saying that having FDUsed in /proc/*/status _would_ be slow?
I would imagine that adding atomic_read() there shouldn't slow things
down too much.
* Are you saying that reading /proc/*/status is already slow and
reading the number of open files from there would be inefficient?

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-09-19 22:04 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-16 23:08 [RFC] proc: report open files as size in stat() for /proc/pid/fd Ivan Babrou
2022-09-17  0:01 ` Andrew Morton
2022-09-17  0:28   ` Ivan Babrou
2022-09-17  7:15   ` Alexey Dobriyan
2022-09-17 18:32     ` Ivan Babrou
2022-09-19 12:30       ` Alexey Dobriyan
2022-09-19 22:03         ` Ivan Babrou
2022-09-19  8:18     ` Christian Brauner
2022-09-17  3:22 ` kernel test robot
2022-09-17 15:31 ` Theodore Ts'o
2022-09-17 18:32   ` Ivan Babrou

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.