* [RFC] proc: report open files as size in stat() for /proc/pid/fd
@ 2022-09-16 23:08 Ivan Babrou
2022-09-17 0:01 ` Andrew Morton
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Ivan Babrou @ 2022-09-16 23:08 UTC (permalink / raw)
To: linux-fsdevel
Cc: linux-kernel, kernel-team, Ivan Babrou, Andrew Morton, Kalesh Singh
Many monitoring tools include open file count as a metric. Currently
the only way to get this number is to enumerate the files in /proc/pid/fd.
The problem with the current approach is that it does many things people
generally don't care about when they need one number for a metric.
In our tests for cadvisor, which reports open file counts per cgroup,
we observed that reading the number of open files is slow. Out of 35.23%
of CPU time spent in `proc_readfd_common`, we see 29.43% spent in
`proc_fill_cache`, which is responsible for filling dentry info.
Some of this extra time is spinlock contention, but it's a contention
for the lock we don't want to take to begin with.
We considered putting the number of open files in /proc/pid/stat.
Unfortunately, counting the number of fds involves iterating the fdtable,
which means that it might slow down /proc/pid/stat for processes
with many open files. Instead we opted to put this info in /proc/pid/fd
as a size member of the stat syscall result. Previously the reported
number was zero, so there's very little risk of breaking anything,
while still providing a somewhat logical way to count the open files.
Previously:
```
$ sudo stat /proc/1/fd | head -n2
File: /proc/1/fd
Size: 0 Blocks: 0 IO Block: 1024 directory
```
With this patch:
```
$ sudo stat /proc/1/fd | head -n2
File: /proc/1/fd
Size: 65 Blocks: 0 IO Block: 1024 directory
```
Correctness check:
```
$ sudo ls /proc/1/fd | wc -l
65
```
There are two alternatives to this approach that I can see:
* Expose /proc/pid/fd_count with a count there
* Make fd count acces O(1) and expose it in /proc/pid/status
I can probably figure out how to do the former, but the latter
will require somebody with more experience in file code than myself.
Signed-off-by: Ivan Babrou <ivan@cloudflare.com>
---
fs/proc/fd.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 913bef0d2a36..c7ac142500a8 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -279,6 +279,29 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx,
return 0;
}
+static int proc_readfd_count(struct inode *inode)
+{
+ struct task_struct *p = get_proc_task(inode);
+ unsigned int fd = 0, count = 0;
+
+ if (!p)
+ return -ENOENT;
+
+ rcu_read_lock();
+ while (task_lookup_next_fd_rcu(p, &fd)) {
+ rcu_read_unlock();
+
+ count++;
+ fd++;
+
+ cond_resched();
+ rcu_read_lock();
+ }
+ rcu_read_unlock();
+ put_task_struct(p);
+ return count;
+}
+
static int proc_readfd(struct file *file, struct dir_context *ctx)
{
return proc_readfd_common(file, ctx, proc_fd_instantiate);
@@ -319,9 +342,33 @@ int proc_fd_permission(struct user_namespace *mnt_userns,
return rv;
}
+int proc_fd_getattr(struct user_namespace *mnt_userns,
+ const struct path *path, struct kstat *stat,
+ u32 request_mask, unsigned int query_flags)
+{
+ struct inode *inode = d_inode(path->dentry);
+ struct proc_dir_entry *de = PDE(inode);
+
+ if (de) {
+ nlink_t nlink = READ_ONCE(de->nlink);
+
+ if (nlink > 0)
+ set_nlink(inode, nlink);
+ }
+
+ generic_fillattr(&init_user_ns, inode, stat);
+
+ /* If it's a directory, put the number of open fds there */
+ if (S_ISDIR(inode->i_mode))
+ stat->size = proc_readfd_count(inode);
+
+ return 0;
+}
+
const struct inode_operations proc_fd_inode_operations = {
.lookup = proc_lookupfd,
.permission = proc_fd_permission,
+ .getattr = proc_fd_getattr,
.setattr = proc_setattr,
};
--
2.37.2
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [RFC] proc: report open files as size in stat() for /proc/pid/fd
2022-09-16 23:08 [RFC] proc: report open files as size in stat() for /proc/pid/fd Ivan Babrou
@ 2022-09-17 0:01 ` Andrew Morton
2022-09-17 0:28 ` Ivan Babrou
2022-09-17 7:15 ` Alexey Dobriyan
2022-09-17 3:22 ` kernel test robot
2022-09-17 15:31 ` Theodore Ts'o
2 siblings, 2 replies; 11+ messages in thread
From: Andrew Morton @ 2022-09-17 0:01 UTC (permalink / raw)
To: Ivan Babrou
Cc: linux-fsdevel, linux-kernel, kernel-team, Kalesh Singh,
Alexey Dobriyan, Al Viro
(cc's added)
On Fri, 16 Sep 2022 16:08:52 -0700 Ivan Babrou <ivan@cloudflare.com> wrote:
> Many monitoring tools include open file count as a metric. Currently
> the only way to get this number is to enumerate the files in /proc/pid/fd.
>
> The problem with the current approach is that it does many things people
> generally don't care about when they need one number for a metric.
> In our tests for cadvisor, which reports open file counts per cgroup,
> we observed that reading the number of open files is slow. Out of 35.23%
> of CPU time spent in `proc_readfd_common`, we see 29.43% spent in
> `proc_fill_cache`, which is responsible for filling dentry info.
> Some of this extra time is spinlock contention, but it's a contention
> for the lock we don't want to take to begin with.
>
> We considered putting the number of open files in /proc/pid/stat.
> Unfortunately, counting the number of fds involves iterating the fdtable,
> which means that it might slow down /proc/pid/stat for processes
> with many open files. Instead we opted to put this info in /proc/pid/fd
> as a size member of the stat syscall result. Previously the reported
> number was zero, so there's very little risk of breaking anything,
> while still providing a somewhat logical way to count the open files.
Documentation/filesystems/proc.rst would be an appropriate place to
document this ;)
> Previously:
>
> ```
> $ sudo stat /proc/1/fd | head -n2
> File: /proc/1/fd
> Size: 0 Blocks: 0 IO Block: 1024 directory
> ```
>
> With this patch:
>
> ```
> $ sudo stat /proc/1/fd | head -n2
> File: /proc/1/fd
> Size: 65 Blocks: 0 IO Block: 1024 directory
> ```
>
> Correctness check:
>
> ```
> $ sudo ls /proc/1/fd | wc -l
> 65
> ```
>
> There are two alternatives to this approach that I can see:
>
> * Expose /proc/pid/fd_count with a count there
> * Make fd count acces O(1) and expose it in /proc/pid/status
>
> I can probably figure out how to do the former, but the latter
> will require somebody with more experience in file code than myself.
>
> Signed-off-by: Ivan Babrou <ivan@cloudflare.com>
> ---
> fs/proc/fd.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 47 insertions(+)
>
> diff --git a/fs/proc/fd.c b/fs/proc/fd.c
> index 913bef0d2a36..c7ac142500a8 100644
> --- a/fs/proc/fd.c
> +++ b/fs/proc/fd.c
> @@ -279,6 +279,29 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx,
> return 0;
> }
>
> +static int proc_readfd_count(struct inode *inode)
> +{
> + struct task_struct *p = get_proc_task(inode);
> + unsigned int fd = 0, count = 0;
> +
> + if (!p)
> + return -ENOENT;
> +
> + rcu_read_lock();
> + while (task_lookup_next_fd_rcu(p, &fd)) {
> + rcu_read_unlock();
> +
> + count++;
> + fd++;
> +
> + cond_resched();
> + rcu_read_lock();
> + }
> + rcu_read_unlock();
> + put_task_struct(p);
> + return count;
> +}
> +
> static int proc_readfd(struct file *file, struct dir_context *ctx)
> {
> return proc_readfd_common(file, ctx, proc_fd_instantiate);
> @@ -319,9 +342,33 @@ int proc_fd_permission(struct user_namespace *mnt_userns,
> return rv;
> }
>
> +int proc_fd_getattr(struct user_namespace *mnt_userns,
> + const struct path *path, struct kstat *stat,
> + u32 request_mask, unsigned int query_flags)
> +{
> + struct inode *inode = d_inode(path->dentry);
> + struct proc_dir_entry *de = PDE(inode);
> +
> + if (de) {
> + nlink_t nlink = READ_ONCE(de->nlink);
> +
> + if (nlink > 0)
> + set_nlink(inode, nlink);
> + }
> +
> + generic_fillattr(&init_user_ns, inode, stat);
> +
> + /* If it's a directory, put the number of open fds there */
> + if (S_ISDIR(inode->i_mode))
> + stat->size = proc_readfd_count(inode);
> +
> + return 0;
> +}
> +
> const struct inode_operations proc_fd_inode_operations = {
> .lookup = proc_lookupfd,
> .permission = proc_fd_permission,
> + .getattr = proc_fd_getattr,
> .setattr = proc_setattr,
> };
>
> --
> 2.37.2
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] proc: report open files as size in stat() for /proc/pid/fd
2022-09-17 0:01 ` Andrew Morton
@ 2022-09-17 0:28 ` Ivan Babrou
2022-09-17 7:15 ` Alexey Dobriyan
1 sibling, 0 replies; 11+ messages in thread
From: Ivan Babrou @ 2022-09-17 0:28 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-fsdevel, linux-kernel, kernel-team, Kalesh Singh,
Alexey Dobriyan, Al Viro
On Fri, Sep 16, 2022 at 5:01 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> (cc's added)
>
> On Fri, 16 Sep 2022 16:08:52 -0700 Ivan Babrou <ivan@cloudflare.com> wrote:
>
> > Many monitoring tools include open file count as a metric. Currently
> > the only way to get this number is to enumerate the files in /proc/pid/fd.
> >
> > The problem with the current approach is that it does many things people
> > generally don't care about when they need one number for a metric.
> > In our tests for cadvisor, which reports open file counts per cgroup,
> > we observed that reading the number of open files is slow. Out of 35.23%
> > of CPU time spent in `proc_readfd_common`, we see 29.43% spent in
> > `proc_fill_cache`, which is responsible for filling dentry info.
> > Some of this extra time is spinlock contention, but it's a contention
> > for the lock we don't want to take to begin with.
> >
> > We considered putting the number of open files in /proc/pid/stat.
> > Unfortunately, counting the number of fds involves iterating the fdtable,
> > which means that it might slow down /proc/pid/stat for processes
> > with many open files. Instead we opted to put this info in /proc/pid/fd
> > as a size member of the stat syscall result. Previously the reported
> > number was zero, so there's very little risk of breaking anything,
> > while still providing a somewhat logical way to count the open files.
>
> Documentation/filesystems/proc.rst would be an appropriate place to
> document this ;)
I am more than happy to add the docs after there's a confirmation that
this is an appropriate approach to expose this information. I probably
should've mentioned this explicitly, that's on me. There are two
alternative approaches at the bottom of my original email that might
be considered.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] proc: report open files as size in stat() for /proc/pid/fd
2022-09-16 23:08 [RFC] proc: report open files as size in stat() for /proc/pid/fd Ivan Babrou
2022-09-17 0:01 ` Andrew Morton
@ 2022-09-17 3:22 ` kernel test robot
2022-09-17 15:31 ` Theodore Ts'o
2 siblings, 0 replies; 11+ messages in thread
From: kernel test robot @ 2022-09-17 3:22 UTC (permalink / raw)
To: Ivan Babrou; +Cc: llvm, kbuild-all
Hi Ivan,
[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on linus/master v6.0-rc5 next-20220916]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Ivan-Babrou/proc-report-open-files-as-size-in-stat-for-proc-pid-fd/20220917-071056
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
config: hexagon-randconfig-r041-20220916 (https://download.01.org/0day-ci/archive/20220917/202209171129.dnkI0XiE-lkp@intel.com/config)
compiler: clang version 16.0.0 (https://github.com/llvm/llvm-project 791a7ae1ba3efd6bca96338e10ffde557ba83920)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/58f6f7e4082a88ddcb4b732d386f9a4b0e1b03b4
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Ivan-Babrou/proc-report-open-files-as-size-in-stat-for-proc-pid-fd/20220917-071056
git checkout 58f6f7e4082a88ddcb4b732d386f9a4b0e1b03b4
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash fs/proc/
If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>
All warnings (new ones prefixed by >>):
>> fs/proc/fd.c:345:5: warning: no previous prototype for function 'proc_fd_getattr' [-Wmissing-prototypes]
int proc_fd_getattr(struct user_namespace *mnt_userns,
^
fs/proc/fd.c:345:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
int proc_fd_getattr(struct user_namespace *mnt_userns,
^
static
1 warning generated.
vim +/proc_fd_getattr +345 fs/proc/fd.c
344
> 345 int proc_fd_getattr(struct user_namespace *mnt_userns,
346 const struct path *path, struct kstat *stat,
347 u32 request_mask, unsigned int query_flags)
348 {
349 struct inode *inode = d_inode(path->dentry);
350 struct proc_dir_entry *de = PDE(inode);
351
352 if (de) {
353 nlink_t nlink = READ_ONCE(de->nlink);
354
355 if (nlink > 0)
356 set_nlink(inode, nlink);
357 }
358
359 generic_fillattr(&init_user_ns, inode, stat);
360
361 /* If it's a directory, put the number of open fds there */
362 if (S_ISDIR(inode->i_mode))
363 stat->size = proc_readfd_count(inode);
364
365 return 0;
366 }
367
--
0-DAY CI Kernel Test Service
https://01.org/lkp
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] proc: report open files as size in stat() for /proc/pid/fd
2022-09-17 0:01 ` Andrew Morton
2022-09-17 0:28 ` Ivan Babrou
@ 2022-09-17 7:15 ` Alexey Dobriyan
2022-09-17 18:32 ` Ivan Babrou
2022-09-19 8:18 ` Christian Brauner
1 sibling, 2 replies; 11+ messages in thread
From: Alexey Dobriyan @ 2022-09-17 7:15 UTC (permalink / raw)
To: Andrew Morton
Cc: Ivan Babrou, linux-fsdevel, linux-kernel, kernel-team,
Kalesh Singh, Al Viro
On Fri, Sep 16, 2022 at 05:01:15PM -0700, Andrew Morton wrote:
> (cc's added)
>
> On Fri, 16 Sep 2022 16:08:52 -0700 Ivan Babrou <ivan@cloudflare.com> wrote:
>
> > Many monitoring tools include open file count as a metric. Currently
> > the only way to get this number is to enumerate the files in /proc/pid/fd.
> >
> > The problem with the current approach is that it does many things people
> > generally don't care about when they need one number for a metric.
> > In our tests for cadvisor, which reports open file counts per cgroup,
> > we observed that reading the number of open files is slow. Out of 35.23%
> > of CPU time spent in `proc_readfd_common`, we see 29.43% spent in
> > `proc_fill_cache`, which is responsible for filling dentry info.
> > Some of this extra time is spinlock contention, but it's a contention
> > for the lock we don't want to take to begin with.
> >
> > We considered putting the number of open files in /proc/pid/stat.
> > Unfortunately, counting the number of fds involves iterating the fdtable,
> > which means that it might slow down /proc/pid/stat for processes
> > with many open files. Instead we opted to put this info in /proc/pid/fd
> > as a size member of the stat syscall result. Previously the reported
> > number was zero, so there's very little risk of breaking anything,
> > while still providing a somewhat logical way to count the open files.
>
> Documentation/filesystems/proc.rst would be an appropriate place to
> document this ;)
>
> > Previously:
> >
> > ```
> > $ sudo stat /proc/1/fd | head -n2
> > File: /proc/1/fd
> > Size: 0 Blocks: 0 IO Block: 1024 directory
> > ```
> >
> > With this patch:
> >
> > ```
> > $ sudo stat /proc/1/fd | head -n2
> > File: /proc/1/fd
> > Size: 65 Blocks: 0 IO Block: 1024 directory
Yes. This is natural place.
> > ```
> >
> > Correctness check:
> >
> > ```
> > $ sudo ls /proc/1/fd | wc -l
> > 65
> > ```
> >
> > There are two alternatives to this approach that I can see:
> >
> > * Expose /proc/pid/fd_count with a count there
> > * Make fd count acces O(1) and expose it in /proc/pid/status
This is doable, next to FDSize.
Below is doable too.
> > --- a/fs/proc/fd.c
> > +++ b/fs/proc/fd.c
> > @@ -279,6 +279,29 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx,
> > return 0;
> > }
> >
> > +static int proc_readfd_count(struct inode *inode)
> > +{
> > + struct task_struct *p = get_proc_task(inode);
> > + unsigned int fd = 0, count = 0;
> > +
> > + if (!p)
> > + return -ENOENT;
> > +
> > + rcu_read_lock();
> > + while (task_lookup_next_fd_rcu(p, &fd)) {
> > + rcu_read_unlock();
> > +
> > + count++;
> > + fd++;
> > +
> > + cond_resched();
> > + rcu_read_lock();
> > + }
> > + rcu_read_unlock();
> > + put_task_struct(p);
> > + return count;
> > +}
> > +
> > static int proc_readfd(struct file *file, struct dir_context *ctx)
> > {
> > return proc_readfd_common(file, ctx, proc_fd_instantiate);
> > @@ -319,9 +342,33 @@ int proc_fd_permission(struct user_namespace *mnt_userns,
> > return rv;
> > }
> >
> > +int proc_fd_getattr(struct user_namespace *mnt_userns,
> > + const struct path *path, struct kstat *stat,
> > + u32 request_mask, unsigned int query_flags)
> > +{
> > + struct inode *inode = d_inode(path->dentry);
> > + struct proc_dir_entry *de = PDE(inode);
> > +
> > + if (de) {
> > + nlink_t nlink = READ_ONCE(de->nlink);
> > +
> > + if (nlink > 0)
> > + set_nlink(inode, nlink);
> > + }
> > +
> > + generic_fillattr(&init_user_ns, inode, stat);
^^^^^^^^^^^^^
Is this correct? I'm not userns guy at all.
> > +
> > + /* If it's a directory, put the number of open fds there */
> > + if (S_ISDIR(inode->i_mode))
> > + stat->size = proc_readfd_count(inode);
ENOENT can get there. In principle this is OK, userspace can live with it.
> > const struct inode_operations proc_fd_inode_operations = {
> > .lookup = proc_lookupfd,
> > .permission = proc_fd_permission,
> > + .getattr = proc_fd_getattr,
> > .setattr = proc_setattr,
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] proc: report open files as size in stat() for /proc/pid/fd
2022-09-16 23:08 [RFC] proc: report open files as size in stat() for /proc/pid/fd Ivan Babrou
2022-09-17 0:01 ` Andrew Morton
2022-09-17 3:22 ` kernel test robot
@ 2022-09-17 15:31 ` Theodore Ts'o
2022-09-17 18:32 ` Ivan Babrou
2 siblings, 1 reply; 11+ messages in thread
From: Theodore Ts'o @ 2022-09-17 15:31 UTC (permalink / raw)
To: Ivan Babrou
Cc: linux-fsdevel, linux-kernel, kernel-team, Andrew Morton, Kalesh Singh
On Fri, Sep 16, 2022 at 04:08:52PM -0700, Ivan Babrou wrote:
> We considered putting the number of open files in /proc/pid/stat.
> Unfortunately, counting the number of fds involves iterating the fdtable,
> which means that it might slow down /proc/pid/stat for processes
> with many open files. Instead we opted to put this info in /proc/pid/fd
> as a size member of the stat syscall result. Previously the reported
> number was zero, so there's very little risk of breaking anything,
> while still providing a somewhat logical way to count the open files.
Instead of using the st_size of /proc/<pid>/fd, why not return that
value in st_nlink? /proc/<pid>/fd is a directory, so having st_nlinks
return number of fd's plus 2 (for . and ..) would be much more natural.
Cheers,
- Ted
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] proc: report open files as size in stat() for /proc/pid/fd
2022-09-17 7:15 ` Alexey Dobriyan
@ 2022-09-17 18:32 ` Ivan Babrou
2022-09-19 12:30 ` Alexey Dobriyan
2022-09-19 8:18 ` Christian Brauner
1 sibling, 1 reply; 11+ messages in thread
From: Ivan Babrou @ 2022-09-17 18:32 UTC (permalink / raw)
To: Alexey Dobriyan
Cc: Andrew Morton, linux-fsdevel, linux-kernel, kernel-team,
Kalesh Singh, Al Viro
> > > * Make fd count acces O(1) and expose it in /proc/pid/status
>
> This is doable, next to FDSize.
It feels like a better solution, but maybe I'm missing some context
here. Let me know whether this is preferred.
That said, I've tried doing it, but failed. There's a noticeable
mismatch in the numbers:
* systemd:
ivan@vm:~$ sudo ls -l /proc/1/fd | wc -l
66
ivan@vm:~$ cat /proc/1/status | fgrep FD
FDSize: 256
FDUsed: 71
* journald:
ivan@vm:~$ sudo ls -l /proc/803/fd | wc -l
29
ivan@vm:~$ cat /proc/803/status | fgrep FD
FDSize: 128
FDUsed: 37
I'll see if I can make it work next week. I'm happy to receive tips as well.
Below is my attempt (link in case gmail breaks patch formatting):
* https://gist.githubusercontent.com/bobrik/acce40881d629d8cce2e55966b31a0a2/raw/716eb4724a8fe3afeeb76fd2a7a47ee13790a9e9/fdused.patch
diff --git a/fs/file.c b/fs/file.c
index 3bcc1ecc314a..8bc0741cabf1 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -85,6 +85,8 @@ static void copy_fdtable(struct fdtable *nfdt,
struct fdtable *ofdt)
memset((char *)nfdt->fd + cpy, 0, set);
copy_fd_bitmaps(nfdt, ofdt, ofdt->max_fds);
+
+ atomic_set(&nfdt->count, atomic_read(&ofdt->count));
}
/*
@@ -105,6 +107,7 @@ static void copy_fdtable(struct fdtable *nfdt,
struct fdtable *ofdt)
static struct fdtable * alloc_fdtable(unsigned int nr)
{
struct fdtable *fdt;
+ atomic_t count = ATOMIC_INIT(0);
void *data;
/*
@@ -148,6 +151,7 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
fdt->close_on_exec = data;
data += nr / BITS_PER_BYTE;
fdt->full_fds_bits = data;
+ fdt->count = count;
return fdt;
@@ -399,6 +403,8 @@ struct files_struct *dup_fd(struct files_struct
*oldf, unsigned int max_fds, int
/* clear the remainder */
memset(new_fds, 0, (new_fdt->max_fds - open_files) * sizeof(struct file *));
+ atomic_set(&new_fdt->count, atomic_read(&old_fdt->count));
+
rcu_assign_pointer(newf->fdt, new_fdt);
return newf;
@@ -474,6 +480,7 @@ struct files_struct init_files = {
.close_on_exec = init_files.close_on_exec_init,
.open_fds = init_files.open_fds_init,
.full_fds_bits = init_files.full_fds_bits_init,
+ .count = ATOMIC_INIT(0),
},
.file_lock = __SPIN_LOCK_UNLOCKED(init_files.file_lock),
.resize_wait = __WAIT_QUEUE_HEAD_INITIALIZER(init_files.resize_wait),
@@ -613,6 +620,7 @@ void fd_install(unsigned int fd, struct file *file)
BUG_ON(fdt->fd[fd] != NULL);
rcu_assign_pointer(fdt->fd[fd], file);
spin_unlock(&files->file_lock);
+ atomic_inc(&fdt->count);
return;
}
/* coupled with smp_wmb() in expand_fdtable() */
@@ -621,6 +629,7 @@ void fd_install(unsigned int fd, struct file *file)
BUG_ON(fdt->fd[fd] != NULL);
rcu_assign_pointer(fdt->fd[fd], file);
rcu_read_unlock_sched();
+ atomic_inc(&fdt->count);
}
EXPORT_SYMBOL(fd_install);
@@ -646,6 +655,7 @@ static struct file *pick_file(struct files_struct
*files, unsigned fd)
if (file) {
rcu_assign_pointer(fdt->fd[fd], NULL);
__put_unused_fd(files, fd);
+ atomic_dec(&fdt->count);
}
return file;
}
@@ -844,6 +854,7 @@ void do_close_on_exec(struct files_struct *files)
filp_close(file, files);
cond_resched();
spin_lock(&files->file_lock);
+ atomic_dec(&fdt->count);
}
}
@@ -1108,6 +1119,7 @@ __releases(&files->file_lock)
else
__clear_close_on_exec(fd, fdt);
spin_unlock(&files->file_lock);
+ atomic_inc(&fdt->count);
if (tofree)
filp_close(tofree, files);
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 99fcbfda8e25..5847f077bfc3 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -153,7 +153,8 @@ static inline void task_state(struct seq_file *m,
struct pid_namespace *ns,
struct task_struct *tracer;
const struct cred *cred;
pid_t ppid, tpid = 0, tgid, ngid;
- unsigned int max_fds = 0;
+ struct fdtable *fdt;
+ unsigned int max_fds = 0, open_fds = 0;
rcu_read_lock();
ppid = pid_alive(p) ?
@@ -170,8 +171,11 @@ static inline void task_state(struct seq_file *m,
struct pid_namespace *ns,
task_lock(p);
if (p->fs)
umask = p->fs->umask;
- if (p->files)
- max_fds = files_fdtable(p->files)->max_fds;
+ if (p->files) {
+ fdt = files_fdtable(p->files);
+ max_fds = fdt->max_fds;
+ open_fds = atomic_read(&fdt->count);
+ }
task_unlock(p);
rcu_read_unlock();
@@ -194,6 +198,7 @@ static inline void task_state(struct seq_file *m,
struct pid_namespace *ns,
seq_put_decimal_ull(m, "\t", from_kgid_munged(user_ns, cred->sgid));
seq_put_decimal_ull(m, "\t", from_kgid_munged(user_ns, cred->fsgid));
seq_put_decimal_ull(m, "\nFDSize:\t", max_fds);
+ seq_put_decimal_ull(m, "\nFDUsed:\t", open_fds);
seq_puts(m, "\nGroups:\t");
group_info = cred->group_info;
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index e066816f3519..59aceb1e4bc6 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -31,6 +31,7 @@ struct fdtable {
unsigned long *open_fds;
unsigned long *full_fds_bits;
struct rcu_head rcu;
+ atomic_t count;
};
static inline bool close_on_exec(unsigned int fd, const struct fdtable *fdt)
> > > +
> > > + generic_fillattr(&init_user_ns, inode, stat);
> ^^^^^^^^^^^^^
>
> Is this correct? I'm not userns guy at all.
I mostly copied from here:
* https://elixir.bootlin.com/linux/v6.0-rc5/source/fs/proc/generic.c#L150
Maybe it can be simplified even further to match this one:
* https://elixir.bootlin.com/linux/v6.0-rc5/source/fs/proc/root.c#L317
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [RFC] proc: report open files as size in stat() for /proc/pid/fd
2022-09-17 15:31 ` Theodore Ts'o
@ 2022-09-17 18:32 ` Ivan Babrou
0 siblings, 0 replies; 11+ messages in thread
From: Ivan Babrou @ 2022-09-17 18:32 UTC (permalink / raw)
To: Theodore Ts'o
Cc: linux-fsdevel, linux-kernel, kernel-team, Andrew Morton, Kalesh Singh
On Sat, Sep 17, 2022 at 8:31 AM Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Fri, Sep 16, 2022 at 04:08:52PM -0700, Ivan Babrou wrote:
> > We considered putting the number of open files in /proc/pid/stat.
> > Unfortunately, counting the number of fds involves iterating the fdtable,
> > which means that it might slow down /proc/pid/stat for processes
> > with many open files. Instead we opted to put this info in /proc/pid/fd
> > as a size member of the stat syscall result. Previously the reported
> > number was zero, so there's very little risk of breaking anything,
> > while still providing a somewhat logical way to count the open files.
>
> Instead of using the st_size of /proc/<pid>/fd, why not return that
> value in st_nlink? /proc/<pid>/fd is a directory, so having st_nlinks
> return number of fd's plus 2 (for . and ..) would be much more natural.
From what I see, st_nlinks is used for the number of subdirectories
and it doesn't include files. In /proc/fd we only have files (well,
symlinks really). I'm still happy to use that instead if that's
preferred.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] proc: report open files as size in stat() for /proc/pid/fd
2022-09-17 7:15 ` Alexey Dobriyan
2022-09-17 18:32 ` Ivan Babrou
@ 2022-09-19 8:18 ` Christian Brauner
1 sibling, 0 replies; 11+ messages in thread
From: Christian Brauner @ 2022-09-19 8:18 UTC (permalink / raw)
To: Alexey Dobriyan
Cc: Andrew Morton, Ivan Babrou, linux-fsdevel, linux-kernel,
kernel-team, Kalesh Singh, Al Viro
On Sat, Sep 17, 2022 at 10:15:13AM +0300, Alexey Dobriyan wrote:
> On Fri, Sep 16, 2022 at 05:01:15PM -0700, Andrew Morton wrote:
> > (cc's added)
> >
> > On Fri, 16 Sep 2022 16:08:52 -0700 Ivan Babrou <ivan@cloudflare.com> wrote:
> >
> > > Many monitoring tools include open file count as a metric. Currently
> > > the only way to get this number is to enumerate the files in /proc/pid/fd.
> > >
> > > The problem with the current approach is that it does many things people
> > > generally don't care about when they need one number for a metric.
> > > In our tests for cadvisor, which reports open file counts per cgroup,
> > > we observed that reading the number of open files is slow. Out of 35.23%
> > > of CPU time spent in `proc_readfd_common`, we see 29.43% spent in
> > > `proc_fill_cache`, which is responsible for filling dentry info.
> > > Some of this extra time is spinlock contention, but it's a contention
> > > for the lock we don't want to take to begin with.
> > >
> > > We considered putting the number of open files in /proc/pid/stat.
> > > Unfortunately, counting the number of fds involves iterating the fdtable,
> > > which means that it might slow down /proc/pid/stat for processes
> > > with many open files. Instead we opted to put this info in /proc/pid/fd
> > > as a size member of the stat syscall result. Previously the reported
> > > number was zero, so there's very little risk of breaking anything,
> > > while still providing a somewhat logical way to count the open files.
> >
> > Documentation/filesystems/proc.rst would be an appropriate place to
> > document this ;)
> >
> > > Previously:
> > >
> > > ```
> > > $ sudo stat /proc/1/fd | head -n2
> > > File: /proc/1/fd
> > > Size: 0 Blocks: 0 IO Block: 1024 directory
> > > ```
> > >
> > > With this patch:
> > >
> > > ```
> > > $ sudo stat /proc/1/fd | head -n2
> > > File: /proc/1/fd
> > > Size: 65 Blocks: 0 IO Block: 1024 directory
>
> Yes. This is natural place.
>
> > > ```
> > >
> > > Correctness check:
> > >
> > > ```
> > > $ sudo ls /proc/1/fd | wc -l
> > > 65
> > > ```
> > >
> > > There are two alternatives to this approach that I can see:
> > >
> > > * Expose /proc/pid/fd_count with a count there
>
> > > * Make fd count acces O(1) and expose it in /proc/pid/status
>
> This is doable, next to FDSize.
>
> Below is doable too.
>
> > > --- a/fs/proc/fd.c
> > > +++ b/fs/proc/fd.c
> > > @@ -279,6 +279,29 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx,
> > > return 0;
> > > }
> > >
> > > +static int proc_readfd_count(struct inode *inode)
> > > +{
> > > + struct task_struct *p = get_proc_task(inode);
> > > + unsigned int fd = 0, count = 0;
> > > +
> > > + if (!p)
> > > + return -ENOENT;
> > > +
> > > + rcu_read_lock();
> > > + while (task_lookup_next_fd_rcu(p, &fd)) {
> > > + rcu_read_unlock();
> > > +
> > > + count++;
> > > + fd++;
> > > +
> > > + cond_resched();
> > > + rcu_read_lock();
> > > + }
> > > + rcu_read_unlock();
> > > + put_task_struct(p);
> > > + return count;
> > > +}
> > > +
> > > static int proc_readfd(struct file *file, struct dir_context *ctx)
> > > {
> > > return proc_readfd_common(file, ctx, proc_fd_instantiate);
> > > @@ -319,9 +342,33 @@ int proc_fd_permission(struct user_namespace *mnt_userns,
> > > return rv;
> > > }
> > >
> > > +int proc_fd_getattr(struct user_namespace *mnt_userns,
> > > + const struct path *path, struct kstat *stat,
> > > + u32 request_mask, unsigned int query_flags)
> > > +{
> > > + struct inode *inode = d_inode(path->dentry);
> > > + struct proc_dir_entry *de = PDE(inode);
> > > +
> > > + if (de) {
> > > + nlink_t nlink = READ_ONCE(de->nlink);
> > > +
> > > + if (nlink > 0)
> > > + set_nlink(inode, nlink);
> > > + }
> > > +
> > > + generic_fillattr(&init_user_ns, inode, stat);
> ^^^^^^^^^^^^^
>
> Is this correct? I'm not userns guy at all.
This is correct. :)
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] proc: report open files as size in stat() for /proc/pid/fd
2022-09-17 18:32 ` Ivan Babrou
@ 2022-09-19 12:30 ` Alexey Dobriyan
2022-09-19 22:03 ` Ivan Babrou
0 siblings, 1 reply; 11+ messages in thread
From: Alexey Dobriyan @ 2022-09-19 12:30 UTC (permalink / raw)
To: Ivan Babrou
Cc: Andrew Morton, linux-fsdevel, linux-kernel, kernel-team,
Kalesh Singh, Al Viro
On Sat, Sep 17, 2022 at 11:32:02AM -0700, Ivan Babrou wrote:
> > > > * Make fd count acces O(1) and expose it in /proc/pid/status
> >
> > This is doable, next to FDSize.
>
> It feels like a better solution, but maybe I'm missing some context
> here. Let me know whether this is preferred.
I don't know. I'd put it in st_size as you did initially.
/proc/*/status should be slow.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC] proc: report open files as size in stat() for /proc/pid/fd
2022-09-19 12:30 ` Alexey Dobriyan
@ 2022-09-19 22:03 ` Ivan Babrou
0 siblings, 0 replies; 11+ messages in thread
From: Ivan Babrou @ 2022-09-19 22:03 UTC (permalink / raw)
To: Alexey Dobriyan
Cc: Andrew Morton, linux-fsdevel, linux-kernel, kernel-team,
Kalesh Singh, Al Viro
On Mon, Sep 19, 2022 at 5:30 AM Alexey Dobriyan <adobriyan@gmail.com> wrote:
>
> On Sat, Sep 17, 2022 at 11:32:02AM -0700, Ivan Babrou wrote:
> > > > > * Make fd count acces O(1) and expose it in /proc/pid/status
> > >
> > > This is doable, next to FDSize.
> >
> > It feels like a better solution, but maybe I'm missing some context
> > here. Let me know whether this is preferred.
>
> I don't know. I'd put it in st_size as you did initially.
> /proc/*/status should be slow.
Could you elaborate what you mean?
* Are you saying that having FDUsed in /proc/*/status _would_ be slow?
I would imagine that adding atomic_read() there shouldn't slow things
down too much.
* Are you saying that reading /proc/*/status is already slow and
reading the number of open files from there would be inefficient?
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2022-09-19 22:04 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-16 23:08 [RFC] proc: report open files as size in stat() for /proc/pid/fd Ivan Babrou
2022-09-17 0:01 ` Andrew Morton
2022-09-17 0:28 ` Ivan Babrou
2022-09-17 7:15 ` Alexey Dobriyan
2022-09-17 18:32 ` Ivan Babrou
2022-09-19 12:30 ` Alexey Dobriyan
2022-09-19 22:03 ` Ivan Babrou
2022-09-19 8:18 ` Christian Brauner
2022-09-17 3:22 ` kernel test robot
2022-09-17 15:31 ` Theodore Ts'o
2022-09-17 18:32 ` Ivan Babrou
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.