bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] bpf iterator for file-system
@ 2023-02-28  3:30 Hou Tao
  2023-02-28 19:59 ` Viacheslav Dubeyko
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Hou Tao @ 2023-02-28  3:30 UTC (permalink / raw)
  To: lsf-pc
  Cc: bpf, linux-fsdevel, Miklos Szeredi, Nhat Pham,
	Alexei Starovoitov, Yonghong Song

From time to time, new syscalls have been proposed to gain more observability
for file-system:

(1) getvalues() [0]. It uses a hierarchical namespace API to gather and return
multiple values in single syscall.
(2) cachestat() [1].  It returns the cache status (e.g., number of dirty pages)
of a given file in a scalable way.

All these proposals requires adding a new syscall. Here I would like to propose
another solution for file system observability: bpf iterator for file system
object. The initial idea came when I was trying to implement a filefrag-like
page cache tool with support for multi-order folio, so that we can know the
number of multi-order folios and the orders of those folios in page cache. After
developing a demo for it, I realized that we could use it to provide more
observability for file system objects. e.g., dumping the per-cpu iostat for a
super block [2],  iterating all inodes in a super-block to dump info for
specific inodes (e.g., unlinked but pinned inode), or displaying the flags of a
specific mount.

The BPF iterator was introduced in v5.8 [3] to support flexible content dumping
for kernel objects. It works by creating bpf iterator file [4], which is a
seq-like read-only file, and the content of the bpf iterator file is determined
by a previously loaded bpf program, so userspace can read the bpf iterator file
to get the information it needs. However there are some unresolved issues:
(1) The privilege.
Loading the bpf program requires CAP_ADMIN or CAP_BPF. This means that the
observability will be available to the privileged process. Maybe we can load the
bpf program through a privileged process and make the bpf iterator file being
readable for normal users.
(2) Prevent pinning the super-block
In the current naive implementation, the bpf iterator simply pins the
super-block of the passed fd and prevents the super-block from being destroyed.
Perhaps fs-pin is a better choice, so the bpf iterator can be deactivated after
the filesystem is umounted.

I hope to send out an RFC soon before LSF/MM/BPF for further discussion.

[0]:
https://lore.kernel.org/linux-fsdevel/YnEeuw6fd1A8usjj@miu.piliscsaba.redhat.com/
[1]: https://lore.kernel.org/linux-mm/20230219073318.366189-1-nphamcs@gmail.com/
[2]:
https://lore.kernel.org/linux-fsdevel/CAJfpegsCKEx41KA1S2QJ9gX9BEBG4_d8igA0DT66GFH2ZanspA@mail.gmail.com/
[3]: https://lore.kernel.org/bpf/20200509175859.2474608-1-yhs@fb.com/
[4]: https://docs.kernel.org/bpf/bpf_iterators.html


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [LSF/MM/BPF TOPIC] bpf iterator for file-system
  2023-02-28  3:30 [LSF/MM/BPF TOPIC] bpf iterator for file-system Hou Tao
@ 2023-02-28 19:59 ` Viacheslav Dubeyko
  2023-03-08  0:31 ` Andrii Nakryiko
  2023-04-16  7:55 ` [Lsf-pc] " Amir Goldstein
  2 siblings, 0 replies; 6+ messages in thread
From: Viacheslav Dubeyko @ 2023-02-28 19:59 UTC (permalink / raw)
  To: Hou Tao
  Cc: lsf-pc, bpf, Linux FS Devel, Miklos Szeredi, Nhat Pham,
	Alexei Starovoitov, Yonghong Song



> On Feb 27, 2023, at 7:30 PM, Hou Tao <houtao@huaweicloud.com> wrote:
> 
> From time to time, new syscalls have been proposed to gain more observability
> for file-system:
> 
> (1) getvalues() [0]. It uses a hierarchical namespace API to gather and return
> multiple values in single syscall.
> (2) cachestat() [1].  It returns the cache status (e.g., number of dirty pages)
> of a given file in a scalable way.
> 
> All these proposals requires adding a new syscall. Here I would like to propose
> another solution for file system observability: bpf iterator for file system
> object. The initial idea came when I was trying to implement a filefrag-like
> page cache tool with support for multi-order folio, so that we can know the
> number of multi-order folios and the orders of those folios in page cache. After
> developing a demo for it, I realized that we could use it to provide more
> observability for file system objects. e.g., dumping the per-cpu iostat for a
> super block [2],  iterating all inodes in a super-block to dump info for
> specific inodes (e.g., unlinked but pinned inode), or displaying the flags of a
> specific mount.
> 

Sounds like interesting suggestion to me. :) Potentially, it could have more
applications.

> The BPF iterator was introduced in v5.8 [3] to support flexible content dumping
> for kernel objects. It works by creating bpf iterator file [4], which is a
> seq-like read-only file, and the content of the bpf iterator file is determined
> by a previously loaded bpf program, so userspace can read the bpf iterator file
> to get the information it needs. However there are some unresolved issues:
> (1) The privilege.
> Loading the bpf program requires CAP_ADMIN or CAP_BPF. This means that the
> observability will be available to the privileged process. Maybe we can load the
> bpf program through a privileged process and make the bpf iterator file being
> readable for normal users.
> (2) Prevent pinning the super-block
> In the current naive implementation, the bpf iterator simply pins the
> super-block of the passed fd and prevents the super-block from being destroyed.
> Perhaps fs-pin is a better choice, so the bpf iterator can be deactivated after
> the filesystem is umounted.
> 
> I hope to send out an RFC soon before LSF/MM/BPF for further discussion.
> 

It will be good to see the patchset. :)

Thanks,
Slava.

> [0]:
> https://lore.kernel.org/linux-fsdevel/YnEeuw6fd1A8usjj@miu.piliscsaba.redhat.com/
> [1]: https://lore.kernel.org/linux-mm/20230219073318.366189-1-nphamcs@gmail.com/
> [2]:
> https://lore.kernel.org/linux-fsdevel/CAJfpegsCKEx41KA1S2QJ9gX9BEBG4_d8igA0DT66GFH2ZanspA@mail.gmail.com/
> [3]: https://lore.kernel.org/bpf/20200509175859.2474608-1-yhs@fb.com/
> [4]: https://docs.kernel.org/bpf/bpf_iterators.html
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [LSF/MM/BPF TOPIC] bpf iterator for file-system
  2023-02-28  3:30 [LSF/MM/BPF TOPIC] bpf iterator for file-system Hou Tao
  2023-02-28 19:59 ` Viacheslav Dubeyko
@ 2023-03-08  0:31 ` Andrii Nakryiko
  2023-04-16  7:55 ` [Lsf-pc] " Amir Goldstein
  2 siblings, 0 replies; 6+ messages in thread
From: Andrii Nakryiko @ 2023-03-08  0:31 UTC (permalink / raw)
  To: Hou Tao
  Cc: lsf-pc, bpf, linux-fsdevel, Miklos Szeredi, Nhat Pham,
	Alexei Starovoitov, Yonghong Song

On Mon, Feb 27, 2023 at 7:42 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> From time to time, new syscalls have been proposed to gain more observability
> for file-system:
>
> (1) getvalues() [0]. It uses a hierarchical namespace API to gather and return
> multiple values in single syscall.
> (2) cachestat() [1].  It returns the cache status (e.g., number of dirty pages)
> of a given file in a scalable way.
>
> All these proposals requires adding a new syscall. Here I would like to propose
> another solution for file system observability: bpf iterator for file system
> object. The initial idea came when I was trying to implement a filefrag-like
> page cache tool with support for multi-order folio, so that we can know the
> number of multi-order folios and the orders of those folios in page cache. After
> developing a demo for it, I realized that we could use it to provide more
> observability for file system objects. e.g., dumping the per-cpu iostat for a
> super block [2],  iterating all inodes in a super-block to dump info for
> specific inodes (e.g., unlinked but pinned inode), or displaying the flags of a
> specific mount.
>
> The BPF iterator was introduced in v5.8 [3] to support flexible content dumping
> for kernel objects. It works by creating bpf iterator file [4], which is a
> seq-like read-only file, and the content of the bpf iterator file is determined
> by a previously loaded bpf program, so userspace can read the bpf iterator file
> to get the information it needs. However there are some unresolved issues:
> (1) The privilege.
> Loading the bpf program requires CAP_ADMIN or CAP_BPF. This means that the
> observability will be available to the privileged process. Maybe we can load the
> bpf program through a privileged process and make the bpf iterator file being
> readable for normal users.

That's possible today. Once you load BPF iter program and pin it in
BPF FS, you can chown/chmod pinned file to give access to it to
unprivileged processes.

> (2) Prevent pinning the super-block
> In the current naive implementation, the bpf iterator simply pins the
> super-block of the passed fd and prevents the super-block from being destroyed.
> Perhaps fs-pin is a better choice, so the bpf iterator can be deactivated after
> the filesystem is umounted.
>
> I hope to send out an RFC soon before LSF/MM/BPF for further discussion.
>
> [0]:
> https://lore.kernel.org/linux-fsdevel/YnEeuw6fd1A8usjj@miu.piliscsaba.redhat.com/
> [1]: https://lore.kernel.org/linux-mm/20230219073318.366189-1-nphamcs@gmail.com/
> [2]:
> https://lore.kernel.org/linux-fsdevel/CAJfpegsCKEx41KA1S2QJ9gX9BEBG4_d8igA0DT66GFH2ZanspA@mail.gmail.com/
> [3]: https://lore.kernel.org/bpf/20200509175859.2474608-1-yhs@fb.com/
> [4]: https://docs.kernel.org/bpf/bpf_iterators.html
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] bpf iterator for file-system
  2023-02-28  3:30 [LSF/MM/BPF TOPIC] bpf iterator for file-system Hou Tao
  2023-02-28 19:59 ` Viacheslav Dubeyko
  2023-03-08  0:31 ` Andrii Nakryiko
@ 2023-04-16  7:55 ` Amir Goldstein
  2023-04-24  6:45   ` Hou Tao
  2 siblings, 1 reply; 6+ messages in thread
From: Amir Goldstein @ 2023-04-16  7:55 UTC (permalink / raw)
  To: Hou Tao
  Cc: lsf-pc, Nhat Pham, Miklos Szeredi, Alexei Starovoitov,
	linux-fsdevel, Yonghong Song, bpf

On Tue, Feb 28, 2023 at 5:47 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> From time to time, new syscalls have been proposed to gain more observability
> for file-system:
>
> (1) getvalues() [0]. It uses a hierarchical namespace API to gather and return
> multiple values in single syscall.
> (2) cachestat() [1].  It returns the cache status (e.g., number of dirty pages)
> of a given file in a scalable way.
>
> All these proposals requires adding a new syscall. Here I would like to propose
> another solution for file system observability: bpf iterator for file system
> object. The initial idea came when I was trying to implement a filefrag-like
> page cache tool with support for multi-order folio, so that we can know the
> number of multi-order folios and the orders of those folios in page cache. After
> developing a demo for it, I realized that we could use it to provide more
> observability for file system objects. e.g., dumping the per-cpu iostat for a
> super block [2],  iterating all inodes in a super-block to dump info for
> specific inodes (e.g., unlinked but pinned inode), or displaying the flags of a
> specific mount.
>
> The BPF iterator was introduced in v5.8 [3] to support flexible content dumping
> for kernel objects. It works by creating bpf iterator file [4], which is a
> seq-like read-only file, and the content of the bpf iterator file is determined
> by a previously loaded bpf program, so userspace can read the bpf iterator file
> to get the information it needs. However there are some unresolved issues:
> (1) The privilege.
> Loading the bpf program requires CAP_ADMIN or CAP_BPF. This means that the
> observability will be available to the privileged process. Maybe we can load the
> bpf program through a privileged process and make the bpf iterator file being
> readable for normal users.
> (2) Prevent pinning the super-block
> In the current naive implementation, the bpf iterator simply pins the
> super-block of the passed fd and prevents the super-block from being destroyed.
> Perhaps fs-pin is a better choice, so the bpf iterator can be deactivated after
> the filesystem is umounted.
>
> I hope to send out an RFC soon before LSF/MM/BPF for further discussion.

Hi Hou,

IIUC, there is not much value in making this a cross track session.
Seems like an FS track session that has not much to do with BPF
development.

Am I understanding correctly or are there any cross subsystem
interactions that need to be discussed?

Perhaps we can join you as co-speaker for Miklos' traditional
"fsinfo" session?

Thanks,
Amir.

>
> [0]:
> https://lore.kernel.org/linux-fsdevel/YnEeuw6fd1A8usjj@miu.piliscsaba.redhat.com/
> [1]: https://lore.kernel.org/linux-mm/20230219073318.366189-1-nphamcs@gmail.com/
> [2]:
> https://lore.kernel.org/linux-fsdevel/CAJfpegsCKEx41KA1S2QJ9gX9BEBG4_d8igA0DT66GFH2ZanspA@mail.gmail.com/
> [3]: https://lore.kernel.org/bpf/20200509175859.2474608-1-yhs@fb.com/
> [4]: https://docs.kernel.org/bpf/bpf_iterators.html
>
> _______________________________________________
> Lsf-pc mailing list
> Lsf-pc@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/lsf-pc

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] bpf iterator for file-system
  2023-04-16  7:55 ` [Lsf-pc] " Amir Goldstein
@ 2023-04-24  6:45   ` Hou Tao
  2023-04-27 15:54     ` Amir Goldstein
  0 siblings, 1 reply; 6+ messages in thread
From: Hou Tao @ 2023-04-24  6:45 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: lsf-pc, Nhat Pham, Miklos Szeredi, Alexei Starovoitov,
	linux-fsdevel, Yonghong Song, bpf

Hi,

On 4/16/2023 3:55 PM, Amir Goldstein wrote:
> On Tue, Feb 28, 2023 at 5:47 AM Hou Tao <houtao@huaweicloud.com> wrote:
>> From time to time, new syscalls have been proposed to gain more observability
>> for file-system:
>>
>> (1) getvalues() [0]. It uses a hierarchical namespace API to gather and return
>> multiple values in single syscall.
>> (2) cachestat() [1].  It returns the cache status (e.g., number of dirty pages)
>> of a given file in a scalable way.
>>
>> All these proposals requires adding a new syscall. Here I would like to propose
>> another solution for file system observability: bpf iterator for file system
>> object. The initial idea came when I was trying to implement a filefrag-like
>> page cache tool with support for multi-order folio, so that we can know the
>> number of multi-order folios and the orders of those folios in page cache. After
>> developing a demo for it, I realized that we could use it to provide more
>> observability for file system objects. e.g., dumping the per-cpu iostat for a
>> super block [2],  iterating all inodes in a super-block to dump info for
>> specific inodes (e.g., unlinked but pinned inode), or displaying the flags of a
>> specific mount.
>>
>> The BPF iterator was introduced in v5.8 [3] to support flexible content dumping
>> for kernel objects. It works by creating bpf iterator file [4], which is a
>> seq-like read-only file, and the content of the bpf iterator file is determined
>> by a previously loaded bpf program, so userspace can read the bpf iterator file
>> to get the information it needs. However there are some unresolved issues:
>> (1) The privilege.
>> Loading the bpf program requires CAP_ADMIN or CAP_BPF. This means that the
>> observability will be available to the privileged process. Maybe we can load the
>> bpf program through a privileged process and make the bpf iterator file being
>> readable for normal users.
>> (2) Prevent pinning the super-block
>> In the current naive implementation, the bpf iterator simply pins the
>> super-block of the passed fd and prevents the super-block from being destroyed.
>> Perhaps fs-pin is a better choice, so the bpf iterator can be deactivated after
>> the filesystem is umounted.
>>
>> I hope to send out an RFC soon before LSF/MM/BPF for further discussion.
> Hi Hou,
>
> IIUC, there is not much value in making this a cross track session.
> Seems like an FS track session that has not much to do with BPF
> development.
>
> Am I understanding correctly or are there any cross subsystem
> interactions that need to be discussed?
Yes. Although the patchset for file-system iterator is still not ready, but I
think the BPF mechanisms for file-system iterator is ready, so a cross track
session maybe unnecessary.
>
> Perhaps we can join you as co-speaker for Miklos' traditional
> "fsinfo" session?
Thanks. I am glad to be a co-speaker for fsinfo session.
>
> Thanks,
> Amir.
>
>> [0]:
>> https://lore.kernel.org/linux-fsdevel/YnEeuw6fd1A8usjj@miu.piliscsaba.redhat.com/
>> [1]: https://lore.kernel.org/linux-mm/20230219073318.366189-1-nphamcs@gmail.com/
>> [2]:
>> https://lore.kernel.org/linux-fsdevel/CAJfpegsCKEx41KA1S2QJ9gX9BEBG4_d8igA0DT66GFH2ZanspA@mail.gmail.com/
>> [3]: https://lore.kernel.org/bpf/20200509175859.2474608-1-yhs@fb.com/
>> [4]: https://docs.kernel.org/bpf/bpf_iterators.html
>>
>> _______________________________________________
>> Lsf-pc mailing list
>> Lsf-pc@lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/lsf-pc


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] bpf iterator for file-system
  2023-04-24  6:45   ` Hou Tao
@ 2023-04-27 15:54     ` Amir Goldstein
  0 siblings, 0 replies; 6+ messages in thread
From: Amir Goldstein @ 2023-04-27 15:54 UTC (permalink / raw)
  To: Hou Tao
  Cc: lsf-pc, Nhat Pham, Miklos Szeredi, Alexei Starovoitov,
	linux-fsdevel, Yonghong Song, bpf

On Mon, Apr 24, 2023 at 9:45 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 4/16/2023 3:55 PM, Amir Goldstein wrote:
> > On Tue, Feb 28, 2023 at 5:47 AM Hou Tao <houtao@huaweicloud.com> wrote:
> >> From time to time, new syscalls have been proposed to gain more observability
> >> for file-system:
> >>
> >> (1) getvalues() [0]. It uses a hierarchical namespace API to gather and return
> >> multiple values in single syscall.
> >> (2) cachestat() [1].  It returns the cache status (e.g., number of dirty pages)
> >> of a given file in a scalable way.
> >>
> >> All these proposals requires adding a new syscall. Here I would like to propose
> >> another solution for file system observability: bpf iterator for file system
> >> object. The initial idea came when I was trying to implement a filefrag-like
> >> page cache tool with support for multi-order folio, so that we can know the
> >> number of multi-order folios and the orders of those folios in page cache. After
> >> developing a demo for it, I realized that we could use it to provide more
> >> observability for file system objects. e.g., dumping the per-cpu iostat for a
> >> super block [2],  iterating all inodes in a super-block to dump info for
> >> specific inodes (e.g., unlinked but pinned inode), or displaying the flags of a
> >> specific mount.
> >>
> >> The BPF iterator was introduced in v5.8 [3] to support flexible content dumping
> >> for kernel objects. It works by creating bpf iterator file [4], which is a
> >> seq-like read-only file, and the content of the bpf iterator file is determined
> >> by a previously loaded bpf program, so userspace can read the bpf iterator file
> >> to get the information it needs. However there are some unresolved issues:
> >> (1) The privilege.
> >> Loading the bpf program requires CAP_ADMIN or CAP_BPF. This means that the
> >> observability will be available to the privileged process. Maybe we can load the
> >> bpf program through a privileged process and make the bpf iterator file being
> >> readable for normal users.
> >> (2) Prevent pinning the super-block
> >> In the current naive implementation, the bpf iterator simply pins the
> >> super-block of the passed fd and prevents the super-block from being destroyed.
> >> Perhaps fs-pin is a better choice, so the bpf iterator can be deactivated after
> >> the filesystem is umounted.
> >>
> >> I hope to send out an RFC soon before LSF/MM/BPF for further discussion.
> > Hi Hou,
> >
> > IIUC, there is not much value in making this a cross track session.
> > Seems like an FS track session that has not much to do with BPF
> > development.
> >
> > Am I understanding correctly or are there any cross subsystem
> > interactions that need to be discussed?
> Yes. Although the patchset for file-system iterator is still not ready, but I
> think the BPF mechanisms for file-system iterator is ready, so a cross track
> session maybe unnecessary.
> >
> > Perhaps we can join you as co-speaker for Miklos' traditional
> > "fsinfo" session?
> Thanks. I am glad to be a co-speaker for fsinfo session.

All right. I put you down as a co-speaker with Miklos on the fsinfo session.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-04-27 15:54 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-28  3:30 [LSF/MM/BPF TOPIC] bpf iterator for file-system Hou Tao
2023-02-28 19:59 ` Viacheslav Dubeyko
2023-03-08  0:31 ` Andrii Nakryiko
2023-04-16  7:55 ` [Lsf-pc] " Amir Goldstein
2023-04-24  6:45   ` Hou Tao
2023-04-27 15:54     ` Amir Goldstein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).