* [LSF/MM/BPF TOPIC] bpf iterator for file-system
@ 2023-02-28 3:30 Hou Tao
2023-02-28 19:59 ` Viacheslav Dubeyko
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Hou Tao @ 2023-02-28 3:30 UTC (permalink / raw)
To: lsf-pc
Cc: bpf, linux-fsdevel, Miklos Szeredi, Nhat Pham,
Alexei Starovoitov, Yonghong Song
From time to time, new syscalls have been proposed to gain more observability
for file-system:
(1) getvalues() [0]. It uses a hierarchical namespace API to gather and return
multiple values in single syscall.
(2) cachestat() [1]. It returns the cache status (e.g., number of dirty pages)
of a given file in a scalable way.
All these proposals requires adding a new syscall. Here I would like to propose
another solution for file system observability: bpf iterator for file system
object. The initial idea came when I was trying to implement a filefrag-like
page cache tool with support for multi-order folio, so that we can know the
number of multi-order folios and the orders of those folios in page cache. After
developing a demo for it, I realized that we could use it to provide more
observability for file system objects. e.g., dumping the per-cpu iostat for a
super block [2], iterating all inodes in a super-block to dump info for
specific inodes (e.g., unlinked but pinned inode), or displaying the flags of a
specific mount.
The BPF iterator was introduced in v5.8 [3] to support flexible content dumping
for kernel objects. It works by creating bpf iterator file [4], which is a
seq-like read-only file, and the content of the bpf iterator file is determined
by a previously loaded bpf program, so userspace can read the bpf iterator file
to get the information it needs. However there are some unresolved issues:
(1) The privilege.
Loading the bpf program requires CAP_ADMIN or CAP_BPF. This means that the
observability will be available to the privileged process. Maybe we can load the
bpf program through a privileged process and make the bpf iterator file being
readable for normal users.
(2) Prevent pinning the super-block
In the current naive implementation, the bpf iterator simply pins the
super-block of the passed fd and prevents the super-block from being destroyed.
Perhaps fs-pin is a better choice, so the bpf iterator can be deactivated after
the filesystem is umounted.
I hope to send out an RFC soon before LSF/MM/BPF for further discussion.
[0]:
https://lore.kernel.org/linux-fsdevel/YnEeuw6fd1A8usjj@miu.piliscsaba.redhat.com/
[1]: https://lore.kernel.org/linux-mm/20230219073318.366189-1-nphamcs@gmail.com/
[2]:
https://lore.kernel.org/linux-fsdevel/CAJfpegsCKEx41KA1S2QJ9gX9BEBG4_d8igA0DT66GFH2ZanspA@mail.gmail.com/
[3]: https://lore.kernel.org/bpf/20200509175859.2474608-1-yhs@fb.com/
[4]: https://docs.kernel.org/bpf/bpf_iterators.html
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [LSF/MM/BPF TOPIC] bpf iterator for file-system
2023-02-28 3:30 [LSF/MM/BPF TOPIC] bpf iterator for file-system Hou Tao
@ 2023-02-28 19:59 ` Viacheslav Dubeyko
2023-03-08 0:31 ` Andrii Nakryiko
2023-04-16 7:55 ` [Lsf-pc] " Amir Goldstein
2 siblings, 0 replies; 6+ messages in thread
From: Viacheslav Dubeyko @ 2023-02-28 19:59 UTC (permalink / raw)
To: Hou Tao
Cc: lsf-pc, bpf, Linux FS Devel, Miklos Szeredi, Nhat Pham,
Alexei Starovoitov, Yonghong Song
> On Feb 27, 2023, at 7:30 PM, Hou Tao <houtao@huaweicloud.com> wrote:
>
> From time to time, new syscalls have been proposed to gain more observability
> for file-system:
>
> (1) getvalues() [0]. It uses a hierarchical namespace API to gather and return
> multiple values in single syscall.
> (2) cachestat() [1]. It returns the cache status (e.g., number of dirty pages)
> of a given file in a scalable way.
>
> All these proposals requires adding a new syscall. Here I would like to propose
> another solution for file system observability: bpf iterator for file system
> object. The initial idea came when I was trying to implement a filefrag-like
> page cache tool with support for multi-order folio, so that we can know the
> number of multi-order folios and the orders of those folios in page cache. After
> developing a demo for it, I realized that we could use it to provide more
> observability for file system objects. e.g., dumping the per-cpu iostat for a
> super block [2], iterating all inodes in a super-block to dump info for
> specific inodes (e.g., unlinked but pinned inode), or displaying the flags of a
> specific mount.
>
Sounds like interesting suggestion to me. :) Potentially, it could have more
applications.
> The BPF iterator was introduced in v5.8 [3] to support flexible content dumping
> for kernel objects. It works by creating bpf iterator file [4], which is a
> seq-like read-only file, and the content of the bpf iterator file is determined
> by a previously loaded bpf program, so userspace can read the bpf iterator file
> to get the information it needs. However there are some unresolved issues:
> (1) The privilege.
> Loading the bpf program requires CAP_ADMIN or CAP_BPF. This means that the
> observability will be available to the privileged process. Maybe we can load the
> bpf program through a privileged process and make the bpf iterator file being
> readable for normal users.
> (2) Prevent pinning the super-block
> In the current naive implementation, the bpf iterator simply pins the
> super-block of the passed fd and prevents the super-block from being destroyed.
> Perhaps fs-pin is a better choice, so the bpf iterator can be deactivated after
> the filesystem is umounted.
>
> I hope to send out an RFC soon before LSF/MM/BPF for further discussion.
>
It will be good to see the patchset. :)
Thanks,
Slava.
> [0]:
> https://lore.kernel.org/linux-fsdevel/YnEeuw6fd1A8usjj@miu.piliscsaba.redhat.com/
> [1]: https://lore.kernel.org/linux-mm/20230219073318.366189-1-nphamcs@gmail.com/
> [2]:
> https://lore.kernel.org/linux-fsdevel/CAJfpegsCKEx41KA1S2QJ9gX9BEBG4_d8igA0DT66GFH2ZanspA@mail.gmail.com/
> [3]: https://lore.kernel.org/bpf/20200509175859.2474608-1-yhs@fb.com/
> [4]: https://docs.kernel.org/bpf/bpf_iterators.html
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [LSF/MM/BPF TOPIC] bpf iterator for file-system
2023-02-28 3:30 [LSF/MM/BPF TOPIC] bpf iterator for file-system Hou Tao
2023-02-28 19:59 ` Viacheslav Dubeyko
@ 2023-03-08 0:31 ` Andrii Nakryiko
2023-04-16 7:55 ` [Lsf-pc] " Amir Goldstein
2 siblings, 0 replies; 6+ messages in thread
From: Andrii Nakryiko @ 2023-03-08 0:31 UTC (permalink / raw)
To: Hou Tao
Cc: lsf-pc, bpf, linux-fsdevel, Miklos Szeredi, Nhat Pham,
Alexei Starovoitov, Yonghong Song
On Mon, Feb 27, 2023 at 7:42 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> From time to time, new syscalls have been proposed to gain more observability
> for file-system:
>
> (1) getvalues() [0]. It uses a hierarchical namespace API to gather and return
> multiple values in single syscall.
> (2) cachestat() [1]. It returns the cache status (e.g., number of dirty pages)
> of a given file in a scalable way.
>
> All these proposals requires adding a new syscall. Here I would like to propose
> another solution for file system observability: bpf iterator for file system
> object. The initial idea came when I was trying to implement a filefrag-like
> page cache tool with support for multi-order folio, so that we can know the
> number of multi-order folios and the orders of those folios in page cache. After
> developing a demo for it, I realized that we could use it to provide more
> observability for file system objects. e.g., dumping the per-cpu iostat for a
> super block [2], iterating all inodes in a super-block to dump info for
> specific inodes (e.g., unlinked but pinned inode), or displaying the flags of a
> specific mount.
>
> The BPF iterator was introduced in v5.8 [3] to support flexible content dumping
> for kernel objects. It works by creating bpf iterator file [4], which is a
> seq-like read-only file, and the content of the bpf iterator file is determined
> by a previously loaded bpf program, so userspace can read the bpf iterator file
> to get the information it needs. However there are some unresolved issues:
> (1) The privilege.
> Loading the bpf program requires CAP_ADMIN or CAP_BPF. This means that the
> observability will be available to the privileged process. Maybe we can load the
> bpf program through a privileged process and make the bpf iterator file being
> readable for normal users.
That's possible today. Once you load BPF iter program and pin it in
BPF FS, you can chown/chmod pinned file to give access to it to
unprivileged processes.
> (2) Prevent pinning the super-block
> In the current naive implementation, the bpf iterator simply pins the
> super-block of the passed fd and prevents the super-block from being destroyed.
> Perhaps fs-pin is a better choice, so the bpf iterator can be deactivated after
> the filesystem is umounted.
>
> I hope to send out an RFC soon before LSF/MM/BPF for further discussion.
>
> [0]:
> https://lore.kernel.org/linux-fsdevel/YnEeuw6fd1A8usjj@miu.piliscsaba.redhat.com/
> [1]: https://lore.kernel.org/linux-mm/20230219073318.366189-1-nphamcs@gmail.com/
> [2]:
> https://lore.kernel.org/linux-fsdevel/CAJfpegsCKEx41KA1S2QJ9gX9BEBG4_d8igA0DT66GFH2ZanspA@mail.gmail.com/
> [3]: https://lore.kernel.org/bpf/20200509175859.2474608-1-yhs@fb.com/
> [4]: https://docs.kernel.org/bpf/bpf_iterators.html
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] bpf iterator for file-system
2023-02-28 3:30 [LSF/MM/BPF TOPIC] bpf iterator for file-system Hou Tao
2023-02-28 19:59 ` Viacheslav Dubeyko
2023-03-08 0:31 ` Andrii Nakryiko
@ 2023-04-16 7:55 ` Amir Goldstein
2023-04-24 6:45 ` Hou Tao
2 siblings, 1 reply; 6+ messages in thread
From: Amir Goldstein @ 2023-04-16 7:55 UTC (permalink / raw)
To: Hou Tao
Cc: lsf-pc, Nhat Pham, Miklos Szeredi, Alexei Starovoitov,
linux-fsdevel, Yonghong Song, bpf
On Tue, Feb 28, 2023 at 5:47 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> From time to time, new syscalls have been proposed to gain more observability
> for file-system:
>
> (1) getvalues() [0]. It uses a hierarchical namespace API to gather and return
> multiple values in single syscall.
> (2) cachestat() [1]. It returns the cache status (e.g., number of dirty pages)
> of a given file in a scalable way.
>
> All these proposals requires adding a new syscall. Here I would like to propose
> another solution for file system observability: bpf iterator for file system
> object. The initial idea came when I was trying to implement a filefrag-like
> page cache tool with support for multi-order folio, so that we can know the
> number of multi-order folios and the orders of those folios in page cache. After
> developing a demo for it, I realized that we could use it to provide more
> observability for file system objects. e.g., dumping the per-cpu iostat for a
> super block [2], iterating all inodes in a super-block to dump info for
> specific inodes (e.g., unlinked but pinned inode), or displaying the flags of a
> specific mount.
>
> The BPF iterator was introduced in v5.8 [3] to support flexible content dumping
> for kernel objects. It works by creating bpf iterator file [4], which is a
> seq-like read-only file, and the content of the bpf iterator file is determined
> by a previously loaded bpf program, so userspace can read the bpf iterator file
> to get the information it needs. However there are some unresolved issues:
> (1) The privilege.
> Loading the bpf program requires CAP_ADMIN or CAP_BPF. This means that the
> observability will be available to the privileged process. Maybe we can load the
> bpf program through a privileged process and make the bpf iterator file being
> readable for normal users.
> (2) Prevent pinning the super-block
> In the current naive implementation, the bpf iterator simply pins the
> super-block of the passed fd and prevents the super-block from being destroyed.
> Perhaps fs-pin is a better choice, so the bpf iterator can be deactivated after
> the filesystem is umounted.
>
> I hope to send out an RFC soon before LSF/MM/BPF for further discussion.
Hi Hou,
IIUC, there is not much value in making this a cross track session.
Seems like an FS track session that has not much to do with BPF
development.
Am I understanding correctly or are there any cross subsystem
interactions that need to be discussed?
Perhaps we can join you as co-speaker for Miklos' traditional
"fsinfo" session?
Thanks,
Amir.
>
> [0]:
> https://lore.kernel.org/linux-fsdevel/YnEeuw6fd1A8usjj@miu.piliscsaba.redhat.com/
> [1]: https://lore.kernel.org/linux-mm/20230219073318.366189-1-nphamcs@gmail.com/
> [2]:
> https://lore.kernel.org/linux-fsdevel/CAJfpegsCKEx41KA1S2QJ9gX9BEBG4_d8igA0DT66GFH2ZanspA@mail.gmail.com/
> [3]: https://lore.kernel.org/bpf/20200509175859.2474608-1-yhs@fb.com/
> [4]: https://docs.kernel.org/bpf/bpf_iterators.html
>
> _______________________________________________
> Lsf-pc mailing list
> Lsf-pc@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/lsf-pc
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] bpf iterator for file-system
2023-04-16 7:55 ` [Lsf-pc] " Amir Goldstein
@ 2023-04-24 6:45 ` Hou Tao
2023-04-27 15:54 ` Amir Goldstein
0 siblings, 1 reply; 6+ messages in thread
From: Hou Tao @ 2023-04-24 6:45 UTC (permalink / raw)
To: Amir Goldstein
Cc: lsf-pc, Nhat Pham, Miklos Szeredi, Alexei Starovoitov,
linux-fsdevel, Yonghong Song, bpf
Hi,
On 4/16/2023 3:55 PM, Amir Goldstein wrote:
> On Tue, Feb 28, 2023 at 5:47 AM Hou Tao <houtao@huaweicloud.com> wrote:
>> From time to time, new syscalls have been proposed to gain more observability
>> for file-system:
>>
>> (1) getvalues() [0]. It uses a hierarchical namespace API to gather and return
>> multiple values in single syscall.
>> (2) cachestat() [1]. It returns the cache status (e.g., number of dirty pages)
>> of a given file in a scalable way.
>>
>> All these proposals requires adding a new syscall. Here I would like to propose
>> another solution for file system observability: bpf iterator for file system
>> object. The initial idea came when I was trying to implement a filefrag-like
>> page cache tool with support for multi-order folio, so that we can know the
>> number of multi-order folios and the orders of those folios in page cache. After
>> developing a demo for it, I realized that we could use it to provide more
>> observability for file system objects. e.g., dumping the per-cpu iostat for a
>> super block [2], iterating all inodes in a super-block to dump info for
>> specific inodes (e.g., unlinked but pinned inode), or displaying the flags of a
>> specific mount.
>>
>> The BPF iterator was introduced in v5.8 [3] to support flexible content dumping
>> for kernel objects. It works by creating bpf iterator file [4], which is a
>> seq-like read-only file, and the content of the bpf iterator file is determined
>> by a previously loaded bpf program, so userspace can read the bpf iterator file
>> to get the information it needs. However there are some unresolved issues:
>> (1) The privilege.
>> Loading the bpf program requires CAP_ADMIN or CAP_BPF. This means that the
>> observability will be available to the privileged process. Maybe we can load the
>> bpf program through a privileged process and make the bpf iterator file being
>> readable for normal users.
>> (2) Prevent pinning the super-block
>> In the current naive implementation, the bpf iterator simply pins the
>> super-block of the passed fd and prevents the super-block from being destroyed.
>> Perhaps fs-pin is a better choice, so the bpf iterator can be deactivated after
>> the filesystem is umounted.
>>
>> I hope to send out an RFC soon before LSF/MM/BPF for further discussion.
> Hi Hou,
>
> IIUC, there is not much value in making this a cross track session.
> Seems like an FS track session that has not much to do with BPF
> development.
>
> Am I understanding correctly or are there any cross subsystem
> interactions that need to be discussed?
Yes. Although the patchset for file-system iterator is still not ready, but I
think the BPF mechanisms for file-system iterator is ready, so a cross track
session maybe unnecessary.
>
> Perhaps we can join you as co-speaker for Miklos' traditional
> "fsinfo" session?
Thanks. I am glad to be a co-speaker for fsinfo session.
>
> Thanks,
> Amir.
>
>> [0]:
>> https://lore.kernel.org/linux-fsdevel/YnEeuw6fd1A8usjj@miu.piliscsaba.redhat.com/
>> [1]: https://lore.kernel.org/linux-mm/20230219073318.366189-1-nphamcs@gmail.com/
>> [2]:
>> https://lore.kernel.org/linux-fsdevel/CAJfpegsCKEx41KA1S2QJ9gX9BEBG4_d8igA0DT66GFH2ZanspA@mail.gmail.com/
>> [3]: https://lore.kernel.org/bpf/20200509175859.2474608-1-yhs@fb.com/
>> [4]: https://docs.kernel.org/bpf/bpf_iterators.html
>>
>> _______________________________________________
>> Lsf-pc mailing list
>> Lsf-pc@lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/lsf-pc
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] bpf iterator for file-system
2023-04-24 6:45 ` Hou Tao
@ 2023-04-27 15:54 ` Amir Goldstein
0 siblings, 0 replies; 6+ messages in thread
From: Amir Goldstein @ 2023-04-27 15:54 UTC (permalink / raw)
To: Hou Tao
Cc: lsf-pc, Nhat Pham, Miklos Szeredi, Alexei Starovoitov,
linux-fsdevel, Yonghong Song, bpf
On Mon, Apr 24, 2023 at 9:45 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 4/16/2023 3:55 PM, Amir Goldstein wrote:
> > On Tue, Feb 28, 2023 at 5:47 AM Hou Tao <houtao@huaweicloud.com> wrote:
> >> From time to time, new syscalls have been proposed to gain more observability
> >> for file-system:
> >>
> >> (1) getvalues() [0]. It uses a hierarchical namespace API to gather and return
> >> multiple values in single syscall.
> >> (2) cachestat() [1]. It returns the cache status (e.g., number of dirty pages)
> >> of a given file in a scalable way.
> >>
> >> All these proposals requires adding a new syscall. Here I would like to propose
> >> another solution for file system observability: bpf iterator for file system
> >> object. The initial idea came when I was trying to implement a filefrag-like
> >> page cache tool with support for multi-order folio, so that we can know the
> >> number of multi-order folios and the orders of those folios in page cache. After
> >> developing a demo for it, I realized that we could use it to provide more
> >> observability for file system objects. e.g., dumping the per-cpu iostat for a
> >> super block [2], iterating all inodes in a super-block to dump info for
> >> specific inodes (e.g., unlinked but pinned inode), or displaying the flags of a
> >> specific mount.
> >>
> >> The BPF iterator was introduced in v5.8 [3] to support flexible content dumping
> >> for kernel objects. It works by creating bpf iterator file [4], which is a
> >> seq-like read-only file, and the content of the bpf iterator file is determined
> >> by a previously loaded bpf program, so userspace can read the bpf iterator file
> >> to get the information it needs. However there are some unresolved issues:
> >> (1) The privilege.
> >> Loading the bpf program requires CAP_ADMIN or CAP_BPF. This means that the
> >> observability will be available to the privileged process. Maybe we can load the
> >> bpf program through a privileged process and make the bpf iterator file being
> >> readable for normal users.
> >> (2) Prevent pinning the super-block
> >> In the current naive implementation, the bpf iterator simply pins the
> >> super-block of the passed fd and prevents the super-block from being destroyed.
> >> Perhaps fs-pin is a better choice, so the bpf iterator can be deactivated after
> >> the filesystem is umounted.
> >>
> >> I hope to send out an RFC soon before LSF/MM/BPF for further discussion.
> > Hi Hou,
> >
> > IIUC, there is not much value in making this a cross track session.
> > Seems like an FS track session that has not much to do with BPF
> > development.
> >
> > Am I understanding correctly or are there any cross subsystem
> > interactions that need to be discussed?
> Yes. Although the patchset for file-system iterator is still not ready, but I
> think the BPF mechanisms for file-system iterator is ready, so a cross track
> session maybe unnecessary.
> >
> > Perhaps we can join you as co-speaker for Miklos' traditional
> > "fsinfo" session?
> Thanks. I am glad to be a co-speaker for fsinfo session.
All right. I put you down as a co-speaker with Miklos on the fsinfo session.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2023-04-27 15:54 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-28 3:30 [LSF/MM/BPF TOPIC] bpf iterator for file-system Hou Tao
2023-02-28 19:59 ` Viacheslav Dubeyko
2023-03-08 0:31 ` Andrii Nakryiko
2023-04-16 7:55 ` [Lsf-pc] " Amir Goldstein
2023-04-24 6:45 ` Hou Tao
2023-04-27 15:54 ` Amir Goldstein
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).