* Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again? @ 2024-02-21 6:09 WANG Xuerui 2024-02-21 6:31 ` Xi Ruoyao 2024-02-24 11:51 ` Huacai Chen 0 siblings, 2 replies; 24+ messages in thread From: WANG Xuerui @ 2024-02-21 6:09 UTC (permalink / raw) To: linux-api Cc: Arnd Bergmann, Christian Brauner, Kees Cook, Huacai Chen, Xuefeng Li, Jianmin Lv, Xiaotian Wu, WANG Rui, Miao Wang, Icenowy Zheng, loongarch, linux-arch, Linux Kernel Mailing List Hi, Recently, we -- community LoongArch porters -- have noticed a problem where the Chromium sandbox apparently wants to deny statx [^1] so it could properly inspect arguments after the sandboxed process later falls back to fstat. The reasoning behind the change was not clear in the patch; but we found out it's basically because there's currently not a "fd-only" version of statx, so that the sandbox has no way to ensure the path argument is empty without being able to peek into the sandboxed process's memory. For architectures able to do newfstatat though, the glibc falls back to newfstatat after getting -ENOSYS for statx, then the respective SIGSYS handler [^2] takes care of inspecting the path argument, transforming allowed newfstatat's into fstat instead which is allowed and has the same type of return value. But, as loongarch is the first architecture to not have fstat nor newfstatat, the LoongArch glibc does not attempt falling back at all when it gets -ENOSYS for statx -- and you see the problem there! Actually, back when the loongarch port was under review, people were aware of the same problem with sandboxing clone3 [^3], so clone was eventually kept. Unfortunately it seemed at that time no one had noticed statx, so besides restoring fstat/newfstatat to loongarch uapi (and postponing the problem further), it seems inevitable that we would need to tackle seccomp deep argument inspection; this is obviously a decision that shouldn't be taken lightly, so I'm posting this to restart the conversation to figure out a way forward together. We basically could do one of below: - just restore fstat and be done with it; - add a flag to statx so we can do the equivalent of just fstat(fd, &out) with statx, and ensuring an error happens if path is not empty in that case; - tackle the long-standing problem of seccomp deep argument inspection (!). Obviously, the simplest solution would be to "just restore fstat". But again, in my opinion this is not quite a solution but a workaround -- we have good reasons to keep just statx (mainly because its feature set is a strict superset of those of fstat/newfstatat), and we're not quite in a hurry to resolve this. Because one of the prerequisites for a new Chromium port is "inclusion in Debian stable" -- as the loong64 port [^4] is progressing and the next Debian release would not happen until 2025, we still have time for a more "elegant" solution. Alternatively, we could also introduce a new flag for statx, maybe named AT_STATX_NO_PATH or something like that, so that statx would ignore the path altogether or error on non-empty paths if called with flags containing AT_EMPTY_PATH | AT_STATX_NO_PATH. This way a seccomp policy could allow statx calls only with such flags (that are passed via register and accessible) and maintain the same level of safety as with fstat. But there is also disadvantage to this approach: the flag would be useful only for sandboxes, because in other cases there's no need to avoid reading from &path. This is also more of a workaround to avoid the deep argument inspection problem. Lastly, should we decide to go the hardest way, according to a previous mail [^5] (about clone3) and the LPC 2019 discussion [^6] [^7], we probably would try the metadata-annotation-based and piece-by-piece approach, as it's expected to provide the most benefit and involve less code changes. The implementation, as I surmise, will involve modifying the generic syscall entrypoint for early copying of user data, and corresponding changes to seccomp plumbing so this information is properly exposed. I don't have a roadmap for non-generic-entry arches right now, and I also haven't started designing the new seccomp ABI for that. As a matter of fact, members of the LoongArch community (myself included) are still fresh to this area of expertise, so a bit more of your feedback will be appreciated. Thanks to Miao Wang from AOSC for doing much of the investigation. [^1]: https://chromium-review.googlesource.com/c/chromium/src/+/2823150 [^2]: https://chromium.googlesource.com/chromium/src/sandbox/+/c085b51940bd/linux/seccomp-bpf-helpers/sigsys_handlers.cc#355 [^3]: https://lore.kernel.org/linux-arch/20220511211231.GG7074@brightrain.aerifal.cx/ [^4]: https://wiki.debian.org/Ports/loong64 [^5]: https://lwn.net/ml/linux-kernel/201905301122.88FD40B3@keescook/ [^6]: https://lwn.net/Articles/799557/ [^7]: https://lpc.events/event/4/contributions/560/attachments/397/640/deep-arg-inspection.pdf -- WANG "xen0n" Xuerui Linux/LoongArch mailing list:https://lore.kernel.org/loongarch/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again? 2024-02-21 6:09 Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again? WANG Xuerui @ 2024-02-21 6:31 ` Xi Ruoyao 2024-02-21 10:31 ` Xi Ruoyao 2024-02-24 11:51 ` Huacai Chen 1 sibling, 1 reply; 24+ messages in thread From: Xi Ruoyao @ 2024-02-21 6:31 UTC (permalink / raw) To: WANG Xuerui, linux-api Cc: Arnd Bergmann, Christian Brauner, Kees Cook, Huacai Chen, Xuefeng Li, Jianmin Lv, Xiaotian Wu, WANG Rui, Miao Wang, Icenowy Zheng, loongarch, linux-arch, Linux Kernel Mailing List On Wed, 2024-02-21 at 14:09 +0800, WANG Xuerui wrote: > - just restore fstat and be done with it; > - add a flag to statx so we can do the equivalent of just fstat(fd, > &out) with statx, and ensuring an error happens if path is not empty in > that case; It's worse than "just restore fstat" considering the performance. Read this thread: https://sourceware.org/pipermail/libc-alpha/2023-September/151320.html > - tackle the long-standing problem of seccomp deep argument inspection (!). Frankly I'm never a fan of syscall blocklisting. When I develop the Online Judge system for the programming contest training in Xidian University I deliberately avoid using seccomp. This thing is very likely to break innocent programs with some system change innocent as well (for example Glibc or libstdc++ update). -- Xi Ruoyao <xry111@xry111.site> School of Aerospace Science and Technology, Xidian University ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again? 2024-02-21 6:31 ` Xi Ruoyao @ 2024-02-21 10:31 ` Xi Ruoyao 2024-02-21 10:49 ` WANG Xuerui 0 siblings, 1 reply; 24+ messages in thread From: Xi Ruoyao @ 2024-02-21 10:31 UTC (permalink / raw) To: WANG Xuerui, linux-api Cc: Arnd Bergmann, Christian Brauner, Kees Cook, Huacai Chen, Xuefeng Li, Jianmin Lv, Xiaotian Wu, WANG Rui, Miao Wang, Icenowy Zheng, loongarch, linux-arch, Linux Kernel Mailing List On Wed, 2024-02-21 at 14:31 +0800, Xi Ruoyao wrote: > On Wed, 2024-02-21 at 14:09 +0800, WANG Xuerui wrote: > > > - just restore fstat and be done with it; > > - add a flag to statx so we can do the equivalent of just fstat(fd, > > &out) with statx, and ensuring an error happens if path is not empty in > > that case; > > It's worse than "just restore fstat" considering the performance. Read > this thread: > https://sourceware.org/pipermail/libc-alpha/2023-September/151320.html Hmm, but it looks like statx already suffers the same performance issue. And in this libc-alpha discussion Linus said: If the user asked for 'fstat()', just give the user 'fstat()'. So to me we should just add fstat (and use it in Glibc for LoongArch, only falling back to statx if fstat returns -ENOSYS), or am I missing something? -- Xi Ruoyao <xry111@xry111.site> School of Aerospace Science and Technology, Xidian University ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again? 2024-02-21 10:31 ` Xi Ruoyao @ 2024-02-21 10:49 ` WANG Xuerui 2024-02-21 12:03 ` Xi Ruoyao 0 siblings, 1 reply; 24+ messages in thread From: WANG Xuerui @ 2024-02-21 10:49 UTC (permalink / raw) To: Xi Ruoyao, linux-api Cc: Arnd Bergmann, Christian Brauner, Kees Cook, Huacai Chen, Xuefeng Li, Jianmin Lv, Xiaotian Wu, WANG Rui, Miao Wang, Icenowy Zheng, loongarch, linux-arch, Linux Kernel Mailing List On 2/21/24 18:31, Xi Ruoyao wrote: > On Wed, 2024-02-21 at 14:31 +0800, Xi Ruoyao wrote: >> On Wed, 2024-02-21 at 14:09 +0800, WANG Xuerui wrote: >> >>> - just restore fstat and be done with it; >>> - add a flag to statx so we can do the equivalent of just fstat(fd, >>> &out) with statx, and ensuring an error happens if path is not empty in >>> that case; >> It's worse than "just restore fstat" considering the performance. Read >> this thread: >> https://sourceware.org/pipermail/libc-alpha/2023-September/151320.html > Hmm, but it looks like statx already suffers the same performance issue. > And in this libc-alpha discussion Linus said: > > If the user asked for 'fstat()', just give the user 'fstat()'. > > So to me we should just add fstat (and use it in Glibc for LoongArch, only > falling back to statx if fstat returns -ENOSYS), or am I missing something? Or we could add a AT_STATX_NULL_PATH flag and mandate that `path` must be NULL if this flag is present -- then with simple checks we could have statx(fd, NULL, AT_STATX_NULL_PATH, STATX_BASIC_STATS, &out) that's both fstat-like and fast. -- WANG "xen0n" Xuerui Linux/LoongArch mailing list: https://lore.kernel.org/loongarch/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again? 2024-02-21 10:49 ` WANG Xuerui @ 2024-02-21 12:03 ` Xi Ruoyao 0 siblings, 0 replies; 24+ messages in thread From: Xi Ruoyao @ 2024-02-21 12:03 UTC (permalink / raw) To: WANG Xuerui, linux-api Cc: Arnd Bergmann, Christian Brauner, Kees Cook, Huacai Chen, Xuefeng Li, Jianmin Lv, Xiaotian Wu, WANG Rui, Miao Wang, Icenowy Zheng, loongarch, linux-arch, Linux Kernel Mailing List On Wed, 2024-02-21 at 18:49 +0800, WANG Xuerui wrote: > > On 2/21/24 18:31, Xi Ruoyao wrote: > > On Wed, 2024-02-21 at 14:31 +0800, Xi Ruoyao wrote: > > > On Wed, 2024-02-21 at 14:09 +0800, WANG Xuerui wrote: > > > > > > > - just restore fstat and be done with it; > > > > - add a flag to statx so we can do the equivalent of just fstat(fd, > > > > &out) with statx, and ensuring an error happens if path is not empty in > > > > that case; > > > It's worse than "just restore fstat" considering the performance. Read > > > this thread: > > > https://sourceware.org/pipermail/libc-alpha/2023-September/151320.html > > Hmm, but it looks like statx already suffers the same performance issue. > > And in this libc-alpha discussion Linus said: > > > > If the user asked for 'fstat()', just give the user 'fstat()'. > > > > So to me we should just add fstat (and use it in Glibc for LoongArch, only > > falling back to statx if fstat returns -ENOSYS), or am I missing something? > > Or we could add a AT_STATX_NULL_PATH flag and mandate that `path` must > be NULL if this flag is present -- then with simple checks we could have > statx(fd, NULL, AT_STATX_NULL_PATH, STATX_BASIC_STATS, &out) that's both > fstat-like and fast. But then to take the advantage in Glibc fstat() implementation, we'll need to try AT_STATX_NULL_PATH first, then check for EFAULT (not ENOSYS!) and fall back to AT_EMPTY_PATH. The EFAULT checking seems nasty to me... -- Xi Ruoyao <xry111@xry111.site> School of Aerospace Science and Technology, Xidian University ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again? 2024-02-21 6:09 Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again? WANG Xuerui 2024-02-21 6:31 ` Xi Ruoyao @ 2024-02-24 11:51 ` Huacai Chen 2024-02-25 6:51 ` Icenowy Zheng 1 sibling, 1 reply; 24+ messages in thread From: Huacai Chen @ 2024-02-24 11:51 UTC (permalink / raw) To: WANG Xuerui Cc: linux-api, Arnd Bergmann, Christian Brauner, Kees Cook, Xuefeng Li, Jianmin Lv, Xiaotian Wu, WANG Rui, Miao Wang, Icenowy Zheng, loongarch, linux-arch, Linux Kernel Mailing List Hi, Xuerui, On Wed, Feb 21, 2024 at 2:10 PM WANG Xuerui <kernel@xen0n.name> wrote: > > Hi, > > Recently, we -- community LoongArch porters -- have noticed a problem > where the Chromium sandbox apparently wants to deny statx [^1] so it > could properly inspect arguments after the sandboxed process later falls > back to fstat. The reasoning behind the change was not clear in the > patch; but we found out it's basically because there's currently not a > "fd-only" version of statx, so that the sandbox has no way to ensure the > path argument is empty without being able to peek into the sandboxed > process's memory. For architectures able to do newfstatat though, the > glibc falls back to newfstatat after getting -ENOSYS for statx, then the > respective SIGSYS handler [^2] takes care of inspecting the path > argument, transforming allowed newfstatat's into fstat instead which is > allowed and has the same type of return value. > > But, as loongarch is the first architecture to not have fstat nor > newfstatat, the LoongArch glibc does not attempt falling back at all > when it gets -ENOSYS for statx -- and you see the problem there! > > Actually, back when the loongarch port was under review, people were > aware of the same problem with sandboxing clone3 [^3], so clone was > eventually kept. Unfortunately it seemed at that time no one had noticed > statx, so besides restoring fstat/newfstatat to loongarch uapi (and > postponing the problem further), it seems inevitable that we would need > to tackle seccomp deep argument inspection; this is obviously a decision > that shouldn't be taken lightly, so I'm posting this to restart the > conversation to figure out a way forward together. We basically could do > one of below: > > - just restore fstat and be done with it; > - add a flag to statx so we can do the equivalent of just fstat(fd, > &out) with statx, and ensuring an error happens if path is not empty in > that case; > - tackle the long-standing problem of seccomp deep argument inspection (!). From my point of view, I prefer to "restore fstat", because we need to use the Chrome sandbox everyday (even though it hasn't been upstream by now). But I also hope "seccomp deep argument inspection" can be solved in the future. Huacai > > Obviously, the simplest solution would be to "just restore fstat". But > again, in my opinion this is not quite a solution but a workaround -- we > have good reasons to keep just statx (mainly because its feature set is > a strict superset of those of fstat/newfstatat), and we're not quite in > a hurry to resolve this. Because one of the prerequisites for a new > Chromium port is "inclusion in Debian stable" -- as the loong64 port > [^4] is progressing and the next Debian release would not happen until > 2025, we still have time for a more "elegant" solution. > > Alternatively, we could also introduce a new flag for statx, maybe named > AT_STATX_NO_PATH or something like that, so that statx would ignore the > path altogether or error on non-empty paths if called with flags > containing AT_EMPTY_PATH | AT_STATX_NO_PATH. This way a seccomp policy > could allow statx calls only with such flags (that are passed via > register and accessible) and maintain the same level of safety as with > fstat. But there is also disadvantage to this approach: the flag would > be useful only for sandboxes, because in other cases there's no need to > avoid reading from &path. This is also more of a workaround to avoid the > deep argument inspection problem. > > Lastly, should we decide to go the hardest way, according to a previous > mail [^5] (about clone3) and the LPC 2019 discussion [^6] [^7], we > probably would try the metadata-annotation-based and piece-by-piece > approach, as it's expected to provide the most benefit and involve less > code changes. The implementation, as I surmise, will involve modifying > the generic syscall entrypoint for early copying of user data, and > corresponding changes to seccomp plumbing so this information is > properly exposed. I don't have a roadmap for non-generic-entry arches > right now, and I also haven't started designing the new seccomp ABI for > that. As a matter of fact, members of the LoongArch community (myself > included) are still fresh to this area of expertise, so a bit more of > your feedback will be appreciated. > > Thanks to Miao Wang from AOSC for doing much of the investigation. > > [^1]: https://chromium-review.googlesource.com/c/chromium/src/+/2823150 > [^2]: > https://chromium.googlesource.com/chromium/src/sandbox/+/c085b51940bd/linux/seccomp-bpf-helpers/sigsys_handlers.cc#355 > [^3]: > https://lore.kernel.org/linux-arch/20220511211231.GG7074@brightrain.aerifal.cx/ > [^4]: https://wiki.debian.org/Ports/loong64 > [^5]: https://lwn.net/ml/linux-kernel/201905301122.88FD40B3@keescook/ > [^6]: https://lwn.net/Articles/799557/ > [^7]: > https://lpc.events/event/4/contributions/560/attachments/397/640/deep-arg-inspection.pdf > > -- > WANG "xen0n" Xuerui > > Linux/LoongArch mailing list:https://lore.kernel.org/loongarch/ > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again? 2024-02-24 11:51 ` Huacai Chen @ 2024-02-25 6:51 ` Icenowy Zheng 2024-02-25 7:32 ` Xi Ruoyao 0 siblings, 1 reply; 24+ messages in thread From: Icenowy Zheng @ 2024-02-25 6:51 UTC (permalink / raw) To: Huacai Chen, WANG Xuerui Cc: linux-api, Arnd Bergmann, Christian Brauner, Kees Cook, Xuefeng Li, Jianmin Lv, Xiaotian Wu, WANG Rui, Miao Wang, loongarch, linux-arch, Linux Kernel Mailing List 在 2024-02-24星期六的 19:51 +0800,Huacai Chen写道: > Hi, Xuerui, > > On Wed, Feb 21, 2024 at 2:10 PM WANG Xuerui <kernel@xen0n.name> > wrote: > > > > Hi, > > > > Recently, we -- community LoongArch porters -- have noticed a > > problem > > where the Chromium sandbox apparently wants to deny statx [^1] so > > it > > could properly inspect arguments after the sandboxed process later > > falls > > back to fstat. The reasoning behind the change was not clear in the > > patch; but we found out it's basically because there's currently > > not a > > "fd-only" version of statx, so that the sandbox has no way to > > ensure the > > path argument is empty without being able to peek into the > > sandboxed > > process's memory. For architectures able to do newfstatat though, > > the > > glibc falls back to newfstatat after getting -ENOSYS for statx, > > then the > > respective SIGSYS handler [^2] takes care of inspecting the path > > argument, transforming allowed newfstatat's into fstat instead > > which is > > allowed and has the same type of return value. > > > > But, as loongarch is the first architecture to not have fstat nor > > newfstatat, the LoongArch glibc does not attempt falling back at > > all > > when it gets -ENOSYS for statx -- and you see the problem there! > > > > Actually, back when the loongarch port was under review, people > > were > > aware of the same problem with sandboxing clone3 [^3], so clone was > > eventually kept. Unfortunately it seemed at that time no one had > > noticed > > statx, so besides restoring fstat/newfstatat to loongarch uapi (and > > postponing the problem further), it seems inevitable that we would > > need > > to tackle seccomp deep argument inspection; this is obviously a > > decision > > that shouldn't be taken lightly, so I'm posting this to restart the > > conversation to figure out a way forward together. We basically > > could do > > one of below: > > > > - just restore fstat and be done with it; > > - add a flag to statx so we can do the equivalent of just fstat(fd, > > &out) with statx, and ensuring an error happens if path is not > > empty in > > that case; > > - tackle the long-standing problem of seccomp deep argument > > inspection (!). > From my point of view, I prefer to "restore fstat", because we need > to > use the Chrome sandbox everyday (even though it hasn't been upstream > by now). But I also hope "seccomp deep argument inspection" can be > solved in the future. My idea is this problem needs syscalls to be designed with deep argument inspection in mind; syscalls before this should be considered as historical error and get fixed by resotring old syscalls. > > > Huacai > > > > > Obviously, the simplest solution would be to "just restore fstat". > > But > > again, in my opinion this is not quite a solution but a workaround > > -- we > > have good reasons to keep just statx (mainly because its feature > > set is > > a strict superset of those of fstat/newfstatat), and we're not > > quite in > > a hurry to resolve this. Because one of the prerequisites for a new > > Chromium port is "inclusion in Debian stable" -- as the loong64 > > port > > [^4] is progressing and the next Debian release would not happen > > until > > 2025, we still have time for a more "elegant" solution. > > > > Alternatively, we could also introduce a new flag for statx, maybe > > named > > AT_STATX_NO_PATH or something like that, so that statx would ignore > > the > > path altogether or error on non-empty paths if called with flags > > containing AT_EMPTY_PATH | AT_STATX_NO_PATH. This way a seccomp > > policy > > could allow statx calls only with such flags (that are passed via > > register and accessible) and maintain the same level of safety as > > with > > fstat. But there is also disadvantage to this approach: the flag > > would > > be useful only for sandboxes, because in other cases there's no > > need to > > avoid reading from &path. This is also more of a workaround to > > avoid the > > deep argument inspection problem. > > > > Lastly, should we decide to go the hardest way, according to a > > previous > > mail [^5] (about clone3) and the LPC 2019 discussion [^6] [^7], we > > probably would try the metadata-annotation-based and piece-by-piece > > approach, as it's expected to provide the most benefit and involve > > less > > code changes. The implementation, as I surmise, will involve > > modifying > > the generic syscall entrypoint for early copying of user data, and > > corresponding changes to seccomp plumbing so this information is > > properly exposed. I don't have a roadmap for non-generic-entry > > arches > > right now, and I also haven't started designing the new seccomp ABI > > for > > that. As a matter of fact, members of the LoongArch community > > (myself > > included) are still fresh to this area of expertise, so a bit more > > of > > your feedback will be appreciated. > > > > Thanks to Miao Wang from AOSC for doing much of the investigation. > > > > [^1]: > > https://chromium-review.googlesource.com/c/chromium/src/+/2823150 > > [^2]: > > https://chromium.googlesource.com/chromium/src/sandbox/+/c085b51940bd/linux/seccomp-bpf-helpers/sigsys_handlers.cc#355 > > [^3]: > > https://lore.kernel.org/linux-arch/20220511211231.GG7074@brightrain.aerifal.cx/ > > [^4]: https://wiki.debian.org/Ports/loong64 > > [^5]: > > https://lwn.net/ml/linux-kernel/201905301122.88FD40B3@keescook/ > > [^6]: https://lwn.net/Articles/799557/ > > [^7]: > > https://lpc.events/event/4/contributions/560/attachments/397/640/deep-arg-inspection.pdf > > > > -- > > WANG "xen0n" Xuerui > > > > Linux/LoongArch mailing list:https://lore.kernel.org/loongarch/ > > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again? 2024-02-25 6:51 ` Icenowy Zheng @ 2024-02-25 7:32 ` Xi Ruoyao 2024-02-26 6:03 ` Icenowy Zheng 0 siblings, 1 reply; 24+ messages in thread From: Xi Ruoyao @ 2024-02-25 7:32 UTC (permalink / raw) To: Icenowy Zheng, Huacai Chen, WANG Xuerui Cc: linux-api, Arnd Bergmann, Christian Brauner, Kees Cook, Xuefeng Li, Jianmin Lv, Xiaotian Wu, WANG Rui, Miao Wang, loongarch, linux-arch, Linux Kernel Mailing List On Sun, 2024-02-25 at 14:51 +0800, Icenowy Zheng wrote: > > From my point of view, I prefer to "restore fstat", because we need > > to > > use the Chrome sandbox everyday (even though it hasn't been upstream > > by now). But I also hope "seccomp deep argument inspection" can be > > solved in the future. > > My idea is this problem needs syscalls to be designed with deep > argument inspection in mind; syscalls before this should be considered > as historical error and get fixed by resotring old syscalls. I'd not consider fstat an error as using statx for fstat has a performance impact (severe for some workflows), and Linus has concluded "if the user wants fstat, give them fstat" for the performance issue: https://sourceware.org/pipermail/libc-alpha/2023-September/151365.html However we only want fstat (actually "newfstat" in fs/stat.c), and it seems we don't want to resurrect newstat, newlstat, newfstatat, etc. (or am I missing any benefit - performance or "just pleasing seccomp" - of them comparing to statx?) so we don't want to just define __ARCH_WANT_NEW_STAT. So it seems we need to add some new #if to fs/stat.c and include/uapi/asm-generic/unistd.h. And no, it's not a design issue of all other syscalls. It's just the design issue of seccomp. There's no way to design a syscall allowing seccomp to inspect a 100-character path in its argument unless refactoring seccomp entirely because we cannot fit a 100-character path into 8 registers. As at now people do use PTRACE_PEEKDATA for "deep inspection" (actually "debugging" the target process) but it obviously makes a very severe performance impact. <rant> Today the entire software industry is saying "do things in a declarative way" but seccomp is completely the opposite. It's auditing *how* the sandboxed application is doing things instead of *what* will be done. I've raised my against to seccomp and/or syscall allowlisting several times after seeing so many breakages like: - https://github.com/NetworkConfiguration/dhcpcd/issues/120 - https://gitlab.gnome.org/GNOME/tracker-miners/-/issues/252 - https://blog.pintia.cn/2018/06/27/glibc-segmentation-fault/ - http://web.archive.org/web/20210126121421/http://acm.xidian.edu.cn/discuss/thread.php?tid=148&cid=# (comment 3) but people just keep telling me "you are wrong, you don't understand security". Some of them even complain "seccomp is broken" as well but still keep using it. </rant> -- Xi Ruoyao <xry111@xry111.site> School of Aerospace Science and Technology, Xidian University ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again? 2024-02-25 7:32 ` Xi Ruoyao @ 2024-02-26 6:03 ` Icenowy Zheng 2024-02-26 6:56 ` Arnd Bergmann 2024-02-26 8:26 ` Christian Brauner 0 siblings, 2 replies; 24+ messages in thread From: Icenowy Zheng @ 2024-02-26 6:03 UTC (permalink / raw) To: Xi Ruoyao, Huacai Chen, WANG Xuerui Cc: linux-api, Arnd Bergmann, Christian Brauner, Kees Cook, Xuefeng Li, Jianmin Lv, Xiaotian Wu, WANG Rui, Miao Wang, loongarch, linux-arch, Linux Kernel Mailing List 在 2024-02-25星期日的 15:32 +0800,Xi Ruoyao写道: > On Sun, 2024-02-25 at 14:51 +0800, Icenowy Zheng wrote: > > > From my point of view, I prefer to "restore fstat", because we > > > need > > > to > > > use the Chrome sandbox everyday (even though it hasn't been > > > upstream > > > by now). But I also hope "seccomp deep argument inspection" can > > > be > > > solved in the future. > > > > My idea is this problem needs syscalls to be designed with deep > > argument inspection in mind; syscalls before this should be > > considered > > as historical error and get fixed by resotring old syscalls. > > I'd not consider fstat an error as using statx for fstat has a > performance impact (severe for some workflows), and Linus has > concluded Sorry for clearance, I mean statx is an error in ABI design, not fstat. > "if the user wants fstat, give them fstat" for the performance issue: > > https://sourceware.org/pipermail/libc-alpha/2023-September/151365.html > > However we only want fstat (actually "newfstat" in fs/stat.c), and it > seems we don't want to resurrect newstat, newlstat, newfstatat, etc. > (or > am I missing any benefit - performance or "just pleasing seccomp" - > of > them comparing to statx?) so we don't want to just define > __ARCH_WANT_NEW_STAT. So it seems we need to add some new #if to > fs/stat.c and include/uapi/asm-generic/unistd.h. > > And no, it's not a design issue of all other syscalls. It's just the > design issue of seccomp. There's no way to design a syscall allowing > seccomp to inspect a 100-character path in its argument unless > refactoring seccomp entirely because we cannot fit a 100-character > path > into 8 registers. Well my meaning is that syscalls should be designed to be simple to prevent this kind of circumstance. > > As at now people do use PTRACE_PEEKDATA for "deep inspection" > (actually > "debugging" the target process) but it obviously makes a very severe > performance impact. > > <rant> > > Today the entire software industry is saying "do things in a > declarative > way" but seccomp is completely the opposite. It's auditing *how* the > sandboxed application is doing things instead of *what* will be done. > > I've raised my against to seccomp and/or syscall allowlisting several > times after seeing so many breakages like: > > - https://github.com/NetworkConfiguration/dhcpcd/issues/120 > - https://gitlab.gnome.org/GNOME/tracker-miners/-/issues/252 > - https://blog.pintia.cn/2018/06/27/glibc-segmentation-fault/ > - > http://web.archive.org/web/20210126121421/http://acm.xidian.edu.cn/discuss/thread.php?tid=148&cid=# > (comment 3) > > but people just keep telling me "you are wrong, you don't understand > security". Some of them even complain "seccomp is broken" as well > but > still keep using it. > > </rant> > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again? 2024-02-26 6:03 ` Icenowy Zheng @ 2024-02-26 6:56 ` Arnd Bergmann 2024-02-26 7:09 ` Xi Ruoyao 2024-02-26 8:26 ` Christian Brauner 1 sibling, 1 reply; 24+ messages in thread From: Arnd Bergmann @ 2024-02-26 6:56 UTC (permalink / raw) To: Icenowy Zheng, Xi Ruoyao, Huacai Chen, WANG Xuerui Cc: linux-api, Christian Brauner, Kees Cook, Xuefeng Li, Jianmin Lv, Xiaotian Wu, WANG Rui, Miao Wang, loongarch, Linux-Arch, Linux Kernel Mailing List On Mon, Feb 26, 2024, at 07:03, Icenowy Zheng wrote: > 在 2024-02-25星期日的 15:32 +0800,Xi Ruoyao写道: >> On Sun, 2024-02-25 at 14:51 +0800, Icenowy Zheng wrote: >> > My idea is this problem needs syscalls to be designed with deep >> > argument inspection in mind; syscalls before this should be >> > considered >> > as historical error and get fixed by resotring old syscalls. >> >> I'd not consider fstat an error as using statx for fstat has a >> performance impact (severe for some workflows), and Linus has >> concluded > > Sorry for clearance, I mean statx is an error in ABI design, not fstat. The same has been said about seccomp(). ;-) It's clear that the two don't go well together at the moment. >> "if the user wants fstat, give them fstat" for the performance issue: >> >> https://sourceware.org/pipermail/libc-alpha/2023-September/151365.html >> >> However we only want fstat (actually "newfstat" in fs/stat.c), and it >> seems we don't want to resurrect newstat, newlstat, newfstatat, etc. >> (or >> am I missing any benefit - performance or "just pleasing seccomp" - >> of them comparing to statx?) so we don't want to just define >> __ARCH_WANT_NEW_STAT. So it seems we need to add some new #if to >> fs/stat.c and include/uapi/asm-generic/unistd.h. >> >> And no, it's not a design issue of all other syscalls. It's just the >> design issue of seccomp. There's no way to design a syscall allowing >> seccomp to inspect a 100-character path in its argument unless >> refactoring seccomp entirely because we cannot fit a 100-character >> path >> into 8 registers. > > Well my meaning is that syscalls should be designed to be simple to > prevent this kind of circumstance. The problem I see with the 'use use fstat' approach is that this does not work on 32-bit architectures, unless we define a new fstatat64_time64() syscall, which is one of the things that statx() was trying to avoid. Whichever solution we end up with should work on both loongarch64 and on armv7 at least. Arnd ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again? 2024-02-26 6:56 ` Arnd Bergmann @ 2024-02-26 7:09 ` Xi Ruoyao 2024-02-26 9:20 ` Arnd Bergmann 0 siblings, 1 reply; 24+ messages in thread From: Xi Ruoyao @ 2024-02-26 7:09 UTC (permalink / raw) To: Arnd Bergmann, Icenowy Zheng, Huacai Chen, WANG Xuerui Cc: linux-api, Christian Brauner, Kees Cook, Xuefeng Li, Jianmin Lv, Xiaotian Wu, WANG Rui, Miao Wang, loongarch, Linux-Arch, Linux Kernel Mailing List On Mon, 2024-02-26 at 07:56 +0100, Arnd Bergmann wrote: > On Mon, Feb 26, 2024, at 07:03, Icenowy Zheng wrote: > > 在 2024-02-25星期日的 15:32 +0800,Xi Ruoyao写道: > > > On Sun, 2024-02-25 at 14:51 +0800, Icenowy Zheng wrote: > > > > My idea is this problem needs syscalls to be designed with deep > > > > argument inspection in mind; syscalls before this should be > > > > considered > > > > as historical error and get fixed by resotring old syscalls. > > > > > > I'd not consider fstat an error as using statx for fstat has a > > > performance impact (severe for some workflows), and Linus has > > > concluded > > > > Sorry for clearance, I mean statx is an error in ABI design, not fstat. I'm wondering why we decided to use AT_EMPTY_PATH/"" instead of "AT_NULL_PATH"/nullptr in the first place? > The same has been said about seccomp(). ;-) > > It's clear that the two don't go well together at the moment. > > > > "if the user wants fstat, give them fstat" for the performance issue: > > > > > > https://sourceware.org/pipermail/libc-alpha/2023-September/151365.html > > > > > > However we only want fstat (actually "newfstat" in fs/stat.c), and it > > > seems we don't want to resurrect newstat, newlstat, newfstatat, etc. > > > (or > > > am I missing any benefit - performance or "just pleasing seccomp" - > > > of them comparing to statx?) so we don't want to just define > > > __ARCH_WANT_NEW_STAT. So it seems we need to add some new #if to > > > fs/stat.c and include/uapi/asm-generic/unistd.h. > > > > > > And no, it's not a design issue of all other syscalls. It's just the > > > design issue of seccomp. There's no way to design a syscall allowing > > > seccomp to inspect a 100-character path in its argument unless > > > refactoring seccomp entirely because we cannot fit a 100-character > > > path > > > into 8 registers. > > > > Well my meaning is that syscalls should be designed to be simple to > > prevent this kind of circumstance. But it's not irrational to pass a path to syscall, as long as we still have the concept of file system (maybe in 2371 or some year we'll use a 128-bit UUID instead of path). > The problem I see with the 'use use fstat' approach is that this > does not work on 32-bit architectures, unless we define a new > fstatat64_time64() syscall, which is one of the things that statx() "fstat64_time64". Using statx for fstatat should be just fine. Or maybe we can just introduce a new AT_something to make statx completely ignore pathname but behave like AT_EMPTY_PATH + "". > was trying to avoid. Oops. I thought "newstat" should be using 64-bit time but it seems the "new" is not what I'd expected... The "new" actually means "newer than Linux 0.9"! :( Let's not use "new" in future syscall names... > Whichever solution we end up with should work on both > loongarch64 and on armv7 at least. > > Arnd -- Xi Ruoyao <xry111@xry111.site> School of Aerospace Science and Technology, Xidian University ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again? 2024-02-26 7:09 ` Xi Ruoyao @ 2024-02-26 9:20 ` Arnd Bergmann 2024-02-26 11:57 ` Xi Ruoyao 2024-02-26 13:32 ` Christian Brauner 0 siblings, 2 replies; 24+ messages in thread From: Arnd Bergmann @ 2024-02-26 9:20 UTC (permalink / raw) To: Xi Ruoyao, Icenowy Zheng, Huacai Chen, WANG Xuerui Cc: linux-api, Christian Brauner, Kees Cook, Xuefeng Li, Jianmin Lv, Xiaotian Wu, WANG Rui, Miao Wang, loongarch, Linux-Arch, Linux Kernel Mailing List On Mon, Feb 26, 2024, at 08:09, Xi Ruoyao wrote: > On Mon, 2024-02-26 at 07:56 +0100, Arnd Bergmann wrote: >> On Mon, Feb 26, 2024, at 07:03, Icenowy Zheng wrote: >> > 在 2024-02-25星期日的 15:32 +0800,Xi Ruoyao写道: >> > > On Sun, 2024-02-25 at 14:51 +0800, Icenowy Zheng wrote: >> > > > My idea is this problem needs syscalls to be designed with deep >> > > > argument inspection in mind; syscalls before this should be >> > > > considered >> > > > as historical error and get fixed by resotring old syscalls. >> > > >> > > I'd not consider fstat an error as using statx for fstat has a >> > > performance impact (severe for some workflows), and Linus has >> > > concluded >> > >> > Sorry for clearance, I mean statx is an error in ABI design, not fstat. > > I'm wondering why we decided to use AT_EMPTY_PATH/"" instead of > "AT_NULL_PATH"/nullptr in the first place? Not sure, but it's hard to change now since the libc implementation won't easily know whether using the NULL path is safe on a given kernel. It could check the kernel version number, but that adds another bit of complexity in the fast path and doesn't work on old kernels with the feature backported. > But it's not irrational to pass a path to syscall, as long as we still > have the concept of file system (maybe in 2371 or some year we'll use a > 128-bit UUID instead of path). > >> The problem I see with the 'use use fstat' approach is that this >> does not work on 32-bit architectures, unless we define a new >> fstatat64_time64() syscall, which is one of the things that statx() > > "fstat64_time64". Using statx for fstatat should be just fine. Right. It does feel wrong to have only an fstat() variant but not fstatat() if we go there. > Or maybe we can just introduce a new AT_something to make statx > completely ignore pathname but behave like AT_EMPTY_PATH + "". I think this is better than going back to fstat64_time64(), but it's still not great because - all the reserved flags on statx() are by definition incompatible with existing kernels that return -EINVAL for any flag they do not recognize. - you still need to convince libc developers to actually use the flag despite the backwards compatibility problem, either with a fallback to the current behavior or a version check. Using the NULL path as a fallback would solve the problem with seccomp, but it would not make the normal case any faster. >> was trying to avoid. > > Oops. I thought "newstat" should be using 64-bit time but it seems the > "new" is not what I'd expected... The "new" actually means "newer than > Linux 0.9"! :( > > Let's not use "new" in future syscall names... Right, we definitely can't ever succeed. On some architectures we even had "oldstat" and "stat" before "newstat" and "stat64", and on some architectures we mix them up. E.g. x86_64 has fstat() and fstatat64() with the same structure but doesn't define __NR_newfstat. On mips64, there is a 'newstat' but it has 32-bit timestamps unlike all other 64-bit architectures. statx() was intended to solve these problems once and for all, and it appears that we have failed again. Arnd ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again? 2024-02-26 9:20 ` Arnd Bergmann @ 2024-02-26 11:57 ` Xi Ruoyao 2024-02-26 12:57 ` Christian Brauner 2024-02-26 13:32 ` Christian Brauner 1 sibling, 1 reply; 24+ messages in thread From: Xi Ruoyao @ 2024-02-26 11:57 UTC (permalink / raw) To: Arnd Bergmann, Icenowy Zheng, Huacai Chen, WANG Xuerui, Adhemerval Zanella, Rich Felker Cc: linux-api, Christian Brauner, Kees Cook, Xuefeng Li, Jianmin Lv, Xiaotian Wu, WANG Rui, Miao Wang, loongarch, Linux-Arch, Linux Kernel Mailing List On Mon, 2024-02-26 at 10:20 +0100, Arnd Bergmann wrote: /* snip */ > > > Or maybe we can just introduce a new AT_something to make statx > > completely ignore pathname but behave like AT_EMPTY_PATH + "". > > I think this is better than going back to fstat64_time64(), but > it's still not great because > > - all the reserved flags on statx() are by definition incompatible > with existing kernels that return -EINVAL for any flag they do > not recognize. Oops, we are deeming passing undefined flags in "mask" undefined behavior but not "flags", thus "wild software" may be relying on EINVAL for invalid flags... We *might* make this new AT_xxx a bit in mask instead of flags but it would be very dirty IMO. > - you still need to convince libc developers to actually use > the flag despite the backwards compatibility problem, either > with a fallback to the current behavior or a version check. Let me ping some libc developers then... > Using the NULL path as a fallback would solve the problem with > seccomp, but it would not make the normal case any faster. But "wild software" may be relying on a EFAULT for NULL path too... /* snip */ > > > > Oops. I thought "newstat" should be using 64-bit time but it seems the > > "new" is not what I'd expected... The "new" actually means "newer than > > Linux 0.9"! :( > > > > Let's not use "new" in future syscall names... > > Right, we definitely can't ever succeed. On some architectures > we even had "oldstat" and "stat" before "newstat" and "stat64", > and on some architectures we mix them up. E.g. x86_64 has fstat() > and fstatat64() with the same structure but doesn't define > __NR_newfstat. On mips64, there is a 'newstat' but it has 32-bit > timestamps unlike all other 64-bit architectures. > > statx() was intended to solve these problems once and for all, > and it appears that we have failed again. https://xkcd.com/927/ :( -- Xi Ruoyao <xry111@xry111.site> School of Aerospace Science and Technology, Xidian University ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again? 2024-02-26 11:57 ` Xi Ruoyao @ 2024-02-26 12:57 ` Christian Brauner 2024-02-26 14:33 ` Rich Felker 0 siblings, 1 reply; 24+ messages in thread From: Christian Brauner @ 2024-02-26 12:57 UTC (permalink / raw) To: Xi Ruoyao Cc: Arnd Bergmann, Icenowy Zheng, Huacai Chen, WANG Xuerui, Adhemerval Zanella, Rich Felker, linux-api, Kees Cook, Xuefeng Li, Jianmin Lv, Xiaotian Wu, WANG Rui, Miao Wang, loongarch, Linux-Arch, Linux Kernel Mailing List On Mon, Feb 26, 2024 at 07:57:56PM +0800, Xi Ruoyao wrote: > On Mon, 2024-02-26 at 10:20 +0100, Arnd Bergmann wrote: > > /* snip */ > > > > > > Or maybe we can just introduce a new AT_something to make statx > > > completely ignore pathname but behave like AT_EMPTY_PATH + "". I'm not at all convinced about doing custom semantics for this. > > I think this is better than going back to fstat64_time64(), but > > it's still not great because > > > > - all the reserved flags on statx() are by definition incompatible > > with existing kernels that return -EINVAL for any flag they do > > not recognize. > > Oops, we are deeming passing undefined flags in "mask" undefined > behavior but not "flags", thus "wild software" may be relying on EINVAL > for invalid flags... We *might* make this new AT_xxx a bit in mask > instead of flags but it would be very dirty IMO. Uhm, no. AT_* flags have nothing to do in statx()'s mask argument at all. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again? 2024-02-26 12:57 ` Christian Brauner @ 2024-02-26 14:33 ` Rich Felker 0 siblings, 0 replies; 24+ messages in thread From: Rich Felker @ 2024-02-26 14:33 UTC (permalink / raw) To: Christian Brauner Cc: Xi Ruoyao, Arnd Bergmann, Icenowy Zheng, Huacai Chen, WANG Xuerui, Adhemerval Zanella, linux-api, Kees Cook, Xuefeng Li, Jianmin Lv, Xiaotian Wu, WANG Rui, Miao Wang, loongarch, Linux-Arch, Linux Kernel Mailing List On Mon, Feb 26, 2024 at 01:57:55PM +0100, Christian Brauner wrote: > On Mon, Feb 26, 2024 at 07:57:56PM +0800, Xi Ruoyao wrote: > > On Mon, 2024-02-26 at 10:20 +0100, Arnd Bergmann wrote: > > > > /* snip */ > > > > > > > > > Or maybe we can just introduce a new AT_something to make statx > > > > completely ignore pathname but behave like AT_EMPTY_PATH + "". > > I'm not at all convinced about doing custom semantics for this. I did not follow the entirety of this thread. I've been aware for a while that the need to use AT_EMPTY_PATH (on archs that don't have old syscalls) is a performance problem in addition to being a sandboxing problem, because the semantics for it were defined to use the string argument if present, thereby requiring the kernel to perform an additional string read from user memory. Unfortunately, I don't see any good fix. Even if we could add AT_STATX_NO_PATH/AT_STATX_NULL_PATH, libc would not be using it, because using them would incur EINVAL-then-fallback on kernels that don't support it. In regards to the Chromium sandbox, I think Chromium is just wrong here. Blocking statx is not safe (it also does not work on 32-bit archs -- it breaks time64 support! and riscv32 doesn't even have legacy stat either, just like loongarch64), and there is really no serious security risk from being able to stat arbitrary pathnames. Maybe it's annoying from a theoretical purity standpoint that you can't block that, but from a practical standpoint it doesn't really matter. I'd like to see a solution to this, but all the possible ones look bad. And it's all a consequence of poor consideration of how AT_EMPTY_PATH should work when it was first invented/added. >_< > > > I think this is better than going back to fstat64_time64(), but > > > it's still not great because > > > > > > - all the reserved flags on statx() are by definition incompatible > > > with existing kernels that return -EINVAL for any flag they do > > > not recognize. > > > > Oops, we are deeming passing undefined flags in "mask" undefined > > behavior but not "flags", thus "wild software" may be relying on EINVAL > > for invalid flags... We *might* make this new AT_xxx a bit in mask > > instead of flags but it would be very dirty IMO. > > Uhm, no. AT_* flags have nothing to do in statx()'s mask argument at all. They definitely cannot go in mask; that's semantically wrong. Rich ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again? 2024-02-26 9:20 ` Arnd Bergmann 2024-02-26 11:57 ` Xi Ruoyao @ 2024-02-26 13:32 ` Christian Brauner 2024-02-26 13:46 ` Arnd Bergmann ` (2 more replies) 1 sibling, 3 replies; 24+ messages in thread From: Christian Brauner @ 2024-02-26 13:32 UTC (permalink / raw) To: Arnd Bergmann Cc: Xi Ruoyao, Icenowy Zheng, Huacai Chen, WANG Xuerui, linux-api, Kees Cook, Xuefeng Li, Jianmin Lv, Xiaotian Wu, WANG Rui, Miao Wang, loongarch, Linux-Arch, Linux Kernel Mailing List On Mon, Feb 26, 2024 at 10:20:23AM +0100, Arnd Bergmann wrote: > On Mon, Feb 26, 2024, at 08:09, Xi Ruoyao wrote: > > On Mon, 2024-02-26 at 07:56 +0100, Arnd Bergmann wrote: > >> On Mon, Feb 26, 2024, at 07:03, Icenowy Zheng wrote: > >> > 在 2024-02-25星期日的 15:32 +0800,Xi Ruoyao写道: > >> > > On Sun, 2024-02-25 at 14:51 +0800, Icenowy Zheng wrote: > >> > > > My idea is this problem needs syscalls to be designed with deep > >> > > > argument inspection in mind; syscalls before this should be > >> > > > considered > >> > > > as historical error and get fixed by resotring old syscalls. > >> > > > >> > > I'd not consider fstat an error as using statx for fstat has a > >> > > performance impact (severe for some workflows), and Linus has > >> > > concluded > >> > > >> > Sorry for clearance, I mean statx is an error in ABI design, not fstat. > > > > I'm wondering why we decided to use AT_EMPTY_PATH/"" instead of > > "AT_NULL_PATH"/nullptr in the first place? > > Not sure, but it's hard to change now since the libc > implementation won't easily know whether using the NULL > path is safe on a given kernel. It could check the kernel > version number, but that adds another bit of complexity in > the fast path and doesn't work on old kernels with the > feature backported. > > > But it's not irrational to pass a path to syscall, as long as we still > > have the concept of file system (maybe in 2371 or some year we'll use a > > 128-bit UUID instead of path). > > > >> The problem I see with the 'use use fstat' approach is that this > >> does not work on 32-bit architectures, unless we define a new > >> fstatat64_time64() syscall, which is one of the things that statx() > > > > "fstat64_time64". Using statx for fstatat should be just fine. > > Right. It does feel wrong to have only an fstat() variant but not > fstatat() if we go there. > > > Or maybe we can just introduce a new AT_something to make statx > > completely ignore pathname but behave like AT_EMPTY_PATH + "". > > I think this is better than going back to fstat64_time64(), but > it's still not great because > > - all the reserved flags on statx() are by definition incompatible > with existing kernels that return -EINVAL for any flag they do > not recognize. > > - you still need to convince libc developers to actually use > the flag despite the backwards compatibility problem, either > with a fallback to the current behavior or a version check. > > Using the NULL path as a fallback would solve the problem with > seccomp, but it would not make the normal case any faster. > > >> was trying to avoid. > > > > Oops. I thought "newstat" should be using 64-bit time but it seems the > > "new" is not what I'd expected... The "new" actually means "newer than > > Linux 0.9"! :( > > > > Let's not use "new" in future syscall names... > > Right, we definitely can't ever succeed. On some architectures > we even had "oldstat" and "stat" before "newstat" and "stat64", > and on some architectures we mix them up. E.g. x86_64 has fstat() > and fstatat64() with the same structure but doesn't define > __NR_newfstat. On mips64, there is a 'newstat' but it has 32-bit > timestamps unlike all other 64-bit architectures. > > statx() was intended to solve these problems once and for all, > and it appears that we have failed again. New apis don't invalidate old apis necessarily. That's just not going to work in an age where you have containerized workloads. statx() is just the beginning of this. A container may have aritrary seccomp profiles that return ENOSYS or even EPERM for whatever reason for any new api that exists. So not implementing fstat() might already break container workloads. Another example: You can't just skip on implementing mount() and only implement the new mount api for example. Because tools that look for api simplicity and don't need complex setup will _always_ continue to use mount() and have a right to do so. And fwiw, mount() isn't fully inspectable by seccomp since forever. The list goes on and on. But let's look at the original mail. Why are they denying statx() and what's that claim about it not being able to be rewritten to something safe? Looking at: intptr_t SIGSYSFstatatHandler(const struct arch_seccomp_data& args, void* fs_denied_errno) { if (args.nr == __NR_fstatat_default) { if (*reinterpret_cast<const char*>(args.args[1]) == '\0' && args.args[3] == static_cast<uint64_t>(AT_EMPTY_PATH)) { return syscall(__NR_fstat_default, static_cast<int>(args.args[0]), reinterpret_cast<default_stat_struct*>(args.args[2])); } return -reinterpret_cast<intptr_t>(fs_denied_errno); } What this does it to rewrite fstatat() to fstat() if it was made with AT_EMPTY_PATH and the path argument was "". That is easily doable for statx() because it has the exact same AT_EMPTY_PATH semantics that fstatat() has. Plus, they can even filter on mask and rewrite that to something that they think is safe. For example, STATX_BASIC_STATS which is equivalent to what any fstat() call returns. So it's pretty difficult to understand what their actual gripe with statx() is. It can't be that statx() passes a struct because fstatat() and fstat() do that too. So what exactly is that problem there? What this tells me without knowing the exact reason is that they thought "Oh, if we just return ENOSYS then the workload or glibc will just always be able to fallback to fstat() or fstatat()". Which ultimately is the exact same thing that containers often assume. So really, just skipping on various system calls isn't going to work. You can't just implement new system calls and forget about the rest unless you know exactly what workloads your architecure will run on. Please implement fstat() or fstatat() and stop inventing hacks for statx() to make weird sandboxing rules work, please. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again? 2024-02-26 13:32 ` Christian Brauner @ 2024-02-26 13:46 ` Arnd Bergmann 2024-02-26 15:40 ` Christian Brauner 2024-02-26 13:46 ` Christian Brauner 2024-02-26 14:00 ` WANG Xuerui 2 siblings, 1 reply; 24+ messages in thread From: Arnd Bergmann @ 2024-02-26 13:46 UTC (permalink / raw) To: Christian Brauner Cc: Xi Ruoyao, Icenowy Zheng, Huacai Chen, WANG Xuerui, linux-api, Kees Cook, Xuefeng Li, Jianmin Lv, Xiaotian Wu, WANG Rui, Miao Wang, loongarch, Linux-Arch, Linux Kernel Mailing List On Mon, Feb 26, 2024, at 14:32, Christian Brauner wrote: > On Mon, Feb 26, 2024 at 10:20:23AM +0100, Arnd Bergmann wrote: >> On Mon, Feb 26, 2024, at 08:09, Xi Ruoyao wrote: > > What this tells me without knowing the exact reason is that they thought > "Oh, if we just return ENOSYS then the workload or glibc will just > always be able to fallback to fstat() or fstatat()". Which ultimately is > the exact same thing that containers often assume. > > So really, just skipping on various system calls isn't going to work. > You can't just implement new system calls and forget about the rest > unless you know exactly what workloads your architecure will run on. > > Please implement fstat() or fstatat() and stop inventing hacks for > statx() to make weird sandboxing rules work, please. Do you mean we should add fstat64_time64() for all architectures then? Would use use the same structure layout as statx for this, the 64-bit version of the 'struct stat' layout from include/uapi/asm-generic/stat.h, or something new that solves the same problems? I definitely don't want to see a new time32 API added to mips64 and the 32-bit architectures, so the existing stat64 interface won't work as a statx replacement. Arnd ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again? 2024-02-26 13:46 ` Arnd Bergmann @ 2024-02-26 15:40 ` Christian Brauner 2024-02-26 16:49 ` Xi Ruoyao 0 siblings, 1 reply; 24+ messages in thread From: Christian Brauner @ 2024-02-26 15:40 UTC (permalink / raw) To: Arnd Bergmann Cc: Xi Ruoyao, Icenowy Zheng, Huacai Chen, WANG Xuerui, linux-api, Kees Cook, Xuefeng Li, Jianmin Lv, Xiaotian Wu, WANG Rui, Miao Wang, loongarch, Linux-Arch, Linux Kernel Mailing List On Mon, Feb 26, 2024 at 02:46:49PM +0100, Arnd Bergmann wrote: > On Mon, Feb 26, 2024, at 14:32, Christian Brauner wrote: > > On Mon, Feb 26, 2024 at 10:20:23AM +0100, Arnd Bergmann wrote: > >> On Mon, Feb 26, 2024, at 08:09, Xi Ruoyao wrote: > > > > What this tells me without knowing the exact reason is that they thought > > "Oh, if we just return ENOSYS then the workload or glibc will just > > always be able to fallback to fstat() or fstatat()". Which ultimately is > > the exact same thing that containers often assume. > > > > So really, just skipping on various system calls isn't going to work. > > You can't just implement new system calls and forget about the rest > > unless you know exactly what workloads your architecure will run on. > > > > Please implement fstat() or fstatat() and stop inventing hacks for > > statx() to make weird sandboxing rules work, please. > > Do you mean we should add fstat64_time64() for all architectures > then? Would use use the same structure layout as statx for this, > the 64-bit version of the 'struct stat' layout from > include/uapi/asm-generic/stat.h, or something new that solves > the same problems? > > I definitely don't want to see a new time32 API added to > mips64 and the 32-bit architectures, so the existing stat64 > interface won't work as a statx replacement. I don't specifically care but the same way you don't want to see newer time32 apis added to architectures I don't want to have hacks in our system calls that aren't even a clear solution to the problem outlined in this thread. Short of adding fstatx() the problem isn't solved by a new flag to statx() as explained in my other mails. But I'm probably missing something here because I find this notion of "design system calls for seccomp and the Chromium sandbox" to be an absurd notion and it makes me a bit impatient. And fwiw, once mseal() lands seccomp should be a lot easier to get deep argument inspection. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again? 2024-02-26 15:40 ` Christian Brauner @ 2024-02-26 16:49 ` Xi Ruoyao 0 siblings, 0 replies; 24+ messages in thread From: Xi Ruoyao @ 2024-02-26 16:49 UTC (permalink / raw) To: Christian Brauner, Arnd Bergmann Cc: Icenowy Zheng, Huacai Chen, WANG Xuerui, linux-api, Kees Cook, Xuefeng Li, Jianmin Lv, Xiaotian Wu, WANG Rui, Miao Wang, loongarch, Linux-Arch, Linux Kernel Mailing List On Mon, 2024-02-26 at 16:40 +0100, Christian Brauner wrote: > > I definitely don't want to see a new time32 API added to > > mips64 and the 32-bit architectures, so the existing stat64 > > interface won't work as a statx replacement. > > I don't specifically care but the same way you don't want to see newer > time32 apis added to architectures I don't want to have hacks in our > system calls that aren't even a clear solution to the problem outlined > in this thread. So we should have a fstat_whatever64, IMO. > Short of adding fstatx() the problem isn't solved by a new flag to > statx() as explained in my other mails. But I'm probably missing > something here because I find this notion of "design system calls for > seccomp and the Chromium sandbox" to be an absurd notion and it makes me > a bit impatient. I'm sharing the feeling on seccomp and/or (mis)uses of it, but using statx() or fstatat() for fstat() has a performance impact as they must inspect path (do a uaccess) and make sure it's an empty string, and Linus concluded "if the user want fstat, you should give the user fstat" for this issue: https://sourceware.org/pipermail/libc-alpha/2023-September/151365.html If it was just seccomp I'd not comment on this topic at all. -- Xi Ruoyao <xry111@xry111.site> School of Aerospace Science and Technology, Xidian University ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again? 2024-02-26 13:32 ` Christian Brauner 2024-02-26 13:46 ` Arnd Bergmann @ 2024-02-26 13:46 ` Christian Brauner 2024-02-26 14:00 ` WANG Xuerui 2 siblings, 0 replies; 24+ messages in thread From: Christian Brauner @ 2024-02-26 13:46 UTC (permalink / raw) To: Arnd Bergmann Cc: Xi Ruoyao, Icenowy Zheng, Huacai Chen, WANG Xuerui, linux-api, Kees Cook, Xuefeng Li, Jianmin Lv, Xiaotian Wu, WANG Rui, Miao Wang, loongarch, Linux-Arch, Linux Kernel Mailing List On Mon, Feb 26, 2024 at 02:32:09PM +0100, Christian Brauner wrote: > On Mon, Feb 26, 2024 at 10:20:23AM +0100, Arnd Bergmann wrote: > > On Mon, Feb 26, 2024, at 08:09, Xi Ruoyao wrote: > > > On Mon, 2024-02-26 at 07:56 +0100, Arnd Bergmann wrote: > > >> On Mon, Feb 26, 2024, at 07:03, Icenowy Zheng wrote: > > >> > 在 2024-02-25星期日的 15:32 +0800,Xi Ruoyao写道: > > >> > > On Sun, 2024-02-25 at 14:51 +0800, Icenowy Zheng wrote: > > >> > > > My idea is this problem needs syscalls to be designed with deep > > >> > > > argument inspection in mind; syscalls before this should be > > >> > > > considered > > >> > > > as historical error and get fixed by resotring old syscalls. > > >> > > > > >> > > I'd not consider fstat an error as using statx for fstat has a > > >> > > performance impact (severe for some workflows), and Linus has > > >> > > concluded > > >> > > > >> > Sorry for clearance, I mean statx is an error in ABI design, not fstat. > > > > > > I'm wondering why we decided to use AT_EMPTY_PATH/"" instead of > > > "AT_NULL_PATH"/nullptr in the first place? > > > > Not sure, but it's hard to change now since the libc > > implementation won't easily know whether using the NULL > > path is safe on a given kernel. It could check the kernel > > version number, but that adds another bit of complexity in > > the fast path and doesn't work on old kernels with the > > feature backported. > > > > > But it's not irrational to pass a path to syscall, as long as we still > > > have the concept of file system (maybe in 2371 or some year we'll use a > > > 128-bit UUID instead of path). > > > > > >> The problem I see with the 'use use fstat' approach is that this > > >> does not work on 32-bit architectures, unless we define a new > > >> fstatat64_time64() syscall, which is one of the things that statx() > > > > > > "fstat64_time64". Using statx for fstatat should be just fine. > > > > Right. It does feel wrong to have only an fstat() variant but not > > fstatat() if we go there. > > > > > Or maybe we can just introduce a new AT_something to make statx > > > completely ignore pathname but behave like AT_EMPTY_PATH + "". > > > > I think this is better than going back to fstat64_time64(), but > > it's still not great because > > > > - all the reserved flags on statx() are by definition incompatible > > with existing kernels that return -EINVAL for any flag they do > > not recognize. > > > > - you still need to convince libc developers to actually use > > the flag despite the backwards compatibility problem, either > > with a fallback to the current behavior or a version check. > > > > Using the NULL path as a fallback would solve the problem with > > seccomp, but it would not make the normal case any faster. > > > > >> was trying to avoid. > > > > > > Oops. I thought "newstat" should be using 64-bit time but it seems the > > > "new" is not what I'd expected... The "new" actually means "newer than > > > Linux 0.9"! :( > > > > > > Let's not use "new" in future syscall names... > > > > Right, we definitely can't ever succeed. On some architectures > > we even had "oldstat" and "stat" before "newstat" and "stat64", > > and on some architectures we mix them up. E.g. x86_64 has fstat() > > and fstatat64() with the same structure but doesn't define > > __NR_newfstat. On mips64, there is a 'newstat' but it has 32-bit > > timestamps unlike all other 64-bit architectures. > > > > statx() was intended to solve these problems once and for all, > > and it appears that we have failed again. > > New apis don't invalidate old apis necessarily. That's just not going to > work in an age where you have containerized workloads. > > statx() is just the beginning of this. A container may have aritrary > seccomp profiles that return ENOSYS or even EPERM for whatever reason > for any new api that exists. So not implementing fstat() might already > break container workloads. > > Another example: You can't just skip on implementing mount() and only > implement the new mount api for example. Because tools that look for api > simplicity and don't need complex setup will _always_ continue to use > mount() and have a right to do so. > > And fwiw, mount() isn't fully inspectable by seccomp since forever. The > list goes on and on. > > But let's look at the original mail. Why are they denying statx() and > what's that claim about it not being able to be rewritten to something > safe? Looking at: > > intptr_t SIGSYSFstatatHandler(const struct arch_seccomp_data& args, > void* fs_denied_errno) { > if (args.nr == __NR_fstatat_default) { > if (*reinterpret_cast<const char*>(args.args[1]) == '\0' && > args.args[3] == static_cast<uint64_t>(AT_EMPTY_PATH)) { > return syscall(__NR_fstat_default, static_cast<int>(args.args[0]), > reinterpret_cast<default_stat_struct*>(args.args[2])); > } > return -reinterpret_cast<intptr_t>(fs_denied_errno); > } > > What this does it to rewrite fstatat() to fstat() if it was made with > AT_EMPTY_PATH and the path argument was "". That is easily doable for > statx() because it has the exact same AT_EMPTY_PATH semantics that > fstatat() has. > > Plus, they can even filter on mask and rewrite that to something that > they think is safe. For example, STATX_BASIC_STATS which is equivalent > to what any fstat() call returns. So it's pretty difficult to understand > what their actual gripe with statx() is. > > It can't be that statx() passes a struct because fstatat() and fstat() > do that too. So what exactly is that problem there? > > What this tells me without knowing the exact reason is that they thought > "Oh, if we just return ENOSYS then the workload or glibc will just > always be able to fallback to fstat() or fstatat()". Which ultimately is > the exact same thing that containers often assume. > > So really, just skipping on various system calls isn't going to work. > You can't just implement new system calls and forget about the rest > unless you know exactly what workloads your architecure will run on. > > Please implement fstat() or fstatat() and stop inventing hacks for > statx() to make weird sandboxing rules work, please. And fwiw, if they rewrite fstatat() to fstat() then they must be worrying about another thread racing and putting something in the second argument (the path argument) after they have inspected the system call arguments but before the system call continue. But if that is their worry then a new flag to statx() won't help at all. Because then they really want to use fstat(). IOW, one would have to add fstatx() which brings us back to square one. I'm mostly going by that code snippet of course. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again? 2024-02-26 13:32 ` Christian Brauner 2024-02-26 13:46 ` Arnd Bergmann 2024-02-26 13:46 ` Christian Brauner @ 2024-02-26 14:00 ` WANG Xuerui 2024-02-26 15:35 ` Christian Brauner 2 siblings, 1 reply; 24+ messages in thread From: WANG Xuerui @ 2024-02-26 14:00 UTC (permalink / raw) To: Christian Brauner, Arnd Bergmann Cc: Xi Ruoyao, Icenowy Zheng, Huacai Chen, linux-api, Kees Cook, Xuefeng Li, Jianmin Lv, Xiaotian Wu, WANG Rui, Miao Wang, loongarch, Linux-Arch, Linux Kernel Mailing List On 2/26/24 21:32, Christian Brauner wrote: > On Mon, Feb 26, 2024 at 10:20:23AM +0100, Arnd Bergmann wrote: >> On Mon, Feb 26, 2024, at 08:09, Xi Ruoyao wrote: >>> On Mon, 2024-02-26 at 07:56 +0100, Arnd Bergmann wrote: >>>> On Mon, Feb 26, 2024, at 07:03, Icenowy Zheng wrote: >>>>> 在 2024-02-25星期日的 15:32 +0800,Xi Ruoyao写道: >>>>>> On Sun, 2024-02-25 at 14:51 +0800, Icenowy Zheng wrote: >>>>>>> My idea is this problem needs syscalls to be designed with deep >>>>>>> argument inspection in mind; syscalls before this should be >>>>>>> considered >>>>>>> as historical error and get fixed by resotring old syscalls. >>>>>> >>>>>> I'd not consider fstat an error as using statx for fstat has a >>>>>> performance impact (severe for some workflows), and Linus has >>>>>> concluded >>>>> >>>>> Sorry for clearance, I mean statx is an error in ABI design, not fstat. >>> >>> I'm wondering why we decided to use AT_EMPTY_PATH/"" instead of >>> "AT_NULL_PATH"/nullptr in the first place? >> >> Not sure, but it's hard to change now since the libc >> implementation won't easily know whether using the NULL >> path is safe on a given kernel. It could check the kernel >> version number, but that adds another bit of complexity in >> the fast path and doesn't work on old kernels with the >> feature backported. >> >>> But it's not irrational to pass a path to syscall, as long as we still >>> have the concept of file system (maybe in 2371 or some year we'll use a >>> 128-bit UUID instead of path). >>> >>>> The problem I see with the 'use use fstat' approach is that this >>>> does not work on 32-bit architectures, unless we define a new >>>> fstatat64_time64() syscall, which is one of the things that statx() >>> >>> "fstat64_time64". Using statx for fstatat should be just fine. >> >> Right. It does feel wrong to have only an fstat() variant but not >> fstatat() if we go there. >> >>> Or maybe we can just introduce a new AT_something to make statx >>> completely ignore pathname but behave like AT_EMPTY_PATH + "". >> >> I think this is better than going back to fstat64_time64(), but >> it's still not great because >> >> - all the reserved flags on statx() are by definition incompatible >> with existing kernels that return -EINVAL for any flag they do >> not recognize. >> >> - you still need to convince libc developers to actually use >> the flag despite the backwards compatibility problem, either >> with a fallback to the current behavior or a version check. >> >> Using the NULL path as a fallback would solve the problem with >> seccomp, but it would not make the normal case any faster. >> >>>> was trying to avoid. >>> >>> Oops. I thought "newstat" should be using 64-bit time but it seems the >>> "new" is not what I'd expected... The "new" actually means "newer than >>> Linux 0.9"! :( >>> >>> Let's not use "new" in future syscall names... >> >> Right, we definitely can't ever succeed. On some architectures >> we even had "oldstat" and "stat" before "newstat" and "stat64", >> and on some architectures we mix them up. E.g. x86_64 has fstat() >> and fstatat64() with the same structure but doesn't define >> __NR_newfstat. On mips64, there is a 'newstat' but it has 32-bit >> timestamps unlike all other 64-bit architectures. >> >> statx() was intended to solve these problems once and for all, >> and it appears that we have failed again. > > New apis don't invalidate old apis necessarily. That's just not going to > work in an age where you have containerized workloads. > > statx() is just the beginning of this. A container may have aritrary > seccomp profiles that return ENOSYS or even EPERM for whatever reason > for any new api that exists. So not implementing fstat() might already > break container workloads. > > Another example: You can't just skip on implementing mount() and only > implement the new mount api for example. Because tools that look for api > simplicity and don't need complex setup will _always_ continue to use > mount() and have a right to do so. > > And fwiw, mount() isn't fully inspectable by seccomp since forever. The > list goes on and on. > > But let's look at the original mail. Why are they denying statx() and > what's that claim about it not being able to be rewritten to something > safe? Looking at: > > intptr_t SIGSYSFstatatHandler(const struct arch_seccomp_data& args, > void* fs_denied_errno) { > if (args.nr == __NR_fstatat_default) { > if (*reinterpret_cast<const char*>(args.args[1]) == '\0' && > args.args[3] == static_cast<uint64_t>(AT_EMPTY_PATH)) { > return syscall(__NR_fstat_default, static_cast<int>(args.args[0]), > reinterpret_cast<default_stat_struct*>(args.args[2])); > } > return -reinterpret_cast<intptr_t>(fs_denied_errno); > } > > What this does it to rewrite fstatat() to fstat() if it was made with > AT_EMPTY_PATH and the path argument was "". That is easily doable for > statx() because it has the exact same AT_EMPTY_PATH semantics that > fstatat() has. > > Plus, they can even filter on mask and rewrite that to something that > they think is safe. For example, STATX_BASIC_STATS which is equivalent > to what any fstat() call returns. So it's pretty difficult to understand > what their actual gripe with statx() is. > > It can't be that statx() passes a struct because fstatat() and fstat() > do that too. So what exactly is that problem there? From our investigation: For (new)fstatat calls that the sandboxed process may make, this SIGSYS handler either: * turns allowed calls (those looking at fd's) into fstat's that only have one argument (the fd) each, or * denies the call, so the sandbox only ever sees fstat calls and no (new)fstatat's, and the guarantee that only open fds can ever been stat'ed trivially holds. With statx, however, there's no way of guaranteeing "only look at fd" semantics without peeking into the path argument, because a non-empty path makes AT_EMPTY_PATH ineffective, and the flags are not validated prior to use making it near-impossible to introduce new semantics in a backwards-compatible manner. > What this tells me without knowing the exact reason is that they thought > "Oh, if we just return ENOSYS then the workload or glibc will just > always be able to fallback to fstat() or fstatat()". Which ultimately is > the exact same thing that containers often assume. > > So really, just skipping on various system calls isn't going to work. > You can't just implement new system calls and forget about the rest > unless you know exactly what workloads your architecure will run on. > > Please implement fstat() or fstatat() and stop inventing hacks for > statx() to make weird sandboxing rules work, please. We have already provided fstat(at) on LoongArch for a while by unconditionally doing statx and translating the returned structure -- see the [glibc] and [golang] [golang-2] implementations for example -- without fallbacks because we know that the old syscalls don't exist. So if we effectively back out the kernel-side change (removing fstat) we must arrange for the various syscall makers to be able to fallback to fstat too. (The glibc code seems self-adapting to presence of newfstatat, but Go could need more care. Though a simple rebuild of libc is enough for the Chromium sandbox to work without code changes on their side.) [glibc]: https://sourceware.org/pipermail/libc-alpha/2022-May/138958.html [golang]: https://go.dev/cl/407694 [golang-2]: https://go.dev/cl/411378 -- WANG "xen0n" Xuerui Linux/LoongArch mailing list: https://lore.kernel.org/loongarch/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again? 2024-02-26 14:00 ` WANG Xuerui @ 2024-02-26 15:35 ` Christian Brauner 2024-02-26 17:38 ` WANG Xuerui 0 siblings, 1 reply; 24+ messages in thread From: Christian Brauner @ 2024-02-26 15:35 UTC (permalink / raw) To: WANG Xuerui Cc: Arnd Bergmann, Xi Ruoyao, Icenowy Zheng, Huacai Chen, linux-api, Kees Cook, Xuefeng Li, Jianmin Lv, Xiaotian Wu, WANG Rui, Miao Wang, loongarch, Linux-Arch, Linux Kernel Mailing List On Mon, Feb 26, 2024 at 10:00:05PM +0800, WANG Xuerui wrote: > On 2/26/24 21:32, Christian Brauner wrote: > > On Mon, Feb 26, 2024 at 10:20:23AM +0100, Arnd Bergmann wrote: > > > On Mon, Feb 26, 2024, at 08:09, Xi Ruoyao wrote: > > > > On Mon, 2024-02-26 at 07:56 +0100, Arnd Bergmann wrote: > > > > > On Mon, Feb 26, 2024, at 07:03, Icenowy Zheng wrote: > > > > > > 在 2024-02-25星期日的 15:32 +0800,Xi Ruoyao写道: > > > > > > > On Sun, 2024-02-25 at 14:51 +0800, Icenowy Zheng wrote: > > > > > > > > My idea is this problem needs syscalls to be designed with deep > > > > > > > > argument inspection in mind; syscalls before this should be > > > > > > > > considered > > > > > > > > as historical error and get fixed by resotring old syscalls. > > > > > > > > > > > > > > I'd not consider fstat an error as using statx for fstat has a > > > > > > > performance impact (severe for some workflows), and Linus has > > > > > > > concluded > > > > > > > > > > > > Sorry for clearance, I mean statx is an error in ABI design, not fstat. > > > > > > > > I'm wondering why we decided to use AT_EMPTY_PATH/"" instead of > > > > "AT_NULL_PATH"/nullptr in the first place? > > > > > > Not sure, but it's hard to change now since the libc > > > implementation won't easily know whether using the NULL > > > path is safe on a given kernel. It could check the kernel > > > version number, but that adds another bit of complexity in > > > the fast path and doesn't work on old kernels with the > > > feature backported. > > > > > > > But it's not irrational to pass a path to syscall, as long as we still > > > > have the concept of file system (maybe in 2371 or some year we'll use a > > > > 128-bit UUID instead of path). > > > > > > > > > The problem I see with the 'use use fstat' approach is that this > > > > > does not work on 32-bit architectures, unless we define a new > > > > > fstatat64_time64() syscall, which is one of the things that statx() > > > > > > > > "fstat64_time64". Using statx for fstatat should be just fine. > > > > > > Right. It does feel wrong to have only an fstat() variant but not > > > fstatat() if we go there. > > > > > > > Or maybe we can just introduce a new AT_something to make statx > > > > completely ignore pathname but behave like AT_EMPTY_PATH + "". > > > > > > I think this is better than going back to fstat64_time64(), but > > > it's still not great because > > > > > > - all the reserved flags on statx() are by definition incompatible > > > with existing kernels that return -EINVAL for any flag they do > > > not recognize. > > > > > > - you still need to convince libc developers to actually use > > > the flag despite the backwards compatibility problem, either > > > with a fallback to the current behavior or a version check. > > > > > > Using the NULL path as a fallback would solve the problem with > > > seccomp, but it would not make the normal case any faster. > > > > > > > > was trying to avoid. > > > > > > > > Oops. I thought "newstat" should be using 64-bit time but it seems the > > > > "new" is not what I'd expected... The "new" actually means "newer than > > > > Linux 0.9"! :( > > > > > > > > Let's not use "new" in future syscall names... > > > > > > Right, we definitely can't ever succeed. On some architectures > > > we even had "oldstat" and "stat" before "newstat" and "stat64", > > > and on some architectures we mix them up. E.g. x86_64 has fstat() > > > and fstatat64() with the same structure but doesn't define > > > __NR_newfstat. On mips64, there is a 'newstat' but it has 32-bit > > > timestamps unlike all other 64-bit architectures. > > > > > > statx() was intended to solve these problems once and for all, > > > and it appears that we have failed again. > > > > New apis don't invalidate old apis necessarily. That's just not going to > > work in an age where you have containerized workloads. > > > > statx() is just the beginning of this. A container may have aritrary > > seccomp profiles that return ENOSYS or even EPERM for whatever reason > > for any new api that exists. So not implementing fstat() might already > > break container workloads. > > > > Another example: You can't just skip on implementing mount() and only > > implement the new mount api for example. Because tools that look for api > > simplicity and don't need complex setup will _always_ continue to use > > mount() and have a right to do so. > > > > And fwiw, mount() isn't fully inspectable by seccomp since forever. The > > list goes on and on. > > > > But let's look at the original mail. Why are they denying statx() and > > what's that claim about it not being able to be rewritten to something > > safe? Looking at: > > > > intptr_t SIGSYSFstatatHandler(const struct arch_seccomp_data& args, > > void* fs_denied_errno) { > > if (args.nr == __NR_fstatat_default) { > > if (*reinterpret_cast<const char*>(args.args[1]) == '\0' && > > args.args[3] == static_cast<uint64_t>(AT_EMPTY_PATH)) { > > return syscall(__NR_fstat_default, static_cast<int>(args.args[0]), > > reinterpret_cast<default_stat_struct*>(args.args[2])); > > } > > return -reinterpret_cast<intptr_t>(fs_denied_errno); > > } > > > > What this does it to rewrite fstatat() to fstat() if it was made with > > AT_EMPTY_PATH and the path argument was "". That is easily doable for > > statx() because it has the exact same AT_EMPTY_PATH semantics that > > fstatat() has. > > > > Plus, they can even filter on mask and rewrite that to something that > > they think is safe. For example, STATX_BASIC_STATS which is equivalent > > to what any fstat() call returns. So it's pretty difficult to understand > > what their actual gripe with statx() is. > > > > It can't be that statx() passes a struct because fstatat() and fstat() > > do that too. So what exactly is that problem there? > > From our investigation: > > For (new)fstatat calls that the sandboxed process may make, this SIGSYS > handler either: > > * turns allowed calls (those looking at fd's) into fstat's that only have > one argument (the fd) each, or > * denies the call, Yes, but look at the filtering that they do: if (args.nr == __NR_fstatat_default) { if (*reinterpret_cast<const char*>(args.args[1]) == '\0' && args.args[3] == static_cast<uint64_t>(AT_EMPTY_PATH)) { So if you have a statx() call instead of an fstatat() call this is trivially: if (args.nr == __NR_statx) { if (*reinterpret_cast<const char*>(args.args[1]) == '\0' && args.args[2] == static_cast<uint64_t>(AT_EMPTY_PATH)) { maybe if they care about it also simply check args.args[3] == STATX_BASIC_STATS. And then just as with fstatat() rewrite it to fstat(). > > so the sandbox only ever sees fstat calls and no (new)fstatat's, and the > guarantee that only open fds can ever been stat'ed trivially holds. > > With statx, however, there's no way of guaranteeing "only look at fd" > semantics without peeking into the path argument, because a non-empty path > makes AT_EMPTY_PATH ineffective, and the flags are not validated prior to > use making it near-impossible to introduce new semantics in a > backwards-compatible manner. I don't understand. That's exactly the same thing as for fstatat(). My point is that you can turn statx() into fstat() just like you can turn fstatat() into fstat(). So if you add fstat()/fstat64() what's left to do? > > What this tells me without knowing the exact reason is that they thought > > "Oh, if we just return ENOSYS then the workload or glibc will just > > always be able to fallback to fstat() or fstatat()". Which ultimately is > > the exact same thing that containers often assume. > > > > So really, just skipping on various system calls isn't going to work. > > You can't just implement new system calls and forget about the rest > > unless you know exactly what workloads your architecure will run on. > > > > Please implement fstat() or fstatat() and stop inventing hacks for > > statx() to make weird sandboxing rules work, please. > > We have already provided fstat(at) on LoongArch for a while by > unconditionally doing statx and translating the returned structure -- see > the [glibc] and [golang] [golang-2] implementations for example -- without But you're doing that translation in userspace. I was talking about adding the fstat()/fstat64() system calls. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again? 2024-02-26 15:35 ` Christian Brauner @ 2024-02-26 17:38 ` WANG Xuerui 0 siblings, 0 replies; 24+ messages in thread From: WANG Xuerui @ 2024-02-26 17:38 UTC (permalink / raw) To: Christian Brauner Cc: Arnd Bergmann, Xi Ruoyao, Icenowy Zheng, Huacai Chen, linux-api, Kees Cook, Xuefeng Li, Jianmin Lv, Xiaotian Wu, WANG Rui, Miao Wang, loongarch, Linux-Arch, Linux Kernel Mailing List On 2/26/24 23:35, Christian Brauner wrote: > On Mon, Feb 26, 2024 at 10:00:05PM +0800, WANG Xuerui wrote: >> On 2/26/24 21:32, Christian Brauner wrote: >>> On Mon, Feb 26, 2024 at 10:20:23AM +0100, Arnd Bergmann wrote: >>>> On Mon, Feb 26, 2024, at 08:09, Xi Ruoyao wrote: >>>>> On Mon, 2024-02-26 at 07:56 +0100, Arnd Bergmann wrote: >>>>>> On Mon, Feb 26, 2024, at 07:03, Icenowy Zheng wrote: >>>>>>> 在 2024-02-25星期日的 15:32 +0800,Xi Ruoyao写道: >>>>>>>> On Sun, 2024-02-25 at 14:51 +0800, Icenowy Zheng wrote: >>>>>>>>> My idea is this problem needs syscalls to be designed with deep >>>>>>>>> argument inspection in mind; syscalls before this should be >>>>>>>>> considered >>>>>>>>> as historical error and get fixed by resotring old syscalls. >>>>>>>> I'd not consider fstat an error as using statx for fstat has a >>>>>>>> performance impact (severe for some workflows), and Linus has >>>>>>>> concluded >>>>>>> Sorry for clearance, I mean statx is an error in ABI design, not fstat. >>>>> I'm wondering why we decided to use AT_EMPTY_PATH/"" instead of >>>>> "AT_NULL_PATH"/nullptr in the first place? >>>> Not sure, but it's hard to change now since the libc >>>> implementation won't easily know whether using the NULL >>>> path is safe on a given kernel. It could check the kernel >>>> version number, but that adds another bit of complexity in >>>> the fast path and doesn't work on old kernels with the >>>> feature backported. >>>> >>>>> But it's not irrational to pass a path to syscall, as long as we still >>>>> have the concept of file system (maybe in 2371 or some year we'll use a >>>>> 128-bit UUID instead of path). >>>>> >>>>>> The problem I see with the 'use use fstat' approach is that this >>>>>> does not work on 32-bit architectures, unless we define a new >>>>>> fstatat64_time64() syscall, which is one of the things that statx() >>>>> "fstat64_time64". Using statx for fstatat should be just fine. >>>> Right. It does feel wrong to have only an fstat() variant but not >>>> fstatat() if we go there. >>>> >>>>> Or maybe we can just introduce a new AT_something to make statx >>>>> completely ignore pathname but behave like AT_EMPTY_PATH + "". >>>> I think this is better than going back to fstat64_time64(), but >>>> it's still not great because >>>> >>>> - all the reserved flags on statx() are by definition incompatible >>>> with existing kernels that return -EINVAL for any flag they do >>>> not recognize. >>>> >>>> - you still need to convince libc developers to actually use >>>> the flag despite the backwards compatibility problem, either >>>> with a fallback to the current behavior or a version check. >>>> >>>> Using the NULL path as a fallback would solve the problem with >>>> seccomp, but it would not make the normal case any faster. >>>> >>>>>> was trying to avoid. >>>>> Oops. I thought "newstat" should be using 64-bit time but it seems the >>>>> "new" is not what I'd expected... The "new" actually means "newer than >>>>> Linux 0.9"! :( >>>>> >>>>> Let's not use "new" in future syscall names... >>>> Right, we definitely can't ever succeed. On some architectures >>>> we even had "oldstat" and "stat" before "newstat" and "stat64", >>>> and on some architectures we mix them up. E.g. x86_64 has fstat() >>>> and fstatat64() with the same structure but doesn't define >>>> __NR_newfstat. On mips64, there is a 'newstat' but it has 32-bit >>>> timestamps unlike all other 64-bit architectures. >>>> >>>> statx() was intended to solve these problems once and for all, >>>> and it appears that we have failed again. >>> New apis don't invalidate old apis necessarily. That's just not going to >>> work in an age where you have containerized workloads. >>> >>> statx() is just the beginning of this. A container may have aritrary >>> seccomp profiles that return ENOSYS or even EPERM for whatever reason >>> for any new api that exists. So not implementing fstat() might already >>> break container workloads. >>> >>> Another example: You can't just skip on implementing mount() and only >>> implement the new mount api for example. Because tools that look for api >>> simplicity and don't need complex setup will _always_ continue to use >>> mount() and have a right to do so. >>> >>> And fwiw, mount() isn't fully inspectable by seccomp since forever. The >>> list goes on and on. >>> >>> But let's look at the original mail. Why are they denying statx() and >>> what's that claim about it not being able to be rewritten to something >>> safe? Looking at: >>> >>> intptr_t SIGSYSFstatatHandler(const struct arch_seccomp_data& args, >>> void* fs_denied_errno) { >>> if (args.nr == __NR_fstatat_default) { >>> if (*reinterpret_cast<const char*>(args.args[1]) == '\0' && >>> args.args[3] == static_cast<uint64_t>(AT_EMPTY_PATH)) { >>> return syscall(__NR_fstat_default, static_cast<int>(args.args[0]), >>> reinterpret_cast<default_stat_struct*>(args.args[2])); >>> } >>> return -reinterpret_cast<intptr_t>(fs_denied_errno); >>> } >>> >>> What this does it to rewrite fstatat() to fstat() if it was made with >>> AT_EMPTY_PATH and the path argument was "". That is easily doable for >>> statx() because it has the exact same AT_EMPTY_PATH semantics that >>> fstatat() has. >>> >>> Plus, they can even filter on mask and rewrite that to something that >>> they think is safe. For example, STATX_BASIC_STATS which is equivalent >>> to what any fstat() call returns. So it's pretty difficult to understand >>> what their actual gripe with statx() is. >>> >>> It can't be that statx() passes a struct because fstatat() and fstat() >>> do that too. So what exactly is that problem there? >> From our investigation: >> >> For (new)fstatat calls that the sandboxed process may make, this SIGSYS >> handler either: >> >> * turns allowed calls (those looking at fd's) into fstat's that only have >> one argument (the fd) each, or >> * denies the call, > Yes, but look at the filtering that they do: > > if (args.nr == __NR_fstatat_default) { > if (*reinterpret_cast<const char*>(args.args[1]) == '\0' && > args.args[3] == static_cast<uint64_t>(AT_EMPTY_PATH)) { > > So if you have a statx() call instead of an fstatat() call this is > trivially: > > if (args.nr == __NR_statx) { > if (*reinterpret_cast<const char*>(args.args[1]) == '\0' && > args.args[2] == static_cast<uint64_t>(AT_EMPTY_PATH)) { > > maybe if they care about it also simply check > args.args[3] == STATX_BASIC_STATS. > > And then just as with fstatat() rewrite it to fstat(). But fstat() and fstatat() share the same return value type i.e. struct stat, different from struct statx. And they are different enough that their existing seccomp policy can distinguish. In the statx-only case though, the seccomp policy cannot distinguish "statx actually called with empty path" from "statx called with AT_EMPTY_PATH but non-empty path" because in both cases the path would be a non-NULL pointer opaque to the policy cBPF program. >> so the sandbox only ever sees fstat calls and no (new)fstatat's, and the >> guarantee that only open fds can ever been stat'ed trivially holds. >> >> With statx, however, there's no way of guaranteeing "only look at fd" >> semantics without peeking into the path argument, because a non-empty path >> makes AT_EMPTY_PATH ineffective, and the flags are not validated prior to >> use making it near-impossible to introduce new semantics in a >> backwards-compatible manner. > I don't understand. That's exactly the same thing as for fstatat(). My > point is that you can turn statx() into fstat() just like you can turn > fstatat() into fstat(). So if you add fstat()/fstat64() what's left to > do? Yes, once fstat is restored it's a matter of transforming every allowed statx into fstat, then translating struct stat back into struct statx. What we're seeking is a possible way forward without re-introducing that though, because we still have some time and don't have to rush. >>> What this tells me without knowing the exact reason is that they thought >>> "Oh, if we just return ENOSYS then the workload or glibc will just >>> always be able to fallback to fstat() or fstatat()". Which ultimately is >>> the exact same thing that containers often assume. >>> >>> So really, just skipping on various system calls isn't going to work. >>> You can't just implement new system calls and forget about the rest >>> unless you know exactly what workloads your architecure will run on. >>> >>> Please implement fstat() or fstatat() and stop inventing hacks for >>> statx() to make weird sandboxing rules work, please. >> We have already provided fstat(at) on LoongArch for a while by >> unconditionally doing statx and translating the returned structure -- see >> the [glibc] and [golang] [golang-2] implementations for example -- without > But you're doing that translation in userspace. I was talking about > adding the fstat()/fstat64() system calls. Hmm, yeah. I meant to provide some more context but I later realized that the sandbox is in charge of rewriting the syscalls from inside the sandbox, so the userland may not matter in the big picture after all. -- WANG "xen0n" Xuerui Linux/LoongArch mailing list: https://lore.kernel.org/loongarch/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again? 2024-02-26 6:03 ` Icenowy Zheng 2024-02-26 6:56 ` Arnd Bergmann @ 2024-02-26 8:26 ` Christian Brauner 1 sibling, 0 replies; 24+ messages in thread From: Christian Brauner @ 2024-02-26 8:26 UTC (permalink / raw) To: Icenowy Zheng Cc: Xi Ruoyao, Huacai Chen, WANG Xuerui, linux-api, Arnd Bergmann, Kees Cook, Xuefeng Li, Jianmin Lv, Xiaotian Wu, WANG Rui, Miao Wang, loongarch, linux-arch, Linux Kernel Mailing List On Mon, Feb 26, 2024 at 02:03:48PM +0800, Icenowy Zheng wrote: > 在 2024-02-25星期日的 15:32 +0800,Xi Ruoyao写道: > > On Sun, 2024-02-25 at 14:51 +0800, Icenowy Zheng wrote: > > > > From my point of view, I prefer to "restore fstat", because we > > > > need > > > > to > > > > use the Chrome sandbox everyday (even though it hasn't been > > > > upstream > > > > by now). But I also hope "seccomp deep argument inspection" can > > > > be > > > > solved in the future. > > > > > > My idea is this problem needs syscalls to be designed with deep > > > argument inspection in mind; syscalls before this should be > > > considered > > > as historical error and get fixed by resotring old syscalls. > > > > I'd not consider fstat an error as using statx for fstat has a > > performance impact (severe for some workflows), and Linus has > > concluded > > Sorry for clearance, I mean statx is an error in ABI design, not fstat. We will not be limited arbitrarly in system call design by seccomp being unable to do deep argument inspection. That ship has sailed many years ago. And it's a bit laughable to disalow pointer arguments and structs in system calls because seccomp isn't able to inspect them. ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2024-02-26 17:38 UTC | newest] Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-02-21 6:09 Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again? WANG Xuerui 2024-02-21 6:31 ` Xi Ruoyao 2024-02-21 10:31 ` Xi Ruoyao 2024-02-21 10:49 ` WANG Xuerui 2024-02-21 12:03 ` Xi Ruoyao 2024-02-24 11:51 ` Huacai Chen 2024-02-25 6:51 ` Icenowy Zheng 2024-02-25 7:32 ` Xi Ruoyao 2024-02-26 6:03 ` Icenowy Zheng 2024-02-26 6:56 ` Arnd Bergmann 2024-02-26 7:09 ` Xi Ruoyao 2024-02-26 9:20 ` Arnd Bergmann 2024-02-26 11:57 ` Xi Ruoyao 2024-02-26 12:57 ` Christian Brauner 2024-02-26 14:33 ` Rich Felker 2024-02-26 13:32 ` Christian Brauner 2024-02-26 13:46 ` Arnd Bergmann 2024-02-26 15:40 ` Christian Brauner 2024-02-26 16:49 ` Xi Ruoyao 2024-02-26 13:46 ` Christian Brauner 2024-02-26 14:00 ` WANG Xuerui 2024-02-26 15:35 ` Christian Brauner 2024-02-26 17:38 ` WANG Xuerui 2024-02-26 8:26 ` Christian Brauner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).