linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
Search results ordered by [date|relevance]  view[summary|nested|Atom feed]
thread overview below | download mbox.gz: |
* man-pages-6.06 released
@ 2024-02-12  1:44  2% Alejandro Colomar
  0 siblings, 0 replies; 200+ results
From: Alejandro Colomar @ 2024-02-12  1:44 UTC (permalink / raw)
  To: linux-man, linux-kernel, libc-alpha

[-- Attachment #1: Type: text/plain, Size: 15809 bytes --]

Gidday!

I'm proud to announce:

tag man-pages-6.06
Tagger: Alejandro Colomar <alx@kernel.org>
Date:   Mon Feb 12 02:19:45 2024 +0100

man-pages-6.06 - manual pages for GNU/Linux

The following `make check` errors are known, and can be safely ignored
by touching all those files:

	$ make check -kj >/dev/null 2>&1;
	$ make check -i 2>/dev/null;
	GREP    .tmp/man/man1/memusage.1.check-catman.touch
	GREP    .tmp/man/man3/mallopt.3.check-catman.touch
	TROFF   .tmp/man/man3/unlocked_stdio.3.cat.set
	GROTTY  .tmp/man/man3/unlocked_stdio.3.cat
	COL     .tmp/man/man3/unlocked_stdio.3.cat.grep
	GREP    .tmp/man/man3/unlocked_stdio.3.check-catman.touch
	TROFF   .tmp/man/man4/console_codes.4.cat.set
	GROTTY  .tmp/man/man4/console_codes.4.cat
	COL     .tmp/man/man4/console_codes.4.cat.grep
	GREP    .tmp/man/man4/console_codes.4.check-catman.touch
	TROFF   .tmp/man/man4/lirc.4.cat.set
	GROTTY  .tmp/man/man4/lirc.4.cat
	COL     .tmp/man/man4/lirc.4.cat.grep
	GREP    .tmp/man/man4/lirc.4.check-catman.touch
	GREP    .tmp/man/man4/smartpqi.4.check-catman.touch
	GREP    .tmp/man/man4/veth.4.check-catman.touch
	GREP    .tmp/man/man5/proc_buddyinfo.5.check-catman.touch
	GREP    .tmp/man/man5/proc_pid_fdinfo.5.check-catman.touch
	GREP    .tmp/man/man5/proc_pid_maps.5.check-catman.touch
	GREP    .tmp/man/man5/proc_pid_mountinfo.5.check-catman.touch
	GREP    .tmp/man/man5/proc_pid_net.5.check-catman.touch
	TROFF   .tmp/man/man5/proc_pid_smaps.5.cat.set
	GROTTY  .tmp/man/man5/proc_pid_smaps.5.cat
	COL     .tmp/man/man5/proc_pid_smaps.5.cat.grep
	GREP    .tmp/man/man5/proc_pid_smaps.5.check-catman.touch
	GREP    .tmp/man/man5/proc_timer_stats.5.check-catman.touch
	GREP    .tmp/man/man5/proc_version.5.check-catman.touch
	GREP    .tmp/man/man5/slabinfo.5.check-catman.touch
	TROFF   .tmp/man/man5/tzfile.5.cat.set
	GROTTY  .tmp/man/man5/tzfile.5.cat
	COL     .tmp/man/man5/tzfile.5.cat.grep
	GREP    .tmp/man/man5/tzfile.5.check-catman.touch
	TROFF   .tmp/man/man7/ascii.7.cat.set
	GROTTY  .tmp/man/man7/ascii.7.cat
	COL     .tmp/man/man7/ascii.7.cat.grep
	GREP    .tmp/man/man7/ascii.7.check-catman.touch
	TROFF   .tmp/man/man7/bpf-helpers.7.cat.set
	GROTTY  .tmp/man/man7/bpf-helpers.7.cat
	COL     .tmp/man/man7/bpf-helpers.7.cat.grep
	GREP    .tmp/man/man7/bpf-helpers.7.check-catman.touch
	TROFF   .tmp/man/man7/charsets.7.cat.set
	GROTTY  .tmp/man/man7/charsets.7.cat
	COL     .tmp/man/man7/charsets.7.cat.grep
	GREP    .tmp/man/man7/charsets.7.check-catman.touch
	TROFF   .tmp/man/man7/iso_8859-1.7.cat.set
	GROTTY  .tmp/man/man7/iso_8859-1.7.cat
	COL     .tmp/man/man7/iso_8859-1.7.cat.grep
	GREP    .tmp/man/man7/iso_8859-1.7.check-catman.touch
	TROFF   .tmp/man/man7/iso_8859-10.7.cat.set
	GROTTY  .tmp/man/man7/iso_8859-10.7.cat
	COL     .tmp/man/man7/iso_8859-10.7.cat.grep
	GREP    .tmp/man/man7/iso_8859-10.7.check-catman.touch
	TROFF   .tmp/man/man7/iso_8859-11.7.cat.set
	GROTTY  .tmp/man/man7/iso_8859-11.7.cat
	COL     .tmp/man/man7/iso_8859-11.7.cat.grep
	GREP    .tmp/man/man7/iso_8859-11.7.check-catman.touch
	TROFF   .tmp/man/man7/iso_8859-13.7.cat.set
	GROTTY  .tmp/man/man7/iso_8859-13.7.cat
	COL     .tmp/man/man7/iso_8859-13.7.cat.grep
	GREP    .tmp/man/man7/iso_8859-13.7.check-catman.touch
	TROFF   .tmp/man/man7/iso_8859-14.7.cat.set
	GROTTY  .tmp/man/man7/iso_8859-14.7.cat
	COL     .tmp/man/man7/iso_8859-14.7.cat.grep
	GREP    .tmp/man/man7/iso_8859-14.7.check-catman.touch
	TROFF   .tmp/man/man7/iso_8859-15.7.cat.set
	GROTTY  .tmp/man/man7/iso_8859-15.7.cat
	COL     .tmp/man/man7/iso_8859-15.7.cat.grep
	GREP    .tmp/man/man7/iso_8859-15.7.check-catman.touch
	TROFF   .tmp/man/man7/iso_8859-16.7.cat.set
	GROTTY  .tmp/man/man7/iso_8859-16.7.cat
	COL     .tmp/man/man7/iso_8859-16.7.cat.grep
	GREP    .tmp/man/man7/iso_8859-16.7.check-catman.touch
	TROFF   .tmp/man/man7/iso_8859-2.7.cat.set
	GROTTY  .tmp/man/man7/iso_8859-2.7.cat
	COL     .tmp/man/man7/iso_8859-2.7.cat.grep
	GREP    .tmp/man/man7/iso_8859-2.7.check-catman.touch
	TROFF   .tmp/man/man7/iso_8859-3.7.cat.set
	GROTTY  .tmp/man/man7/iso_8859-3.7.cat
	COL     .tmp/man/man7/iso_8859-3.7.cat.grep
	GREP    .tmp/man/man7/iso_8859-3.7.check-catman.touch
	TROFF   .tmp/man/man7/iso_8859-4.7.cat.set
	GROTTY  .tmp/man/man7/iso_8859-4.7.cat
	COL     .tmp/man/man7/iso_8859-4.7.cat.grep
	GREP    .tmp/man/man7/iso_8859-4.7.check-catman.touch
	TROFF   .tmp/man/man7/iso_8859-5.7.cat.set
	GROTTY  .tmp/man/man7/iso_8859-5.7.cat
	COL     .tmp/man/man7/iso_8859-5.7.cat.grep
	GREP    .tmp/man/man7/iso_8859-5.7.check-catman.touch
	TROFF   .tmp/man/man7/iso_8859-6.7.cat.set
	GROTTY  .tmp/man/man7/iso_8859-6.7.cat
	COL     .tmp/man/man7/iso_8859-6.7.cat.grep
	GREP    .tmp/man/man7/iso_8859-6.7.check-catman.touch
	TROFF   .tmp/man/man7/iso_8859-7.7.cat.set
	GROTTY  .tmp/man/man7/iso_8859-7.7.cat
	COL     .tmp/man/man7/iso_8859-7.7.cat.grep
	GREP    .tmp/man/man7/iso_8859-7.7.check-catman.touch
	TROFF   .tmp/man/man7/iso_8859-8.7.cat.set
	GROTTY  .tmp/man/man7/iso_8859-8.7.cat
	COL     .tmp/man/man7/iso_8859-8.7.cat.grep
	GREP    .tmp/man/man7/iso_8859-8.7.check-catman.touch
	TROFF   .tmp/man/man7/iso_8859-9.7.cat.set
	GROTTY  .tmp/man/man7/iso_8859-9.7.cat
	COL     .tmp/man/man7/iso_8859-9.7.cat.grep
	GREP    .tmp/man/man7/iso_8859-9.7.check-catman.touch
	GREP    .tmp/man/man7/keyrings.7.check-catman.touch
	GREP    .tmp/man/man7/uri.7.check-catman.touch
	TROFF   .tmp/man/man8/tzselect.8.cat.set
	GROTTY  .tmp/man/man8/tzselect.8.cat
	COL     .tmp/man/man8/tzselect.8.cat.grep
	GREP    .tmp/man/man8/tzselect.8.check-catman.touch
	TROFF   .tmp/man/man8/zdump.8.cat.set
	GROTTY  .tmp/man/man8/zdump.8.cat
	COL     .tmp/man/man8/zdump.8.cat.grep
	GREP    .tmp/man/man8/zdump.8.check-catman.touch
	TROFF   .tmp/man/man8/zic.8.cat.set
	GROTTY  .tmp/man/man8/zic.8.cat
	COL     .tmp/man/man8/zic.8.cat.grep
	GREP    .tmp/man/man8/zic.8.check-catman.touch


Tarball download:
     <https://kernel.org/pub/linux/docs/man-pages/>
Git repository:
     <https://git.kernel.org/cgit/docs/man-pages/man-pages.git/>
Online PDF book:
     <https://kernel.org/pub/linux/docs/man-pages/book/>

Thank you all for contributing!

Have a lovely night!
Alex

==================== Changes in man-pages-6.06 ====================

Released: 2024-02-12, Aldaya


Contributors
------------

The following people contributed patches/fixes, reports, notes,
ideas, and discussions that have been incorporated in changes in
this release:

"G. Branden Robinson" <branden@debian.org>
"G. Branden Robinson" <g.branden.robinson@gmail.com>
"Huang, Ying" <ying.huang@intel.com>
"Serge E. Hallyn" <serge@hallyn.com>
Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Alejandro Colomar <alx@kernel.org>
Alexander Kozhevnikov <mentalisttraceur@gmail.com>
Alexey Tikhonov <atikhono@redhat.com>
Amir Goldstein <amir73il@gmail.com>
Andreas Schwab <schwab@issan.informatik.uni-dortmund.de>
Andreas Schwab <schwab@linux-m68k.org>
Andreas Schwab <schwab@suse.de>
Andriy Utkin <andriy_utkin@fastmail.com>
Arnav Rawat <rawat.arnav@gmail.com>
Arnd Bergmann <arnd@arndb.de>
Aurelien Jarno <aurel32@debian.org>
Avinesh Kumar <akumar@suse.de>
Axel Rasmussen <axelrasmussen@google.com>
Brian Inglis <Brian.Inglis@Shaw.ca>
Bruno Haible <bruno@clisp.org>
Carlos O'Donell <carlos@redhat.com>
Catalin Marinas <catalin.marinas@arm.com>
Christian Brauner <brauner@kernel.org>
Christopher Lameter <cl@os.amperecomputing.com>
Colin Watson <cjwatson@debian.org>
DJ Delorie <dj@redhat.com>
David Mosberger <davidm@hpl.hp.com>
Deri James <deri@chuzzlewit.myzen.co.uk>
Don Brace <don.brace@microchip.com>
Elliott Hughes <enh@google.com>
Florent Revest <revest@chromium.org>
Florian Lehner <dev@der-flo.net>
Florian Weimer <fweimer@redhat.com>
G. Branden Robinson <g.branden.robinson@gmail.com>
Geoff Keating <geoffk@ozemail.com.au>
Gobinda Das <godas@redhat.com>
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Guillem Jover <guillem@hadrons.org>
Guo Ren <guoren@kernel.org>
Guo Ren <guoren@linux.alibaba.com>
Günther Noack <gnoack@google.com>
Hanno Böck <hanno@hboeck.de>
Helge Kreutzmann <debian@helgefjell.de>
Iker Pedrosa <ipedrosa@redhat.com>
Ingo Schwarze <schwarze@openbsd.org>
Ingo Schwarze <schwarze@usta.de>
Jakub Jelinek <jakub@redhat.com>
Jakub Wilk <jwilk@jwilk.net>
Jan Engelhardt <jengelh@inai.de>
Jan Kara <jack@suse.cz>
John Watts <contact@jookia.org>
Jonathan Wakely <jwakely@redhat.com>
Jonny Grant <jg@jguk.org>
Kees Cook <keescook@chromium.org>
Kevin Barnett <kevin.barnett@microchip.com>
Kuniyuki Iwashima <kuniyu@amazon.com>
Lee Griffiths <poddster@gmail.com>
Luis Chamberlain <mcgrof@kernel.org>
Maciej Żenczykowski <maze@google.com>
Mario Blaettermann <mario.blaettermann@gmail.com>
Matthew House <mattlloydhouse@gmail.com>
Matthias Gerstner <matthias.gerstner@suse.com>
Max Kellermann <max.kellermann@ionos.com>
Michael Kerrisk <mtk.manpages@gmail.com>
Miguel de Icaza <miguel@nuclecu.unam.mx>
Mike McGowen <mike.mcgowen@microchip.com>
Mike Rapoport (IBM) <rppt@kernel.org>
Morten Welinder <mwelinder@gmail.com>
Muhammad Usama Anjum <usama.anjum@collabora.com>
Oskari Pirhonen <xxc3ncoredxx@gmail.com>
Patch by Xavier Leroy <Xavier.Leroy@inria.fr>.
Paul Eggert <eggert@cs.ucla.edu>
Paul Smith <psmith@gnu.org>
Peter Xu <peterx@redhat.com>
Petr Vorel <pvorel@suse.cz>
Philip Blundell <pb@nexus.co.uk>
Renzo Davoli <renzo@cs.unibo.it>
Reported by Ralf Corsepius <corsepiu@faw.uni-ulm.de>.
Reported by Sam Roberts <sroberts@uniserve.com>.
Richard Henderson <richard@gnu.ai.mit.edu>
Richard Henderson <rth@cygnus.com>
Richard Henderson <rth@tamu.edu>
Rik van Riel <riel@surriel.com>
Roland McGrath <roland@gnu.org>
Sam James <sam@gentoo.org>
Sambit Nayak <sambitnayak@gmail.com>
Samuel Thibault <samuel.thibault@ens-lyon.org>
Sargun Dhillon <sargun@sargun.me>
Sascha Grunert <saschagrunert@gmail.com>
Sascha Grunert <sgrunert@redhat.com>
Scott Benesh <scott.benesh@microchip.com>
Scott Teel <scott.teel@microchip.com>
Serge Hallyn <serge@hallyn.com>
Sergei Gromeniuk <sgromeni@redhat.com>
Shahab Ouraie <shahabouraie@gmail.com>
Shani Leviim <sleviim@redhat.com>
Stefan Puiu <stefan.puiu@gmail.com>
Thorsten Kukuk <kukuk@suse.com>
Tom Schwindl <schwindl@posteo.de>
Tomáš Golembiovský <tgolembi@redhat.com>
Ulrich Drepper <drepper@cygnus.com>
Ulrich Drepper <drepper@redhat.com>
Wolfram Gloger <wg@wolfram.dent.med.uni-muenchen.de>
Wolfram Gloger <wmglo@dent.med.uni-muenchen.de>
Xavier Leroy <Xavier.Leroy@inria.fr>
Xi Ruoyao <xry111@xry111.site>
Yafang Shao <laoar.shao@gmail.com>
Yang Xu <xuyang2018.jy@fujitsu.com>
Zack Weinberg <zack@owlfolio.org>
Štěpán Němec <stepnem@smrk.net>
Дилян Палаузов <dilyan.palauzov@aegee.org>
наб <nabijaczleweli@nabijaczleweli.xyz>

Apologies if I missed anyone!


New and rewritten pages
-----------------------

man2/
	ioctl_pagemap_scan.2

man3/					(taken from glibc's linuxthreads)
	pthread_cond_init.3
	pthread_condattr_init.3
	pthread_key_create.3
	pthread_mutex_init.3
	pthread_mutexattr_setkind_np.3
	pthread_once.3

man5/
	proc.5				(split into many small pages)
	proc_apm.5
	proc_buddyinfo.5
	proc_bus.5
	proc_cgroups.5
	proc_cmdline.5
	proc_config.gz.5
	proc_cpuinfo.5
	proc_crypto.5
	proc_devices.5
	proc_diskstats.5
	proc_dma.5
	proc_driver.5
	proc_execdomains.5
	proc_fb.5
	proc_filesystems.5
	proc_fs.5
	proc_ide.5
	proc_interrupts.5
	proc_iomem.5
	proc_ioports.5
	proc_kallsyms.5
	proc_kcore.5
	proc_key-users.5
	proc_keys.5
	proc_kmsg.5
	proc_kpagecgroup.5
	proc_kpagecount.5
	proc_kpageflags.5
	proc_ksyms.5
	proc_loadavg.5
	proc_locks.5
	proc_malloc.5
	proc_meminfo.5
	proc_modules.5
	proc_mtrr.5
	proc_partitions.5
	proc_pci.5
	proc_pid.5
	proc_pid_attr.5
	proc_pid_autogroup.5
	proc_pid_auxv.5
	proc_pid_cgroup.5
	proc_pid_clear_refs.5
	proc_pid_cmdline.5
	proc_pid_comm.5
	proc_pid_coredump_filter.5
	proc_pid_cpuset.5
	proc_pid_cwd.5
	proc_pid_environ.5
	proc_pid_exe.5
	proc_pid_fd.5
	proc_pid_fdinfo.5
	proc_pid_io.5
	proc_pid_limits.5
	proc_pid_map_files.5
	proc_pid_maps.5
	proc_pid_mem.5
	proc_pid_mountinfo.5
	proc_pid_mounts.5
	proc_pid_mountstats.5
	proc_pid_net.5
	proc_pid_ns.5
	proc_pid_numa_maps.5
	proc_pid_oom_score.5
	proc_pid_oom_score_adj.5
	proc_pid_pagemap.5
	proc_pid_personality.5
	proc_pid_projid_map.5
	proc_pid_root.5
	proc_pid_seccomp.5
	proc_pid_setgroups.5
	proc_pid_smaps.5
	proc_pid_stack.5
	proc_pid_stat.5
	proc_pid_statm.5
	proc_pid_status.5
	proc_pid_syscall.5
	proc_pid_task.5
	proc_pid_timers.5
	proc_pid_timerslack_ns.5
	proc_pid_uid_map.5
	proc_pid_wchan.5
	proc_profile.5
	proc_scsi.5
	proc_slabinfo.5
	proc_stat.5
	proc_swaps.5
	proc_sys.5
	proc_sys_abi.5
	proc_sys_debug.5
	proc_sys_dev.5
	proc_sys_fs.5
	proc_sys_kernel.5
	proc_sys_net.5
	proc_sys_proc.5
	proc_sys_sunrpc.5
	proc_sys_user.5
	proc_sys_vm.5
	proc_sysrq-trigger.5
	proc_sysvipc.5
	proc_tid_children.5
	proc_timer_list.5
	proc_timer_stats.5
	proc_tty.5
	proc_uptime.5
	proc_version.5
	proc_vmstat.5
	proc_zoneinfo.5


Newly documented interfaces in existing pages
---------------------------------------------

man2/
	access.2
		AT_EMPTY_PATH

	execve.2
		E2BIG

	ioctl_userfaultfd.2
		UFFDIO_API handshake
		UFFDIO_POISON
		UFFD_FEATURE_WP_ASYNC

	mbind.2
		MPOL_F_NUMA_BALANCING

	prctl.2
		PR_SET_MDWE
		PR_GET_MDWE

	set_thread_area.2
		C-SKY

	utimensat.2
		AT_EMPTY_PATH

man3/
	stdio.3
		fmemopen(3)
		fopencookie(3)
		open_memstream(3)
		open_wmemstream(3)

man4/
	smartpqi.4
		ctrl_ready_timeout
		enable_stream_detection
		ssd_smart_path_enabled
		enable_r5_writes
		enable_r6_writes
		lunid
		unique_id
		path_info
		raid_bypass_cnt
		sas_ncq_prio_enable

man5/
	proc_pid_status.5		(previously, proc.5)
		Seccomp_filters

	tmpfs.5
		size/blocks=0
		nr_inodes=0

man8/
	ld.so.8
		--list-diagnostics
		--glibc-hwcaps-mask
		--glibc-hwcaps-prepend


New and changed links
---------------------

man5/
	proc_mounts.5			(proc_pid_mounts(5))
	proc_net.5			(proc_pid_net(5))
	proc_pid_gid_map.5		(proc_pid_uid_map(5))
	proc_pid_oom_adj.5		(proc_pid_oom_score_adj(5))
	proc_self.5			(proc_pid(5))
	proc_thread-self.5		(proc_pid_task(5))
	proc_tid.5			(proc_pid_task(5))


Removed links
-------------

man3/
	stpecpy.3
	stpecpyx.3
	ustpcpy.3
	ustr2stp.3
	zustr2stp.3
	zustr2ustp.3


Global changes
--------------

-  Build system
   -  Update PDF book for groff-1.23.0.
   -  Add targets to [un]install intro(*) pages separately.
   -  Support manual pages in other projects, so that our build system
      can be used to for example lint them.
   -  Reject non-GNU make(1).
   -  Add target to build the PDF book.

-  man*/
   -  Add some consistency in the use of man(7).
   -  Split proc(5) into many small pages.
   -  Import pages from old linuxthreads (glibc), with their git
      history (from both glibc and Debian).
   -  Rewrite a large part of the documentation for string-copying
      functions.
   -  Say ISO/IEC instead of ISO where appropriate, and be consistent in
      the fomatting of names of ISO or ISO/IEC standards.


Changes to individual pages
---------------------------

The manual pages (and other files in the repository) have been improved
beyond what this changelog covers.  To learn more about changes applied
to individual pages, use git(1).

-- 
<https://www.alejandro-colomar.es/>
Looking for a remote C programming job at the moment.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[relevance 2%]

* Re: [PATCH v2] vfs: add RWF_NOAPPEND flag for pwritev2
  2024-01-18 15:57  0%     ` Rich Felker
@ 2024-01-18 16:02  0%       ` Jens Axboe
  0 siblings, 0 replies; 200+ results
From: Jens Axboe @ 2024-01-18 16:02 UTC (permalink / raw)
  To: Rich Felker
  Cc: Jann Horn, Alexander Viro, linux-fsdevel, kernel list, Linux API,
	Pavel Begunkov, Christian Brauner

On 1/18/24 8:57 AM, Rich Felker wrote:
> On Mon, Aug 31, 2020 at 11:05:34AM -0600, Jens Axboe wrote:
>> On 8/31/20 9:46 AM, Jann Horn wrote:
>>> On Mon, Aug 31, 2020 at 5:32 PM Rich Felker <dalias@libc.org> wrote:
>>>> The pwrite function, originally defined by POSIX (thus the "p"), is
>>>> defined to ignore O_APPEND and write at the offset passed as its
>>>> argument. However, historically Linux honored O_APPEND if set and
>>>> ignored the offset. This cannot be changed due to stability policy,
>>>> but is documented in the man page as a bug.
>>>>
>>>> Now that there's a pwritev2 syscall providing a superset of the pwrite
>>>> functionality that has a flags argument, the conforming behavior can
>>>> be offered to userspace via a new flag. Since pwritev2 checks flag
>>>> validity (in kiocb_set_rw_flags) and reports unknown ones with
>>>> EOPNOTSUPP, callers will not get wrong behavior on old kernels that
>>>> don't support the new flag; the error is reported and the caller can
>>>> decide how to handle it.
>>>>
>>>> Signed-off-by: Rich Felker <dalias@libc.org>
>>>
>>> Reviewed-by: Jann Horn <jannh@google.com>
>>>
>>> Note that if this lands, Michael Kerrisk will probably be happy if you
>>> send a corresponding patch for the manpage man2/readv.2.
>>>
>>> Btw, I'm not really sure whose tree this should go through - VFS is
>>> normally Al Viro's turf, but it looks like the most recent
>>> modifications to this function have gone through Jens Axboe's tree?
>>
>> Should probably go through Al's tree, I've only carried them when
>> they've been associated with io_uring in some shape or form.
> 
> This appears to have slipped through the cracks. Do I need to send an
> updated rebase of it? Were there any objections to it I missed?

Let's add Christian.

-- 
Jens Axboe



^ permalink raw reply	[relevance 0%]

* Re: [PATCH v2] vfs: add RWF_NOAPPEND flag for pwritev2
  @ 2024-01-18 15:57  0%     ` Rich Felker
  2024-01-18 16:02  0%       ` Jens Axboe
  0 siblings, 1 reply; 200+ results
From: Rich Felker @ 2024-01-18 15:57 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jann Horn, Alexander Viro, linux-fsdevel, kernel list, Linux API,
	Pavel Begunkov

On Mon, Aug 31, 2020 at 11:05:34AM -0600, Jens Axboe wrote:
> On 8/31/20 9:46 AM, Jann Horn wrote:
> > On Mon, Aug 31, 2020 at 5:32 PM Rich Felker <dalias@libc.org> wrote:
> >> The pwrite function, originally defined by POSIX (thus the "p"), is
> >> defined to ignore O_APPEND and write at the offset passed as its
> >> argument. However, historically Linux honored O_APPEND if set and
> >> ignored the offset. This cannot be changed due to stability policy,
> >> but is documented in the man page as a bug.
> >>
> >> Now that there's a pwritev2 syscall providing a superset of the pwrite
> >> functionality that has a flags argument, the conforming behavior can
> >> be offered to userspace via a new flag. Since pwritev2 checks flag
> >> validity (in kiocb_set_rw_flags) and reports unknown ones with
> >> EOPNOTSUPP, callers will not get wrong behavior on old kernels that
> >> don't support the new flag; the error is reported and the caller can
> >> decide how to handle it.
> >>
> >> Signed-off-by: Rich Felker <dalias@libc.org>
> > 
> > Reviewed-by: Jann Horn <jannh@google.com>
> > 
> > Note that if this lands, Michael Kerrisk will probably be happy if you
> > send a corresponding patch for the manpage man2/readv.2.
> > 
> > Btw, I'm not really sure whose tree this should go through - VFS is
> > normally Al Viro's turf, but it looks like the most recent
> > modifications to this function have gone through Jens Axboe's tree?
> 
> Should probably go through Al's tree, I've only carried them when
> they've been associated with io_uring in some shape or form.

This appears to have slipped through the cracks. Do I need to send an
updated rebase of it? Were there any objections to it I missed?

Rich

^ permalink raw reply	[relevance 0%]

* Re: set_thread_area.2: csky architecture undocumented
  2023-10-14 23:20  0%     ` Alejandro Colomar
@ 2023-10-15 15:09  0%       ` Guo Ren
  0 siblings, 0 replies; 200+ results
From: Guo Ren @ 2023-10-15 15:09 UTC (permalink / raw)
  To: Alejandro Colomar; +Cc: linux-man, linux-kernel, linux-csky, Arnd Bergmann

On Sun, Oct 15, 2023 at 01:20:42AM +0200, Alejandro Colomar wrote:
> Hi Guo,
> 
> On Tue, Nov 24, 2020 at 08:07:07PM +0800, Guo Ren wrote:
> 
> Huh, 3 years already!  I've had this in my head for all this time; just
> didn't find the energy to act on it.
> 
> > Thx Michael & Alejandro,
> > 
> > Yes, the man page has no csky's.
> 
> I've applied a patch to add initial documentation for it:
> <https://www.alejandro-colomar.es/src/alx/linux/man-pages/man-pages.git/commit/?h=contrib&id=a63979eb24aaf73f4be5799cc9579f84a3874b7d>
> 
> > 
> > C-SKY have abiv1 and abiv2
> > For abiv1: There is no register for tls saving, We use trap 3 to got
> > tls and use set_thread_area to init ti->tp_value.
> > For abiv2: The r31 is the tls register. We could directly read r31 to
> > got r31 and use set_thread_area to init reg->tls value.
> > 
> > In glibc:
> > # ifdef __CSKYABIV2__
> > /* Define r31 as thread pointer register.  */
> > #  define READ_THREAD_POINTER() \
> >         mov r0, r31;
> > # else
> > #  define READ_THREAD_POINTER() \
> >         trap 3;
> > # endif
> > 
> > /* Code to initially initialize the thread pointer.  This might need
> >    special attention since 'errno' is not yet available and if the
> >    operation can cause a failure 'errno' must not be touched.  */
> > # define TLS_INIT_TP(tcbp) \
> >   ({ INTERNAL_SYSCALL_DECL (err);                                       \
> >      long result_var;                                                   \
> >      result_var = INTERNAL_SYSCALL (set_thread_area, err, 1,            \
> >                     (char *) (tcbp) + TLS_TCB_OFFSET);                  \
> >      INTERNAL_SYSCALL_ERROR_P (result_var, err)                         \
> >        ? "unknown error" : NULL; })
> > 
> > In kernel:
> > SYSCALL_DEFINE1(set_thread_area, unsigned long, addr)
> > {
> >         struct thread_info *ti = task_thread_info(current);
> >         struct pt_regs *reg = current_pt_regs();
> > 
> >         reg->tls = addr;
> >         ti->tp_value = addr;
> > 
> >         return 0;
> > }
> > 
> > Any comments are welcome :)
> 
> I'm sorry, but I have little understanding of this syscall, and that
> shounds like gibberish to me :)
> 
> Feel free to send a patch to improve the documentation for csky.
Yeah, I've sent a patch for it; please review:
https://lore.kernel.org/linux-csky/20231015150732.1991997-1-guoren@kernel.org/

> 
> Cheers,
> Alex
> 
> > 
> > 
> > On Tue, Nov 24, 2020 at 5:51 PM Michael Kerrisk (man-pages)
> > <mtk.manpages@gmail.com> wrote:
> > >
> > > Hi Alex,
> > >
> > > On 11/23/20 10:31 PM, Alejandro Colomar (man-pages) wrote:
> > > > Hi Michael,
> > > >
> > > > SYNOPSIS
> > > >        #include <linux/unistd.h>
> > > >
> > > >        #if defined __i386__ || defined __x86_64__
> > > >        # include <asm/ldt.h>
> > > >
> > > >        int get_thread_area(struct user_desc *u_info);
> > > >        int set_thread_area(struct user_desc *u_info);
> > > >
> > > >        #elif defined __m68k__
> > > >
> > > >        int get_thread_area(void);
> > > >        int set_thread_area(unsigned long tp);
> > > >
> > > >        #elif defined __mips__
> > > >
> > > >        int set_thread_area(unsigned long addr);
> > > >
> > > >        #endif
> > > >
> > > >        Note: There are no glibc wrappers for these system  calls;  see
> > > >        NOTES.
> > > >
> > > >
> > > > $ grep -rn 'SYSCALL_DEFINE.*et_thread_area'
> > > > arch/csky/kernel/syscall.c:6:
> > > > SYSCALL_DEFINE1(set_thread_area, unsigned long, addr)
> > > > arch/mips/kernel/syscall.c:86:
> > > > SYSCALL_DEFINE1(set_thread_area, unsigned long, addr)
> > > > arch/x86/kernel/tls.c:191:
> > > > SYSCALL_DEFINE1(set_thread_area, struct user_desc __user *, u_info)
> > > > arch/x86/kernel/tls.c:243:
> > > > SYSCALL_DEFINE1(get_thread_area, struct user_desc __user *, u_info)
> > > > arch/x86/um/tls_32.c:277:
> > > > SYSCALL_DEFINE1(set_thread_area, struct user_desc __user *, user_desc)
> > > > arch/x86/um/tls_32.c:325:
> > > > SYSCALL_DEFINE1(get_thread_area, struct user_desc __user *, user_desc)
> > > >
> > > >
> > > > See kernel commit 4859bfca11c7d63d55175bcd85a75d6cee4b7184
> > > >
> > > >
> > > > I'd change
> > > > -      #elif defined __mips__
> > > > +      #elif defined(__mips__ || __csky__)
> > > >
> > > > and then change the rest of the text to add csky when appropriate.
> > > > Am I correct?
> > >
> > > AFAICT, you are correct. I think the reason that csky is missing is
> > > that the architecture was added after this manual pages was added.
> > >
> > > Thanks,
> > >
> > > Michael
> > >
> > >
> > > --
> > > Michael Kerrisk
> > > Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> > > Linux/UNIX System Programming Training: http://man7.org/training/
> > 
> > 
> > 
> > --
> > Best Regards
> >  Guo Ren
> > 
> > ML: https://lore.kernel.org/linux-csky/
> 
> -- 
> <https://www.alejandro-colomar.es/>



^ permalink raw reply	[relevance 0%]

* Re: set_thread_area.2: csky architecture undocumented
  @ 2023-10-14 23:20  0%     ` Alejandro Colomar
  2023-10-15 15:09  0%       ` Guo Ren
  0 siblings, 1 reply; 200+ results
From: Alejandro Colomar @ 2023-10-14 23:20 UTC (permalink / raw)
  To: Guo Ren; +Cc: linux-man, linux-kernel, linux-csky, Arnd Bergmann

[-- Attachment #1: Type: text/plain, Size: 4554 bytes --]

Hi Guo,

On Tue, Nov 24, 2020 at 08:07:07PM +0800, Guo Ren wrote:

Huh, 3 years already!  I've had this in my head for all this time; just
didn't find the energy to act on it.

> Thx Michael & Alejandro,
> 
> Yes, the man page has no csky's.

I've applied a patch to add initial documentation for it:
<https://www.alejandro-colomar.es/src/alx/linux/man-pages/man-pages.git/commit/?h=contrib&id=a63979eb24aaf73f4be5799cc9579f84a3874b7d>

> 
> C-SKY have abiv1 and abiv2
> For abiv1: There is no register for tls saving, We use trap 3 to got
> tls and use set_thread_area to init ti->tp_value.
> For abiv2: The r31 is the tls register. We could directly read r31 to
> got r31 and use set_thread_area to init reg->tls value.
> 
> In glibc:
> # ifdef __CSKYABIV2__
> /* Define r31 as thread pointer register.  */
> #  define READ_THREAD_POINTER() \
>         mov r0, r31;
> # else
> #  define READ_THREAD_POINTER() \
>         trap 3;
> # endif
> 
> /* Code to initially initialize the thread pointer.  This might need
>    special attention since 'errno' is not yet available and if the
>    operation can cause a failure 'errno' must not be touched.  */
> # define TLS_INIT_TP(tcbp) \
>   ({ INTERNAL_SYSCALL_DECL (err);                                       \
>      long result_var;                                                   \
>      result_var = INTERNAL_SYSCALL (set_thread_area, err, 1,            \
>                     (char *) (tcbp) + TLS_TCB_OFFSET);                  \
>      INTERNAL_SYSCALL_ERROR_P (result_var, err)                         \
>        ? "unknown error" : NULL; })
> 
> In kernel:
> SYSCALL_DEFINE1(set_thread_area, unsigned long, addr)
> {
>         struct thread_info *ti = task_thread_info(current);
>         struct pt_regs *reg = current_pt_regs();
> 
>         reg->tls = addr;
>         ti->tp_value = addr;
> 
>         return 0;
> }
> 
> Any comments are welcome :)

I'm sorry, but I have little understanding of this syscall, and that
shounds like gibberish to me :)

Feel free to send a patch to improve the documentation for csky.

Cheers,
Alex

> 
> 
> On Tue, Nov 24, 2020 at 5:51 PM Michael Kerrisk (man-pages)
> <mtk.manpages@gmail.com> wrote:
> >
> > Hi Alex,
> >
> > On 11/23/20 10:31 PM, Alejandro Colomar (man-pages) wrote:
> > > Hi Michael,
> > >
> > > SYNOPSIS
> > >        #include <linux/unistd.h>
> > >
> > >        #if defined __i386__ || defined __x86_64__
> > >        # include <asm/ldt.h>
> > >
> > >        int get_thread_area(struct user_desc *u_info);
> > >        int set_thread_area(struct user_desc *u_info);
> > >
> > >        #elif defined __m68k__
> > >
> > >        int get_thread_area(void);
> > >        int set_thread_area(unsigned long tp);
> > >
> > >        #elif defined __mips__
> > >
> > >        int set_thread_area(unsigned long addr);
> > >
> > >        #endif
> > >
> > >        Note: There are no glibc wrappers for these system  calls;  see
> > >        NOTES.
> > >
> > >
> > > $ grep -rn 'SYSCALL_DEFINE.*et_thread_area'
> > > arch/csky/kernel/syscall.c:6:
> > > SYSCALL_DEFINE1(set_thread_area, unsigned long, addr)
> > > arch/mips/kernel/syscall.c:86:
> > > SYSCALL_DEFINE1(set_thread_area, unsigned long, addr)
> > > arch/x86/kernel/tls.c:191:
> > > SYSCALL_DEFINE1(set_thread_area, struct user_desc __user *, u_info)
> > > arch/x86/kernel/tls.c:243:
> > > SYSCALL_DEFINE1(get_thread_area, struct user_desc __user *, u_info)
> > > arch/x86/um/tls_32.c:277:
> > > SYSCALL_DEFINE1(set_thread_area, struct user_desc __user *, user_desc)
> > > arch/x86/um/tls_32.c:325:
> > > SYSCALL_DEFINE1(get_thread_area, struct user_desc __user *, user_desc)
> > >
> > >
> > > See kernel commit 4859bfca11c7d63d55175bcd85a75d6cee4b7184
> > >
> > >
> > > I'd change
> > > -      #elif defined __mips__
> > > +      #elif defined(__mips__ || __csky__)
> > >
> > > and then change the rest of the text to add csky when appropriate.
> > > Am I correct?
> >
> > AFAICT, you are correct. I think the reason that csky is missing is
> > that the architecture was added after this manual pages was added.
> >
> > Thanks,
> >
> > Michael
> >
> >
> > --
> > Michael Kerrisk
> > Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> > Linux/UNIX System Programming Training: http://man7.org/training/
> 
> 
> 
> --
> Best Regards
>  Guo Ren
> 
> ML: https://lore.kernel.org/linux-csky/

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[relevance 0%]

* Re: man-pages-6.05.01 released
  @ 2023-08-04  3:40  5%   ` Luna Jernberg
  0 siblings, 0 replies; 200+ results
From: Luna Jernberg @ 2023-08-04  3:40 UTC (permalink / raw)
  To: Alejandro Colomar, andyrtr, Luna Jernberg
  Cc: linux-man, LKML, GNU C Library, Sam James, Jonathan Corbet,
	Michael Kerrisk, Marcos Fouces

[-- Attachment #1: Type: text/plain, Size: 1291 bytes --]

Hello!

Here comes and updated PKGBUILD for Arch Linux, sorry it took a while,
was watching Fedora Flock 2023 yesterday

Den tors 3 aug. 2023 kl 00:32 skrev Alejandro Colomar <alx@kernel.org>:
>
> Gidday!
>
> On 2023-08-01 15:19, Alejandro Colomar wrote:
> > Gidday!
> >
> > I'm proud to announce:
> >
> >       man-pages-6.05 - manual pages for GNU/Linux
> >
> > The release tarball is already available at <kernel.org>
> >
> > Tarball download:
> >       <https://mirrors.edge.kernel.org/pub/linux/docs/man-pages/>
> > Git repository:
> >       <https://git.kernel.org/cgit/docs/man-pages/man-pages.git/>
>
> There was a small problem while packaging for Debian.  quilt(1)
> produces a .pc/ dir in the root of the repository, and the patches
> stored in there confuse the build system to try to lint those patches
> as if they were manual pages.  If you successfully packaged 6.05
> without noticing this issue, you can safely ignore this bugfix
> release.  If you noticed the issue, or haven't yet started, I suggest
> you package 6.05.01.
>
> Changes since man-pages-6.05:
>
> man-pages-6.05.01:
>
> -  Build system:
>    -  Ignore dot-dirs within $MANDIR
>
>
> Cheers,
> Alex
>
> --
> <http://www.alejandro-colomar.es/>
> GPG key fingerprint: A9348594CE31283A826FBDD8D57633D441E25BB5
>

[-- Attachment #2: PKGBUILD --]
[-- Type: application/octet-stream, Size: 1743 bytes --]

# Maintainer: Andreas Radke <andyrtr@archlinux.org>

pkgname=man-pages
pkgver=6.05.01
_posixver=2017-a
pkgrel=1
pkgdesc="Linux man pages"
arch=('any')
license=('GPL' 'custom')
url="https://www.kernel.org/doc/man-pages/"
makedepends=('man2html' 'git')
# https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/
source=(https://www.kernel.org/pub/linux/docs/man-pages/$pkgname-$pkgver.tar.{xz,sign}
        https://www.kernel.org/pub/linux/docs/man-pages/man-pages-posix/$pkgname-posix-${_posixver}.tar.{xz,sign})
# https://www.kernel.org/pub/linux/docs/man-pages/sha256sums.asc
sha256sums=('b96ab6b44a688c91d1b572e52fece519e1cfd2bb4c33fe7014fc3fd1ef3f9cae'
            'SKIP'
            'ce67bb25b5048b20dad772e405a83f4bc70faf051afa289361c81f9660318bc3'
            'SKIP')
validpgpkeys=('E522595B52EDA4E6BFCCCB5E856199113A35CE5E') # Michael Kerrisk (Linux man-pages maintainer) <mtk.manpages@gmail.com>
# + for posix tarball
validpgpkeys+=('A9348594CE31283A826FBDD8D57633D441E25BB5') # Alejandro Colomar Andres <alx.manpages@gmail.com>

prepare() {
  cd "${srcdir}"/$pkgname-$pkgver

  # included in shadow
  rm man5/passwd.5
  rm man3/getspnam.3
  # included in tzdata
  rm man5/tzfile.5 man8/{tzselect,zdump,zic}.8
  # included in libxcrypt
  rm man3/crypt*.3
}

package() {
  cd "${srcdir}"/$pkgname-$pkgver

  # install man-pages
  make DESTDIR="${pkgdir}" prefix=/usr install 

  # install posix pages
  pushd "${srcdir}"/$pkgname-posix-${_posixver%-*}
  make DESTDIR="${pkgdir}" install 
  popd
  
  # posix pages have a custom license
  install -m755 -d "${pkgdir}/usr/share/licenses/${pkgname}"
  install -m644 "${srcdir}"/$pkgname-posix-${_posixver%-*}/POSIX-COPYRIGHT "${pkgdir}/usr/share/licenses/${pkgname}/POSIX-COPYRIGHT"
}

^ permalink raw reply	[relevance 5%]

* Re: man-pages-6.05 released
  2023-08-01 13:19  3% man-pages-6.05 released Alejandro Colomar
@ 2023-08-02  4:19  5% ` Luna Jernberg
    1 sibling, 0 replies; 200+ results
From: Luna Jernberg @ 2023-08-02  4:19 UTC (permalink / raw)
  To: Alejandro Colomar, andyrtr, Luna Jernberg
  Cc: linux-man, LKML, GNU C Library, Sam James, Jonathan Corbet,
	Michael Kerrisk, Marcos Fouces

[-- Attachment #1: Type: text/plain, Size: 9139 bytes --]

Updated PKGBUILD for Arch Linux

Den tis 1 aug. 2023 kl 15:30 skrev Alejandro Colomar <alx@kernel.org>:
>
> Gidday!
>
> I'm proud to announce:
>
>         man-pages-6.05 - manual pages for GNU/Linux
>
> The release tarball is already available at <kernel.org>
>
> Tarball download:
>         <https://mirrors.edge.kernel.org/pub/linux/docs/man-pages/>
> Git repository:
>         <https://git.kernel.org/cgit/docs/man-pages/man-pages.git/>
>
> A change from man-pages-6.04 merits a mention in this release, as it
> wasn't properly documented in the previous release notes:
>
>    -  Add make(1) 'check' target.  This has been split from 'lint'.
>       'lint' will check the source code, and 'check' will check the
>       rendered pages (as a user will read them).  There are currently
>       several pages that fail this `make check`, and distributors that
>       depend on this can workaround it by touching a few files:
>
>       $ make check -k -j >/dev/null 2>/dev/null;
>       $ make check -k 2>/dev/null;
>       GREP      .tmp/man/man1/memusage.1.check-catman.touch
>       TROFF     .tmp/man/man2/fanotify_init.2.cat.set
>       TROFF     .tmp/man/man2/gettimeofday.2.cat.set
>       TROFF     .tmp/man/man2/s390_sthyi.2.cat.set
>       GREP      .tmp/man/man3/mallopt.3.check-catman.touch
>       TROFF     .tmp/man/man3/unlocked_stdio.3.cat.set
>       TROFF     .tmp/man/man4/console_codes.4.cat.set
>       TROFF     .tmp/man/man4/lirc.4.cat.set
>       GREP      .tmp/man/man4/smartpqi.4.check-catman.touch
>       GREP      .tmp/man/man4/veth.4.check-catman.touch
>       TROFF     .tmp/man/man5/proc.5.cat.set
>       GREP      .tmp/man/man5/slabinfo.5.check-catman.touch
>       TROFF     .tmp/man/man5/tzfile.5.cat.set
>       TROFF     .tmp/man/man7/address_families.7.cat.set
>       TROFF     .tmp/man/man7/ascii.7.cat.set
>       TROFF     .tmp/man/man7/bpf-helpers.7.cat.set
>       GREP      .tmp/man/man7/keyrings.7.check-catman.touch
>       GREP      .tmp/man/man7/uri.7.check-catman.touch
>       TROFF     .tmp/man/man8/tzselect.8.cat.set
>       TROFF     .tmp/man/man8/zdump.8.cat.set
>       TROFF     .tmp/man/man8/zic.8.cat.set
>
>       After touching the previous files, `make check` will succeed:
>
>       $ make check -k 2>/dev/null | awk '{print $2}' | xargs touch;
>       $ make check -j >/dev/null;
>       $ echo $?
>       0
>
> The most notable changes in this release (man-pages-6.05) are:
>
> New and rewritten pages
> -----------------------
>
> man2/
>         ioctl_pipe.2
>
> man3/
>         regex.3
>
> man5/
>         erofs.5
>
> Newly documented interfaces in existing pages
> ---------------------------------------------
>
> bpf.2
>         EAGAIN
>
> ioctl_userfaultfd.2
>         UFFD_FEATURE_EXACT_ADDRESS
>
> prctl.2
>         PR_GET_AUXV
>
> recv.2
>         MSG_CMSG_CLOEXEC
>
> statx.2
>         STAT_ATTR_MOUNT_ROOT
>
> syscall.2
>         ENOSYS
>
> resolv.conf.5
>         no-aaaa
>         RES_NOAAAA
>
> tmpfs.5
>         CONFIG_TRANSPARENT_HUGEPAGE
>
> ip.7
>         IP_LOCAL_PORT_RANGE
>
> rtnetlink.7
>         IFLA_PERM_ADDRESS
>
> New and changed links
> ---------------------
>
> man3type/
>         regex_t.3type                           (regex(3))
>         regmatch_t.3type                        (regex(3))
>         regoff_t.3type                          (regex(3))
>
> Global changes
> --------------
>
> -  Types:
>    -  Document functions using off64_t as if they used off_t (except
>       for lseek64()).
>
> -  Build system:
>    -  Keep file modes in the release tarball.
>    -  Fix symlink installation (`make install LINK_PAGES=symlink`).
>    -  Add support for using bzip2(1), lzip(1), and xz(1) when installing
>       pages and creating release tarballs.
>    -  Create reproducible release tarballs.
>    -  Move makefiles from lib/ to share/mk/.
>    -  Support mdoc(7) pages.
>    -  Relicense Makefiles as GPL-3.0-or-later.
>    -  Build PostScript and PDF manual pages.
>    -  Add support for running our build system on arbitrary source
>       trees; this makes it possible to easily run our linters on another
>       project's manual pages as easily as `make lint MANDIR=~/src/groff`
>
> -  Licenses:
>    -  Relicense ddp.7 from VERBATIM_ONE_PARA to Linux-man-pages-copyleft.
>    -  Relicense dir_colors.5 from LDPv1 to GPL-2.0-or-later.
>    -  Use new SPDX license identifiers:
>       -  Linux-man-pages-1-para                 (was VERBATIM_ONE_PARA)
>       -  Linux-man-pages-copyleft-2-para        (was VERBATIM_TWO_PARA)
>       -  Linux-man-pages-copyleft-var           (was VERBATIM_PROF)
>
> -  ffix:
>    -  use `\%`
>    -  un-bracket tbl(1) tables
>
>
> Contributors
> ------------
>
> The following people contributed patches/fixes, reports, notes,
> ideas, and discussions that have been incorporated in changes in
> this release:
>
> "David S. Miller" <davem@davemloft.net>
> "G. Branden Robinson" <g.branden.robinson@gmail.com>
> A. Wilcox <AWilcox@wilcox-tech.com>
> Adam Dobes <adobes@redhat.com>
> Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
> Alan Cox <alan@llwyncelyn.cymru>
> Alejandro Colomar <alx@kernel.org>
> Alexei Starovoitov <ast@kernel.org>
> Andreas Schwab <schwab@suse.de>
> Andrew Clayton <andrew@digital-domain.net>
> Andrew Morton <akpm@linux-foundation.org>
> Avinesh Kumar <akumar@suse.de>
> Bastien Roucariès <rouca@debian.org>
> Bjarni Ingi Gislason <bjarniig@simnet.is>
> Brian Inglis <Brian.Inglis@Shaw.ca>
> Bruno Haible <bruno@clisp.org>
> Carsten Grohmann <carstengrohmann@gmx.de>
> Colin Watson <cjwatson@debian.org>
> Cyril Hrubis <chrubis@suse.cz>
> DJ Delorie <dj@redhat.com>
> Daniel Verkamp <daniel@drv.nu>
> David Howells <dhowells@redhat.com>
> Dirk Gouders <dirk@gouders.net>
> Dmitry Goncharov <dgoncharov@users.sf.net>
> Eli Zaretskii <eliz@gnu.org>
> Elliott Hughes <enh@google.com>
> Eric Biggers <ebiggers@google.com>
> Eric Blake <eblake@redhat.com>
> Eric Wong <e@80x24.org>
> Fangrui Song <maskray@google.com>
> Florian Weimer <fweimer@redhat.com>
> Gavin Smith <gavinsmith0123@gmail.com>
> Guillem Jover <guillem@hadrons.org>
> Günther Noack <gnoack@google.com>
> Helge Kreutzmann <debian@helgefjell.de>
> Igor Sysoev <igor@sysoev.ru>
> Ingo Schwarze <schwarze@openbsd.org>
> Jakub Jelinek <jakub@redhat.com>
> Jakub Sitnicki <jakub@cloudflare.com>
> Jakub Wilk <jwilk@jwilk.net>
> Johannes Weiner <hannes@cmpxchg.org>
> John Gilmore <gnu@toad.com>
> John Hubbard <jhubbard@nvidia.com>
> John Scott <jscott@posteo.net>
> Jonathan Corbet <corbet@lwn.net>
> Jonathan Wakely <jwakely@redhat.com>
> Joseph Myers <joseph@codesourcery.com>
> Josh Triplett <josh@joshtriplett.org>
> Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Larry McVoy <lm@mcvoy.com>
> Lennart Jablonka <humm@ljabl.com>
> Linus Heckemann <git@sphalerite.org>
> Lukas Javorsky <ljavorsk@redhat.com>
> Marcos Fouces <marcos@debian.org>
> Mario Blaettermann <mario.blaettermann@gmail.com>
> Martin (Joey) Schulze <joey@infodrom.org>
> Masami Hiramatsu <mhiramat@kernel.org>
> Masatake YAMATO <yamato@redhat.com>
> Matthew House <mattlloydhouse@gmail.com>
> Matthew Wilcox (Oracle) <willy@infradead.org>
> Michael Kerrisk <mtk.manpages@gmail.com>
> Michael Weiß <michael.weiss@aisec.fraunhofer.de>
> Mickaël Salaün <mic@digikod.net>
> Mike Frysinger <vapier@gentoo.org>
> Mike Kravetz <mike.kravetz@oracle.com>
> Mingye Wang <arthur200126@gmail.com>
> Nadav Amit <namit@vmware.com>
> Nick Desaulniers <ndesaulniers@google.com>
> Oskari Pirhonen <xxc3ncoredxx@gmail.com>
> Paul E. McKenney <paulmck@kernel.org>
> Paul Eggert <eggert@cs.ucla.edu>
> Paul Floyd <pjfloyd@wanadoo.fr>
> Paul Smith <psmith@gnu.org>
> Philip Guenther <guenther@gmail.com>
> Ralph Corderoy <ralph@inputplus.co.uk>
> Reuben Thomas <rrt@sc3d.org>
> Rich Felker <dalias@libc.org>
> Richard Biener <richard.guenther@gmail.com>
> Sam James <sam@gentoo.org>
> Serge Hallyn <serge@hallyn.com>
> Seth David Schoen <schoen@loyalty.org>
> Siddhesh Poyarekar <siddhesh@gotplt.org>
> Simon Horman <simon.horman@corigine.com>
> Stefan Puiu <stefan.puiu@gmail.com>
> Steffen Nurpmeso <steffen@sdaoden.eu>
> Szabolcs Nagy <nsz@port70.net>
> Thomas Weißschuh <thomas@t-8ch.de>
> Tom Schwindl <schwindl@posteo.de>
> Tomáš Golembiovský <tgolembi@redhat.com>
> Torbjorn SVENSSON <torbjorn.svensson@foss.st.com>
> Ulrich Drepper <drepper@redhat.com>
> Vahid Noormofidi <vnoormof@nvidia.com>
> Vlastimil Babka <vbabka@suse.cz>
> Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> Xi Ruoyao <xry111@xry111.site>
> Yang Xu <xuyang2018.jy@fujitsu.com>
> Yedidyah Bar David <didi@redhat.com>
> Zack Weinberg <zack@owlfolio.org>
> Zijun Zhao <zijunzhao@google.com>
>
> Apologies if I missed anyone!
>
>
> Thanks you all for contributing!
>
> Cheers,
> Alex
>
> --
> <http://www.alejandro-colomar.es/>
> GPG key fingerprint: A9348594CE31283A826FBDD8D57633D441E25BB5

[-- Attachment #2: PKGBUILD --]
[-- Type: application/octet-stream, Size: 1740 bytes --]

# Maintainer: Andreas Radke <andyrtr@archlinux.org>

pkgname=man-pages
pkgver=6.05
_posixver=2017-a
pkgrel=1
pkgdesc="Linux man pages"
arch=('any')
license=('GPL' 'custom')
url="https://www.kernel.org/doc/man-pages/"
makedepends=('man2html' 'git')
# https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/
source=(https://www.kernel.org/pub/linux/docs/man-pages/$pkgname-$pkgver.tar.{xz,sign}
        https://www.kernel.org/pub/linux/docs/man-pages/man-pages-posix/$pkgname-posix-${_posixver}.tar.{xz,sign})
# https://www.kernel.org/pub/linux/docs/man-pages/sha256sums.asc
sha256sums=('89b1445cfe2e3de8bd139758c78f08b37813cff217b9fb1c8df55fd9407875a6'
            'SKIP'
            'ce67bb25b5048b20dad772e405a83f4bc70faf051afa289361c81f9660318bc3'
            'SKIP')
validpgpkeys=('E522595B52EDA4E6BFCCCB5E856199113A35CE5E') # Michael Kerrisk (Linux man-pages maintainer) <mtk.manpages@gmail.com>
# + for posix tarball
validpgpkeys+=('A9348594CE31283A826FBDD8D57633D441E25BB5') # Alejandro Colomar Andres <alx.manpages@gmail.com>

prepare() {
  cd "${srcdir}"/$pkgname-$pkgver

  # included in shadow
  rm man5/passwd.5
  rm man3/getspnam.3
  # included in tzdata
  rm man5/tzfile.5 man8/{tzselect,zdump,zic}.8
  # included in libxcrypt
  rm man3/crypt*.3
}

package() {
  cd "${srcdir}"/$pkgname-$pkgver

  # install man-pages
  make DESTDIR="${pkgdir}" prefix=/usr install 

  # install posix pages
  pushd "${srcdir}"/$pkgname-posix-${_posixver%-*}
  make DESTDIR="${pkgdir}" install 
  popd
  
  # posix pages have a custom license
  install -m755 -d "${pkgdir}/usr/share/licenses/${pkgname}"
  install -m644 "${srcdir}"/$pkgname-posix-${_posixver%-*}/POSIX-COPYRIGHT "${pkgdir}/usr/share/licenses/${pkgname}/POSIX-COPYRIGHT"
}

^ permalink raw reply	[relevance 5%]

* man-pages-6.05 released
@ 2023-08-01 13:19  3% Alejandro Colomar
  2023-08-02  4:19  5% ` Luna Jernberg
    0 siblings, 2 replies; 200+ results
From: Alejandro Colomar @ 2023-08-01 13:19 UTC (permalink / raw)
  To: linux-man
  Cc: LKML, GNU C Library, Sam James, Jonathan Corbet, Michael Kerrisk,
	Marcos Fouces


[-- Attachment #1.1: Type: text/plain, Size: 8528 bytes --]

Gidday!

I'm proud to announce:

	man-pages-6.05 - manual pages for GNU/Linux

The release tarball is already available at <kernel.org>

Tarball download:
	<https://mirrors.edge.kernel.org/pub/linux/docs/man-pages/>
Git repository:
	<https://git.kernel.org/cgit/docs/man-pages/man-pages.git/>

A change from man-pages-6.04 merits a mention in this release, as it
wasn't properly documented in the previous release notes:

   -  Add make(1) 'check' target.  This has been split from 'lint'.
      'lint' will check the source code, and 'check' will check the
      rendered pages (as a user will read them).  There are currently
      several pages that fail this `make check`, and distributors that
      depend on this can workaround it by touching a few files:

      $ make check -k -j >/dev/null 2>/dev/null;
      $ make check -k 2>/dev/null;
      GREP      .tmp/man/man1/memusage.1.check-catman.touch
      TROFF     .tmp/man/man2/fanotify_init.2.cat.set
      TROFF     .tmp/man/man2/gettimeofday.2.cat.set
      TROFF     .tmp/man/man2/s390_sthyi.2.cat.set
      GREP      .tmp/man/man3/mallopt.3.check-catman.touch
      TROFF     .tmp/man/man3/unlocked_stdio.3.cat.set
      TROFF     .tmp/man/man4/console_codes.4.cat.set
      TROFF     .tmp/man/man4/lirc.4.cat.set
      GREP      .tmp/man/man4/smartpqi.4.check-catman.touch
      GREP      .tmp/man/man4/veth.4.check-catman.touch
      TROFF     .tmp/man/man5/proc.5.cat.set
      GREP      .tmp/man/man5/slabinfo.5.check-catman.touch
      TROFF     .tmp/man/man5/tzfile.5.cat.set
      TROFF     .tmp/man/man7/address_families.7.cat.set
      TROFF     .tmp/man/man7/ascii.7.cat.set
      TROFF     .tmp/man/man7/bpf-helpers.7.cat.set
      GREP      .tmp/man/man7/keyrings.7.check-catman.touch
      GREP      .tmp/man/man7/uri.7.check-catman.touch
      TROFF     .tmp/man/man8/tzselect.8.cat.set
      TROFF     .tmp/man/man8/zdump.8.cat.set
      TROFF     .tmp/man/man8/zic.8.cat.set

      After touching the previous files, `make check` will succeed:

      $ make check -k 2>/dev/null | awk '{print $2}' | xargs touch;
      $ make check -j >/dev/null;
      $ echo $?
      0

The most notable changes in this release (man-pages-6.05) are:

New and rewritten pages
-----------------------

man2/
        ioctl_pipe.2

man3/
        regex.3

man5/
        erofs.5

Newly documented interfaces in existing pages
---------------------------------------------

bpf.2
        EAGAIN

ioctl_userfaultfd.2
        UFFD_FEATURE_EXACT_ADDRESS

prctl.2
        PR_GET_AUXV

recv.2
        MSG_CMSG_CLOEXEC

statx.2
        STAT_ATTR_MOUNT_ROOT

syscall.2
        ENOSYS

resolv.conf.5
        no-aaaa
        RES_NOAAAA

tmpfs.5
        CONFIG_TRANSPARENT_HUGEPAGE

ip.7
        IP_LOCAL_PORT_RANGE

rtnetlink.7
        IFLA_PERM_ADDRESS

New and changed links
---------------------

man3type/
        regex_t.3type                           (regex(3))
        regmatch_t.3type                        (regex(3))
        regoff_t.3type                          (regex(3))

Global changes
--------------

-  Types:
   -  Document functions using off64_t as if they used off_t (except
      for lseek64()).

-  Build system:
   -  Keep file modes in the release tarball.
   -  Fix symlink installation (`make install LINK_PAGES=symlink`).
   -  Add support for using bzip2(1), lzip(1), and xz(1) when installing
      pages and creating release tarballs.
   -  Create reproducible release tarballs.
   -  Move makefiles from lib/ to share/mk/.
   -  Support mdoc(7) pages.
   -  Relicense Makefiles as GPL-3.0-or-later.
   -  Build PostScript and PDF manual pages.
   -  Add support for running our build system on arbitrary source
      trees; this makes it possible to easily run our linters on another
      project's manual pages as easily as `make lint MANDIR=~/src/groff`

-  Licenses:
   -  Relicense ddp.7 from VERBATIM_ONE_PARA to Linux-man-pages-copyleft.
   -  Relicense dir_colors.5 from LDPv1 to GPL-2.0-or-later.
   -  Use new SPDX license identifiers:
      -  Linux-man-pages-1-para                 (was VERBATIM_ONE_PARA)
      -  Linux-man-pages-copyleft-2-para        (was VERBATIM_TWO_PARA)
      -  Linux-man-pages-copyleft-var           (was VERBATIM_PROF)

-  ffix:
   -  use `\%`
   -  un-bracket tbl(1) tables


Contributors
------------

The following people contributed patches/fixes, reports, notes,
ideas, and discussions that have been incorporated in changes in
this release:

"David S. Miller" <davem@davemloft.net>
"G. Branden Robinson" <g.branden.robinson@gmail.com>
A. Wilcox <AWilcox@wilcox-tech.com>
Adam Dobes <adobes@redhat.com>
Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Alan Cox <alan@llwyncelyn.cymru>
Alejandro Colomar <alx@kernel.org>
Alexei Starovoitov <ast@kernel.org>
Andreas Schwab <schwab@suse.de>
Andrew Clayton <andrew@digital-domain.net>
Andrew Morton <akpm@linux-foundation.org>
Avinesh Kumar <akumar@suse.de>
Bastien Roucariès <rouca@debian.org>
Bjarni Ingi Gislason <bjarniig@simnet.is>
Brian Inglis <Brian.Inglis@Shaw.ca>
Bruno Haible <bruno@clisp.org>
Carsten Grohmann <carstengrohmann@gmx.de>
Colin Watson <cjwatson@debian.org>
Cyril Hrubis <chrubis@suse.cz>
DJ Delorie <dj@redhat.com>
Daniel Verkamp <daniel@drv.nu>
David Howells <dhowells@redhat.com>
Dirk Gouders <dirk@gouders.net>
Dmitry Goncharov <dgoncharov@users.sf.net>
Eli Zaretskii <eliz@gnu.org>
Elliott Hughes <enh@google.com>
Eric Biggers <ebiggers@google.com>
Eric Blake <eblake@redhat.com>
Eric Wong <e@80x24.org>
Fangrui Song <maskray@google.com>
Florian Weimer <fweimer@redhat.com>
Gavin Smith <gavinsmith0123@gmail.com>
Guillem Jover <guillem@hadrons.org>
Günther Noack <gnoack@google.com>
Helge Kreutzmann <debian@helgefjell.de>
Igor Sysoev <igor@sysoev.ru>
Ingo Schwarze <schwarze@openbsd.org>
Jakub Jelinek <jakub@redhat.com>
Jakub Sitnicki <jakub@cloudflare.com>
Jakub Wilk <jwilk@jwilk.net>
Johannes Weiner <hannes@cmpxchg.org>
John Gilmore <gnu@toad.com>
John Hubbard <jhubbard@nvidia.com>
John Scott <jscott@posteo.net>
Jonathan Corbet <corbet@lwn.net>
Jonathan Wakely <jwakely@redhat.com>
Joseph Myers <joseph@codesourcery.com>
Josh Triplett <josh@joshtriplett.org>
Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Larry McVoy <lm@mcvoy.com>
Lennart Jablonka <humm@ljabl.com>
Linus Heckemann <git@sphalerite.org>
Lukas Javorsky <ljavorsk@redhat.com>
Marcos Fouces <marcos@debian.org>
Mario Blaettermann <mario.blaettermann@gmail.com>
Martin (Joey) Schulze <joey@infodrom.org>
Masami Hiramatsu <mhiramat@kernel.org>
Masatake YAMATO <yamato@redhat.com>
Matthew House <mattlloydhouse@gmail.com>
Matthew Wilcox (Oracle) <willy@infradead.org>
Michael Kerrisk <mtk.manpages@gmail.com>
Michael Weiß <michael.weiss@aisec.fraunhofer.de>
Mickaël Salaün <mic@digikod.net>
Mike Frysinger <vapier@gentoo.org>
Mike Kravetz <mike.kravetz@oracle.com>
Mingye Wang <arthur200126@gmail.com>
Nadav Amit <namit@vmware.com>
Nick Desaulniers <ndesaulniers@google.com>
Oskari Pirhonen <xxc3ncoredxx@gmail.com>
Paul E. McKenney <paulmck@kernel.org>
Paul Eggert <eggert@cs.ucla.edu>
Paul Floyd <pjfloyd@wanadoo.fr>
Paul Smith <psmith@gnu.org>
Philip Guenther <guenther@gmail.com>
Ralph Corderoy <ralph@inputplus.co.uk>
Reuben Thomas <rrt@sc3d.org>
Rich Felker <dalias@libc.org>
Richard Biener <richard.guenther@gmail.com>
Sam James <sam@gentoo.org>
Serge Hallyn <serge@hallyn.com>
Seth David Schoen <schoen@loyalty.org>
Siddhesh Poyarekar <siddhesh@gotplt.org>
Simon Horman <simon.horman@corigine.com>
Stefan Puiu <stefan.puiu@gmail.com>
Steffen Nurpmeso <steffen@sdaoden.eu>
Szabolcs Nagy <nsz@port70.net>
Thomas Weißschuh <thomas@t-8ch.de>
Tom Schwindl <schwindl@posteo.de>
Tomáš Golembiovský <tgolembi@redhat.com>
Torbjorn SVENSSON <torbjorn.svensson@foss.st.com>
Ulrich Drepper <drepper@redhat.com>
Vahid Noormofidi <vnoormof@nvidia.com>
Vlastimil Babka <vbabka@suse.cz>
Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Xi Ruoyao <xry111@xry111.site>
Yang Xu <xuyang2018.jy@fujitsu.com>
Yedidyah Bar David <didi@redhat.com>
Zack Weinberg <zack@owlfolio.org>
Zijun Zhao <zijunzhao@google.com>

Apologies if I missed anyone!


Thanks you all for contributing!

Cheers,
Alex

-- 
<http://www.alejandro-colomar.es/>
GPG key fingerprint: A9348594CE31283A826FBDD8D57633D441E25BB5

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[relevance 3%]

* Re: [PATCH] proc.5: Clarify that boot arguments can be embedded in image
  2023-07-05 20:33  0%   ` Paul E. McKenney
@ 2023-07-08 17:19  0%     ` Alejandro Colomar
  0 siblings, 0 replies; 200+ results
From: Alejandro Colomar @ 2023-07-08 17:19 UTC (permalink / raw)
  To: paulmck, Masami Hiramatsu
  Cc: mtk.manpages, corbet, akpm, ndesaulniers, vbabka, hannes,
	linux-doc, linux-kernel, linux-man


[-- Attachment #1.1: Type: text/plain, Size: 1844 bytes --]

Hi Paul!

On 7/5/23 22:33, Paul E. McKenney wrote:
> On Tue, Jul 04, 2023 at 09:59:32PM +0900, Masami Hiramatsu wrote:
>> On Fri, 30 Jun 2023 16:33:28 -0700
>> "Paul E. McKenney" <paulmck@kernel.org> wrote:
>>
>>> With the advent of the CONFIG_BOOT_CONFIG Kconfig option, kernel boot
>>> arguments can now be embedded in the kernel image, either attached
>>> to the end of initramfs or embedded in the kernel itself.  Document
>>> this possibility in the /proc/cmdline entry of proc.5.
>>
>> Thanks for update!
>>
>> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
> 

Thanks for the review!  Tag added.

> Thank you, Masami!
> 
> Adding Alejandro and linux-man on CC.
> 
> 							Thanx, Paul
> 
>>> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
>>> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
>>> Cc: Masami Hiramatsu <mhiramat@kernel.org>
>>> Cc: Jonathan Corbet <corbet@lwn.net>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Nick Desaulniers <ndesaulniers@google.com>
>>> Cc: Vlastimil Babka <vbabka@suse.cz>
>>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>>>

Thanks!  Patch applied.

Cheers,
Alex

>>> diff --git a/man5/proc.5 b/man5/proc.5
>>> index c6684620e..141a2983c 100644
>>> --- a/man5/proc.5
>>> +++ b/man5/proc.5
>>> @@ -3100,6 +3100,9 @@ Often done via a boot manager such as
>>>   .BR lilo (8)
>>>   or
>>>   .BR grub (8).
>>> +Any arguments embedded in the kernel image or initramfs via
>>> +.B CONFIG_BOOT_CONFIG
>>> +will also be displayed.
>>>   .TP
>>>   .IR /proc/config.gz " (since Linux 2.6)"
>>>   This file exposes the configuration options that were used
>>
>>
>> -- 
>> Masami Hiramatsu (Google) <mhiramat@kernel.org>

-- 
<http://www.alejandro-colomar.es/>
GPG key fingerprint: A9348594CE31283A826FBDD8D57633D441E25BB5


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] proc.5: Clarify that boot arguments can be embedded in image
  2023-07-04 12:59  0% ` Masami Hiramatsu
@ 2023-07-05 20:33  0%   ` Paul E. McKenney
  2023-07-08 17:19  0%     ` Alejandro Colomar
  0 siblings, 1 reply; 200+ results
From: Paul E. McKenney @ 2023-07-05 20:33 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: mtk.manpages, corbet, akpm, ndesaulniers, vbabka, hannes,
	linux-doc, linux-kernel, alx, linux-man

On Tue, Jul 04, 2023 at 09:59:32PM +0900, Masami Hiramatsu wrote:
> On Fri, 30 Jun 2023 16:33:28 -0700
> "Paul E. McKenney" <paulmck@kernel.org> wrote:
> 
> > With the advent of the CONFIG_BOOT_CONFIG Kconfig option, kernel boot
> > arguments can now be embedded in the kernel image, either attached
> > to the end of initramfs or embedded in the kernel itself.  Document
> > this possibility in the /proc/cmdline entry of proc.5.
> 
> Thanks for update!
> 
> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

Thank you, Masami!

Adding Alejandro and linux-man on CC.

							Thanx, Paul

> > Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> > Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> > Cc: Masami Hiramatsu <mhiramat@kernel.org>
> > Cc: Jonathan Corbet <corbet@lwn.net>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Nick Desaulniers <ndesaulniers@google.com>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > 
> > diff --git a/man5/proc.5 b/man5/proc.5
> > index c6684620e..141a2983c 100644
> > --- a/man5/proc.5
> > +++ b/man5/proc.5
> > @@ -3100,6 +3100,9 @@ Often done via a boot manager such as
> >  .BR lilo (8)
> >  or
> >  .BR grub (8).
> > +Any arguments embedded in the kernel image or initramfs via 
> > +.B CONFIG_BOOT_CONFIG
> > +will also be displayed.
> >  .TP
> >  .IR /proc/config.gz " (since Linux 2.6)"
> >  This file exposes the configuration options that were used
> 
> 
> -- 
> Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] proc.5: Clarify that boot arguments can be embedded in image
  2023-06-30 23:33  5% [PATCH] proc.5: Clarify that boot arguments can be embedded in image Paul E. McKenney
@ 2023-07-04 12:59  0% ` Masami Hiramatsu
  2023-07-05 20:33  0%   ` Paul E. McKenney
  0 siblings, 1 reply; 200+ results
From: Masami Hiramatsu @ 2023-07-04 12:59 UTC (permalink / raw)
  To: paulmck
  Cc: mtk.manpages, mhiramat, corbet, akpm, ndesaulniers, vbabka,
	hannes, linux-doc, linux-kernel

On Fri, 30 Jun 2023 16:33:28 -0700
"Paul E. McKenney" <paulmck@kernel.org> wrote:

> With the advent of the CONFIG_BOOT_CONFIG Kconfig option, kernel boot
> arguments can now be embedded in the kernel image, either attached
> to the end of initramfs or embedded in the kernel itself.  Document
> this possibility in the /proc/cmdline entry of proc.5.

Thanks for update!

Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>

> 
> Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Nick Desaulniers <ndesaulniers@google.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> 
> diff --git a/man5/proc.5 b/man5/proc.5
> index c6684620e..141a2983c 100644
> --- a/man5/proc.5
> +++ b/man5/proc.5
> @@ -3100,6 +3100,9 @@ Often done via a boot manager such as
>  .BR lilo (8)
>  or
>  .BR grub (8).
> +Any arguments embedded in the kernel image or initramfs via 
> +.B CONFIG_BOOT_CONFIG
> +will also be displayed.
>  .TP
>  .IR /proc/config.gz " (since Linux 2.6)"
>  This file exposes the configuration options that were used


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

^ permalink raw reply	[relevance 0%]

* [PATCH] proc.5: Clarify that boot arguments can be embedded in image
@ 2023-06-30 23:33  5% Paul E. McKenney
  2023-07-04 12:59  0% ` Masami Hiramatsu
  0 siblings, 1 reply; 200+ results
From: Paul E. McKenney @ 2023-06-30 23:33 UTC (permalink / raw)
  To: mtk.manpages
  Cc: mhiramat, corbet, akpm, ndesaulniers, vbabka, hannes, linux-doc,
	linux-kernel

With the advent of the CONFIG_BOOT_CONFIG Kconfig option, kernel boot
arguments can now be embedded in the kernel image, either attached
to the end of initramfs or embedded in the kernel itself.  Document
this possibility in the /proc/cmdline entry of proc.5.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>

diff --git a/man5/proc.5 b/man5/proc.5
index c6684620e..141a2983c 100644
--- a/man5/proc.5
+++ b/man5/proc.5
@@ -3100,6 +3100,9 @@ Often done via a boot manager such as
 .BR lilo (8)
 or
 .BR grub (8).
+Any arguments embedded in the kernel image or initramfs via 
+.B CONFIG_BOOT_CONFIG
+will also be displayed.
 .TP
 .IR /proc/config.gz " (since Linux 2.6)"
 This file exposes the configuration options that were used

^ permalink raw reply related	[relevance 5%]

* Re: [patch 12/20] posix-timers: Document sys_clock_getoverrun()
  2023-04-25 18:49  5% ` [patch 12/20] posix-timers: Document sys_clock_getoverrun() Thomas Gleixner
@ 2023-06-01 11:06  0%   ` Frederic Weisbecker
  0 siblings, 0 replies; 200+ results
From: Frederic Weisbecker @ 2023-06-01 11:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Anna-Maria Behnsen, Peter Zijlstra, Michael Kerrisk,
	Sebastian Siewior, syzbot+5c54bd3eb218bb595aa9, Dmitry Vyukov

On Tue, Apr 25, 2023 at 08:49:14PM +0200, Thomas Gleixner wrote:
> Document the syscall in detail and with coherent sentences.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

^ permalink raw reply	[relevance 0%]

* [patch 12/20] posix-timers: Document sys_clock_getoverrun()
    2023-04-25 18:49  4% ` [patch 10/20] posix-timers: Document sys_clock_getres() correctly Thomas Gleixner
@ 2023-04-25 18:49  5% ` Thomas Gleixner
  2023-06-01 11:06  0%   ` Frederic Weisbecker
  1 sibling, 1 reply; 200+ results
From: Thomas Gleixner @ 2023-04-25 18:49 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Anna-Maria Behnsen, Peter Zijlstra,
	Michael Kerrisk, Sebastian Siewior, syzbot+5c54bd3eb218bb595aa9,
	Dmitry Vyukov

Document the syscall in detail and with coherent sentences.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
---
 kernel/time/posix-timers.c |   25 +++++++++++++++++--------
 1 file changed, 17 insertions(+), 8 deletions(-)

--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -782,14 +782,23 @@ SYSCALL_DEFINE2(timer_gettime32, timer_t
 
 #endif
 
-/*
- * Get the number of overruns of a POSIX.1b interval timer.  This is to
- * be the overrun of the timer last delivered.  At the same time we are
- * accumulating overruns on the next timer.  The overrun is frozen when
- * the signal is delivered, either at the notify time (if the info block
- * is not queued) or at the actual delivery time (as we are informed by
- * the call back to posixtimer_rearm().  So all we need to do is
- * to pick up the frozen overrun.
+/**
+ * sys_timer_getoverrun - Get the number of overruns of a POSIX.1b interval timer
+ * @timer_id:	The timer ID which identifies the timer
+ *
+ * The "overrun count" of a timer is one plus the number of expiration
+ * intervals which have elapsed between the first expiry, which queues the
+ * signal and the actual signal delivery. On signal delivery the "overrun
+ * count" is calculated and cached, so it can be returned directly here.
+ *
+ * As this is relative to the last queued signal the returned overrun count
+ * is meaningless outside of the signal delivery path and even there it
+ * does not accurately reflect the current state when user space evaluates
+ * it.
+ *
+ * Returns:
+ *	-EINVAL		@timer_id is invalid
+ *	1..INT_MAX	The number of overruns related to the last delivered signal
  */
 SYSCALL_DEFINE1(timer_getoverrun, timer_t, timer_id)
 {


^ permalink raw reply	[relevance 5%]

* [patch 10/20] posix-timers: Document sys_clock_getres() correctly
  @ 2023-04-25 18:49  4% ` Thomas Gleixner
  2023-04-25 18:49  5% ` [patch 12/20] posix-timers: Document sys_clock_getoverrun() Thomas Gleixner
  1 sibling, 0 replies; 200+ results
From: Thomas Gleixner @ 2023-04-25 18:49 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Anna-Maria Behnsen, Peter Zijlstra,
	Michael Kerrisk, Sebastian Siewior, syzbot+5c54bd3eb218bb595aa9,
	Dmitry Vyukov

The decades old comment about Posix clock resolution is confusing at best.

Remove it and add a proper explanation to sys_clock_getres().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
---
 kernel/time/posix-timers.c |   81 ++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 73 insertions(+), 8 deletions(-)

--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -67,14 +67,6 @@ static const struct k_clock clock_realti
  *	    to implement others.  This structure defines the various
  *	    clocks.
  *
- * RESOLUTION: Clock resolution is used to round up timer and interval
- *	    times, NOT to report clock times, which are reported with as
- *	    much resolution as the system can muster.  In some cases this
- *	    resolution may depend on the underlying clock hardware and
- *	    may not be quantifiable until run time, and only then is the
- *	    necessary code is written.	The standard says we should say
- *	    something about this issue in the documentation...
- *
  * FUNCTIONS: The CLOCKs structure defines possible functions to
  *	    handle various clock functions.
  *
@@ -1204,6 +1196,79 @@ SYSCALL_DEFINE2(clock_adjtime, const clo
 	return err;
 }
 
+/**
+ * sys_clock_getres - Get the resolution of a clock
+ * @which_clock:	The clock to get the resolution for
+ * @tp:			Pointer to a a user space timespec64 for storage
+ *
+ * POSIX defines:
+ *
+ * "The clock_getres() function shall return the resolution of any
+ * clock. Clock resolutions are implementation-defined and cannot be set by
+ * a process. If the argument res is not NULL, the resolution of the
+ * specified clock shall be stored in the location pointed to by res. If
+ * res is NULL, the clock resolution is not returned. If the time argument
+ * of clock_settime() is not a multiple of res, then the value is truncated
+ * to a multiple of res."
+ *
+ * Due to the various hardware constraints the real resolution can vary
+ * wildly and even change during runtime when the underlying devices are
+ * replaced. The kernel also can use hardware devices with different
+ * resolutions for reading the time and for arming timers.
+ *
+ * The kernel therefore deviates from the POSIX spec in various aspects:
+ *
+ * 1) The resolution returned to user space
+ *
+ *    For CLOCK_REALTIME, CLOCK_MONOTONIC, CLOCK_BOOTTIME, CLOCK_TAI,
+ *    CLOCK_REALTIME_ALARM, CLOCK_BOOTTIME_ALAREM and CLOCK_MONOTONIC_RAW
+ *    the kernel differentiates only two cases:
+ *
+ *    I)  Low resolution mode:
+ *
+ *	  When high resolution timers are disabled at compile or runtime
+ *	  the resolution returned is nanoseconds per tick, which represents
+ *	  the precision at which timers expire.
+ *
+ *    II) High resolution mode:
+ *
+ *	  When high resolution timers are enabled the resolution returned
+ *	  is always one nanosecond independent of the actual resolution of
+ *	  the underlying hardware devices.
+ *
+ *	  For CLOCK_*_ALARM the actual resolution depends on system
+ *	  state. When system is running the resolution is the same as the
+ *	  resolution of the other clocks. During suspend the actual
+ *	  resolution is the resolution of the underlying RTC device which
+ *	  might be way less precise than the clockevent device used during
+ *	  running state.
+ *
+ *   For CLOCK_REALTIME_COARSE and CLOCK_MONOTONIC_COARSE the resolution
+ *   returned is always nanoseconds per tick.
+ *
+ *   For CLOCK_PROCESS_CPUTIME and CLOCK_THREAD_CPUTIME the resolution
+ *   returned is always one nanosecond under the assumption that the
+ *   underlying scheduler clock has a better resolution than nanoseconds
+ *   per tick.
+ *
+ *   For dynamic POSIX clocks (PTP devices) the resolution returned is
+ *   always one nanosecond.
+ *
+ * 2) Affect on sys_clock_settime()
+ *
+ *    The kernel does not truncate the time which is handed in to
+ *    sys_clock_settime(). The kernel internal timekeeping is always using
+ *    nanoseconds precision independent of the clocksource device which is
+ *    used to read the time from. The resolution of that device only
+ *    affects the presicion of the time returned by sys_clock_gettime().
+ *
+ * Returns:
+ *	0		Success. @tp contains the resolution
+ *	-EINVAL		@which_clock is not a valid clock ID
+ *	-EFAULT		Copying the resolution to @tp faulted
+ *	-ENODEV		Dynamic POSIX clock is not backed by a device
+ *	-EOPNOTSUPP	Dynamic POSIX clock does not support getres()
+ */
 SYSCALL_DEFINE2(clock_getres, const clockid_t, which_clock,
 		struct __kernel_timespec __user *, tp)
 {


^ permalink raw reply	[relevance 4%]

* [PATCH] docs/sp_SP: Add translation of process/adding-syscalls
@ 2023-03-15 14:35  7% Carlos Bilbao
  0 siblings, 0 replies; 200+ results
From: Carlos Bilbao @ 2023-03-15 14:35 UTC (permalink / raw)
  To: corbet; +Cc: linux-kernel, linux-doc, mauriciofb, Carlos Bilbao

Translate Documentation/process/adding-syscalls.rst into Spanish.

Co-developed-by: Mauricio Fuentes <mauriciofb@gmail.com>
Signed-off-by: Mauricio Fuentes <mauriciofb@gmail.com>
Signed-off-by: Carlos Bilbao <carlos.bilbao@amd.com>
---
 .../sp_SP/process/adding-syscalls.rst         | 632 ++++++++++++++++++
 .../translations/sp_SP/process/index.rst      |   1 +
 2 files changed, 633 insertions(+)
 create mode 100644 Documentation/translations/sp_SP/process/adding-syscalls.rst

diff --git a/Documentation/translations/sp_SP/process/adding-syscalls.rst b/Documentation/translations/sp_SP/process/adding-syscalls.rst
new file mode 100644
index 000000000000..f21504c612b2
--- /dev/null
+++ b/Documentation/translations/sp_SP/process/adding-syscalls.rst
@@ -0,0 +1,632 @@
+.. include:: ../disclaimer-sp.rst
+
+:Original: :ref:`Documentation/process/adding-syscalls.rst <addsyscalls>`
+:Translator: Mauricio Fuentes <mauriciofb@gmail.com>
+
+.. _sp_addsyscalls:
+
+Agregando una Nueva Llamada del Sistema
+=======================================
+
+Este documento describe qué involucra agregar una nueva llamada del sistema
+al kernel Linux, más allá de la presentación y consejos normales en
+:ref:`Documentation/process/submitting-patches.rst <submittingpatches>` que
+también puede encontrar traducido a este idioma.
+
+Alternativas a Llamadas del Sistema
+-----------------------------------
+
+La primera cosa a considerar cuando se agrega una llamada al sistema es si
+alguna alternativa es adecuada en su lugar. Aunque las llamadas al sistema
+son los puntos de interacción entre el userspace y el kernel más obvios y
+tradicionales, existen otras posibilidades -- elija la que mejor se adecúe
+a su interfaz.
+
+ - Si se puede hacer que la operación se parezca a un objeto filesystem,
+   podría tener más sentido crear un nuevo sistema de ficheros o
+   dispositivo. Esto también hará más fácil encapsular la nueva
+   funcionalidad en un módulo del kernel en vez de requerir que sea
+   construido junto al kernel principal.
+
+     - Si la nueva funcionalidad involucra operaciones donde el kernel
+       notifica al userspace que algo ha pasado, entonces retornar un nuevo
+       descriptor de archivo para el objeto relevante permite al userspace
+       usar ``poll``/``select``/``epoll`` para recibir esta notificación.
+
+     - Sin embargo, operaciones que no mapean a operaciones similares a
+       :manpage:`read(2)`/:manpage:`write(2)` tienen que ser implementadas
+       como solicitudes :manpage:`ioctl(2)`, las cuales pueden llevar a un
+       API algo opaca.
+
+ - Si sólo está exponiendo información del runtime, un nuevo nodo en sysfs
+   (mire ``Documentation/filesystems/sysfs.rst``) o el filesystem ``/proc``
+   podría ser más adecuado. Sin embargo, acceder a estos mecanismos
+   requiere que el filesystem relevante esté montado, lo que podría no ser
+   siempre el caso (e.g. en un ambiente namespaced/sandboxed/chrooted).
+   Evite agregar cualquier API a debugfs, ya que no se considera una
+   interfaz (interface) de 'producción' para el userspace.
+
+ - Si la operación es específica a un archivo o descriptor de archivo
+   específico, entonces la opción de comando adicional :manpage:`fcntl(2)`
+   podría ser más apropiada. Sin embargo, :manpage:`fcntl(2)` es una
+   llamada al sistema multiplexada que esconde mucha complejidad, así que
+   esta opción es mejor cuando la nueva funcion es analogamente cercana a
+   la funcionalidad existente :manpage:`fcntl(2)`, o la nueva funcionalidad
+   es muy simple (por ejemplo, definir/obtener un flag simple relacionado a
+   un descriptor de archivo).
+
+ - Si la operación es específica a un proceso o tarea particular, entonces
+   un comando adicional :manpage:`prctl(2)` podría ser más apropiado. Tal
+   como con :manpage:`fcntl(2)`, esta llamada al sistema es un multiplexor
+   complicado así que está reservado para comandos análogamente cercanos
+   del existente ``prctl()`` u obtener/definir un flag simple relacionado a
+   un proceso.
+
+Diseñando el API: Planeando para extensiones
+--------------------------------------------
+
+Una nueva llamada del sistema forma parte del API del kernel, y tiene que
+ser soportada indefinidamente. Como tal, es una muy buena idea discutir
+explícitamente el interface en las listas de correo del kernel, y es
+importante planear para futuras extensiones del interface.
+
+(La tabla syscall está poblada con ejemplos históricos donde esto no se
+hizo, junto con los correspondientes seguimientos de los system calls --
+``eventfd``/``eventfd2``, ``dup2``/``dup3``, ``inotify_init``/``inotify_init1``,
+``pipe``/``pipe2``, ``renameat``/``renameat2`` -- así que aprenda de la
+historia del kernel y planee extensiones desde el inicio.)
+
+Para llamadas al sistema más simples que sólo toman un par de argumentos,
+la forma preferida de permitir futuras extensiones es incluir un argumento
+flag a la llamada al sistema. Para asegurarse que el userspace pueda usar
+de forma segura estos flags entre versiones del kernel, revise si los flags
+contienen cualquier flag desconocido, y rechace la llamada al sistema (con
+``EINVAL``) si ocurre::
+
+    if (flags & ~(THING_FLAG1 | THINGFLAG2 | THING_FLAG3))
+        return -EINVAL;
+
+(Si no hay valores de flags usados aún, revise que los argumentos del flag
+sean cero.)
+
+Para llamadas al sistema más sofisticadas que involucran un gran número de
+argumentos, es preferible encapsular la mayoría de los argumentos en una
+estructura que sea pasada a través de un puntero. Tal estructura puede
+hacer frente a futuras extensiones mediante la inclusión de un argumento de
+tamaño en la estructura::
+
+    struct xyzzy_params {
+        u32 size; /* userspace define p->size = sizeof(struct xyzzy_params) */
+        u32 param_1;
+        u64 param_2;
+        u64 param_3;
+    };
+
+Siempre que cualquier campo añadido subsecuente, digamos ``param_4``, sea
+diseñado de forma tal que un valor cero, devuelva el comportamiento previo,
+entonces permite versiones no coincidentes en ambos sentidos:
+
+ - Para hacer frente a programas del userspace más modernos, haciendo
+   llamadas a un kernel más antiguo, el código del kernel debe revisar que
+   cualquier memoria más allá del tamaño de la estructura sea cero (revisar
+   de manera efectiva que ``param_4 == 0``).
+ - Para hacer frente a programas antiguos del userspace haciendo llamadas a
+   un kernel más nuevo, el código del kernel puede extender con ceros, una
+   instancia más pequeña de la estructura (definiendo efectivamente
+   ``param_4 == 0``).
+
+Revise :manpage:`perf_event_open(2)` y la función ``perf_copy_attr()`` (en
+``kernel/events/code.c``) para un ejemplo de esta aproximación.
+
+
+Diseñando el API: Otras consideraciones
+---------------------------------------
+
+Si su nueva llamada al sistema permite al userspace hacer referencia a un
+objeto del kernel, esta debería usar un descriptor de archivo como el
+manipulador de ese objeto -- no invente un nuevo tipo de objeto manipulador
+userspace cuando el kernel ya tiene mecanismos y semánticas bien definidas
+para usar los descriptores de archivos.
+
+Si su nueva llamada a sistema :manpage:`xyzzy(2)` retorna un nuevo
+descriptor de archivo, entonces el argumento flag debe incluir un valor que
+sea equivalente a definir ``O_CLOEXEC`` en el nuevo FD. Esto hace posible
+al userspace acortar la brecha de tiempo entre ``xyzzy()`` y la llamada a
+``fcntl(fd, F_SETFD, FD_CLOEXEC)``, donde un ``fork()`` inesperado y
+``execve()`` en otro hilo podrían filtrar un descriptor al programa
+ejecutado. (Sin embargo, resista la tentación de reusar el valor actual de
+la constante ``O_CLOEXEC``, ya que es específica de la arquitectura y es
+parte de un espacio numerado de flags ``O_*`` que está bastante lleno.)
+
+Si su llamada de sistema retorna un nuevo descriptor de archivo, debería
+considerar también que significa usar la familia de llamadas de sistema
+:manpage:`poll(2)` en ese descriptor de archivo. Hacer un descriptor de
+archivo listo para leer o escribir es la forma normal para que el kernel
+indique al espacio de usuario que un evento ha ocurrido en el
+correspondiente objeto del kernel.
+
+Si su nueva llamada de sistema :manpage:`xyzzy(2)` involucra algún nombre
+de archivo como argumento::
+
+    int sys_xyzzy(const char __user *path, ..., unsigned int flags);
+
+debería considerar también si una versión :manpage:`xyzzyat(2)` es mas
+apropiada::
+
+    int sys_xyzzyat(int dfd, const char __user *path, ..., unsigned int flags);
+
+Esto permite más flexibilidad en como el userspace especifica el archivo en
+cuestión; en particular esto permite al userspace pedir la funcionalidad a
+un descriptor de archivo ya abierto usando el flag ``AT_EMPTY_PATH``,
+efectivamente dando una operación :manpage:`fxyzzy(3)` gratis::
+
+ - xyzzyat(AT_FDCWD, path, ..., 0) es equivalente a xyzzy(path, ...)
+ - xyzzyat(fd, "", ..., AT_EMPTY_PATH) es equivalente a fxyzzy(fd, ...)
+
+(Para más detalles sobre la explicación racional de las llamadas \*at(),
+revise el man page :manpage:`openat(2)`; para un ejemplo de AT_EMPTY_PATH,
+mire el man page :manpage:`fstatat(2)` manpage.)
+
+Si su nueva llamada de sistema :manpage:`xyzzy(2)` involucra un parámetro
+describiendo un describiendo un movimiento dentro de un archivo, ponga de
+tipo ``loff_t`` para que movimientos de 64-bit puedan ser soportados
+incluso en arquitecturas de 32-bit.
+
+Si su nueva llamada de sistema  :manpage:`xyzzy` involucra una
+funcionalidad privilegiada, esta necesita ser gobernada por la capability
+bit linux apropiada (revisado con una llamada a ``capable()``), como se
+describe en el man page :manpage:`capabilities(7)`. Elija una parte de
+capability linux que govierne las funcionalidades relacionadas, pero trate
+de evitar combinar muchas funciones sólo relacionadas vagamente bajo la
+misma sección, ya que va en contra de los propósitos de las capabilities de
+dividir el poder del usuario root. En particular, evite agregar nuevos usos
+de la capacidad ya demasiado general de la capabilities ``CAP_SYS_ADMIN``.
+
+Si su nueva llamada de sistema :manpage:`xyzzy(2)` manipula un proceso que
+no es el proceso invocado, este debería ser restringido (usando una llamada
+a ``ptrace_may_access()``) de forma que el único proceso con los mismos
+permisos del proceso objetivo, o con las capacidades (capabilities)
+necesarias, pueda manipulador el proceso objetivo.
+
+Finalmente, debe ser conciente de que algunas arquitecturas no-x86 tienen
+un manejo más sencillo si los parámetros que son explícitamente 64-bit
+caigan en argumentos enumerados impares (i.e. parámetros 1,3,5), para
+permitir el uso de pares contiguos de registros 32-bits. (Este cuidado no
+aplica si el argumento es parte de una estructura que se pasa a través de
+un puntero.)
+
+Proponiendo el API
+------------------
+
+Para hacer una nueva llamada al sistema fácil de revisar, es mejor dividir
+el patchset (conjunto de parches) en trozos separados. Estos deberían
+incluir al menos los siguientes items como commits distintos (cada uno de
+los cuales se describirá más abajo):
+
+ - La implementación central de la llamada al sistema, junto con
+   prototipos, numeración genérica, cambios Kconfig e implementaciones de
+   rutinas de respaldo (fallback stub)
+ - Conectar la nueva llamada a sistema a una arquitectura particular,
+   usualmente x86 (incluyendo todas las x86_64, x86_32 y x32).
+ - Una demostración del use de la nueva llamada a sistema en el userspace
+   vía un selftest en ``tools/testing/selftest/``.
+ - Un borrador de man-page para la nueva llamada a sistema, ya sea como
+   texto plano en la carta de presentación, o como un parche (separado)
+   para el repositorio man-pages.
+
+Nuevas propuestas de llamadas de sistema, como cualquier cambio al API del
+kernel, debería siempre ser copiado a linux-api@vger.kernel.org.
+
+
+Implementation de Llamada de Sistema Generica
+---------------------------------------------
+
+La entrada principal a su nueva llamada de sistema :manpage:`xyzzy(2)` será
+llamada ``sys_xyzzy()``, pero incluya este punto de entrada con la macro
+``SYSCALL_DEFINEn()`` apropiada en vez de explicitamente. El 'n' indica el
+numero de argumentos de la llamada de sistema, y la macro toma el nombre de
+la llamada de sistema seguida por el par (tipo, nombre) para los parámetros
+como argumentos. Usar esta macro permite a la metadata de la nueva llamada
+de sistema estar disponible para otras herramientas.
+
+El nuevo punto de entrada también necesita un prototipo de función
+correspondiente en ``include/linux/syscalls.h``,  marcado como asmlinkage
+para calzar en la manera en que las llamadas de sistema son invocadas::
+
+    asmlinkage long sys_xyzzy(...);
+
+Algunas arquitecturas (e.g. x86) tienen sus propias tablas de syscall
+específicas para la arquitectura, pero muchas otras arquitecturas comparten
+una tabla de syscall genéricas. Agrega su nueva llamada de sistema a la
+lista genérica agregando una entrada a la lista en
+``include/uapi/asm-generic/unistd.h``::
+
+    #define __NR_xyzzy 292
+    __SYSCALL(__NR_xyzzy, sys_xyzzy )
+
+También actualice el conteo de __NR_syscalls para reflejar la llamada de
+sistema adicional, y note que si multiples llamadas de sistema nuevas son
+añadidas en la misma ventana unida, su nueva llamada de sistema podría
+tener que ser ajustada para resolver conflictos.
+
+El archivo ``kernel/sys_ni.c`` provee una implementación fallback stub
+(rutina de respaldo) para cada llamada de sistema, retornando ``-ENOSYS``.
+Incluya su nueva llamada a sistema aquí también::
+
+    COND_SYSCALL(xyzzy);
+
+Su nueva funcionalidad del kernel, y la llamada de sistema que la controla,
+debería normalmente ser opcional, así que incluya una opción ``CONFIG``
+(tipicamente en ``init/Kconfig``) para ella. Como es usual para opciones
+``CONFIG`` nuevas:
+
+ - Incluya una descripción para la nueva funcionalidad y llamada al sistema
+   controlada por la opción.
+ - Haga la opción dependiendo de EXPERT si esta debe estar escondida de los
+   usuarios normales.
+ - Haga que cualquier nuevo archivo fuente que implemente la función
+   dependa de la opción CONFIG en el Makefile (e.g.
+   ``obj-$(CONFIG_XYZZY_SYSCALL) += xyzzy.o``).
+ - Revise dos veces que el kernel se siga compilando con la nueva opción
+   CONFIG apagada.
+
+Para resumir, necesita un commit que incluya:
+
+ - una opción ``CONFIG`` para la nueva función, normalmente en ``init/Kconfig``
+ - ``SYSCALL_DEFINEn(xyzzy, ...)`` para el punto de entrada
+ - El correspondiente prototipo en ``include/linux/syscalls.h``
+ - Una entrada genérica en ``include/uapi/asm-generic/unistd.h``
+ - fallback stub en ``kernel/sys_ni.c``
+
+
+Implementación de Llamada de Sistema x86
+----------------------------------------
+
+Para conectar su nueva llamada de sistema a plataformas x86, necesita
+actualizar las tablas maestras syscall. Asumiendo que su nueva llamada de
+sistema ni es especial de alguna manera (revise abajo), esto involucra una
+entrada "común" (para x86_64 y x86_32) en
+arch/x86/entry/syscalls/syscall_64.tbl::
+
+    333   common   xyzz     sys_xyzzy
+
+y una entrada "i386" en ``arch/x86/entry/syscalls/syscall_32.tbl``::
+
+    380   i386     xyzz     sys_xyzzy
+
+De nuevo, estos número son propensos de ser cambiados si hay conflictos en
+la ventana de integración relevante.
+
+
+Compatibilidad de Llamadas de Sistema (Genérica)
+------------------------------------------------
+
+Para la mayoría de llamadas al sistema la misma implementación 64-bit puede
+ser invocada incluso cuando el programa de userspace es en si mismo 32-bit;
+incluso si los parámetros de la llamada de sistema incluyen un puntero
+explícito, esto es manipulado de forma transparente.
+
+Sin embargo, existe un par de situaciones donde se necesita una capa de
+compatibilidad para lidiar con las diferencias de tamaño entre 32-bit y
+64-bit.
+
+La primera es si el kernel 64-bit también soporta programas del userspace
+32-bit, y por lo tanto necesita analizar areas de memoria del (``__user``)
+que podrían tener valores tanto 32-bit como 64-bit. En particular esto se
+necesita siempre que un argumento de la llamada a sistema es:
+
+ - un puntero a un puntero
+ - un puntero a un struc conteniendo un puntero (por ejemplo
+   ``struct iovec __user *``)
+ - un puntero a un type entero de tamaño entero variable (``time_t``,
+   ``off_t``, ``long``, ...)
+ - un puntero a un struct conteniendo un type entero de tamaño variable.
+
+La segunda situación que requiere una capa de compatibilidad es cuando uno
+de los argumentos de la llamada a sistema tiene un argumento que es
+explícitamente 64-bit incluso sobre arquitectura 32-bit, por ejemplo
+``loff_t`` o ``__u64``. En este caso, el valor que llega a un kernel 64-bit
+desde una aplicación de 32-bit se separará en dos valores de 32-bit, los
+que luego necesitan ser reensamblados en la capa de compatibilidad.
+
+(Note que un argumento de una llamada a sistema que sea un puntero a un
+type explicitamente de 64-bit **no** necesita una capa de compatibilidad;
+por ejemplo, los argumentos de :manpage:`splice(2)`) del tipo
+``loff_t __user *`` no significan la necesidad de una llamada a sistema
+``compat_``.)
+
+La versión compatible de la llamada de sistema se llama
+``compat_sys_xyzzy()``, y se agrega con la macro
+``COMPAT_SYSCALL_DEFINEn``, de manera análoga a SYSCALL_DEFINEn. Esta
+versión de la implementación se ejecuta como parte de un kernel de 64-bit,
+pero espera recibir parametros con valores 32-bit y hace lo que tenga que
+hacer para tratar con ellos. (Típicamente, la versión ``compat_sys_``
+convierte los valores a versiones de 64 bits y llama a la versión ``sys_``
+o ambas llaman a una función de implementación interna común.)
+
+El punto de entrada compat también necesita un prototipo de función
+correspondiente, en ``include/linux/compat.h``, marcado como asmlinkage
+para igualar la forma en que las llamadas al sistema son invocadas::
+
+    asmlinkage long compat_sys_xyzzy(...);
+
+Si la nueva llamada al sistema involucra una estructura que que se dispone
+de forma distinta en sistema de 32-bit y 64-bit, digamos
+``struct xyzzy_args``, entonces el archivo de cabecera
+include/linux/compat.h también debería incluir una versión compatible de la
+estructura (``struct compat_xyzzy_args``) donde cada campo de tamaño
+variable tiene el tipo ``compat_`` apropiado que corresponde al tipo en
+``struct xyzzy_args``. La rutina ``compat_sys_xyzzy()`` puede entonces usar
+esta estructura ``compat_`` para analizar los argumentos de una invocación
+de 32-bit.
+
+Por ejemplo, si hay campos::
+
+    struct xyzzy_args {
+      const char __user *ptr;
+      __kernel_long_t varying_val;
+      u64 fixed_val;
+      /* ... */
+    };
+
+en struct xyzzy_args, entonces struct compat_xyzzy_args debe tener::
+
+    struct compat_xyzzy_args {
+      compat_uptr_t ptr;
+      compat_long_t varying_val;
+      u64 fixed_val;
+      /* ... */
+    };
+
+la lista genérica de llamadas al sistema también necesita ajustes para
+permitir la versión compat; la entrada en
+``include/uapi/asm-generic/unistd.h`` debería usar ``__SC_COMP`` en vez de
+``__SYSCALL``::
+
+    #define __NR_xyzzy 292
+    __SC_COMP(__NR_xyzzy, sys_xyzzy, compat_sys_xyzzy)
+
+Para resumir, necesita:
+
+  - una ``COMPAT_SYSCALL_DEFINEn(xyzzy, ...)`` para el punto de entrada de compat.
+  - el prototipo correspondiente en ``include/linux/compat.h``
+  - (en caso de ser necesario) un struct de mapeo de 32-bit en ``include/linux/compat.h``
+  - una instancia de ``__SC_COMP`` no ``__SYSCALL`` en ``include/uapi/asm-generic/unistd.h``
+
+Compatibilidad de Llamadas de Sistema (x86)
+-------------------------------------------
+
+Para conectar la arquitectura x86 de una llamada al sistema con una versión
+de compatibilidad, las entradas en las tablas de syscall deben ser
+ajustadas.
+
+Primero, la entrada en ``arch/x86/entry/syscalls/syscall_32.tbl`` recibe
+una columna extra para indicar que un programa del userspace de 32-bit
+corriendo en un kernel de 64-bit debe llegar al punto de entrada compat::
+
+    380  i386     xyzzy      sys_xyzzy    __ia32_compat_sys_xyzzy
+
+Segundo, tienes que averiguar qué debería pasar para la versión x32 ABI de
+la nueva llamada al sistema. Aquí hay una elección: el diseño de los
+argumentos debería coincidir con la versión de 64-bit o la versión de
+32-bit.
+
+Si hay involucrado un puntero-a-puntero, la decisión es fácil: x32 es
+ILP32, por lo que el diseño debe coincidir con la versión 32-bit, y la
+entrada en ``arch/x86/entry/syscalls/syscall_64.tbl`` se divide para que
+progamas 32-bit lleguen al envoltorio de compatibilidad::
+
+    333   64        xyzzy       sys_xyzzy
+    ...
+    555   x32       xyzzy       __x32_compat_sys_xyzzy
+
+Si no hay punteros involucrados, entonces es preferible reutilizar el system
+call 64-bit para el x32 ABI  (y consecuentemente la entrada en
+arch/x86/entry/syscalls/syscall_64.tbl no se cambia).
+
+En cualquier caso, debes revisar que lo tipos involucrados en su diseño de
+argumentos de hecho asigne exactamente de x32 (-mx32) a 32-bit(-m32) o
+equivalentes 64-bit (-m64).
+
+
+Llamadas de Sistema Retornando a Otros Lugares
+----------------------------------------------
+
+Para la mayoría de las llamadas al sistema, una vez que se la llamada al
+sistema se ha completado el programa de usuario continúa exactamente donde
+quedó -- en la siguiente instrucción, con el stack igual y la mayoría de
+los registros igual que antes de la llamada al sistema, y con el mismo
+espacio en la memoria virtual.
+
+Sin embargo, unas pocas llamadas al sistema hacen las cosas diferente.
+Estas podrían retornar a una ubicación distinta (``rt_sigreturn``) o
+cambiar el espacio de memoria (``fork``/``vfork``/``clone``) o incluso de
+arquitectura (``execve``/``execveat``) del programa.
+
+Para permitir esto, la implementación del kernel de la llamada al sistema
+podría necesitar guardar y restaurar registros adicionales al stak del
+kernel, brindandole control completo de donde y cómo la ejecución continúa
+después de la llamada a sistema.
+
+Esto es arch-specific, pero típicamente involucra definir puntos de entrada
+assembly que guardan/restauran registros adicionales e invocan el punto de
+entrada real de la llamada a sistema.
+
+Para x86_64, esto es implementado como un punto de entrada ``stub_xyzzy``
+en ``arch/x86/entry/entry_64.S``, y la entrada en la tabla syscall
+(``arch/x86/entry/syscalls/syscall_32.tbl``) es ajustada para calzar::
+
+    333   common  xyzzy     stub_xyzzy
+
+El equivalente para programas 32-bit corriendo en un kernel 64-bit es
+normalmente llamado ``stub32_xyzzy`` e implementado en
+``arch/x86/entry/entry_64_compat.S``, con el correspondiente ajuste en la
+tabla syscall en ``arch/x86/syscalls/syscall_32.tbl``::
+
+    380    i386       xyzzy     sys_xyzzy     stub32_xyzzy
+
+Si la llamada a sistema necesita una capa de compatibilidad (como en la
+sección anterior) entonces la versión ``stub32_`` necesita llamar a la
+versión ``compat_sys_`` de la llamada a sistema, en vez de la versión
+nativa de 64-bit. También, si la implementación de la versión x32 ABI no es
+comun con la versión x86_64, entonces su tabla syscall también necesitará
+invocar un stub que llame a la versión ``compat_sys_``
+
+Para completar, también es agradable configurar un mapeo de modo que el
+user-mode linux todavía funcione -- su tabla syscall referenciará
+stub_xyzzy, pero el UML construido no incluye una implementación
+``arch/x86/entry/entry_64.S``. Arreglar esto es tan simple como agregar un
+#define a ``arch/x86/um/sys_call_table_64.c``::
+
+    #define stub_xyzzy sys_xyzzy
+
+
+Otros detalles
+--------------
+
+La mayoría del kernel trata las llamadas a sistema de manera genérica, pero
+está la excepción ocasional que pueda requerir actualización para su
+llamada a sistema particular.
+
+El subsistema de auditoría es un caso especial; este incluye funciones
+(arch-specific) que clasifican algunos tipos especiales de llamadas al
+sistema -- específicamente file open (``open``/``openat``), program
+execution (``execve`` /``execveat``) o operaciones multiplexores de socket
+(``socketcall``). Si su nueva llamada de sistema es análoga a alguna de
+estas, entonces el sistema auditor debe ser actualizado.
+
+Más generalmente, si existe una llamada al sistema que sea análoga a su
+nueva llamada al sistema, entonces vale la pena hacer un grep a todo el
+kernel de la llamada a sistema existente, para revisar que no exista otro
+caso especial.
+
+
+Testing
+-------
+
+Una nueva llamada al sistema debe obviamente ser probada; también es útil
+proveer a los revisores con una demostración de cómo los programas del
+userspace usarán la llamada al sistema. Una buena forma de combinar estos
+objetivos es incluir un simple programa self-test en un nuevo directorio
+bajo ``tools/testing/selftests/``.
+
+Para una nueva llamada al sistema, obviamente no habrá una función
+envoltorio libc por lo que el test necesitará ser invocado usando
+``syscall()``; también, si la llamada al sistema involucra una nueva
+estructura userspace-visible, el encabezado correspondiente necesitará ser
+instalado para compilar el test.
+
+Asegure que selftest corra satisfactoriamente en todas las arquitecturas
+soportadas. Por ejemplo, revise si funciona cuando es compilado como un
+x86_64 (-m64), x86_32 (-m32) y x32 (-mx32) programa ABI.
+
+Para pruebas más amplias y exhautivas de la nueva funcionalidad, también
+debería considerar agregar tests al Linus Test Project, o al proyecto
+xfstests para cambios filesystem-related
+
+  - https://linux-test-project.github.io/
+  - git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
+
+
+Man Page
+--------
+
+Todas las llamada al sistema nueva deben venir con un man page completo,
+idealmente usando groff markup, pero texto plano también funciona. Si se
+usa groff, es útil incluir una versión ASCII pre-renderizada del man-page
+en el cover del email para el patchset, para la conveniencia de los
+revisores.
+
+El man page debe ser cc'do a linux-man@vger.kernel.org
+Para más detalles, revise https://www.kernel.org/doc/man-pages/patches.html
+
+
+No invoque las llamadas de sistemas en el kernel
+------------------------------------------------
+
+Las llamadas al sistema son, cómo se declaró más arriba, puntos de
+interacción entre el userspace y el kernel. Por lo tanto, las funciones de
+llamada al sistema como ``sys_xyzzy()`` o ``compat_sys_xyzzy()`` deberían
+ser llamadas sólo desde el userspace vía la tabla de syscall, pero no de
+otro lugar en el kernel. Si la funcionalidad syscall es útil para ser usada
+dentro del kernel, necesita ser compartida entre syscalls nuevas o
+antiguas, o necesita ser compartida entre una syscall y su variante de
+compatibilidad, esta debería ser implementada mediante una función "helper"
+(como ``ksys_xyzzy()``). Esta función del kernel puede ahora ser llamada
+dentro del syscall stub (``sys_xyzzy()``), la syscall stub de
+compatibilidad (``compat_sys_xyzzy()``), y/o otro código del kernel.
+
+Al menos en 64-bit x86, será un requerimiento duro desde la v4.17 en
+adelante no invocar funciones de llamada al sistema (system call) en el
+kernel. Este usa una convención de llamada diferente para llamadas al
+sistema donde ``struct pt_regs`` es decodificado on-the-fly en un
+envoltorio syscall que luego entrega el procesamiento al syscall real. Esto
+significa que sólo aquellos parámetros que son realmente necesarios para
+una syscall específica son pasados durante la entrada del syscall, en vez
+de llenar en seis registros de CPU con contenido random del userspace todo
+el tiempo (los cuales podrían causar serios problemas bajando la cadena de
+llamadas).
+
+Más aún, reglas sobre cómo se debería acceder a la data pueden diferir
+entre la data del kernel y la data de usuario. Esta es otra razón por la
+cual llamar a ``sys_xyzzy()`` es generalmente una mala idea.
+
+Excepciones a esta regla están permitidas solamente en overrides
+específicos de arquitectura, envoltorios de compatibilidad específicos de
+arquitectura, u otro código en arch/.
+
+
+Referencias y fuentes
+---------------------
+
+ - Artículo LWN de Michael Kerrisk sobre el uso de argumentos flags en llamadas al
+   sistema:
+   https://lwn.net/Articles/585415/
+ - Artículo LWN de Michael Kerrisk sobre cómo manejar flags desconocidos en una
+   llamada al sistema: https://lwn.net/Articles/588444/
+ - Artículo LWN de Jake Edge describiendo restricciones en argumentos en
+   64-bit system call: https://lwn.net/Articles/311630/
+ - Par de artículos LWN de David Drysdale que describen la ruta de implementación
+   de llamadas al sistema en detalle para v3.14:
+
+    - https://lwn.net/Articles/604287/
+    - https://lwn.net/Articles/604515/
+
+ - Requerimientos arquitectura-específicos para llamadas al sistema son discutidos en el
+   :manpage:`syscall(2)` man-page:
+   http://man7.org/linux/man-pages/man2/syscall.2.html#NOTES
+ - Recopilación de emails de Linus Torvalds discutiendo problemas con ``ioctl()``:
+   https://yarchive.net/comp/linux/ioctl.html
+ - "How to not invent kernel interfaces", Arnd Bergmann,
+   https://www.ukuug.org/events/linux2007/2007/papers/Bergmann.pdf
+ - Artículo LWN de Michael Kerrisk sobre evitar nuevos usos de CAP_SYS_ADMIN:
+   https://lwn.net/Articles/486306/
+ - Recomendaciones de Andrew Morton que toda la información relacionada a una nueva
+   llamada al sistema debe venir en el mismo hilo de correos:
+   https://lore.kernel.org/r/20140724144747.3041b208832bbdf9fbce5d96@linux-foundation.org
+ - Recomendaciones de Michael Kerrisk que una nueva llamada al sistema debe venir
+   con un man-page: https://lore.kernel.org/r/CAKgNAkgMA39AfoSoA5Pe1r9N+ZzfYQNvNPvcRN7tOvRb8+v06Q@mail.gmail.com
+ - Sugerencias de Thomas Gleixner que conexiones x86 deben ir en commits
+   separados: https://lore.kernel.org/r/alpine.DEB.2.11.1411191249560.3909@nanos
+ - Sugerencias de Greg Kroah-Hartman que es bueno para las nueva llamadas al sistema
+   que vengan con man-page y selftest: https://lore.kernel.org/r/20140320025530.GA25469@kroah.com
+ - Discusión de Michael Kerrisk de nuevas system call vs. extensiones :manpage:`prctl(2)`:
+   https://lore.kernel.org/r/CAHO5Pa3F2MjfTtfNxa8LbnkeeU8=YJ+9tDqxZpw7Gz59E-4AUg@mail.gmail.com
+ - Sugerencias de Ingo Molnar que llamadas al sistema que involucran múltiples
+   argumentos deben encapsular estos argumentos en una estructura, la cual incluye
+   un campo de tamaño para futura extensibilidad: https://lore.kernel.org/r/20150730083831.GA22182@gmail.com
+ - Enumerando rarezas por la (re-)utilización de O_* numbering space flags:
+
+    - commit 75069f2b5bfb ("vfs: renumber FMODE_NONOTIFY and add to uniqueness
+      check")
+    - commit 12ed2e36c98a ("fanotify: FMODE_NONOTIFY and __O_SYNC in sparc
+      conflict")
+    - commit bb458c644a59 ("Safer ABI for O_TMPFILE")
+
+ - Discusión de Matthew Wilcox sobre las restricciones en argumentos 64-bit:
+   https://lore.kernel.org/r/20081212152929.GM26095@parisc-linux.org
+ - Recomendaciones de Greg Kroah-Hartman sobre flags desconocidos deben ser
+   vigilados: https://lore.kernel.org/r/20140717193330.GB4703@kroah.com
+ - Recomendaciones de Linus Torvalds que las llamadas al sistema x32 deben favorecer
+   compatibilidad con versiones 64-bit sobre versiones 32-bit:
+   https://lore.kernel.org/r/CA+55aFxfmwfB7jbbrXxa=K7VBYPfAvmu3XOkGrLbB1UFjX1+Ew@mail.gmail.com
diff --git a/Documentation/translations/sp_SP/process/index.rst b/Documentation/translations/sp_SP/process/index.rst
index 351bcd3921ba..a0ff2e132c54 100644
--- a/Documentation/translations/sp_SP/process/index.rst
+++ b/Documentation/translations/sp_SP/process/index.rst
@@ -18,3 +18,4 @@
    email-clients
    programming-language
    deprecated
+   adding-syscalls
-- 
2.34.1


^ permalink raw reply related	[relevance 7%]

* [PATCH v7 08/41] x86/shstk: Add user control-protection fault handler
  @ 2023-02-27 22:29  2% ` Rick Edgecombe
  0 siblings, 0 replies; 200+ results
From: Rick Edgecombe @ 2023-02-27 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe, Yu-cheng Yu, Michael Kerrisk

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

A control-protection fault is triggered when a control-flow transfer
attempt violates Shadow Stack or Indirect Branch Tracking constraints.
For example, the return address for a RET instruction differs from the copy
on the shadow stack.

There already exists a control-protection fault handler for handling kernel
IBT faults. Refactor this fault handler into separate user and kernel
handlers, like the page fault handler. Add a control-protection handler
for usermode. To avoid ifdeffery, put them both in a new file cet.c, which
is compiled in the case of either of the two CET features supported in the
kernel: kernel IBT or user mode shadow stack. Move some static inline
functions from traps.c into a header so they can be used in cet.c.

Opportunistically fix a comment in the kernel IBT part of the fault
handler that is on the end of the line instead of preceding it.

Keep the same behavior for the kernel side of the fault handler, except for
converting a BUG to a WARN in the case of a #CP happening when the feature
is missing. This unifies the behavior with the new shadow stack code, and
also prevents the kernel from crashing under this situation which is
potentially recoverable.

The control-protection fault handler works in a similar way as the general
protection fault handler. It provides the si_code SEGV_CPERR to the signal
handler.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>

---
v7:
 - Adjust alignment of WARN statement

v6:
 - Split into separate patches (Kees)
 - Change to "x86/shstk" in commit log (Boris)

v5:
 - Move to separate file to avoid ifdeffery (Boris)
 - Improvements to commit log (Boris)
 - Rename control_protection_err (Boris)
 - Move comment from end of line in IBT fault handler (Boris)

v3:
 - Shorten user/kernel #CP handler function names (peterz)
 - Restore CP_ENDBR check to kernel handler (peterz)
 - Utilize CONFIG_X86_CET (Kees)
 - Unify "unexpected" warnings (Andrew Cooper)
 - Use 2d array for error code chars (Andrew Cooper)
 - Add comment about why to read SSP MSR before enabling interrupts

v2:
 - Integrate with kernel IBT fault handler
 - Update printed messages. (Dave)
 - Remove array_index_nospec() usage. (Dave)
 - Remove IBT messages. (Dave)
 - Add enclave error code bit processing it case it can get triggered
   somehow.
 - Add extra "unknown" in control_protection_err.
---
 arch/arm/kernel/signal.c                 |  2 +-
 arch/arm64/kernel/signal.c               |  2 +-
 arch/arm64/kernel/signal32.c             |  2 +-
 arch/sparc/kernel/signal32.c             |  2 +-
 arch/sparc/kernel/signal_64.c            |  2 +-
 arch/x86/include/asm/disabled-features.h |  8 +-
 arch/x86/include/asm/idtentry.h          |  2 +-
 arch/x86/include/asm/traps.h             | 12 +++
 arch/x86/kernel/cet.c                    | 94 +++++++++++++++++++++---
 arch/x86/kernel/idt.c                    |  2 +-
 arch/x86/kernel/signal_32.c              |  2 +-
 arch/x86/kernel/signal_64.c              |  2 +-
 arch/x86/kernel/traps.c                  | 12 ---
 arch/x86/xen/enlighten_pv.c              |  2 +-
 arch/x86/xen/xen-asm.S                   |  2 +-
 include/uapi/asm-generic/siginfo.h       |  3 +-
 16 files changed, 117 insertions(+), 34 deletions(-)

diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index e07f359254c3..9a3c9de5ac5e 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -681,7 +681,7 @@ asmlinkage void do_rseq_syscall(struct pt_regs *regs)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 06a02707f488..19b6b292892c 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -1341,7 +1341,7 @@ void __init minsigstksz_setup(void)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/arm64/kernel/signal32.c b/arch/arm64/kernel/signal32.c
index 4700f8522d27..bbd542704730 100644
--- a/arch/arm64/kernel/signal32.c
+++ b/arch/arm64/kernel/signal32.c
@@ -460,7 +460,7 @@ void compat_setup_restart_syscall(struct pt_regs *regs)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/sparc/kernel/signal32.c b/arch/sparc/kernel/signal32.c
index dad38960d1a8..82da8a2d769d 100644
--- a/arch/sparc/kernel/signal32.c
+++ b/arch/sparc/kernel/signal32.c
@@ -751,7 +751,7 @@ asmlinkage int do_sys32_sigstack(u32 u_ssptr, u32 u_ossptr, unsigned long sp)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/sparc/kernel/signal_64.c b/arch/sparc/kernel/signal_64.c
index 570e43e6fda5..b4e410976e0d 100644
--- a/arch/sparc/kernel/signal_64.c
+++ b/arch/sparc/kernel/signal_64.c
@@ -562,7 +562,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long orig_i0, unsigned long
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 505f78ddca82..652e366b68a0 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -105,6 +105,12 @@
 #define DISABLE_USER_SHSTK	(1 << (X86_FEATURE_USER_SHSTK & 31))
 #endif
 
+#ifdef CONFIG_X86_KERNEL_IBT
+#define DISABLE_IBT	0
+#else
+#define DISABLE_IBT	(1 << (X86_FEATURE_IBT & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -128,7 +134,7 @@
 #define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
 			 DISABLE_ENQCMD)
 #define DISABLED_MASK17	0
-#define DISABLED_MASK18	0
+#define DISABLED_MASK18	(DISABLE_IBT)
 #define DISABLED_MASK19	0
 #define DISABLED_MASK20	0
 #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 21)
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 72184b0b2219..69e26f48d027 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -618,7 +618,7 @@ DECLARE_IDTENTRY_RAW_ERRORCODE(X86_TRAP_DF,	xenpv_exc_double_fault);
 #endif
 
 /* #CP */
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_CP,	exc_control_protection);
 #endif
 
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 47ecfff2c83d..75e0dabf0c45 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -47,4 +47,16 @@ void __noreturn handle_stack_overflow(struct pt_regs *regs,
 				      struct stack_info *info);
 #endif
 
+static inline void cond_local_irq_enable(struct pt_regs *regs)
+{
+	if (regs->flags & X86_EFLAGS_IF)
+		local_irq_enable();
+}
+
+static inline void cond_local_irq_disable(struct pt_regs *regs)
+{
+	if (regs->flags & X86_EFLAGS_IF)
+		local_irq_disable();
+}
+
 #endif /* _ASM_X86_TRAPS_H */
diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
index 7ad22b705b64..cc10d8be9d74 100644
--- a/arch/x86/kernel/cet.c
+++ b/arch/x86/kernel/cet.c
@@ -4,10 +4,6 @@
 #include <asm/bugs.h>
 #include <asm/traps.h>
 
-static __ro_after_init bool ibt_fatal = true;
-
-extern void ibt_selftest_ip(void); /* code label defined in asm below */
-
 enum cp_error_code {
 	CP_EC        = (1 << 15) - 1,
 
@@ -20,15 +16,80 @@ enum cp_error_code {
 	CP_ENCL	     = 1 << 15,
 };
 
-DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+static const char cp_err[][10] = {
+	[0] = "unknown",
+	[1] = "near ret",
+	[2] = "far/iret",
+	[3] = "endbranch",
+	[4] = "rstorssp",
+	[5] = "setssbsy",
+};
+
+static const char *cp_err_string(unsigned long error_code)
+{
+	unsigned int cpec = error_code & CP_EC;
+
+	if (cpec >= ARRAY_SIZE(cp_err))
+		cpec = 0;
+	return cp_err[cpec];
+}
+
+static void do_unexpected_cp(struct pt_regs *regs, unsigned long error_code)
+{
+	WARN_ONCE(1, "Unexpected %s #CP, error_code: %s\n",
+		  user_mode(regs) ? "user mode" : "kernel mode",
+		  cp_err_string(error_code));
+}
+
+static DEFINE_RATELIMIT_STATE(cpf_rate, DEFAULT_RATELIMIT_INTERVAL,
+			      DEFAULT_RATELIMIT_BURST);
+
+static void do_user_cp_fault(struct pt_regs *regs, unsigned long error_code)
 {
-	if (!cpu_feature_enabled(X86_FEATURE_IBT)) {
-		pr_err("Unexpected #CP\n");
-		BUG();
+	struct task_struct *tsk;
+	unsigned long ssp;
+
+	/*
+	 * An exception was just taken from userspace. Since interrupts are disabled
+	 * here, no scheduling should have messed with the registers yet and they
+	 * will be whatever is live in userspace. So read the SSP before enabling
+	 * interrupts so locking the fpregs to do it later is not required.
+	 */
+	rdmsrl(MSR_IA32_PL3_SSP, ssp);
+
+	cond_local_irq_enable(regs);
+
+	tsk = current;
+	tsk->thread.error_code = error_code;
+	tsk->thread.trap_nr = X86_TRAP_CP;
+
+	/* Ratelimit to prevent log spamming. */
+	if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
+	    __ratelimit(&cpf_rate)) {
+		pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)%s",
+			 tsk->comm, task_pid_nr(tsk),
+			 regs->ip, regs->sp, ssp, error_code,
+			 cp_err_string(error_code),
+			 error_code & CP_ENCL ? " in enclave" : "");
+		print_vma_addr(KERN_CONT " in ", regs->ip);
+		pr_cont("\n");
 	}
 
-	if (WARN_ON_ONCE(user_mode(regs) || (error_code & CP_EC) != CP_ENDBR))
+	force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
+	cond_local_irq_disable(regs);
+}
+
+static __ro_after_init bool ibt_fatal = true;
+
+/* code label defined in asm below */
+extern void ibt_selftest_ip(void);
+
+static void do_kernel_cp_fault(struct pt_regs *regs, unsigned long error_code)
+{
+	if ((error_code & CP_EC) != CP_ENDBR) {
+		do_unexpected_cp(regs, error_code);
 		return;
+	}
 
 	if (unlikely(regs->ip == (unsigned long)&ibt_selftest_ip)) {
 		regs->ax = 0;
@@ -74,3 +135,18 @@ static int __init ibt_setup(char *str)
 }
 
 __setup("ibt=", ibt_setup);
+
+DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+{
+	if (user_mode(regs)) {
+		if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+			do_user_cp_fault(regs, error_code);
+		else
+			do_unexpected_cp(regs, error_code);
+	} else {
+		if (cpu_feature_enabled(X86_FEATURE_IBT))
+			do_kernel_cp_fault(regs, error_code);
+		else
+			do_unexpected_cp(regs, error_code);
+	}
+}
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index a58c6bc1cd68..5074b8420359 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -107,7 +107,7 @@ static const __initconst struct idt_data def_idts[] = {
 	ISTG(X86_TRAP_MC,		asm_exc_machine_check, IST_INDEX_MCE),
 #endif
 
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
 	INTG(X86_TRAP_CP,		asm_exc_control_protection),
 #endif
 
diff --git a/arch/x86/kernel/signal_32.c b/arch/x86/kernel/signal_32.c
index 9027fc088f97..c12624bc82a3 100644
--- a/arch/x86/kernel/signal_32.c
+++ b/arch/x86/kernel/signal_32.c
@@ -402,7 +402,7 @@ int ia32_setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs)
 */
 static_assert(NSIGILL  == 11);
 static_assert(NSIGFPE  == 15);
-static_assert(NSIGSEGV == 9);
+static_assert(NSIGSEGV == 10);
 static_assert(NSIGBUS  == 5);
 static_assert(NSIGTRAP == 6);
 static_assert(NSIGCHLD == 6);
diff --git a/arch/x86/kernel/signal_64.c b/arch/x86/kernel/signal_64.c
index 13a1e6083837..0e808c72bf7e 100644
--- a/arch/x86/kernel/signal_64.c
+++ b/arch/x86/kernel/signal_64.c
@@ -403,7 +403,7 @@ void sigaction_compat_abi(struct k_sigaction *act, struct k_sigaction *oact)
 */
 static_assert(NSIGILL  == 11);
 static_assert(NSIGFPE  == 15);
-static_assert(NSIGSEGV == 9);
+static_assert(NSIGSEGV == 10);
 static_assert(NSIGBUS  == 5);
 static_assert(NSIGTRAP == 6);
 static_assert(NSIGCHLD == 6);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index cc223e60aba2..18fb9d620824 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -77,18 +77,6 @@
 
 DECLARE_BITMAP(system_vectors, NR_VECTORS);
 
-static inline void cond_local_irq_enable(struct pt_regs *regs)
-{
-	if (regs->flags & X86_EFLAGS_IF)
-		local_irq_enable();
-}
-
-static inline void cond_local_irq_disable(struct pt_regs *regs)
-{
-	if (regs->flags & X86_EFLAGS_IF)
-		local_irq_disable();
-}
-
 __always_inline int is_valid_bugaddr(unsigned long addr)
 {
 	if (addr < TASK_SIZE_MAX)
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index bb59cc6ddb2d..9c29cd5393cc 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -640,7 +640,7 @@ static struct trap_array_entry trap_array[] = {
 	TRAP_ENTRY(exc_coprocessor_error,		false ),
 	TRAP_ENTRY(exc_alignment_check,			false ),
 	TRAP_ENTRY(exc_simd_coprocessor_error,		false ),
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
 	TRAP_ENTRY(exc_control_protection,		false ),
 #endif
 };
diff --git a/arch/x86/xen/xen-asm.S b/arch/x86/xen/xen-asm.S
index 4a184f6e4e4d..7cdcb4ce6976 100644
--- a/arch/x86/xen/xen-asm.S
+++ b/arch/x86/xen/xen-asm.S
@@ -148,7 +148,7 @@ xen_pv_trap asm_exc_page_fault
 xen_pv_trap asm_exc_spurious_interrupt_bug
 xen_pv_trap asm_exc_coprocessor_error
 xen_pv_trap asm_exc_alignment_check
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
 xen_pv_trap asm_exc_control_protection
 #endif
 #ifdef CONFIG_X86_MCE
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index ffbe4cec9f32..0f52d0ac47c5 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -242,7 +242,8 @@ typedef struct siginfo {
 #define SEGV_ADIPERR	7	/* Precise MCD exception */
 #define SEGV_MTEAERR	8	/* Asynchronous ARM MTE error */
 #define SEGV_MTESERR	9	/* Synchronous ARM MTE exception */
-#define NSIGSEGV	9
+#define SEGV_CPERR	10	/* Control protection fault */
+#define NSIGSEGV	10
 
 /*
  * SIGBUS si_codes
-- 
2.17.1


^ permalink raw reply related	[relevance 2%]

* [PATCH v6 08/41] x86/shstk: Add user control-protection fault handler
  @ 2023-02-18 21:14  2% ` Rick Edgecombe
  0 siblings, 0 replies; 200+ results
From: Rick Edgecombe @ 2023-02-18 21:14 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe,
	david, debug
  Cc: rick.p.edgecombe, Yu-cheng Yu, Michael Kerrisk

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

A control-protection fault is triggered when a control-flow transfer
attempt violates Shadow Stack or Indirect Branch Tracking constraints.
For example, the return address for a RET instruction differs from the copy
on the shadow stack.

There already exists a control-protection fault handler for handling kernel
IBT faults. Refactor this fault handler into separate user and kernel
handlers, like the page fault handler. Add a control-protection handler
for usermode. To avoid ifdeffery, put them both in a new file cet.c, which
is compiled in the case of either of the two CET features supported in the
kernel: kernel IBT or user mode shadow stack. Move some static inline
functions from traps.c into a header so they can be used in cet.c.

Opportunistically fix a comment in the kernel IBT part of the fault
handler that is on the end of the line instead of preceding it.

Keep the same behavior for the kernel side of the fault handler, except for
converting a BUG to a WARN in the case of a #CP happening when the feature
is missing. This unifies the behavior with the new shadow stack code, and
also prevents the kernel from crashing under this situation which is
potentially recoverable.

The control-protection fault handler works in a similar way as the general
protection fault handler. It provides the si_code SEGV_CPERR to the signal
handler.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>

---
v6:
 - Split into separate patches (Kees)
 - Change to "x86/shstk" in commit log (Boris)

v5:
 - Move to separate file to advoid ifdeffery (Boris)
 - Improvements to commit log (Boris)
 - Rename control_protection_err (Boris)
 - Move comment from end of line in IBT fault handler (Boris)

v3:
 - Shorten user/kernel #CP handler function names (peterz)
 - Restore CP_ENDBR check to kernel handler (peterz)
 - Utilize CONFIG_X86_CET (Kees)
 - Unify "unexpected" warnings (Andrew Cooper)
 - Use 2d array for error code chars (Andrew Cooper)
 - Add comment about why to read SSP MSR before enabling interrupts

v2:
 - Integrate with kernel IBT fault handler
 - Update printed messages. (Dave)
 - Remove array_index_nospec() usage. (Dave)
 - Remove IBT messages. (Dave)
 - Add enclave error code bit processing it case it can get triggered
   somehow.
 - Add extra "unknown" in control_protection_err.
---
 arch/arm/kernel/signal.c                 |  2 +-
 arch/arm64/kernel/signal.c               |  2 +-
 arch/arm64/kernel/signal32.c             |  2 +-
 arch/sparc/kernel/signal32.c             |  2 +-
 arch/sparc/kernel/signal_64.c            |  2 +-
 arch/x86/include/asm/disabled-features.h |  8 +-
 arch/x86/include/asm/idtentry.h          |  2 +-
 arch/x86/include/asm/traps.h             | 12 +++
 arch/x86/kernel/cet.c                    | 94 +++++++++++++++++++++---
 arch/x86/kernel/idt.c                    |  2 +-
 arch/x86/kernel/signal_32.c              |  2 +-
 arch/x86/kernel/signal_64.c              |  2 +-
 arch/x86/kernel/traps.c                  | 12 ---
 arch/x86/xen/enlighten_pv.c              |  2 +-
 arch/x86/xen/xen-asm.S                   |  2 +-
 include/uapi/asm-generic/siginfo.h       |  3 +-
 16 files changed, 117 insertions(+), 34 deletions(-)

diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index e07f359254c3..9a3c9de5ac5e 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -681,7 +681,7 @@ asmlinkage void do_rseq_syscall(struct pt_regs *regs)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index be279fd48248..4bced22213d5 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -1176,7 +1176,7 @@ void __init minsigstksz_setup(void)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/arm64/kernel/signal32.c b/arch/arm64/kernel/signal32.c
index 4700f8522d27..bbd542704730 100644
--- a/arch/arm64/kernel/signal32.c
+++ b/arch/arm64/kernel/signal32.c
@@ -460,7 +460,7 @@ void compat_setup_restart_syscall(struct pt_regs *regs)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/sparc/kernel/signal32.c b/arch/sparc/kernel/signal32.c
index dad38960d1a8..82da8a2d769d 100644
--- a/arch/sparc/kernel/signal32.c
+++ b/arch/sparc/kernel/signal32.c
@@ -751,7 +751,7 @@ asmlinkage int do_sys32_sigstack(u32 u_ssptr, u32 u_ossptr, unsigned long sp)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/sparc/kernel/signal_64.c b/arch/sparc/kernel/signal_64.c
index 570e43e6fda5..b4e410976e0d 100644
--- a/arch/sparc/kernel/signal_64.c
+++ b/arch/sparc/kernel/signal_64.c
@@ -562,7 +562,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long orig_i0, unsigned long
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 505f78ddca82..652e366b68a0 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -105,6 +105,12 @@
 #define DISABLE_USER_SHSTK	(1 << (X86_FEATURE_USER_SHSTK & 31))
 #endif
 
+#ifdef CONFIG_X86_KERNEL_IBT
+#define DISABLE_IBT	0
+#else
+#define DISABLE_IBT	(1 << (X86_FEATURE_IBT & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -128,7 +134,7 @@
 #define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
 			 DISABLE_ENQCMD)
 #define DISABLED_MASK17	0
-#define DISABLED_MASK18	0
+#define DISABLED_MASK18	(DISABLE_IBT)
 #define DISABLED_MASK19	0
 #define DISABLED_MASK20	0
 #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 21)
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 72184b0b2219..69e26f48d027 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -618,7 +618,7 @@ DECLARE_IDTENTRY_RAW_ERRORCODE(X86_TRAP_DF,	xenpv_exc_double_fault);
 #endif
 
 /* #CP */
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_CP,	exc_control_protection);
 #endif
 
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 47ecfff2c83d..75e0dabf0c45 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -47,4 +47,16 @@ void __noreturn handle_stack_overflow(struct pt_regs *regs,
 				      struct stack_info *info);
 #endif
 
+static inline void cond_local_irq_enable(struct pt_regs *regs)
+{
+	if (regs->flags & X86_EFLAGS_IF)
+		local_irq_enable();
+}
+
+static inline void cond_local_irq_disable(struct pt_regs *regs)
+{
+	if (regs->flags & X86_EFLAGS_IF)
+		local_irq_disable();
+}
+
 #endif /* _ASM_X86_TRAPS_H */
diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
index 7ad22b705b64..33d7d119be26 100644
--- a/arch/x86/kernel/cet.c
+++ b/arch/x86/kernel/cet.c
@@ -4,10 +4,6 @@
 #include <asm/bugs.h>
 #include <asm/traps.h>
 
-static __ro_after_init bool ibt_fatal = true;
-
-extern void ibt_selftest_ip(void); /* code label defined in asm below */
-
 enum cp_error_code {
 	CP_EC        = (1 << 15) - 1,
 
@@ -20,15 +16,80 @@ enum cp_error_code {
 	CP_ENCL	     = 1 << 15,
 };
 
-DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+static const char cp_err[][10] = {
+	[0] = "unknown",
+	[1] = "near ret",
+	[2] = "far/iret",
+	[3] = "endbranch",
+	[4] = "rstorssp",
+	[5] = "setssbsy",
+};
+
+static const char *cp_err_string(unsigned long error_code)
+{
+	unsigned int cpec = error_code & CP_EC;
+
+	if (cpec >= ARRAY_SIZE(cp_err))
+		cpec = 0;
+	return cp_err[cpec];
+}
+
+static void do_unexpected_cp(struct pt_regs *regs, unsigned long error_code)
+{
+	WARN_ONCE(1, "Unexpected %s #CP, error_code: %s\n",
+		     user_mode(regs) ? "user mode" : "kernel mode",
+		     cp_err_string(error_code));
+}
+
+static DEFINE_RATELIMIT_STATE(cpf_rate, DEFAULT_RATELIMIT_INTERVAL,
+			      DEFAULT_RATELIMIT_BURST);
+
+static void do_user_cp_fault(struct pt_regs *regs, unsigned long error_code)
 {
-	if (!cpu_feature_enabled(X86_FEATURE_IBT)) {
-		pr_err("Unexpected #CP\n");
-		BUG();
+	struct task_struct *tsk;
+	unsigned long ssp;
+
+	/*
+	 * An exception was just taken from userspace. Since interrupts are disabled
+	 * here, no scheduling should have messed with the registers yet and they
+	 * will be whatever is live in userspace. So read the SSP before enabling
+	 * interrupts so locking the fpregs to do it later is not required.
+	 */
+	rdmsrl(MSR_IA32_PL3_SSP, ssp);
+
+	cond_local_irq_enable(regs);
+
+	tsk = current;
+	tsk->thread.error_code = error_code;
+	tsk->thread.trap_nr = X86_TRAP_CP;
+
+	/* Ratelimit to prevent log spamming. */
+	if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
+	    __ratelimit(&cpf_rate)) {
+		pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)%s",
+			 tsk->comm, task_pid_nr(tsk),
+			 regs->ip, regs->sp, ssp, error_code,
+			 cp_err_string(error_code),
+			 error_code & CP_ENCL ? " in enclave" : "");
+		print_vma_addr(KERN_CONT " in ", regs->ip);
+		pr_cont("\n");
 	}
 
-	if (WARN_ON_ONCE(user_mode(regs) || (error_code & CP_EC) != CP_ENDBR))
+	force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
+	cond_local_irq_disable(regs);
+}
+
+static __ro_after_init bool ibt_fatal = true;
+
+/* code label defined in asm below */
+extern void ibt_selftest_ip(void);
+
+static void do_kernel_cp_fault(struct pt_regs *regs, unsigned long error_code)
+{
+	if ((error_code & CP_EC) != CP_ENDBR) {
+		do_unexpected_cp(regs, error_code);
 		return;
+	}
 
 	if (unlikely(regs->ip == (unsigned long)&ibt_selftest_ip)) {
 		regs->ax = 0;
@@ -74,3 +135,18 @@ static int __init ibt_setup(char *str)
 }
 
 __setup("ibt=", ibt_setup);
+
+DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+{
+	if (user_mode(regs)) {
+		if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+			do_user_cp_fault(regs, error_code);
+		else
+			do_unexpected_cp(regs, error_code);
+	} else {
+		if (cpu_feature_enabled(X86_FEATURE_IBT))
+			do_kernel_cp_fault(regs, error_code);
+		else
+			do_unexpected_cp(regs, error_code);
+	}
+}
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index a58c6bc1cd68..5074b8420359 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -107,7 +107,7 @@ static const __initconst struct idt_data def_idts[] = {
 	ISTG(X86_TRAP_MC,		asm_exc_machine_check, IST_INDEX_MCE),
 #endif
 
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
 	INTG(X86_TRAP_CP,		asm_exc_control_protection),
 #endif
 
diff --git a/arch/x86/kernel/signal_32.c b/arch/x86/kernel/signal_32.c
index 9027fc088f97..c12624bc82a3 100644
--- a/arch/x86/kernel/signal_32.c
+++ b/arch/x86/kernel/signal_32.c
@@ -402,7 +402,7 @@ int ia32_setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs)
 */
 static_assert(NSIGILL  == 11);
 static_assert(NSIGFPE  == 15);
-static_assert(NSIGSEGV == 9);
+static_assert(NSIGSEGV == 10);
 static_assert(NSIGBUS  == 5);
 static_assert(NSIGTRAP == 6);
 static_assert(NSIGCHLD == 6);
diff --git a/arch/x86/kernel/signal_64.c b/arch/x86/kernel/signal_64.c
index 13a1e6083837..0e808c72bf7e 100644
--- a/arch/x86/kernel/signal_64.c
+++ b/arch/x86/kernel/signal_64.c
@@ -403,7 +403,7 @@ void sigaction_compat_abi(struct k_sigaction *act, struct k_sigaction *oact)
 */
 static_assert(NSIGILL  == 11);
 static_assert(NSIGFPE  == 15);
-static_assert(NSIGSEGV == 9);
+static_assert(NSIGSEGV == 10);
 static_assert(NSIGBUS  == 5);
 static_assert(NSIGTRAP == 6);
 static_assert(NSIGCHLD == 6);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index cc223e60aba2..18fb9d620824 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -77,18 +77,6 @@
 
 DECLARE_BITMAP(system_vectors, NR_VECTORS);
 
-static inline void cond_local_irq_enable(struct pt_regs *regs)
-{
-	if (regs->flags & X86_EFLAGS_IF)
-		local_irq_enable();
-}
-
-static inline void cond_local_irq_disable(struct pt_regs *regs)
-{
-	if (regs->flags & X86_EFLAGS_IF)
-		local_irq_disable();
-}
-
 __always_inline int is_valid_bugaddr(unsigned long addr)
 {
 	if (addr < TASK_SIZE_MAX)
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index bb59cc6ddb2d..9c29cd5393cc 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -640,7 +640,7 @@ static struct trap_array_entry trap_array[] = {
 	TRAP_ENTRY(exc_coprocessor_error,		false ),
 	TRAP_ENTRY(exc_alignment_check,			false ),
 	TRAP_ENTRY(exc_simd_coprocessor_error,		false ),
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
 	TRAP_ENTRY(exc_control_protection,		false ),
 #endif
 };
diff --git a/arch/x86/xen/xen-asm.S b/arch/x86/xen/xen-asm.S
index 4a184f6e4e4d..7cdcb4ce6976 100644
--- a/arch/x86/xen/xen-asm.S
+++ b/arch/x86/xen/xen-asm.S
@@ -148,7 +148,7 @@ xen_pv_trap asm_exc_page_fault
 xen_pv_trap asm_exc_spurious_interrupt_bug
 xen_pv_trap asm_exc_coprocessor_error
 xen_pv_trap asm_exc_alignment_check
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
 xen_pv_trap asm_exc_control_protection
 #endif
 #ifdef CONFIG_X86_MCE
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index ffbe4cec9f32..0f52d0ac47c5 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -242,7 +242,8 @@ typedef struct siginfo {
 #define SEGV_ADIPERR	7	/* Precise MCD exception */
 #define SEGV_MTEAERR	8	/* Asynchronous ARM MTE error */
 #define SEGV_MTESERR	9	/* Synchronous ARM MTE exception */
-#define NSIGSEGV	9
+#define SEGV_CPERR	10	/* Control protection fault */
+#define NSIGSEGV	10
 
 /*
  * SIGBUS si_codes
-- 
2.17.1


^ permalink raw reply related	[relevance 2%]

* Re: [PATCH 1/1] rseq.2: New man page for the rseq(2) API
  2023-02-15  2:21  5%       ` G. Branden Robinson
@ 2023-02-15  3:07  0%         ` Alejandro Colomar
  0 siblings, 0 replies; 200+ results
From: Alejandro Colomar @ 2023-02-15  3:07 UTC (permalink / raw)
  To: G. Branden Robinson; +Cc: Mathieu Desnoyers, linux-kernel, linux-api, linux-man


[-- Attachment #1.1: Type: text/plain, Size: 1842 bytes --]

Hi Branden,

On 2/15/23 03:21, G. Branden Robinson wrote:
> At 2023-02-15T02:52:03+0100, Alejandro Colomar wrote:
>> On 2/15/23 02:20, G. Branden Robinson wrote:
>>> [CC list violently trimmed; for those who remain, this is mostly man
>>> page style issues]
>>
>> Ironically, you trimmed linux-man@  :D
> 
> I didn't!  It wasn't present in the mail to which I repled.

Hmm, you're right, Mathieu didn't CC linux-man@.  I guessed somewhere
in that big list it would be there, but it wasn't.  Thanks for CCing it.

> 
> This did puzzle me.  I guess it was an oversight.  You might want to
> re-send that message of yours, and/or Mathieu's, if it lacked it too.
> 
> Or maybe it doesn't matter because lore.kernel.org finds all.  I just
> used it to track down an exchange between Michael Kerrisk and me that
> GMail refused to find even though it was in my inbox.  It showed me only
> one thread, didn't highlight the specific message that it thought
> matched, and showed me the _wrong_ thread on top of everything else.
> The word "constraint" was in the thread I wanted, not in the one I
> didn't, and even when I quoted it I was served up an incorrect match.

Which reminds me that I hate searching in the groff@ archives.  It's not
because of the search engine, but because of the thread view.  You are
artificially restricted to a given month, and you can't see entire threads
in the search engine.  Is there anything similar to lore for groff@?
Other GNU projects can now be searched at <https://inbox.sourceware.org/>
such as <https://inbox.sourceware.org/libc-alpha/>, but groff@ isn't there
:(

Cheers,

Alex

> 
> Clearly their AI efforts are going swimmingly.> 
> Regards,
> Branden

-- 
<http://www.alejandro-colomar.es/>
GPG key fingerprint: A9348594CE31283A826FBDD8D57633D441E25BB5

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[relevance 0%]

* Re: [PATCH 1/1] rseq.2: New man page for the rseq(2) API
  @ 2023-02-15  2:21  5%       ` G. Branden Robinson
  2023-02-15  3:07  0%         ` Alejandro Colomar
  0 siblings, 1 reply; 200+ results
From: G. Branden Robinson @ 2023-02-15  2:21 UTC (permalink / raw)
  To: Alejandro Colomar; +Cc: Mathieu Desnoyers, linux-kernel, linux-api, linux-man

[-- Attachment #1: Type: text/plain, Size: 1018 bytes --]

At 2023-02-15T02:52:03+0100, Alejandro Colomar wrote:
> On 2/15/23 02:20, G. Branden Robinson wrote:
> > [CC list violently trimmed; for those who remain, this is mostly man
> > page style issues]
> 
> Ironically, you trimmed linux-man@  :D

I didn't!  It wasn't present in the mail to which I repled.

This did puzzle me.  I guess it was an oversight.  You might want to
re-send that message of yours, and/or Mathieu's, if it lacked it too.

Or maybe it doesn't matter because lore.kernel.org finds all.  I just
used it to track down an exchange between Michael Kerrisk and me that
GMail refused to find even though it was in my inbox.  It showed me only
one thread, didn't highlight the specific message that it thought
matched, and showed me the _wrong_ thread on top of everything else.
The word "constraint" was in the thread I wanted, not in the one I
didn't, and even when I quoted it I was served up an incorrect match.

Clearly their AI efforts are going swimmingly.

Regards,
Branden

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[relevance 5%]

* [PATCH v5 07/39] x86: Add user control-protection fault handler
  @ 2023-01-19 21:22  2% ` Rick Edgecombe
  0 siblings, 0 replies; 200+ results
From: Rick Edgecombe @ 2023-01-19 21:22 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe, Yu-cheng Yu, Michael Kerrisk

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

A control-protection fault is triggered when a control-flow transfer
attempt violates Shadow Stack or Indirect Branch Tracking constraints.
For example, the return address for a RET instruction differs from the copy
on the shadow stack.

There already exists a control-protection fault handler for handling kernel
IBT faults. Refactor this fault handler into separate user and kernel
handlers, like the page fault handler. Add a control-protection handler
for usermode. To avoid ifdeffery, put them both in a new file cet.c, which
is compiled in the case of either of the two CET features supported in the
kernel: kernel IBT or user mode shadow stack. Move some static inline
functions from traps.c into a header so they can be used in cet.c.

Opportunistically fix a comment in the kernel IBT part of the fault
handler that is on the end of the line instead of preceding it.

Keep the same behavior for the kernel side of the fault handler, except for
converting a BUG to a WARN in the case of a #CP happening when the feature
is missing. This unifies the behavior with the new shadow stack code, and
also prevents the kernel from crashing under this situation which is
potentially recoverable.

The control-protection fault handler works in a similar way as the general
protection fault handler. It provides the si_code SEGV_CPERR to the signal
handler.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
---

v5:
 - Move to separate file to advoid ifdeffery (Boris)
 - Improvements to commit log (Boris)
 - Rename control_protection_err (Boris)
 - Move comment from end of line in IBT fault handler (Boris)

v3:
 - Shorten user/kernel #CP handler function names (peterz)
 - Restore CP_ENDBR check to kernel handler (peterz)
 - Utilize CONFIG_X86_CET (Kees)
 - Unify "unexpected" warnings (Andrew Cooper)
 - Use 2d array for error code chars (Andrew Cooper)
 - Add comment about why to read SSP MSR before enabling interrupts

v2:
 - Integrate with kernel IBT fault handler
 - Update printed messages. (Dave)
 - Remove array_index_nospec() usage. (Dave)
 - Remove IBT messages. (Dave)
 - Add enclave error code bit processing it case it can get triggered
   somehow.
 - Add extra "unknown" in control_protection_err.

v1:
 - Update static asserts for NSIGSEGV

 arch/arm/kernel/signal.c                 |   2 +-
 arch/arm64/kernel/signal.c               |   2 +-
 arch/arm64/kernel/signal32.c             |   2 +-
 arch/sparc/kernel/signal32.c             |   2 +-
 arch/sparc/kernel/signal_64.c            |   2 +-
 arch/x86/include/asm/disabled-features.h |   8 +-
 arch/x86/include/asm/idtentry.h          |   2 +-
 arch/x86/include/asm/traps.h             |  12 ++
 arch/x86/kernel/Makefile                 |   2 +
 arch/x86/kernel/cet.c                    | 152 +++++++++++++++++++++++
 arch/x86/kernel/idt.c                    |   2 +-
 arch/x86/kernel/signal_32.c              |   2 +-
 arch/x86/kernel/signal_64.c              |   2 +-
 arch/x86/kernel/traps.c                  |  87 -------------
 arch/x86/xen/enlighten_pv.c              |   2 +-
 arch/x86/xen/xen-asm.S                   |   2 +-
 include/uapi/asm-generic/siginfo.h       |   3 +-
 17 files changed, 186 insertions(+), 100 deletions(-)
 create mode 100644 arch/x86/kernel/cet.c

diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index e07f359254c3..9a3c9de5ac5e 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -681,7 +681,7 @@ asmlinkage void do_rseq_syscall(struct pt_regs *regs)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index be279fd48248..4bced22213d5 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -1176,7 +1176,7 @@ void __init minsigstksz_setup(void)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/arm64/kernel/signal32.c b/arch/arm64/kernel/signal32.c
index 4700f8522d27..bbd542704730 100644
--- a/arch/arm64/kernel/signal32.c
+++ b/arch/arm64/kernel/signal32.c
@@ -460,7 +460,7 @@ void compat_setup_restart_syscall(struct pt_regs *regs)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/sparc/kernel/signal32.c b/arch/sparc/kernel/signal32.c
index dad38960d1a8..82da8a2d769d 100644
--- a/arch/sparc/kernel/signal32.c
+++ b/arch/sparc/kernel/signal32.c
@@ -751,7 +751,7 @@ asmlinkage int do_sys32_sigstack(u32 u_ssptr, u32 u_ossptr, unsigned long sp)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/sparc/kernel/signal_64.c b/arch/sparc/kernel/signal_64.c
index 570e43e6fda5..b4e410976e0d 100644
--- a/arch/sparc/kernel/signal_64.c
+++ b/arch/sparc/kernel/signal_64.c
@@ -562,7 +562,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long orig_i0, unsigned long
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 505f78ddca82..652e366b68a0 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -105,6 +105,12 @@
 #define DISABLE_USER_SHSTK	(1 << (X86_FEATURE_USER_SHSTK & 31))
 #endif
 
+#ifdef CONFIG_X86_KERNEL_IBT
+#define DISABLE_IBT	0
+#else
+#define DISABLE_IBT	(1 << (X86_FEATURE_IBT & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -128,7 +134,7 @@
 #define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
 			 DISABLE_ENQCMD)
 #define DISABLED_MASK17	0
-#define DISABLED_MASK18	0
+#define DISABLED_MASK18	(DISABLE_IBT)
 #define DISABLED_MASK19	0
 #define DISABLED_MASK20	0
 #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 21)
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 72184b0b2219..69e26f48d027 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -618,7 +618,7 @@ DECLARE_IDTENTRY_RAW_ERRORCODE(X86_TRAP_DF,	xenpv_exc_double_fault);
 #endif
 
 /* #CP */
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_CP,	exc_control_protection);
 #endif
 
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 47ecfff2c83d..75e0dabf0c45 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -47,4 +47,16 @@ void __noreturn handle_stack_overflow(struct pt_regs *regs,
 				      struct stack_info *info);
 #endif
 
+static inline void cond_local_irq_enable(struct pt_regs *regs)
+{
+	if (regs->flags & X86_EFLAGS_IF)
+		local_irq_enable();
+}
+
+static inline void cond_local_irq_disable(struct pt_regs *regs)
+{
+	if (regs->flags & X86_EFLAGS_IF)
+		local_irq_disable();
+}
+
 #endif /* _ASM_X86_TRAPS_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index dd61752f4c96..92446f1dedd7 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -144,6 +144,8 @@ obj-$(CONFIG_CFI_CLANG)			+= cfi.o
 
 obj-$(CONFIG_CALL_THUNKS)		+= callthunks.o
 
+obj-$(CONFIG_X86_CET)			+= cet.o
+
 ###
 # 64 bit specific files
 ifeq ($(CONFIG_X86_64),y)
diff --git a/arch/x86/kernel/cet.c b/arch/x86/kernel/cet.c
new file mode 100644
index 000000000000..33d7d119be26
--- /dev/null
+++ b/arch/x86/kernel/cet.c
@@ -0,0 +1,152 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/ptrace.h>
+#include <asm/bugs.h>
+#include <asm/traps.h>
+
+enum cp_error_code {
+	CP_EC        = (1 << 15) - 1,
+
+	CP_RET       = 1,
+	CP_IRET      = 2,
+	CP_ENDBR     = 3,
+	CP_RSTRORSSP = 4,
+	CP_SETSSBSY  = 5,
+
+	CP_ENCL	     = 1 << 15,
+};
+
+static const char cp_err[][10] = {
+	[0] = "unknown",
+	[1] = "near ret",
+	[2] = "far/iret",
+	[3] = "endbranch",
+	[4] = "rstorssp",
+	[5] = "setssbsy",
+};
+
+static const char *cp_err_string(unsigned long error_code)
+{
+	unsigned int cpec = error_code & CP_EC;
+
+	if (cpec >= ARRAY_SIZE(cp_err))
+		cpec = 0;
+	return cp_err[cpec];
+}
+
+static void do_unexpected_cp(struct pt_regs *regs, unsigned long error_code)
+{
+	WARN_ONCE(1, "Unexpected %s #CP, error_code: %s\n",
+		     user_mode(regs) ? "user mode" : "kernel mode",
+		     cp_err_string(error_code));
+}
+
+static DEFINE_RATELIMIT_STATE(cpf_rate, DEFAULT_RATELIMIT_INTERVAL,
+			      DEFAULT_RATELIMIT_BURST);
+
+static void do_user_cp_fault(struct pt_regs *regs, unsigned long error_code)
+{
+	struct task_struct *tsk;
+	unsigned long ssp;
+
+	/*
+	 * An exception was just taken from userspace. Since interrupts are disabled
+	 * here, no scheduling should have messed with the registers yet and they
+	 * will be whatever is live in userspace. So read the SSP before enabling
+	 * interrupts so locking the fpregs to do it later is not required.
+	 */
+	rdmsrl(MSR_IA32_PL3_SSP, ssp);
+
+	cond_local_irq_enable(regs);
+
+	tsk = current;
+	tsk->thread.error_code = error_code;
+	tsk->thread.trap_nr = X86_TRAP_CP;
+
+	/* Ratelimit to prevent log spamming. */
+	if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
+	    __ratelimit(&cpf_rate)) {
+		pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)%s",
+			 tsk->comm, task_pid_nr(tsk),
+			 regs->ip, regs->sp, ssp, error_code,
+			 cp_err_string(error_code),
+			 error_code & CP_ENCL ? " in enclave" : "");
+		print_vma_addr(KERN_CONT " in ", regs->ip);
+		pr_cont("\n");
+	}
+
+	force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
+	cond_local_irq_disable(regs);
+}
+
+static __ro_after_init bool ibt_fatal = true;
+
+/* code label defined in asm below */
+extern void ibt_selftest_ip(void);
+
+static void do_kernel_cp_fault(struct pt_regs *regs, unsigned long error_code)
+{
+	if ((error_code & CP_EC) != CP_ENDBR) {
+		do_unexpected_cp(regs, error_code);
+		return;
+	}
+
+	if (unlikely(regs->ip == (unsigned long)&ibt_selftest_ip)) {
+		regs->ax = 0;
+		return;
+	}
+
+	pr_err("Missing ENDBR: %pS\n", (void *)instruction_pointer(regs));
+	if (!ibt_fatal) {
+		printk(KERN_DEFAULT CUT_HERE);
+		__warn(__FILE__, __LINE__, (void *)regs->ip, TAINT_WARN, regs, NULL);
+		return;
+	}
+	BUG();
+}
+
+/* Must be noinline to ensure uniqueness of ibt_selftest_ip. */
+noinline bool ibt_selftest(void)
+{
+	unsigned long ret;
+
+	asm ("	lea ibt_selftest_ip(%%rip), %%rax\n\t"
+	     ANNOTATE_RETPOLINE_SAFE
+	     "	jmp *%%rax\n\t"
+	     "ibt_selftest_ip:\n\t"
+	     UNWIND_HINT_FUNC
+	     ANNOTATE_NOENDBR
+	     "	nop\n\t"
+
+	     : "=a" (ret) : : "memory");
+
+	return !ret;
+}
+
+static int __init ibt_setup(char *str)
+{
+	if (!strcmp(str, "off"))
+		setup_clear_cpu_cap(X86_FEATURE_IBT);
+
+	if (!strcmp(str, "warn"))
+		ibt_fatal = false;
+
+	return 1;
+}
+
+__setup("ibt=", ibt_setup);
+
+DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+{
+	if (user_mode(regs)) {
+		if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+			do_user_cp_fault(regs, error_code);
+		else
+			do_unexpected_cp(regs, error_code);
+	} else {
+		if (cpu_feature_enabled(X86_FEATURE_IBT))
+			do_kernel_cp_fault(regs, error_code);
+		else
+			do_unexpected_cp(regs, error_code);
+	}
+}
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index a58c6bc1cd68..5074b8420359 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -107,7 +107,7 @@ static const __initconst struct idt_data def_idts[] = {
 	ISTG(X86_TRAP_MC,		asm_exc_machine_check, IST_INDEX_MCE),
 #endif
 
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
 	INTG(X86_TRAP_CP,		asm_exc_control_protection),
 #endif
 
diff --git a/arch/x86/kernel/signal_32.c b/arch/x86/kernel/signal_32.c
index 9027fc088f97..c12624bc82a3 100644
--- a/arch/x86/kernel/signal_32.c
+++ b/arch/x86/kernel/signal_32.c
@@ -402,7 +402,7 @@ int ia32_setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs)
 */
 static_assert(NSIGILL  == 11);
 static_assert(NSIGFPE  == 15);
-static_assert(NSIGSEGV == 9);
+static_assert(NSIGSEGV == 10);
 static_assert(NSIGBUS  == 5);
 static_assert(NSIGTRAP == 6);
 static_assert(NSIGCHLD == 6);
diff --git a/arch/x86/kernel/signal_64.c b/arch/x86/kernel/signal_64.c
index 13a1e6083837..0e808c72bf7e 100644
--- a/arch/x86/kernel/signal_64.c
+++ b/arch/x86/kernel/signal_64.c
@@ -403,7 +403,7 @@ void sigaction_compat_abi(struct k_sigaction *act, struct k_sigaction *oact)
 */
 static_assert(NSIGILL  == 11);
 static_assert(NSIGFPE  == 15);
-static_assert(NSIGSEGV == 9);
+static_assert(NSIGSEGV == 10);
 static_assert(NSIGBUS  == 5);
 static_assert(NSIGTRAP == 6);
 static_assert(NSIGCHLD == 6);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index d317dc3d06a3..18fb9d620824 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -77,18 +77,6 @@
 
 DECLARE_BITMAP(system_vectors, NR_VECTORS);
 
-static inline void cond_local_irq_enable(struct pt_regs *regs)
-{
-	if (regs->flags & X86_EFLAGS_IF)
-		local_irq_enable();
-}
-
-static inline void cond_local_irq_disable(struct pt_regs *regs)
-{
-	if (regs->flags & X86_EFLAGS_IF)
-		local_irq_disable();
-}
-
 __always_inline int is_valid_bugaddr(unsigned long addr)
 {
 	if (addr < TASK_SIZE_MAX)
@@ -213,81 +201,6 @@ DEFINE_IDTENTRY(exc_overflow)
 	do_error_trap(regs, 0, "overflow", X86_TRAP_OF, SIGSEGV, 0, NULL);
 }
 
-#ifdef CONFIG_X86_KERNEL_IBT
-
-static __ro_after_init bool ibt_fatal = true;
-
-extern void ibt_selftest_ip(void); /* code label defined in asm below */
-
-enum cp_error_code {
-	CP_EC        = (1 << 15) - 1,
-
-	CP_RET       = 1,
-	CP_IRET      = 2,
-	CP_ENDBR     = 3,
-	CP_RSTRORSSP = 4,
-	CP_SETSSBSY  = 5,
-
-	CP_ENCL	     = 1 << 15,
-};
-
-DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
-{
-	if (!cpu_feature_enabled(X86_FEATURE_IBT)) {
-		pr_err("Unexpected #CP\n");
-		BUG();
-	}
-
-	if (WARN_ON_ONCE(user_mode(regs) || (error_code & CP_EC) != CP_ENDBR))
-		return;
-
-	if (unlikely(regs->ip == (unsigned long)&ibt_selftest_ip)) {
-		regs->ax = 0;
-		return;
-	}
-
-	pr_err("Missing ENDBR: %pS\n", (void *)instruction_pointer(regs));
-	if (!ibt_fatal) {
-		printk(KERN_DEFAULT CUT_HERE);
-		__warn(__FILE__, __LINE__, (void *)regs->ip, TAINT_WARN, regs, NULL);
-		return;
-	}
-	BUG();
-}
-
-/* Must be noinline to ensure uniqueness of ibt_selftest_ip. */
-noinline bool ibt_selftest(void)
-{
-	unsigned long ret;
-
-	asm ("	lea ibt_selftest_ip(%%rip), %%rax\n\t"
-	     ANNOTATE_RETPOLINE_SAFE
-	     "	jmp *%%rax\n\t"
-	     "ibt_selftest_ip:\n\t"
-	     UNWIND_HINT_FUNC
-	     ANNOTATE_NOENDBR
-	     "	nop\n\t"
-
-	     : "=a" (ret) : : "memory");
-
-	return !ret;
-}
-
-static int __init ibt_setup(char *str)
-{
-	if (!strcmp(str, "off"))
-		setup_clear_cpu_cap(X86_FEATURE_IBT);
-
-	if (!strcmp(str, "warn"))
-		ibt_fatal = false;
-
-	return 1;
-}
-
-__setup("ibt=", ibt_setup);
-
-#endif /* CONFIG_X86_KERNEL_IBT */
-
 #ifdef CONFIG_X86_F00F_BUG
 void handle_invalid_op(struct pt_regs *regs)
 #else
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index bb59cc6ddb2d..9c29cd5393cc 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -640,7 +640,7 @@ static struct trap_array_entry trap_array[] = {
 	TRAP_ENTRY(exc_coprocessor_error,		false ),
 	TRAP_ENTRY(exc_alignment_check,			false ),
 	TRAP_ENTRY(exc_simd_coprocessor_error,		false ),
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
 	TRAP_ENTRY(exc_control_protection,		false ),
 #endif
 };
diff --git a/arch/x86/xen/xen-asm.S b/arch/x86/xen/xen-asm.S
index 4a184f6e4e4d..7cdcb4ce6976 100644
--- a/arch/x86/xen/xen-asm.S
+++ b/arch/x86/xen/xen-asm.S
@@ -148,7 +148,7 @@ xen_pv_trap asm_exc_page_fault
 xen_pv_trap asm_exc_spurious_interrupt_bug
 xen_pv_trap asm_exc_coprocessor_error
 xen_pv_trap asm_exc_alignment_check
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
 xen_pv_trap asm_exc_control_protection
 #endif
 #ifdef CONFIG_X86_MCE
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index ffbe4cec9f32..0f52d0ac47c5 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -242,7 +242,8 @@ typedef struct siginfo {
 #define SEGV_ADIPERR	7	/* Precise MCD exception */
 #define SEGV_MTEAERR	8	/* Asynchronous ARM MTE error */
 #define SEGV_MTESERR	9	/* Synchronous ARM MTE exception */
-#define NSIGSEGV	9
+#define SEGV_CPERR	10	/* Control protection fault */
+#define NSIGSEGV	10
 
 /*
  * SIGBUS si_codes
-- 
2.17.1


^ permalink raw reply related	[relevance 2%]

* man-pages-6.02 released
@ 2022-12-22 19:39  3% Alejandro Colomar
  0 siblings, 0 replies; 200+ results
From: Alejandro Colomar @ 2022-12-22 19:39 UTC (permalink / raw)
  To: linux-man
  Cc: linux-kernel, GNU C Library, groff, Michael Kerrisk,
	Jonathan Corbet, Sam James, Marcos Fouces


[-- Attachment #1.1: Type: text/plain, Size: 5975 bytes --]

Gidday!

I'm proud to announce:

	man-pages-6.02 - manual pages for GNU/Linux

The release tarball is already available on <kernel.org>.

Tarball download:
      <https://mirrors.edge.kernel.org/pub/linux/docs/man-pages/>
Git repository:
      <https://git.kernel.org/cgit/docs/man-pages/man-pages.git/>

The most notable changes in this release are the following:

-  Rewritten pages for string-copying functions.  These now use
    consistent language.  Also added a new string_copying(7) page that
    serves as an overview of all such functions, compares them, and
    details which is appropriate for which uses.

-  Use _Nullable for documenting which functions accept NULL as a
    meaningful value in the function prototypes in the SYNOPSIS.

-  Use VLA syntax for documenting function parameters that are treated
    as arrays.  This uses syntax not accepted by compilers.

-  Rewritten repository documentation (README, CONTRIBUTING, INSTALL, ...).

-  Documentation for new APIs, such as MADV_COLLAPSE in madvise(2).

Thank you all for contributing.

-  There's also a repository change that is not part of this release:  Historic 
versions of the project going back to man-pages-1.0 have been added to the git 
repository in a 'prehistory' branch.

Cheers,

Alex

==================== Changes in man-pages-6.02 ====================

Released: 2022-12-22, Aldaya


Contributors
------------

The following people contributed patches/fixes, reports, notes,
ideas, and discussions that have been incorporated in changes in
this release:


"G. Branden Robinson" <g.branden.robinson@gmail.com>
1092615079 <1092615079@qq.com>
Aaron Schrab <aaron@schrab.com>
Agostino Sarubbo <ago@gentoo.org>
Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Alejandro Colomar <alx@kernel.org>
Alex Colomar <alx.manpages@gmail.com>
Amir Goldstein <amir73il@gmail.com>
Andrew Clayton <andrew@digital-domain.net>
Andrew Pinski <pinskia@gmail.com>
Andries E. Brouwer <aeb@cwi.nl>
Darrick J. Wong <djwong@kernel.org>
Douglas McIlroy <douglas.mcilroy@dartmouth.edu>
Eric Biggers <ebiggers@google.com>
Florian Weimer <fweimer@redhat.com>
G. Branden Robinson <g.branden.robinson@gmail.com>
Grigoriy <grigoriyremvar@protonmail.com>
Grzegorz Szymaszek <gszymaszek@short.pl>
Helge Kreutzmann <debian@helgefjell.de>
Ian Abbott <abbotti@mev.co.uk>
Iker Pedrosa <ipedrosa@redhat.com>
Ingo Schwarze <schwarze@openbsd.org>
Jakub Wilk <jwilk@jwilk.net>
Jan Kara <jack@suse.cz>
JeanHeyd Meneide <wg14@soasis.org>
Jun Ishiguro <algon.0320@gmail.com>
Luca Versari <veluca93@gmail.com>
Luis Javier Merino <ninjalj@gmail.com>
Mario Blättermann <mario.blaettermann@gmail.com>
Martin Sebor <msebor@redhat.com>
Martin Uecker <uecker@tugraz.at>
Matthew Bobrowski <repnop@google.com>
Michael Kerrisk <mtk.manpages@gmail.com>
Michael Tokarev <mjt@tls.msk.ru>
Mike Frysinger <vapier@gentoo.org>
Mike Gilbert <floppym@gentoo.org>
Minchan Kim <minchan@kernel.org>
Nicolás A. Ortega Froysa <nicolas@ortegas.org>
Pali Rohár <pali@kernel.org>
Pierre Labastie <pierre.labastie@neuf.fr>
Sam James <sam@gentoo.org>
Serge Hallyn <serge@hallyn.com>
Stefan Puiu <stefan.puiu@gmail.com>
Steve Izma <sizma@golden.net>
Suren Baghdasaryan <surenb@google.com>
Thomas Voss <mail@thomasvoss.com>
Tycho Andersen <tycho@tycho.pizza>
Xi Ruoyao <xry111@xry111.site>
Zach O'Keefe <zokeefe@google.com>
Zack Weinberg <zack@owlfolio.org>


Apologies if I missed anyone!


New and rewritten pages
-----------------------

man3/
	static_assert.3
	strcpy.3
	stpncpy.3
	strncat.3

man3const/
	EOF.3const
	EXIT_SUCCESS.3const

man7/
	string_copying.7


Newly documented interfaces in existing pages
---------------------------------------------

ioctl_tty.2
	TIOCSERGETLSR
	TIOCSER_TEMT

madvise.2
	MADV_COLLAPSE

syscall.2
	loongarch


New and changed links
---------------------

man3/
	_Static_assert.3	(static_assert(3))
	stpcpy.3		(strcpy(3))
	strcat.3		(strcpy(3))
	strncpy.3		(stpncpy(3))
	stpecpy.3		(string_copying(7))
	stpecpyx.3		(string_copying(7))
	ustpcpy.3		(string_copying(7))
	ustr2stp.3		(string_copying(7))
	zustr2stp.3		(string_copying(7))
	zustr2ustp.3		(string_copying(7))

man3const/
	EXIT_FAILURE.3const	(EXIT_SUCCESS(3const))


Global changes
--------------

-  Use correct letter case in manual page titles, instead of uppercase.

-  Use \" t comments when appropriate (Lintian needs this).

-  SYNOPSIS:

    -  Add _Nullable for functions that receive NULL as a meaningful
       input.

    -  Use VLA syntax to clarify the meaning of size parameters, rather
       than hiding it in possibly-confusing text.  This syntax is not
       accepted by any compilers, though.

    -  Use [[noreturn]] instead of noreturn, which will be deprecated
       soon.

-  Repository documentation:

    -  Added significant documentation about the repository and the
       project in the root of the repository in different files.
       Starting from the README, anyone passing by should be able to
       understand how the project works and be directed to other
       documentation files.  These files also document the release
       process.

    -  Michael has been busy lately, and he is no longer maintaining
       the project.  The in-repository documentation mentioned above has
       been updated to reflect that.


Changes to individual pages
---------------------------

copy_file_range.2
	Fix wrong kernel version information

process_madvise.2
	Fix capability and ptrace requirements

madvise.2
	Update Transparent Huge Pages file/shmem documentation for
	Linux 5.4+.


The manual pages (and other files in the repository) have been improved
beyond what this changelog covers.  To learn more about changes applied
to individual pages, use git(1).


-- 
<http://www.alejandro-colomar.es/>

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[relevance 3%]

* [PATCH v4 07/39] x86: Add user control-protection fault handler
  @ 2022-12-03  0:35  2% ` Rick Edgecombe
  0 siblings, 0 replies; 200+ results
From: Rick Edgecombe @ 2022-12-03  0:35 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Weijiang Yang, Kirill A . Shutemov, John Allen, kcc, eranian,
	rppt, jamorris, dethoma, akpm, Andrew.Cooper3, christina.schimpe
  Cc: rick.p.edgecombe, Yu-cheng Yu, Michael Kerrisk

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

A control-protection fault is triggered when a control-flow transfer
attempt violates Shadow Stack or Indirect Branch Tracking constraints.
For example, the return address for a RET instruction differs from the copy
on the shadow stack.

There already exists a control-protection fault handler for handling kernel
IBT. Refactor this fault handler into sparate user and kernel handlers,
like the page fault handler. Add a control-protection handler for usermode.

Keep the same behavior for the kernel side of the fault handler, except for
converting a BUG to a WARN in the case of a #CP happening when
!cpu_feature_enabled(). This unifies the behavior with the new shadow stack
code, and also prevents the kernel from crashing under this situation which
is potentially recoverable.

The control-protection fault handler works in a similar way as the general
protection fault handler. It provides the si_code SEGV_CPERR to the signal
handler.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
---

v3:
 - Shorten user/kernel #CP handler function names (peterz)
 - Restore CP_ENDBR check to kernel handler (peterz)
 - Utilize CONFIG_X86_CET (Kees)
 - Unify "unexpected" warnings (Andrew Cooper)
 - Use 2d array for error code chars (Andrew Cooper)
 - Add comment about why to read SSP MSR before enabling interrupts

v2:
 - Integrate with kernel IBT fault handler
 - Update printed messages. (Dave)
 - Remove array_index_nospec() usage. (Dave)
 - Remove IBT messages. (Dave)
 - Add enclave error code bit processing it case it can get triggered
   somehow.
 - Add extra "unknown" in control_protection_err.

v1:
 - Update static asserts for NSIGSEGV

Yu-cheng v29:
 - Remove pr_emerg() since it is followed by die().
 - Change boot_cpu_has() to cpu_feature_enabled().

 arch/arm/kernel/signal.c                 |   2 +-
 arch/arm64/kernel/signal.c               |   2 +-
 arch/arm64/kernel/signal32.c             |   2 +-
 arch/sparc/kernel/signal32.c             |   2 +-
 arch/sparc/kernel/signal_64.c            |   2 +-
 arch/x86/include/asm/disabled-features.h |   8 +-
 arch/x86/include/asm/idtentry.h          |   2 +-
 arch/x86/kernel/idt.c                    |   2 +-
 arch/x86/kernel/signal_compat.c          |   2 +-
 arch/x86/kernel/traps.c                  | 107 ++++++++++++++++++++---
 arch/x86/xen/enlighten_pv.c              |   2 +-
 arch/x86/xen/xen-asm.S                   |   2 +-
 include/uapi/asm-generic/siginfo.h       |   3 +-
 13 files changed, 114 insertions(+), 24 deletions(-)

diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index e07f359254c3..9a3c9de5ac5e 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -681,7 +681,7 @@ asmlinkage void do_rseq_syscall(struct pt_regs *regs)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 9ad911f1647c..81b13a21046e 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -1166,7 +1166,7 @@ void __init minsigstksz_setup(void)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/arm64/kernel/signal32.c b/arch/arm64/kernel/signal32.c
index 4700f8522d27..bbd542704730 100644
--- a/arch/arm64/kernel/signal32.c
+++ b/arch/arm64/kernel/signal32.c
@@ -460,7 +460,7 @@ void compat_setup_restart_syscall(struct pt_regs *regs)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/sparc/kernel/signal32.c b/arch/sparc/kernel/signal32.c
index dad38960d1a8..82da8a2d769d 100644
--- a/arch/sparc/kernel/signal32.c
+++ b/arch/sparc/kernel/signal32.c
@@ -751,7 +751,7 @@ asmlinkage int do_sys32_sigstack(u32 u_ssptr, u32 u_ossptr, unsigned long sp)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/sparc/kernel/signal_64.c b/arch/sparc/kernel/signal_64.c
index 570e43e6fda5..b4e410976e0d 100644
--- a/arch/sparc/kernel/signal_64.c
+++ b/arch/sparc/kernel/signal_64.c
@@ -562,7 +562,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long orig_i0, unsigned long
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 7a2954a16cb7..b7646b471537 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -105,6 +105,12 @@
 #define DISABLE_USER_SHSTK	(1 << (X86_FEATURE_USER_SHSTK & 31))
 #endif
 
+#ifdef CONFIG_X86_KERNEL_IBT
+#define DISABLE_IBT	0
+#else
+#define DISABLE_IBT	(1 << (X86_FEATURE_IBT & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -128,7 +134,7 @@
 #define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
 			 DISABLE_ENQCMD)
 #define DISABLED_MASK17	0
-#define DISABLED_MASK18	0
+#define DISABLED_MASK18	(DISABLE_IBT)
 #define DISABLED_MASK19	0
 #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 20)
 
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 72184b0b2219..69e26f48d027 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -618,7 +618,7 @@ DECLARE_IDTENTRY_RAW_ERRORCODE(X86_TRAP_DF,	xenpv_exc_double_fault);
 #endif
 
 /* #CP */
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_CP,	exc_control_protection);
 #endif
 
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index a58c6bc1cd68..5074b8420359 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -107,7 +107,7 @@ static const __initconst struct idt_data def_idts[] = {
 	ISTG(X86_TRAP_MC,		asm_exc_machine_check, IST_INDEX_MCE),
 #endif
 
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
 	INTG(X86_TRAP_CP,		asm_exc_control_protection),
 #endif
 
diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index 879ef8c72f5c..d441804443d5 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -27,7 +27,7 @@ static inline void signal_compat_build_tests(void)
 	 */
 	BUILD_BUG_ON(NSIGILL  != 11);
 	BUILD_BUG_ON(NSIGFPE  != 15);
-	BUILD_BUG_ON(NSIGSEGV != 9);
+	BUILD_BUG_ON(NSIGSEGV != 10);
 	BUILD_BUG_ON(NSIGBUS  != 5);
 	BUILD_BUG_ON(NSIGTRAP != 6);
 	BUILD_BUG_ON(NSIGCHLD != 6);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 8b83d8fbce71..e35c70dc1afb 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -213,12 +213,7 @@ DEFINE_IDTENTRY(exc_overflow)
 	do_error_trap(regs, 0, "overflow", X86_TRAP_OF, SIGSEGV, 0, NULL);
 }
 
-#ifdef CONFIG_X86_KERNEL_IBT
-
-static __ro_after_init bool ibt_fatal = true;
-
-extern void ibt_selftest_ip(void); /* code label defined in asm below */
-
+#ifdef CONFIG_X86_CET
 enum cp_error_code {
 	CP_EC        = (1 << 15) - 1,
 
@@ -231,15 +226,87 @@ enum cp_error_code {
 	CP_ENCL	     = 1 << 15,
 };
 
-DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+static const char control_protection_err[][10] = {
+	[0] = "unknown",
+	[1] = "near ret",
+	[2] = "far/iret",
+	[3] = "endbranch",
+	[4] = "rstorssp",
+	[5] = "setssbsy",
+};
+
+static const char *cp_err_string(unsigned long error_code)
+{
+	unsigned int cpec = error_code & CP_EC;
+
+	if (cpec >= ARRAY_SIZE(control_protection_err))
+		cpec = 0;
+	return control_protection_err[cpec];
+}
+
+static void do_unexpected_cp(struct pt_regs *regs, unsigned long error_code)
+{
+	WARN_ONCE(1, "Unexpected %s #CP, error_code: %s\n",
+		     user_mode(regs) ? "user mode" : "kernel mode",
+		     cp_err_string(error_code));
+}
+#endif /* CONFIG_X86_CET */
+
+void do_user_cp_fault(struct pt_regs *regs, unsigned long error_code);
+
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+static DEFINE_RATELIMIT_STATE(cpf_rate, DEFAULT_RATELIMIT_INTERVAL,
+			      DEFAULT_RATELIMIT_BURST);
+
+void do_user_cp_fault(struct pt_regs *regs, unsigned long error_code)
 {
-	if (!cpu_feature_enabled(X86_FEATURE_IBT)) {
-		pr_err("Unexpected #CP\n");
-		BUG();
+	struct task_struct *tsk;
+	unsigned long ssp;
+
+	/*
+	 * An exception was just taken from userspace. Since interrupts are disabled
+	 * here, no scheduling should have messed with the registers yet and they
+	 * will be whatever is live in userspace. So read the SSP before enabling
+	 * interrupts so locking the fpregs to do it later is not required.
+	 */
+	rdmsrl(MSR_IA32_PL3_SSP, ssp);
+
+	cond_local_irq_enable(regs);
+
+	tsk = current;
+	tsk->thread.error_code = error_code;
+	tsk->thread.trap_nr = X86_TRAP_CP;
+
+	/* Ratelimit to prevent log spamming. */
+	if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
+	    __ratelimit(&cpf_rate)) {
+		pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)%s",
+			 tsk->comm, task_pid_nr(tsk),
+			 regs->ip, regs->sp, ssp, error_code,
+			 cp_err_string(error_code),
+			 error_code & CP_ENCL ? " in enclave" : "");
+		print_vma_addr(KERN_CONT " in ", regs->ip);
+		pr_cont("\n");
 	}
 
-	if (WARN_ON_ONCE(user_mode(regs) || (error_code & CP_EC) != CP_ENDBR))
+	force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
+	cond_local_irq_disable(regs);
+}
+#endif
+
+void do_kernel_cp_fault(struct pt_regs *regs, unsigned long error_code);
+
+#ifdef CONFIG_X86_KERNEL_IBT
+static __ro_after_init bool ibt_fatal = true;
+
+extern void ibt_selftest_ip(void); /* code label defined in asm below */
+
+void do_kernel_cp_fault(struct pt_regs *regs, unsigned long error_code)
+{
+	if ((error_code & CP_EC) != CP_ENDBR) {
+		do_unexpected_cp(regs, error_code);
 		return;
+	}
 
 	if (unlikely(regs->ip == (unsigned long)&ibt_selftest_ip)) {
 		regs->ax = 0;
@@ -285,9 +352,25 @@ static int __init ibt_setup(char *str)
 }
 
 __setup("ibt=", ibt_setup);
-
 #endif /* CONFIG_X86_KERNEL_IBT */
 
+#ifdef CONFIG_X86_CET
+DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+{
+	if (user_mode(regs)) {
+		if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+			do_user_cp_fault(regs, error_code);
+		else
+			do_unexpected_cp(regs, error_code);
+	} else {
+		if (cpu_feature_enabled(X86_FEATURE_IBT))
+			do_kernel_cp_fault(regs, error_code);
+		else
+			do_unexpected_cp(regs, error_code);
+	}
+}
+#endif /* CONFIG_X86_CET */
+
 #ifdef CONFIG_X86_F00F_BUG
 void handle_invalid_op(struct pt_regs *regs)
 #else
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index a7d83c7800e4..e58d6cd30853 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -639,7 +639,7 @@ static struct trap_array_entry trap_array[] = {
 	TRAP_ENTRY(exc_coprocessor_error,		false ),
 	TRAP_ENTRY(exc_alignment_check,			false ),
 	TRAP_ENTRY(exc_simd_coprocessor_error,		false ),
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
 	TRAP_ENTRY(exc_control_protection,		false ),
 #endif
 };
diff --git a/arch/x86/xen/xen-asm.S b/arch/x86/xen/xen-asm.S
index 4a184f6e4e4d..7cdcb4ce6976 100644
--- a/arch/x86/xen/xen-asm.S
+++ b/arch/x86/xen/xen-asm.S
@@ -148,7 +148,7 @@ xen_pv_trap asm_exc_page_fault
 xen_pv_trap asm_exc_spurious_interrupt_bug
 xen_pv_trap asm_exc_coprocessor_error
 xen_pv_trap asm_exc_alignment_check
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
 xen_pv_trap asm_exc_control_protection
 #endif
 #ifdef CONFIG_X86_MCE
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index ffbe4cec9f32..0f52d0ac47c5 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -242,7 +242,8 @@ typedef struct siginfo {
 #define SEGV_ADIPERR	7	/* Precise MCD exception */
 #define SEGV_MTEAERR	8	/* Asynchronous ARM MTE error */
 #define SEGV_MTESERR	9	/* Synchronous ARM MTE exception */
-#define NSIGSEGV	9
+#define SEGV_CPERR	10	/* Control protection fault */
+#define NSIGSEGV	10
 
 /*
  * SIGBUS si_codes
-- 
2.17.1


^ permalink raw reply related	[relevance 2%]

* [PATCH v3 07/37] x86/cet: Add user control-protection fault handler
  @ 2022-11-04 22:35  2% ` Rick Edgecombe
  0 siblings, 0 replies; 200+ results
From: Rick Edgecombe @ 2022-11-04 22:35 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov, John Allen,
	kcc, eranian, rppt, jamorris, dethoma, akpm
  Cc: rick.p.edgecombe, Yu-cheng Yu, Michael Kerrisk

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

A control-protection fault is triggered when a control-flow transfer
attempt violates Shadow Stack or Indirect Branch Tracking constraints.
For example, the return address for a RET instruction differs from the copy
on the shadow stack.

There already exists a control-protection fault handler for handling kernel
IBT. Refactor this fault handler into sparate user and kernel handlers,
like the page fault handler. Add a control-protection handler for usermode.

Keep the same behavior for the kernel side of the fault handler, except for
converting a BUG to a WARN in the case of a #CP happening when
!cpu_feature_enabled(). This unifies the behavior with the new shadow stack
code, and also prevents the kernel from crashing under this situation which
is potentially recoverable.

The control-protection fault handler works in a similar way as the general
protection fault handler. It provides the si_code SEGV_CPERR to the signal
handler.

Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>

---

v3:
 - Shorten user/kernel #CP handler function names (peterz)
 - Restore CP_ENDBR check to kernel handler (peterz)
 - Utilize CONFIG_X86_CET (Kees)
 - Unify "unexpected" warnings (Andrew Cooper)
 - Use 2d array for error code chars (Andrew Cooper)
 - Add comment about why to read SSP MSR before enabling interrupts

v2:
 - Integrate with kernel IBT fault handler
 - Update printed messages. (Dave)
 - Remove array_index_nospec() usage. (Dave)
 - Remove IBT messages. (Dave)
 - Add enclave error code bit processing it case it can get triggered
   somehow.
 - Add extra "unknown" in control_protection_err.

v1:
 - Update static asserts for NSIGSEGV

Yu-cheng v29:
 - Remove pr_emerg() since it is followed by die().
 - Change boot_cpu_has() to cpu_feature_enabled().

 arch/arm/kernel/signal.c                 |   2 +-
 arch/arm64/kernel/signal.c               |   2 +-
 arch/arm64/kernel/signal32.c             |   2 +-
 arch/sparc/kernel/signal32.c             |   2 +-
 arch/sparc/kernel/signal_64.c            |   2 +-
 arch/x86/include/asm/disabled-features.h |   8 +-
 arch/x86/include/asm/idtentry.h          |   2 +-
 arch/x86/kernel/idt.c                    |   2 +-
 arch/x86/kernel/signal_compat.c          |   2 +-
 arch/x86/kernel/traps.c                  | 107 ++++++++++++++++++++---
 arch/x86/xen/enlighten_pv.c              |   2 +-
 arch/x86/xen/xen-asm.S                   |   2 +-
 include/uapi/asm-generic/siginfo.h       |   3 +-
 13 files changed, 114 insertions(+), 24 deletions(-)

diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index e07f359254c3..9a3c9de5ac5e 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -681,7 +681,7 @@ asmlinkage void do_rseq_syscall(struct pt_regs *regs)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 9ad911f1647c..81b13a21046e 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -1166,7 +1166,7 @@ void __init minsigstksz_setup(void)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/arm64/kernel/signal32.c b/arch/arm64/kernel/signal32.c
index 4700f8522d27..bbd542704730 100644
--- a/arch/arm64/kernel/signal32.c
+++ b/arch/arm64/kernel/signal32.c
@@ -460,7 +460,7 @@ void compat_setup_restart_syscall(struct pt_regs *regs)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/sparc/kernel/signal32.c b/arch/sparc/kernel/signal32.c
index dad38960d1a8..82da8a2d769d 100644
--- a/arch/sparc/kernel/signal32.c
+++ b/arch/sparc/kernel/signal32.c
@@ -751,7 +751,7 @@ asmlinkage int do_sys32_sigstack(u32 u_ssptr, u32 u_ossptr, unsigned long sp)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/sparc/kernel/signal_64.c b/arch/sparc/kernel/signal_64.c
index 570e43e6fda5..b4e410976e0d 100644
--- a/arch/sparc/kernel/signal_64.c
+++ b/arch/sparc/kernel/signal_64.c
@@ -562,7 +562,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long orig_i0, unsigned long
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 30cd12905499..5ff93b8165ed 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -93,6 +93,12 @@
 #define DISABLE_USER_SHSTK	(1 << (X86_FEATURE_USER_SHSTK & 31))
 #endif
 
+#ifdef CONFIG_X86_KERNEL_IBT
+#define DISABLE_IBT	0
+#else
+#define DISABLE_IBT	(1 << (X86_FEATURE_IBT & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -116,7 +122,7 @@
 #define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
 			 DISABLE_ENQCMD)
 #define DISABLED_MASK17	0
-#define DISABLED_MASK18	0
+#define DISABLED_MASK18	(DISABLE_IBT)
 #define DISABLED_MASK19	0
 #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 20)
 
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 72184b0b2219..69e26f48d027 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -618,7 +618,7 @@ DECLARE_IDTENTRY_RAW_ERRORCODE(X86_TRAP_DF,	xenpv_exc_double_fault);
 #endif
 
 /* #CP */
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_CP,	exc_control_protection);
 #endif
 
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index a58c6bc1cd68..5074b8420359 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -107,7 +107,7 @@ static const __initconst struct idt_data def_idts[] = {
 	ISTG(X86_TRAP_MC,		asm_exc_machine_check, IST_INDEX_MCE),
 #endif
 
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
 	INTG(X86_TRAP_CP,		asm_exc_control_protection),
 #endif
 
diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index 879ef8c72f5c..d441804443d5 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -27,7 +27,7 @@ static inline void signal_compat_build_tests(void)
 	 */
 	BUILD_BUG_ON(NSIGILL  != 11);
 	BUILD_BUG_ON(NSIGFPE  != 15);
-	BUILD_BUG_ON(NSIGSEGV != 9);
+	BUILD_BUG_ON(NSIGSEGV != 10);
 	BUILD_BUG_ON(NSIGBUS  != 5);
 	BUILD_BUG_ON(NSIGTRAP != 6);
 	BUILD_BUG_ON(NSIGCHLD != 6);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 178015a820f0..1ba42c6118ce 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -212,12 +212,7 @@ DEFINE_IDTENTRY(exc_overflow)
 	do_error_trap(regs, 0, "overflow", X86_TRAP_OF, SIGSEGV, 0, NULL);
 }
 
-#ifdef CONFIG_X86_KERNEL_IBT
-
-static __ro_after_init bool ibt_fatal = true;
-
-extern void ibt_selftest_ip(void); /* code label defined in asm below */
-
+#ifdef CONFIG_X86_CET
 enum cp_error_code {
 	CP_EC        = (1 << 15) - 1,
 
@@ -230,15 +225,87 @@ enum cp_error_code {
 	CP_ENCL	     = 1 << 15,
 };
 
-DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+static const char control_protection_err[][10] = {
+	[0] = "unknown",
+	[1] = "near ret",
+	[2] = "far/iret",
+	[3] = "endbranch",
+	[4] = "rstorssp",
+	[5] = "setssbsy",
+};
+
+static const char *cp_err_string(unsigned long error_code)
+{
+	unsigned int cpec = error_code & CP_EC;
+
+	if (cpec >= ARRAY_SIZE(control_protection_err))
+		cpec = 0;
+	return control_protection_err[cpec];
+}
+
+static void do_unexpected_cp(struct pt_regs *regs, unsigned long error_code)
+{
+	WARN_ONCE(1, "Unexpected %s #CP, error_code: %s\n",
+		     user_mode(regs) ? "user mode" : "kernel mode",
+		     cp_err_string(error_code));
+}
+#endif /* CONFIG_X86_CET */
+
+void do_user_cp_fault(struct pt_regs *regs, unsigned long error_code);
+
+#ifdef CONFIG_X86_USER_SHADOW_STACK
+static DEFINE_RATELIMIT_STATE(cpf_rate, DEFAULT_RATELIMIT_INTERVAL,
+			      DEFAULT_RATELIMIT_BURST);
+
+void do_user_cp_fault(struct pt_regs *regs, unsigned long error_code)
 {
-	if (!cpu_feature_enabled(X86_FEATURE_IBT)) {
-		pr_err("Unexpected #CP\n");
-		BUG();
+	struct task_struct *tsk;
+	unsigned long ssp;
+
+	/*
+	 * An exception was just taken from userspace. Since interrupts are disabled
+	 * here, no scheduling should have messed with the registers yet and they
+	 * will be whatever is live in userspace. So read the SSP before enabling
+	 * interrupts so locking the fpregs to do it later is not required.
+	 */
+	rdmsrl(MSR_IA32_PL3_SSP, ssp);
+
+	cond_local_irq_enable(regs);
+
+	tsk = current;
+	tsk->thread.error_code = error_code;
+	tsk->thread.trap_nr = X86_TRAP_CP;
+
+	/* Ratelimit to prevent log spamming. */
+	if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
+	    __ratelimit(&cpf_rate)) {
+		pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)%s",
+			 tsk->comm, task_pid_nr(tsk),
+			 regs->ip, regs->sp, ssp, error_code,
+			 cp_err_string(error_code),
+			 error_code & CP_ENCL ? " in enclave" : "");
+		print_vma_addr(KERN_CONT " in ", regs->ip);
+		pr_cont("\n");
 	}
 
-	if (WARN_ON_ONCE(user_mode(regs) || (error_code & CP_EC) != CP_ENDBR))
+	force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
+	cond_local_irq_disable(regs);
+}
+#endif
+
+void do_kernel_cp_fault(struct pt_regs *regs, unsigned long error_code);
+
+#ifdef CONFIG_X86_KERNEL_IBT
+static __ro_after_init bool ibt_fatal = true;
+
+extern void ibt_selftest_ip(void); /* code label defined in asm below */
+
+void do_kernel_cp_fault(struct pt_regs *regs, unsigned long error_code)
+{
+	if ((error_code & CP_EC) != CP_ENDBR) {
+		do_unexpected_cp(regs, error_code);
 		return;
+	}
 
 	if (unlikely(regs->ip == (unsigned long)&ibt_selftest_ip)) {
 		regs->ax = 0;
@@ -284,9 +351,25 @@ static int __init ibt_setup(char *str)
 }
 
 __setup("ibt=", ibt_setup);
-
 #endif /* CONFIG_X86_KERNEL_IBT */
 
+#ifdef CONFIG_X86_CET
+DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+{
+	if (user_mode(regs)) {
+		if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
+			do_user_cp_fault(regs, error_code);
+		else
+			do_unexpected_cp(regs, error_code);
+	} else {
+		if (cpu_feature_enabled(X86_FEATURE_IBT))
+			do_kernel_cp_fault(regs, error_code);
+		else
+			do_unexpected_cp(regs, error_code);
+	}
+}
+#endif /* CONFIG_X86_CET */
+
 #ifdef CONFIG_X86_F00F_BUG
 void handle_invalid_op(struct pt_regs *regs)
 #else
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index f82857e48815..cf4ee15e956e 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -638,7 +638,7 @@ static struct trap_array_entry trap_array[] = {
 	TRAP_ENTRY(exc_coprocessor_error,		false ),
 	TRAP_ENTRY(exc_alignment_check,			false ),
 	TRAP_ENTRY(exc_simd_coprocessor_error,		false ),
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
 	TRAP_ENTRY(exc_control_protection,		false ),
 #endif
 };
diff --git a/arch/x86/xen/xen-asm.S b/arch/x86/xen/xen-asm.S
index 6b4fdf6b9542..32f1b05b7a3c 100644
--- a/arch/x86/xen/xen-asm.S
+++ b/arch/x86/xen/xen-asm.S
@@ -148,7 +148,7 @@ xen_pv_trap asm_exc_page_fault
 xen_pv_trap asm_exc_spurious_interrupt_bug
 xen_pv_trap asm_exc_coprocessor_error
 xen_pv_trap asm_exc_alignment_check
-#ifdef CONFIG_X86_KERNEL_IBT
+#ifdef CONFIG_X86_CET
 xen_pv_trap asm_exc_control_protection
 #endif
 #ifdef CONFIG_X86_MCE
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index ffbe4cec9f32..0f52d0ac47c5 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -242,7 +242,8 @@ typedef struct siginfo {
 #define SEGV_ADIPERR	7	/* Precise MCD exception */
 #define SEGV_MTEAERR	8	/* Asynchronous ARM MTE error */
 #define SEGV_MTESERR	9	/* Synchronous ARM MTE exception */
-#define NSIGSEGV	9
+#define SEGV_CPERR	10	/* Control protection fault */
+#define NSIGSEGV	10
 
 /*
  * SIGBUS si_codes
-- 
2.17.1


^ permalink raw reply related	[relevance 2%]

* [PATCH v3 2/2] Documentation: Add HOWTO Spanish translation into rst based build system
  @ 2022-10-24 14:55  2% ` Carlos Bilbao
  0 siblings, 0 replies; 200+ results
From: Carlos Bilbao @ 2022-10-24 14:55 UTC (permalink / raw)
  To: corbet
  Cc: linux-doc, linux-kernel, bilbao, bagasdotme, willy, akiyks,
	miguel.ojeda.sandonis, Carlos Bilbao

Add Spanish translation of HOWTO document into rst based documentation
build system.

Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Carlos Bilbao <carlos.bilbao@amd.com>
---
 Documentation/translations/sp_SP/howto.rst | 617 +++++++++++++++++++++
 Documentation/translations/sp_SP/index.rst |   8 +
 2 files changed, 625 insertions(+)
 create mode 100644 Documentation/translations/sp_SP/howto.rst

diff --git a/Documentation/translations/sp_SP/howto.rst b/Documentation/translations/sp_SP/howto.rst
new file mode 100644
index 000000000000..f1375651a1a8
--- /dev/null
+++ b/Documentation/translations/sp_SP/howto.rst
@@ -0,0 +1,617 @@
+.. include:: ./disclaimer-sp.rst
+
+:Original: :ref:`Documentation/process/howto.rst <process_howto>`
+:Translator: Carlos Bilbao <carlos.bilbao@amd.com>
+
+.. _sp_process_howto:
+
+Cómo participar en el desarrollo del kernel de Linux
+====================================================
+
+Este documento es el principal punto de partida. Contiene instrucciones
+sobre cómo convertirse en desarrollador del kernel de Linux y explica cómo
+trabajar con el y en su desarrollo. El documento no tratará ningún aspecto
+técnico relacionado con la programación del kernel, pero le ayudará
+guiándole por el camino correcto.
+
+Si algo en este documento quedara obsoleto, envíe parches al maintainer de
+este archivo, que se encuentra en la parte superior del documento.
+
+Introducción
+------------
+¿De modo que quiere descubrir como convertirse en un/a desarrollador/a del
+kernel de Linux? Tal vez su jefe le haya dicho, "Escriba un driver de
+Linux para este dispositivo." El objetivo de este documento en enseñarle
+todo cuanto necesita para conseguir esto, describiendo el proceso por el
+que debe pasar, y con indicaciones de como trabajar con la comunidad.
+También trata de explicar las razones por las cuales la comunidad trabaja
+de la forma en que lo hace.
+
+El kernel esta principalmente escrito en C, con algunas partes que son
+dependientes de la arquitectura en ensamblador. Un buen conocimiento de C
+es necesario para desarrollar en el kernel. Lenguaje ensamblador (en
+cualquier arquitectura) no es necesario excepto que planee realizar
+desarrollo de bajo nivel para dicha arquitectura. Aunque no es un perfecto
+sustituto para una educación sólida en C y/o años de experiencia, los
+siguientes libros sirven, como mínimo, como referencia:
+
+- "The C Programming Language" de Kernighan e Ritchie [Prentice Hall]
+- "Practical C Programming" de Steve Oualline [O'Reilly]
+- "C:  A Reference Manual" de Harbison and Steele [Prentice Hall]
+
+El kernel está escrito usando GNU C y la cadena de herramientas GNU. Si
+bien se adhiere al estándar ISO C89, utiliza una serie de extensiones que
+no aparecen en dicho estándar. El kernel usa un C independiente de entorno,
+sin depender de la biblioteca C estándar, por lo que algunas partes del
+estándar C no son compatibles. Divisiones de long long arbitrarios o
+de coma flotante no son permitidas. En ocasiones, puede ser difícil de
+entender las suposiciones que el kernel hace respecto a la cadena de
+herramientas y las extensiones que usa, y desafortunadamente no hay
+referencia definitiva para estas. Consulte las páginas de información de
+gcc (`info gcc`) para obtener información al respecto.
+
+Recuerde que está tratando de aprender a trabajar con una comunidad de
+desarrollo existente. Es un grupo diverso de personas, con altos estándares
+de código, estilo y procedimiento. Estas normas han sido creadas a lo
+largo del tiempo en función de lo que se ha encontrado que funciona mejor
+para un equipo tan grande y geográficamente disperso. Trate de aprender
+tanto como le sea posible acerca de estos estándares antes de tiempo, ya
+que están bien documentados; no espere que la gente se adapte a usted o a
+la forma de hacer las cosas en su empresa.
+
+Cuestiones legales
+------------------
+El código fuente del kernel de Linux se publica bajo licencia GPL. Por
+favor, revise el archivo COPYING, presente en la carpeta principal del
+código fuente, para detalles de la licencia. Si tiene alguna otra pregunta
+sobre licencias, contacte a un abogado, no pregunte en listas de discusión
+del kernel de Linux. La gente en estas listas no son abogadas, y no debe
+confiar en sus opiniones en materia legal.
+
+Para preguntas y respuestas más frecuentes sobre la licencia GPL, consulte:
+
+	https://www.gnu.org/licenses/gpl-faq.html
+
+Documentación
+--------------
+El código fuente del kernel de Linux tiene una gran variedad de documentos
+que son increíblemente valiosos para aprender a interactuar con la
+comunidad del kernel. Cuando se agregan nuevas funciones al kernel, se
+recomienda que se incluyan nuevos archivos de documentación que expliquen
+cómo usar la función. Cuando un cambio en el kernel hace que la interfaz
+que el kernel expone espacio de usuario cambie, se recomienda que envíe la
+información o un parche en las páginas del manual que expliquen el cambio
+a mtk.manpages@gmail.com, y CC la lista linux-api@vger.kernel.org.
+
+Esta es la lista de archivos que están en el código fuente del kernel y son
+de obligada lectura:
+
+  :ref:`Documentation/admin-guide/README.rst <readme>`
+    Este archivo ofrece una breve descripción del kernel de Linux y
+    describe lo que es necesario hacer para configurar y compilar el
+    kernel. Quienes sean nuevos en el kernel deben comenzar aquí.
+
+  :ref:`Documentation/process/changes.rst <changes>`
+    Este archivo proporciona una lista de los niveles mínimos de varios
+    paquetes que son necesarios para construir y ejecutar el kernel
+    exitosamente.
+
+  :ref:`Documentation/process/coding-style.rst <codingstyle>`
+    Esto describe el estilo de código del kernel de Linux y algunas de los
+    razones detrás de esto. Se espera que todo el código nuevo siga las
+    directrices de este documento. La mayoría de los maintainers solo
+    aceptarán parches si se siguen estas reglas, y muchas personas solo
+    revisan el código si tiene el estilo adecuado.
+
+  :ref:`Documentation/process/submitting-patches.rst <submittingpatches>`
+    Este archivo describe en gran detalle cómo crear con éxito y enviar un
+    parche, que incluye (pero no se limita a):
+
+       - Contenidos del correo electrónico (email)
+       - Formato del email
+       - A quien se debe enviar
+
+    Seguir estas reglas no garantiza el éxito (ya que todos los parches son
+    sujetos a escrutinio de contenido y estilo), pero en caso de no seguir
+    dichas reglas, el fracaso es prácticamente garantizado.
+    Otras excelentes descripciones de cómo crear parches correctamente son:
+
+	"The Perfect Patch"
+		https://www.ozlabs.org/~akpm/stuff/tpp.txt
+
+	"Linux kernel patch submission format"
+		https://web.archive.org/web/20180829112450/http://linux.yyz.us/patch-format.html
+
+  :ref:`Documentation/process/stable-api-nonsense.rst <stable_api_nonsense>`
+    Este archivo describe la lógica detrás de la decisión consciente de
+    no tener una API estable dentro del kernel, incluidas cosas como:
+
+      - Capas intermedias del subsistema (por compatibilidad?)
+      - Portabilidad de drivers entre sistemas operativos
+      - Mitigar el cambio rápido dentro del árbol de fuentes del kernel (o
+        prevenir cambios rápidos)
+
+     Este documento es crucial para comprender la filosofía del desarrollo
+     de Linux y es muy importante para las personas que se mudan a Linux
+     tras desarrollar otros sistemas operativos.
+
+  :ref:`Documentation/admin-guide/security-bugs.rst <securitybugs>`
+    Si cree que ha encontrado un problema de seguridad en el kernel de
+    Linux, siga los pasos de este documento para ayudar a notificar a los
+    desarrolladores del kernel y ayudar a resolver el problema.
+
+  :ref:`Documentation/process/management-style.rst <managementstyle>`
+    Este documento describe cómo operan los maintainers del kernel de Linux
+    y los valores compartidos detrás de sus metodologías. Esta es una
+    lectura importante para cualquier persona nueva en el desarrollo del
+    kernel (o cualquier persona que simplemente sienta curiosidad por
+    el campo IT), ya que clarifica muchos conceptos erróneos y confusiones
+    comunes sobre el comportamiento único de los maintainers del kernel.
+
+  :ref:`Documentation/process/stable-kernel-rules.rst <stable_kernel_rules>`
+    Este archivo describe las reglas sobre cómo se suceden las versiones
+    del kernel estable, y qué hacer si desea obtener un cambio en una de
+    estas publicaciones.
+
+  :ref:`Documentation/process/kernel-docs.rst <kernel_docs>`
+    Una lista de documentación externa relativa al desarrollo del kernel.
+    Por favor consulte esta lista si no encuentra lo que están buscando
+    dentro de la documentación del kernel.
+
+  :ref:`Documentation/process/applying-patches.rst <applying_patches>`
+    Una buena introducción que describe exactamente qué es un parche y cómo
+    aplicarlo a las diferentes ramas de desarrollo del kernel.
+
+El kernel también tiene una gran cantidad de documentos que pueden ser
+generados automáticamente desde el propio código fuente o desde
+ReStructuredText markups (ReST), como este. Esto incluye un descripción
+completa de la API en el kernel y reglas sobre cómo manejar cerrojos
+(locking) correctamente.
+
+Todos estos documentos se pueden generar como PDF o HTML ejecutando::
+
+	make pdfdocs
+	make htmldocs
+
+respectivamente desde el directorio fuente principal del kernel.
+
+Los documentos que utilizan el markup ReST se generarán en
+Documentation/output. También se pueden generar en formatos LaTeX y ePub
+con::
+
+	make latexdocs
+	make epubdocs
+
+Convertirse en un/a desarrollador/a de kernel
+---------------------------------------------
+
+Si no sabe nada sobre el desarrollo del kernel de Linux, debería consultar
+el proyecto Linux KernelNewbies:
+
+	https://kernelnewbies.org
+
+Consiste en una útil lista de correo donde puede preguntar casi cualquier
+tipo de pregunta básica de desarrollo del kernel (asegúrese de buscar en
+los archivos primero, antes de preguntar algo que ya ha sido respondido en
+el pasado.) También tiene un canal IRC que puede usar para hacer preguntas
+en tiempo real, y una gran cantidad de documentación útil para ir
+aprendiendo sobre el desarrollo del kernel de Linux.
+
+El sitio web tiene información básica sobre la organización del código,
+subsistemas, y proyectos actuales (tanto dentro como fuera del árbol).
+También describe alguna información logística básica, como cómo compilar
+un kernel y aplicar un parche.
+
+Si no sabe por dónde quiere empezar, pero quieres buscar alguna tarea que
+comenzar a hacer para unirse a la comunidad de desarrollo del kernel,
+acuda al proyecto Linux Kernel Janitor:
+
+	https://kernelnewbies.org/KernelJanitors
+
+Es un gran lugar para comenzar. Describe una lista de problemas
+relativamente simples que deben limpiarse y corregirse dentro del código
+fuente del kernel de Linux árbol de fuentes. Trabajando con los
+desarrolladores a cargo de este proyecto, aprenderá los conceptos básicos
+para incluir su parche en el árbol del kernel de Linux, y posiblemente
+descubrir en la dirección en que trabajar a continuación, si no tiene ya
+una idea.
+
+Antes de realizar cualquier modificación real al código del kernel de
+Linux, es imperativo entender cómo funciona el código en cuestión. Para
+este propósito, nada es mejor que leerlo directamente (lo más complicado
+está bien comentado), tal vez incluso con la ayuda de herramientas
+especializadas. Una de esas herramientas que se recomienda especialmente
+es el proyecto Linux Cross-Reference, que es capaz de presentar el código
+fuente en un formato de página web indexada y autorreferencial. Una
+excelente puesta al día del repositorio del código del kernel se puede
+encontrar en:
+
+	https://elixir.bootlin.com/
+
+El proceso de desarrollo
+------------------------
+
+El proceso de desarrollo del kernel de Linux consiste actualmente de
+diferentes "branches" (ramas) con muchos distintos subsistemas específicos
+a cada una de ellas. Las diferentes ramas son:
+
+  - El código principal de Linus (mainline tree)
+  - Varios árboles estables con múltiples major numbers
+  - Subsistemas específicos
+  - linux-next, para integración y testing
+
+Mainline tree (Árbol principal)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+El mainline tree es mantenido por Linus Torvalds, y puede encontrarse en
+https://kernel.org o en su repo.  El proceso de desarrollo es el siguiente:
+
+  - Tan pronto como se lanza un nuevo kernel, se abre una ventana de dos
+    semanas, durante este período de tiempo, los maintainers pueden enviar
+    grandes modificaciones a Linus, por lo general los parches que ya se
+    han incluido en el linux-next durante unas semanas. La forma preferida
+    de enviar grandes cambios es usando git (la herramienta de
+    administración de código fuente del kernel, más información al respecto
+    en https://git-scm.com/), pero los parches simples también son validos.
+  - Después de dos semanas, se lanza un kernel -rc1 y la atención se centra
+    en hacer el kernel nuevo lo más estable ("solido") posible. La mayoría
+    de los parches en este punto deben arreglar una regresión. Los errores
+    que siempre han existido no son regresiones, por lo tanto, solo envíe
+    este tipo de correcciones si son importantes. Tenga en cuenta que se
+    podría aceptar un controlador (o sistema de archivos) completamente
+    nuevo después de -rc1 porque no hay riesgo de causar regresiones con
+    tal cambio, siempre y cuando el cambio sea autónomo y no afecte áreas
+    fuera del código que se está agregando. git se puede usar para enviar
+    parches a Linus después de que se lance -rc1, pero los parches también
+    deben ser enviado a una lista de correo pública para su revisión.
+  - Se lanza un nuevo -rc cada vez que Linus considera que el árbol git
+    actual esta en un estado razonablemente sano y adecuado para la prueba.
+    La meta es lanzar un nuevo kernel -rc cada semana.
+  - El proceso continúa hasta que el kernel se considera "listo", y esto
+    puede durar alrededor de 6 semanas.
+
+Vale la pena mencionar lo que Andrew Morton escribió en las listas de
+correo del kernel de Linux, sobre lanzamientos del kernel (traducido):
+
+	*"Nadie sabe cuándo se publicara un nuevo kernel, pues esto sucede
+	según el estado de los bugs, no de una cronología preconcebida."*
+
+Varios árboles estables con múltiples major numbers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Los kernels con versiones de 3 partes son kernels estables. Estos contienen
+correcciones relativamente pequeñas y críticas para problemas de seguridad
+o importantes regresiones descubiertas para una publicación de código.
+Cada lanzamiento en una gran serie estable incrementa la tercera parte de
+la versión número, manteniendo las dos primeras partes iguales.
+
+Esta es la rama recomendada para los usuarios que quieren la versión
+estable más reciente del kernel, y no están interesados ​​en ayudar a probar
+versiones en desarrollo/experimentales.
+
+Los árboles estables son mantenidos por el equipo "estable"
+<stable@vger.kernel.org>, y se liberan (publican) según lo dicten las
+necesidades. El período de liberación normal es de aproximadamente dos
+semanas, pero puede ser más largo si no hay problemas apremiantes. Un
+problema relacionado con la seguridad, en cambio, puede causar un
+lanzamiento casi instantáneamente.
+
+El archivo :ref:`Documentación/proceso/stable-kernel-rules.rst <stable_kernel_rules>`
+en el árbol del kernel documenta qué tipos de cambios son aceptables para
+el árbol estable y cómo funciona el proceso de lanzamiento.
+
+Subsistemas específicos
+~~~~~~~~~~~~~~~~~~~~~~~~
+Los maintainers de los diversos subsistemas del kernel --- y también muchos
+desarrolladores de subsistemas del kernel --- exponen su estado actual de
+desarrollo en repositorios fuente. De esta manera, otros pueden ver lo que
+está sucediendo en las diferentes áreas del kernel. En áreas donde el
+desarrollo es rápido, se le puede pedir a un desarrollador que base sus
+envíos en tal árbol del subsistema del kernel, para evitar conflictos entre
+este y otros trabajos ya en curso.
+
+La mayoría de estos repositorios son árboles git, pero también hay otros
+SCM en uso, o colas de parches que se publican como series quilt. Las
+direcciones de estos repositorios de subsistemas se enumeran en el archivo
+MAINTAINERS. Muchos de estos se pueden ver en https://git.kernel.org/.
+
+Antes de que un parche propuesto se incluya con dicho árbol de subsistemas,
+es sujeto a revisión, que ocurre principalmente en las listas de correo
+(ver la sección respectiva a continuación). Para varios subsistemas del
+kernel, esta revisión se rastrea con la herramienta patchwork. Patchwork
+ofrece una interfaz web que muestra publicaciones de parches, cualquier
+comentario sobre un parche o revisiones a él, y los maintainers pueden
+marcar los parches como en revisión, aceptado, o rechazado. La mayoría de
+estos sitios de trabajo de parches se enumeran en
+
+https://patchwork.kernel.org/.
+
+linux-next, para integración y testing
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Antes de que las actualizaciones de los árboles de subsistemas se combinen
+con el árbol principal, necesitan probar su integración. Para ello, existe
+un repositorio especial de pruebas en el que se encuentran casi todos los
+árboles de subsistema, actualizado casi a diario:
+
+	https://git.kernel.org/?p=linux/kernel/git/next/linux-next.git
+
+De esta manera, linux-next ofrece una perspectiva resumida de lo que se
+espera que entre en el kernel principal en el próximo período de "merge"
+(fusión de código). Los testers aventureros son bienvenidos a probar
+linux-next en ejecución.
+
+Reportar bugs
+-------------
+
+El archivo 'Documentación/admin-guide/reporting-issues.rst' en el
+directorio principal del kernel describe cómo informar un posible bug del
+kernel y detalles sobre qué tipo de información necesitan los
+desarrolladores del kernel para ayudar a rastrear la fuente del problema.
+
+Gestión de informes de bugs
+------------------------------
+
+Una de las mejores formas de poner en práctica sus habilidades de hacking
+es arreglando errores reportados por otras personas. No solo ayudará a
+hacer el kernel más estable, también aprenderá a solucionar problemas del
+mundo real y mejora sus habilidades, y otros desarrolladores se darán
+cuenta de tu presencia. La corrección de errores es una de las mejores
+formas de ganar méritos entre desarrolladores, porque no a muchas personas
+les gusta perder el tiempo arreglando los errores de otras personas.
+
+Para trabajar en informes de errores ya reportados, busque un subsistema
+que le interese. Verifique el archivo MAINTAINERS donde se informan los
+errores de ese subsistema; con frecuencia será una lista de correo, rara
+vez un rastreador de errores (bugtracker). Busque en los archivos de dicho
+lugar para informes recientes y ayude donde lo crea conveniente. También es
+posible que desee revisar https://bugzilla.kernel.org para informes de
+errores; solo un puñado de subsistemas del kernel lo emplean activamente
+para informar o rastrear; sin embargo, todos los errores para todo el kernel
+se archivan allí.
+
+Listas de correo
+-----------------
+
+Como se explica en algunos de los documentos anteriores, la mayoría de
+desarrolladores del kernel participan en la lista de correo del kernel de
+Linux. Detalles sobre cómo para suscribirse y darse de baja de la lista se
+pueden encontrar en:
+
+	http://vger.kernel.org/vger-lists.html#linux-kernel
+
+Existen archivos de la lista de correo en la web en muchos lugares
+distintos. Utilice un motor de búsqueda para encontrar estos archivos. Por
+ejemplo:
+
+	http://dir.gmane.org/gmane.linux.kernel
+
+Es muy recomendable que busque en los archivos sobre el tema que desea
+tratar, antes de publicarlo en la lista. Un montón de cosas ya discutidas
+en detalle solo se registran en los archivos de la lista de correo.
+
+La mayoría de los subsistemas individuales del kernel también tienen sus
+propias lista de correo donde hacen sus esfuerzos de desarrollo. Revise el
+archivo MAINTAINERS para obtener referencias de lo que estas listas para
+los diferentes grupos.
+
+Muchas de las listas están alojadas en kernel.org. La información sobre
+estas puede ser encontrada en:
+
+	http://vger.kernel.org/vger-lists.html
+
+Recuerde mantener buenos hábitos de comportamiento al usar las listas.
+Aunque un poco cursi, la siguiente URL tiene algunas pautas simples para
+interactuar con la lista (o cualquier lista):
+
+	http://www.albion.com/netiquette/
+
+Si varias personas responden a su correo, el CC (lista de destinatarios)
+puede hacerse bastante grande. No elimine a nadie de la lista CC: sin una
+buena razón, o no responda solo a la dirección de la lista. Acostúmbrese
+a recibir correos dos veces, una del remitente y otra de la lista, y no
+intente ajustar esto agregando encabezados de correo astutos, a la gente no
+le gustará.
+
+Recuerde mantener intacto el contexto y la atribución de sus respuestas,
+mantenga las líneas "El hacker John Kernel escribió ...:" en la parte
+superior de su respuesta, y agregue sus declaraciones entre las secciones
+individuales citadas en lugar de escribiendo en la parte superior del
+correo electrónico.
+
+Si incluye parches en su correo, asegúrese de que sean texto legible sin
+formato como se indica en :ref:`Documentation/process/submitting-patches.rst <submittingpatches>`.
+Los desarrolladores del kernel no quieren lidiar con archivos adjuntos o
+parches comprimidos; y pueden querer comentar líneas individuales de su
+parche, que funciona sólo de esa manera. Asegúrese de emplear un programa
+de correo que no altere los espacios ni los tabuladores. Una buena primera
+prueba es enviarse el correo a usted mismo, e intentar aplicar su
+propio parche. Si eso no funciona, arregle su programa de correo o
+reemplace hasta que funcione.
+
+Sobretodo, recuerde de ser respetuoso con otros subscriptores.
+
+Colaborando con la comunidad
+----------------------------
+
+El objetivo de la comunidad del kernel es proporcionar el mejor kernel
+posible. Cuando envíe un parche para su aceptación, se revisará en sus
+méritos técnicos solamente. Entonces, ¿qué deberías ser?
+
+  - críticas
+  - comentarios
+  - peticiones de cambios
+  - peticiones de justificaciones
+  - silencio
+
+Recuerde, esto es parte de introducir su parche en el kernel. Tiene que ser
+capaz de recibir críticas y comentarios sobre sus parches, evaluar
+a nivel técnico y re-elaborar sus parches o proporcionar razonamiento claro
+y conciso de por qué no se deben hacer tales cambios. Si no hay respuestas
+a su publicación, espere unos días e intente de nuevo, a veces las cosas se
+pierden dado el gran volumen.
+
+¿Qué no debería hacer?
+
+  - esperar que su parche se acepte sin preguntas
+  - actuar de forma defensiva
+  - ignorar comentarios
+  - enviar el parche de nuevo, sin haber aplicados los cambios pertinentes
+
+En una comunidad que busca la mejor solución técnica posible, siempre habrá
+diferentes opiniones sobre lo beneficioso que es un parche. Tiene que ser
+cooperativo y estar dispuesto a adaptar su idea para que encaje dentro
+del kernel, o al menos esté dispuesto a demostrar que su idea vale la pena.
+Recuerde, estar equivocado es aceptable siempre y cuando estés dispuesto a
+trabajar hacia una solución que sea correcta.
+
+Es normal que las respuestas a su primer parche sean simplemente una lista
+de una docena de cosas que debe corregir. Esto **no** implica que su
+parche no será aceptado, y **no** es personal. Simplemente corrija todos
+los problemas planteados en su parche, y envié otra vez.
+
+Diferencias entre la comunidad kernel y las estructuras corporativas
+--------------------------------------------------------------------
+
+La comunidad del kernel funciona de manera diferente a la mayoría de los
+entornos de desarrollo tradicionales en empresas. Aquí hay una lista de
+cosas que puede intentar hacer para evitar problemas:
+
+  Cosas buenas que decir respecto a los cambios propuestos:
+
+    - "Esto arregla múltiples problemas."
+    - "Esto elimina 2000 lineas de código."
+    - "Aquí hay un parche que explica lo que intento describir."
+    - "Lo he testeado en 5 arquitecturas distintas..."
+    - "Aquí hay una serie de parches menores que..."
+    - "Esto mejora el rendimiento en maquinas típicas..."
+
+  Cosas negativas que debe evitar decir:
+
+    - "Lo hicimos así en AIX/ptx/Solaris, de modo que debe ser bueno..."
+    - "Llevo haciendo esto 20 años, de modo que..."
+    - "Esto lo necesita mi empresa para ganar dinero"
+    - "Esto es para la linea de nuestros productos Enterprise"
+    - "Aquí esta el documento de 1000 paginas describiendo mi idea"
+    - "Llevo 6 meses trabajando en esto..."
+    - "Aquí esta un parche de 5000 lineas que..."
+    - "He rescrito todo el desastre actual, y aquí esta..."
+    - "Tengo un deadline, y este parche debe aplicarse ahora."
+
+Otra forma en que la comunidad del kernel es diferente a la mayoría de los
+entornos de trabajo tradicionales en ingeniería de software, es la
+naturaleza sin rostro de interacción. Una de las ventajas de utilizar el
+correo electrónico y el IRC como formas principales de comunicación es la
+no discriminación por motivos de género o raza. El entorno de trabajo del
+kernel de Linux acepta a mujeres y minorías porque todo lo que eres es una
+dirección de correo electrónico. El aspecto internacional también ayuda a
+nivelar el campo de juego porque no puede adivinar el género basado en
+el nombre de una persona. Un hombre puede llamarse Andrea y una mujer puede
+llamarse Pat. La mayoría de las mujeres que han trabajado en el kernel de
+Linux y han expresado una opinión han tenido experiencias positivas.
+
+La barrera del idioma puede causar problemas a algunas personas que no se
+sientes cómodas con el inglés. Un buen dominio del idioma puede ser
+necesario para transmitir ideas correctamente en las listas de correo, por
+lo que le recomendamos que revise sus correos electrónicos para asegurarse
+de que tengan sentido en inglés antes de enviarlos.
+
+Divida sus cambios
+---------------------
+
+La comunidad del kernel de Linux no acepta con gusto grandes fragmentos de
+código, sobretodo a la vez. Los cambios deben introducirse correctamente,
+discutidos y divididos en pequeñas porciones individuales. Esto es casi
+exactamente lo contrario de lo que las empresas están acostumbradas a hacer.
+Su propuesta también debe introducirse muy temprano en el proceso de
+desarrollo, de modo que pueda recibir comentarios sobre lo que está
+haciendo. También deje que la comunidad sienta que está trabajando con
+ellos, y no simplemente usándolos como un vertedero para su función. Sin
+embargo, no envíe 50 correos electrónicos a una vez a una lista de correo,
+su serie de parches debe casi siempre ser más pequeña que eso.
+
+Las razones para dividir las cosas son las siguientes:
+
+1) Los cambios pequeños aumentan la probabilidad de que sus parches sean
+   aplicados, ya que no requieren mucho tiempo o esfuerzo para verificar su
+   exactitud. Un parche de 5 líneas puede ser aplicado por un maintainer
+   con apenas una segunda mirada. Sin embargo, un parche de 500 líneas
+   puede tardar horas en ser revisado en términos de corrección (el tiempo
+   que toma es exponencialmente proporcional al tamaño del parche, o algo
+   así).
+
+   Los parches pequeños también facilitan la depuración cuando algo falla.
+   Es mucho más fácil retirar los parches uno por uno que diseccionar un
+   parche muy grande después de haber sido aplicado (y roto alguna cosa).
+
+2) Es importante no solo enviar pequeños parches, sino también reescribir
+   y simplificar (o simplemente reordenar) los parches antes de enviarlos.
+
+Esta es una analogía del desarrollador del kernel Al Viro (traducida):
+
+	*"Piense en un maestro que califica la tarea de un estudiante de
+	matemáticas. El maestro no quiere ver los intentos y errores del
+	estudiante antes de que se les ocurriera la solución. Quiere ver la
+	respuesta más limpia y elegante. Un buen estudiante lo sabe, y nunca
+	presentaría su trabajo intermedio antes de tener la solución final.*
+
+	*Lo mismo ocurre con el desarrollo del kernel. Los maintainers y
+	revisores no quieren ver el proceso de pensamiento detrás de la solución
+	al problema que se está resolviendo. Quieren ver un solución simple y
+	elegante."*
+
+Puede resultar un reto mantener el equilibrio entre presentar una solución
+elegante y trabajar junto a la comunidad, discutiendo su trabajo inacabado.
+Por lo tanto, es bueno comenzar temprano en el proceso para obtener
+"feedback" y mejorar su trabajo, pero también mantenga sus cambios en
+pequeños trozos que pueden ser aceptados, incluso cuando toda su labor no
+está listo para inclusión en un momento dado.
+
+También tenga en cuenta que no es aceptable enviar parches para su
+inclusión que están sin terminar y serán "arreglados más tarde".
+
+Justifique sus cambios
+----------------------
+
+Además de dividir sus parches, es muy importante que deje a la comunidad de
+Linux sabe por qué deberían agregar este cambio. Nuevas características
+debe justificarse como necesarias y útiles.
+
+Documente sus cambios
+---------------------
+
+Cuando envíe sus parches, preste especial atención a lo que dice en el
+texto de su correo electrónico. Esta información se convertirá en el
+ChangeLog del parche, y se conservará para que todos la vean, todo el
+tiempo. Debe describir el parche por completo y contener:
+
+  - por qué los cambios son necesarios
+  - el diseño general de su propuesta
+  - detalles de implementación
+  - resultados de sus experimentos
+
+Para obtener más detalles sobre cómo debería quedar todo esto, consulte la
+sección ChangeLog del documento:
+
+  "The Perfect Patch"
+      https://www.ozlabs.org/~akpm/stuff/tpp.txt
+
+Todas estas cuestiones son a veces son muy difíciles de conseguir. Puede
+llevar años perfeccionar estas prácticas (si es que lo hace). Es un proceso
+continuo de mejora que requiere mucha paciencia y determinación. Pero no se
+rinda, es posible. Muchos lo han hecho antes, y cada uno tuvo que comenzar
+exactamente donde está usted ahora.
+
+----------
+
+Gracias a Paolo Ciarrocchi que permitió que la sección "Development Process"
+se basara en el texto que había escrito (https://lwn.net/Articles/94386/),
+y a Randy Dunlap y Gerrit Huizenga por algunas de la lista de cosas que
+debes y no debes decir. También gracias a Pat Mochel, Hanna Linder, Randy
+Dunlap, Kay Sievers, Vojtech Pavlik, Jan Kara, Josh Boyer, Kees Cook,
+Andrew Morton, Andi Kleen, Vadim Lobanov, Jesper Juhl, Adrian Bunk,
+Keri Harris, Frans Pop, David A. Wheeler, Junio ​​Hamano, Michael Kerrisk y
+Alex Shepard por su revisión, comentarios y contribuciones. Sin su ayuda,
+este documento no hubiera sido posible.
+
+Maintainer: Greg Kroah-Hartman <greg@kroah.com>
diff --git a/Documentation/translations/sp_SP/index.rst b/Documentation/translations/sp_SP/index.rst
index 816d45e081e9..5b3f45d84955 100644
--- a/Documentation/translations/sp_SP/index.rst
+++ b/Documentation/translations/sp_SP/index.rst
@@ -70,3 +70,11 @@ En términos más generales, la documentación, como el kernel mismo, están en
 constante desarrollo. Las mejoras en la documentación siempre son
 bienvenidas; de modo que, si desea ayudar, únase a la lista de correo
 linux-doc en vger.kernel.org.
+
+Traducciones al español
+=======================
+
+.. toctree::
+   :maxdepth: 1
+
+   howto
-- 
2.34.1


^ permalink raw reply related	[relevance 2%]

* Re: [PATCH v3 2/2] docs/zh_CN: Add userspace-api/ebpf Chinese translation
  2022-10-19 13:30  7% ` [PATCH v3 2/2] docs/zh_CN: Add userspace-api/ebpf " Rui Li
@ 2022-10-20  6:57  0%   ` Yanteng Si
  0 siblings, 0 replies; 200+ results
From: Yanteng Si @ 2022-10-20  6:57 UTC (permalink / raw)
  To: Rui Li, Alex Shi; +Cc: Jonathan Corbet, Wu XiangCheng, linux-doc, linux-kernel


On 10/19/22 21:30, Rui Li wrote:
> Translate the following documents into Chinese:
>
> - userspace-api/ebpf/index.rst
> - userspace-api/ebpf/syscall.rst
>
> Signed-off-by: Rui Li <me@lirui.org>

Reviewed-by: Yanteng Si <siyanteng@loongson.cn>


Thanks,

Yanteng

> ---
> Changes since v2:
> - Remove long English reference
> - Remove ebpf from TODO
>
> Changes since v1:
> - Translate bpf subcommand title
> - Align title
> - Add space after doc path
> ---
>   .../zh_CN/userspace-api/ebpf/index.rst        | 22 ++++++++++++++
>   .../zh_CN/userspace-api/ebpf/syscall.rst      | 29 +++++++++++++++++++
>   .../zh_CN/userspace-api/index.rst             |  6 +++-
>   3 files changed, 56 insertions(+), 1 deletion(-)
>   create mode 100644 Documentation/translations/zh_CN/userspace-api/ebpf/index.rst
>   create mode 100644 Documentation/translations/zh_CN/userspace-api/ebpf/syscall.rst
>
> diff --git a/Documentation/translations/zh_CN/userspace-api/ebpf/index.rst b/Documentation/translations/zh_CN/userspace-api/ebpf/index.rst
> new file mode 100644
> index 000000000000..d52c7052f101
> --- /dev/null
> +++ b/Documentation/translations/zh_CN/userspace-api/ebpf/index.rst
> @@ -0,0 +1,22 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +.. include:: ../../disclaimer-zh_CN.rst
> +
> +:Original: Documentation/userspace-api/ebpf/index.rst
> +
> +:翻译:
> +
> + 李睿 Rui Li <me@lirui.org>
> +
> +eBPF 用户空间API
> +================
> +
> +eBPF是一种在Linux内核中提供沙箱化运行环境的机制,它可以在不改变内核源码或加载
> +内核模块的情况下扩展运行时和编写工具。eBPF程序能够被附加到各种内核子系统中,包
> +括网络,跟踪和Linux安全模块(LSM)等。
> +
> +关于eBPF的内部内核文档,请查看 Documentation/bpf/index.rst 。
> +
> +.. toctree::
> +   :maxdepth: 1
> +
> +   syscall
> diff --git a/Documentation/translations/zh_CN/userspace-api/ebpf/syscall.rst b/Documentation/translations/zh_CN/userspace-api/ebpf/syscall.rst
> new file mode 100644
> index 000000000000..47e2a59ae45d
> --- /dev/null
> +++ b/Documentation/translations/zh_CN/userspace-api/ebpf/syscall.rst
> @@ -0,0 +1,29 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +.. include:: ../../disclaimer-zh_CN.rst
> +
> +:Original: Documentation/userspace-api/ebpf/syscall.rst
> +
> +:翻译:
> +
> + 李睿 Rui Li <me@lirui.org>
> +
> +eBPF Syscall
> +------------
> +
> +:作者:
> +    - Alexei Starovoitov <ast@kernel.org>
> +    - Joe Stringer <joe@wand.net.nz>
> +    - Michael Kerrisk <mtk.manpages@gmail.com>
> +
> +bpf syscall的主要信息可以在 `man-pages`_ 中的 `bpf(2)`_ 找到。
> +
> +bpf() 子命令参考
> +~~~~~~~~~~~~~~~~
> +
> +子命令在以下内核代码中:
> +
> +include/uapi/linux/bpf.h
> +
> +.. Links:
> +.. _man-pages: https://www.kernel.org/doc/man-pages/
> +.. _bpf(2): https://man7.org/linux/man-pages/man2/bpf.2.html
> diff --git a/Documentation/translations/zh_CN/userspace-api/index.rst b/Documentation/translations/zh_CN/userspace-api/index.rst
> index 3b834fe7e33b..12c63d81c663 100644
> --- a/Documentation/translations/zh_CN/userspace-api/index.rst
> +++ b/Documentation/translations/zh_CN/userspace-api/index.rst
> @@ -21,6 +21,11 @@ Linux 内核用户空间API指南
>   
>   	   目录
>   
> +.. toctree::
> +   :maxdepth: 2
> +
> +   ebpf/index
> +
>   TODOList:
>   
>   * no_new_privs
> @@ -29,7 +34,6 @@ TODOList:
>   * unshare
>   * spec_ctrl
>   * accelerators/ocxl
> -* ebpf/index
>   * ioctl/index
>   * iommu
>   * media/index


^ permalink raw reply	[relevance 0%]

* [PATCH v3 2/2] docs/zh_CN: Add userspace-api/ebpf Chinese translation
  @ 2022-10-19 13:30  7% ` Rui Li
  2022-10-20  6:57  0%   ` Yanteng Si
  0 siblings, 1 reply; 200+ results
From: Rui Li @ 2022-10-19 13:30 UTC (permalink / raw)
  To: Alex Shi, Yanteng Si
  Cc: Jonathan Corbet, Wu XiangCheng, linux-doc, linux-kernel, Rui Li

Translate the following documents into Chinese:

- userspace-api/ebpf/index.rst
- userspace-api/ebpf/syscall.rst

Signed-off-by: Rui Li <me@lirui.org>
---
Changes since v2:
- Remove long English reference
- Remove ebpf from TODO

Changes since v1:
- Translate bpf subcommand title
- Align title
- Add space after doc path
---
 .../zh_CN/userspace-api/ebpf/index.rst        | 22 ++++++++++++++
 .../zh_CN/userspace-api/ebpf/syscall.rst      | 29 +++++++++++++++++++
 .../zh_CN/userspace-api/index.rst             |  6 +++-
 3 files changed, 56 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/translations/zh_CN/userspace-api/ebpf/index.rst
 create mode 100644 Documentation/translations/zh_CN/userspace-api/ebpf/syscall.rst

diff --git a/Documentation/translations/zh_CN/userspace-api/ebpf/index.rst b/Documentation/translations/zh_CN/userspace-api/ebpf/index.rst
new file mode 100644
index 000000000000..d52c7052f101
--- /dev/null
+++ b/Documentation/translations/zh_CN/userspace-api/ebpf/index.rst
@@ -0,0 +1,22 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: ../../disclaimer-zh_CN.rst
+
+:Original: Documentation/userspace-api/ebpf/index.rst
+
+:翻译:
+
+ 李睿 Rui Li <me@lirui.org>
+
+eBPF 用户空间API
+================
+
+eBPF是一种在Linux内核中提供沙箱化运行环境的机制,它可以在不改变内核源码或加载
+内核模块的情况下扩展运行时和编写工具。eBPF程序能够被附加到各种内核子系统中,包
+括网络,跟踪和Linux安全模块(LSM)等。
+
+关于eBPF的内部内核文档,请查看 Documentation/bpf/index.rst 。
+
+.. toctree::
+   :maxdepth: 1
+
+   syscall
diff --git a/Documentation/translations/zh_CN/userspace-api/ebpf/syscall.rst b/Documentation/translations/zh_CN/userspace-api/ebpf/syscall.rst
new file mode 100644
index 000000000000..47e2a59ae45d
--- /dev/null
+++ b/Documentation/translations/zh_CN/userspace-api/ebpf/syscall.rst
@@ -0,0 +1,29 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: ../../disclaimer-zh_CN.rst
+
+:Original: Documentation/userspace-api/ebpf/syscall.rst
+
+:翻译:
+
+ 李睿 Rui Li <me@lirui.org>
+
+eBPF Syscall
+------------
+
+:作者:
+    - Alexei Starovoitov <ast@kernel.org>
+    - Joe Stringer <joe@wand.net.nz>
+    - Michael Kerrisk <mtk.manpages@gmail.com>
+
+bpf syscall的主要信息可以在 `man-pages`_ 中的 `bpf(2)`_ 找到。
+
+bpf() 子命令参考
+~~~~~~~~~~~~~~~~
+
+子命令在以下内核代码中:
+
+include/uapi/linux/bpf.h
+
+.. Links:
+.. _man-pages: https://www.kernel.org/doc/man-pages/
+.. _bpf(2): https://man7.org/linux/man-pages/man2/bpf.2.html
diff --git a/Documentation/translations/zh_CN/userspace-api/index.rst b/Documentation/translations/zh_CN/userspace-api/index.rst
index 3b834fe7e33b..12c63d81c663 100644
--- a/Documentation/translations/zh_CN/userspace-api/index.rst
+++ b/Documentation/translations/zh_CN/userspace-api/index.rst
@@ -21,6 +21,11 @@ Linux 内核用户空间API指南
 
 	   目录
 
+.. toctree::
+   :maxdepth: 2
+
+   ebpf/index
+
 TODOList:
 
 * no_new_privs
@@ -29,7 +34,6 @@ TODOList:
 * unshare
 * spec_ctrl
 * accelerators/ocxl
-* ebpf/index
 * ioctl/index
 * iommu
 * media/index
-- 
2.30.2


^ permalink raw reply related	[relevance 7%]

* Re: [PATCH v2 1/2] docs/zh_CN: Add userspace-api/ebpf Chinese translation
  2022-10-18 11:54  8% ` [PATCH v2 1/2] docs/zh_CN: Add userspace-api/ebpf " Rui Li
@ 2022-10-19 12:08  0%   ` Yanteng Si
  0 siblings, 0 replies; 200+ results
From: Yanteng Si @ 2022-10-19 12:08 UTC (permalink / raw)
  To: Rui Li, linux-doc, linux-kernel; +Cc: Alex Shi, Jonathan Corbet, Wu XiangCheng


On 10/18/22 19:54, Rui Li wrote:
> Translate the following documents into Chinese:
>
> - userspace-api/ebpf/index.rst
> - userspace-api/ebpf/syscall.rst
>
> Signed-off-by: Rui Li <me@lirui.org>
> ---
> Changes since v1:
> - Translate bpf subcommand title
> - Align title
> - Add space after doc path
> ---
>   .../zh_CN/userspace-api/ebpf/index.rst        | 22 +++++++++++++
>   .../zh_CN/userspace-api/ebpf/syscall.rst      | 31 +++++++++++++++++++
>   2 files changed, 53 insertions(+)
>   create mode 100644 Documentation/translations/zh_CN/userspace-api/ebpf/index.rst
>   create mode 100644 Documentation/translations/zh_CN/userspace-api/ebpf/syscall.rst
>
> diff --git a/Documentation/translations/zh_CN/userspace-api/ebpf/index.rst b/Documentation/translations/zh_CN/userspace-api/ebpf/index.rst
> new file mode 100644
> index 000000000000..d52c7052f101
> --- /dev/null
> +++ b/Documentation/translations/zh_CN/userspace-api/ebpf/index.rst
> @@ -0,0 +1,22 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +.. include:: ../../disclaimer-zh_CN.rst
> +
> +:Original: Documentation/userspace-api/ebpf/index.rst
> +
> +:翻译:
> +
> + 李睿 Rui Li <me@lirui.org>
> +
> +eBPF 用户空间API
> +================
> +
> +eBPF是一种在Linux内核中提供沙箱化运行环境的机制,它可以在不改变内核源码或加载
> +内核模块的情况下扩展运行时和编写工具。eBPF程序能够被附加到各种内核子系统中,包
> +括网络,跟踪和Linux安全模块(LSM)等。
> +
> +关于eBPF的内部内核文档,请查看 Documentation/bpf/index.rst 。
> +
> +.. toctree::
> +   :maxdepth: 1
> +
> +   syscall
> diff --git a/Documentation/translations/zh_CN/userspace-api/ebpf/syscall.rst b/Documentation/translations/zh_CN/userspace-api/ebpf/syscall.rst
> new file mode 100644
> index 000000000000..17515728f544
> --- /dev/null
> +++ b/Documentation/translations/zh_CN/userspace-api/ebpf/syscall.rst
> @@ -0,0 +1,31 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +.. include:: ../../disclaimer-zh_CN.rst
> +
> +:Original: Documentation/userspace-api/ebpf/syscall.rst
> +
> +:翻译:
> +
> + 李睿 Rui Li <me@lirui.org>
> +
> +eBPF Syscall
> +------------
> +
> +:作者:
> +    - Alexei Starovoitov <ast@kernel.org>
> +    - Joe Stringer <joe@wand.net.nz>
> +    - Michael Kerrisk <mtk.manpages@gmail.com>
> +
> +bpf syscall的主要信息可以在 `man-pages`_ 中的 `bpf(2)`_ 找到。
> +
> +bpf() 子命令参考
> +~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +.. kernel-doc:: include/uapi/linux/bpf.h
> +   :doc: eBPF Syscall Preamble
> +
> +.. kernel-doc:: include/uapi/linux/bpf.h
> +   :doc: eBPF Syscall Commands

This generates a lot of documentation in English.


See Documentation/translations/zh_CN/core-api/kernel-api.rst


Thanks,

Yanteng

> +
> +.. Links:
> +.. _man-pages: https://www.kernel.org/doc/man-pages/
> +.. _bpf(2): https://man7.org/linux/man-pages/man2/bpf.2.html


^ permalink raw reply	[relevance 0%]

* [PATCH v2 1/2] docs/zh_CN: Add userspace-api/ebpf Chinese translation
  @ 2022-10-18 11:54  8% ` Rui Li
  2022-10-19 12:08  0%   ` Yanteng Si
  0 siblings, 1 reply; 200+ results
From: Rui Li @ 2022-10-18 11:54 UTC (permalink / raw)
  To: linux-doc, linux-kernel
  Cc: Alex Shi, Yanteng Si, Jonathan Corbet, Wu XiangCheng, Rui Li

Translate the following documents into Chinese:

- userspace-api/ebpf/index.rst
- userspace-api/ebpf/syscall.rst

Signed-off-by: Rui Li <me@lirui.org>
---
Changes since v1:
- Translate bpf subcommand title
- Align title
- Add space after doc path
---
 .../zh_CN/userspace-api/ebpf/index.rst        | 22 +++++++++++++
 .../zh_CN/userspace-api/ebpf/syscall.rst      | 31 +++++++++++++++++++
 2 files changed, 53 insertions(+)
 create mode 100644 Documentation/translations/zh_CN/userspace-api/ebpf/index.rst
 create mode 100644 Documentation/translations/zh_CN/userspace-api/ebpf/syscall.rst

diff --git a/Documentation/translations/zh_CN/userspace-api/ebpf/index.rst b/Documentation/translations/zh_CN/userspace-api/ebpf/index.rst
new file mode 100644
index 000000000000..d52c7052f101
--- /dev/null
+++ b/Documentation/translations/zh_CN/userspace-api/ebpf/index.rst
@@ -0,0 +1,22 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: ../../disclaimer-zh_CN.rst
+
+:Original: Documentation/userspace-api/ebpf/index.rst
+
+:翻译:
+
+ 李睿 Rui Li <me@lirui.org>
+
+eBPF 用户空间API
+================
+
+eBPF是一种在Linux内核中提供沙箱化运行环境的机制,它可以在不改变内核源码或加载
+内核模块的情况下扩展运行时和编写工具。eBPF程序能够被附加到各种内核子系统中,包
+括网络,跟踪和Linux安全模块(LSM)等。
+
+关于eBPF的内部内核文档,请查看 Documentation/bpf/index.rst 。
+
+.. toctree::
+   :maxdepth: 1
+
+   syscall
diff --git a/Documentation/translations/zh_CN/userspace-api/ebpf/syscall.rst b/Documentation/translations/zh_CN/userspace-api/ebpf/syscall.rst
new file mode 100644
index 000000000000..17515728f544
--- /dev/null
+++ b/Documentation/translations/zh_CN/userspace-api/ebpf/syscall.rst
@@ -0,0 +1,31 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: ../../disclaimer-zh_CN.rst
+
+:Original: Documentation/userspace-api/ebpf/syscall.rst
+
+:翻译:
+
+ 李睿 Rui Li <me@lirui.org>
+
+eBPF Syscall
+------------
+
+:作者:
+    - Alexei Starovoitov <ast@kernel.org>
+    - Joe Stringer <joe@wand.net.nz>
+    - Michael Kerrisk <mtk.manpages@gmail.com>
+
+bpf syscall的主要信息可以在 `man-pages`_ 中的 `bpf(2)`_ 找到。
+
+bpf() 子命令参考
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. kernel-doc:: include/uapi/linux/bpf.h
+   :doc: eBPF Syscall Preamble
+
+.. kernel-doc:: include/uapi/linux/bpf.h
+   :doc: eBPF Syscall Commands
+
+.. Links:
+.. _man-pages: https://www.kernel.org/doc/man-pages/
+.. _bpf(2): https://man7.org/linux/man-pages/man2/bpf.2.html
-- 
2.30.2


^ permalink raw reply related	[relevance 8%]

* Re: [PATCH 1/2] docs/zh_CN: Add userspace-api/ebpf Chinese translation
  2022-10-16 11:58  8%   ` [PATCH 1/2] docs/zh_CN: Add userspace-api/ebpf " Rui Li
@ 2022-10-17 13:21  0%     ` Yanteng Si
  0 siblings, 0 replies; 200+ results
From: Yanteng Si @ 2022-10-17 13:21 UTC (permalink / raw)
  To: Rui Li, Alex Shi, Jonathan Corbet, linux-doc, LKML; +Cc: Wu XiangCheng

CC  wu.xiangcheng@linux.dev

On 10/16/22 19:58, Rui Li wrote:
> Translate the following documents into Chinese:
>
> - userspace-api/ebpf/index.rst
> - userspace-api/ebpf/syscall.rst
>
> Signed-off-by: Rui Li <me@lirui.org>
> ---
>   .../zh_CN/userspace-api/ebpf/index.rst        | 22 +++++++++++++
>   .../zh_CN/userspace-api/ebpf/syscall.rst      | 31 +++++++++++++++++++
>   2 files changed, 53 insertions(+)
>   create mode 100644 Documentation/translations/zh_CN/userspace-api/ebpf/index.rst
>   create mode 100644 Documentation/translations/zh_CN/userspace-api/ebpf/syscall.rst
>
> diff --git a/Documentation/translations/zh_CN/userspace-api/ebpf/index.rst b/Documentation/translations/zh_CN/userspace-api/ebpf/index.rst
> new file mode 100644
> index 000000000000..9f0af275eb69
> --- /dev/null
> +++ b/Documentation/translations/zh_CN/userspace-api/ebpf/index.rst
> @@ -0,0 +1,22 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +.. include:: ../../disclaimer-zh_CN.rst
> +
> +:Original: Documentation/userspace-api/ebpf/index.rst
> +
> +:翻译:
> +
> + 李睿 Rui Li <me@lirui.org>
> +
> +eBPF 用户空间API
> +==================
Alignment is required here, please remove some "=" .
> +
> +eBPF是一种在Linux内核中提供沙箱化运行环境的机制,它可以在不改变内核源码或加载
> +内核模块的情况下扩展运行时和编写工具。eBPF程序能够被附加到各种内核子系统中,包
> +括网络,跟踪和Linux安全模块(LSM)等。
> +
> +关于eBPF的内部内核文档,请查看 Documentation/bpf/index.rst。

.../index.rst。 -> .../index.rst 。

> +
> +.. toctree::
> +   :maxdepth: 1
> +
> +   syscall
> diff --git a/Documentation/translations/zh_CN/userspace-api/ebpf/syscall.rst b/Documentation/translations/zh_CN/userspace-api/ebpf/syscall.rst
> new file mode 100644
> index 000000000000..56bfa9bc7887
> --- /dev/null
> +++ b/Documentation/translations/zh_CN/userspace-api/ebpf/syscall.rst
> @@ -0,0 +1,31 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +.. include:: ../../disclaimer-zh_CN.rst
> +
> +:Original: Documentation/userspace-api/ebpf/syscall.rst
> +
> +:翻译:
> +
> + 李睿 Rui Li <me@lirui.org>
> +
> +eBPF Syscall
> +------------
> +
> +:作者:
> +    - Alexei Starovoitov <ast@kernel.org>
> +    - Joe Stringer <joe@wand.net.nz>
> +    - Michael Kerrisk <mtk.manpages@gmail.com>
> +
> +bpf syscall的主要信息可以在 `man-pages`_ 中的 `bpf(2)`_ 找到。
> +
> +bpf() subcommand reference

Translate it into Chinese.


Thanks,

Yanteng

> +~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +.. kernel-doc:: include/uapi/linux/bpf.h
> +   :doc: eBPF Syscall Preamble
> +
> +.. kernel-doc:: include/uapi/linux/bpf.h
> +   :doc: eBPF Syscall Commands
> +
> +.. Links:
> +.. _man-pages: https://www.kernel.org/doc/man-pages/
> +.. _bpf(2): https://man7.org/linux/man-pages/man2/bpf.2.html


^ permalink raw reply	[relevance 0%]

* [RESEND PATCH 1/2] docs/zh_CN: Add userspace-api/ebpf Chinese translation
  @ 2022-10-17 13:27  8% ` Rui Li
  0 siblings, 0 replies; 200+ results
From: Rui Li @ 2022-10-17 13:27 UTC (permalink / raw)
  To: Alex Shi, Yanteng Si, Jonathan Corbet, linux-doc, linux-kernel; +Cc: Rui Li

Translate the following documents into Chinese:

- userspace-api/ebpf/index.rst
- userspace-api/ebpf/syscall.rst

Signed-off-by: Rui Li <me@lirui.org>
---
 .../zh_CN/userspace-api/ebpf/index.rst        | 22 +++++++++++++
 .../zh_CN/userspace-api/ebpf/syscall.rst      | 31 +++++++++++++++++++
 2 files changed, 53 insertions(+)
 create mode 100644 Documentation/translations/zh_CN/userspace-api/ebpf/index.rst
 create mode 100644 Documentation/translations/zh_CN/userspace-api/ebpf/syscall.rst

diff --git a/Documentation/translations/zh_CN/userspace-api/ebpf/index.rst b/Documentation/translations/zh_CN/userspace-api/ebpf/index.rst
new file mode 100644
index 000000000000..9f0af275eb69
--- /dev/null
+++ b/Documentation/translations/zh_CN/userspace-api/ebpf/index.rst
@@ -0,0 +1,22 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: ../../disclaimer-zh_CN.rst
+
+:Original: Documentation/userspace-api/ebpf/index.rst
+
+:翻译:
+
+ 李睿 Rui Li <me@lirui.org>
+
+eBPF 用户空间API
+==================
+
+eBPF是一种在Linux内核中提供沙箱化运行环境的机制,它可以在不改变内核源码或加载
+内核模块的情况下扩展运行时和编写工具。eBPF程序能够被附加到各种内核子系统中,包
+括网络,跟踪和Linux安全模块(LSM)等。
+
+关于eBPF的内部内核文档,请查看 Documentation/bpf/index.rst。
+
+.. toctree::
+   :maxdepth: 1
+
+   syscall
diff --git a/Documentation/translations/zh_CN/userspace-api/ebpf/syscall.rst b/Documentation/translations/zh_CN/userspace-api/ebpf/syscall.rst
new file mode 100644
index 000000000000..56bfa9bc7887
--- /dev/null
+++ b/Documentation/translations/zh_CN/userspace-api/ebpf/syscall.rst
@@ -0,0 +1,31 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: ../../disclaimer-zh_CN.rst
+
+:Original: Documentation/userspace-api/ebpf/syscall.rst
+
+:翻译:
+
+ 李睿 Rui Li <me@lirui.org>
+
+eBPF Syscall
+------------
+
+:作者:
+    - Alexei Starovoitov <ast@kernel.org>
+    - Joe Stringer <joe@wand.net.nz>
+    - Michael Kerrisk <mtk.manpages@gmail.com>
+
+bpf syscall的主要信息可以在 `man-pages`_ 中的 `bpf(2)`_ 找到。
+
+bpf() subcommand reference
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. kernel-doc:: include/uapi/linux/bpf.h
+   :doc: eBPF Syscall Preamble
+
+.. kernel-doc:: include/uapi/linux/bpf.h
+   :doc: eBPF Syscall Commands
+
+.. Links:
+.. _man-pages: https://www.kernel.org/doc/man-pages/
+.. _bpf(2): https://man7.org/linux/man-pages/man2/bpf.2.html
-- 
2.30.2


^ permalink raw reply related	[relevance 8%]

* [PATCH 1/2] docs/zh_CN: Add userspace-api/ebpf Chinese translation
       [not found]     ` <cover.1665919802.git.me@lirui.org>
@ 2022-10-16 11:58  8%   ` Rui Li
  2022-10-17 13:21  0%     ` Yanteng Si
  0 siblings, 1 reply; 200+ results
From: Rui Li @ 2022-10-16 11:58 UTC (permalink / raw)
  To: Alex Shi, Yanteng Si, Jonathan Corbet, linux-doc, LKML

Translate the following documents into Chinese:

- userspace-api/ebpf/index.rst
- userspace-api/ebpf/syscall.rst

Signed-off-by: Rui Li <me@lirui.org>
---
 .../zh_CN/userspace-api/ebpf/index.rst        | 22 +++++++++++++
 .../zh_CN/userspace-api/ebpf/syscall.rst      | 31 +++++++++++++++++++
 2 files changed, 53 insertions(+)
 create mode 100644 Documentation/translations/zh_CN/userspace-api/ebpf/index.rst
 create mode 100644 Documentation/translations/zh_CN/userspace-api/ebpf/syscall.rst

diff --git a/Documentation/translations/zh_CN/userspace-api/ebpf/index.rst b/Documentation/translations/zh_CN/userspace-api/ebpf/index.rst
new file mode 100644
index 000000000000..9f0af275eb69
--- /dev/null
+++ b/Documentation/translations/zh_CN/userspace-api/ebpf/index.rst
@@ -0,0 +1,22 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: ../../disclaimer-zh_CN.rst
+
+:Original: Documentation/userspace-api/ebpf/index.rst
+
+:翻译:
+
+ 李睿 Rui Li <me@lirui.org>
+
+eBPF 用户空间API
+==================
+
+eBPF是一种在Linux内核中提供沙箱化运行环境的机制,它可以在不改变内核源码或加载
+内核模块的情况下扩展运行时和编写工具。eBPF程序能够被附加到各种内核子系统中,包
+括网络,跟踪和Linux安全模块(LSM)等。
+
+关于eBPF的内部内核文档,请查看 Documentation/bpf/index.rst。
+
+.. toctree::
+   :maxdepth: 1
+
+   syscall
diff --git a/Documentation/translations/zh_CN/userspace-api/ebpf/syscall.rst b/Documentation/translations/zh_CN/userspace-api/ebpf/syscall.rst
new file mode 100644
index 000000000000..56bfa9bc7887
--- /dev/null
+++ b/Documentation/translations/zh_CN/userspace-api/ebpf/syscall.rst
@@ -0,0 +1,31 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: ../../disclaimer-zh_CN.rst
+
+:Original: Documentation/userspace-api/ebpf/syscall.rst
+
+:翻译:
+
+ 李睿 Rui Li <me@lirui.org>
+
+eBPF Syscall
+------------
+
+:作者:
+    - Alexei Starovoitov <ast@kernel.org>
+    - Joe Stringer <joe@wand.net.nz>
+    - Michael Kerrisk <mtk.manpages@gmail.com>
+
+bpf syscall的主要信息可以在 `man-pages`_ 中的 `bpf(2)`_ 找到。
+
+bpf() subcommand reference
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. kernel-doc:: include/uapi/linux/bpf.h
+   :doc: eBPF Syscall Preamble
+
+.. kernel-doc:: include/uapi/linux/bpf.h
+   :doc: eBPF Syscall Commands
+
+.. Links:
+.. _man-pages: https://www.kernel.org/doc/man-pages/
+.. _bpf(2): https://man7.org/linux/man-pages/man2/bpf.2.html
-- 
2.30.2



^ permalink raw reply related	[relevance 8%]

* [PATCH v2 2/2] Documentation: Add HOWTO Spanish translation into rst based build system
  @ 2022-10-14 14:24  2%   ` Carlos Bilbao
  0 siblings, 0 replies; 200+ results
From: Carlos Bilbao @ 2022-10-14 14:24 UTC (permalink / raw)
  To: corbet; +Cc: linux-doc, linux-kernel, bilbao, Carlos Bilbao

Add Spanish translation of HOWTO document into rst based documentation
build system.

Signed-off-by: Carlos Bilbao <carlos.bilbao@amd.com>
---
 Documentation/translations/sp_SP/howto.rst | 617 +++++++++++++++++++++
 Documentation/translations/sp_SP/index.rst |   8 +
 2 files changed, 625 insertions(+)
 create mode 100644 Documentation/translations/sp_SP/howto.rst

diff --git a/Documentation/translations/sp_SP/howto.rst b/Documentation/translations/sp_SP/howto.rst
new file mode 100644
index 000000000000..f1375651a1a8
--- /dev/null
+++ b/Documentation/translations/sp_SP/howto.rst
@@ -0,0 +1,617 @@
+.. include:: ./disclaimer-sp.rst
+
+:Original: :ref:`Documentation/process/howto.rst <process_howto>`
+:Translator: Carlos Bilbao <carlos.bilbao@amd.com>
+
+.. _sp_process_howto:
+
+Cómo participar en el desarrollo del kernel de Linux
+====================================================
+
+Este documento es el principal punto de partida. Contiene instrucciones
+sobre cómo convertirse en desarrollador del kernel de Linux y explica cómo
+trabajar con el y en su desarrollo. El documento no tratará ningún aspecto
+técnico relacionado con la programación del kernel, pero le ayudará
+guiándole por el camino correcto.
+
+Si algo en este documento quedara obsoleto, envíe parches al maintainer de
+este archivo, que se encuentra en la parte superior del documento.
+
+Introducción
+------------
+¿De modo que quiere descubrir como convertirse en un/a desarrollador/a del
+kernel de Linux? Tal vez su jefe le haya dicho, "Escriba un driver de
+Linux para este dispositivo." El objetivo de este documento en enseñarle
+todo cuanto necesita para conseguir esto, describiendo el proceso por el
+que debe pasar, y con indicaciones de como trabajar con la comunidad.
+También trata de explicar las razones por las cuales la comunidad trabaja
+de la forma en que lo hace.
+
+El kernel esta principalmente escrito en C, con algunas partes que son
+dependientes de la arquitectura en ensamblador. Un buen conocimiento de C
+es necesario para desarrollar en el kernel. Lenguaje ensamblador (en
+cualquier arquitectura) no es necesario excepto que planee realizar
+desarrollo de bajo nivel para dicha arquitectura. Aunque no es un perfecto
+sustituto para una educación sólida en C y/o años de experiencia, los
+siguientes libros sirven, como mínimo, como referencia:
+
+- "The C Programming Language" de Kernighan e Ritchie [Prentice Hall]
+- "Practical C Programming" de Steve Oualline [O'Reilly]
+- "C:  A Reference Manual" de Harbison and Steele [Prentice Hall]
+
+El kernel está escrito usando GNU C y la cadena de herramientas GNU. Si
+bien se adhiere al estándar ISO C89, utiliza una serie de extensiones que
+no aparecen en dicho estándar. El kernel usa un C independiente de entorno,
+sin depender de la biblioteca C estándar, por lo que algunas partes del
+estándar C no son compatibles. Divisiones de long long arbitrarios o
+de coma flotante no son permitidas. En ocasiones, puede ser difícil de
+entender las suposiciones que el kernel hace respecto a la cadena de
+herramientas y las extensiones que usa, y desafortunadamente no hay
+referencia definitiva para estas. Consulte las páginas de información de
+gcc (`info gcc`) para obtener información al respecto.
+
+Recuerde que está tratando de aprender a trabajar con una comunidad de
+desarrollo existente. Es un grupo diverso de personas, con altos estándares
+de código, estilo y procedimiento. Estas normas han sido creadas a lo
+largo del tiempo en función de lo que se ha encontrado que funciona mejor
+para un equipo tan grande y geográficamente disperso. Trate de aprender
+tanto como le sea posible acerca de estos estándares antes de tiempo, ya
+que están bien documentados; no espere que la gente se adapte a usted o a
+la forma de hacer las cosas en su empresa.
+
+Cuestiones legales
+------------------
+El código fuente del kernel de Linux se publica bajo licencia GPL. Por
+favor, revise el archivo COPYING, presente en la carpeta principal del
+código fuente, para detalles de la licencia. Si tiene alguna otra pregunta
+sobre licencias, contacte a un abogado, no pregunte en listas de discusión
+del kernel de Linux. La gente en estas listas no son abogadas, y no debe
+confiar en sus opiniones en materia legal.
+
+Para preguntas y respuestas más frecuentes sobre la licencia GPL, consulte:
+
+	https://www.gnu.org/licenses/gpl-faq.html
+
+Documentación
+--------------
+El código fuente del kernel de Linux tiene una gran variedad de documentos
+que son increíblemente valiosos para aprender a interactuar con la
+comunidad del kernel. Cuando se agregan nuevas funciones al kernel, se
+recomienda que se incluyan nuevos archivos de documentación que expliquen
+cómo usar la función. Cuando un cambio en el kernel hace que la interfaz
+que el kernel expone espacio de usuario cambie, se recomienda que envíe la
+información o un parche en las páginas del manual que expliquen el cambio
+a mtk.manpages@gmail.com, y CC la lista linux-api@vger.kernel.org.
+
+Esta es la lista de archivos que están en el código fuente del kernel y son
+de obligada lectura:
+
+  :ref:`Documentation/admin-guide/README.rst <readme>`
+    Este archivo ofrece una breve descripción del kernel de Linux y
+    describe lo que es necesario hacer para configurar y compilar el
+    kernel. Quienes sean nuevos en el kernel deben comenzar aquí.
+
+  :ref:`Documentation/process/changes.rst <changes>`
+    Este archivo proporciona una lista de los niveles mínimos de varios
+    paquetes que son necesarios para construir y ejecutar el kernel
+    exitosamente.
+
+  :ref:`Documentation/process/coding-style.rst <codingstyle>`
+    Esto describe el estilo de código del kernel de Linux y algunas de los
+    razones detrás de esto. Se espera que todo el código nuevo siga las
+    directrices de este documento. La mayoría de los maintainers solo
+    aceptarán parches si se siguen estas reglas, y muchas personas solo
+    revisan el código si tiene el estilo adecuado.
+
+  :ref:`Documentation/process/submitting-patches.rst <submittingpatches>`
+    Este archivo describe en gran detalle cómo crear con éxito y enviar un
+    parche, que incluye (pero no se limita a):
+
+       - Contenidos del correo electrónico (email)
+       - Formato del email
+       - A quien se debe enviar
+
+    Seguir estas reglas no garantiza el éxito (ya que todos los parches son
+    sujetos a escrutinio de contenido y estilo), pero en caso de no seguir
+    dichas reglas, el fracaso es prácticamente garantizado.
+    Otras excelentes descripciones de cómo crear parches correctamente son:
+
+	"The Perfect Patch"
+		https://www.ozlabs.org/~akpm/stuff/tpp.txt
+
+	"Linux kernel patch submission format"
+		https://web.archive.org/web/20180829112450/http://linux.yyz.us/patch-format.html
+
+  :ref:`Documentation/process/stable-api-nonsense.rst <stable_api_nonsense>`
+    Este archivo describe la lógica detrás de la decisión consciente de
+    no tener una API estable dentro del kernel, incluidas cosas como:
+
+      - Capas intermedias del subsistema (por compatibilidad?)
+      - Portabilidad de drivers entre sistemas operativos
+      - Mitigar el cambio rápido dentro del árbol de fuentes del kernel (o
+        prevenir cambios rápidos)
+
+     Este documento es crucial para comprender la filosofía del desarrollo
+     de Linux y es muy importante para las personas que se mudan a Linux
+     tras desarrollar otros sistemas operativos.
+
+  :ref:`Documentation/admin-guide/security-bugs.rst <securitybugs>`
+    Si cree que ha encontrado un problema de seguridad en el kernel de
+    Linux, siga los pasos de este documento para ayudar a notificar a los
+    desarrolladores del kernel y ayudar a resolver el problema.
+
+  :ref:`Documentation/process/management-style.rst <managementstyle>`
+    Este documento describe cómo operan los maintainers del kernel de Linux
+    y los valores compartidos detrás de sus metodologías. Esta es una
+    lectura importante para cualquier persona nueva en el desarrollo del
+    kernel (o cualquier persona que simplemente sienta curiosidad por
+    el campo IT), ya que clarifica muchos conceptos erróneos y confusiones
+    comunes sobre el comportamiento único de los maintainers del kernel.
+
+  :ref:`Documentation/process/stable-kernel-rules.rst <stable_kernel_rules>`
+    Este archivo describe las reglas sobre cómo se suceden las versiones
+    del kernel estable, y qué hacer si desea obtener un cambio en una de
+    estas publicaciones.
+
+  :ref:`Documentation/process/kernel-docs.rst <kernel_docs>`
+    Una lista de documentación externa relativa al desarrollo del kernel.
+    Por favor consulte esta lista si no encuentra lo que están buscando
+    dentro de la documentación del kernel.
+
+  :ref:`Documentation/process/applying-patches.rst <applying_patches>`
+    Una buena introducción que describe exactamente qué es un parche y cómo
+    aplicarlo a las diferentes ramas de desarrollo del kernel.
+
+El kernel también tiene una gran cantidad de documentos que pueden ser
+generados automáticamente desde el propio código fuente o desde
+ReStructuredText markups (ReST), como este. Esto incluye un descripción
+completa de la API en el kernel y reglas sobre cómo manejar cerrojos
+(locking) correctamente.
+
+Todos estos documentos se pueden generar como PDF o HTML ejecutando::
+
+	make pdfdocs
+	make htmldocs
+
+respectivamente desde el directorio fuente principal del kernel.
+
+Los documentos que utilizan el markup ReST se generarán en
+Documentation/output. También se pueden generar en formatos LaTeX y ePub
+con::
+
+	make latexdocs
+	make epubdocs
+
+Convertirse en un/a desarrollador/a de kernel
+---------------------------------------------
+
+Si no sabe nada sobre el desarrollo del kernel de Linux, debería consultar
+el proyecto Linux KernelNewbies:
+
+	https://kernelnewbies.org
+
+Consiste en una útil lista de correo donde puede preguntar casi cualquier
+tipo de pregunta básica de desarrollo del kernel (asegúrese de buscar en
+los archivos primero, antes de preguntar algo que ya ha sido respondido en
+el pasado.) También tiene un canal IRC que puede usar para hacer preguntas
+en tiempo real, y una gran cantidad de documentación útil para ir
+aprendiendo sobre el desarrollo del kernel de Linux.
+
+El sitio web tiene información básica sobre la organización del código,
+subsistemas, y proyectos actuales (tanto dentro como fuera del árbol).
+También describe alguna información logística básica, como cómo compilar
+un kernel y aplicar un parche.
+
+Si no sabe por dónde quiere empezar, pero quieres buscar alguna tarea que
+comenzar a hacer para unirse a la comunidad de desarrollo del kernel,
+acuda al proyecto Linux Kernel Janitor:
+
+	https://kernelnewbies.org/KernelJanitors
+
+Es un gran lugar para comenzar. Describe una lista de problemas
+relativamente simples que deben limpiarse y corregirse dentro del código
+fuente del kernel de Linux árbol de fuentes. Trabajando con los
+desarrolladores a cargo de este proyecto, aprenderá los conceptos básicos
+para incluir su parche en el árbol del kernel de Linux, y posiblemente
+descubrir en la dirección en que trabajar a continuación, si no tiene ya
+una idea.
+
+Antes de realizar cualquier modificación real al código del kernel de
+Linux, es imperativo entender cómo funciona el código en cuestión. Para
+este propósito, nada es mejor que leerlo directamente (lo más complicado
+está bien comentado), tal vez incluso con la ayuda de herramientas
+especializadas. Una de esas herramientas que se recomienda especialmente
+es el proyecto Linux Cross-Reference, que es capaz de presentar el código
+fuente en un formato de página web indexada y autorreferencial. Una
+excelente puesta al día del repositorio del código del kernel se puede
+encontrar en:
+
+	https://elixir.bootlin.com/
+
+El proceso de desarrollo
+------------------------
+
+El proceso de desarrollo del kernel de Linux consiste actualmente de
+diferentes "branches" (ramas) con muchos distintos subsistemas específicos
+a cada una de ellas. Las diferentes ramas son:
+
+  - El código principal de Linus (mainline tree)
+  - Varios árboles estables con múltiples major numbers
+  - Subsistemas específicos
+  - linux-next, para integración y testing
+
+Mainline tree (Árbol principal)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+El mainline tree es mantenido por Linus Torvalds, y puede encontrarse en
+https://kernel.org o en su repo.  El proceso de desarrollo es el siguiente:
+
+  - Tan pronto como se lanza un nuevo kernel, se abre una ventana de dos
+    semanas, durante este período de tiempo, los maintainers pueden enviar
+    grandes modificaciones a Linus, por lo general los parches que ya se
+    han incluido en el linux-next durante unas semanas. La forma preferida
+    de enviar grandes cambios es usando git (la herramienta de
+    administración de código fuente del kernel, más información al respecto
+    en https://git-scm.com/), pero los parches simples también son validos.
+  - Después de dos semanas, se lanza un kernel -rc1 y la atención se centra
+    en hacer el kernel nuevo lo más estable ("solido") posible. La mayoría
+    de los parches en este punto deben arreglar una regresión. Los errores
+    que siempre han existido no son regresiones, por lo tanto, solo envíe
+    este tipo de correcciones si son importantes. Tenga en cuenta que se
+    podría aceptar un controlador (o sistema de archivos) completamente
+    nuevo después de -rc1 porque no hay riesgo de causar regresiones con
+    tal cambio, siempre y cuando el cambio sea autónomo y no afecte áreas
+    fuera del código que se está agregando. git se puede usar para enviar
+    parches a Linus después de que se lance -rc1, pero los parches también
+    deben ser enviado a una lista de correo pública para su revisión.
+  - Se lanza un nuevo -rc cada vez que Linus considera que el árbol git
+    actual esta en un estado razonablemente sano y adecuado para la prueba.
+    La meta es lanzar un nuevo kernel -rc cada semana.
+  - El proceso continúa hasta que el kernel se considera "listo", y esto
+    puede durar alrededor de 6 semanas.
+
+Vale la pena mencionar lo que Andrew Morton escribió en las listas de
+correo del kernel de Linux, sobre lanzamientos del kernel (traducido):
+
+	*"Nadie sabe cuándo se publicara un nuevo kernel, pues esto sucede
+	según el estado de los bugs, no de una cronología preconcebida."*
+
+Varios árboles estables con múltiples major numbers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Los kernels con versiones de 3 partes son kernels estables. Estos contienen
+correcciones relativamente pequeñas y críticas para problemas de seguridad
+o importantes regresiones descubiertas para una publicación de código.
+Cada lanzamiento en una gran serie estable incrementa la tercera parte de
+la versión número, manteniendo las dos primeras partes iguales.
+
+Esta es la rama recomendada para los usuarios que quieren la versión
+estable más reciente del kernel, y no están interesados ​​en ayudar a probar
+versiones en desarrollo/experimentales.
+
+Los árboles estables son mantenidos por el equipo "estable"
+<stable@vger.kernel.org>, y se liberan (publican) según lo dicten las
+necesidades. El período de liberación normal es de aproximadamente dos
+semanas, pero puede ser más largo si no hay problemas apremiantes. Un
+problema relacionado con la seguridad, en cambio, puede causar un
+lanzamiento casi instantáneamente.
+
+El archivo :ref:`Documentación/proceso/stable-kernel-rules.rst <stable_kernel_rules>`
+en el árbol del kernel documenta qué tipos de cambios son aceptables para
+el árbol estable y cómo funciona el proceso de lanzamiento.
+
+Subsistemas específicos
+~~~~~~~~~~~~~~~~~~~~~~~~
+Los maintainers de los diversos subsistemas del kernel --- y también muchos
+desarrolladores de subsistemas del kernel --- exponen su estado actual de
+desarrollo en repositorios fuente. De esta manera, otros pueden ver lo que
+está sucediendo en las diferentes áreas del kernel. En áreas donde el
+desarrollo es rápido, se le puede pedir a un desarrollador que base sus
+envíos en tal árbol del subsistema del kernel, para evitar conflictos entre
+este y otros trabajos ya en curso.
+
+La mayoría de estos repositorios son árboles git, pero también hay otros
+SCM en uso, o colas de parches que se publican como series quilt. Las
+direcciones de estos repositorios de subsistemas se enumeran en el archivo
+MAINTAINERS. Muchos de estos se pueden ver en https://git.kernel.org/.
+
+Antes de que un parche propuesto se incluya con dicho árbol de subsistemas,
+es sujeto a revisión, que ocurre principalmente en las listas de correo
+(ver la sección respectiva a continuación). Para varios subsistemas del
+kernel, esta revisión se rastrea con la herramienta patchwork. Patchwork
+ofrece una interfaz web que muestra publicaciones de parches, cualquier
+comentario sobre un parche o revisiones a él, y los maintainers pueden
+marcar los parches como en revisión, aceptado, o rechazado. La mayoría de
+estos sitios de trabajo de parches se enumeran en
+
+https://patchwork.kernel.org/.
+
+linux-next, para integración y testing
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Antes de que las actualizaciones de los árboles de subsistemas se combinen
+con el árbol principal, necesitan probar su integración. Para ello, existe
+un repositorio especial de pruebas en el que se encuentran casi todos los
+árboles de subsistema, actualizado casi a diario:
+
+	https://git.kernel.org/?p=linux/kernel/git/next/linux-next.git
+
+De esta manera, linux-next ofrece una perspectiva resumida de lo que se
+espera que entre en el kernel principal en el próximo período de "merge"
+(fusión de código). Los testers aventureros son bienvenidos a probar
+linux-next en ejecución.
+
+Reportar bugs
+-------------
+
+El archivo 'Documentación/admin-guide/reporting-issues.rst' en el
+directorio principal del kernel describe cómo informar un posible bug del
+kernel y detalles sobre qué tipo de información necesitan los
+desarrolladores del kernel para ayudar a rastrear la fuente del problema.
+
+Gestión de informes de bugs
+------------------------------
+
+Una de las mejores formas de poner en práctica sus habilidades de hacking
+es arreglando errores reportados por otras personas. No solo ayudará a
+hacer el kernel más estable, también aprenderá a solucionar problemas del
+mundo real y mejora sus habilidades, y otros desarrolladores se darán
+cuenta de tu presencia. La corrección de errores es una de las mejores
+formas de ganar méritos entre desarrolladores, porque no a muchas personas
+les gusta perder el tiempo arreglando los errores de otras personas.
+
+Para trabajar en informes de errores ya reportados, busque un subsistema
+que le interese. Verifique el archivo MAINTAINERS donde se informan los
+errores de ese subsistema; con frecuencia será una lista de correo, rara
+vez un rastreador de errores (bugtracker). Busque en los archivos de dicho
+lugar para informes recientes y ayude donde lo crea conveniente. También es
+posible que desee revisar https://bugzilla.kernel.org para informes de
+errores; solo un puñado de subsistemas del kernel lo emplean activamente
+para informar o rastrear; sin embargo, todos los errores para todo el kernel
+se archivan allí.
+
+Listas de correo
+-----------------
+
+Como se explica en algunos de los documentos anteriores, la mayoría de
+desarrolladores del kernel participan en la lista de correo del kernel de
+Linux. Detalles sobre cómo para suscribirse y darse de baja de la lista se
+pueden encontrar en:
+
+	http://vger.kernel.org/vger-lists.html#linux-kernel
+
+Existen archivos de la lista de correo en la web en muchos lugares
+distintos. Utilice un motor de búsqueda para encontrar estos archivos. Por
+ejemplo:
+
+	http://dir.gmane.org/gmane.linux.kernel
+
+Es muy recomendable que busque en los archivos sobre el tema que desea
+tratar, antes de publicarlo en la lista. Un montón de cosas ya discutidas
+en detalle solo se registran en los archivos de la lista de correo.
+
+La mayoría de los subsistemas individuales del kernel también tienen sus
+propias lista de correo donde hacen sus esfuerzos de desarrollo. Revise el
+archivo MAINTAINERS para obtener referencias de lo que estas listas para
+los diferentes grupos.
+
+Muchas de las listas están alojadas en kernel.org. La información sobre
+estas puede ser encontrada en:
+
+	http://vger.kernel.org/vger-lists.html
+
+Recuerde mantener buenos hábitos de comportamiento al usar las listas.
+Aunque un poco cursi, la siguiente URL tiene algunas pautas simples para
+interactuar con la lista (o cualquier lista):
+
+	http://www.albion.com/netiquette/
+
+Si varias personas responden a su correo, el CC (lista de destinatarios)
+puede hacerse bastante grande. No elimine a nadie de la lista CC: sin una
+buena razón, o no responda solo a la dirección de la lista. Acostúmbrese
+a recibir correos dos veces, una del remitente y otra de la lista, y no
+intente ajustar esto agregando encabezados de correo astutos, a la gente no
+le gustará.
+
+Recuerde mantener intacto el contexto y la atribución de sus respuestas,
+mantenga las líneas "El hacker John Kernel escribió ...:" en la parte
+superior de su respuesta, y agregue sus declaraciones entre las secciones
+individuales citadas en lugar de escribiendo en la parte superior del
+correo electrónico.
+
+Si incluye parches en su correo, asegúrese de que sean texto legible sin
+formato como se indica en :ref:`Documentation/process/submitting-patches.rst <submittingpatches>`.
+Los desarrolladores del kernel no quieren lidiar con archivos adjuntos o
+parches comprimidos; y pueden querer comentar líneas individuales de su
+parche, que funciona sólo de esa manera. Asegúrese de emplear un programa
+de correo que no altere los espacios ni los tabuladores. Una buena primera
+prueba es enviarse el correo a usted mismo, e intentar aplicar su
+propio parche. Si eso no funciona, arregle su programa de correo o
+reemplace hasta que funcione.
+
+Sobretodo, recuerde de ser respetuoso con otros subscriptores.
+
+Colaborando con la comunidad
+----------------------------
+
+El objetivo de la comunidad del kernel es proporcionar el mejor kernel
+posible. Cuando envíe un parche para su aceptación, se revisará en sus
+méritos técnicos solamente. Entonces, ¿qué deberías ser?
+
+  - críticas
+  - comentarios
+  - peticiones de cambios
+  - peticiones de justificaciones
+  - silencio
+
+Recuerde, esto es parte de introducir su parche en el kernel. Tiene que ser
+capaz de recibir críticas y comentarios sobre sus parches, evaluar
+a nivel técnico y re-elaborar sus parches o proporcionar razonamiento claro
+y conciso de por qué no se deben hacer tales cambios. Si no hay respuestas
+a su publicación, espere unos días e intente de nuevo, a veces las cosas se
+pierden dado el gran volumen.
+
+¿Qué no debería hacer?
+
+  - esperar que su parche se acepte sin preguntas
+  - actuar de forma defensiva
+  - ignorar comentarios
+  - enviar el parche de nuevo, sin haber aplicados los cambios pertinentes
+
+En una comunidad que busca la mejor solución técnica posible, siempre habrá
+diferentes opiniones sobre lo beneficioso que es un parche. Tiene que ser
+cooperativo y estar dispuesto a adaptar su idea para que encaje dentro
+del kernel, o al menos esté dispuesto a demostrar que su idea vale la pena.
+Recuerde, estar equivocado es aceptable siempre y cuando estés dispuesto a
+trabajar hacia una solución que sea correcta.
+
+Es normal que las respuestas a su primer parche sean simplemente una lista
+de una docena de cosas que debe corregir. Esto **no** implica que su
+parche no será aceptado, y **no** es personal. Simplemente corrija todos
+los problemas planteados en su parche, y envié otra vez.
+
+Diferencias entre la comunidad kernel y las estructuras corporativas
+--------------------------------------------------------------------
+
+La comunidad del kernel funciona de manera diferente a la mayoría de los
+entornos de desarrollo tradicionales en empresas. Aquí hay una lista de
+cosas que puede intentar hacer para evitar problemas:
+
+  Cosas buenas que decir respecto a los cambios propuestos:
+
+    - "Esto arregla múltiples problemas."
+    - "Esto elimina 2000 lineas de código."
+    - "Aquí hay un parche que explica lo que intento describir."
+    - "Lo he testeado en 5 arquitecturas distintas..."
+    - "Aquí hay una serie de parches menores que..."
+    - "Esto mejora el rendimiento en maquinas típicas..."
+
+  Cosas negativas que debe evitar decir:
+
+    - "Lo hicimos así en AIX/ptx/Solaris, de modo que debe ser bueno..."
+    - "Llevo haciendo esto 20 años, de modo que..."
+    - "Esto lo necesita mi empresa para ganar dinero"
+    - "Esto es para la linea de nuestros productos Enterprise"
+    - "Aquí esta el documento de 1000 paginas describiendo mi idea"
+    - "Llevo 6 meses trabajando en esto..."
+    - "Aquí esta un parche de 5000 lineas que..."
+    - "He rescrito todo el desastre actual, y aquí esta..."
+    - "Tengo un deadline, y este parche debe aplicarse ahora."
+
+Otra forma en que la comunidad del kernel es diferente a la mayoría de los
+entornos de trabajo tradicionales en ingeniería de software, es la
+naturaleza sin rostro de interacción. Una de las ventajas de utilizar el
+correo electrónico y el IRC como formas principales de comunicación es la
+no discriminación por motivos de género o raza. El entorno de trabajo del
+kernel de Linux acepta a mujeres y minorías porque todo lo que eres es una
+dirección de correo electrónico. El aspecto internacional también ayuda a
+nivelar el campo de juego porque no puede adivinar el género basado en
+el nombre de una persona. Un hombre puede llamarse Andrea y una mujer puede
+llamarse Pat. La mayoría de las mujeres que han trabajado en el kernel de
+Linux y han expresado una opinión han tenido experiencias positivas.
+
+La barrera del idioma puede causar problemas a algunas personas que no se
+sientes cómodas con el inglés. Un buen dominio del idioma puede ser
+necesario para transmitir ideas correctamente en las listas de correo, por
+lo que le recomendamos que revise sus correos electrónicos para asegurarse
+de que tengan sentido en inglés antes de enviarlos.
+
+Divida sus cambios
+---------------------
+
+La comunidad del kernel de Linux no acepta con gusto grandes fragmentos de
+código, sobretodo a la vez. Los cambios deben introducirse correctamente,
+discutidos y divididos en pequeñas porciones individuales. Esto es casi
+exactamente lo contrario de lo que las empresas están acostumbradas a hacer.
+Su propuesta también debe introducirse muy temprano en el proceso de
+desarrollo, de modo que pueda recibir comentarios sobre lo que está
+haciendo. También deje que la comunidad sienta que está trabajando con
+ellos, y no simplemente usándolos como un vertedero para su función. Sin
+embargo, no envíe 50 correos electrónicos a una vez a una lista de correo,
+su serie de parches debe casi siempre ser más pequeña que eso.
+
+Las razones para dividir las cosas son las siguientes:
+
+1) Los cambios pequeños aumentan la probabilidad de que sus parches sean
+   aplicados, ya que no requieren mucho tiempo o esfuerzo para verificar su
+   exactitud. Un parche de 5 líneas puede ser aplicado por un maintainer
+   con apenas una segunda mirada. Sin embargo, un parche de 500 líneas
+   puede tardar horas en ser revisado en términos de corrección (el tiempo
+   que toma es exponencialmente proporcional al tamaño del parche, o algo
+   así).
+
+   Los parches pequeños también facilitan la depuración cuando algo falla.
+   Es mucho más fácil retirar los parches uno por uno que diseccionar un
+   parche muy grande después de haber sido aplicado (y roto alguna cosa).
+
+2) Es importante no solo enviar pequeños parches, sino también reescribir
+   y simplificar (o simplemente reordenar) los parches antes de enviarlos.
+
+Esta es una analogía del desarrollador del kernel Al Viro (traducida):
+
+	*"Piense en un maestro que califica la tarea de un estudiante de
+	matemáticas. El maestro no quiere ver los intentos y errores del
+	estudiante antes de que se les ocurriera la solución. Quiere ver la
+	respuesta más limpia y elegante. Un buen estudiante lo sabe, y nunca
+	presentaría su trabajo intermedio antes de tener la solución final.*
+
+	*Lo mismo ocurre con el desarrollo del kernel. Los maintainers y
+	revisores no quieren ver el proceso de pensamiento detrás de la solución
+	al problema que se está resolviendo. Quieren ver un solución simple y
+	elegante."*
+
+Puede resultar un reto mantener el equilibrio entre presentar una solución
+elegante y trabajar junto a la comunidad, discutiendo su trabajo inacabado.
+Por lo tanto, es bueno comenzar temprano en el proceso para obtener
+"feedback" y mejorar su trabajo, pero también mantenga sus cambios en
+pequeños trozos que pueden ser aceptados, incluso cuando toda su labor no
+está listo para inclusión en un momento dado.
+
+También tenga en cuenta que no es aceptable enviar parches para su
+inclusión que están sin terminar y serán "arreglados más tarde".
+
+Justifique sus cambios
+----------------------
+
+Además de dividir sus parches, es muy importante que deje a la comunidad de
+Linux sabe por qué deberían agregar este cambio. Nuevas características
+debe justificarse como necesarias y útiles.
+
+Documente sus cambios
+---------------------
+
+Cuando envíe sus parches, preste especial atención a lo que dice en el
+texto de su correo electrónico. Esta información se convertirá en el
+ChangeLog del parche, y se conservará para que todos la vean, todo el
+tiempo. Debe describir el parche por completo y contener:
+
+  - por qué los cambios son necesarios
+  - el diseño general de su propuesta
+  - detalles de implementación
+  - resultados de sus experimentos
+
+Para obtener más detalles sobre cómo debería quedar todo esto, consulte la
+sección ChangeLog del documento:
+
+  "The Perfect Patch"
+      https://www.ozlabs.org/~akpm/stuff/tpp.txt
+
+Todas estas cuestiones son a veces son muy difíciles de conseguir. Puede
+llevar años perfeccionar estas prácticas (si es que lo hace). Es un proceso
+continuo de mejora que requiere mucha paciencia y determinación. Pero no se
+rinda, es posible. Muchos lo han hecho antes, y cada uno tuvo que comenzar
+exactamente donde está usted ahora.
+
+----------
+
+Gracias a Paolo Ciarrocchi que permitió que la sección "Development Process"
+se basara en el texto que había escrito (https://lwn.net/Articles/94386/),
+y a Randy Dunlap y Gerrit Huizenga por algunas de la lista de cosas que
+debes y no debes decir. También gracias a Pat Mochel, Hanna Linder, Randy
+Dunlap, Kay Sievers, Vojtech Pavlik, Jan Kara, Josh Boyer, Kees Cook,
+Andrew Morton, Andi Kleen, Vadim Lobanov, Jesper Juhl, Adrian Bunk,
+Keri Harris, Frans Pop, David A. Wheeler, Junio ​​Hamano, Michael Kerrisk y
+Alex Shepard por su revisión, comentarios y contribuciones. Sin su ayuda,
+este documento no hubiera sido posible.
+
+Maintainer: Greg Kroah-Hartman <greg@kroah.com>
diff --git a/Documentation/translations/sp_SP/index.rst b/Documentation/translations/sp_SP/index.rst
index 2800041168f4..1d5d1154d309 100644
--- a/Documentation/translations/sp_SP/index.rst
+++ b/Documentation/translations/sp_SP/index.rst
@@ -70,3 +70,11 @@ En términos más generales, la documentación, como el kernel mismo, están en
 constante desarrollo. Las mejoras en la documentación siempre son
 bienvenidas; de modo que, si desea ayudar, únase a la lista de correo de
 linux-doc en vger.kernel.org.
+
+Traducciones al español
+=======================
+
+.. toctree::
+   :maxdepth: 1
+
+   howto
-- 
2.34.1


^ permalink raw reply related	[relevance 2%]

* Re: [PATCH 2/2] Documentation: Add HOWTO Spanish translation into rst based build system
  2022-10-14  9:21  0%   ` Bagas Sanjaya
@ 2022-10-14 12:58  0%     ` Carlos Bilbao
  0 siblings, 0 replies; 200+ results
From: Carlos Bilbao @ 2022-10-14 12:58 UTC (permalink / raw)
  To: Bagas Sanjaya; +Cc: corbet, linux-doc, linux-kernel, bilbao, ojeda

On 10/14/22 04:21, Bagas Sanjaya wrote:

> ¡Hola Carlos! Gracias for start writing Spanish translations. However,
> the patch can be improved, see below.
Hola Bagas, thanks for your feedback :)
>
> On Thu, Oct 13, 2022 at 01:48:16PM -0500, Carlos Bilbao wrote:
>> This commit adds Spanish translation of HOWTO document into rst based
>> documentation build system.
>>
> Better say "Translate HOWTO document into Spanish".
So, for the commit message here I just replicated what prior folks did,
see:

For Japanese:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/Documentation/translations/ja_JP?h=v6.0&id=f012733894d36ff687862e9cd3b02ee980c61416

For Korean:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/Documentation/translations/ko_KR/howto.rst?h=v6.0&id=ba42c574fc8b803ec206785b7b91325c05810422

I think I will leave that commit message, it is slightly more informative
than "Translate HOWTO document into Spanish".

>
>> Signed-off-by: Carlos Bilbao <carlos.bilbao@amd.com>
>> ---
>>   Documentation/translations/sp_SP/howto.rst | 619 +++++++++++++++++++++
>>   Documentation/translations/sp_SP/index.rst |   7 +
>>   2 files changed, 626 insertions(+)
>>   create mode 100644 Documentation/translations/sp_SP/howto.rst
>>
>> diff --git a/Documentation/translations/sp_SP/howto.rst b/Documentation/translations/sp_SP/howto.rst
>> new file mode 100644
>> index 000000000000..4cf8fa6b9f7c
>> --- /dev/null
>> +++ b/Documentation/translations/sp_SP/howto.rst
>> @@ -0,0 +1,619 @@
>> +.. include:: ./disclaimer-sp.rst
>> +
>> +:Original: :ref:`Documentation/process/howto.rst <process_howto>`
>> +:Translator: Carlos Bilbao <carlos.bilbao@amd.com>
>> +
>> +.. _sp_process_howto:
>> +
>> +Cómo participar en el desarrollo del kernel de Linux
>> +====================================================
>> +
>> +Este documento es el principal punto de partida. Contiene instrucciones
>> +sobre cómo convertirse en desarrollador del kernel de Linux y explica cómo
>> +trabajar con el y en su desarrollo. El documento no tratará ningún aspecto
>> +técnico relacionado con la programación del kernel, pero le ayudará
>> +guiándole por el camino correcto.
>> +
>> +Si algo en este documento quedara obsoleto, envíe parches al maintainer de
>> +este archivo, que se encuentra en la parte superior del documento.
>> +
>> +Introducción
>> +------------
>> +¿De modo que quiere descubrir como convertirse en un/a desarrollador/a del
>> +kernel de Linux? Tal vez su jefe le haya dicho, "Escriba un driver de
>> +Linux para este dispositivo." El objetivo de este documento en enseñarle
>> +todo cuanto necesita para conseguir esto, describiendo el proceso por el
>> +que debe pasar, y con indicaciones de como trabajar con la comunidad.
>> +También trata de explicar las razones por las cuales la comunidad trabaja
>> +de la forma en que lo hace.
>> +
>> +El kernel esta principalmente escrito en C, con algunas partes que son
>> +dependientes de la arquitectura en ensamblador. Un buen conocimiento de C
>> +es necesario para desarrollar en el kernel. Lenguaje ensamblador (en
>> +cualquier arquitectura) no es necesario excepto que planee realizar
>> +desarrollo de bajo nivel para dicha arquitectura. Aunque no es un perfecto
>> +sustituto para una educación sólida en C y/o años de experiencia, los
>> +siguientes libros sirven, como mínimo, como referencia:
>> +
>> +- "The C Programming Language" de Kernighan e Ritchie [Prentice Hall]
>> +- "Practical C Programming" de Steve Oualline [O'Reilly]
>> +- "C:  A Reference Manual" de Harbison and Steele [Prentice Hall]
>> +
>> +El kernel está escrito usando GNU C y la cadena de herramientas GNU. Si
>> +bien se adhiere al estándar ISO C89, utiliza una serie de extensiones que
>> +no aparecen en dicho estándar. El kernel usa un C independiente de entorno,
>> +sin depender de la biblioteca C estándar, por lo que algunas partes del
>> +estándar C no son compatibles. Divisiones de long long arbitrarios o
>> +de coma flotante no son permitidas. En ocasiones, puede ser difícil de
>> +entender las suposiciones que el kernel hace respecto a la cadena de
>> +herramientas y las extensiones que usa, y desafortunadamente no hay
>> +referencia definitiva para estos. Consulte las páginas de información de
>> +gcc (`info gcc`) para obtener información al respecto.
>> +
>> +Recuerde que está tratando de aprender a trabajar con una comunidad de
>> +desarrollo existente. Es un grupo diverso de personas, con altos estándares
>> +de codificación, estilo y procedimiento. Estas normas han sido creadas a lo
>> +largo del tiempo en función de lo que se ha encontrado que funciona mejor
>> +para un equipo tan grande y geográficamente disperso. Trate de aprender
>> +tanto como le sea posible acerca de estos estándares antes de tiempo, ya
>> +que están bien documentados; no espere que la gente se adapte a usted o a
>> +su forma de ser de hacer las cosas.
>> +
>> +Cuestiones legales
>> +------------------
>> +El código fuente del kernel de Linux se publica bajo licencia GPL. Por
>> +favor, revise el archivo COPYING, presente en la carpeta principal del
>> +fuente, para detalles de la licencia. Si tiene alguna otra pregunta
>> +sobre licencias, contacte a un abogado, no pregunte en listas de discusión
>> +del kernel de Linux. Las personas en estas listas no son abogadas, y no
>> +debe confiar en sus opiniones en materia legal.
>> +
>> +Para preguntas y respuestas más frecuentes sobre la licencia GPL, consulte:
>> +
>> +	https://www.gnu.org/licenses/gpl-faq.html
>> +
>> +Documentacion
>> +--------------
>> +El código fuente del kernel de Linux tiene una gran variedad de documentos
>> +que son increíblemente valiosos para aprender a interactuar con la
>> +comunidad del kernel. Cuando se agregan nuevas funciones al kernel, se
>> +recomienda que se incluyan nuevos archivos de documentación que expliquen
>> +cómo usar la función. Cuando un cambio en el kernel hace que la interfaz
>> +que el kernel expone espacio de usuario cambie, se recomienda que envíe la
>> +información o un parche en las páginas del manual que expliquen el cambio
>> +a mtk.manpages@gmail.com, y CC la lista linux-api@vger.kernel.org.
>> +
>> +Esta es la lista de archivos que están en el código fuente del kernel y son
>> +de obligada lectura:
>> +
>> +  :ref:`Documentation/admin-guide/README.rst <readme>`
>> +    Este archivo ofrece una breve descripción del kernel de Linux y
>> +    describe lo que es necesario hacer para configurar y compilar el
>> +    kernel. Quienes sean nuevos en el kernel deben comenzar aquí.
>> +
>> +  :ref:`Documentation/process/changes.rst <changes>`
>> +    Este archivo proporciona una lista de los niveles mínimos de varios
>> +    paquetes que son necesarios para construir y ejecutar el kernel
>> +    exitosamente.
>> +
>> +  :ref:`Documentation/process/coding-style.rst <codingstyle>`
>> +    Esto describe el estilo de código del kernel de Linux y algunas de los
>> +    razones detrás de esto. Se espera que todo el código nuevo siga las
>> +    directrices de este documento. La mayoría de los maintainers solo
>> +    aceptarán parches si se siguen estas reglas, y muchas personas solo
>> +    revisan el código si tiene el estilo adecuado.
>> +
>> +  :ref:`Documentation/process/submitting-patches.rst <submittingpatches>`
>> +    Este archivo describe en gran detalle cómo crear con éxito y enviar un
>> +    parche, que incluye (pero no se limita a):
>> +
>> +       - Contenidos del correo electrónico (email)
>> +       - Formato del email
>> +       - A quien se debe enviar
>> +
>> +    Seguir estas reglas no garantiza el éxito (ya que todos los parches son
>> +    sujetos a escrutinio de contenido y estilo), pero en caso de no seguir
>> +    dichas reglas, el fracaso es prácticamente garantizado.
>> +    Otras excelentes descripciones de cómo crear parches correctamente son:
>> +
>> +	"The Perfect Patch"
>> +		https://www.ozlabs.org/~akpm/stuff/tpp.txt
>> +
>> +	"Linux kernel patch submission format"
>> +		https://web.archive.org/web/20180829112450/http://linux.yyz.us/patch-format.html
>> +
>> +  :ref:`Documentation/process/stable-api-nonsense.rst <stable_api_nonsense>`
>> +    Este archivo describe la lógica detrás de la decisión consciente de
>> +    no tener una API estable dentro del kernel, incluidas cosas como:
>> +
>> +      - Capas intermedias del subsistema (por compatibilidad?)
>> +      - Portabilidad de drivers entre sistemas operativos
>> +      - Mitigar el cambio rápido dentro del árbol de fuentes del kernel (o
>> +        prevenir cambios rápidos)
>> +
>> +     Este documento es crucial para comprender la filosofía del desarrollo
>> +     de Linux y es muy importante para las personas que se mudan a Linux
>> +     tras desarrollar otros sistemas operativos.
>> +
>> +  :ref:`Documentation/admin-guide/security-bugs.rst <securitybugs>`
>> +    Si cree que ha encontrado un problema de seguridad en el kernel de
>> +    Linux, siga los pasos de este documento para ayudar a notificar a los
>> +    desarrolladores del kernel y ayudar a resolver el problema.
>> +
>> +  :ref:`Documentation/process/management-style.rst <managementstyle>`
>> +    Este documento describe cómo operan los maintainers del kernel de Linux
>> +    y los valores compartidos detrás de sus metodologías. Esta es una
>> +    lectura importante para cualquier persona nueva en el desarrollo del
>> +    kernel (o cualquier persona que simplemente sienta curiosidad por
>> +    el campo IT), ya que clarifica muchos conceptos erróneos y confusiones
>> +    comunes sobre el comportamiento único de los maintainers del kernel.
>> +
>> +  :ref:`Documentation/process/stable-kernel-rules.rst <stable_kernel_rules>`
>> +    Este archivo describe las reglas sobre cómo se suceden las versiones
>> +    del kernel estable, y qué hacer si desea obtener un cambio en una de
>> +    estas publicaciones.
>> +
>> +  :ref:`Documentation/process/kernel-docs.rst <kernel_docs>`
>> +    Una lista de documentación externa relativa al desarrollo del kernel.
>> +    Por favor consulte esta lista si no encuentra lo que están buscando
>> +    dentro de la documentación del kernel.
>> +
>> +  :ref:`Documentation/process/applying-patches.rst <applying_patches>`
>> +    Una buena introducción que describe exactamente qué es un parche y cómo
>> +    aplicarlo a las diferentes ramas de desarrollo del kernel.
>> +
>> +El kernel también tiene una gran cantidad de documentos que pueden ser
>> +generados automáticamente desde el propio código fuente o desde
>> +ReStructuredText markups (ReST), como este. Esto incluye un descripción
>> +completa de la API en el kernel y reglas sobre cómo manejar cerrojos
>> +(locking) correctamente.
>> +
>> +Todos estos documentos se pueden generar como PDF o HTML ejecutando::
>> +
>> +	make pdfdocs
>> +	make htmldocs
>> +
>> +respectivamente desde el directorio fuente principal del kernel.
>> +
>> +Los documentos que utilizan el markup ReST se generarán en
>> +Documentation/output. También se pueden generar en formatos LaTeX y ePub
>> +con::
>> +
>> +	make latexdocs
>> +	make epubdocs
>> +
>> +Convertirse en un/a desarrollador/a de kernel
>> +-------------------------------------------
>> +
>> +Si no sabe nada sobre el desarrollo del kernel de Linux, debería consultar
>> +el proyecto Linux KernelNewbies:
>> +
>> +	https://kernelnewbies.org
>> +
>> +Consiste en una útil lista de correo donde puede preguntar casi cualquier
>> +tipo de pregunta básica de desarrollo del kernel (asegúrese de buscar en
>> +los archivos primero, antes de preguntar algo que ya ha sido respondido en
>> +el pasado.) También tiene un canal IRC que puede usar para hacer preguntas
>> +en en tiempo real, y una gran cantidad de documentación útil que es útil
>> +para ir aprendiendo sobre el desarrollo del kernel de Linux.
>> +
>> +El sitio web tiene información básica sobre la organización del código,
>> +subsistemas, y proyectos actuales (tanto dentro como fuera del árbol).
>> +También describe alguna información logística básica, como cómo compilar
>> +un kernel y aplicar un parche.
>> +
>> +Si no sabe por dónde quiere empezar, pero quieres buscar alguna tarea que
>> +comenzar a hacer para unirse a la comunidad de desarrollo del kernel,
>> +acuda al proyecto Linux Kernel Janitor:
>> +
>> +	https://kernelnewbies.org/KernelJanitors
>> +
>> +Es un gran lugar para comenzar. Describe una lista de problemas
>> +relativamente simples que deben limpiarse y corregirse dentro del codigo
>> +fuente del kernel de Linux árbol de fuentes. Trabajando con los
>> +desarrolladores a cargo de este proyecto, aprenderá los conceptos básicos
>> +para incluir su parche en el árbol del kernel de Linux, y posiblemente
>> +descubrir en la dirección en que trabajar a continuación, si no tiene ya
>> +una idea.
>> +
>> +Antes de realizar cualquier modificación real al código del kernel de
>> +Linux, es imperativo entender cómo funciona el código en cuestión. Para
>> +este propósito, nada es mejor que leerlo directamente (lo más complicado
>> +está bien comentado), tal vez incluso con la ayuda de herramientas
>> +especializadas. Una de esas herramientas que se recomienda especialmente
>> +es el proyecto Linux Cross-Reference, que es capaz de presentar el código
>> +fuente en un formato de página web indexada y autorreferencial. Una
>> +excelente puesta al día del repositorio del código del kernel se puede
>> +encontrar en:
>> +
>> +	https://elixir.bootlin.com/
>> +
>> +El proceso de desarrollo
>> +------------------------
>> +
>> +El proceso de desarrollo del kernel de Linux consiste actualmente de
>> +diferentes "branches" (ramas) con muchos distintos subsistemas específicos
>> +a cada una de ellas. Las diferentes ramas son:
>> +
>> +  - El código principal de Linus (mainline tree)
>> +  - Varios árboles estables con múltiples major numbers
>> +  - Subsistemas específicos
>> +  - linux-next, para integración y testing
>> +
>> +Mainline tree (Árbol principal)
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +El mainline tree es mantenido por Linus Torvalds, y puede encontrarse en
>> +https://kernel.org o en su repo.  El proceso de desarrollo es el siguiente:
>> +
>> +  - Tan pronto como se lanza un nuevo kernel, se abre una ventana de dos
>> +    semanas, durante este período de tiempo, los maintainers pueden enviar
>> +    grandes modificaciones a Linus, por lo general los parches que ya se
>> +    han incluido en el linux-next durante unas semanas. La forma preferida
>> +    de enviar grandes cambios es usando git (la herramienta de
>> +    administración de codigo fuente del kernel, más información al respecto
>> +    en https://git-scm.com/), pero los parches simples también son validos.
>> +  - Después de dos semanas, se lanza un kernel -rc1 y la atención se centra
>> +    en hacer que el kernel nuevo lo más estable ("solido") posible. La
>> +    mayoría de los parches en este punto debe arreglar una regresión. Los
>> +    errores que siempre han existido no son regresiones, por lo tanto, solo
>> +    envíe este tipo de correcciones si son importantes. Tenga en cuenta que
>> +    se podría aceptar un controlador (o sistema de archivos) completamente
>> +    nuevo después de -rc1 porque no hay riesgo de causar regresiones con
>> +    tal cambio, siempre y cuando el cambio sea autónomo y no afecte áreas
>> +    fuera del código que se está agregando. git se puede usar para enviar
>> +    parches a Linus después de que se lance -rc1, pero los parches también
>> +    deben ser enviado a una lista de correo pública para su revisión.
>> +  - Se lanza un nuevo -rc cada vez que Linus considera que el árbol git
>> +    actual esta en un estado razonablemente sano y adecuado para la prueba.
>> +    La meta es lanzar un nuevo kernel -rc cada semana.
>> +  - El proceso continúa hasta que el kernel se considera "listo", y esto
>> +    puede durar alrededor de 6 semanas.
>> +
>> +Vale la pena mencionar lo que Andrew Morton escribió en las listas de
>> +correo del kernel de Linux, sobre lanzamientos del kernel (traducido):
>> +
>> +	*"Nadie sabe cuándo se publicara un nuevo kernel, porque esto sucede
>> +    de acuerdo con el estado de bugs (error) percibido, no de acuerdo con
>> +    una línea temporal preconcebida."*
>> +
>> +Varios árboles estables con múltiples major numbers
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +Los kernels con versiones de 3 partes son kernels estables. Estos contienen
>> +correcciones relativamente pequeñas y críticas para problemas de seguridad
>> +o importantes regresiones descubiertas para una publicación de código.
>> +Cada lanzamiento en una gran serie estable incrementa la tercera parte de
>> +la versión número, manteniendo las dos primeras partes iguales.
>> +
>> +Esta es la rama recomendada para los usuarios que quieren la versión
>> +estable más reciente del kernel, y no están interesados ​​en ayudar a probar
>> +versiones en desarrollo/experimentales.
>> +
>> +Los árboles estables son mantenidos por el equipo "estable"
>> +<stable@vger.kernel.org>, y se liberan (publican) según lo dicten las
>> +necesidades. El período de liberación normal es de aproximadamente dos
>> +semanas, pero puede ser más largo si no hay problemas apremiantes. Un
>> +problema relacionado con la seguridad, en cambio, puede causar un
>> +lanzamiento casi instantáneamente.
>> +
>> +El archivo :ref:`Documentación/proceso/stable-kernel-rules.rst <stable_kernel_rules>`
>> +en el árbol del kernel documenta qué tipos de cambios son aceptables para
>> +el árbol estable y cómo funciona el proceso de lanzamiento.
>> +
>> +Subsistemas específicos
>> +~~~~~~~~~~~~~~~~~~~~~~~~
>> +Los maintainers de los diversos subsistemas del kernel --- y también muchos
>> +desarrolladores de subsistemas del kernel --- exponen su estado actual de
>> +desarrollo en repositorios fuente. De esta manera, otros pueden ver lo que
>> +está sucediendo en las diferentes áreas del kernel. En áreas donde el
>> +desarrollo es rápido, se le puede pedir a un desarrollador que base sus
>> +envíos en tal árbol del subsistema del kernel, para evitar conflictos entre
>> +este y otros trabajos ya en curso.
>> +
>> +La mayoría de estos repositorios son árboles git, pero también hay otros
>> +SCM en uso, o colas de parches que se publican como series quilt. Las
>> +direcciones de estos repositorios de subsistemas se enumeran en el archivo
>> +MAINTAINERS. Muchos de estos se pueden ver en https://git.kernel.org/.
>> +
>> +Antes de que un parche propuesto se incluya con dicho árbol de subsistemas,
>> +es sujeto a revisión, que ocurre principalmente en las listas de correo
>> +(ver la sección respectiva a continuación). Para varios subsistemas del
>> +kernel, esta revisión se rastrea con la herramienta patchwork. Patchwork
>> +ofrece una interfaz web que muestra publicaciones de parches, cualquier
>> +comentario sobre un parche o revisiones a él, y los maintainers pueden
>> +marcar los parches como en revisión, aceptado, o rechazado. La mayoría de
>> +estos sitios de trabajo de parches se enumeran en
>> +
>> +https://patchwork.kernel.org/.
>> +
>> +linux-next, para integración y testing
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +Antes de que las actualizaciones de los árboles de subsistemas se combinen
>> +con el árbol principal, necesitan probar su integración. Para ello, existe
>> +un repositorio especial de pruebas en el que se encuentran casi todos los
>> +árboles de subsistema, actualizado casi a diario:
>> +
>> +	https://git.kernel.org/?p=linux/kernel/git/next/linux-next.git
>> +
>> +De esta manera, linux-next ofrece una perspectiva resumida de lo que se
>> +espera que entre en el kernel principal en el próximo período de "merge"
>> +(fusión de codigo). Los testers aventureros son bienvenidos a probar
>> +linux-next en ejecución.
>> +
>> +Reportar bugs
>> +-------------
>> +
>> +El archivo 'Documentación/admin-guide/reporting-issues.rst' en el
>> +directorio principal del kernel describe cómo informar un posible bug del
>> +kernel y detalles sobre qué tipo de información necesitan los
>> +desarrolladores del kernel para ayudar a rastrear la fuente del problema.
>> +
>> +Gestión de informes de bugs
>> +------------------------------
>> +
>> +Una de las mejores formas de poner en práctica sus habilidades de hacking
>> +es arreglando errores reportados por otras personas. No solo ayudará a
>> +hacer el kernel más estable, también aprenderá a solucionar problemas del
>> +mundo real y mejora sus habilidades, y otros desarrolladores se darán
>> +cuenta de tu presencia. La corrección de errores es una de las mejores
>> +formas de ganar méritos entre desarrolladores, porque no a muchas personas
>> +les gusta perder el tiempo arreglando los errores de otras personas.
>> +
>> +Para trabajar en informes de errores ya reportados, busque un subsistema
>> +que le interese. Verifique el archivo MAINTAINERS donde se informan los
>> +errores de ese subsistema; con frecuencia será una lista de correo, rara
>> +vez un rastreador de errores (bugtracker). Busque en los archivos de dicho
>> +lugar para informes recientes y ayude donde lo crea conveniente. También es
>> +posible que desee revisar https://bugzilla.kernel.org para informes de
>> +errores; solo un puñado de subsistemas del kernel lo emplean activamente
>> +para informar o rastrear; sin embargo, todos los errores para todo el kernel
>> +se archivan allí.
>> +
>> +Listas de correo
>> +-----------------
>> +
>> +Como se explica en algunos de los documentos anteriores, la mayoría de
>> +desarrolladores del kernel participan en la lista de correo del kernel de
>> +Linux. Detalles sobre cómo para suscribirse y darse de baja de la lista se
>> +pueden encontrar en:
>> +
>> +	http://vger.kernel.org/vger-lists.html#linux-kernel
>> +
>> +Existen archivos de la lista de correo en la web en muchos lugares
>> +distintos. Utilice un motor de búsqueda para encontrar estos archivos. Por
>> +ejemplo:
>> +
>> +	http://dir.gmane.org/gmane.linux.kernel
>> +
>> +Es muy recomendable que busque en los archivos sobre el tema que desea
>> +tratar, antes de publicarlo en la lista. Un montón de cosas ya discutidas
>> +en detalle solo se registran en los archivos de la lista de correo.
>> +
>> +La mayoría de los subsistemas individuales del kernel también tienen sus
>> +propias lista de correo donde hacen sus esfuerzos de desarrollo. Revise el
>> +archivo MAINTAINERS para obtener referencias de lo que estas listas para
>> +los diferentes grupos.
>> +
>> +Muchas de las listas están alojadas en kernel.org. La información sobre
>> +estas puede ser encontrada en:
>> +
>> +	http://vger.kernel.org/vger-lists.html
>> +
>> +Recuerde mantener buenos hábitos de comportamiento al usar las listas.
>> +Aunque un poco cursi, la siguiente URL tiene algunas pautas simples para
>> +interactuar con la lista (o cualquier lista):
>> +
>> +	http://www.albion.com/netiquette/
>> +
>> +Si varias personas responden a su correo, el CC (lista de destinatarios)
>> +puede hacerse bastante grande. No elimine a nadie de la lista CC: sin una
>> +buena razón, o no responda solo a la dirección de la lista. Acostúmbrese
>> +a recibir correos dos veces, una del remitente y otra de la lista, y no
>> +intente ajustar esto agregando encabezados de correo astutos, a la gente no
>> +le gustará.
>> +
>> +Recuerde mantener intacto el contexto y la atribución de sus respuestas,
>> +mantenga las líneas "El hacker John Kernel escribió ...:" en la parte
>> +superior de su respuesta, y agregue sus declaraciones entre las secciones
>> +individuales citadas en lugar de escribiendo en la parte superior del
>> +correo electrónico.
>> +
>> +Si incluye parches en su correo, asegúrese de que sean texto legible sin
>> +formato como se indica en :ref:`Documentation/process/submitting-patches.rst <submittingpatches>`.
>> +Los desarrolladores del kernel no quieren lidiar con archivos adjuntos o
>> +parches comprimidos; y pueden querer comentar líneas individuales de su
>> +parche, que funciona sólo de esa manera. Asegúrese de emplear un programa
>> +de correo que no altere los espacios ni los tabuladores. Una buena primera
>> +prueba es enviarse el correo a usted mismo, e intentar aplicar su
>> +propio parche. Si eso no funciona, arregle su programa de correo o
>> +reemplace hasta que funcione.
>> +
>> +Sobretodo, recuerde de ser respetuoso con otros subscriptores.
>> +
>> +Colaborando con la comunidad
>> +----------------------------
>> +
>> +El objetivo de la comunidad del kernel es proporcionar el mejor kernel
>> +posible. Cuando envíe un parche para su aceptación, se revisará en sus
>> +méritos técnicos solamente. Entonces, ¿qué deberías ser?
>> +
>> +  - criticas
>> +  - comentarios
>> +  - peticiones de cambios
>> +  - peticiones de justificaciones
>> +  - silencio
>> +
>> +Recuerde, esto es parte de introducir su parche en el kernel. Tiene que ser
>> +capaz de recibir críticas y comentarios sobre sus parches, evaluar
>> +a nivel técnico y re-elaborar sus parches o proporcionar razonamiento claro
>> +y conciso de por qué no se deben hacer tales cambios. Si no hay respuestas
>> +a su publicación, espere unos días e intente de nuevo, a veces las cosas se
>> +pierden dado el gran volumen.
>> +
>> +¿Qué no debería hacer?
>> +
>> +  - esperar ue su parche se acepte sin preguntas
>> +  - actuar de forma defensiva
>> +  - ignorar comentarios
>> +  - enviar el parche de nuevo, sin haber aplicados los cambios pertinentes
>> +
>> +En una comunidad que busca la mejor solución técnica posible, siempre habrá
>> +diferentes opiniones sobre lo beneficioso que es un parche. Tiene que ser
>> +cooperativo y estar dispuesto a adaptar su idea para que encaje dentro
>> +del kernel, o al menos esté dispuesto a demostrar que su idea vale la pena.
>> +Recuerea, estar equivocado es aceptable siempre y cuando estés dispuesto a
>> +trabajar hacia una solución que sea correcta.
>> +
>> +Es normal que las respuestas a su primer parche sean simplemente una lista
>> +de una docena de cosas que debe corregir. Esto **no** implica que su
>> +parche no será aceptado, y **no** es personal. Simplemente corrija todos
>> +los problemas planteados en su parche, y envié otra vez.
>> +
>> +Diferencias entre la comunidad kernel y las estructuras corporativas
>> +--------------------------------------------------------------------
>> +
>> +La comunidad del kernel funciona de manera diferente a la mayoría de los
>> +entornos de desarrollo tradicionales en empresas. Aquí hay una lista de
>> +cosas que puede intentar hacer para evitar problemas:
>> +
>> +  Cosas buenas que decir respecto a los cambios propuestos:
>> +
>> +    - "Esto arregla múltiples problemas."
>> +    - "Esto elimina 2000 lineas de código."
>> +    - "Aquí hay un parche que explica lo que intento describir."
>> +    - "Lo he testeado en 5 arquitecturas distintas..."
>> +    - "Aquí hay una serie de parches menores que..."
>> +    - "Esto mejora el rendimiento en maquinas típicas..."
>> +
>> +  Cosas negativas que debe evitar decir:
>> +
>> +    - "Lo hicimos asi en AIX/ptx/Solaris, de modo que debe ser bueno..."
>> +    - "LLevo haciendo esto 20 años, de modo que..."
>> +    - "Esto lo necesita mi empresa para ganar dinero"
>> +    - "Esto es para la linea de nuestros productos Enterprise"
>> +    - "Aquí esta el documento de 1000 paginas describiendo mi idea"
>> +    - "Llevo 6 meses trabajando en esto..."
>> +    - "Aquí esta un parche de 5000 lineas que..."
>> +    - "He rescrito todo el desastre actual, y aqui esta..."
>> +    - "Tengo un deadline, y este parche debe aplicarse ahora."
>> +
>> +Otra forma en que la comunidad del kernel es diferente a la mayoría de los
>> +entornos de trabajo tradicionales en ingeniería de software, es la
>> +naturaleza sin rostro de interacción. Una de las ventajas de utilizar el
>> +correo electrónico y el IRC como formas principales de comunicación es la
>> +no discriminación por motivos de género o raza. El entorno de trabajo del
>> +kernel de Linux acepta a mujeres y minorías porque todo lo que eres es una
>> +dirección de correo electrónico. El aspecto internacional también ayuda a
>> +nivelar el campo de juego porque no puede adivinar el género basado en
>> +el nombre de una persona. Un hombre puede llamarse Andrea y una mujer puede
>> +llamarse Pat. La mayoría de las mujeres que han trabajado en el kernel de
>> +Linux y han expresado una opinión han tenido experiencias positivas.
>> +
>> +La barrera del idioma puede causar problemas a algunas personas que no se
>> +sientes cómodas con el inglés. Un buen dominio del idioma puede ser
>> +necesario para transmitir ideas correctamente en las listas de correo, por
>> +lo que le recomendamos que revise sus correos electrónicos para asegurarse
>> +de que tengan sentido en inglés antes de enviarlos.
>> +
>> +Divida sus cambios
>> +---------------------
>> +
>> +La comunidad del kernel de Linux no acepta con gusto grandes fragmentos de
>> +código, sobretodo a la vez. Los cambios deben introducirse correctamente,
>> +discutidos y divididos en pequeñas porciones individuales. Esto es casi
>> +exactamente lo contrario de lo que las empresas están acostumbradas a hacer.
>> +Su propuesta también debe introducirse muy temprano en el proceso de
>> +desarrollo, de modo que pueda recibir comentarios sobre lo que está
>> +haciendo. También deje que la comunidad sienta que está trabajando con
>> +ellos, y no simplemente usándolos como un vertedero para su función. Sin
>> +embargo, no envíe 50 correos electrónicos a una vez a una lista de correo,
>> +su serie de parches debe casi siempre ser más pequeña que eso.
>> +
>> +Las razones para dividir las cosas son las siguientes:
>> +
>> +1) Los cambios pequeños aumentan la probabilidad de que sus parches sean
>> +   aplicados, ya que no requieren mucho tiempo o esfuerzo para verificar su
>> +   exactitud. Un parche de 5 líneas puede ser aplicado por un maintainer
>> +   con apenas una segunda mirada. Sin embargo, un parche de 500 líneas
>> +   puede tardar horas en ser revisado en términos de corrección (el tiempo
>> +   que toma es exponencialmente proporcional al tamaño del parche, o algo
>> +   así).
>> +
>> +   Los parches pequeños también facilitan la depuración cuando algo falla.
>> +   Es mucho más fácil retirar los parches uno por uno que diseccionar un
>> +   parche muy grande después de haber sido aplicado (y roto alguna cosa).
>> +
>> +2) Es importante no solo enviar pequeños parches, sino también reescribir
>> +   y simplificar (o simplemente reordenar) los parches antes de enviarlos.
>> +
>> +Esta es una analogía del desarrollador del kernel Al Viro (traducida):
>> +
>> +	*"Piense en un maestro que califica la tarea de un estudiante de
>> +	matemáticas. El maestro no quiere ver los intentos y errores del
>> +	estudiante antes de que se les ocurriera la solución. Quiere ver la
>> +	respuesta más limpia y elegante. Un buen estudiante lo sabe, y nunca
>> +	presentaría su trabajo intermedio antes de tener la solución final.*
>> +
>> +	* Lo mismo ocurre con el desarrollo del kernel. Los maintainers y
>> +	revisores no quieren ver el proceso de pensamiento detrás de la solución
>> +	al problema que se está resolviendo. Quieren ver un solución simple y
>> +	elegante."*
>> +
>> +Puede resultar un reto mantener el equilibrio entre presentar una solución
>> +elegante y trabajar junto a la comunidad, discutiendo su trabajo inacabado.
>> +Por lo tanto, es bueno comenzar temprano en el proceso para obtener
>> +"feedback" y mejorar su trabajo, pero también mantenga sus cambios en
>> +pequeños trozos que pueden ser aceptados, incluso cuando toda su labor no
>> +está listo para inclusión en un momento dado.
>> +
>> +También tenga en cuenta que no es aceptable enviar parches para su
>> +inclusión que están sin terminar y serán "arreglados más tarde".
>> +
>> +Justifique sus cambios
>> +----------------------
>> +
>> +Además de dividir sus parches, es muy importante que deje a la comunidad de
>> +Linux sabe por qué deberían agregar este cambio. Nuevas características
>> +debe justificarse como necesarias y útiles.
>> +
>> +Documente sus cambios
>> +--------------------
>> +
>> +Cuando envíe sus parches, preste especial atención a lo que dice en el
>> +texto de su correo electrónico. Esta información se convertirá en el
>> +ChangeLog del parche, y se conservará para que todos la vean, todo el
>> +tiempo. Debe describir el parche por completo y contener:
>> +
>> +  - por que los cambios son necesarios
>> +  - el diseño general de su propuesta
>> +  - detalles de implementación
>> +  - resultados de sus experimentos
>> +
>> +Para obtener más detalles sobre cómo debería quedar todo esto, consulte la
>> +sección ChangeLog del documento:
>> +
>> +  "The Perfect Patch"
>> +      https://www.ozlabs.org/~akpm/stuff/tpp.txt
>> +
>> +Todas estas cuestiones son a veces son muy difíciles de conseguir. Puede
>> +llevar años perfeccionar estas prácticas (si es que lo hace). Es un proceso
>> +continuo de mejora que requiere mucha paciencia y determinación. Pero no se
>> +rinda, es posible. Muchos lo han hecho antes, y cada uno tuvo que comenzar
>> +exactamente donde está usted ahora.
>> +
>> +
>> +----------
>> +
>> +Gracias a Paolo Ciarrocchi que permitió que la sección "Development Process"
>> +se basara en el texto que había escrito (https://lwn.net/Articles/94386/),
>> +y a Randy Dunlap y Gerrit Huizenga por algunas de la lista de cosas que
>> +debes y no debes decir. También gracias a Pat Mochel, Hanna Linder, Randy
>> +Dunlap, Kay Sievers, Vojtech Pavlik, Jan Kara, Josh Boyer, Kees Cook,
>> +Andrew Morton, Andi Kleen, Vadim Lobanov, Jesper Juhl, Adrian Bunk,
>> +Keri Harris, Frans Pop, David A. Wheeler, Junio ​​Hamano, Michael Kerrisk y
>> +Alex Shepard por su revisión, comentarios y contribuciones. Sin su ayuda,
>> +este documento no hubiera sido posible.
>> +
>> +Maintainer: Greg Kroah-Hartman <greg@kroah.com>
> kernel test robot have already reported documentation warnings at [1],
> so I have applied the fixup:
Nice, I'll make sure to include this in v2
>
> ---- >8 ----
>
> diff --git a/Documentation/translations/sp_SP/howto.rst b/Documentation/translations/sp_SP/howto.rst
> index 4cf8fa6b9f7c2e..0c072b9a69df30 100644
> --- a/Documentation/translations/sp_SP/howto.rst
> +++ b/Documentation/translations/sp_SP/howto.rst
> @@ -183,7 +183,7 @@ con::
>   	make epubdocs
>   
>   Convertirse en un/a desarrollador/a de kernel
> --------------------------------------------
> +---------------------------------------------
>   
>   Si no sabe nada sobre el desarrollo del kernel de Linux, debería consultar
>   el proyecto Linux KernelNewbies:
> @@ -274,8 +274,8 @@ Vale la pena mencionar lo que Andrew Morton escribió en las listas de
>   correo del kernel de Linux, sobre lanzamientos del kernel (traducido):
>   
>   	*"Nadie sabe cuándo se publicara un nuevo kernel, porque esto sucede
> -    de acuerdo con el estado de bugs (error) percibido, no de acuerdo con
> -    una línea temporal preconcebida."*
> +        de acuerdo con el estado de bugs (error) percibido, no de acuerdo con
> +        una línea temporal preconcebida."*
>   
>   Varios árboles estables con múltiples major numbers
>   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> @@ -556,7 +556,7 @@ Esta es una analogía del desarrollador del kernel Al Viro (traducida):
>   	respuesta más limpia y elegante. Un buen estudiante lo sabe, y nunca
>   	presentaría su trabajo intermedio antes de tener la solución final.*
>   
> -	* Lo mismo ocurre con el desarrollo del kernel. Los maintainers y
> +	*Lo mismo ocurre con el desarrollo del kernel. Los maintainers y
>   	revisores no quieren ver el proceso de pensamiento detrás de la solución
>   	al problema que se está resolviendo. Quieren ver un solución simple y
>   	elegante."*
> @@ -579,7 +579,7 @@ Linux sabe por qué deberían agregar este cambio. Nuevas características
>   debe justificarse como necesarias y útiles.
>   
>   Documente sus cambios
> ---------------------
> +---------------------
>   
>   Cuando envíe sus parches, preste especial atención a lo que dice en el
>   texto de su correo electrónico. Esta información se convertirá en el
>
> Muchas gracias (thanks very much).
Cheers!
>
> [1]: https://lore.kernel.org/linux-doc/202210141348.7UGXRUp8-lkp@intel.com/

^ permalink raw reply	[relevance 0%]

* Re: [PATCH 2/2] Documentation: Add HOWTO Spanish translation into rst based build system
  2022-10-13 18:48  2% ` [PATCH 2/2] Documentation: Add HOWTO Spanish translation into rst based build system Carlos Bilbao
@ 2022-10-14  9:21  0%   ` Bagas Sanjaya
  2022-10-14 12:58  0%     ` Carlos Bilbao
  0 siblings, 1 reply; 200+ results
From: Bagas Sanjaya @ 2022-10-14  9:21 UTC (permalink / raw)
  To: Carlos Bilbao; +Cc: corbet, linux-doc, linux-kernel, bilbao, ojeda

[-- Attachment #1: Type: text/plain, Size: 36153 bytes --]

¡Hola Carlos! Gracias for start writing Spanish translations. However,
the patch can be improved, see below.

On Thu, Oct 13, 2022 at 01:48:16PM -0500, Carlos Bilbao wrote:
> This commit adds Spanish translation of HOWTO document into rst based
> documentation build system.
> 

Better say "Translate HOWTO document into Spanish".

> Signed-off-by: Carlos Bilbao <carlos.bilbao@amd.com>
> ---
>  Documentation/translations/sp_SP/howto.rst | 619 +++++++++++++++++++++
>  Documentation/translations/sp_SP/index.rst |   7 +
>  2 files changed, 626 insertions(+)
>  create mode 100644 Documentation/translations/sp_SP/howto.rst
> 
> diff --git a/Documentation/translations/sp_SP/howto.rst b/Documentation/translations/sp_SP/howto.rst
> new file mode 100644
> index 000000000000..4cf8fa6b9f7c
> --- /dev/null
> +++ b/Documentation/translations/sp_SP/howto.rst
> @@ -0,0 +1,619 @@
> +.. include:: ./disclaimer-sp.rst
> +
> +:Original: :ref:`Documentation/process/howto.rst <process_howto>`
> +:Translator: Carlos Bilbao <carlos.bilbao@amd.com>
> +
> +.. _sp_process_howto:
> +
> +Cómo participar en el desarrollo del kernel de Linux
> +====================================================
> +
> +Este documento es el principal punto de partida. Contiene instrucciones
> +sobre cómo convertirse en desarrollador del kernel de Linux y explica cómo
> +trabajar con el y en su desarrollo. El documento no tratará ningún aspecto
> +técnico relacionado con la programación del kernel, pero le ayudará
> +guiándole por el camino correcto.
> +
> +Si algo en este documento quedara obsoleto, envíe parches al maintainer de
> +este archivo, que se encuentra en la parte superior del documento.
> +
> +Introducción
> +------------
> +¿De modo que quiere descubrir como convertirse en un/a desarrollador/a del
> +kernel de Linux? Tal vez su jefe le haya dicho, "Escriba un driver de
> +Linux para este dispositivo." El objetivo de este documento en enseñarle
> +todo cuanto necesita para conseguir esto, describiendo el proceso por el
> +que debe pasar, y con indicaciones de como trabajar con la comunidad.
> +También trata de explicar las razones por las cuales la comunidad trabaja
> +de la forma en que lo hace.
> +
> +El kernel esta principalmente escrito en C, con algunas partes que son
> +dependientes de la arquitectura en ensamblador. Un buen conocimiento de C
> +es necesario para desarrollar en el kernel. Lenguaje ensamblador (en
> +cualquier arquitectura) no es necesario excepto que planee realizar
> +desarrollo de bajo nivel para dicha arquitectura. Aunque no es un perfecto
> +sustituto para una educación sólida en C y/o años de experiencia, los
> +siguientes libros sirven, como mínimo, como referencia:
> +
> +- "The C Programming Language" de Kernighan e Ritchie [Prentice Hall]
> +- "Practical C Programming" de Steve Oualline [O'Reilly]
> +- "C:  A Reference Manual" de Harbison and Steele [Prentice Hall]
> +
> +El kernel está escrito usando GNU C y la cadena de herramientas GNU. Si
> +bien se adhiere al estándar ISO C89, utiliza una serie de extensiones que
> +no aparecen en dicho estándar. El kernel usa un C independiente de entorno,
> +sin depender de la biblioteca C estándar, por lo que algunas partes del
> +estándar C no son compatibles. Divisiones de long long arbitrarios o
> +de coma flotante no son permitidas. En ocasiones, puede ser difícil de
> +entender las suposiciones que el kernel hace respecto a la cadena de
> +herramientas y las extensiones que usa, y desafortunadamente no hay
> +referencia definitiva para estos. Consulte las páginas de información de
> +gcc (`info gcc`) para obtener información al respecto.
> +
> +Recuerde que está tratando de aprender a trabajar con una comunidad de
> +desarrollo existente. Es un grupo diverso de personas, con altos estándares
> +de codificación, estilo y procedimiento. Estas normas han sido creadas a lo
> +largo del tiempo en función de lo que se ha encontrado que funciona mejor
> +para un equipo tan grande y geográficamente disperso. Trate de aprender
> +tanto como le sea posible acerca de estos estándares antes de tiempo, ya
> +que están bien documentados; no espere que la gente se adapte a usted o a
> +su forma de ser de hacer las cosas.
> +
> +Cuestiones legales
> +------------------
> +El código fuente del kernel de Linux se publica bajo licencia GPL. Por
> +favor, revise el archivo COPYING, presente en la carpeta principal del
> +fuente, para detalles de la licencia. Si tiene alguna otra pregunta
> +sobre licencias, contacte a un abogado, no pregunte en listas de discusión
> +del kernel de Linux. Las personas en estas listas no son abogadas, y no
> +debe confiar en sus opiniones en materia legal.
> +
> +Para preguntas y respuestas más frecuentes sobre la licencia GPL, consulte:
> +
> +	https://www.gnu.org/licenses/gpl-faq.html
> +
> +Documentacion
> +--------------
> +El código fuente del kernel de Linux tiene una gran variedad de documentos
> +que son increíblemente valiosos para aprender a interactuar con la
> +comunidad del kernel. Cuando se agregan nuevas funciones al kernel, se
> +recomienda que se incluyan nuevos archivos de documentación que expliquen
> +cómo usar la función. Cuando un cambio en el kernel hace que la interfaz
> +que el kernel expone espacio de usuario cambie, se recomienda que envíe la
> +información o un parche en las páginas del manual que expliquen el cambio
> +a mtk.manpages@gmail.com, y CC la lista linux-api@vger.kernel.org.
> +
> +Esta es la lista de archivos que están en el código fuente del kernel y son
> +de obligada lectura:
> +
> +  :ref:`Documentation/admin-guide/README.rst <readme>`
> +    Este archivo ofrece una breve descripción del kernel de Linux y
> +    describe lo que es necesario hacer para configurar y compilar el
> +    kernel. Quienes sean nuevos en el kernel deben comenzar aquí.
> +
> +  :ref:`Documentation/process/changes.rst <changes>`
> +    Este archivo proporciona una lista de los niveles mínimos de varios
> +    paquetes que son necesarios para construir y ejecutar el kernel
> +    exitosamente.
> +
> +  :ref:`Documentation/process/coding-style.rst <codingstyle>`
> +    Esto describe el estilo de código del kernel de Linux y algunas de los
> +    razones detrás de esto. Se espera que todo el código nuevo siga las
> +    directrices de este documento. La mayoría de los maintainers solo
> +    aceptarán parches si se siguen estas reglas, y muchas personas solo
> +    revisan el código si tiene el estilo adecuado.
> +
> +  :ref:`Documentation/process/submitting-patches.rst <submittingpatches>`
> +    Este archivo describe en gran detalle cómo crear con éxito y enviar un
> +    parche, que incluye (pero no se limita a):
> +
> +       - Contenidos del correo electrónico (email)
> +       - Formato del email
> +       - A quien se debe enviar
> +
> +    Seguir estas reglas no garantiza el éxito (ya que todos los parches son
> +    sujetos a escrutinio de contenido y estilo), pero en caso de no seguir
> +    dichas reglas, el fracaso es prácticamente garantizado.
> +    Otras excelentes descripciones de cómo crear parches correctamente son:
> +
> +	"The Perfect Patch"
> +		https://www.ozlabs.org/~akpm/stuff/tpp.txt
> +
> +	"Linux kernel patch submission format"
> +		https://web.archive.org/web/20180829112450/http://linux.yyz.us/patch-format.html
> +
> +  :ref:`Documentation/process/stable-api-nonsense.rst <stable_api_nonsense>`
> +    Este archivo describe la lógica detrás de la decisión consciente de
> +    no tener una API estable dentro del kernel, incluidas cosas como:
> +
> +      - Capas intermedias del subsistema (por compatibilidad?)
> +      - Portabilidad de drivers entre sistemas operativos
> +      - Mitigar el cambio rápido dentro del árbol de fuentes del kernel (o
> +        prevenir cambios rápidos)
> +
> +     Este documento es crucial para comprender la filosofía del desarrollo
> +     de Linux y es muy importante para las personas que se mudan a Linux
> +     tras desarrollar otros sistemas operativos.
> +
> +  :ref:`Documentation/admin-guide/security-bugs.rst <securitybugs>`
> +    Si cree que ha encontrado un problema de seguridad en el kernel de
> +    Linux, siga los pasos de este documento para ayudar a notificar a los
> +    desarrolladores del kernel y ayudar a resolver el problema.
> +
> +  :ref:`Documentation/process/management-style.rst <managementstyle>`
> +    Este documento describe cómo operan los maintainers del kernel de Linux
> +    y los valores compartidos detrás de sus metodologías. Esta es una
> +    lectura importante para cualquier persona nueva en el desarrollo del
> +    kernel (o cualquier persona que simplemente sienta curiosidad por
> +    el campo IT), ya que clarifica muchos conceptos erróneos y confusiones
> +    comunes sobre el comportamiento único de los maintainers del kernel.
> +
> +  :ref:`Documentation/process/stable-kernel-rules.rst <stable_kernel_rules>`
> +    Este archivo describe las reglas sobre cómo se suceden las versiones
> +    del kernel estable, y qué hacer si desea obtener un cambio en una de
> +    estas publicaciones.
> +
> +  :ref:`Documentation/process/kernel-docs.rst <kernel_docs>`
> +    Una lista de documentación externa relativa al desarrollo del kernel.
> +    Por favor consulte esta lista si no encuentra lo que están buscando
> +    dentro de la documentación del kernel.
> +
> +  :ref:`Documentation/process/applying-patches.rst <applying_patches>`
> +    Una buena introducción que describe exactamente qué es un parche y cómo
> +    aplicarlo a las diferentes ramas de desarrollo del kernel.
> +
> +El kernel también tiene una gran cantidad de documentos que pueden ser
> +generados automáticamente desde el propio código fuente o desde
> +ReStructuredText markups (ReST), como este. Esto incluye un descripción
> +completa de la API en el kernel y reglas sobre cómo manejar cerrojos
> +(locking) correctamente.
> +
> +Todos estos documentos se pueden generar como PDF o HTML ejecutando::
> +
> +	make pdfdocs
> +	make htmldocs
> +
> +respectivamente desde el directorio fuente principal del kernel.
> +
> +Los documentos que utilizan el markup ReST se generarán en
> +Documentation/output. También se pueden generar en formatos LaTeX y ePub
> +con::
> +
> +	make latexdocs
> +	make epubdocs
> +
> +Convertirse en un/a desarrollador/a de kernel
> +-------------------------------------------
> +
> +Si no sabe nada sobre el desarrollo del kernel de Linux, debería consultar
> +el proyecto Linux KernelNewbies:
> +
> +	https://kernelnewbies.org
> +
> +Consiste en una útil lista de correo donde puede preguntar casi cualquier
> +tipo de pregunta básica de desarrollo del kernel (asegúrese de buscar en
> +los archivos primero, antes de preguntar algo que ya ha sido respondido en
> +el pasado.) También tiene un canal IRC que puede usar para hacer preguntas
> +en en tiempo real, y una gran cantidad de documentación útil que es útil
> +para ir aprendiendo sobre el desarrollo del kernel de Linux.
> +
> +El sitio web tiene información básica sobre la organización del código,
> +subsistemas, y proyectos actuales (tanto dentro como fuera del árbol).
> +También describe alguna información logística básica, como cómo compilar
> +un kernel y aplicar un parche.
> +
> +Si no sabe por dónde quiere empezar, pero quieres buscar alguna tarea que
> +comenzar a hacer para unirse a la comunidad de desarrollo del kernel,
> +acuda al proyecto Linux Kernel Janitor:
> +
> +	https://kernelnewbies.org/KernelJanitors
> +
> +Es un gran lugar para comenzar. Describe una lista de problemas
> +relativamente simples que deben limpiarse y corregirse dentro del codigo
> +fuente del kernel de Linux árbol de fuentes. Trabajando con los
> +desarrolladores a cargo de este proyecto, aprenderá los conceptos básicos
> +para incluir su parche en el árbol del kernel de Linux, y posiblemente
> +descubrir en la dirección en que trabajar a continuación, si no tiene ya
> +una idea.
> +
> +Antes de realizar cualquier modificación real al código del kernel de
> +Linux, es imperativo entender cómo funciona el código en cuestión. Para
> +este propósito, nada es mejor que leerlo directamente (lo más complicado
> +está bien comentado), tal vez incluso con la ayuda de herramientas
> +especializadas. Una de esas herramientas que se recomienda especialmente
> +es el proyecto Linux Cross-Reference, que es capaz de presentar el código
> +fuente en un formato de página web indexada y autorreferencial. Una
> +excelente puesta al día del repositorio del código del kernel se puede
> +encontrar en:
> +
> +	https://elixir.bootlin.com/
> +
> +El proceso de desarrollo
> +------------------------
> +
> +El proceso de desarrollo del kernel de Linux consiste actualmente de
> +diferentes "branches" (ramas) con muchos distintos subsistemas específicos
> +a cada una de ellas. Las diferentes ramas son:
> +
> +  - El código principal de Linus (mainline tree)
> +  - Varios árboles estables con múltiples major numbers
> +  - Subsistemas específicos
> +  - linux-next, para integración y testing
> +
> +Mainline tree (Árbol principal)
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +El mainline tree es mantenido por Linus Torvalds, y puede encontrarse en
> +https://kernel.org o en su repo.  El proceso de desarrollo es el siguiente:
> +
> +  - Tan pronto como se lanza un nuevo kernel, se abre una ventana de dos
> +    semanas, durante este período de tiempo, los maintainers pueden enviar
> +    grandes modificaciones a Linus, por lo general los parches que ya se
> +    han incluido en el linux-next durante unas semanas. La forma preferida
> +    de enviar grandes cambios es usando git (la herramienta de
> +    administración de codigo fuente del kernel, más información al respecto
> +    en https://git-scm.com/), pero los parches simples también son validos.
> +  - Después de dos semanas, se lanza un kernel -rc1 y la atención se centra
> +    en hacer que el kernel nuevo lo más estable ("solido") posible. La
> +    mayoría de los parches en este punto debe arreglar una regresión. Los
> +    errores que siempre han existido no son regresiones, por lo tanto, solo
> +    envíe este tipo de correcciones si son importantes. Tenga en cuenta que
> +    se podría aceptar un controlador (o sistema de archivos) completamente
> +    nuevo después de -rc1 porque no hay riesgo de causar regresiones con
> +    tal cambio, siempre y cuando el cambio sea autónomo y no afecte áreas
> +    fuera del código que se está agregando. git se puede usar para enviar
> +    parches a Linus después de que se lance -rc1, pero los parches también
> +    deben ser enviado a una lista de correo pública para su revisión.
> +  - Se lanza un nuevo -rc cada vez que Linus considera que el árbol git
> +    actual esta en un estado razonablemente sano y adecuado para la prueba.
> +    La meta es lanzar un nuevo kernel -rc cada semana.
> +  - El proceso continúa hasta que el kernel se considera "listo", y esto
> +    puede durar alrededor de 6 semanas.
> +
> +Vale la pena mencionar lo que Andrew Morton escribió en las listas de
> +correo del kernel de Linux, sobre lanzamientos del kernel (traducido):
> +
> +	*"Nadie sabe cuándo se publicara un nuevo kernel, porque esto sucede
> +    de acuerdo con el estado de bugs (error) percibido, no de acuerdo con
> +    una línea temporal preconcebida."*
> +
> +Varios árboles estables con múltiples major numbers
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Los kernels con versiones de 3 partes son kernels estables. Estos contienen
> +correcciones relativamente pequeñas y críticas para problemas de seguridad
> +o importantes regresiones descubiertas para una publicación de código.
> +Cada lanzamiento en una gran serie estable incrementa la tercera parte de
> +la versión número, manteniendo las dos primeras partes iguales.
> +
> +Esta es la rama recomendada para los usuarios que quieren la versión
> +estable más reciente del kernel, y no están interesados ​​en ayudar a probar
> +versiones en desarrollo/experimentales.
> +
> +Los árboles estables son mantenidos por el equipo "estable"
> +<stable@vger.kernel.org>, y se liberan (publican) según lo dicten las
> +necesidades. El período de liberación normal es de aproximadamente dos
> +semanas, pero puede ser más largo si no hay problemas apremiantes. Un
> +problema relacionado con la seguridad, en cambio, puede causar un
> +lanzamiento casi instantáneamente.
> +
> +El archivo :ref:`Documentación/proceso/stable-kernel-rules.rst <stable_kernel_rules>`
> +en el árbol del kernel documenta qué tipos de cambios son aceptables para
> +el árbol estable y cómo funciona el proceso de lanzamiento.
> +
> +Subsistemas específicos
> +~~~~~~~~~~~~~~~~~~~~~~~~
> +Los maintainers de los diversos subsistemas del kernel --- y también muchos
> +desarrolladores de subsistemas del kernel --- exponen su estado actual de
> +desarrollo en repositorios fuente. De esta manera, otros pueden ver lo que
> +está sucediendo en las diferentes áreas del kernel. En áreas donde el
> +desarrollo es rápido, se le puede pedir a un desarrollador que base sus
> +envíos en tal árbol del subsistema del kernel, para evitar conflictos entre
> +este y otros trabajos ya en curso.
> +
> +La mayoría de estos repositorios son árboles git, pero también hay otros
> +SCM en uso, o colas de parches que se publican como series quilt. Las
> +direcciones de estos repositorios de subsistemas se enumeran en el archivo
> +MAINTAINERS. Muchos de estos se pueden ver en https://git.kernel.org/.
> +
> +Antes de que un parche propuesto se incluya con dicho árbol de subsistemas,
> +es sujeto a revisión, que ocurre principalmente en las listas de correo
> +(ver la sección respectiva a continuación). Para varios subsistemas del
> +kernel, esta revisión se rastrea con la herramienta patchwork. Patchwork
> +ofrece una interfaz web que muestra publicaciones de parches, cualquier
> +comentario sobre un parche o revisiones a él, y los maintainers pueden
> +marcar los parches como en revisión, aceptado, o rechazado. La mayoría de
> +estos sitios de trabajo de parches se enumeran en
> +
> +https://patchwork.kernel.org/.
> +
> +linux-next, para integración y testing
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Antes de que las actualizaciones de los árboles de subsistemas se combinen
> +con el árbol principal, necesitan probar su integración. Para ello, existe
> +un repositorio especial de pruebas en el que se encuentran casi todos los
> +árboles de subsistema, actualizado casi a diario:
> +
> +	https://git.kernel.org/?p=linux/kernel/git/next/linux-next.git
> +
> +De esta manera, linux-next ofrece una perspectiva resumida de lo que se
> +espera que entre en el kernel principal en el próximo período de "merge"
> +(fusión de codigo). Los testers aventureros son bienvenidos a probar
> +linux-next en ejecución.
> +
> +Reportar bugs
> +-------------
> +
> +El archivo 'Documentación/admin-guide/reporting-issues.rst' en el
> +directorio principal del kernel describe cómo informar un posible bug del
> +kernel y detalles sobre qué tipo de información necesitan los
> +desarrolladores del kernel para ayudar a rastrear la fuente del problema.
> +
> +Gestión de informes de bugs
> +------------------------------
> +
> +Una de las mejores formas de poner en práctica sus habilidades de hacking
> +es arreglando errores reportados por otras personas. No solo ayudará a
> +hacer el kernel más estable, también aprenderá a solucionar problemas del
> +mundo real y mejora sus habilidades, y otros desarrolladores se darán
> +cuenta de tu presencia. La corrección de errores es una de las mejores
> +formas de ganar méritos entre desarrolladores, porque no a muchas personas
> +les gusta perder el tiempo arreglando los errores de otras personas.
> +
> +Para trabajar en informes de errores ya reportados, busque un subsistema
> +que le interese. Verifique el archivo MAINTAINERS donde se informan los
> +errores de ese subsistema; con frecuencia será una lista de correo, rara
> +vez un rastreador de errores (bugtracker). Busque en los archivos de dicho
> +lugar para informes recientes y ayude donde lo crea conveniente. También es
> +posible que desee revisar https://bugzilla.kernel.org para informes de
> +errores; solo un puñado de subsistemas del kernel lo emplean activamente
> +para informar o rastrear; sin embargo, todos los errores para todo el kernel
> +se archivan allí.
> +
> +Listas de correo
> +-----------------
> +
> +Como se explica en algunos de los documentos anteriores, la mayoría de
> +desarrolladores del kernel participan en la lista de correo del kernel de
> +Linux. Detalles sobre cómo para suscribirse y darse de baja de la lista se
> +pueden encontrar en:
> +
> +	http://vger.kernel.org/vger-lists.html#linux-kernel
> +
> +Existen archivos de la lista de correo en la web en muchos lugares
> +distintos. Utilice un motor de búsqueda para encontrar estos archivos. Por
> +ejemplo:
> +
> +	http://dir.gmane.org/gmane.linux.kernel
> +
> +Es muy recomendable que busque en los archivos sobre el tema que desea
> +tratar, antes de publicarlo en la lista. Un montón de cosas ya discutidas
> +en detalle solo se registran en los archivos de la lista de correo.
> +
> +La mayoría de los subsistemas individuales del kernel también tienen sus
> +propias lista de correo donde hacen sus esfuerzos de desarrollo. Revise el
> +archivo MAINTAINERS para obtener referencias de lo que estas listas para
> +los diferentes grupos.
> +
> +Muchas de las listas están alojadas en kernel.org. La información sobre
> +estas puede ser encontrada en:
> +
> +	http://vger.kernel.org/vger-lists.html
> +
> +Recuerde mantener buenos hábitos de comportamiento al usar las listas.
> +Aunque un poco cursi, la siguiente URL tiene algunas pautas simples para
> +interactuar con la lista (o cualquier lista):
> +
> +	http://www.albion.com/netiquette/
> +
> +Si varias personas responden a su correo, el CC (lista de destinatarios)
> +puede hacerse bastante grande. No elimine a nadie de la lista CC: sin una
> +buena razón, o no responda solo a la dirección de la lista. Acostúmbrese
> +a recibir correos dos veces, una del remitente y otra de la lista, y no
> +intente ajustar esto agregando encabezados de correo astutos, a la gente no
> +le gustará.
> +
> +Recuerde mantener intacto el contexto y la atribución de sus respuestas,
> +mantenga las líneas "El hacker John Kernel escribió ...:" en la parte
> +superior de su respuesta, y agregue sus declaraciones entre las secciones
> +individuales citadas en lugar de escribiendo en la parte superior del
> +correo electrónico.
> +
> +Si incluye parches en su correo, asegúrese de que sean texto legible sin
> +formato como se indica en :ref:`Documentation/process/submitting-patches.rst <submittingpatches>`.
> +Los desarrolladores del kernel no quieren lidiar con archivos adjuntos o
> +parches comprimidos; y pueden querer comentar líneas individuales de su
> +parche, que funciona sólo de esa manera. Asegúrese de emplear un programa
> +de correo que no altere los espacios ni los tabuladores. Una buena primera
> +prueba es enviarse el correo a usted mismo, e intentar aplicar su
> +propio parche. Si eso no funciona, arregle su programa de correo o
> +reemplace hasta que funcione.
> +
> +Sobretodo, recuerde de ser respetuoso con otros subscriptores.
> +
> +Colaborando con la comunidad
> +----------------------------
> +
> +El objetivo de la comunidad del kernel es proporcionar el mejor kernel
> +posible. Cuando envíe un parche para su aceptación, se revisará en sus
> +méritos técnicos solamente. Entonces, ¿qué deberías ser?
> +
> +  - criticas
> +  - comentarios
> +  - peticiones de cambios
> +  - peticiones de justificaciones
> +  - silencio
> +
> +Recuerde, esto es parte de introducir su parche en el kernel. Tiene que ser
> +capaz de recibir críticas y comentarios sobre sus parches, evaluar
> +a nivel técnico y re-elaborar sus parches o proporcionar razonamiento claro
> +y conciso de por qué no se deben hacer tales cambios. Si no hay respuestas
> +a su publicación, espere unos días e intente de nuevo, a veces las cosas se
> +pierden dado el gran volumen.
> +
> +¿Qué no debería hacer?
> +
> +  - esperar ue su parche se acepte sin preguntas
> +  - actuar de forma defensiva
> +  - ignorar comentarios
> +  - enviar el parche de nuevo, sin haber aplicados los cambios pertinentes
> +
> +En una comunidad que busca la mejor solución técnica posible, siempre habrá
> +diferentes opiniones sobre lo beneficioso que es un parche. Tiene que ser
> +cooperativo y estar dispuesto a adaptar su idea para que encaje dentro
> +del kernel, o al menos esté dispuesto a demostrar que su idea vale la pena.
> +Recuerea, estar equivocado es aceptable siempre y cuando estés dispuesto a
> +trabajar hacia una solución que sea correcta.
> +
> +Es normal que las respuestas a su primer parche sean simplemente una lista
> +de una docena de cosas que debe corregir. Esto **no** implica que su
> +parche no será aceptado, y **no** es personal. Simplemente corrija todos
> +los problemas planteados en su parche, y envié otra vez.
> +
> +Diferencias entre la comunidad kernel y las estructuras corporativas
> +--------------------------------------------------------------------
> +
> +La comunidad del kernel funciona de manera diferente a la mayoría de los
> +entornos de desarrollo tradicionales en empresas. Aquí hay una lista de
> +cosas que puede intentar hacer para evitar problemas:
> +
> +  Cosas buenas que decir respecto a los cambios propuestos:
> +
> +    - "Esto arregla múltiples problemas."
> +    - "Esto elimina 2000 lineas de código."
> +    - "Aquí hay un parche que explica lo que intento describir."
> +    - "Lo he testeado en 5 arquitecturas distintas..."
> +    - "Aquí hay una serie de parches menores que..."
> +    - "Esto mejora el rendimiento en maquinas típicas..."
> +
> +  Cosas negativas que debe evitar decir:
> +
> +    - "Lo hicimos asi en AIX/ptx/Solaris, de modo que debe ser bueno..."
> +    - "LLevo haciendo esto 20 años, de modo que..."
> +    - "Esto lo necesita mi empresa para ganar dinero"
> +    - "Esto es para la linea de nuestros productos Enterprise"
> +    - "Aquí esta el documento de 1000 paginas describiendo mi idea"
> +    - "Llevo 6 meses trabajando en esto..."
> +    - "Aquí esta un parche de 5000 lineas que..."
> +    - "He rescrito todo el desastre actual, y aqui esta..."
> +    - "Tengo un deadline, y este parche debe aplicarse ahora."
> +
> +Otra forma en que la comunidad del kernel es diferente a la mayoría de los
> +entornos de trabajo tradicionales en ingeniería de software, es la
> +naturaleza sin rostro de interacción. Una de las ventajas de utilizar el
> +correo electrónico y el IRC como formas principales de comunicación es la
> +no discriminación por motivos de género o raza. El entorno de trabajo del
> +kernel de Linux acepta a mujeres y minorías porque todo lo que eres es una
> +dirección de correo electrónico. El aspecto internacional también ayuda a
> +nivelar el campo de juego porque no puede adivinar el género basado en
> +el nombre de una persona. Un hombre puede llamarse Andrea y una mujer puede
> +llamarse Pat. La mayoría de las mujeres que han trabajado en el kernel de
> +Linux y han expresado una opinión han tenido experiencias positivas.
> +
> +La barrera del idioma puede causar problemas a algunas personas que no se
> +sientes cómodas con el inglés. Un buen dominio del idioma puede ser
> +necesario para transmitir ideas correctamente en las listas de correo, por
> +lo que le recomendamos que revise sus correos electrónicos para asegurarse
> +de que tengan sentido en inglés antes de enviarlos.
> +
> +Divida sus cambios
> +---------------------
> +
> +La comunidad del kernel de Linux no acepta con gusto grandes fragmentos de
> +código, sobretodo a la vez. Los cambios deben introducirse correctamente,
> +discutidos y divididos en pequeñas porciones individuales. Esto es casi
> +exactamente lo contrario de lo que las empresas están acostumbradas a hacer.
> +Su propuesta también debe introducirse muy temprano en el proceso de
> +desarrollo, de modo que pueda recibir comentarios sobre lo que está
> +haciendo. También deje que la comunidad sienta que está trabajando con
> +ellos, y no simplemente usándolos como un vertedero para su función. Sin
> +embargo, no envíe 50 correos electrónicos a una vez a una lista de correo,
> +su serie de parches debe casi siempre ser más pequeña que eso.
> +
> +Las razones para dividir las cosas son las siguientes:
> +
> +1) Los cambios pequeños aumentan la probabilidad de que sus parches sean
> +   aplicados, ya que no requieren mucho tiempo o esfuerzo para verificar su
> +   exactitud. Un parche de 5 líneas puede ser aplicado por un maintainer
> +   con apenas una segunda mirada. Sin embargo, un parche de 500 líneas
> +   puede tardar horas en ser revisado en términos de corrección (el tiempo
> +   que toma es exponencialmente proporcional al tamaño del parche, o algo
> +   así).
> +
> +   Los parches pequeños también facilitan la depuración cuando algo falla.
> +   Es mucho más fácil retirar los parches uno por uno que diseccionar un
> +   parche muy grande después de haber sido aplicado (y roto alguna cosa).
> +
> +2) Es importante no solo enviar pequeños parches, sino también reescribir
> +   y simplificar (o simplemente reordenar) los parches antes de enviarlos.
> +
> +Esta es una analogía del desarrollador del kernel Al Viro (traducida):
> +
> +	*"Piense en un maestro que califica la tarea de un estudiante de
> +	matemáticas. El maestro no quiere ver los intentos y errores del
> +	estudiante antes de que se les ocurriera la solución. Quiere ver la
> +	respuesta más limpia y elegante. Un buen estudiante lo sabe, y nunca
> +	presentaría su trabajo intermedio antes de tener la solución final.*
> +
> +	* Lo mismo ocurre con el desarrollo del kernel. Los maintainers y
> +	revisores no quieren ver el proceso de pensamiento detrás de la solución
> +	al problema que se está resolviendo. Quieren ver un solución simple y
> +	elegante."*
> +
> +Puede resultar un reto mantener el equilibrio entre presentar una solución
> +elegante y trabajar junto a la comunidad, discutiendo su trabajo inacabado.
> +Por lo tanto, es bueno comenzar temprano en el proceso para obtener
> +"feedback" y mejorar su trabajo, pero también mantenga sus cambios en
> +pequeños trozos que pueden ser aceptados, incluso cuando toda su labor no
> +está listo para inclusión en un momento dado.
> +
> +También tenga en cuenta que no es aceptable enviar parches para su
> +inclusión que están sin terminar y serán "arreglados más tarde".
> +
> +Justifique sus cambios
> +----------------------
> +
> +Además de dividir sus parches, es muy importante que deje a la comunidad de
> +Linux sabe por qué deberían agregar este cambio. Nuevas características
> +debe justificarse como necesarias y útiles.
> +
> +Documente sus cambios
> +--------------------
> +
> +Cuando envíe sus parches, preste especial atención a lo que dice en el
> +texto de su correo electrónico. Esta información se convertirá en el
> +ChangeLog del parche, y se conservará para que todos la vean, todo el
> +tiempo. Debe describir el parche por completo y contener:
> +
> +  - por que los cambios son necesarios
> +  - el diseño general de su propuesta
> +  - detalles de implementación
> +  - resultados de sus experimentos
> +
> +Para obtener más detalles sobre cómo debería quedar todo esto, consulte la
> +sección ChangeLog del documento:
> +
> +  "The Perfect Patch"
> +      https://www.ozlabs.org/~akpm/stuff/tpp.txt
> +
> +Todas estas cuestiones son a veces son muy difíciles de conseguir. Puede
> +llevar años perfeccionar estas prácticas (si es que lo hace). Es un proceso
> +continuo de mejora que requiere mucha paciencia y determinación. Pero no se
> +rinda, es posible. Muchos lo han hecho antes, y cada uno tuvo que comenzar
> +exactamente donde está usted ahora.
> +
> +
> +----------
> +
> +Gracias a Paolo Ciarrocchi que permitió que la sección "Development Process"
> +se basara en el texto que había escrito (https://lwn.net/Articles/94386/),
> +y a Randy Dunlap y Gerrit Huizenga por algunas de la lista de cosas que
> +debes y no debes decir. También gracias a Pat Mochel, Hanna Linder, Randy
> +Dunlap, Kay Sievers, Vojtech Pavlik, Jan Kara, Josh Boyer, Kees Cook,
> +Andrew Morton, Andi Kleen, Vadim Lobanov, Jesper Juhl, Adrian Bunk,
> +Keri Harris, Frans Pop, David A. Wheeler, Junio ​​Hamano, Michael Kerrisk y
> +Alex Shepard por su revisión, comentarios y contribuciones. Sin su ayuda,
> +este documento no hubiera sido posible.
> +
> +Maintainer: Greg Kroah-Hartman <greg@kroah.com>

kernel test robot have already reported documentation warnings at [1],
so I have applied the fixup:

---- >8 ----

diff --git a/Documentation/translations/sp_SP/howto.rst b/Documentation/translations/sp_SP/howto.rst
index 4cf8fa6b9f7c2e..0c072b9a69df30 100644
--- a/Documentation/translations/sp_SP/howto.rst
+++ b/Documentation/translations/sp_SP/howto.rst
@@ -183,7 +183,7 @@ con::
 	make epubdocs
 
 Convertirse en un/a desarrollador/a de kernel
--------------------------------------------
+---------------------------------------------
 
 Si no sabe nada sobre el desarrollo del kernel de Linux, debería consultar
 el proyecto Linux KernelNewbies:
@@ -274,8 +274,8 @@ Vale la pena mencionar lo que Andrew Morton escribió en las listas de
 correo del kernel de Linux, sobre lanzamientos del kernel (traducido):
 
 	*"Nadie sabe cuándo se publicara un nuevo kernel, porque esto sucede
-    de acuerdo con el estado de bugs (error) percibido, no de acuerdo con
-    una línea temporal preconcebida."*
+        de acuerdo con el estado de bugs (error) percibido, no de acuerdo con
+        una línea temporal preconcebida."*
 
 Varios árboles estables con múltiples major numbers
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -556,7 +556,7 @@ Esta es una analogía del desarrollador del kernel Al Viro (traducida):
 	respuesta más limpia y elegante. Un buen estudiante lo sabe, y nunca
 	presentaría su trabajo intermedio antes de tener la solución final.*
 
-	* Lo mismo ocurre con el desarrollo del kernel. Los maintainers y
+	*Lo mismo ocurre con el desarrollo del kernel. Los maintainers y
 	revisores no quieren ver el proceso de pensamiento detrás de la solución
 	al problema que se está resolviendo. Quieren ver un solución simple y
 	elegante."*
@@ -579,7 +579,7 @@ Linux sabe por qué deberían agregar este cambio. Nuevas características
 debe justificarse como necesarias y útiles.
 
 Documente sus cambios
---------------------
+---------------------
 
 Cuando envíe sus parches, preste especial atención a lo que dice en el
 texto de su correo electrónico. Esta información se convertirá en el

Muchas gracias (thanks very much).

[1]: https://lore.kernel.org/linux-doc/202210141348.7UGXRUp8-lkp@intel.com/
-- 
An old man doll... just what I always wanted! - Clara

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply related	[relevance 0%]

* [PATCH 2/2] Documentation: Add HOWTO Spanish translation into rst based build system
  @ 2022-10-13 18:48  2% ` Carlos Bilbao
  2022-10-14  9:21  0%   ` Bagas Sanjaya
    1 sibling, 1 reply; 200+ results
From: Carlos Bilbao @ 2022-10-13 18:48 UTC (permalink / raw)
  To: corbet; +Cc: linux-doc, linux-kernel, carlos.bilbao, bilbao, ojeda

This commit adds Spanish translation of HOWTO document into rst based
documentation build system.

Signed-off-by: Carlos Bilbao <carlos.bilbao@amd.com>
---
 Documentation/translations/sp_SP/howto.rst | 619 +++++++++++++++++++++
 Documentation/translations/sp_SP/index.rst |   7 +
 2 files changed, 626 insertions(+)
 create mode 100644 Documentation/translations/sp_SP/howto.rst

diff --git a/Documentation/translations/sp_SP/howto.rst b/Documentation/translations/sp_SP/howto.rst
new file mode 100644
index 000000000000..4cf8fa6b9f7c
--- /dev/null
+++ b/Documentation/translations/sp_SP/howto.rst
@@ -0,0 +1,619 @@
+.. include:: ./disclaimer-sp.rst
+
+:Original: :ref:`Documentation/process/howto.rst <process_howto>`
+:Translator: Carlos Bilbao <carlos.bilbao@amd.com>
+
+.. _sp_process_howto:
+
+Cómo participar en el desarrollo del kernel de Linux
+====================================================
+
+Este documento es el principal punto de partida. Contiene instrucciones
+sobre cómo convertirse en desarrollador del kernel de Linux y explica cómo
+trabajar con el y en su desarrollo. El documento no tratará ningún aspecto
+técnico relacionado con la programación del kernel, pero le ayudará
+guiándole por el camino correcto.
+
+Si algo en este documento quedara obsoleto, envíe parches al maintainer de
+este archivo, que se encuentra en la parte superior del documento.
+
+Introducción
+------------
+¿De modo que quiere descubrir como convertirse en un/a desarrollador/a del
+kernel de Linux? Tal vez su jefe le haya dicho, "Escriba un driver de
+Linux para este dispositivo." El objetivo de este documento en enseñarle
+todo cuanto necesita para conseguir esto, describiendo el proceso por el
+que debe pasar, y con indicaciones de como trabajar con la comunidad.
+También trata de explicar las razones por las cuales la comunidad trabaja
+de la forma en que lo hace.
+
+El kernel esta principalmente escrito en C, con algunas partes que son
+dependientes de la arquitectura en ensamblador. Un buen conocimiento de C
+es necesario para desarrollar en el kernel. Lenguaje ensamblador (en
+cualquier arquitectura) no es necesario excepto que planee realizar
+desarrollo de bajo nivel para dicha arquitectura. Aunque no es un perfecto
+sustituto para una educación sólida en C y/o años de experiencia, los
+siguientes libros sirven, como mínimo, como referencia:
+
+- "The C Programming Language" de Kernighan e Ritchie [Prentice Hall]
+- "Practical C Programming" de Steve Oualline [O'Reilly]
+- "C:  A Reference Manual" de Harbison and Steele [Prentice Hall]
+
+El kernel está escrito usando GNU C y la cadena de herramientas GNU. Si
+bien se adhiere al estándar ISO C89, utiliza una serie de extensiones que
+no aparecen en dicho estándar. El kernel usa un C independiente de entorno,
+sin depender de la biblioteca C estándar, por lo que algunas partes del
+estándar C no son compatibles. Divisiones de long long arbitrarios o
+de coma flotante no son permitidas. En ocasiones, puede ser difícil de
+entender las suposiciones que el kernel hace respecto a la cadena de
+herramientas y las extensiones que usa, y desafortunadamente no hay
+referencia definitiva para estos. Consulte las páginas de información de
+gcc (`info gcc`) para obtener información al respecto.
+
+Recuerde que está tratando de aprender a trabajar con una comunidad de
+desarrollo existente. Es un grupo diverso de personas, con altos estándares
+de codificación, estilo y procedimiento. Estas normas han sido creadas a lo
+largo del tiempo en función de lo que se ha encontrado que funciona mejor
+para un equipo tan grande y geográficamente disperso. Trate de aprender
+tanto como le sea posible acerca de estos estándares antes de tiempo, ya
+que están bien documentados; no espere que la gente se adapte a usted o a
+su forma de ser de hacer las cosas.
+
+Cuestiones legales
+------------------
+El código fuente del kernel de Linux se publica bajo licencia GPL. Por
+favor, revise el archivo COPYING, presente en la carpeta principal del
+fuente, para detalles de la licencia. Si tiene alguna otra pregunta
+sobre licencias, contacte a un abogado, no pregunte en listas de discusión
+del kernel de Linux. Las personas en estas listas no son abogadas, y no
+debe confiar en sus opiniones en materia legal.
+
+Para preguntas y respuestas más frecuentes sobre la licencia GPL, consulte:
+
+	https://www.gnu.org/licenses/gpl-faq.html
+
+Documentacion
+--------------
+El código fuente del kernel de Linux tiene una gran variedad de documentos
+que son increíblemente valiosos para aprender a interactuar con la
+comunidad del kernel. Cuando se agregan nuevas funciones al kernel, se
+recomienda que se incluyan nuevos archivos de documentación que expliquen
+cómo usar la función. Cuando un cambio en el kernel hace que la interfaz
+que el kernel expone espacio de usuario cambie, se recomienda que envíe la
+información o un parche en las páginas del manual que expliquen el cambio
+a mtk.manpages@gmail.com, y CC la lista linux-api@vger.kernel.org.
+
+Esta es la lista de archivos que están en el código fuente del kernel y son
+de obligada lectura:
+
+  :ref:`Documentation/admin-guide/README.rst <readme>`
+    Este archivo ofrece una breve descripción del kernel de Linux y
+    describe lo que es necesario hacer para configurar y compilar el
+    kernel. Quienes sean nuevos en el kernel deben comenzar aquí.
+
+  :ref:`Documentation/process/changes.rst <changes>`
+    Este archivo proporciona una lista de los niveles mínimos de varios
+    paquetes que son necesarios para construir y ejecutar el kernel
+    exitosamente.
+
+  :ref:`Documentation/process/coding-style.rst <codingstyle>`
+    Esto describe el estilo de código del kernel de Linux y algunas de los
+    razones detrás de esto. Se espera que todo el código nuevo siga las
+    directrices de este documento. La mayoría de los maintainers solo
+    aceptarán parches si se siguen estas reglas, y muchas personas solo
+    revisan el código si tiene el estilo adecuado.
+
+  :ref:`Documentation/process/submitting-patches.rst <submittingpatches>`
+    Este archivo describe en gran detalle cómo crear con éxito y enviar un
+    parche, que incluye (pero no se limita a):
+
+       - Contenidos del correo electrónico (email)
+       - Formato del email
+       - A quien se debe enviar
+
+    Seguir estas reglas no garantiza el éxito (ya que todos los parches son
+    sujetos a escrutinio de contenido y estilo), pero en caso de no seguir
+    dichas reglas, el fracaso es prácticamente garantizado.
+    Otras excelentes descripciones de cómo crear parches correctamente son:
+
+	"The Perfect Patch"
+		https://www.ozlabs.org/~akpm/stuff/tpp.txt
+
+	"Linux kernel patch submission format"
+		https://web.archive.org/web/20180829112450/http://linux.yyz.us/patch-format.html
+
+  :ref:`Documentation/process/stable-api-nonsense.rst <stable_api_nonsense>`
+    Este archivo describe la lógica detrás de la decisión consciente de
+    no tener una API estable dentro del kernel, incluidas cosas como:
+
+      - Capas intermedias del subsistema (por compatibilidad?)
+      - Portabilidad de drivers entre sistemas operativos
+      - Mitigar el cambio rápido dentro del árbol de fuentes del kernel (o
+        prevenir cambios rápidos)
+
+     Este documento es crucial para comprender la filosofía del desarrollo
+     de Linux y es muy importante para las personas que se mudan a Linux
+     tras desarrollar otros sistemas operativos.
+
+  :ref:`Documentation/admin-guide/security-bugs.rst <securitybugs>`
+    Si cree que ha encontrado un problema de seguridad en el kernel de
+    Linux, siga los pasos de este documento para ayudar a notificar a los
+    desarrolladores del kernel y ayudar a resolver el problema.
+
+  :ref:`Documentation/process/management-style.rst <managementstyle>`
+    Este documento describe cómo operan los maintainers del kernel de Linux
+    y los valores compartidos detrás de sus metodologías. Esta es una
+    lectura importante para cualquier persona nueva en el desarrollo del
+    kernel (o cualquier persona que simplemente sienta curiosidad por
+    el campo IT), ya que clarifica muchos conceptos erróneos y confusiones
+    comunes sobre el comportamiento único de los maintainers del kernel.
+
+  :ref:`Documentation/process/stable-kernel-rules.rst <stable_kernel_rules>`
+    Este archivo describe las reglas sobre cómo se suceden las versiones
+    del kernel estable, y qué hacer si desea obtener un cambio en una de
+    estas publicaciones.
+
+  :ref:`Documentation/process/kernel-docs.rst <kernel_docs>`
+    Una lista de documentación externa relativa al desarrollo del kernel.
+    Por favor consulte esta lista si no encuentra lo que están buscando
+    dentro de la documentación del kernel.
+
+  :ref:`Documentation/process/applying-patches.rst <applying_patches>`
+    Una buena introducción que describe exactamente qué es un parche y cómo
+    aplicarlo a las diferentes ramas de desarrollo del kernel.
+
+El kernel también tiene una gran cantidad de documentos que pueden ser
+generados automáticamente desde el propio código fuente o desde
+ReStructuredText markups (ReST), como este. Esto incluye un descripción
+completa de la API en el kernel y reglas sobre cómo manejar cerrojos
+(locking) correctamente.
+
+Todos estos documentos se pueden generar como PDF o HTML ejecutando::
+
+	make pdfdocs
+	make htmldocs
+
+respectivamente desde el directorio fuente principal del kernel.
+
+Los documentos que utilizan el markup ReST se generarán en
+Documentation/output. También se pueden generar en formatos LaTeX y ePub
+con::
+
+	make latexdocs
+	make epubdocs
+
+Convertirse en un/a desarrollador/a de kernel
+-------------------------------------------
+
+Si no sabe nada sobre el desarrollo del kernel de Linux, debería consultar
+el proyecto Linux KernelNewbies:
+
+	https://kernelnewbies.org
+
+Consiste en una útil lista de correo donde puede preguntar casi cualquier
+tipo de pregunta básica de desarrollo del kernel (asegúrese de buscar en
+los archivos primero, antes de preguntar algo que ya ha sido respondido en
+el pasado.) También tiene un canal IRC que puede usar para hacer preguntas
+en en tiempo real, y una gran cantidad de documentación útil que es útil
+para ir aprendiendo sobre el desarrollo del kernel de Linux.
+
+El sitio web tiene información básica sobre la organización del código,
+subsistemas, y proyectos actuales (tanto dentro como fuera del árbol).
+También describe alguna información logística básica, como cómo compilar
+un kernel y aplicar un parche.
+
+Si no sabe por dónde quiere empezar, pero quieres buscar alguna tarea que
+comenzar a hacer para unirse a la comunidad de desarrollo del kernel,
+acuda al proyecto Linux Kernel Janitor:
+
+	https://kernelnewbies.org/KernelJanitors
+
+Es un gran lugar para comenzar. Describe una lista de problemas
+relativamente simples que deben limpiarse y corregirse dentro del codigo
+fuente del kernel de Linux árbol de fuentes. Trabajando con los
+desarrolladores a cargo de este proyecto, aprenderá los conceptos básicos
+para incluir su parche en el árbol del kernel de Linux, y posiblemente
+descubrir en la dirección en que trabajar a continuación, si no tiene ya
+una idea.
+
+Antes de realizar cualquier modificación real al código del kernel de
+Linux, es imperativo entender cómo funciona el código en cuestión. Para
+este propósito, nada es mejor que leerlo directamente (lo más complicado
+está bien comentado), tal vez incluso con la ayuda de herramientas
+especializadas. Una de esas herramientas que se recomienda especialmente
+es el proyecto Linux Cross-Reference, que es capaz de presentar el código
+fuente en un formato de página web indexada y autorreferencial. Una
+excelente puesta al día del repositorio del código del kernel se puede
+encontrar en:
+
+	https://elixir.bootlin.com/
+
+El proceso de desarrollo
+------------------------
+
+El proceso de desarrollo del kernel de Linux consiste actualmente de
+diferentes "branches" (ramas) con muchos distintos subsistemas específicos
+a cada una de ellas. Las diferentes ramas son:
+
+  - El código principal de Linus (mainline tree)
+  - Varios árboles estables con múltiples major numbers
+  - Subsistemas específicos
+  - linux-next, para integración y testing
+
+Mainline tree (Árbol principal)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+El mainline tree es mantenido por Linus Torvalds, y puede encontrarse en
+https://kernel.org o en su repo.  El proceso de desarrollo es el siguiente:
+
+  - Tan pronto como se lanza un nuevo kernel, se abre una ventana de dos
+    semanas, durante este período de tiempo, los maintainers pueden enviar
+    grandes modificaciones a Linus, por lo general los parches que ya se
+    han incluido en el linux-next durante unas semanas. La forma preferida
+    de enviar grandes cambios es usando git (la herramienta de
+    administración de codigo fuente del kernel, más información al respecto
+    en https://git-scm.com/), pero los parches simples también son validos.
+  - Después de dos semanas, se lanza un kernel -rc1 y la atención se centra
+    en hacer que el kernel nuevo lo más estable ("solido") posible. La
+    mayoría de los parches en este punto debe arreglar una regresión. Los
+    errores que siempre han existido no son regresiones, por lo tanto, solo
+    envíe este tipo de correcciones si son importantes. Tenga en cuenta que
+    se podría aceptar un controlador (o sistema de archivos) completamente
+    nuevo después de -rc1 porque no hay riesgo de causar regresiones con
+    tal cambio, siempre y cuando el cambio sea autónomo y no afecte áreas
+    fuera del código que se está agregando. git se puede usar para enviar
+    parches a Linus después de que se lance -rc1, pero los parches también
+    deben ser enviado a una lista de correo pública para su revisión.
+  - Se lanza un nuevo -rc cada vez que Linus considera que el árbol git
+    actual esta en un estado razonablemente sano y adecuado para la prueba.
+    La meta es lanzar un nuevo kernel -rc cada semana.
+  - El proceso continúa hasta que el kernel se considera "listo", y esto
+    puede durar alrededor de 6 semanas.
+
+Vale la pena mencionar lo que Andrew Morton escribió en las listas de
+correo del kernel de Linux, sobre lanzamientos del kernel (traducido):
+
+	*"Nadie sabe cuándo se publicara un nuevo kernel, porque esto sucede
+    de acuerdo con el estado de bugs (error) percibido, no de acuerdo con
+    una línea temporal preconcebida."*
+
+Varios árboles estables con múltiples major numbers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Los kernels con versiones de 3 partes son kernels estables. Estos contienen
+correcciones relativamente pequeñas y críticas para problemas de seguridad
+o importantes regresiones descubiertas para una publicación de código.
+Cada lanzamiento en una gran serie estable incrementa la tercera parte de
+la versión número, manteniendo las dos primeras partes iguales.
+
+Esta es la rama recomendada para los usuarios que quieren la versión
+estable más reciente del kernel, y no están interesados ​​en ayudar a probar
+versiones en desarrollo/experimentales.
+
+Los árboles estables son mantenidos por el equipo "estable"
+<stable@vger.kernel.org>, y se liberan (publican) según lo dicten las
+necesidades. El período de liberación normal es de aproximadamente dos
+semanas, pero puede ser más largo si no hay problemas apremiantes. Un
+problema relacionado con la seguridad, en cambio, puede causar un
+lanzamiento casi instantáneamente.
+
+El archivo :ref:`Documentación/proceso/stable-kernel-rules.rst <stable_kernel_rules>`
+en el árbol del kernel documenta qué tipos de cambios son aceptables para
+el árbol estable y cómo funciona el proceso de lanzamiento.
+
+Subsistemas específicos
+~~~~~~~~~~~~~~~~~~~~~~~~
+Los maintainers de los diversos subsistemas del kernel --- y también muchos
+desarrolladores de subsistemas del kernel --- exponen su estado actual de
+desarrollo en repositorios fuente. De esta manera, otros pueden ver lo que
+está sucediendo en las diferentes áreas del kernel. En áreas donde el
+desarrollo es rápido, se le puede pedir a un desarrollador que base sus
+envíos en tal árbol del subsistema del kernel, para evitar conflictos entre
+este y otros trabajos ya en curso.
+
+La mayoría de estos repositorios son árboles git, pero también hay otros
+SCM en uso, o colas de parches que se publican como series quilt. Las
+direcciones de estos repositorios de subsistemas se enumeran en el archivo
+MAINTAINERS. Muchos de estos se pueden ver en https://git.kernel.org/.
+
+Antes de que un parche propuesto se incluya con dicho árbol de subsistemas,
+es sujeto a revisión, que ocurre principalmente en las listas de correo
+(ver la sección respectiva a continuación). Para varios subsistemas del
+kernel, esta revisión se rastrea con la herramienta patchwork. Patchwork
+ofrece una interfaz web que muestra publicaciones de parches, cualquier
+comentario sobre un parche o revisiones a él, y los maintainers pueden
+marcar los parches como en revisión, aceptado, o rechazado. La mayoría de
+estos sitios de trabajo de parches se enumeran en
+
+https://patchwork.kernel.org/.
+
+linux-next, para integración y testing
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Antes de que las actualizaciones de los árboles de subsistemas se combinen
+con el árbol principal, necesitan probar su integración. Para ello, existe
+un repositorio especial de pruebas en el que se encuentran casi todos los
+árboles de subsistema, actualizado casi a diario:
+
+	https://git.kernel.org/?p=linux/kernel/git/next/linux-next.git
+
+De esta manera, linux-next ofrece una perspectiva resumida de lo que se
+espera que entre en el kernel principal en el próximo período de "merge"
+(fusión de codigo). Los testers aventureros son bienvenidos a probar
+linux-next en ejecución.
+
+Reportar bugs
+-------------
+
+El archivo 'Documentación/admin-guide/reporting-issues.rst' en el
+directorio principal del kernel describe cómo informar un posible bug del
+kernel y detalles sobre qué tipo de información necesitan los
+desarrolladores del kernel para ayudar a rastrear la fuente del problema.
+
+Gestión de informes de bugs
+------------------------------
+
+Una de las mejores formas de poner en práctica sus habilidades de hacking
+es arreglando errores reportados por otras personas. No solo ayudará a
+hacer el kernel más estable, también aprenderá a solucionar problemas del
+mundo real y mejora sus habilidades, y otros desarrolladores se darán
+cuenta de tu presencia. La corrección de errores es una de las mejores
+formas de ganar méritos entre desarrolladores, porque no a muchas personas
+les gusta perder el tiempo arreglando los errores de otras personas.
+
+Para trabajar en informes de errores ya reportados, busque un subsistema
+que le interese. Verifique el archivo MAINTAINERS donde se informan los
+errores de ese subsistema; con frecuencia será una lista de correo, rara
+vez un rastreador de errores (bugtracker). Busque en los archivos de dicho
+lugar para informes recientes y ayude donde lo crea conveniente. También es
+posible que desee revisar https://bugzilla.kernel.org para informes de
+errores; solo un puñado de subsistemas del kernel lo emplean activamente
+para informar o rastrear; sin embargo, todos los errores para todo el kernel
+se archivan allí.
+
+Listas de correo
+-----------------
+
+Como se explica en algunos de los documentos anteriores, la mayoría de
+desarrolladores del kernel participan en la lista de correo del kernel de
+Linux. Detalles sobre cómo para suscribirse y darse de baja de la lista se
+pueden encontrar en:
+
+	http://vger.kernel.org/vger-lists.html#linux-kernel
+
+Existen archivos de la lista de correo en la web en muchos lugares
+distintos. Utilice un motor de búsqueda para encontrar estos archivos. Por
+ejemplo:
+
+	http://dir.gmane.org/gmane.linux.kernel
+
+Es muy recomendable que busque en los archivos sobre el tema que desea
+tratar, antes de publicarlo en la lista. Un montón de cosas ya discutidas
+en detalle solo se registran en los archivos de la lista de correo.
+
+La mayoría de los subsistemas individuales del kernel también tienen sus
+propias lista de correo donde hacen sus esfuerzos de desarrollo. Revise el
+archivo MAINTAINERS para obtener referencias de lo que estas listas para
+los diferentes grupos.
+
+Muchas de las listas están alojadas en kernel.org. La información sobre
+estas puede ser encontrada en:
+
+	http://vger.kernel.org/vger-lists.html
+
+Recuerde mantener buenos hábitos de comportamiento al usar las listas.
+Aunque un poco cursi, la siguiente URL tiene algunas pautas simples para
+interactuar con la lista (o cualquier lista):
+
+	http://www.albion.com/netiquette/
+
+Si varias personas responden a su correo, el CC (lista de destinatarios)
+puede hacerse bastante grande. No elimine a nadie de la lista CC: sin una
+buena razón, o no responda solo a la dirección de la lista. Acostúmbrese
+a recibir correos dos veces, una del remitente y otra de la lista, y no
+intente ajustar esto agregando encabezados de correo astutos, a la gente no
+le gustará.
+
+Recuerde mantener intacto el contexto y la atribución de sus respuestas,
+mantenga las líneas "El hacker John Kernel escribió ...:" en la parte
+superior de su respuesta, y agregue sus declaraciones entre las secciones
+individuales citadas en lugar de escribiendo en la parte superior del
+correo electrónico.
+
+Si incluye parches en su correo, asegúrese de que sean texto legible sin
+formato como se indica en :ref:`Documentation/process/submitting-patches.rst <submittingpatches>`.
+Los desarrolladores del kernel no quieren lidiar con archivos adjuntos o
+parches comprimidos; y pueden querer comentar líneas individuales de su
+parche, que funciona sólo de esa manera. Asegúrese de emplear un programa
+de correo que no altere los espacios ni los tabuladores. Una buena primera
+prueba es enviarse el correo a usted mismo, e intentar aplicar su
+propio parche. Si eso no funciona, arregle su programa de correo o
+reemplace hasta que funcione.
+
+Sobretodo, recuerde de ser respetuoso con otros subscriptores.
+
+Colaborando con la comunidad
+----------------------------
+
+El objetivo de la comunidad del kernel es proporcionar el mejor kernel
+posible. Cuando envíe un parche para su aceptación, se revisará en sus
+méritos técnicos solamente. Entonces, ¿qué deberías ser?
+
+  - criticas
+  - comentarios
+  - peticiones de cambios
+  - peticiones de justificaciones
+  - silencio
+
+Recuerde, esto es parte de introducir su parche en el kernel. Tiene que ser
+capaz de recibir críticas y comentarios sobre sus parches, evaluar
+a nivel técnico y re-elaborar sus parches o proporcionar razonamiento claro
+y conciso de por qué no se deben hacer tales cambios. Si no hay respuestas
+a su publicación, espere unos días e intente de nuevo, a veces las cosas se
+pierden dado el gran volumen.
+
+¿Qué no debería hacer?
+
+  - esperar ue su parche se acepte sin preguntas
+  - actuar de forma defensiva
+  - ignorar comentarios
+  - enviar el parche de nuevo, sin haber aplicados los cambios pertinentes
+
+En una comunidad que busca la mejor solución técnica posible, siempre habrá
+diferentes opiniones sobre lo beneficioso que es un parche. Tiene que ser
+cooperativo y estar dispuesto a adaptar su idea para que encaje dentro
+del kernel, o al menos esté dispuesto a demostrar que su idea vale la pena.
+Recuerea, estar equivocado es aceptable siempre y cuando estés dispuesto a
+trabajar hacia una solución que sea correcta.
+
+Es normal que las respuestas a su primer parche sean simplemente una lista
+de una docena de cosas que debe corregir. Esto **no** implica que su
+parche no será aceptado, y **no** es personal. Simplemente corrija todos
+los problemas planteados en su parche, y envié otra vez.
+
+Diferencias entre la comunidad kernel y las estructuras corporativas
+--------------------------------------------------------------------
+
+La comunidad del kernel funciona de manera diferente a la mayoría de los
+entornos de desarrollo tradicionales en empresas. Aquí hay una lista de
+cosas que puede intentar hacer para evitar problemas:
+
+  Cosas buenas que decir respecto a los cambios propuestos:
+
+    - "Esto arregla múltiples problemas."
+    - "Esto elimina 2000 lineas de código."
+    - "Aquí hay un parche que explica lo que intento describir."
+    - "Lo he testeado en 5 arquitecturas distintas..."
+    - "Aquí hay una serie de parches menores que..."
+    - "Esto mejora el rendimiento en maquinas típicas..."
+
+  Cosas negativas que debe evitar decir:
+
+    - "Lo hicimos asi en AIX/ptx/Solaris, de modo que debe ser bueno..."
+    - "LLevo haciendo esto 20 años, de modo que..."
+    - "Esto lo necesita mi empresa para ganar dinero"
+    - "Esto es para la linea de nuestros productos Enterprise"
+    - "Aquí esta el documento de 1000 paginas describiendo mi idea"
+    - "Llevo 6 meses trabajando en esto..."
+    - "Aquí esta un parche de 5000 lineas que..."
+    - "He rescrito todo el desastre actual, y aqui esta..."
+    - "Tengo un deadline, y este parche debe aplicarse ahora."
+
+Otra forma en que la comunidad del kernel es diferente a la mayoría de los
+entornos de trabajo tradicionales en ingeniería de software, es la
+naturaleza sin rostro de interacción. Una de las ventajas de utilizar el
+correo electrónico y el IRC como formas principales de comunicación es la
+no discriminación por motivos de género o raza. El entorno de trabajo del
+kernel de Linux acepta a mujeres y minorías porque todo lo que eres es una
+dirección de correo electrónico. El aspecto internacional también ayuda a
+nivelar el campo de juego porque no puede adivinar el género basado en
+el nombre de una persona. Un hombre puede llamarse Andrea y una mujer puede
+llamarse Pat. La mayoría de las mujeres que han trabajado en el kernel de
+Linux y han expresado una opinión han tenido experiencias positivas.
+
+La barrera del idioma puede causar problemas a algunas personas que no se
+sientes cómodas con el inglés. Un buen dominio del idioma puede ser
+necesario para transmitir ideas correctamente en las listas de correo, por
+lo que le recomendamos que revise sus correos electrónicos para asegurarse
+de que tengan sentido en inglés antes de enviarlos.
+
+Divida sus cambios
+---------------------
+
+La comunidad del kernel de Linux no acepta con gusto grandes fragmentos de
+código, sobretodo a la vez. Los cambios deben introducirse correctamente,
+discutidos y divididos en pequeñas porciones individuales. Esto es casi
+exactamente lo contrario de lo que las empresas están acostumbradas a hacer.
+Su propuesta también debe introducirse muy temprano en el proceso de
+desarrollo, de modo que pueda recibir comentarios sobre lo que está
+haciendo. También deje que la comunidad sienta que está trabajando con
+ellos, y no simplemente usándolos como un vertedero para su función. Sin
+embargo, no envíe 50 correos electrónicos a una vez a una lista de correo,
+su serie de parches debe casi siempre ser más pequeña que eso.
+
+Las razones para dividir las cosas son las siguientes:
+
+1) Los cambios pequeños aumentan la probabilidad de que sus parches sean
+   aplicados, ya que no requieren mucho tiempo o esfuerzo para verificar su
+   exactitud. Un parche de 5 líneas puede ser aplicado por un maintainer
+   con apenas una segunda mirada. Sin embargo, un parche de 500 líneas
+   puede tardar horas en ser revisado en términos de corrección (el tiempo
+   que toma es exponencialmente proporcional al tamaño del parche, o algo
+   así).
+
+   Los parches pequeños también facilitan la depuración cuando algo falla.
+   Es mucho más fácil retirar los parches uno por uno que diseccionar un
+   parche muy grande después de haber sido aplicado (y roto alguna cosa).
+
+2) Es importante no solo enviar pequeños parches, sino también reescribir
+   y simplificar (o simplemente reordenar) los parches antes de enviarlos.
+
+Esta es una analogía del desarrollador del kernel Al Viro (traducida):
+
+	*"Piense en un maestro que califica la tarea de un estudiante de
+	matemáticas. El maestro no quiere ver los intentos y errores del
+	estudiante antes de que se les ocurriera la solución. Quiere ver la
+	respuesta más limpia y elegante. Un buen estudiante lo sabe, y nunca
+	presentaría su trabajo intermedio antes de tener la solución final.*
+
+	* Lo mismo ocurre con el desarrollo del kernel. Los maintainers y
+	revisores no quieren ver el proceso de pensamiento detrás de la solución
+	al problema que se está resolviendo. Quieren ver un solución simple y
+	elegante."*
+
+Puede resultar un reto mantener el equilibrio entre presentar una solución
+elegante y trabajar junto a la comunidad, discutiendo su trabajo inacabado.
+Por lo tanto, es bueno comenzar temprano en el proceso para obtener
+"feedback" y mejorar su trabajo, pero también mantenga sus cambios en
+pequeños trozos que pueden ser aceptados, incluso cuando toda su labor no
+está listo para inclusión en un momento dado.
+
+También tenga en cuenta que no es aceptable enviar parches para su
+inclusión que están sin terminar y serán "arreglados más tarde".
+
+Justifique sus cambios
+----------------------
+
+Además de dividir sus parches, es muy importante que deje a la comunidad de
+Linux sabe por qué deberían agregar este cambio. Nuevas características
+debe justificarse como necesarias y útiles.
+
+Documente sus cambios
+--------------------
+
+Cuando envíe sus parches, preste especial atención a lo que dice en el
+texto de su correo electrónico. Esta información se convertirá en el
+ChangeLog del parche, y se conservará para que todos la vean, todo el
+tiempo. Debe describir el parche por completo y contener:
+
+  - por que los cambios son necesarios
+  - el diseño general de su propuesta
+  - detalles de implementación
+  - resultados de sus experimentos
+
+Para obtener más detalles sobre cómo debería quedar todo esto, consulte la
+sección ChangeLog del documento:
+
+  "The Perfect Patch"
+      https://www.ozlabs.org/~akpm/stuff/tpp.txt
+
+Todas estas cuestiones son a veces son muy difíciles de conseguir. Puede
+llevar años perfeccionar estas prácticas (si es que lo hace). Es un proceso
+continuo de mejora que requiere mucha paciencia y determinación. Pero no se
+rinda, es posible. Muchos lo han hecho antes, y cada uno tuvo que comenzar
+exactamente donde está usted ahora.
+
+
+----------
+
+Gracias a Paolo Ciarrocchi que permitió que la sección "Development Process"
+se basara en el texto que había escrito (https://lwn.net/Articles/94386/),
+y a Randy Dunlap y Gerrit Huizenga por algunas de la lista de cosas que
+debes y no debes decir. También gracias a Pat Mochel, Hanna Linder, Randy
+Dunlap, Kay Sievers, Vojtech Pavlik, Jan Kara, Josh Boyer, Kees Cook,
+Andrew Morton, Andi Kleen, Vadim Lobanov, Jesper Juhl, Adrian Bunk,
+Keri Harris, Frans Pop, David A. Wheeler, Junio ​​Hamano, Michael Kerrisk y
+Alex Shepard por su revisión, comentarios y contribuciones. Sin su ayuda,
+este documento no hubiera sido posible.
+
+Maintainer: Greg Kroah-Hartman <greg@kroah.com>
diff --git a/Documentation/translations/sp_SP/index.rst b/Documentation/translations/sp_SP/index.rst
index bf6a24a2399d..1cc566058f2a 100644
--- a/Documentation/translations/sp_SP/index.rst
+++ b/Documentation/translations/sp_SP/index.rst
@@ -71,3 +71,10 @@ constante desarrollo. Las mejoras en la documentación siempre son
 bienvenidas; de modo que, si desea ayudar, únase a la lista de correo de
 linux-doc en vger.kernel.org.

+Traducciones al español
+=======================
+
+.. toctree::
+   :maxdepth: 1
+
+   howto
--
2.34.1

^ permalink raw reply related	[relevance 2%]

* man-pages-6.00 released
@ 2022-10-09 18:01  2% Alejandro Colomar
  0 siblings, 0 replies; 200+ results
From: Alejandro Colomar @ 2022-10-09 18:01 UTC (permalink / raw)
  To: Michael Kerrisk, LKML, GNU C Library, linux-man, groff
  Cc: Jonathan Corbet, Dr. Tobias Quathamer


[-- Attachment #1.1: Type: text/plain, Size: 12664 bytes --]

Gidday!

I'm proud to announce:

     man-pages-6.00 - manual pages for GNU/Linux

This release resulted from patches, bug reports, reviews, and comments
from around 145 contributors.  The release includes around 1245
commits, and changed all of the pages.

Tarball download:
     TBD - However, you should be able to generate locally
     a set of tarballs from the git repository with `make dist`,
     which will generate .tar, .tar.gz, and .tar.xz archives.
Git repository:
     https://git.kernel.org/cgit/docs/man-pages/man-pages.git/

The most notable of the changes in man-pages-6.00 are the following:

- A new set of man dirs: man2type/, man3const/, man3head/, and man3type.
   These hold new pages and pages splitted from system_data_types(7),
   which had become too big in the recent releases.

- An improved build system, which allows running linter programs that
   check the correctness of both the man(7) source and the C programs in
   EXAMPLES.

- A new LIBRARY section (mostly in sections 2 and 3).  There have also
   been other important changes to the title and other sections, such as
   the removal of the COLOPHON.

- We have added several new pages documenting new kernel features, such
   as landlock(7) and memfd_secret(2).

Especial mention to наб, with 58 commits to this release.

Thank you all for contributing.  Especially to those in the groff@
mailing list who helped me a lot in this release, and to Michael (mtk).

Cheers,

Alex

==================== Changes in man-pages-6.00 ====================

Released: 2022-10-09, València


Contributors
------------

The following people contributed patches/fixes, reports, notes,
ideas, and discussions that have been incorporated in changes in
this release:


"Darrick J. Wong" <darrick.wong@oracle.com>
"Dr. Jürgen Sauermann" <mail@xn--jrgen-sauermann-zvb.de>
"Dr. Wolfgang Armbruster" <dr.w.e.armbruster@gmail.com>
"G. Branden Robinson" <g.branden.robinson@gmail.com>
"M. Welinder" <mwelinder@gmail.com>
"Theodore Ts'o" <tytso@mit.edu>
"Todd C. Miller" <Todd.Miller@sudo.ws>
"Valentin V. Bartenev" <vbart@nginx.com>
<pellucida@tutanota.com>
Adhemerval Zanella <adhemerval.zanella@linaro.org>
Ahelenia Ziemiańska (наб) <nabijaczleweli@nabijaczleweli.xyz>
Alejandro Colomar <alx@kernel.org>
Aleksander Baranowski <alex@euro-linux.com>
Alexander Viro <viro@zeniv.linux.org.uk>
Alexei Starovoitov <ast@kernel.org>
Amir Goldstein <amir73il@gmail.com>
Andrea Cervesato <andrea.cervesato@suse.com>
Andreas Dilger <adilger@dilger.ca>
Andrew Morton <akpm@linux-foundation.org>
Andrew Morton <akpm@osdl.org>
Andrew Persons <andrewscottpersons@gmail.com>
Andrew Wock <ajwock@gmail.com>
Anna Schumaker <anna.schumaker@netapp.com>
Arnd Bergmann <arnd@arndb.de>
Avinash Sonawane <rootkea@gmail.com>
Axel Rasmussen <axelrasmussen@google.com>
Benjamin Peterson <benjamin@python.org>
Benoit Lecocq <benoit@openbsd.org>
Bjarni Ingi Gislason <bjarniig@vortex.is>
Brett Holman <bholman.devel@gmail.com>
Carlos O'Donell <carlos@redhat.com>
Charan Teja Reddy <quic_charante@quicinc.com>
Christian Aistleitner <christian@quelltextlich.at>
Christian Brauner <brauner@kernel.org>
Christoph Hellwig <hch@infradead.org>
Cyril Hrubis <chrubis@suse.cz>
Daniel Borkmann <daniel@iogearbox.net>
Dave Chinner <dchinner@redhat.com>
Dave Kemper <saint.snit@gmail.com>
David Hildenbrand <david@redhat.com>
David Howells <dhowells@redhat.com>
David Laight <David.Laight@ACULAB.COM>
David Sletten <david.paul.sletten@gmail.com>
David Ward <david.ward@gatech.edu>
Davide Benini <davide.benini@gmail.com>
Donald Buczek <buczek@molgen.mpg.de>
Elliott Hughes <enh@google.com>
Eric Biggers <ebiggers@kernel.org>
Eric Dumazet <edumazet@google.com>
Eugene Syromyatnikov <evgsyr@gmail.com>
Fabian <fabian@ritter-vogt.de>
Florian Weimer <fweimer@redhat.com>
GUO Zihua <guozihua@huawei.com>
Gabriel Krisman Bertazi <krisman@collabora.com>
Greg Banks <gbanks@linkedin.com>
Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Grzegorz Szpetkowski <gszpetkowski@gmail.com>
Günther Noack <guenther@unix-ag.uni-kl.de>
Heinrich Schuchardt <xypron.glpk@gmx.de>
Huang Pei <huangpei@loongson.cn>
Ian Abbott <abbotti@mev.co.uk>
Ian Lance Taylor <iant@google.com>
Ingo Schwarze <schwarze@openbsd.org>
Jakub Sitnicki <jakub@cloudflare.com>
Jakub Wilk <jwilk@jwilk.net>
Jan Kara <jack@suse.cz>
Jann Horn <jannh@google.com>
Jayprakash Ray <r.jay3283@gmail.com>
JeanHeyd Meneide <wg14@soasis.org>
Jeff Layton <jlayton@kernel.org>
Jens Gustedt <jens.gustedt@inria.fr>
Jeremy Kerr <jk@codeconstruct.com.au>
Jesse Rosenstock <jmr@google.com>
Joseph Myers <joseph@codesourcery.com>
Kir Kolyshkin <kolyshkin@gmail.com>
Klemens Nanni <kn@openbsd.org>
Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Linus Torvalds <torvalds@linuxfoundation.org>
Lucien Gentis <lucien.gentis@waika9.com>
Luis Henriques <lhenriques@suse.de>
Luis Lozano <llozano@chromium.org>
Marco Bonelli <marco@mebeim.net>
Masatake YAMATO <yamato@redhat.com>
Matheus Tavares <matheus.bernardino@usp.br>
Mathnerd314 <mathnerd314.gph@gmail.com>
Matthew Bobrowski <repnop@google.com>
Matthew Wilcox <willy@infradead.org>
Melker Narikka <meklu@meklu.org>
Michael Kearney <mikekearney85@hotmail.com>
Michael Kerrisk <mtk.manpages@gmail.com>
Michal Hocko <mhocko@suse.com>
Mickaël Salaün <mic@linux.microsoft.com>
Mike Frysinger <vapier@gentoo.org>
Mike Kravetz <mike.kravetz@oracle.com>
Mike Rapoport <rppt@linux.ibm.com>
Miklos Szeredi <miklos@szeredi.hu>
Nadav Amit <nadav.amit@gmail.com>
NeilBrown <neilb@suse.de>
Nicolas Boichat <drinkcat@chromium.org>
Nikola Forró <nforro@redhat.com>
Olga Kornievskaia <aglo@umich.edu>
Oscar Salvador <osalvador@suse.de>
Pali Rohár <pali@kernel.org>
Pankaj Gupta <pankaj.gupta@ionos.com>
Patrick Reader <_@pxeger.com>
Paul Eggert <eggert@cs.ucla.edu>
Peter Xu <peterx@redhat.com>
Petr Vorel <pvorel@suse.cz>
Pádraig Brady <P@draigBrady.com>
Quentin Monnet <quentin.monnet@netronome.com>
Ralf Baechle <ralf@linux-mips.org>
Ralph Corderoy <ralph@inputplus.co.uk>
Randall <rsbecker@nexbridge.com>
Rich Felker <dalias@libc.org>
Robert Schneider <robert.schneider03@sap.com>
Rumen Telbizov <rumen.telbizov@menlosecurity.com>
Sam James <sam@gentoo.org>
Samanta Navarro <ferivoz@riseup.net>
Sean Young <sean@mess.org>
Simon Branch <simonmbranch@gmail.com>
Stefan Puiu <stefan.puiu@gmail.com>
Stephen Kitt <steve@sk2.org>
Steve French <sfrench@samba.org>
Suren Baghdasaryan <surenb@google.com>
Theo de Raadt <deraadt@openbsd.org>
Theodore Dubois <tbodt@google.com>
Tilman Schmidt <tilman@imap.cc>
Tobias Stoeckmann <tobias@stoeckmann.org>
Topi Miettinen <toiwoton@gmail.com>
Trevor Woerner <twoerner@gmail.com>
Trond Myklebust <trond.myklebust@hammerspace.com>
Vincent Lefevre <vincent@vinc17.net>
Vito Caputo <vcaputo@pengaru.com>
Walter Harms <wharms@bfs.de>
Wei Wang <weiwan@google.com>
Yang Xu <xuyang2018.jy@fujitsu.com>
Yuchung Cheng <ycheng@google.com>
Zack Weinberg <zack@owlfolio.org>
enh <enh@google.com>
glibg10b <pugonfireyt@gmail.com>
nick black <nickblack@linux.com>
zhangkui <zhangkui@oppo.com>
Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Štěpán Němec <stepnem@smrk.net>

Apologies if I missed anyone!


New and rewritten pages
-----------------------

man2/
     landlock_add_rule.2
     landlock_create_ruleset.2
     landlock_restrict_self.2
     memfd_secret.2

man2type/
     open_how.2type

man3/
     _Generic.3

man3const/
     NULL.3const

man3head/
     sysexits.h.3head

man3type/
     aiocb.3type
     blkcnt_t.3type
     blksize_t.3type
     cc_t.3type
     clock_t.3type
     clockid_t.3type
     dev_t.3type
     div_t.3type
     double_t.3type
     epoll_event.3type
     fenv_t.3type
     id_t.3type
     intN_t.3type
     intmax_t.3type
     intptr_t.3type
     iovec.3type
     itimerspec.3type
     lconv.3type
     mode_t.3type
     off_t.3type
     ptrdiff_t.3type
     regex_t.3type
     size_t.3type
     sockaddr.3type
     stat.3type
     time_t.3type
     timer_t.3type
     timespec.3type
     timeval.3type
     tm.3type
     va_list.3type
     void.3type

man7/
     landlock.7


Newly documented interfaces in existing pages
---------------------------------------------

epoll_wait.2
     epoll_pwait2(2)

fanotify_init.2
     FAN_REPORT_PIDFD

fanotify_mark.2
     FAN_FS_ERROR
     FAN_MARK_EVICTABLE
     FAN_RENAME
     FAN_REPORT_TARGET_FID

madvise.2
     MADV_POPULATE_READ
     MADV_POPULATE_WRITE

pipe.2
     O_NOTIFICATION_PIPE

process_madvise.2
     MADV_WILLNEED

send.2
     MSG_FASTOPEN

userfaultfd.2
     UFFD_USER_MODE_ONLY

proc.5
     /proc/[pid]/pagemap    bit 57

fanotify.7
     /proc/sys/fs/fanotify/max_queued_events
     /proc/sys/fs/fanotify/max_user_group
     /proc/sys/fs/fanotify/max_user_marks

tcp.7
     TCP_FASTOPEN
     TCP_FASTOPEN_CONNECT


New and changed links
---------------------

man3/
     strftime_l.3

man3type/
     epoll_data.3type
     epoll_data_t.3type
     fexcept_t.3type
     float_t.3type
     gid_t.3type
     imaxdiv_t.3type
     in6_addr.3type
     in_addr.3type
     in_addr_t.3type
     in_port_t.3type
     int16_t.3type
     int32_t.3type
     int64_t.3type
     int8_t.3type
     ldiv_t.3type
     lldiv_t.3type
     loff_t.3type
     off64_t.3type
     pid_t.3type
     regmatch_t.3type
     regoff_t.3type
     sa_family_t.3type
     sockaddr_in.3type
     sockaddr_in6.3type
     sockaddr_storage.3type
     sockaddr_un.3type
     socklen_t.3type
     speed_t.3type
     ssize_t.3type
     suseconds_t.3type
     tcflag_t.3type
     uid_t.3type
     uint16_t.3type
     uint32_t.3type
     uint64_t.3type
     uint8_t.3type
     uintN_t.3type
     uintmax_t.3type
     uintptr_t.3type
     useconds_t.3type


Global changes
--------------

- Man dirs:

   - Move definitions of types to separate pages in man2type/ and
     man3type/.  Previously, they were spreaded (and duplicated) in other
     pages, or in system_data_types.7 (with links in man3/).

   - Add man3head/ for pages that document header files.

   - Add man3const/ for pages that document constants.

- Licenses:

   - Use SPDX-License-Indentiffier for licenses specified by SPDX
     (including the newly-added Linux-man-pages-copyleft).  This reduces
     the overhead text at the top of most manual page source files.
     License texts have been moved to LICENSES/.

- Build system:

   - Add several make(1) targets to lint the manual pages, and also lint
     and build the C programs contained in them.  Use of these targets
     requires unreleased versions of software, such as groff-1.23.0, so
     it's not yet intended to be used by the public.

   - Add targets to build tarballs of the repository.

- man(7) source:

   - Improve consistency of man(7) source.  Also, reduce the number of
     warnings that groff(1) and mandoc(7) emit when parsing the pages
     with the highest warning level.  Most of these fixes were found
     thanks to the new `make lint-man` target.

- Manual pages sections:

   - Title (.TH):

     - Remove 5th argument to TH (middle-header).

     - Specify "Linux man-pages" and the version in the 4th argument
       (left-footer).

   - Add the LIBRARY section.  This section standardizes a way to
     document the library that provides a given interface.

   - Add the CAVEATS section.  BUGS and NOTES were serving that purpose
     before, but CAVEATS is more appropriate.

   - Rename the CONFORMING TO section to STANDARDS for consistency with
     other projects, such as the BSDs.

   - SYNOPSIS:  Add the ISO C2X [[deprecated]] attribute for functions
     that have been deprecated or removed.

   - EXAMPLES:  Improve consistency of C source code.  Also, reduce the
     number of warnings that several linting tools emit.

   - COLOPHON:  Remove section (its purpose is now served by the title).

- Repository:

   - CONTRIBUTING, README, INSTALL:  Document important changes in the
     project organization.


Changes to individual pages
---------------------------

The manual pages (and other files in the repository) have been improved
beyond what this changelog covers.  To learn more about changes applied
to individual pages, use git(1).


-- 
Alejandro Colomar; <http://www.alejandro-colomar.es/>
Linux man-pages maintainer; <http://www.kernel.org/doc/man-pages/>

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[relevance 2%]

* [PATCH v2 07/39] x86/cet: Add user control-protection fault handler
  @ 2022-09-29 22:29  3% ` Rick Edgecombe
  0 siblings, 0 replies; 200+ results
From: Rick Edgecombe @ 2022-09-29 22:29 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Weijiang Yang, Kirill A . Shutemov,
	joao.moreira, John Allen, kcc, eranian, rppt, jamorris, dethoma
  Cc: rick.p.edgecombe, Yu-cheng Yu, Michael Kerrisk

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

A control-protection fault is triggered when a control-flow transfer
attempt violates Shadow Stack or Indirect Branch Tracking constraints.
For example, the return address for a RET instruction differs from the copy
on the shadow stack.

There already exists a control-protection fault handler for handling kernel
IBT. Refactor this fault handler into sparate user and kernel handlers,
like the page fault handler. Add a control-protection handler for usermode.

The control-protection fault handler works in a similar way as the general
protection fault handler. It provides the si_code SEGV_CPERR to the signal
handler.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>

---

v2:
 - Integrate with kernel IBT fault handler
 - Update printed messages. (Dave)
 - Remove array_index_nospec() usage. (Dave)
 - Remove IBT messages. (Dave)
 - Add enclave error code bit processing it case it can get triggered
   somehow.
 - Add extra "unknown" in control_protection_err.

v1:
 - Update static asserts for NSIGSEGV

Yu-cheng v29:
 - Remove pr_emerg() since it is followed by die().
 - Change boot_cpu_has() to cpu_feature_enabled().

Yu-cheng v25:
 - Change CONFIG_X86_CET to CONFIG_X86_SHADOW_STACK.
 - Change X86_FEATURE_CET to X86_FEATURE_SHSTK.

 arch/arm/kernel/signal.c           |  2 +-
 arch/arm64/kernel/signal.c         |  2 +-
 arch/arm64/kernel/signal32.c       |  2 +-
 arch/sparc/kernel/signal32.c       |  2 +-
 arch/sparc/kernel/signal_64.c      |  2 +-
 arch/x86/include/asm/idtentry.h    |  2 +-
 arch/x86/kernel/idt.c              |  2 +-
 arch/x86/kernel/signal_compat.c    |  2 +-
 arch/x86/kernel/traps.c            | 98 ++++++++++++++++++++++++++----
 arch/x86/xen/enlighten_pv.c        |  2 +-
 arch/x86/xen/xen-asm.S             |  2 +-
 include/uapi/asm-generic/siginfo.h |  3 +-
 12 files changed, 97 insertions(+), 24 deletions(-)

diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index ea128e32e8ca..fa47b8754624 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -681,7 +681,7 @@ asmlinkage void do_rseq_syscall(struct pt_regs *regs)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 9ad911f1647c..81b13a21046e 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -1166,7 +1166,7 @@ void __init minsigstksz_setup(void)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/arm64/kernel/signal32.c b/arch/arm64/kernel/signal32.c
index 4700f8522d27..bbd542704730 100644
--- a/arch/arm64/kernel/signal32.c
+++ b/arch/arm64/kernel/signal32.c
@@ -460,7 +460,7 @@ void compat_setup_restart_syscall(struct pt_regs *regs)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/sparc/kernel/signal32.c b/arch/sparc/kernel/signal32.c
index dad38960d1a8..82da8a2d769d 100644
--- a/arch/sparc/kernel/signal32.c
+++ b/arch/sparc/kernel/signal32.c
@@ -751,7 +751,7 @@ asmlinkage int do_sys32_sigstack(u32 u_ssptr, u32 u_ossptr, unsigned long sp)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/sparc/kernel/signal_64.c b/arch/sparc/kernel/signal_64.c
index 570e43e6fda5..b4e410976e0d 100644
--- a/arch/sparc/kernel/signal_64.c
+++ b/arch/sparc/kernel/signal_64.c
@@ -562,7 +562,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long orig_i0, unsigned long
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 72184b0b2219..6768c9d4468c 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -618,7 +618,7 @@ DECLARE_IDTENTRY_RAW_ERRORCODE(X86_TRAP_DF,	xenpv_exc_double_fault);
 #endif
 
 /* #CP */
-#ifdef CONFIG_X86_KERNEL_IBT
+#if defined(CONFIG_X86_KERNEL_IBT) || defined(CONFIG_X86_SHADOW_STACK)
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_CP,	exc_control_protection);
 #endif
 
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index a58c6bc1cd68..90cce3614ead 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -107,7 +107,7 @@ static const __initconst struct idt_data def_idts[] = {
 	ISTG(X86_TRAP_MC,		asm_exc_machine_check, IST_INDEX_MCE),
 #endif
 
-#ifdef CONFIG_X86_KERNEL_IBT
+#if defined(CONFIG_X86_KERNEL_IBT) || defined(CONFIG_X86_SHADOW_STACK)
 	INTG(X86_TRAP_CP,		asm_exc_control_protection),
 #endif
 
diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index 879ef8c72f5c..d441804443d5 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -27,7 +27,7 @@ static inline void signal_compat_build_tests(void)
 	 */
 	BUILD_BUG_ON(NSIGILL  != 11);
 	BUILD_BUG_ON(NSIGFPE  != 15);
-	BUILD_BUG_ON(NSIGSEGV != 9);
+	BUILD_BUG_ON(NSIGSEGV != 10);
 	BUILD_BUG_ON(NSIGBUS  != 5);
 	BUILD_BUG_ON(NSIGTRAP != 6);
 	BUILD_BUG_ON(NSIGCHLD != 6);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index d62b2cb85cea..b7dde8730236 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -211,12 +211,6 @@ DEFINE_IDTENTRY(exc_overflow)
 	do_error_trap(regs, 0, "overflow", X86_TRAP_OF, SIGSEGV, 0, NULL);
 }
 
-#ifdef CONFIG_X86_KERNEL_IBT
-
-static __ro_after_init bool ibt_fatal = true;
-
-extern void ibt_selftest_ip(void); /* code label defined in asm below */
-
 enum cp_error_code {
 	CP_EC        = (1 << 15) - 1,
 
@@ -229,16 +223,74 @@ enum cp_error_code {
 	CP_ENCL	     = 1 << 15,
 };
 
-DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+#ifdef CONFIG_X86_SHADOW_STACK
+static const char * const control_protection_err[] = {
+	"unknown",
+	"near-ret",
+	"far-ret/iret",
+	"endbranch",
+	"rstorssp",
+	"setssbsy",
+};
+
+static DEFINE_RATELIMIT_STATE(cpf_rate, DEFAULT_RATELIMIT_INTERVAL,
+			      DEFAULT_RATELIMIT_BURST);
+
+static void do_user_control_protection_fault(struct pt_regs *regs,
+					     unsigned long error_code)
 {
-	if (!cpu_feature_enabled(X86_FEATURE_IBT)) {
-		pr_err("Unexpected #CP\n");
-		BUG();
+	struct task_struct *tsk;
+	unsigned long ssp;
+
+	/* Read SSP before enabling interrupts. */
+	rdmsrl(MSR_IA32_PL3_SSP, ssp);
+
+	cond_local_irq_enable(regs);
+
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		WARN_ONCE(1, "User-mode control protection fault with shadow support disabled\n");
+
+	tsk = current;
+	tsk->thread.error_code = error_code;
+	tsk->thread.trap_nr = X86_TRAP_CP;
+
+	/* Ratelimit to prevent log spamming. */
+	if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
+	    __ratelimit(&cpf_rate)) {
+		unsigned int cpec;
+
+		cpec = error_code & CP_EC;
+		if (cpec >= ARRAY_SIZE(control_protection_err))
+			cpec = 0;
+
+		pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)%s",
+			 tsk->comm, task_pid_nr(tsk),
+			 regs->ip, regs->sp, ssp, error_code,
+			 control_protection_err[cpec],
+			 error_code & CP_ENCL ? " in enclave" : "");
+		print_vma_addr(KERN_CONT " in ", regs->ip);
+		pr_cont("\n");
 	}
 
-	if (WARN_ON_ONCE(user_mode(regs) || (error_code & CP_EC) != CP_ENDBR))
-		return;
+	force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
+	cond_local_irq_disable(regs);
+}
+#else
+static void do_user_control_protection_fault(struct pt_regs *regs,
+					     unsigned long error_code)
+{
+	WARN_ONCE(1, "User-mode control protection fault with shadow support disabled\n");
+}
+#endif
+
+#ifdef CONFIG_X86_KERNEL_IBT
+
+static __ro_after_init bool ibt_fatal = true;
+
+extern void ibt_selftest_ip(void); /* code label defined in asm below */
 
+static void do_kernel_control_protection_fault(struct pt_regs *regs)
+{
 	if (unlikely(regs->ip == (unsigned long)&ibt_selftest_ip)) {
 		regs->ax = 0;
 		return;
@@ -283,9 +335,29 @@ static int __init ibt_setup(char *str)
 }
 
 __setup("ibt=", ibt_setup);
-
+#else
+static void do_kernel_control_protection_fault(struct pt_regs *regs)
+{
+	WARN_ONCE(1, "Kernel-mode control protection fault with IBT disabled\n");
+}
 #endif /* CONFIG_X86_KERNEL_IBT */
 
+#if defined(CONFIG_X86_KERNEL_IBT) || defined(CONFIG_X86_SHADOW_STACK)
+DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+{
+	if (!cpu_feature_enabled(X86_FEATURE_IBT) &&
+	    !cpu_feature_enabled(X86_FEATURE_SHSTK)) {
+		pr_err("Unexpected #CP\n");
+		BUG();
+	}
+
+	if (user_mode(regs))
+		do_user_control_protection_fault(regs, error_code);
+	else
+		do_kernel_control_protection_fault(regs);
+}
+#endif /* defined(CONFIG_X86_KERNEL_IBT) || defined(CONFIG_X86_SHADOW_STACK) */
+
 #ifdef CONFIG_X86_F00F_BUG
 void handle_invalid_op(struct pt_regs *regs)
 #else
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index 0ed2e487a693..57faa287163f 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -628,7 +628,7 @@ static struct trap_array_entry trap_array[] = {
 	TRAP_ENTRY(exc_coprocessor_error,		false ),
 	TRAP_ENTRY(exc_alignment_check,			false ),
 	TRAP_ENTRY(exc_simd_coprocessor_error,		false ),
-#ifdef CONFIG_X86_KERNEL_IBT
+#if defined(CONFIG_X86_KERNEL_IBT) || defined(CONFIG_X86_SHADOW_STACK)
 	TRAP_ENTRY(exc_control_protection,		false ),
 #endif
 };
diff --git a/arch/x86/xen/xen-asm.S b/arch/x86/xen/xen-asm.S
index 6b4fdf6b9542..e45ff6300c7d 100644
--- a/arch/x86/xen/xen-asm.S
+++ b/arch/x86/xen/xen-asm.S
@@ -148,7 +148,7 @@ xen_pv_trap asm_exc_page_fault
 xen_pv_trap asm_exc_spurious_interrupt_bug
 xen_pv_trap asm_exc_coprocessor_error
 xen_pv_trap asm_exc_alignment_check
-#ifdef CONFIG_X86_KERNEL_IBT
+#if defined(CONFIG_X86_KERNEL_IBT) || defined(CONFIG_X86_SHADOW_STACK)
 xen_pv_trap asm_exc_control_protection
 #endif
 #ifdef CONFIG_X86_MCE
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index ffbe4cec9f32..0f52d0ac47c5 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -242,7 +242,8 @@ typedef struct siginfo {
 #define SEGV_ADIPERR	7	/* Precise MCD exception */
 #define SEGV_MTEAERR	8	/* Asynchronous ARM MTE error */
 #define SEGV_MTESERR	9	/* Synchronous ARM MTE exception */
-#define NSIGSEGV	9
+#define SEGV_CPERR	10	/* Control protection fault */
+#define NSIGSEGV	10
 
 /*
  * SIGBUS si_codes
-- 
2.17.1


^ permalink raw reply related	[relevance 3%]

* [PATCH 5.4 19/34] exec: Force single empty string when argv is empty
  @ 2022-06-03 17:43  5% ` Greg Kroah-Hartman
  0 siblings, 0 replies; 200+ results
From: Greg Kroah-Hartman @ 2022-06-03 17:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Ariadne Conill, Michael Kerrisk,
	Matthew Wilcox, Christian Brauner, Rich Felker, Eric Biederman,
	Alexander Viro, linux-fsdevel, Kees Cook, Andy Lutomirski,
	Vegard Nossum

From: Kees Cook <keescook@chromium.org>

commit dcd46d897adb70d63e025f175a00a89797d31a43 upstream.

Quoting[1] Ariadne Conill:

"In several other operating systems, it is a hard requirement that the
second argument to execve(2) be the name of a program, thus prohibiting
a scenario where argc < 1. POSIX 2017 also recommends this behaviour,
but it is not an explicit requirement[2]:

    The argument arg0 should point to a filename string that is
    associated with the process being started by one of the exec
    functions.
...
Interestingly, Michael Kerrisk opened an issue about this in 2008[3],
but there was no consensus to support fixing this issue then.
Hopefully now that CVE-2021-4034 shows practical exploitative use[4]
of this bug in a shellcode, we can reconsider.

This issue is being tracked in the KSPP issue tracker[5]."

While the initial code searches[6][7] turned up what appeared to be
mostly corner case tests, trying to that just reject argv == NULL
(or an immediately terminated pointer list) quickly started tripping[8]
existing userspace programs.

The next best approach is forcing a single empty string into argv and
adjusting argc to match. The number of programs depending on argc == 0
seems a smaller set than those calling execve with a NULL argv.

Account for the additional stack space in bprm_stack_limits(). Inject an
empty string when argc == 0 (and set argc = 1). Warn about the case so
userspace has some notice about the change:

    process './argc0' launched './argc0' with NULL argv: empty string added

Additionally WARN() and reject NULL argv usage for kernel threads.

[1] https://lore.kernel.org/lkml/20220127000724.15106-1-ariadne@dereferenced.org/
[2] https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
[3] https://bugzilla.kernel.org/show_bug.cgi?id=8408
[4] https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
[5] https://github.com/KSPP/linux/issues/176
[6] https://codesearch.debian.net/search?q=execve%5C+*%5C%28%5B%5E%2C%5D%2B%2C+*NULL&literal=0
[7] https://codesearch.debian.net/search?q=execlp%3F%5Cs*%5C%28%5B%5E%2C%5D%2B%2C%5Cs*NULL&literal=0
[8] https://lore.kernel.org/lkml/20220131144352.GE16385@xsang-OptiPlex-9020/

Reported-by: Ariadne Conill <ariadne@dereferenced.org>
Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Ariadne Conill <ariadne@dereferenced.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20220201000947.2453721-1-keescook@chromium.org
[vegard: fixed conflicts due to missing
 886d7de631da71e30909980fdbf318f7caade262^- and
 3950e975431bc914f7e81b8f2a2dbdf2064acb0f^-]
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 fs/exec.c |   25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

This has been tested in both argc == 0 and argc >= 1 cases, but I would
still appreciate a review given the differences with mainline. If it's
considered too risky I'm also fine with dropping it -- just wanted to
make sure this didn't fall through the cracks, as it does block a real
(albeit old by now) exploit.

--- a/fs/exec.c
+++ b/fs/exec.c
@@ -454,6 +454,9 @@ static int prepare_arg_pages(struct linu
 	unsigned long limit, ptr_size;
 
 	bprm->argc = count(argv, MAX_ARG_STRINGS);
+	if (bprm->argc == 0)
+		pr_warn_once("process '%s' launched '%s' with NULL argv: empty string added\n",
+			     current->comm, bprm->filename);
 	if (bprm->argc < 0)
 		return bprm->argc;
 
@@ -482,8 +485,14 @@ static int prepare_arg_pages(struct linu
 	 * the stack. They aren't stored until much later when we can't
 	 * signal to the parent that the child has run out of stack space.
 	 * Instead, calculate it here so it's possible to fail gracefully.
+	 *
+	 * In the case of argc = 0, make sure there is space for adding a
+	 * empty string (which will bump argc to 1), to ensure confused
+	 * userspace programs don't start processing from argv[1], thinking
+	 * argc can never be 0, to keep them from walking envp by accident.
+	 * See do_execveat_common().
 	 */
-	ptr_size = (bprm->argc + bprm->envc) * sizeof(void *);
+	ptr_size = (max(bprm->argc, 1) + bprm->envc) * sizeof(void *);
 	if (limit <= ptr_size)
 		return -E2BIG;
 	limit -= ptr_size;
@@ -1848,6 +1857,20 @@ static int __do_execve_file(int fd, stru
 	if (retval < 0)
 		goto out;
 
+	/*
+	 * When argv is empty, add an empty string ("") as argv[0] to
+	 * ensure confused userspace programs that start processing
+	 * from argv[1] won't end up walking envp. See also
+	 * bprm_stack_limits().
+	 */
+	if (bprm->argc == 0) {
+		const char *argv[] = { "", NULL };
+		retval = copy_strings_kernel(1, argv, bprm);
+		if (retval < 0)
+			goto out;
+		bprm->argc = 1;
+	}
+
 	retval = exec_binprm(bprm);
 	if (retval < 0)
 		goto out;



^ permalink raw reply	[relevance 5%]

* [PATCH 4.19 18/30] exec: Force single empty string when argv is empty
  @ 2022-06-03 17:39  5% ` Greg Kroah-Hartman
  0 siblings, 0 replies; 200+ results
From: Greg Kroah-Hartman @ 2022-06-03 17:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Ariadne Conill, Michael Kerrisk,
	Matthew Wilcox, Christian Brauner, Rich Felker, Eric Biederman,
	Alexander Viro, linux-fsdevel, Kees Cook, Andy Lutomirski,
	Vegard Nossum

From: Kees Cook <keescook@chromium.org>

commit dcd46d897adb70d63e025f175a00a89797d31a43 upstream.

Quoting[1] Ariadne Conill:

"In several other operating systems, it is a hard requirement that the
second argument to execve(2) be the name of a program, thus prohibiting
a scenario where argc < 1. POSIX 2017 also recommends this behaviour,
but it is not an explicit requirement[2]:

    The argument arg0 should point to a filename string that is
    associated with the process being started by one of the exec
    functions.
...
Interestingly, Michael Kerrisk opened an issue about this in 2008[3],
but there was no consensus to support fixing this issue then.
Hopefully now that CVE-2021-4034 shows practical exploitative use[4]
of this bug in a shellcode, we can reconsider.

This issue is being tracked in the KSPP issue tracker[5]."

While the initial code searches[6][7] turned up what appeared to be
mostly corner case tests, trying to that just reject argv == NULL
(or an immediately terminated pointer list) quickly started tripping[8]
existing userspace programs.

The next best approach is forcing a single empty string into argv and
adjusting argc to match. The number of programs depending on argc == 0
seems a smaller set than those calling execve with a NULL argv.

Account for the additional stack space in bprm_stack_limits(). Inject an
empty string when argc == 0 (and set argc = 1). Warn about the case so
userspace has some notice about the change:

    process './argc0' launched './argc0' with NULL argv: empty string added

Additionally WARN() and reject NULL argv usage for kernel threads.

[1] https://lore.kernel.org/lkml/20220127000724.15106-1-ariadne@dereferenced.org/
[2] https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
[3] https://bugzilla.kernel.org/show_bug.cgi?id=8408
[4] https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
[5] https://github.com/KSPP/linux/issues/176
[6] https://codesearch.debian.net/search?q=execve%5C+*%5C%28%5B%5E%2C%5D%2B%2C+*NULL&literal=0
[7] https://codesearch.debian.net/search?q=execlp%3F%5Cs*%5C%28%5B%5E%2C%5D%2B%2C%5Cs*NULL&literal=0
[8] https://lore.kernel.org/lkml/20220131144352.GE16385@xsang-OptiPlex-9020/

Reported-by: Ariadne Conill <ariadne@dereferenced.org>
Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Ariadne Conill <ariadne@dereferenced.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20220201000947.2453721-1-keescook@chromium.org
[vegard: fixed conflicts due to missing
 886d7de631da71e30909980fdbf318f7caade262^- and
 3950e975431bc914f7e81b8f2a2dbdf2064acb0f^- and
 655c16a8ce9c15842547f40ce23fd148aeccc074]
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 fs/exec.c |   17 +++++++++++++++++
 1 file changed, 17 insertions(+)

This has been tested in both argc == 0 and argc >= 1 cases, but I would
still appreciate a review given the differences with mainline. If it's
considered too risky I'm also fine with dropping it -- just wanted to
make sure this didn't fall through the cracks, as it does block a real
(albeit old by now) exploit.

--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1805,6 +1805,9 @@ static int __do_execve_file(int fd, stru
 		goto out_unmark;
 
 	bprm->argc = count(argv, MAX_ARG_STRINGS);
+	if (bprm->argc == 0)
+		pr_warn_once("process '%s' launched '%s' with NULL argv: empty string added\n",
+			     current->comm, bprm->filename);
 	if ((retval = bprm->argc) < 0)
 		goto out;
 
@@ -1829,6 +1832,20 @@ static int __do_execve_file(int fd, stru
 	if (retval < 0)
 		goto out;
 
+	/*
+	 * When argv is empty, add an empty string ("") as argv[0] to
+	 * ensure confused userspace programs that start processing
+	 * from argv[1] won't end up walking envp. See also
+	 * bprm_stack_limits().
+	 */
+	if (bprm->argc == 0) {
+		const char *argv[] = { "", NULL };
+		retval = copy_strings_kernel(1, argv, bprm);
+		if (retval < 0)
+			goto out;
+		bprm->argc = 1;
+	}
+
 	retval = exec_binprm(bprm);
 	if (retval < 0)
 		goto out;



^ permalink raw reply	[relevance 5%]

* [PATCH 4.14 13/23] exec: Force single empty string when argv is empty
  @ 2022-06-03 17:39  5% ` Greg Kroah-Hartman
  0 siblings, 0 replies; 200+ results
From: Greg Kroah-Hartman @ 2022-06-03 17:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Ariadne Conill, Michael Kerrisk,
	Matthew Wilcox, Christian Brauner, Rich Felker, Eric Biederman,
	Alexander Viro, linux-fsdevel, Kees Cook, Andy Lutomirski,
	Vegard Nossum

From: Kees Cook <keescook@chromium.org>

commit dcd46d897adb70d63e025f175a00a89797d31a43 upstream.

Quoting[1] Ariadne Conill:

"In several other operating systems, it is a hard requirement that the
second argument to execve(2) be the name of a program, thus prohibiting
a scenario where argc < 1. POSIX 2017 also recommends this behaviour,
but it is not an explicit requirement[2]:

    The argument arg0 should point to a filename string that is
    associated with the process being started by one of the exec
    functions.
...
Interestingly, Michael Kerrisk opened an issue about this in 2008[3],
but there was no consensus to support fixing this issue then.
Hopefully now that CVE-2021-4034 shows practical exploitative use[4]
of this bug in a shellcode, we can reconsider.

This issue is being tracked in the KSPP issue tracker[5]."

While the initial code searches[6][7] turned up what appeared to be
mostly corner case tests, trying to that just reject argv == NULL
(or an immediately terminated pointer list) quickly started tripping[8]
existing userspace programs.

The next best approach is forcing a single empty string into argv and
adjusting argc to match. The number of programs depending on argc == 0
seems a smaller set than those calling execve with a NULL argv.

Account for the additional stack space in bprm_stack_limits(). Inject an
empty string when argc == 0 (and set argc = 1). Warn about the case so
userspace has some notice about the change:

    process './argc0' launched './argc0' with NULL argv: empty string added

Additionally WARN() and reject NULL argv usage for kernel threads.

[1] https://lore.kernel.org/lkml/20220127000724.15106-1-ariadne@dereferenced.org/
[2] https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
[3] https://bugzilla.kernel.org/show_bug.cgi?id=8408
[4] https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
[5] https://github.com/KSPP/linux/issues/176
[6] https://codesearch.debian.net/search?q=execve%5C+*%5C%28%5B%5E%2C%5D%2B%2C+*NULL&literal=0
[7] https://codesearch.debian.net/search?q=execlp%3F%5Cs*%5C%28%5B%5E%2C%5D%2B%2C%5Cs*NULL&literal=0
[8] https://lore.kernel.org/lkml/20220131144352.GE16385@xsang-OptiPlex-9020/

Reported-by: Ariadne Conill <ariadne@dereferenced.org>
Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Ariadne Conill <ariadne@dereferenced.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20220201000947.2453721-1-keescook@chromium.org
[vegard: fixed conflicts due to missing
 886d7de631da71e30909980fdbf318f7caade262^- and
 3950e975431bc914f7e81b8f2a2dbdf2064acb0f^- and
 655c16a8ce9c15842547f40ce23fd148aeccc074]
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 fs/exec.c |   17 +++++++++++++++++
 1 file changed, 17 insertions(+)

This has been tested in both argc == 0 and argc >= 1 cases, but I would
still appreciate a review given the differences with mainline. If it's
considered too risky I'm also fine with dropping it -- just wanted to
make sure this didn't fall through the cracks, as it does block a real
(albeit old by now) exploit.

--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1788,6 +1788,9 @@ static int do_execveat_common(int fd, st
 		goto out_unmark;
 
 	bprm->argc = count(argv, MAX_ARG_STRINGS);
+	if (bprm->argc == 0)
+		pr_warn_once("process '%s' launched '%s' with NULL argv: empty string added\n",
+			     current->comm, bprm->filename);
 	if ((retval = bprm->argc) < 0)
 		goto out;
 
@@ -1812,6 +1815,20 @@ static int do_execveat_common(int fd, st
 	if (retval < 0)
 		goto out;
 
+	/*
+	 * When argv is empty, add an empty string ("") as argv[0] to
+	 * ensure confused userspace programs that start processing
+	 * from argv[1] won't end up walking envp. See also
+	 * bprm_stack_limits().
+	 */
+	if (bprm->argc == 0) {
+		const char *argv[] = { "", NULL };
+		retval = copy_strings_kernel(1, argv, bprm);
+		if (retval < 0)
+			goto out;
+		bprm->argc = 1;
+	}
+
 	retval = exec_binprm(bprm);
 	if (retval < 0)
 		goto out;



^ permalink raw reply	[relevance 5%]

* [PATCH 4.9 06/12] exec: Force single empty string when argv is empty
  @ 2022-06-03 17:39  5% ` Greg Kroah-Hartman
  0 siblings, 0 replies; 200+ results
From: Greg Kroah-Hartman @ 2022-06-03 17:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Ariadne Conill, Michael Kerrisk,
	Matthew Wilcox, Christian Brauner, Rich Felker, Eric Biederman,
	Alexander Viro, linux-fsdevel, Kees Cook, Andy Lutomirski,
	Vegard Nossum

From: Kees Cook <keescook@chromium.org>

commit dcd46d897adb70d63e025f175a00a89797d31a43 upstream.

Quoting[1] Ariadne Conill:

"In several other operating systems, it is a hard requirement that the
second argument to execve(2) be the name of a program, thus prohibiting
a scenario where argc < 1. POSIX 2017 also recommends this behaviour,
but it is not an explicit requirement[2]:

    The argument arg0 should point to a filename string that is
    associated with the process being started by one of the exec
    functions.
...
Interestingly, Michael Kerrisk opened an issue about this in 2008[3],
but there was no consensus to support fixing this issue then.
Hopefully now that CVE-2021-4034 shows practical exploitative use[4]
of this bug in a shellcode, we can reconsider.

This issue is being tracked in the KSPP issue tracker[5]."

While the initial code searches[6][7] turned up what appeared to be
mostly corner case tests, trying to that just reject argv == NULL
(or an immediately terminated pointer list) quickly started tripping[8]
existing userspace programs.

The next best approach is forcing a single empty string into argv and
adjusting argc to match. The number of programs depending on argc == 0
seems a smaller set than those calling execve with a NULL argv.

Account for the additional stack space in bprm_stack_limits(). Inject an
empty string when argc == 0 (and set argc = 1). Warn about the case so
userspace has some notice about the change:

    process './argc0' launched './argc0' with NULL argv: empty string added

Additionally WARN() and reject NULL argv usage for kernel threads.

[1] https://lore.kernel.org/lkml/20220127000724.15106-1-ariadne@dereferenced.org/
[2] https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
[3] https://bugzilla.kernel.org/show_bug.cgi?id=8408
[4] https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
[5] https://github.com/KSPP/linux/issues/176
[6] https://codesearch.debian.net/search?q=execve%5C+*%5C%28%5B%5E%2C%5D%2B%2C+*NULL&literal=0
[7] https://codesearch.debian.net/search?q=execlp%3F%5Cs*%5C%28%5B%5E%2C%5D%2B%2C%5Cs*NULL&literal=0
[8] https://lore.kernel.org/lkml/20220131144352.GE16385@xsang-OptiPlex-9020/

Reported-by: Ariadne Conill <ariadne@dereferenced.org>
Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Ariadne Conill <ariadne@dereferenced.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20220201000947.2453721-1-keescook@chromium.org
[vegard: fixed conflicts due to missing
 886d7de631da71e30909980fdbf318f7caade262^- and
 3950e975431bc914f7e81b8f2a2dbdf2064acb0f^- and
 655c16a8ce9c15842547f40ce23fd148aeccc074]
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 fs/exec.c |   17 +++++++++++++++++
 1 file changed, 17 insertions(+)

This has been tested in both argc == 0 and argc >= 1 cases, but I would
still appreciate a review given the differences with mainline. If it's
considered too risky I'm also fine with dropping it -- just wanted to
make sure this didn't fall through the cracks, as it does block a real
(albeit old by now) exploit.

--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1758,6 +1758,9 @@ static int do_execveat_common(int fd, st
 		goto out_unmark;
 
 	bprm->argc = count(argv, MAX_ARG_STRINGS);
+	if (bprm->argc == 0)
+		pr_warn_once("process '%s' launched '%s' with NULL argv: empty string added\n",
+			     current->comm, bprm->filename);
 	if ((retval = bprm->argc) < 0)
 		goto out;
 
@@ -1782,6 +1785,20 @@ static int do_execveat_common(int fd, st
 	if (retval < 0)
 		goto out;
 
+	/*
+	 * When argv is empty, add an empty string ("") as argv[0] to
+	 * ensure confused userspace programs that start processing
+	 * from argv[1] won't end up walking envp. See also
+	 * bprm_stack_limits().
+	 */
+	if (bprm->argc == 0) {
+		const char *argv[] = { "", NULL };
+		retval = copy_strings_kernel(1, argv, bprm);
+		if (retval < 0)
+			goto out;
+		bprm->argc = 1;
+	}
+
 	retval = exec_binprm(bprm);
 	if (retval < 0)
 		goto out;



^ permalink raw reply	[relevance 5%]

* [PATCH 5.10 108/599] exec: Force single empty string when argv is empty
  @ 2022-04-05  7:26  5% ` Greg Kroah-Hartman
  0 siblings, 0 replies; 200+ results
From: Greg Kroah-Hartman @ 2022-04-05  7:26 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Ariadne Conill, Michael Kerrisk,
	Matthew Wilcox, Christian Brauner, Rich Felker, Eric Biederman,
	Alexander Viro, linux-fsdevel, Kees Cook, Andy Lutomirski

From: Kees Cook <keescook@chromium.org>

commit dcd46d897adb70d63e025f175a00a89797d31a43 upstream.

Quoting[1] Ariadne Conill:

"In several other operating systems, it is a hard requirement that the
second argument to execve(2) be the name of a program, thus prohibiting
a scenario where argc < 1. POSIX 2017 also recommends this behaviour,
but it is not an explicit requirement[2]:

    The argument arg0 should point to a filename string that is
    associated with the process being started by one of the exec
    functions.
...
Interestingly, Michael Kerrisk opened an issue about this in 2008[3],
but there was no consensus to support fixing this issue then.
Hopefully now that CVE-2021-4034 shows practical exploitative use[4]
of this bug in a shellcode, we can reconsider.

This issue is being tracked in the KSPP issue tracker[5]."

While the initial code searches[6][7] turned up what appeared to be
mostly corner case tests, trying to that just reject argv == NULL
(or an immediately terminated pointer list) quickly started tripping[8]
existing userspace programs.

The next best approach is forcing a single empty string into argv and
adjusting argc to match. The number of programs depending on argc == 0
seems a smaller set than those calling execve with a NULL argv.

Account for the additional stack space in bprm_stack_limits(). Inject an
empty string when argc == 0 (and set argc = 1). Warn about the case so
userspace has some notice about the change:

    process './argc0' launched './argc0' with NULL argv: empty string added

Additionally WARN() and reject NULL argv usage for kernel threads.

[1] https://lore.kernel.org/lkml/20220127000724.15106-1-ariadne@dereferenced.org/
[2] https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
[3] https://bugzilla.kernel.org/show_bug.cgi?id=8408
[4] https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
[5] https://github.com/KSPP/linux/issues/176
[6] https://codesearch.debian.net/search?q=execve%5C+*%5C%28%5B%5E%2C%5D%2B%2C+*NULL&literal=0
[7] https://codesearch.debian.net/search?q=execlp%3F%5Cs*%5C%28%5B%5E%2C%5D%2B%2C%5Cs*NULL&literal=0
[8] https://lore.kernel.org/lkml/20220131144352.GE16385@xsang-OptiPlex-9020/

Reported-by: Ariadne Conill <ariadne@dereferenced.org>
Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Ariadne Conill <ariadne@dereferenced.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20220201000947.2453721-1-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 fs/exec.c |   26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

--- a/fs/exec.c
+++ b/fs/exec.c
@@ -494,8 +494,14 @@ static int bprm_stack_limits(struct linu
 	 * the stack. They aren't stored until much later when we can't
 	 * signal to the parent that the child has run out of stack space.
 	 * Instead, calculate it here so it's possible to fail gracefully.
+	 *
+	 * In the case of argc = 0, make sure there is space for adding a
+	 * empty string (which will bump argc to 1), to ensure confused
+	 * userspace programs don't start processing from argv[1], thinking
+	 * argc can never be 0, to keep them from walking envp by accident.
+	 * See do_execveat_common().
 	 */
-	ptr_size = (bprm->argc + bprm->envc) * sizeof(void *);
+	ptr_size = (max(bprm->argc, 1) + bprm->envc) * sizeof(void *);
 	if (limit <= ptr_size)
 		return -E2BIG;
 	limit -= ptr_size;
@@ -1886,6 +1892,9 @@ static int do_execveat_common(int fd, st
 	}
 
 	retval = count(argv, MAX_ARG_STRINGS);
+	if (retval == 0)
+		pr_warn_once("process '%s' launched '%s' with NULL argv: empty string added\n",
+			     current->comm, bprm->filename);
 	if (retval < 0)
 		goto out_free;
 	bprm->argc = retval;
@@ -1912,6 +1921,19 @@ static int do_execveat_common(int fd, st
 	if (retval < 0)
 		goto out_free;
 
+	/*
+	 * When argv is empty, add an empty string ("") as argv[0] to
+	 * ensure confused userspace programs that start processing
+	 * from argv[1] won't end up walking envp. See also
+	 * bprm_stack_limits().
+	 */
+	if (bprm->argc == 0) {
+		retval = copy_string_kernel("", bprm);
+		if (retval < 0)
+			goto out_free;
+		bprm->argc = 1;
+	}
+
 	retval = bprm_execve(bprm, fd, filename, flags);
 out_free:
 	free_bprm(bprm);
@@ -1940,6 +1962,8 @@ int kernel_execve(const char *kernel_fil
 	}
 
 	retval = count_strings_kernel(argv);
+	if (WARN_ON_ONCE(retval == 0))
+		retval = -EINVAL;
 	if (retval < 0)
 		goto out_free;
 	bprm->argc = retval;



^ permalink raw reply	[relevance 5%]

* [PATCH 5.15 156/913] exec: Force single empty string when argv is empty
  @ 2022-04-05  7:20  5% ` Greg Kroah-Hartman
  0 siblings, 0 replies; 200+ results
From: Greg Kroah-Hartman @ 2022-04-05  7:20 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Ariadne Conill, Michael Kerrisk,
	Matthew Wilcox, Christian Brauner, Rich Felker, Eric Biederman,
	Alexander Viro, linux-fsdevel, Kees Cook, Andy Lutomirski

From: Kees Cook <keescook@chromium.org>

commit dcd46d897adb70d63e025f175a00a89797d31a43 upstream.

Quoting[1] Ariadne Conill:

"In several other operating systems, it is a hard requirement that the
second argument to execve(2) be the name of a program, thus prohibiting
a scenario where argc < 1. POSIX 2017 also recommends this behaviour,
but it is not an explicit requirement[2]:

    The argument arg0 should point to a filename string that is
    associated with the process being started by one of the exec
    functions.
...
Interestingly, Michael Kerrisk opened an issue about this in 2008[3],
but there was no consensus to support fixing this issue then.
Hopefully now that CVE-2021-4034 shows practical exploitative use[4]
of this bug in a shellcode, we can reconsider.

This issue is being tracked in the KSPP issue tracker[5]."

While the initial code searches[6][7] turned up what appeared to be
mostly corner case tests, trying to that just reject argv == NULL
(or an immediately terminated pointer list) quickly started tripping[8]
existing userspace programs.

The next best approach is forcing a single empty string into argv and
adjusting argc to match. The number of programs depending on argc == 0
seems a smaller set than those calling execve with a NULL argv.

Account for the additional stack space in bprm_stack_limits(). Inject an
empty string when argc == 0 (and set argc = 1). Warn about the case so
userspace has some notice about the change:

    process './argc0' launched './argc0' with NULL argv: empty string added

Additionally WARN() and reject NULL argv usage for kernel threads.

[1] https://lore.kernel.org/lkml/20220127000724.15106-1-ariadne@dereferenced.org/
[2] https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
[3] https://bugzilla.kernel.org/show_bug.cgi?id=8408
[4] https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
[5] https://github.com/KSPP/linux/issues/176
[6] https://codesearch.debian.net/search?q=execve%5C+*%5C%28%5B%5E%2C%5D%2B%2C+*NULL&literal=0
[7] https://codesearch.debian.net/search?q=execlp%3F%5Cs*%5C%28%5B%5E%2C%5D%2B%2C%5Cs*NULL&literal=0
[8] https://lore.kernel.org/lkml/20220131144352.GE16385@xsang-OptiPlex-9020/

Reported-by: Ariadne Conill <ariadne@dereferenced.org>
Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Ariadne Conill <ariadne@dereferenced.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20220201000947.2453721-1-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 fs/exec.c |   26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

--- a/fs/exec.c
+++ b/fs/exec.c
@@ -494,8 +494,14 @@ static int bprm_stack_limits(struct linu
 	 * the stack. They aren't stored until much later when we can't
 	 * signal to the parent that the child has run out of stack space.
 	 * Instead, calculate it here so it's possible to fail gracefully.
+	 *
+	 * In the case of argc = 0, make sure there is space for adding a
+	 * empty string (which will bump argc to 1), to ensure confused
+	 * userspace programs don't start processing from argv[1], thinking
+	 * argc can never be 0, to keep them from walking envp by accident.
+	 * See do_execveat_common().
 	 */
-	ptr_size = (bprm->argc + bprm->envc) * sizeof(void *);
+	ptr_size = (max(bprm->argc, 1) + bprm->envc) * sizeof(void *);
 	if (limit <= ptr_size)
 		return -E2BIG;
 	limit -= ptr_size;
@@ -1895,6 +1901,9 @@ static int do_execveat_common(int fd, st
 	}
 
 	retval = count(argv, MAX_ARG_STRINGS);
+	if (retval == 0)
+		pr_warn_once("process '%s' launched '%s' with NULL argv: empty string added\n",
+			     current->comm, bprm->filename);
 	if (retval < 0)
 		goto out_free;
 	bprm->argc = retval;
@@ -1921,6 +1930,19 @@ static int do_execveat_common(int fd, st
 	if (retval < 0)
 		goto out_free;
 
+	/*
+	 * When argv is empty, add an empty string ("") as argv[0] to
+	 * ensure confused userspace programs that start processing
+	 * from argv[1] won't end up walking envp. See also
+	 * bprm_stack_limits().
+	 */
+	if (bprm->argc == 0) {
+		retval = copy_string_kernel("", bprm);
+		if (retval < 0)
+			goto out_free;
+		bprm->argc = 1;
+	}
+
 	retval = bprm_execve(bprm, fd, filename, flags);
 out_free:
 	free_bprm(bprm);
@@ -1949,6 +1971,8 @@ int kernel_execve(const char *kernel_fil
 	}
 
 	retval = count_strings_kernel(argv);
+	if (WARN_ON_ONCE(retval == 0))
+		retval = -EINVAL;
 	if (retval < 0)
 		goto out_free;
 	bprm->argc = retval;



^ permalink raw reply	[relevance 5%]

* [PATCH 5.16 0164/1017] exec: Force single empty string when argv is empty
  @ 2022-04-05  7:17  5% ` Greg Kroah-Hartman
  0 siblings, 0 replies; 200+ results
From: Greg Kroah-Hartman @ 2022-04-05  7:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Ariadne Conill, Michael Kerrisk,
	Matthew Wilcox, Christian Brauner, Rich Felker, Eric Biederman,
	Alexander Viro, linux-fsdevel, Kees Cook, Andy Lutomirski

From: Kees Cook <keescook@chromium.org>

commit dcd46d897adb70d63e025f175a00a89797d31a43 upstream.

Quoting[1] Ariadne Conill:

"In several other operating systems, it is a hard requirement that the
second argument to execve(2) be the name of a program, thus prohibiting
a scenario where argc < 1. POSIX 2017 also recommends this behaviour,
but it is not an explicit requirement[2]:

    The argument arg0 should point to a filename string that is
    associated with the process being started by one of the exec
    functions.
...
Interestingly, Michael Kerrisk opened an issue about this in 2008[3],
but there was no consensus to support fixing this issue then.
Hopefully now that CVE-2021-4034 shows practical exploitative use[4]
of this bug in a shellcode, we can reconsider.

This issue is being tracked in the KSPP issue tracker[5]."

While the initial code searches[6][7] turned up what appeared to be
mostly corner case tests, trying to that just reject argv == NULL
(or an immediately terminated pointer list) quickly started tripping[8]
existing userspace programs.

The next best approach is forcing a single empty string into argv and
adjusting argc to match. The number of programs depending on argc == 0
seems a smaller set than those calling execve with a NULL argv.

Account for the additional stack space in bprm_stack_limits(). Inject an
empty string when argc == 0 (and set argc = 1). Warn about the case so
userspace has some notice about the change:

    process './argc0' launched './argc0' with NULL argv: empty string added

Additionally WARN() and reject NULL argv usage for kernel threads.

[1] https://lore.kernel.org/lkml/20220127000724.15106-1-ariadne@dereferenced.org/
[2] https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
[3] https://bugzilla.kernel.org/show_bug.cgi?id=8408
[4] https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
[5] https://github.com/KSPP/linux/issues/176
[6] https://codesearch.debian.net/search?q=execve%5C+*%5C%28%5B%5E%2C%5D%2B%2C+*NULL&literal=0
[7] https://codesearch.debian.net/search?q=execlp%3F%5Cs*%5C%28%5B%5E%2C%5D%2B%2C%5Cs*NULL&literal=0
[8] https://lore.kernel.org/lkml/20220131144352.GE16385@xsang-OptiPlex-9020/

Reported-by: Ariadne Conill <ariadne@dereferenced.org>
Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Ariadne Conill <ariadne@dereferenced.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20220201000947.2453721-1-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 fs/exec.c |   26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

--- a/fs/exec.c
+++ b/fs/exec.c
@@ -494,8 +494,14 @@ static int bprm_stack_limits(struct linu
 	 * the stack. They aren't stored until much later when we can't
 	 * signal to the parent that the child has run out of stack space.
 	 * Instead, calculate it here so it's possible to fail gracefully.
+	 *
+	 * In the case of argc = 0, make sure there is space for adding a
+	 * empty string (which will bump argc to 1), to ensure confused
+	 * userspace programs don't start processing from argv[1], thinking
+	 * argc can never be 0, to keep them from walking envp by accident.
+	 * See do_execveat_common().
 	 */
-	ptr_size = (bprm->argc + bprm->envc) * sizeof(void *);
+	ptr_size = (max(bprm->argc, 1) + bprm->envc) * sizeof(void *);
 	if (limit <= ptr_size)
 		return -E2BIG;
 	limit -= ptr_size;
@@ -1893,6 +1899,9 @@ static int do_execveat_common(int fd, st
 	}
 
 	retval = count(argv, MAX_ARG_STRINGS);
+	if (retval == 0)
+		pr_warn_once("process '%s' launched '%s' with NULL argv: empty string added\n",
+			     current->comm, bprm->filename);
 	if (retval < 0)
 		goto out_free;
 	bprm->argc = retval;
@@ -1919,6 +1928,19 @@ static int do_execveat_common(int fd, st
 	if (retval < 0)
 		goto out_free;
 
+	/*
+	 * When argv is empty, add an empty string ("") as argv[0] to
+	 * ensure confused userspace programs that start processing
+	 * from argv[1] won't end up walking envp. See also
+	 * bprm_stack_limits().
+	 */
+	if (bprm->argc == 0) {
+		retval = copy_string_kernel("", bprm);
+		if (retval < 0)
+			goto out_free;
+		bprm->argc = 1;
+	}
+
 	retval = bprm_execve(bprm, fd, filename, flags);
 out_free:
 	free_bprm(bprm);
@@ -1947,6 +1969,8 @@ int kernel_execve(const char *kernel_fil
 	}
 
 	retval = count_strings_kernel(argv);
+	if (WARN_ON_ONCE(retval == 0))
+		retval = -EINVAL;
 	if (retval < 0)
 		goto out_free;
 	bprm->argc = retval;



^ permalink raw reply	[relevance 5%]

* [PATCH 5.17 0159/1126] exec: Force single empty string when argv is empty
  @ 2022-04-05  7:15  5% ` Greg Kroah-Hartman
  0 siblings, 0 replies; 200+ results
From: Greg Kroah-Hartman @ 2022-04-05  7:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Ariadne Conill, Michael Kerrisk,
	Matthew Wilcox, Christian Brauner, Rich Felker, Eric Biederman,
	Alexander Viro, linux-fsdevel, Kees Cook, Andy Lutomirski

From: Kees Cook <keescook@chromium.org>

commit dcd46d897adb70d63e025f175a00a89797d31a43 upstream.

Quoting[1] Ariadne Conill:

"In several other operating systems, it is a hard requirement that the
second argument to execve(2) be the name of a program, thus prohibiting
a scenario where argc < 1. POSIX 2017 also recommends this behaviour,
but it is not an explicit requirement[2]:

    The argument arg0 should point to a filename string that is
    associated with the process being started by one of the exec
    functions.
...
Interestingly, Michael Kerrisk opened an issue about this in 2008[3],
but there was no consensus to support fixing this issue then.
Hopefully now that CVE-2021-4034 shows practical exploitative use[4]
of this bug in a shellcode, we can reconsider.

This issue is being tracked in the KSPP issue tracker[5]."

While the initial code searches[6][7] turned up what appeared to be
mostly corner case tests, trying to that just reject argv == NULL
(or an immediately terminated pointer list) quickly started tripping[8]
existing userspace programs.

The next best approach is forcing a single empty string into argv and
adjusting argc to match. The number of programs depending on argc == 0
seems a smaller set than those calling execve with a NULL argv.

Account for the additional stack space in bprm_stack_limits(). Inject an
empty string when argc == 0 (and set argc = 1). Warn about the case so
userspace has some notice about the change:

    process './argc0' launched './argc0' with NULL argv: empty string added

Additionally WARN() and reject NULL argv usage for kernel threads.

[1] https://lore.kernel.org/lkml/20220127000724.15106-1-ariadne@dereferenced.org/
[2] https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
[3] https://bugzilla.kernel.org/show_bug.cgi?id=8408
[4] https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
[5] https://github.com/KSPP/linux/issues/176
[6] https://codesearch.debian.net/search?q=execve%5C+*%5C%28%5B%5E%2C%5D%2B%2C+*NULL&literal=0
[7] https://codesearch.debian.net/search?q=execlp%3F%5Cs*%5C%28%5B%5E%2C%5D%2B%2C%5Cs*NULL&literal=0
[8] https://lore.kernel.org/lkml/20220131144352.GE16385@xsang-OptiPlex-9020/

Reported-by: Ariadne Conill <ariadne@dereferenced.org>
Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Ariadne Conill <ariadne@dereferenced.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20220201000947.2453721-1-keescook@chromium.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 fs/exec.c |   26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

--- a/fs/exec.c
+++ b/fs/exec.c
@@ -495,8 +495,14 @@ static int bprm_stack_limits(struct linu
 	 * the stack. They aren't stored until much later when we can't
 	 * signal to the parent that the child has run out of stack space.
 	 * Instead, calculate it here so it's possible to fail gracefully.
+	 *
+	 * In the case of argc = 0, make sure there is space for adding a
+	 * empty string (which will bump argc to 1), to ensure confused
+	 * userspace programs don't start processing from argv[1], thinking
+	 * argc can never be 0, to keep them from walking envp by accident.
+	 * See do_execveat_common().
 	 */
-	ptr_size = (bprm->argc + bprm->envc) * sizeof(void *);
+	ptr_size = (max(bprm->argc, 1) + bprm->envc) * sizeof(void *);
 	if (limit <= ptr_size)
 		return -E2BIG;
 	limit -= ptr_size;
@@ -1897,6 +1903,9 @@ static int do_execveat_common(int fd, st
 	}
 
 	retval = count(argv, MAX_ARG_STRINGS);
+	if (retval == 0)
+		pr_warn_once("process '%s' launched '%s' with NULL argv: empty string added\n",
+			     current->comm, bprm->filename);
 	if (retval < 0)
 		goto out_free;
 	bprm->argc = retval;
@@ -1923,6 +1932,19 @@ static int do_execveat_common(int fd, st
 	if (retval < 0)
 		goto out_free;
 
+	/*
+	 * When argv is empty, add an empty string ("") as argv[0] to
+	 * ensure confused userspace programs that start processing
+	 * from argv[1] won't end up walking envp. See also
+	 * bprm_stack_limits().
+	 */
+	if (bprm->argc == 0) {
+		retval = copy_string_kernel("", bprm);
+		if (retval < 0)
+			goto out_free;
+		bprm->argc = 1;
+	}
+
 	retval = bprm_execve(bprm, fd, filename, flags);
 out_free:
 	free_bprm(bprm);
@@ -1951,6 +1973,8 @@ int kernel_execve(const char *kernel_fil
 	}
 
 	retval = count_strings_kernel(argv);
+	if (WARN_ON_ONCE(retval == 0))
+		retval = -EINVAL;
 	if (retval < 0)
 		goto out_free;
 	bprm->argc = retval;



^ permalink raw reply	[relevance 5%]

* [ANNOUNCE] util-linux v2.38
@ 2022-03-28 11:52  1% Karel Zak
  0 siblings, 0 replies; 200+ results
From: Karel Zak @ 2022-03-28 11:52 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, util-linux


The util-linux release v2.38 is available at
  
  http://www.kernel.org/pub/linux/utils/util-linux/v2.38/
 
Feedback and bug reports, as always, are welcomed.

  Karel



Util-linux 2.38 Release Notes
=============================

Release highlights
------------------

This is the first release with translated util-linux man-pages. For now, the
translations are not installed by default. It's necessary to use a configure
option --enable-poman to enable, and po4a (PO for all) program is required to
generate the translation from the tarball.


mount(8) now supports a new option --mkdir as shortcut for X-mount.mkdir

mount(8) (and libmount) now supports new mount options X-mount.subdir= to
mounting sub-directory from a filesystem instead of the root directory.

lsfd is a NEW COMMAND. lsfd is intended to be a modern replacement for lsof(8)
on Linux systems. Unlike lsof, lsfd is specialized to Linux kernel; it supports
Linux specific features like namespaces with simpler code. lsfd is not a
drop-in replacement for lsof; they are different in the command line interface
and output formats. lsfd uses Libsmartcols for output formatting and filtering.
For example: lsfd -Q 'ASSOC == "exe"' prints all running executables.
(Thanks to Masatake YAMATO)

dmesg(1) supports a new option --json to print kernel log in JSON format.

libfdisk has been improved to set correct CHS addresses in MBR.
(Thanks to Pali Rohár)

fstrim(8) ignores all /ect/fstab entries with X-fstrim.notrim mount option now.

hardlink(1) now supports reflinks (new options --reflinks and --skip-reflinks),
and a new option --method=<memcmp,sha1,crc32,sha256> to specify a way how to
compare files. Now the files comparation use Linux crypto API in zero-copy way
-- all is calculated in kernel and userspace compares only hash checksums
(default is sha256).

hwclock(8) supports new command line options --param-get and --param-set to
works with RTC_PARAM_* attributes.

irqtop(1) provides a new option --cpu-stat <enable|disable|auto> to control
per-cpu stats.

libblkid supports zoned disks for btrfs now.

lsblk(8) provides a new option --noempty to ignore all devices with zero size;
the new option --zoned prints information about zones.

mkswap(8) supports a new option --quiet.

nsenter(8) supports a new option --wdns to change working directory within
namespace.

rename(1) supports new option --all and --last to replace all or last
occurrences of expression rather than the first one.

su(1) now resets RLIMIT_AS, RLIMIT_{NICE,RTPRIO}, RLIMIT_FSIZE and RLIMIT_NOFILE
reourse limits.

unshare(8) supports new options --map-users= and --map-groups= to map block of
group IDs; and new option --map-auto to map the first block of user IDs owned
by the effective user from /etc/subuid

wdctl supports new options --setpregovernor to set pre-timeout governor name,
and --setpretimeout to set watchdog pre-timeout in seconds.


Changes between v2.37 and v2.38
-------------------------------

Man pages:
   - Fix end extend formatting  [Mario Blättermann]
agetty:
   - (adoc) double hyphen replaced by dash in man pages  [Karel Zak]
   - do not use atol()  [Karel Zak]
   - resolve tty name even if stdin is specified  [tamz]
   - use CTRL+C to erase username  [Karel Zak]
   - use getttynam() if available  [Ludwig Nussel]
asciidoc:
   - fix quoted message in fsck.minix  [Rafael Fontenelle]
   - unconstrained formatting pair in fdisk  [Rafael Fontenelle]
bash-completion:
   - add --json to dmesg  [Karel Zak]
   - fix irqtop  [Karel Zak]
blkid:
   - check device type and name before probe  [Karel Zak]
   - don't print all devices if only garbage specified  [Karel Zak]
blkzone:
   - Do not print zone capacity if not supported  [Andreas Hindborg]
blockdev:
   - allow for larger values for start sector  [Thomas Abraham]
   - improve arguments parsing (remove atoi)  [Karel Zak]
   - remove accidental non-breaking spaces  [Chris Hofstaedtler]
   - use snprintf() rather than sprintf()  [Karel Zak]
build-sys:
   - (hardlink) check for llistxattr and lgetxattr  [Karel Zak]
   - (meson) fix hardlink  [Karel Zak]
   - (po-man) force .pot file update on 'make dist'  [Karel Zak]
   - Update configure.ac  [Alex Xu]
   - add USE_SYSTEMD  [Karel Zak]
   - add configure option to disable lsfd  [Anatoly Pugachev]
   - add cryptsetup config-gen  template  [Karel Zak]
   - add generated man-pages to distribution tarball  [Karel Zak]
   - add missing files from tools/ directory  [Karel Zak]
   - add missing header  [Karel Zak]
   - add script to compare config.h from meson and autotools  [Karel Zak]
   - be verbose about missing gettext  [Karel Zak]
   - cleanup lsfd related stuff  [Karel Zak]
   - disable IPC tools on Darwin  [Karel Zak]
   - disable libmount when missing mntent.h  [Karel Zak]
   - display cryptsetup status after ./configure  [Luca Boccassi]
   - distribute Meson files  [Karel Zak]
   - fir distcheck for fileeq.h  [Karel Zak]
   - fix test_procfs SOURCES  [Karel Zak]
   - fix {release-version} man pages  [Karel Zak]
   - generate all man pages for distribution tarball  [Karel Zak]
   - improve setns, unshare and prlimit checks  [Karel Zak]
   - include xlocale.h for locale_t on MacOS  [Karel Zak]
   - install hardlink bash-completion  [Karel Zak]
   - install lastb bash-completion  [Karel Zak]
   - link lib_common to test_procfs  [Masatake YAMATO]
   - make autogen.sh output more user friendly  [Karel Zak]
   - make libtool patching more robust  [Karel Zak]
   - make re-use of generated man-pages more robust  [Karel Zak]
   - patch libtool.m4 for darwin  [Karel Zak]
   - release++ (v2.38-rc1)  [Karel Zak]
   - release++ (v2.38-rc2)  [Karel Zak]
   - release++ (v2.38-rc3)  [Karel Zak]
   - release++ (v2.38-rc4)  [Karel Zak]
   - remove bashism  [Karel Zak]
   - remove lib/procutils.c  [Karel Zak]
   - report C++ compiler too  [Karel Zak]
   - use $LIBS rather than LDFLAGS  [Karel Zak]
   - use set +e before patch --try in ./autogen.sh  [Karel Zak]
cfdisk:
   - do not use atoi()  [Karel Zak]
   - don't use NULL in printf [coverity scan]  [Karel Zak]
   - optimize mountpoint detection for PARTUUID  [Karel Zak]
chfn:
   - flush stdout before reading stdin and fix uninitialized variable  [Lorenzo Beretta]
chrt:
   - use lib/procfs.c  [Karel Zak]
chsh:
   - fflush stdout before reading from stdin  [Lorenzo Beretta]
chsh, chfn:
   - remove readline support [CVE-2022-0563]  [Karel Zak]
ci:
   - add a GHAction sending data to Coverity  [Evgeny Vereshchagin]
   - build coverage reports on Coveralls  [Evgeny Vereshchagin]
   - no longer refer to Travis CI  [Evgeny Vereshchagin]
cifuzz:
   - switch to the util-linux organization  [Evgeny Vereshchagin]
colors.adoc:
   - format command name bold  [Mario Blättermann]
column:
   - (man) add note about default width in non-interactive mode  [Karel Zak]
   - segmentation fault on invalid unicode input passed to -s option  [Karel Zak]
   - use new libsmartcols functions  [Karel Zak]
dmesg:
   - Start colouring subsys delimiter only after trailing blank  [Chris Down]
   - add --json output format  [Karel Zak]
   - fix indentation in man page  [Platon Pronko]
   - fix possible memory leak [coverity scan]  [Karel Zak]
   - remove  condition [lgtm scan]  [Karel Zak]
   - translate ctime strings  [Karel Zak]
docs:
   - Uniformize references to section titles  [Rafael Fontenelle]
   - add hint about TP  [Karel Zak]
   - add hint for non-public reports  [Karel Zak]
   - add link to GitHub TODO items  [Karel Zak]
   - add links to adjtime_config manpage  [Karel Zak]
   - add man-common/in-bytes.adoc  [Karel Zak]
   - add note about GitHub PR  [Karel Zak]
   - add uclampset to AUTHORS file  [Karel Zak]
   - document --param-get, --param-set  [Bastian Krause]
   - fix info about LIBSMARTCOLS_DEBUG_PADDING  [Karel Zak]
   - fix typo in v2.37-ReleaseNotes  [Karel Zak]
   - update AUTHORS file  [Karel Zak]
   - update IRC address  [Karel Zak]
   - update TODO  [Karel Zak]
   - update TODO, add "column --output-width unlimited"  [Karel Zak]
   - update copyright years  [Karel Zak]
   - update github URL  [Karel Zak]
   - update v2.38-ReleaseNotes  [Karel Zak]
eject:
   - add __format__ attribute  [Karel Zak]
   - do not use atoi()  [Karel Zak]
   - fix typo in docs  [Karel Zak]
eject.1.adoc:
   - Fix markup  [Mario Blättermann]
fallocate:
   - add verbose messages  [Karel Zak]
fdisk:
   - Add support for fixing MBR partitions CHS values  [Pali Rohár]
   - do not print error message when partition reordering is not needed  [Pali Rohár]
   - move reorder diag messages to fdisk_reorder_partitions()  [Pali Rohár]
   - open device in nonblock mode  [changlianzhi]
   - when use fdisk -l, open device in nonblock mode  [lishengyu]
findmnt:
   - (adoc) Added section stating exit code semantics  [Mister Me]
   - (verify) add hint about systemctl daemon-reload  [Karel Zak]
   - (verify) fix cache related memory leaks on --nocanonicalize [coverity scan]  [Karel Zak]
   - (verify) fix memory leak [asan]  [Karel Zak]
   - (verify) ignore passno for btrfs  [Karel Zak]
   - (verify) support fstype patterns  [Karel Zak]
   - add -y,--shell  [Karel Zak]
   - add SOURCES column to print all devices with the same tag  [Karel Zak]
   - add __format__ attribute  [Karel Zak]
   - add reason to "cannot detect on-disk filesystem type" warning  [Karel Zak]
   - add support to print deleted targets  [Karel Zak]
   - add to the man page note about SOURCES  [Karel Zak]
   - allow SOURCES field even without '--fstab'  [Goffredo Baroncelli]
   - commit missing flag  [Karel Zak]
   - filter entries before add to the tree  [Karel Zak]
   - fix compiler warning [-Werror=sign-compare]  [Karel Zak]
   - make sure all entries are in tree output  [Karel Zak]
   - properly exclude poll columns from --output-all  [Thomas Weißschuh]
fixup! lsns:
   - interpolate missing namespaces for converting forests to a tree  [Masatake YAMATO]
flock:
   - (adoc) fix example  [Karel Zak]
   - Decribe limitations of flock  deadlock, NFS, CIFS  [Stanislav Brabec]
fsck:
   - check errno after strto..()  [Karel Zak]
   - clear SIGCHLD inherited setting  [Karel Zak]
   - do not use atoi()  [Karel Zak]
   - use mnt_fs_is_regularfs()  [Karel Zak]
fsck.cramfs:
   - use open+fstat rather than stat+open  [Karel Zak]
fstrim:
   - (man) add missing note  [Karel Zak]
   - Add fstab option X-fstrim.notrim  [Stanislav Brabec]
   - clean return code on --quiet-unsupported  [Karel Zak]
   - don't trigger autofs  [Karel Zak]
   - fix typo  [Karel Zak]
getopt.1.adoc:
   - render synopsis rules on separate lines  [Johannes Altmanninger]
github:
   - add linux-modules-extra package to CI tests  [Karel Zak]
   - add meson build target  [Karel Zak]
hardlink:
   - Calling posix_fadvise without checking return value [coverity scan]  [Karel Zak]
   - add --cache-size  [Karel Zak]
   - add a missing word to an error message  [Benno Schulenberg]
   - add new option  -S/--maximum-size  [Daniele Pizzolli]
   - add reflinks support (add --reflinks and --skip-reflinks)  [Karel Zak]
   - add verbose messages when skip file  [Karel Zak]
   - call size_to_human_string() only when necessary  [Karel Zak]
   - fix compiler warning [-Wformat=]  [Karel Zak]
   - grammaticalize the main description in the man page  [Benno Schulenberg]
   - ignore files specified more than once  [Karel Zak]
   - improve verbose messages  [Karel Zak]
   - make it possible to compare paths  [Karel Zak]
   - make reflink detection more robust [coverity scan]  [Karel Zak]
   - remove pcre2posix.h support  [Karel Zak]
   - rename --buffer-size to --io-size  [Karel Zak]
   - rewrite files content comparison  [Karel Zak]
   - set all locale elements, so that messages will get translated  [Benno Schulenberg]
   - simplify file_link()  [Karel Zak]
   - small regex stuff refactoring  [Karel Zak]
   - use more passive wording in hardlink.1  [Eduard Bloch]
   - use open(O_CREAT) with mode  [Karel Zak]
hexdump:
   - call getline() in more robust way  [Karel Zak]
   - correctly display signed single byte integers  [Samir Benmendil]
   - do not use atoi()  [Karel Zak]
hwclock:
   - add --param-get option  [Bastian Krause]
   - add --param-set option  [Bastian Krause]
   - check errno after strto..()  [Karel Zak]
   - cleanup hwclock_params[] use  [Karel Zak]
   - close adjtime on write error [coverity scan]  [Karel Zak]
   - don't ignore sscanf() return code [coverity scan]  [Karel Zak]
   - fix --param-get  [Karel Zak]
   - fix ul_path_scanf() use  [Karel Zak]
   - get/set param cleanup  [Karel Zak]
   - increase indent in help text  [Bastian Krause]
include:
   - Rename HiFive partition UUIDs  [Alexandre Ghiti]
include/c:
   - Add abs_diff macro  [Sean Anderson]
   - add __format__ attribute  [Karel Zak]
   - add cmp_timespec() and cmp_stat_mtime()  [Karel Zak]
   - add drop_permissions(), consolidate UID/GID reset  [Karel Zak]
include/carefulputc:
   - remove unused function  [Karel Zak]
include/fileeq:
   - add functions to compare files content  [Karel Zak]
include/path:
   - add __format__attribute  [Karel Zak]
include/strutils:
   - cleanup strto..() functions  [Karel Zak]
   - consolidate string to number conversion  [Karel Zak]
   - fix __format__attribute  [Karel Zak]
   - mark some arguments as non-null  [Karel Zak]
include/strv:
   - fix format attributes  [Karel Zak]
ipcmk:
   - fix strtoul use, remove deadcode [coverity scan]  [Karel Zak]
ipcs:
   - check errno after strto..()  [Karel Zak]
   - do not use atoi()  [Karel Zak]
   - use snprintf() rather than sprintf()  [Karel Zak]
irqtop:
   - add -c/--cpu-stat option  [zhenwei pi]
   - don't ignore sscanf() return code [coverity scan]  [Karel Zak]
   - fix options parsing  [Karel Zak]
   - small coding style change  [Karel Zak]
isfdisk:
   - improve --backup documentation  [Karel Zak]
kill:
   - check errno after strto..()  [Karel Zak]
   - use lib/procfs.c  [Karel Zak]
kill.1.adoc:
   - clarify syntax of -SIG argument in synopsis  [Johannes Altmanninger]
last:
   - add note about empty files/entries to the man page  [Karel Zak]
   - don't assume zero terminate strings  [Karel Zak]
   - use snprintf() rather than sprintf()  [Karel Zak]
ldattach:
   - add __format__ attribute  [Karel Zak]
ldattach.8.adoc:
   - Add missing standard options  [Mario Blättermann]
lib:
   - use snprintf() rather than sprintf()  [Karel Zak]
lib/buffer:
   - add possibility to save position in the buffer  [Karel Zak]
   - add support for "safe" encoding  [Karel Zak]
   - fix buffer reset  [Karel Zak]
   - fix possible SEGV  [Karel Zak]
   - make sure buffer without data is zero terminated [asan]  [Karel Zak]
   - retun size of the buffer and data  [Karel Zak]
lib/caputils:
   - use lib/procfs.c  [Karel Zak]
lib/env:
   - don't ignore failed malloc  [Karel Zak]
lib/fileeq:
   - fix for small memsiz  [Karel Zak]
lib/jsonwrt:
   - check if JSON handler is initialized  [Karel Zak]
lib/loopdev:
   - perform retry on EAGAIN  [Karel Zak]
lib/path:
   - (test) fix ul_new_path() use  [Karel Zak]
   - add ul_path_next_dirent()  [Karel Zak]
   - fix possible leak when use ul_path_read_string() [coverity scan]  [Karel Zak]
   - fstat dir itself  [Karel Zak]
   - improve ul_path_readlink() to be more robust  [Karel Zak]
   - initialize variables for scanf [coverity scan]  [Karel Zak]
   - make path use more robust [coverity scan]  [Karel Zak]
   - make ul_path_read_buffer() more robust [coverity scan]  [Karel Zak]
   - use flags for fstatat()  [Karel Zak]
lib/procfs:
   - add functions to read /proc/#/ stuff  [Karel Zak]
lib/pwdutils:
   - don't use getlogin(3).  [Érico Nogueira]
   - use assert to check correct usage.  [Érico Nogueira]
lib/strutils:
   - add strappend()  [Karel Zak]
   - improve normalize_whitespace()  [Karel Zak]
   - make sure mem2strcpy() buffer is zeroized  [Karel Zak]
   - make test_strutils_normalize() more robust  [Karel Zak]
   - rename strappend() to strconcat()  [Karel Zak]
lib/sys:
   - add sysfs_chrdev_devno_to_devname()  [Karel Zak]
libblkid:
   - (btrfs) add debug messages to zoned support  [Karel Zak]
   - Add hyphens to UUID string representation in Stratis superblock parsing  [John Baublitz]
   - Optimize the blkid_safe_string() function  [Karel Zak, changlianzhi]
   - add magic and probing for zoned btrfs  [Naohiro Aota]
   - check UBI char device name  [Karel Zak]
   - check blkid_get_cache() return value [coverity scan]  [Karel Zak]
   - check errno after strto..()  [Karel Zak]
   - check for ioctl macro rather than for header file  [Karel Zak]
   - don't mark cache as "probed" if /sys not available  [Karel Zak]
   - fix and cleanup blkid_safe_string()  [Karel Zak]
   - ignore scanf() result when read number of stripes [coverity scan]  [Karel Zak]
   - implement zone-aware probing  [Naohiro Aota]
   - make blkid_free_probe() more robust  [Karel Zak]
   - optimize ioctl calls in blkid_probe_set_device()  [Karel Zak]
   - remove EVMS support  [Karel Zak]
   - remove unnecessary ifdef  [Karel Zak]
   - reopen floppy without O_NONBLOCK  [Karel Zak]
   - reset errno after failed floppy test  [Karel Zak]
   - support zone reset for wipefs  [Naohiro Aota]
   - use snprintf() rather than sprintf()  [Karel Zak]
   - vfat  Fix reading FAT16 boot label and serial id  [Pali Rohár]
   - vfat  Fix reading FAT32 boot label  [Pali Rohár]
libblkid/src/probe:
   - check for ENOMEDIUM from ioctl(CDROM_LAST_WRITTEN)  [Jeremi Piotrowski]
libbuid:
   - use _UL_LIBUUID_UUID_H to cover uuid.h  [Karel Zak]
libfdisk:
   - (MBR) recognize EBBR protective partitions  [Vincent Stehlé]
   - (dos) Add check both begin and end CHS partition parameters  [Pali Rohár]
   - (dos) Add function dos_partition_sync_chs() for updating CHS values  [Pali Rohár]
   - (dos) Add function fdisk_dos_fix_chs() for fixing CHS values for all partitions  [Pali Rohár]
   - (dos) Fix check error message when CHS calculated sector does not match LBA  [Pali Rohár]
   - (dos) Fix determining number of heads and sectors per track from MBR  [Pali Rohár]
   - (dos) Fix printing number of CHS sectors in check error message  [Pali Rohár]
   - (dos) Fix setting CHS values when creating new partition  [Pali Rohár]
   - (dos) Fix upper bound cylinder check in check()  [Pali Rohár]
   - (dos) Fix upper bound cylinder check in check_consistency()  [Pali Rohár]
   - (dos) Put number of CHS check_consistency errors into summart message  [Pali Rohár]
   - (dos) Recalculate number of cylinders after changing number of heads and sectors  [Pali Rohár]
   - (dos) Use helper macros cylinder() and sector() in check_consistency()  [Pali Rohár]
   - (dos) don't ignore MBR+FAT use-case  [Karel Zak]
   - (dos) index partition from zero for DBG()  [Karel Zak]
   - (dos) support partition and MBR overlap  [Karel Zak]
   - (gpt) align size of partition by default  [Karel Zak]
   - (gpt) cleanup verity GUID names  [Karel Zak]
   - (gpt) make fdisk -x output more readable  [Karel Zak]
   - (gpt) provide last LBA where is partitions array  [Karel Zak]
   - (script) rewrite start= and size= parsing  [Karel Zak]
   - add and fix __format__ attributes  [Karel Zak]
   - add new Linux GPT partition types  [WANG Xuerui]
   - add new root and /usr part types  [Georgy Yakovlev]
   - add new verity root and /usr part types  [Georgy Yakovlev]
   - check calloc() return [gcc-analyzer]  [Karel Zak]
   - dereference of possibly-NULL [gcc-analyzer]  [Karel Zak]
   - don't use too small free segments by default  [Karel Zak]
   - enlarge partition by move start down  [Karel Zak]
   - incorrect GUID for NetBSD  [Siu Ching Pong -Asuka Kenji-]
   - make self_partition() use more robust [gcc-analyzer]  [Karel Zak]
libmount:
   - (--all) continue although /proc is not mounted  [Karel Zak]
   - add X-mount.subdir=  [Karel Zak]
   - add __format__ attribute  [Karel Zak]
   - add glusterfs between network filesystems  [Karel Zak]
   - add mnt_fs_is_deleted()  [Karel Zak]
   - add mnt_fs_is_regularfs() to public API  [Karel Zak]
   - allow X-* options more than once  [Karel Zak]
   - assert() is enough [lgtm scan]  [Karel Zak]
   - change propagation of /run for X-mount.subdir  [Karel Zak]
   - check errno after strto..()  [Karel Zak]
   - disable mtab only on statfs() success only  [Karel Zak]
   - don't use setgroups at all()  [Karel Zak]
   - fix UID check for FUSE umount [CVE-2021-3995]  [Karel Zak]
   - fix mnt_fs_is_* return codes  [Karel Zak]
   - fix possible memory leak in mnt_optstr_fix_secontext() [coverity scan]  [Karel Zak]
   - fix setgroups() use  [Karel Zak]
   - make mnt_table_get_fs_root() more robust [gcc-analyzer]  [Karel Zak]
   - remove support for deleted mount table entries  [Karel Zak]
   - remove support for obsolete /dev/.mount/utab  [Karel Zak]
   - show options string on parse error  [Karel Zak]
   - support quotes in X-mount options  [Karel Zak]
   - use /run/mount/tmptgt rather than /tmp/mount/mount.<pid>  [Karel Zak]
libsmartcols:
   - add multi-line cells to samples  [Karel Zak]
   - add scols_line_get_column_data()  [Karel Zak]
   - add support for optional boolean values  [Thomas Weißschuh]
   - change "export" behavior, add "shellvar" flag  [Karel Zak]
   - fix bare array on JSON output  [Karel Zak]
   - fix lines groups for multi-line cells  [Karel Zak]
   - use lib/buffer, remove local implementation  [Karel Zak]
   - use snprintf() rather than sprintf()  [Karel Zak]
libuuid:
   - check errno after strto..()  [Karel Zak]
   - extend cache in uuid_generate_time_generic()  [Michael Trapp]
   - fix buffer overrun in uuid_parse_range()  [Zane van Iperen]
   - include c.h to cover restrict keyword  [Karel Zak]
logger:
   - add __format__ attribute  [Karel Zak]
   - dealloc login name  [Karel Zak]
   - fix --prio-prefix doesn't use --priority default  [Karel Zak]
   - fix --size use for stdin  [Karel Zak]
   - realloc buffer when header size changed  [Karel Zak]
   - use xgetlogin from pwdutils.  [Érico Nogueira]
login:
   - (adoc) add hint about PAM and env.variables  [Karel Zak]
   - Restore tty size after calling vhangup()  [Daan De Meyer]
   - add callback for close_range()  [Karel Zak]
   - fix close_range() use  [Karel Zak]
   - improve coding style  [Karel Zak]
   - remove obsolete and confusing comment  [Karel Zak]
logindefs:
   - use snprintf() rather than sprintf()  [Karel Zak]
loopdev:
   - Do not treat errors when detecting overlap as fatal  [Jan Kara]
   - Properly translate errors from ul_path_read_*()  [Jan Kara]
   - accept ENOSYS for LOOP_CONFIGURE  [Alex Xu]
   - add retries on EAGAIN  [Karel Zak]
losetup:
   - Add missing pipe to man example for setting up loop device  [Vojtech Trefny]
   - directly set dio instead of afterwards  [Alex Xu (Hello71)]
   - don't skip adding a new device if it already has a device node  [Christoph Hellwig]
   - fix --direct-io  [Karel Zak]
   - fix memory leak [asan]  [Karel Zak]
   - use LOOP_CONFIGURE in a more robust way  [Karel Zak]
lsblk:
   - (adoc) improve --all description  [Karel Zak]
   - add --noempty  [Karel Zak]
   - add -y/--shell  [Karel Zak]
   - add column START for partition start offsets  [Karel Zak]
   - add columns of zoned parameters  [Naohiro Aota]
   - add zoned columns to "lsblk -z"  [Naohiro Aota]
   - factor out function to read sysfs param as bytes  [Naohiro Aota]
   - fix formatting in -e option  [ratijas]
   - normalize space in SERIAL and MODEL  [Karel Zak]
   - sort list of columns  [Karel Zak]
   - sort usage() output  [Karel Zak]
   - update --help output for -y  [Karel Zak]
   - use ID_MODEL_ENC is possible  [Karel Zak]
lscpu:
   - (arm) remove extra whitespace  [Karel Zak]
   - Add Phytium FT-2000+ & S2500 support  [panchenbo]
   - Add Phytium aarch64 cpupart  [panchenbo]
   - add SCALMHZ% and "CPU scaling MHz "  [Karel Zak]
   - add additional arm cpu part numbers  [Ali Saidi]
   - add bios_family  [Huang Shijie]
   - add more sanity checks for dmi_decode_cputype()  [Huang Shijie]
   - check errno after strto..()  [Karel Zak]
   - do not use atoi()  [Karel Zak]
   - don't use DMI if executed with --sysroot  [Karel Zak]
   - fix NULL dereference  [Karel Zak]
   - fix build on powerpc  [Georgy Yakovlev]
   - fix compilation against librtas  [Karel Zak]
   - fix cppcheck warning [Uninitialized variable]  [Karel Zak]
   - get the processor information by DMI  [Huang Shijie]
   - read MHZ from /sys/.../cpufreq/scaling_cur_freq  [Karel Zak, Thomas Weißschu]
   - remove extra blank lines  [Karel Zak]
   - remove the old code  [Huang Shijie]
   - remove unintended change  [Karel Zak]
   - use MHZ as number to be locale sensitive  [Karel Zak]
   - use json types  [Thomas Weißschuh]
   - use locale-independent strtod() when read from kernel  [Karel Zak]
   - use optional json values  [Thomas Weißschuh]
lsfd:
   - (adoc) add more exapmles  [Masatake YAMATO]
   - (adoc) add proc(5) to SEE ALSO section  [Masatake YAMATO]
   - (adoc) put missing    at the end of options  [Masatake YAMATO]
   - (adoc) remove a redundant word  [Masatake YAMATO]
   - (adoc) reorder the options  [Masatake YAMATO]
   - (adoc) reorder the sections  [Masatake YAMATO]
   - (adoc) update DESCRIPTION  [Masatake YAMATO]
   - (adoc) write about filter expression  [Masatake YAMATO]
   - (adoc) write more about the -o option  [Masatake YAMATO]
   - (filter) accept % char as a part of column name  [Masatake YAMATO]
   - (filter) fix a memory leak  [Masatake YAMATO]
   - (filter) give a name to a constant  [Masatake YAMATO]
   - (filter) implement !~, an operator for regex unmatching  [Masatake YAMATO]
   - (filter) implement =~, an operator for regex matching  [Masatake YAMATO]
   - (filter) make error messages in check_type methods  [Masatake YAMATO]
   - (filter) make some data structures its source file local  [Masatake YAMATO]
   - (filter) whitespace cleanup  [Masatake YAMATO]
   - (helper) accept an integer argument for a parameter  [Masatake YAMATO]
   - (helper) add "dentries" parameter to directory factory  [Masatake YAMATO]
   - (helper) add "dir" parameter to directory factory  [Masatake YAMATO]
   - (helper) add "file" parameter to ro-regular-file factory  [Masatake YAMATO]
   - (helper) add "nonblock" parameter to pipe-no-fork factory  [Masatake YAMATO]
   - (helper) add "offset" parameter to ro-regular-file factory  [Masatake YAMATO]
   - (helper) allow to pass extra parameters  [Masatake YAMATO]
   - (helper) improve the code converting file descriptor numbers  [Masatake YAMATO]
   - (helper) set proper errno before calling err()  [Masatake YAMATO]
   - (helper) update examples in the help message  [Masatake YAMATO]
   - (helper) use more "const" modifiers  [Masatake YAMATO]
   - (test) add a case for displaying COMMAND column  [Masatake YAMATO]
   - (test) add a case for displaying DEV column  [Masatake YAMATO]
   - (test) add a case for displaying a character device  [Masatake YAMATO]
   - (test) add a case for displaying a directory  [Masatake YAMATO]
   - (test) add a case for displaying socket pairs  [Masatake YAMATO]
   - (test) add a case for displaying symlinks  [Masatake YAMATO]
   - (test) add a case for testing FLAGS (wronly,nonblock) column  [Masatake YAMATO]
   - (test) add a case for testing SIZE column  [Masatake YAMATO]
   - (test) add cases for displaying a regular file and pipe  [Masatake YAMATO]
   - (test) test POS column  [Masatake YAMATO]
   - Add initial man page  [Mario Blättermann]
   - Add new man page to po4a.cfg  [Mario Blättermann]
   - Fix typos in lsfd.c  [Mario Blättermann]
   - add --debug-filter option  [Masatake YAMATO]
   - add --dump-counters option  [Masatake YAMATO]
   - add --notruncate  [Karel Zak]
   - add --sysroot, use lib/path.c  [Karel Zak]
   - add CHRDRV column  [Masatake YAMATO]
   - add DEVTYPE column  [Masatake YAMATO]
   - add FLAGS, MNTID, and POS columns  [Masatake YAMATO]
   - add FUID and OWNER columns  [Masatake YAMATO]
   - add KTHREAD column  [Masatake YAMATO]
   - add MAPLEN column  [Masatake YAMATO]
   - add MISCDEV column  [Masatake YAMATO]
   - add MODE column  [Masatake YAMATO]
   - add NLINK and DELETED columns  [Masatake YAMATO]
   - add PARTITION column  [Masatake YAMATO]
   - add PROTONAME column  [Masatake YAMATO]
   - add a function to get the name of filesystem from a given minor number  [Masatake YAMATO]
   - add a helper function for building filter  [Masatake YAMATO]
   - add a helper function for reading bdevs in /prode/devices  [Masatake YAMATO]
   - add a stub for fifo type  [Masatake YAMATO]
   - add code for reading /proc/$pid/maps  [Masatake YAMATO]
   - add columns for DEV and RDEV  [Masatake YAMATO]
   - add columns for SIZE  [Masatake YAMATO]
   - add cwd, exe, and root associations  [Masatake YAMATO]
   - add filter engine  [Masatake YAMATO]
   - add infrstructure code for reading fdinfo files  [Masatake YAMATO]
   - add mem associations  [Masatake YAMATO]
   - add namespace related associations  [Masatake YAMATO]
   - add new man page to Makemodule.am  [Masatake YAMATO]
   - add reference to proc from file  [Karel Zak]
   - add stubs for sockets and files of unknown type  [Masatake YAMATO]
   - add the way to initialize and finalize classes  [Masatake YAMATO]
   - adjust column width for COMMAND  [Masatake YAMATO]
   - allow passing a proc object to the constructors of the file classes  [Masatake YAMATO]
   - change the license of the filtering engine to LGPL  [Masatake YAMATO]
   - check ul_strtou*() return code [coverity scan]  [Karel Zak]
   - cleanup --summary semantic  [Karel Zak]
   - cleanup collect_outofbox_files stuff  [Karel Zak]
   - cleanup fdinfo handling  [Karel Zak]
   - cleanup new file initialization  [Karel Zak]
   - collect threads level information if TID is specified in a filter  [Masatake YAMATO]
   - convert lines introducing local variable to a block with {...}  [Masatake YAMATO]
   - declare JSON type in colinfo entries  [Masatake YAMATO]
   - declare local variables at the beginning of block  [Masatake YAMATO]
   - delete an unnecessary semicolon  [Masatake YAMATO]
   - don't collect and print redundant information about threads  [Masatake YAMATO]
   - don't define a local variable in the middle of a block  [Masatake YAMATO]
   - don't duplicate ASSOC_EXE processing  [Karel Zak]
   - don't use 'long int' for file data  [Karel Zak]
   - don't use threads  [Masatake YAMATO]
   - fill ASSOC field  [Masatake YAMATO]
   - fill DEVICE field  [Masatake YAMATO]
   - fill INODE field  [Masatake YAMATO]
   - fill POS and MODE columns for SHM and MEM associated files  [Masatake YAMATO]
   - fill PROTONAME field of file for mmap'ed socket  [Masatake YAMATO]
   - fill TYPE field  [Masatake YAMATO]
   - fill UID column with the process's uid  [Masatake YAMATO]
   - fill UID field  [Masatake YAMATO]
   - fill USER field  [Masatake YAMATO]
   - fix ASSOC_EXE use, make file->association use more robust  [Karel Zak]
   - fix a typo in DEVTYPE description  [Masatake YAMATO]
   - fix a typo in comment  [Masatake YAMATO]
   - fix copy & past error [coverity scan]  [Karel Zak]
   - fix the way to print length of mmap area  [Masatake YAMATO]
   - fix the way to print stat.st_nlink  [Masatake YAMATO]
   - fix the way to print stat.st_size  [Masatake YAMATO]
   - fix typo, rename function  [Karel Zak]
   - fix use-after-free and resource leak [coverity scan]  [Karel Zak]
   - function rename  [Karel Zak]
   - give column widths  [Masatake YAMATO]
   - implement --summary and --counter options  [Masatake YAMATO]
   - increase the threads to collect information  [Masatake YAMATO]
   - initial commit  [Masatake YAMATO]
   - introduce --source filter option  [Masatake YAMATO]
   - introduce -Q option for generic filtering  [Masatake YAMATO]
   - introduce -p/--pid option, pids filter working in the early stage  [Masatake YAMATO]
   - introduce DEVNAME column and use it as default  [Masatake YAMATO]
   - introduce a data structure for storing common fdinfo data  [Masatake YAMATO]
   - introduce fopenf helper function  [Masatake YAMATO]
   - introduce name_manager  [Masatake YAMATO]
   - introduce new association SHM representing shared file mapping  [Masatake YAMATO]
   - keep main() at the end of the code  [Karel Zak]
   - make sure we do not use uninitialized struct stat [coverity scan]  [Karel Zak]
   - make username_cache lsfd-file privaite  [Masatake YAMATO]
   - move file_class variants after their constructors  [Masatake YAMATO]
   - move list_free() to list.h  [Karel Zak]
   - move the code for reading /proc/devices to lsfd.c  [Masatake YAMATO]
   - optimize maps use  [Karel Zak]
   - optimize symlinks use  [Karel Zak]
   - print the owner of process as USER  [Masatake YAMATO]
   - purge fd layer  [Masatake YAMATO]
   - read /proc/partitions  [Masatake YAMATO]
   - read character driver names from /proc/devices  [Masatake YAMATO]
   - read misc character device names from /proc/misc  [Masatake YAMATO]
   - refactor  [Masatake YAMATO]
   - refactor code calling collect_outofbox_files  [Masatake YAMATO]
   - remove --source option  [Masatake YAMATO]
   - remove collect_file()  [Karel Zak]
   - remove duplicated an O_ flag entry  [Masatake YAMATO]
   - remove prototype decls for removed functions  [Masatake YAMATO]
   - remove redundant "nodev " prefix from DEVNAME column  [Masatake YAMATO]
   - remove struct fdinfo_data  [Karel Zak]
   - remove unused --sysroot  [Karel Zak]
   - remove unused code  [Karel Zak]
   - rename DEVNAME column to SOURCE  [Masatake YAMATO]
   - rename the column DEVICE to MAJ MIN  [Masatake YAMATO]
   - reorder function  [Karel Zak]
   - replace "socket " in NAME of SOCKET with its protoname  [Masatake YAMATO]
   - replace POS with MNT_ID in default column set  [Masatake YAMATO]
   - revert include/path.h use  [Karel Zak]
   - simplify class hierarchy  [Masatake YAMATO]
   - small cleanup to mountinfo/nodev code  [Karel Zak]
   - sort the enumerators about columns  [Masatake YAMATO]
   - specify column more declarative way  [Masatake YAMATO]
   - split new_file(), remove map_file_data  [Karel Zak]
   - support threads with -l option  [Masatake YAMATO]
   - tiny change to default colummns initialization  [Karel Zak]
   - unify nodev lists into global one  [Masatake YAMATO]
   - use 'new_' prefix when we allocate something  [Karel Zak]
   - use lib/procfs.c  [Karel Zak]
   - use new libsmartcols functions  [Karel Zak]
   - use new scols_line_get_column_data()  [Karel Zak]
   - use one function to all symlinks  [Karel Zak]
   - use only "/proc/#/maps" file  [Karel Zak]
   - use path_cxt to read process  [Karel Zak]
   - use the list of block devices in /proc/devices for decoding SOURCE column  [Masatake YAMATO]
   - wrap code for debugging with #ifdef DEBUG/#endif  [Masatake YAMATO]
lsfd.1.adoc:
   - Add missing underscores  [Mario Blättermann]
   - Fix markup  [Mario Blättermann]
   - Fix wording and markup  [Mario Blättermann]
   - Fix yet another entry in the filter examples list  [Mario Blättermann]
   - Improve punctuation and add translator comments  [Mario Blättermann]
   - add caution about the CLI stability  [Masatake YAMATO]
   - fix a typo  [Masatake YAMATO]
   - remove redundant parenthesis from the examples  [Masatake YAMATO]
lsfd.1.doc:
   - use delimited literal block notation in the EXAMPLE section  [Masatake YAMATO]
   - write anout --summary and --counter options  [Masatake YAMATO]
lsipc:
   - add -y,--shell  [Karel Zak]
   - use lib/procfs.c  [Karel Zak]
lslocks:
   - add INODE and MAJ MIN columns  [Karel Zak]
   - check errno after strto..()  [Karel Zak]
   - check scanf() return code [coverity scan]  [Karel Zak]
   - fix maj min scanf  [Karel Zak]
   - use lib/procfs.c  [Karel Zak]
   - use snprintf() rather than sprintf()  [Karel Zak]
lslogins:
   - add -y,--shell  [Karel Zak]
   - ask for supplementary groups only once [asan]  [Karel Zak]
   - check errno after strto..()  [Karel Zak]
   - consolidate and optimize utmp files use  [Karel Zak]
   - fix memory leak [asan]  [Karel Zak]
   - remove unwanted debug message  [Karel Zak]
   - use lib/procfs.c  [Karel Zak]
   - use sd_journal_get_data() in proper way  [Karel Zak]
lsmem:
   - check errno after strto..()  [Karel Zak]
lsns:
   - fill UID and USER columns for interpolated namespaces  [Masatake YAMATO]
   - fix compilation on old systems without linux/nsfs.h  [Karel Zak]
   - fix copy & past in man page  [Karel Zak]
   - fix old error message  [Karel Zak]
   - fix passing wrong process lists when showing ownerns and parentns  [Masatake YAMATO]
   - interpolate missing namespaces for converting forests to a tree  [Masatake YAMATO]
   - make --tree default, update man-page  [Karel Zak]
   - make namespace having no process printable  [Masatake YAMATO]
   - optimize mountinfo use  [Karel Zak]
   - print namespace tree based on the relationship (parent/child or owner/owned)  [Masatake YAMATO]
   - reorganize members specifying other namespaces in lsns_namespace  [Masatake YAMATO]
   - unify the code and option for printing process based tree and namespace based trees  [Masatake YAMATO]
   - use lib/procfs.c  [Karel Zak]
lspcu:
   - Print dummy modelname if none present  [Thomas Weißschuh]
man pages:
   - Fix punctuation, wording and markup  [Mario Blättermann]
   - unify output of --help and --version  [Mario Blättermann]
man-pages:
   - consolidate COLORS section  [Karel Zak]
mcookie:
   - fix infinite-loop when use -f  [Hiroaki Sengoku]
meson:
   - add missing header files check  [Karel Zak]
   - do not generate fstrim.service if we do not have systemd  [Martin Roukala (né Peres)]
   - fix bash_completion.get_variable() use  [Karel Zak]
   - fix building libsmartcols  [Alex Xu (Hello71)]
   - fix building logger  [Alex Xu (Hello71)]
   - fix crypt_activate_by_signed_key detection  [Luca Boccassi]
   - fix dlopen support for cryptsetup  [Luca Boccassi]
   - fix typo  [Karel Zak]
   - headers  Install headers  [Thomas Weißschuh]
   - headers  use util-linux version of version defines  [Thomas Weißschuh]
   - install examples to correct directory  [Thomas Weißschuh]
   - install manpages and bash completions  [Thomas Weißschuh]
   - keep bash-completion symlinks in variable  [Karel Zak]
   - make asciidoc optional  [Alex Xu (Hello71)]
   - make raw(7) optional  [Karel Zak]
   - only install pkgconfig if library is built  [Thomas Weißschuh]
misc:
   - consolidate stat() error message  [Karel Zak]
   - improve string to number conversions  [Karel Zak]
   - non-Linux portability fixes  [Samuel Thibault]
   - use everywhere mkstemp_cloexec() as fallback to mkostemp()  [Karel Zak]
mkfs.cramfs:
   - add comment to explain readlink() use  [Karel Zak]
mkswap:
   - (adoc) suggest looking up page size portably  [Jakub Wilk]
   - add --quiet  [Karel Zak]
   - fix holes detection (infinite loop and/or stack-buffer-underflow)  [Karel Zak]
   - support -U {clear,random,time,uuid}  [Karel Zak]
more:
   - Calling open without checking return value [coverity scan]  [Karel Zak]
   - POSIX compliance patch preventing exit on EOF without -e  [Ian Jones]
   - add __format__ attribute  [Karel Zak]
   - clear SIGCHLD inherited setting  [Karel Zak]
   - fix -e in non-interactive mode  [Karel Zak]
   - fix null-pointer dereference  [Karel Zak]
   - fix setuid/setgid order  [Karel Zak]
   - improve zero size handling  [Tobias Stoeckmann]
   - use snprintf() rather than sprintf()  [Karel Zak]
mount:
   - (adoc) add hint about /proc and /sys to --all description  [Karel Zak]
   - (adoc) ext_N_ → ext__N__ [manpage-l10n]  [Karel Zak]
   - (adoc) fix comma splice  [Jakub Wilk]
   - (adoc) fix missing period [manpage-l10n]  [Karel Zak]
   - (adoc) mount → mount(2),  of → or [manpage-l10n]  [Karel Zak]
   - (man) fix example  [Karel Zak]
   - Allow bind-mounting with "nosymfollow"  [Jakub Wilk]
   - Fix race in loop device reuse code  [Jan Kara]
   - add -m,--mkdir as shortcut for X-mount.mkdir  [Karel Zak]
   - add hint about dmesg(8) to error messages  [Karel Zak]
   - add hint about systemctl daemon-reload  [Karel Zak]
   - fix roothash signature extension in manpage  [Luca Boccassi]
   - man-page; add all overlayfs options  [Tj]
   - mount.8 don't consider additional mounts as experimental  [Karel Zak]
   - mount.8 fix overlayfs nfs_export= indention  [Karel Zak]
   - use mnt_fs_is_regularfs()  [Karel Zak]
mount.8.adoc:
   - Remove context options exclusion  [Thiébaud Weksteen]
   - document SELinux use of nosuid mount flag  [Topi Miettinen]
   - fix misformatting  [Mario Blättermann]
   - note that mandatory locking is fully deprecated in Linux 5.15  [Jeff Layton]
   - use bold font for literal text in synopsis  [Johannes Altmanninger]
mount_fuzz:
   - reject giant files early  [Evgeny Vereshchagin]
namei:
   - simplify code  [Karel Zak]
newgrp:
   - fix memory leak [coverity scan]  [Karel Zak]
newgrp.1.adoc:
   - use bold font for command name in synopsis  [Johannes Altmanninger]
nsenter:
   - Do not try to enter nonexisting namespaces when --all is used  [Yonatan Goldschmidt]
   - add --wdns to change working directory  [Karel Zak]
   - clear SIGCHLD inherited setting  [Karel Zak]
partx:
   - remove memory leak to make scanners happy  [coverity scan]  [Karel Zak]
pg:
   - do not use atoi()  [Karel Zak]
po:
   - add sk.po (from translationproject.org)  [Jose Riha]
   - merge changes  [Karel Zak]
   - update cs.po (from translationproject.org)  [Petr Písař]
   - update de.po (from translationproject.org)  [Mario Blättermann]
   - update es.po (from translationproject.org)  [Antonio Ceballos Roa]
   - update fr.po (from translationproject.org)  [Frédéric Marchal]
   - update hr.po (from translationproject.org)  [Božidar Putanec]
   - update ko.po (from translationproject.org)  [Seong-ho Cho]
   - update pl.po (from translationproject.org)  [Jakub Bogusz]
   - update pt_BR.po (from translationproject.org)  [Rafael Fontenelle]
   - update sr.po (from translationproject.org)  [Мирослав Николић]
   - update tr.po (from translationproject.org)  [Mesutcan Kurt]
   - update uk.po (from translationproject.org)  [Yuri Chornoivan]
   - update zh_CN.po (from translationproject.org)  [Boyuan Yang]
po-man:
   - add cs.po (from translationproject.org)  [Petr Písař]
   - add es.po (from translationproject.org)  [Antonio Ceballos Roa]
   - add fr.po (from translationproject.org)  [Frédéric Marchal]
   - add new langs to po4a.cfg  [Karel Zak]
   - add pt_BR.po (from translationproject.org)  [Rafael Fontenelle]
   - add sr.po (from translationproject.org)  [Мирослав Николић]
   - add uk.po (from translationproject.org)  [Yuri Chornoivan]
   - merge changes  [Karel Zak]
   - update de.po (from translationproject.org)  [Mario Blättermann]
   - update fr.po (from translationproject.org)  [Frédéric Marchal]
   - update uk.po (from translationproject.org)  [Yuri Chornoivan]
   - update variables in Makefile.am  [Karel Zak]
prlimit:
   - fix compiler warning [-Wmaybe-uninitialized]  [Karel Zak]
   - improve --help output  [Karel Zak]
   - make syscall use more robust  [Karel Zak]
readprofile:
   - check errno after strto..()  [Karel Zak]
   - use snprintf() rather than sprintf()  [Karel Zak]
rename:
   - add --all and --last parameters  [Todd Lewis]
   - size_t, mutually exclusive parameters  [Todd Lewis]
   - stop after count changes  [Todd Lewis]
   - use readlink() in more robust way  [Karel Zak]
rfkill:
   - Set scols table name to make the json output valid  [Nicolai Dagestad]
   - quit when read end of stdout is closed  [Mickey Rose]
script:
   - (adoc) improve man page readability  [Karel Zak]
   - add COMMAND= to log header  [Karel Zak, Henrik Bach]
   - add __format__ attribute  [Karel Zak]
   - add separator to header, update tests  [Karel Zak]
   - don't use \n when we log COMMAND  [Karel Zak]
   - fix passing args to execlp()  [Jakub Wilk]
script.1.adoc:
   - correct socond as second  [Vicente Jimenez Aguilar]
scriptlive:
   - fix argv[0] for execlp()  [Karel Zak]
setterm:
   - (man) improve dosc about optional arguments  [Karel Zak]
sfdisk:
   - fix typo in --move-data when check partition size  [Karel Zak]
   - update docs, add examples to the man page  [Karel Zak]
   - write empty label also when only ignored partition specified  [Karel Zak]
sfdisk man:
   - Escape ((…)) to avoid AsciiDoc interpreting and stripping from manpage  [Paul Sarena]
su:
   - (bash-completion) offer usernames rather than files  [Karel Zak]
   - Verify default SIGCHLD handling.  [Tobias Stoeckmann]
   - reset RLIMIT_AS too  [Karel Zak]
   - reset RLIMIT_{NICE,RTPRIO} to zero  [Karel Zak]
   - reset also RLIMIT_FSIZE and RLIMIT_NOFILE  [Karel Zak]
   - use LOG_PID for syslog  [Sam James]
sulogin:
   - Display all kinds of errno during password input.  [Shigeki Morishima]
   - add missing ifdefs  [Karel Zak]
   - fix compiler warning [-Werror=implicit-fallthrough=]  [Karel Zak]
   - fix whitespace error  [Karel Zak]
   - ignore none-existing console devices  [Werner Fink]
   - use explicit_bzero() for buffer with password  [Karel Zak]
swapon:
   - do not use atoi()  [Karel Zak]
sys-utils/ipcutils:
   - be careful when call calloc() for uint64 nmembs  [Karel Zak]
sysfs:
   - fallback for partitions not including parent name  [Portisch]
taskset:
   - use lib/procfs.c  [Karel Zak]
test/eject:
   - guard asan LD_PRELOAD with use-system-commands check  [Ross Burton]
test_mount_optstr:
   - use xstrdup()  [Karel Zak]
tests:
   - (cramfs) make GID and mode use more robust  [Karel Zak]
   - (hardlink) add info about number of files to test  [Karel Zak]
   - (libmount) add X-* and x-8 options strings tests  [Karel Zak]
   - (logger) check for socat  [Karel Zak]
   - (lsfd) add a case for listing a fd opening a block device  [Masatake YAMATO]
   - (lsfd) add a factory for opening a block device to the helper command  [Masatake YAMATO]
   - (lsfd) add a missing word to the test output  [Masatake YAMATO]
   - (lsfd) call ts_skip_nonroot earlier  [Masatake YAMATO]
   - (lsfd) delete "largefile" flag in the output before the comparison  [Masatake YAMATO]
   - (lsfd) don't check an unused program  [Masatake YAMATO]
   - (lsfd) don't compare inodes  [Masatake YAMATO]
   - (lsfd) don't use findmnt to verify device numbers  [Masatake YAMATO]
   - (lsfd) fix file descriptor leaks reported by coverity  [Masatake YAMATO]
   - (lsfd) give missing test descriptions  [Masatake YAMATO]
   - (lsfd) improve the help messages of test_mkfds helper command  [Masatake YAMATO]
   - (lsfd) make DGRAM socketpair to mitigate the change of protoname  [Masatake YAMATO]
   - (lsfd) normalize protoname before comparing  [Masatake YAMATO]
   - (lsfd) print more information for debugging  [Masatake YAMATO]
   - (lsfd) refine the pattern for comparing the output of the commands  [Masatake YAMATO]
   - Fix test/misc/swaplabel failure due to change in mkswap behaviour.  [Mark Hindley]
   - Skip lsns/ioctl_ns test if unshare fails  [Chris Hofstaedtler]
   - add rv64 lscpu test  [Karel Zak]
   - add tests for dm-verity support in mount  [Vojtěch Eichler]
   - check correct log file for errors in blkdiscard test  [Ross Burton]
   - check for dm-verity support  [Karel Zak]
   - don't hardcode /bin/kill in the kill tests  [Ross Burton]
   - fdisk  Layout with more details  [Pali Rohár]
   - fdisk  Update CHS values in MBR partitions  [Pali Rohár]
   - fix fdisk/bsd on big endian systems (tested on sparc64 and ppc64)  [Anatoly Pugachev]
   - fix lsns test on kernels without USER namespaces  [Anatoly Pugachev]
   - make ./run.sh more robust  [Karel Zak]
   - make eject umount tests more robust  [Karel Zak]
   - make mount/fstab-all more robust  [Karel Zak]
   - make use of subtests  [Vojtěch Eichler]
   - mark ul/ul as a known failure  [Ross Burton]
   - remove readline from build-sys output  [Karel Zak]
   - skip if scsi_debug model file is not accessible  [Karel Zak]
   - split additional tests into subtests  [Vojtěch Eichler]
   - split cal/color test into subtests  [Vojtěch Eichler]
   - split cal/colorw test into subtests  [Vojtěch Eichler]
   - split several tests into subtests  [Vojtěch Eichler]
   - split test into subtest  [Vojtěch Eichler]
   - update build-sys test  [Karel Zak]
   - update hardlink --maximum-size  [Karel Zak]
   - update hardlink output  [Karel Zak]
   - update lscpu output  [Karel Zak]
   - update lscpu outputs  [Karel Zak]
   - update mountinfo files  [Karel Zak]
   - update sfdisk reorder test  [Karel Zak]
   - use sub-tests for dm-verity  [Karel Zak]
   - use subtests  [Vojtěch Eichler]
tests/eject:
   - check for root perms at beginning  [Karel Zak]
tools:
   - add git-tp-sync-man  [Karel Zak]
   - allow to update specific files on kernel.org  [Karel Zak]
   - report and use LDFLAGS in tools/config-gen  [Karel Zak]
tools/git-version-gen:
   - use NEWS as a fallback  [Karel Zak]
uclampset:
   - Fix left over optind++  [Qais Yousef]
   - use lib/procfs.c  [Karel Zak]
unshare:
   - Add option to automatically create user and group maps  [Sean Anderson]
   - Add options to map blocks of user/group IDs  [Sean Anderson]
   - Add some helpers for forking and synchronizing  [Sean Anderson]
   - Add waitchild helper  [Sean Anderson]
   - Document --map-{groups,users,auto}  [Sean Anderson]
   - Fix PDEATHSIG race for --kill-child  [Earl Chew]
   - Fix doc comments  [Sean Anderson]
   - Propagate inherited signal handling to forked child  [Earl Chew]
   - call getline() in more robust way  [Karel Zak]
   - clear SIGCHLD inherited setting  [Karel Zak]
   - fix memory leak [coverity scan]  [Karel Zak]
   - fix typo in uint_to_id()  [Karel Zak]
unshare.1.adoc:
   - Improve wording re creation of bind mounts  [Michael Kerrisk]
   - Improve wording re namespace creation  [Michael Kerrisk]
utmpdump:
   - do not use atoi()  [Karel Zak]
   - don't ignore sscanf() return code [coverity scan]  [Karel Zak]
uuidd:
   - Whitelist libuuid clock file  [Stanislav Brabec]
   - fix open/lock state issue  [Karel Zak]
   - use snprintf() rather than sprintf()  [Karel Zak]
uuidgen.1.adoc:
   - mention uuidparse in SEE ALSO  [Bruno Heridet]
verity:
   - add support for corruption action flag  [Luca Boccassi]
   - fix verity.roothashsig only working as last parameter  [Luca Boccassi]
   - remove experimental tag from mount manpage  [Luca Boccassi]
vipw:
   - flush stdout before getting answer.  [Érico Nogueira]
   - improve child error handling  [Tobias Stoeckmann]
   - use snprintf() rather than sprintf()  [Karel Zak]
wall:
   - add __format__ attribute  [Karel Zak]
   - use xgetlogin.  [Érico Nogueira]
wdctl:
   - Workaround reported boot-status bits not being present in wd->ident.options  [Hans de Goede]
   - add --setpregovernor  [Karel Zak]
   - add --setpretimeout  [Karel Zak]
   - print the current and available governors  [Karel Zak]
   - set_watchdog() refactoring  [Karel Zak]
   - sysfs open refactoring  [Karel Zak]
   - update man page  [Karel Zak]
whereis:
   - use commands for Bash completions  [Smitty]
wipefs:
   - check errno after strto..()  [Karel Zak]
   - increase delay after re-read ioctl  [Karel Zak]
   - remove dead code  [Karel Zak]
write:
   - use snprintf() rather than sprintf()  [Karel Zak]
zramctl:
   - add zstd compression algorithm option  [Jan Samek]
   - improve usage() output  [Karel Zak]

-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com


^ permalink raw reply	[relevance 1%]

* Re: [RFC PATCH v2 09/11] sched: Introduce per memory space current virtual cpu id
  @ 2022-02-25 18:39  5%         ` Mathieu Desnoyers
  0 siblings, 0 replies; 200+ results
From: Mathieu Desnoyers @ 2022-02-25 18:39 UTC (permalink / raw)
  To: Jonathan Corbet, linux-man
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, paulmck,
	Boqun Feng, H. Peter Anvin, Paul Turner, linux-api,
	Christian Brauner, Florian Weimer, David Laight, carlos,
	Peter Oskolkov

----- On Feb 25, 2022, at 1:15 PM, Jonathan Corbet corbet@lwn.net wrote:

> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> writes:
> 
>> Some effective upper bounds for the number of vcpu ids observable in a process:
>>
>> - sysconf(3) _SC_NPROCESSORS_CONF,
>> - the number of threads which exist concurrently in the process,
>> - the number of cpus in the cpu affinity mask applied by sched_setaffinity,
>>   except in corner-case situations such as cpu hotplug removing all cpus from
>>   the affinity set,
>> - cgroup cpuset "partition" limits,
>>
>> Note that AFAIR non-partition cgroup cpusets allow a cgroup to "borrow"
>> additional cores from the rest of the system if they are idle, therefore
>> allowing the number of concurrent threads to go beyond the specified limit.
>>
>> AFAIR the sched affinity mask is tweaked independently of the cgroup cpuset.
>> Those are two mechanisms both affecting the scheduler task placement.
>>
>> I would expect the user-space code to use some sensible upper bound as a
>> hint about how many per-vcpu data structure elements to expect (and how many
>> to pre-allocate), but have a "lazy initialization" fall-back in case the
>> vcpu id goes up to the number of configured processors - 1. And I suspect
>> that even the number of configured processors may change with CRIU.
>>
>> If the above explanation makes sense (please let me know if I am wrong
>> or missed something), I suspect I should add it to the commit message.
> 
> That helps, thanks.  I do think that something like this belongs in the
> changelog - or, even better, in the upcoming restartable-sequences
> section in the userspace-api documentation :)

Just to confirm, when you say "userspace-api documentation" do you refer to
man pages ?

I did a few attempts at upstreaming a rseq.2 man page in 2020, but I have been
stuck waiting for feedback from Michael Kerrisk since then.

So for the moment I'm maintaining a rseq.2 man page here:

https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2

I'd gladly accept some help to improve the documentation of rseq.

Thanks,

Mathieu

> 
> Thanks,
> 
> jon

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[relevance 5%]

* Re: [PATCH] exec: Force single empty string when argv is empty
  2022-02-02 15:50  0%   ` Kees Cook
@ 2022-02-02 17:12  0%     ` Rich Felker
  0 siblings, 0 replies; 200+ results
From: Rich Felker @ 2022-02-02 17:12 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Ariadne Conill, Michael Kerrisk, Matthew Wilcox,
	Christian Brauner, Eric Biederman, Alexander Viro, linux-fsdevel,
	stable, linux-kernel, linux-hardening

On Wed, Feb 02, 2022 at 07:50:42AM -0800, Kees Cook wrote:
> 
> 
> On February 1, 2022 6:53:25 AM PST, Rich Felker <dalias@libc.org> wrote:
> >On Mon, Jan 31, 2022 at 04:09:47PM -0800, Kees Cook wrote:
> >> Quoting[1] Ariadne Conill:
> >> 
> >> "In several other operating systems, it is a hard requirement that the
> >> second argument to execve(2) be the name of a program, thus prohibiting
> >> a scenario where argc < 1. POSIX 2017 also recommends this behaviour,
> >> but it is not an explicit requirement[2]:
> >> 
> >>     The argument arg0 should point to a filename string that is
> >>     associated with the process being started by one of the exec
> >>     functions.
> >> ....
> >> Interestingly, Michael Kerrisk opened an issue about this in 2008[3],
> >> but there was no consensus to support fixing this issue then.
> >> Hopefully now that CVE-2021-4034 shows practical exploitative use[4]
> >> of this bug in a shellcode, we can reconsider.
> >> 
> >> This issue is being tracked in the KSPP issue tracker[5]."
> >> 
> >> While the initial code searches[6][7] turned up what appeared to be
> >> mostly corner case tests, trying to that just reject argv == NULL
> >> (or an immediately terminated pointer list) quickly started tripping[8]
> >> existing userspace programs.
> >> 
> >> The next best approach is forcing a single empty string into argv and
> >> adjusting argc to match. The number of programs depending on argc == 0
> >> seems a smaller set than those calling execve with a NULL argv.
> >> 
> >> Account for the additional stack space in bprm_stack_limits(). Inject an
> >> empty string when argc == 0 (and set argc = 1). Warn about the case so
> >> userspace has some notice about the change:
> >> 
> >>     process './argc0' launched './argc0' with NULL argv: empty string added
> >> 
> >> Additionally WARN() and reject NULL argv usage for kernel threads.
> >> 
> >> [1] https://lore.kernel.org/lkml/20220127000724.15106-1-ariadne@dereferenced.org/
> >> [2] https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
> >> [3] https://bugzilla.kernel.org/show_bug.cgi?id=8408
> >> [4] https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
> >> [5] https://github.com/KSPP/linux/issues/176
> >> [6] https://codesearch.debian.net/search?q=execve%5C+*%5C%28%5B%5E%2C%5D%2B%2C+*NULL&literal=0
> >> [7] https://codesearch.debian.net/search?q=execlp%3F%5Cs*%5C%28%5B%5E%2C%5D%2B%2C%5Cs*NULL&literal=0
> >> [8] https://lore.kernel.org/lkml/20220131144352.GE16385@xsang-OptiPlex-9020/
> >> 
> >> Reported-by: Ariadne Conill <ariadne@dereferenced.org>
> >> Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
> >> Cc: Matthew Wilcox <willy@infradead.org>
> >> Cc: Christian Brauner <brauner@kernel.org>
> >> Cc: Rich Felker <dalias@libc.org>
> >> Cc: Eric Biederman <ebiederm@xmission.com>
> >> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> >> Cc: linux-fsdevel@vger.kernel.org
> >> Cc: stable@vger.kernel.org
> >> Signed-off-by: Kees Cook <keescook@chromium.org>
> >> ---
> >>  fs/exec.c | 26 +++++++++++++++++++++++++-
> >>  1 file changed, 25 insertions(+), 1 deletion(-)
> >> 
> >> diff --git a/fs/exec.c b/fs/exec.c
> >> index 79f2c9483302..bbf3aadf7ce1 100644
> >> --- a/fs/exec.c
> >> +++ b/fs/exec.c
> >> @@ -495,8 +495,14 @@ static int bprm_stack_limits(struct linux_binprm *bprm)
> >>  	 * the stack. They aren't stored until much later when we can't
> >>  	 * signal to the parent that the child has run out of stack space.
> >>  	 * Instead, calculate it here so it's possible to fail gracefully.
> >> +	 *
> >> +	 * In the case of argc = 0, make sure there is space for adding a
> >> +	 * empty string (which will bump argc to 1), to ensure confused
> >> +	 * userspace programs don't start processing from argv[1], thinking
> >> +	 * argc can never be 0, to keep them from walking envp by accident.
> >> +	 * See do_execveat_common().
> >>  	 */
> >> -	ptr_size = (bprm->argc + bprm->envc) * sizeof(void *);
> >> +	ptr_size = (min(bprm->argc, 1) + bprm->envc) * sizeof(void *);
> >
> >From #musl:
> >
> ><mixi> kees: shouldn't the min(bprm->argc, 1) be max(...) in your patch?
> 
> Fix has already been sent, yup.
> 
> >I'm pretty sure without fixing that, you're introducing a giant vuln
> >here.
> 
> I wouldn't say "giant", but yes, it weakened a defense in depth for
> avoiding high stack utilization.

I thought it was deciding the amount of memory to allocate/reserve for
the arg slots, but based on the comment it looks like it's just a way
to fail early rather than making the new process image fault later if
they don't fit.

> > I believe this is the second time a patch attempting to fix this
> >non-vuln has proposed adding a new vuln...
> 
> Mistakes happen, and that's why there is review and testing. Thank
> you for being part of the review process! :)

I know, and I'm sorry for being a bit hostile over it, and for jumping
the gun about the severity. I just get frustrated when I see a rush to
make changes over an incidental part of a popularized vuln, with
disproportionate weight on "doing something" and not enough on being
careful.

Rich

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] pidfd: fix test failure due to stack overflow on some arches
  2022-01-28  8:56  6% ` Christian Brauner
@ 2022-02-02 15:52  0%   ` Shuah Khan
  0 siblings, 0 replies; 200+ results
From: Shuah Khan @ 2022-02-02 15:52 UTC (permalink / raw)
  To: Christian Brauner, Axel Rasmussen
  Cc: Christian Brauner, Shuah Khan, Zach O'Keefe, linux-kernel,
	linux-kselftest, Shuah Khan

On 1/28/22 1:56 AM, Christian Brauner wrote:
> On Thu, Jan 27, 2022 at 01:29:51PM -0800, Axel Rasmussen wrote:
>> When running the pidfd_fdinfo_test on arm64, it fails for me. After some
>> digging, the reason is that the child exits due to SIGBUS, because it
>> overflows the 1024 byte stack we've reserved for it.
>>
>> To fix the issue, increase the stack size to 8192 bytes (this number is
>> somewhat arbitrary, and was arrived at through experimentation -- I kept
>> doubling until the failure no longer occurred).
>>
>> Also, let's make the issue easier to debug. wait_for_pid() returns an
>> ambiguous value: it may return -1 in all of these cases:
>>
>> 1. waitpid() itself returned -1
>> 2. waitpid() returned success, but we found !WIFEXITED(status).
>> 3. The child process exited, but it did so with a -1 exit code.
>>
>> There's no way for the caller to tell the difference. So, at least log
>> which occurred, so the test runner can debug things.
>>
>> While debugging this, I found that we had !WIFEXITED(), because the
>> child exited due to a signal. This seems like a reasonably common case,
>> so also print out whether or not we have WIFSIGNALED(), and the
>> associated WTERMSIG() (if any). This lets us see the SIGBUS I'm fixing
>> clearly when it occurs.
>>
>> Finally, I'm suspicious of allocating the child's stack on our stack.
>> man clone(2) suggests that the correct way to do this is with mmap(),
>> and in particular by setting MAP_STACK. So, switch to doing it that way
>> instead.
> 
> Heh, yes. :)
> 
> commit 99c3a000279919cc4875c9dfa9c3ebb41ed8773e
> Author: Michael Kerrisk <mtk.manpages@gmail.com>
> Date:   Thu Nov 14 12:19:21 2019 +0100
> 
>      clone.2: Allocate child's stack using mmap(2) rather than malloc(3)
> 
>      Christian Brauner suggested mmap(MAP_STACKED), rather than
>      malloc(), as the canonical way of allocating a stack for the
>      child of clone(), and Jann Horn noted some reasons why:
> 
>          Not on Linux, but on OpenBSD, they do use MAP_STACK now
>          AFAIK; this was announced here:
>          <http://openbsd-archive.7691.n7.nabble.com/stack-register-checking-td338238.html>.
>          Basically they periodically check whether the userspace
>          stack pointer points into a MAP_STACK region, and if not,
>          they kill the process. So even if it's a no-op on Linux, it
>          might make sense to advise people to use the flag to improve
>          portability? I'm not sure if that's something that belongs
>          in Linux manpages.
> 
>          Another reason against malloc() is that when setting up
>          thread stacks in proper, reliable software, you'll probably
>          want to place a guard page (in other words, a 4K PROT_NONE
>          VMA) at the bottom of the stack to reliably catch stack
>          overflows; and you probably don't want to do that with
>          malloc, in particular with non-page-aligned allocations.
> 
>      And the OpenBSD 6.5 manual pages says:
> 
>          MAP_STACK
>              Indicate that the mapping is used as a stack. This
>              flag must be used in combination with MAP_ANON and
>              MAP_PRIVATE.
> 
>      And I then noticed that MAP_STACK seems already to be on
>      FreeBSD for a long time:
> 
>          MAP_STACK
>              Map the area as a stack.  MAP_ANON is implied.
>              Offset should be 0, fd must be -1, and prot should
>              include at least PROT_READ and PROT_WRITE.  This
>              option creates a memory region that grows to at
>              most len bytes in size, starting from the stack
>              top and growing down.  The stack top is the start‐
>              ing address returned by the call, plus len bytes.
>              The bottom of the stack at maximum growth is the
>              starting address returned by the call.
> 
>              The entire area is reserved from the point of view
>              of other mmap() calls, even if not faulted in yet.
> 
>      Reported-by: Jann Horn <jannh@google.com>
>      Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
>      Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
> 
> 
>>
>> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
>> ---
> 
> Yeah, stack handling - especially with legacy clone() - is yucky on the
> best of days. Thank you for the fix.
> 
> Acked-by: Christian Brauner <brauner@kernel.org>
> 

Thank you both. Will apply for 5.17-rc4 or so.

thanks,
-- Shuah

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] exec: Force single empty string when argv is empty
  2022-02-01 14:53  0% ` Rich Felker
@ 2022-02-02 15:50  0%   ` Kees Cook
  2022-02-02 17:12  0%     ` Rich Felker
  0 siblings, 1 reply; 200+ results
From: Kees Cook @ 2022-02-02 15:50 UTC (permalink / raw)
  To: Rich Felker
  Cc: Andrew Morton, Ariadne Conill, Michael Kerrisk, Matthew Wilcox,
	Christian Brauner, Eric Biederman, Alexander Viro, linux-fsdevel,
	stable, linux-kernel, linux-hardening



On February 1, 2022 6:53:25 AM PST, Rich Felker <dalias@libc.org> wrote:
>On Mon, Jan 31, 2022 at 04:09:47PM -0800, Kees Cook wrote:
>> Quoting[1] Ariadne Conill:
>> 
>> "In several other operating systems, it is a hard requirement that the
>> second argument to execve(2) be the name of a program, thus prohibiting
>> a scenario where argc < 1. POSIX 2017 also recommends this behaviour,
>> but it is not an explicit requirement[2]:
>> 
>>     The argument arg0 should point to a filename string that is
>>     associated with the process being started by one of the exec
>>     functions.
>> ....
>> Interestingly, Michael Kerrisk opened an issue about this in 2008[3],
>> but there was no consensus to support fixing this issue then.
>> Hopefully now that CVE-2021-4034 shows practical exploitative use[4]
>> of this bug in a shellcode, we can reconsider.
>> 
>> This issue is being tracked in the KSPP issue tracker[5]."
>> 
>> While the initial code searches[6][7] turned up what appeared to be
>> mostly corner case tests, trying to that just reject argv == NULL
>> (or an immediately terminated pointer list) quickly started tripping[8]
>> existing userspace programs.
>> 
>> The next best approach is forcing a single empty string into argv and
>> adjusting argc to match. The number of programs depending on argc == 0
>> seems a smaller set than those calling execve with a NULL argv.
>> 
>> Account for the additional stack space in bprm_stack_limits(). Inject an
>> empty string when argc == 0 (and set argc = 1). Warn about the case so
>> userspace has some notice about the change:
>> 
>>     process './argc0' launched './argc0' with NULL argv: empty string added
>> 
>> Additionally WARN() and reject NULL argv usage for kernel threads.
>> 
>> [1] https://lore.kernel.org/lkml/20220127000724.15106-1-ariadne@dereferenced.org/
>> [2] https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
>> [3] https://bugzilla.kernel.org/show_bug.cgi?id=8408
>> [4] https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
>> [5] https://github.com/KSPP/linux/issues/176
>> [6] https://codesearch.debian.net/search?q=execve%5C+*%5C%28%5B%5E%2C%5D%2B%2C+*NULL&literal=0
>> [7] https://codesearch.debian.net/search?q=execlp%3F%5Cs*%5C%28%5B%5E%2C%5D%2B%2C%5Cs*NULL&literal=0
>> [8] https://lore.kernel.org/lkml/20220131144352.GE16385@xsang-OptiPlex-9020/
>> 
>> Reported-by: Ariadne Conill <ariadne@dereferenced.org>
>> Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Christian Brauner <brauner@kernel.org>
>> Cc: Rich Felker <dalias@libc.org>
>> Cc: Eric Biederman <ebiederm@xmission.com>
>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> Cc: linux-fsdevel@vger.kernel.org
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Kees Cook <keescook@chromium.org>
>> ---
>>  fs/exec.c | 26 +++++++++++++++++++++++++-
>>  1 file changed, 25 insertions(+), 1 deletion(-)
>> 
>> diff --git a/fs/exec.c b/fs/exec.c
>> index 79f2c9483302..bbf3aadf7ce1 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -495,8 +495,14 @@ static int bprm_stack_limits(struct linux_binprm *bprm)
>>  	 * the stack. They aren't stored until much later when we can't
>>  	 * signal to the parent that the child has run out of stack space.
>>  	 * Instead, calculate it here so it's possible to fail gracefully.
>> +	 *
>> +	 * In the case of argc = 0, make sure there is space for adding a
>> +	 * empty string (which will bump argc to 1), to ensure confused
>> +	 * userspace programs don't start processing from argv[1], thinking
>> +	 * argc can never be 0, to keep them from walking envp by accident.
>> +	 * See do_execveat_common().
>>  	 */
>> -	ptr_size = (bprm->argc + bprm->envc) * sizeof(void *);
>> +	ptr_size = (min(bprm->argc, 1) + bprm->envc) * sizeof(void *);
>
>From #musl:
>
><mixi> kees: shouldn't the min(bprm->argc, 1) be max(...) in your patch?

Fix has already been sent, yup.

>I'm pretty sure without fixing that, you're introducing a giant vuln
>here.

I wouldn't say "giant", but yes, it weakened a defense in depth for avoiding high stack utilization.

> I believe this is the second time a patch attempting to fix this
>non-vuln has proposed adding a new vuln...

Mistakes happen, and that's why there is review and testing. Thank you for being part of the review process! :)

-Kees

-- 
Kees Cook

^ permalink raw reply	[relevance 0%]

* [PATCH] generic/633: pass non-empty argv with execveat()
@ 2022-02-02  9:52  4% Christian Brauner
  0 siblings, 0 replies; 200+ results
From: Christian Brauner @ 2022-02-02  9:52 UTC (permalink / raw)
  To: Eryu Guan, fstests
  Cc: Ariadne Conill, Kees Cook, Rich Felker, Michael Kerrisk,
	Andrew Morton, Matthew Wilcox, David Laight, linux-fsdevel,
	linux-kernel, Christian Brauner, Eryu Guan

So far the kernel allowed passing an empty argv. Given that there's now
a push to restrict the kernel in that regard make sure we pass at least
one argument with argv.

Cc: Ariadne Conill <ariadne@dereferenced.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Eryu Guan <guaneryu@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: fstests@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Link: https://lore.kernel.org/lkml/20220127000724.15106-1-ariadne@dereferenced.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
/* v2 */
- Make sure argv array is NULL terminated. I fired the first patch too
  quickly.
- Take the chance and remove the repeated argv open-coding and move it
  directly into the execveat helper and rename it to reflect the fact
  that it's not just a simple syscall wrapper anymore.
---
 src/idmapped-mounts/idmapped-mounts.c | 65 +++++++++------------------
 1 file changed, 22 insertions(+), 43 deletions(-)

diff --git a/src/idmapped-mounts/idmapped-mounts.c b/src/idmapped-mounts/idmapped-mounts.c
index 4cf6c3bb..5bab19a9 100644
--- a/src/idmapped-mounts/idmapped-mounts.c
+++ b/src/idmapped-mounts/idmapped-mounts.c
@@ -695,11 +695,14 @@ static int fd_to_fd(int from, int to)
 	return 0;
 }
 
-static int sys_execveat(int fd, const char *path, char **argv, char **envp,
-			int flags)
+static int do_execveat(int fd, const char *path, char **envp)
 {
 #ifdef __NR_execveat
-	return syscall(__NR_execveat, fd, path, argv, envp, flags);
+	static char *argv_empty[] = {
+		"",
+		NULL,
+	};
+	return syscall(__NR_execveat, fd, path, argv_empty, envp, 0);
 #else
 	errno = ENOSYS;
 	return -1;
@@ -3597,15 +3600,12 @@ static int setid_binaries(void)
 			"EXPECTED_EGID=5000",
 			NULL,
 		};
-		static char *argv[] = {
-			NULL,
-		};
 
 		if (!expected_uid_gid(t_dir1_fd, FILE1, 0, 5000, 5000))
 			die("failure: expected_uid_gid");
 
-		sys_execveat(t_dir1_fd, FILE1, argv, envp, 0);
-		die("failure: sys_execveat");
+		do_execveat(t_dir1_fd, FILE1, envp);
+		die("failure: do_execveat");
 
 		exit(EXIT_FAILURE);
 	}
@@ -3725,15 +3725,12 @@ static int setid_binaries_idmapped_mounts(void)
 			"EXPECTED_EGID=15000",
 			NULL,
 		};
-		static char *argv[] = {
-			NULL,
-		};
 
 		if (!expected_uid_gid(open_tree_fd, FILE1, 0, 15000, 15000))
 			die("failure: expected_uid_gid");
 
-		sys_execveat(open_tree_fd, FILE1, argv, envp, 0);
-		die("failure: sys_execveat");
+		do_execveat(open_tree_fd, FILE1, envp);
+		die("failure: do_execveat");
 
 		exit(EXIT_FAILURE);
 	}
@@ -3864,9 +3861,6 @@ static int setid_binaries_idmapped_mounts_in_userns(void)
 			"EXPECTED_EGID=5000",
 			NULL,
 		};
-		static char *argv[] = {
-			NULL,
-		};
 
 		if (!switch_userns(attr.userns_fd, 0, 0, false))
 			die("failure: switch_userns");
@@ -3874,8 +3868,8 @@ static int setid_binaries_idmapped_mounts_in_userns(void)
 		if (!expected_uid_gid(open_tree_fd, FILE1, 0, 5000, 5000))
 			die("failure: expected_uid_gid");
 
-		sys_execveat(open_tree_fd, FILE1, argv, envp, 0);
-		die("failure: sys_execveat");
+		do_execveat(open_tree_fd, FILE1, envp);
+		die("failure: do_execveat");
 
 		exit(EXIT_FAILURE);
 	}
@@ -3923,9 +3917,6 @@ static int setid_binaries_idmapped_mounts_in_userns(void)
 			"EXPECTED_EGID=0",
 			NULL,
 		};
-		static char *argv[] = {
-			NULL,
-		};
 
 		if (!caps_supported()) {
 			log_debug("skip: capability library not installed");
@@ -3938,8 +3929,8 @@ static int setid_binaries_idmapped_mounts_in_userns(void)
 		if (!expected_uid_gid(open_tree_fd, FILE1, 0, 0, 0))
 			die("failure: expected_uid_gid");
 
-		sys_execveat(open_tree_fd, FILE1, argv, envp, 0);
-		die("failure: sys_execveat");
+		do_execveat(open_tree_fd, FILE1, envp);
+		die("failure: do_execveat");
 
 		exit(EXIT_FAILURE);
 	}
@@ -3991,9 +3982,6 @@ static int setid_binaries_idmapped_mounts_in_userns(void)
 			NULL,
 			NULL,
 		};
-		static char *argv[] = {
-			NULL,
-		};
 
 		if (!switch_userns(attr.userns_fd, 0, 0, false))
 			die("failure: switch_userns");
@@ -4007,8 +3995,8 @@ static int setid_binaries_idmapped_mounts_in_userns(void)
 		if (!expected_uid_gid(open_tree_fd, FILE1, 0, t_overflowuid, t_overflowgid))
 			die("failure: expected_uid_gid");
 
-		sys_execveat(open_tree_fd, FILE1, argv, envp, 0);
-		die("failure: sys_execveat");
+		do_execveat(open_tree_fd, FILE1, envp);
+		die("failure: do_execveat");
 
 		exit(EXIT_FAILURE);
 	}
@@ -4149,9 +4137,6 @@ static int setid_binaries_idmapped_mounts_in_userns_separate_userns(void)
 			"EXPECTED_EGID=5000",
 			NULL,
 		};
-		static char *argv[] = {
-			NULL,
-		};
 
 		userns_fd = get_userns_fd(0, 10000, 10000);
 		if (userns_fd < 0)
@@ -4163,8 +4148,8 @@ static int setid_binaries_idmapped_mounts_in_userns_separate_userns(void)
 		if (!expected_uid_gid(open_tree_fd, FILE1, 0, 5000, 5000))
 			die("failure: expected_uid_gid");
 
-		sys_execveat(open_tree_fd, FILE1, argv, envp, 0);
-		die("failure: sys_execveat");
+		do_execveat(open_tree_fd, FILE1, envp);
+		die("failure: do_execveat");
 
 		exit(EXIT_FAILURE);
 	}
@@ -4213,9 +4198,6 @@ static int setid_binaries_idmapped_mounts_in_userns_separate_userns(void)
 			"EXPECTED_EGID=0",
 			NULL,
 		};
-		static char *argv[] = {
-			NULL,
-		};
 
 		userns_fd = get_userns_fd(0, 10000, 10000);
 		if (userns_fd < 0)
@@ -4232,8 +4214,8 @@ static int setid_binaries_idmapped_mounts_in_userns_separate_userns(void)
 		if (!expected_uid_gid(open_tree_fd, FILE1, 0, 0, 0))
 			die("failure: expected_uid_gid");
 
-		sys_execveat(open_tree_fd, FILE1, argv, envp, 0);
-		die("failure: sys_execveat");
+		do_execveat(open_tree_fd, FILE1, envp);
+		die("failure: do_execveat");
 
 		exit(EXIT_FAILURE);
 	}
@@ -4285,9 +4267,6 @@ static int setid_binaries_idmapped_mounts_in_userns_separate_userns(void)
 			NULL,
 			NULL,
 		};
-		static char *argv[] = {
-			NULL,
-		};
 
 		userns_fd = get_userns_fd(0, 10000, 10000);
 		if (userns_fd < 0)
@@ -4305,8 +4284,8 @@ static int setid_binaries_idmapped_mounts_in_userns_separate_userns(void)
 		if (!expected_uid_gid(open_tree_fd, FILE1, 0, t_overflowuid, t_overflowgid))
 			die("failure: expected_uid_gid");
 
-		sys_execveat(open_tree_fd, FILE1, argv, envp, 0);
-		die("failure: sys_execveat");
+		do_execveat(open_tree_fd, FILE1, envp);
+		die("failure: do_execveat");
 
 		exit(EXIT_FAILURE);
 	}

base-commit: d8dee1222ecdfa1cff1386a61248e587eb3b275d
-- 
2.32.0


^ permalink raw reply related	[relevance 4%]

* Re: [PATCH] exec: Force single empty string when argv is empty
  2022-02-01  0:09  5% [PATCH] exec: Force single empty string when argv is empty Kees Cook
                   ` (2 preceding siblings ...)
  2022-02-01 13:22  0% ` Christian Brauner
@ 2022-02-01 14:53  0% ` Rich Felker
  2022-02-02 15:50  0%   ` Kees Cook
  3 siblings, 1 reply; 200+ results
From: Rich Felker @ 2022-02-01 14:53 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Ariadne Conill, Michael Kerrisk, Matthew Wilcox,
	Christian Brauner, Eric Biederman, Alexander Viro, linux-fsdevel,
	stable, linux-kernel, linux-hardening

On Mon, Jan 31, 2022 at 04:09:47PM -0800, Kees Cook wrote:
> Quoting[1] Ariadne Conill:
> 
> "In several other operating systems, it is a hard requirement that the
> second argument to execve(2) be the name of a program, thus prohibiting
> a scenario where argc < 1. POSIX 2017 also recommends this behaviour,
> but it is not an explicit requirement[2]:
> 
>     The argument arg0 should point to a filename string that is
>     associated with the process being started by one of the exec
>     functions.
> ....
> Interestingly, Michael Kerrisk opened an issue about this in 2008[3],
> but there was no consensus to support fixing this issue then.
> Hopefully now that CVE-2021-4034 shows practical exploitative use[4]
> of this bug in a shellcode, we can reconsider.
> 
> This issue is being tracked in the KSPP issue tracker[5]."
> 
> While the initial code searches[6][7] turned up what appeared to be
> mostly corner case tests, trying to that just reject argv == NULL
> (or an immediately terminated pointer list) quickly started tripping[8]
> existing userspace programs.
> 
> The next best approach is forcing a single empty string into argv and
> adjusting argc to match. The number of programs depending on argc == 0
> seems a smaller set than those calling execve with a NULL argv.
> 
> Account for the additional stack space in bprm_stack_limits(). Inject an
> empty string when argc == 0 (and set argc = 1). Warn about the case so
> userspace has some notice about the change:
> 
>     process './argc0' launched './argc0' with NULL argv: empty string added
> 
> Additionally WARN() and reject NULL argv usage for kernel threads.
> 
> [1] https://lore.kernel.org/lkml/20220127000724.15106-1-ariadne@dereferenced.org/
> [2] https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
> [3] https://bugzilla.kernel.org/show_bug.cgi?id=8408
> [4] https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
> [5] https://github.com/KSPP/linux/issues/176
> [6] https://codesearch.debian.net/search?q=execve%5C+*%5C%28%5B%5E%2C%5D%2B%2C+*NULL&literal=0
> [7] https://codesearch.debian.net/search?q=execlp%3F%5Cs*%5C%28%5B%5E%2C%5D%2B%2C%5Cs*NULL&literal=0
> [8] https://lore.kernel.org/lkml/20220131144352.GE16385@xsang-OptiPlex-9020/
> 
> Reported-by: Ariadne Conill <ariadne@dereferenced.org>
> Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Rich Felker <dalias@libc.org>
> Cc: Eric Biederman <ebiederm@xmission.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: stable@vger.kernel.org
> Signed-off-by: Kees Cook <keescook@chromium.org>
> ---
>  fs/exec.c | 26 +++++++++++++++++++++++++-
>  1 file changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 79f2c9483302..bbf3aadf7ce1 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -495,8 +495,14 @@ static int bprm_stack_limits(struct linux_binprm *bprm)
>  	 * the stack. They aren't stored until much later when we can't
>  	 * signal to the parent that the child has run out of stack space.
>  	 * Instead, calculate it here so it's possible to fail gracefully.
> +	 *
> +	 * In the case of argc = 0, make sure there is space for adding a
> +	 * empty string (which will bump argc to 1), to ensure confused
> +	 * userspace programs don't start processing from argv[1], thinking
> +	 * argc can never be 0, to keep them from walking envp by accident.
> +	 * See do_execveat_common().
>  	 */
> -	ptr_size = (bprm->argc + bprm->envc) * sizeof(void *);
> +	ptr_size = (min(bprm->argc, 1) + bprm->envc) * sizeof(void *);

From #musl:

<mixi> kees: shouldn't the min(bprm->argc, 1) be max(...) in your patch?

I'm pretty sure without fixing that, you're introducing a giant vuln
here. I believe this is the second time a patch attempting to fix this
non-vuln has proposed adding a new vuln...

Rich

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] exec: Force single empty string when argv is empty
  2022-02-01  0:09  5% [PATCH] exec: Force single empty string when argv is empty Kees Cook
  2022-02-01  1:00  0% ` Ariadne Conill
  2022-02-01  2:00  0% ` Andy Lutomirski
@ 2022-02-01 13:22  0% ` Christian Brauner
  2022-02-01 14:53  0% ` Rich Felker
  3 siblings, 0 replies; 200+ results
From: Christian Brauner @ 2022-02-01 13:22 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Ariadne Conill, Michael Kerrisk, Matthew Wilcox,
	Rich Felker, Eric Biederman, Alexander Viro, linux-fsdevel,
	stable, linux-kernel, linux-hardening

On Mon, Jan 31, 2022 at 04:09:47PM -0800, Kees Cook wrote:
> Quoting[1] Ariadne Conill:
> 
> "In several other operating systems, it is a hard requirement that the
> second argument to execve(2) be the name of a program, thus prohibiting
> a scenario where argc < 1. POSIX 2017 also recommends this behaviour,
> but it is not an explicit requirement[2]:
> 
>     The argument arg0 should point to a filename string that is
>     associated with the process being started by one of the exec
>     functions.
> ...
> Interestingly, Michael Kerrisk opened an issue about this in 2008[3],
> but there was no consensus to support fixing this issue then.
> Hopefully now that CVE-2021-4034 shows practical exploitative use[4]
> of this bug in a shellcode, we can reconsider.
> 
> This issue is being tracked in the KSPP issue tracker[5]."
> 
> While the initial code searches[6][7] turned up what appeared to be
> mostly corner case tests, trying to that just reject argv == NULL
> (or an immediately terminated pointer list) quickly started tripping[8]
> existing userspace programs.
> 
> The next best approach is forcing a single empty string into argv and
> adjusting argc to match. The number of programs depending on argc == 0
> seems a smaller set than those calling execve with a NULL argv.
> 
> Account for the additional stack space in bprm_stack_limits(). Inject an
> empty string when argc == 0 (and set argc = 1). Warn about the case so
> userspace has some notice about the change:
> 
>     process './argc0' launched './argc0' with NULL argv: empty string added
> 
> Additionally WARN() and reject NULL argv usage for kernel threads.
> 
> [1] https://lore.kernel.org/lkml/20220127000724.15106-1-ariadne@dereferenced.org/
> [2] https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
> [3] https://bugzilla.kernel.org/show_bug.cgi?id=8408
> [4] https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
> [5] https://github.com/KSPP/linux/issues/176
> [6] https://codesearch.debian.net/search?q=execve%5C+*%5C%28%5B%5E%2C%5D%2B%2C+*NULL&literal=0
> [7] https://codesearch.debian.net/search?q=execlp%3F%5Cs*%5C%28%5B%5E%2C%5D%2B%2C%5Cs*NULL&literal=0
> [8] https://lore.kernel.org/lkml/20220131144352.GE16385@xsang-OptiPlex-9020/
> 
> Reported-by: Ariadne Conill <ariadne@dereferenced.org>
> Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Rich Felker <dalias@libc.org>
> Cc: Eric Biederman <ebiederm@xmission.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: stable@vger.kernel.org
> Signed-off-by: Kees Cook <keescook@chromium.org>
> ---

Looks good,
Acked-by: Christian Brauner <brauner@kernel.org>

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] exec: Force single empty string when argv is empty
  2022-02-01  0:09  5% [PATCH] exec: Force single empty string when argv is empty Kees Cook
  2022-02-01  1:00  0% ` Ariadne Conill
@ 2022-02-01  2:00  0% ` Andy Lutomirski
  2022-02-01 13:22  0% ` Christian Brauner
  2022-02-01 14:53  0% ` Rich Felker
  3 siblings, 0 replies; 200+ results
From: Andy Lutomirski @ 2022-02-01  2:00 UTC (permalink / raw)
  To: Kees Cook, Andrew Morton
  Cc: Ariadne Conill, Michael Kerrisk, Matthew Wilcox,
	Christian Brauner, Rich Felker, Eric Biederman, Alexander Viro,
	linux-fsdevel, stable, linux-kernel, linux-hardening, Linux API

On 1/31/22 16:09, Kees Cook wrote:
> Quoting[1] Ariadne Conill:
> 
> "In several other operating systems, it is a hard requirement that the
> second argument to execve(2) be the name of a program, thus prohibiting
> a scenario where argc < 1. POSIX 2017 also recommends this behaviour,
> but it is not an explicit requirement[2]:
> 
>      The argument arg0 should point to a filename string that is
>      associated with the process being started by one of the exec
>      functions.
> ...
> Interestingly, Michael Kerrisk opened an issue about this in 2008[3],
> but there was no consensus to support fixing this issue then.
> Hopefully now that CVE-2021-4034 shows practical exploitative use[4]
> of this bug in a shellcode, we can reconsider.
> 
> This issue is being tracked in the KSPP issue tracker[5]."
> 
> While the initial code searches[6][7] turned up what appeared to be
> mostly corner case tests, trying to that just reject argv == NULL
> (or an immediately terminated pointer list) quickly started tripping[8]
> existing userspace programs.
> 
> The next best approach is forcing a single empty string into argv and
> adjusting argc to match. The number of programs depending on argc == 0
> seems a smaller set than those calling execve with a NULL argv.
> 
> Account for the additional stack space in bprm_stack_limits(). Inject an
> empty string when argc == 0 (and set argc = 1). Warn about the case so
> userspace has some notice about the change:
> 
>      process './argc0' launched './argc0' with NULL argv: empty string added
> 
> Additionally WARN() and reject NULL argv usage for kernel threads.
> 
> [1] https://lore.kernel.org/lkml/20220127000724.15106-1-ariadne@dereferenced.org/
> [2] https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
> [3] https://bugzilla.kernel.org/show_bug.cgi?id=8408
> [4] https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
> [5] https://github.com/KSPP/linux/issues/176
> [6] https://codesearch.debian.net/search?q=execve%5C+*%5C%28%5B%5E%2C%5D%2B%2C+*NULL&literal=0
> [7] https://codesearch.debian.net/search?q=execlp%3F%5Cs*%5C%28%5B%5E%2C%5D%2B%2C%5Cs*NULL&literal=0
> [8] https://lore.kernel.org/lkml/20220131144352.GE16385@xsang-OptiPlex-9020/

Acked-by: Andy Lutomirski <luto@kernel.org>

and cc-ing linux-api.

I agree that this should be done regardless of any security context change.

> 
> Reported-by: Ariadne Conill <ariadne@dereferenced.org>
> Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Rich Felker <dalias@libc.org>
> Cc: Eric Biederman <ebiederm@xmission.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: stable@vger.kernel.org
> Signed-off-by: Kees Cook <keescook@chromium.org>
> ---
>   fs/exec.c | 26 +++++++++++++++++++++++++-
>   1 file changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 79f2c9483302..bbf3aadf7ce1 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -495,8 +495,14 @@ static int bprm_stack_limits(struct linux_binprm *bprm)
>   	 * the stack. They aren't stored until much later when we can't
>   	 * signal to the parent that the child has run out of stack space.
>   	 * Instead, calculate it here so it's possible to fail gracefully.
> +	 *
> +	 * In the case of argc = 0, make sure there is space for adding a
> +	 * empty string (which will bump argc to 1), to ensure confused
> +	 * userspace programs don't start processing from argv[1], thinking
> +	 * argc can never be 0, to keep them from walking envp by accident.
> +	 * See do_execveat_common().
>   	 */
> -	ptr_size = (bprm->argc + bprm->envc) * sizeof(void *);
> +	ptr_size = (min(bprm->argc, 1) + bprm->envc) * sizeof(void *);
>   	if (limit <= ptr_size)
>   		return -E2BIG;
>   	limit -= ptr_size;
> @@ -1897,6 +1903,9 @@ static int do_execveat_common(int fd, struct filename *filename,
>   	}
>   
>   	retval = count(argv, MAX_ARG_STRINGS);
> +	if (retval == 0)
> +		pr_warn_once("process '%s' launched '%s' with NULL argv: empty string added\n",
> +			     current->comm, bprm->filename);
>   	if (retval < 0)
>   		goto out_free;
>   	bprm->argc = retval;
> @@ -1923,6 +1932,19 @@ static int do_execveat_common(int fd, struct filename *filename,
>   	if (retval < 0)
>   		goto out_free;
>   
> +	/*
> +	 * When argv is empty, add an empty string ("") as argv[0] to
> +	 * ensure confused userspace programs that start processing
> +	 * from argv[1] won't end up walking envp. See also
> +	 * bprm_stack_limits().
> +	 */
> +	if (bprm->argc == 0) {
> +		retval = copy_string_kernel("", bprm);
> +		if (retval < 0)
> +			goto out_free;
> +		bprm->argc = 1;
> +	}
> +
>   	retval = bprm_execve(bprm, fd, filename, flags);
>   out_free:
>   	free_bprm(bprm);
> @@ -1951,6 +1973,8 @@ int kernel_execve(const char *kernel_filename,
>   	}
>   
>   	retval = count_strings_kernel(argv);
> +	if (WARN_ON_ONCE(retval == 0))
> +		retval = -EINVAL;
>   	if (retval < 0)
>   		goto out_free;
>   	bprm->argc = retval;


^ permalink raw reply	[relevance 0%]

* Re: [PATCH] exec: Force single empty string when argv is empty
  2022-02-01  0:09  5% [PATCH] exec: Force single empty string when argv is empty Kees Cook
@ 2022-02-01  1:00  0% ` Ariadne Conill
  2022-02-01  2:00  0% ` Andy Lutomirski
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 200+ results
From: Ariadne Conill @ 2022-02-01  1:00 UTC (permalink / raw)
  To: Kees Cook
  Cc: Andrew Morton, Ariadne Conill, Michael Kerrisk, Matthew Wilcox,
	Christian Brauner, Rich Felker, Eric Biederman, Alexander Viro,
	linux-fsdevel, stable, linux-kernel, linux-hardening

Hi,

On Mon, 31 Jan 2022, Kees Cook wrote:

> Quoting[1] Ariadne Conill:
>
> "In several other operating systems, it is a hard requirement that the
> second argument to execve(2) be the name of a program, thus prohibiting
> a scenario where argc < 1. POSIX 2017 also recommends this behaviour,
> but it is not an explicit requirement[2]:
>
>    The argument arg0 should point to a filename string that is
>    associated with the process being started by one of the exec
>    functions.
> ...
> Interestingly, Michael Kerrisk opened an issue about this in 2008[3],
> but there was no consensus to support fixing this issue then.
> Hopefully now that CVE-2021-4034 shows practical exploitative use[4]
> of this bug in a shellcode, we can reconsider.
>
> This issue is being tracked in the KSPP issue tracker[5]."
>
> While the initial code searches[6][7] turned up what appeared to be
> mostly corner case tests, trying to that just reject argv == NULL
> (or an immediately terminated pointer list) quickly started tripping[8]
> existing userspace programs.

Yes, it's a shame this is the case, but we do what we have to do, I guess 
:)

>
> The next best approach is forcing a single empty string into argv and
> adjusting argc to match. The number of programs depending on argc == 0
> seems a smaller set than those calling execve with a NULL argv.
>
> Account for the additional stack space in bprm_stack_limits(). Inject an
> empty string when argc == 0 (and set argc = 1). Warn about the case so
> userspace has some notice about the change:
>
>    process './argc0' launched './argc0' with NULL argv: empty string added
>
> Additionally WARN() and reject NULL argv usage for kernel threads.
>
> [1] https://lore.kernel.org/lkml/20220127000724.15106-1-ariadne@dereferenced.org/
> [2] https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
> [3] https://bugzilla.kernel.org/show_bug.cgi?id=8408
> [4] https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
> [5] https://github.com/KSPP/linux/issues/176
> [6] https://codesearch.debian.net/search?q=execve%5C+*%5C%28%5B%5E%2C%5D%2B%2C+*NULL&literal=0
> [7] https://codesearch.debian.net/search?q=execlp%3F%5Cs*%5C%28%5B%5E%2C%5D%2B%2C%5Cs*NULL&literal=0
> [8] https://lore.kernel.org/lkml/20220131144352.GE16385@xsang-OptiPlex-9020/
>
> Reported-by: Ariadne Conill <ariadne@dereferenced.org>
> Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Rich Felker <dalias@libc.org>
> Cc: Eric Biederman <ebiederm@xmission.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: stable@vger.kernel.org
> Signed-off-by: Kees Cook <keescook@chromium.org>

In terms of going with this approach as an alternative verses my original 
patch,

Acked-by: Ariadne Conill <ariadne@dereferenced.org>

Ariadne

^ permalink raw reply	[relevance 0%]

* [PATCH] exec: Force single empty string when argv is empty
@ 2022-02-01  0:09  5% Kees Cook
  2022-02-01  1:00  0% ` Ariadne Conill
                   ` (3 more replies)
  0 siblings, 4 replies; 200+ results
From: Kees Cook @ 2022-02-01  0:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kees Cook, Ariadne Conill, Michael Kerrisk, Matthew Wilcox,
	Christian Brauner, Rich Felker, Eric Biederman, Alexander Viro,
	linux-fsdevel, stable, linux-kernel, linux-hardening

Quoting[1] Ariadne Conill:

"In several other operating systems, it is a hard requirement that the
second argument to execve(2) be the name of a program, thus prohibiting
a scenario where argc < 1. POSIX 2017 also recommends this behaviour,
but it is not an explicit requirement[2]:

    The argument arg0 should point to a filename string that is
    associated with the process being started by one of the exec
    functions.
...
Interestingly, Michael Kerrisk opened an issue about this in 2008[3],
but there was no consensus to support fixing this issue then.
Hopefully now that CVE-2021-4034 shows practical exploitative use[4]
of this bug in a shellcode, we can reconsider.

This issue is being tracked in the KSPP issue tracker[5]."

While the initial code searches[6][7] turned up what appeared to be
mostly corner case tests, trying to that just reject argv == NULL
(or an immediately terminated pointer list) quickly started tripping[8]
existing userspace programs.

The next best approach is forcing a single empty string into argv and
adjusting argc to match. The number of programs depending on argc == 0
seems a smaller set than those calling execve with a NULL argv.

Account for the additional stack space in bprm_stack_limits(). Inject an
empty string when argc == 0 (and set argc = 1). Warn about the case so
userspace has some notice about the change:

    process './argc0' launched './argc0' with NULL argv: empty string added

Additionally WARN() and reject NULL argv usage for kernel threads.

[1] https://lore.kernel.org/lkml/20220127000724.15106-1-ariadne@dereferenced.org/
[2] https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
[3] https://bugzilla.kernel.org/show_bug.cgi?id=8408
[4] https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
[5] https://github.com/KSPP/linux/issues/176
[6] https://codesearch.debian.net/search?q=execve%5C+*%5C%28%5B%5E%2C%5D%2B%2C+*NULL&literal=0
[7] https://codesearch.debian.net/search?q=execlp%3F%5Cs*%5C%28%5B%5E%2C%5D%2B%2C%5Cs*NULL&literal=0
[8] https://lore.kernel.org/lkml/20220131144352.GE16385@xsang-OptiPlex-9020/

Reported-by: Ariadne Conill <ariadne@dereferenced.org>
Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 fs/exec.c | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/fs/exec.c b/fs/exec.c
index 79f2c9483302..bbf3aadf7ce1 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -495,8 +495,14 @@ static int bprm_stack_limits(struct linux_binprm *bprm)
 	 * the stack. They aren't stored until much later when we can't
 	 * signal to the parent that the child has run out of stack space.
 	 * Instead, calculate it here so it's possible to fail gracefully.
+	 *
+	 * In the case of argc = 0, make sure there is space for adding a
+	 * empty string (which will bump argc to 1), to ensure confused
+	 * userspace programs don't start processing from argv[1], thinking
+	 * argc can never be 0, to keep them from walking envp by accident.
+	 * See do_execveat_common().
 	 */
-	ptr_size = (bprm->argc + bprm->envc) * sizeof(void *);
+	ptr_size = (min(bprm->argc, 1) + bprm->envc) * sizeof(void *);
 	if (limit <= ptr_size)
 		return -E2BIG;
 	limit -= ptr_size;
@@ -1897,6 +1903,9 @@ static int do_execveat_common(int fd, struct filename *filename,
 	}
 
 	retval = count(argv, MAX_ARG_STRINGS);
+	if (retval == 0)
+		pr_warn_once("process '%s' launched '%s' with NULL argv: empty string added\n",
+			     current->comm, bprm->filename);
 	if (retval < 0)
 		goto out_free;
 	bprm->argc = retval;
@@ -1923,6 +1932,19 @@ static int do_execveat_common(int fd, struct filename *filename,
 	if (retval < 0)
 		goto out_free;
 
+	/*
+	 * When argv is empty, add an empty string ("") as argv[0] to
+	 * ensure confused userspace programs that start processing
+	 * from argv[1] won't end up walking envp. See also
+	 * bprm_stack_limits().
+	 */
+	if (bprm->argc == 0) {
+		retval = copy_string_kernel("", bprm);
+		if (retval < 0)
+			goto out_free;
+		bprm->argc = 1;
+	}
+
 	retval = bprm_execve(bprm, fd, filename, flags);
 out_free:
 	free_bprm(bprm);
@@ -1951,6 +1973,8 @@ int kernel_execve(const char *kernel_filename,
 	}
 
 	retval = count_strings_kernel(argv);
+	if (WARN_ON_ONCE(retval == 0))
+		retval = -EINVAL;
 	if (retval < 0)
 		goto out_free;
 	bprm->argc = retval;
-- 
2.30.2


^ permalink raw reply related	[relevance 5%]

* Re: [PATCH] generic/633: adapt execveat() invocations
  2022-01-31 17:10  4% [PATCH] generic/633: adapt execveat() invocations Christian Brauner
@ 2022-01-31 20:46  0% ` Kees Cook
  0 siblings, 0 replies; 200+ results
From: Kees Cook @ 2022-01-31 20:46 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Eryu Guan, fstests, Ariadne Conill, Rich Felker, Michael Kerrisk,
	Andrew Morton, Matthew Wilcox, linux-fsdevel, linux-kernel,
	Eryu Guan

On Mon, Jan 31, 2022 at 06:10:23PM +0100, Christian Brauner wrote:
> There's a push by Ariadne to enforce that argv[0] cannot be NULL. So far
> we've allowed this. Fix the execveat() invocations to set argv[0] to the
> name of the file we're about to execute.

To be clear, these tests are also trying to launch set-id binaries with
argc == 0, so narrowing the kernel check to only set-id binaries
wouldn't help here, yes?

-Kees

> 
> Cc: Ariadne Conill <ariadne@dereferenced.org>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Rich Felker <dalias@libc.org>
> Cc: Eryu Guan <guaneryu@gmail.com>
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: fstests@vger.kernel.org
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Link: https://lore.kernel.org/lkml/20220127000724.15106-1-ariadne@dereferenced.org
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  src/idmapped-mounts/idmapped-mounts.c | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/src/idmapped-mounts/idmapped-mounts.c b/src/idmapped-mounts/idmapped-mounts.c
> index 4cf6c3bb..76b559ae 100644
> --- a/src/idmapped-mounts/idmapped-mounts.c
> +++ b/src/idmapped-mounts/idmapped-mounts.c
> @@ -3598,7 +3598,7 @@ static int setid_binaries(void)
>  			NULL,
>  		};
>  		static char *argv[] = {
> -			NULL,
> +			"",
>  		};
>  
>  		if (!expected_uid_gid(t_dir1_fd, FILE1, 0, 5000, 5000))
> @@ -3726,7 +3726,7 @@ static int setid_binaries_idmapped_mounts(void)
>  			NULL,
>  		};
>  		static char *argv[] = {
> -			NULL,
> +			"",
>  		};
>  
>  		if (!expected_uid_gid(open_tree_fd, FILE1, 0, 15000, 15000))
> @@ -3865,7 +3865,7 @@ static int setid_binaries_idmapped_mounts_in_userns(void)
>  			NULL,
>  		};
>  		static char *argv[] = {
> -			NULL,
> +			"",
>  		};
>  
>  		if (!switch_userns(attr.userns_fd, 0, 0, false))
> @@ -3924,7 +3924,7 @@ static int setid_binaries_idmapped_mounts_in_userns(void)
>  			NULL,
>  		};
>  		static char *argv[] = {
> -			NULL,
> +			"",
>  		};
>  
>  		if (!caps_supported()) {
> @@ -3992,7 +3992,7 @@ static int setid_binaries_idmapped_mounts_in_userns(void)
>  			NULL,
>  		};
>  		static char *argv[] = {
> -			NULL,
> +			"",
>  		};
>  
>  		if (!switch_userns(attr.userns_fd, 0, 0, false))
> @@ -4150,7 +4150,7 @@ static int setid_binaries_idmapped_mounts_in_userns_separate_userns(void)
>  			NULL,
>  		};
>  		static char *argv[] = {
> -			NULL,
> +			"",
>  		};
>  
>  		userns_fd = get_userns_fd(0, 10000, 10000);
> @@ -4214,7 +4214,7 @@ static int setid_binaries_idmapped_mounts_in_userns_separate_userns(void)
>  			NULL,
>  		};
>  		static char *argv[] = {
> -			NULL,
> +			"",
>  		};
>  
>  		userns_fd = get_userns_fd(0, 10000, 10000);
> @@ -4286,7 +4286,7 @@ static int setid_binaries_idmapped_mounts_in_userns_separate_userns(void)
>  			NULL,
>  		};
>  		static char *argv[] = {
> -			NULL,
> +			"",
>  		};
>  
>  		userns_fd = get_userns_fd(0, 10000, 10000);
> 
> base-commit: d8dee1222ecdfa1cff1386a61248e587eb3b275d
> -- 
> 2.32.0
> 

-- 
Kees Cook

^ permalink raw reply	[relevance 0%]

* [PATCH] generic/633: adapt execveat() invocations
@ 2022-01-31 17:10  4% Christian Brauner
  2022-01-31 20:46  0% ` Kees Cook
  0 siblings, 1 reply; 200+ results
From: Christian Brauner @ 2022-01-31 17:10 UTC (permalink / raw)
  To: Eryu Guan, fstests
  Cc: Ariadne Conill, Kees Cook, Rich Felker, Michael Kerrisk,
	Andrew Morton, Matthew Wilcox, linux-fsdevel, linux-kernel,
	Christian Brauner, Eryu Guan

There's a push by Ariadne to enforce that argv[0] cannot be NULL. So far
we've allowed this. Fix the execveat() invocations to set argv[0] to the
name of the file we're about to execute.

Cc: Ariadne Conill <ariadne@dereferenced.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Eryu Guan <guaneryu@gmail.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: fstests@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Link: https://lore.kernel.org/lkml/20220127000724.15106-1-ariadne@dereferenced.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 src/idmapped-mounts/idmapped-mounts.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/src/idmapped-mounts/idmapped-mounts.c b/src/idmapped-mounts/idmapped-mounts.c
index 4cf6c3bb..76b559ae 100644
--- a/src/idmapped-mounts/idmapped-mounts.c
+++ b/src/idmapped-mounts/idmapped-mounts.c
@@ -3598,7 +3598,7 @@ static int setid_binaries(void)
 			NULL,
 		};
 		static char *argv[] = {
-			NULL,
+			"",
 		};
 
 		if (!expected_uid_gid(t_dir1_fd, FILE1, 0, 5000, 5000))
@@ -3726,7 +3726,7 @@ static int setid_binaries_idmapped_mounts(void)
 			NULL,
 		};
 		static char *argv[] = {
-			NULL,
+			"",
 		};
 
 		if (!expected_uid_gid(open_tree_fd, FILE1, 0, 15000, 15000))
@@ -3865,7 +3865,7 @@ static int setid_binaries_idmapped_mounts_in_userns(void)
 			NULL,
 		};
 		static char *argv[] = {
-			NULL,
+			"",
 		};
 
 		if (!switch_userns(attr.userns_fd, 0, 0, false))
@@ -3924,7 +3924,7 @@ static int setid_binaries_idmapped_mounts_in_userns(void)
 			NULL,
 		};
 		static char *argv[] = {
-			NULL,
+			"",
 		};
 
 		if (!caps_supported()) {
@@ -3992,7 +3992,7 @@ static int setid_binaries_idmapped_mounts_in_userns(void)
 			NULL,
 		};
 		static char *argv[] = {
-			NULL,
+			"",
 		};
 
 		if (!switch_userns(attr.userns_fd, 0, 0, false))
@@ -4150,7 +4150,7 @@ static int setid_binaries_idmapped_mounts_in_userns_separate_userns(void)
 			NULL,
 		};
 		static char *argv[] = {
-			NULL,
+			"",
 		};
 
 		userns_fd = get_userns_fd(0, 10000, 10000);
@@ -4214,7 +4214,7 @@ static int setid_binaries_idmapped_mounts_in_userns_separate_userns(void)
 			NULL,
 		};
 		static char *argv[] = {
-			NULL,
+			"",
 		};
 
 		userns_fd = get_userns_fd(0, 10000, 10000);
@@ -4286,7 +4286,7 @@ static int setid_binaries_idmapped_mounts_in_userns_separate_userns(void)
 			NULL,
 		};
 		static char *argv[] = {
-			NULL,
+			"",
 		};
 
 		userns_fd = get_userns_fd(0, 10000, 10000);

base-commit: d8dee1222ecdfa1cff1386a61248e587eb3b275d
-- 
2.32.0


^ permalink raw reply related	[relevance 4%]

* [ANNOUNCE] util-linux v2.38-rc1
@ 2022-01-31 15:14  1% Karel Zak
  0 siblings, 0 replies; 200+ results
From: Karel Zak @ 2022-01-31 15:14 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, util-linux


The util-linux release v2.38-rc1 is available at
     
  http://www.kernel.org/pub/linux/utils/util-linux/v2.38/
     
Feedback and bug reports, as always, are welcomed.
     
  Karel


Util-linux 2.38 Release Notes
=============================

Release highlights
------------------

mount(8) now supports a new option --mkdir as shortcut for X-mount.mkdir

mount(8) (and libmount) now supports new mount options X-mount.subdir= to
mounting sub-directory from a filesystem instead of the root directory.

lsfd is a NEW COMMAND. lsfd is intended to be a modern replacement for lsof(8)
on Linux systems. Unlike lsof, lsfd is specialized to Linux kernel; it supports
Linux specific features like namespaces with simpler code. lsfd is not a
drop-in replacement for lsof; they are different in the command line interface
and output formats. lsfd uses Libsmartcols for output formatting and filtering.
For example: lsfd -Q 'ASSOC == "exe"' prints all running executables.
(Thanks to Masatake YAMATO)

dmesg(1) supports a new option --json to print kernel log in JSON format.

libfdisk has been improved to set correct CHS addresses in MBR.
(Thanks to Pali Rohár)

fstrim(8) ignores all /ect/fstab entries with X-fstrim.notrim mount option now.

hardlink(1) now supports reflinks (new options --reflinks and --skip-reflinks),
and a new option --method=<memcmp,sha1,crc32,sha256> to specify a way how to
compare files. Now the files comparation use Linux crypto API in zero-copy way
-- all is calculated in kernel and userspace compares only hash checksums
(default is sha256).

hwclock(8) supports new command line options --param-get and --param-set to
works with RTC_PARAM_* attributes.

irqtop(1) provides a new option --cpu-stat <enable|disable|auto> to control
per-cpu stats.

libblkid supports zoned disks for btrfs now.

lsblk(8) provides a new option --noempty to ignore all devices with zero size;
the new option --zoned prints information about zones.

mkswap(8) supports a new option --quiet.

nsenter(8) supports a new option --wdns to change working directory within
namespace.

rename(1) supports new option --all and --last to replace all or last
occurrences of expression rather than the first one.

su(1) now resets RLIMIT_AS, RLIMIT_{NICE,RTPRIO}, RLIMIT_FSIZE and RLIMIT_NOFILE
reourse limits.

unshare(8) supports new options --map-users= and --map-groups= to map block of
group IDs; and new option --map-auto to map the first block of user IDs owned
by the effective user from /etc/subuid

wdctl supports new options --setpregovernor to set pre-timeout governor name,
and --setpretimeout to set watchdog pre-timeout in seconds.


Changes between v2.37 and v2.38
-------------------------------

Man pages:
   - Fix end extend formatting  [Mario Blättermann]
agetty:
   - (adoc) double hyphen replaced by dash in man pages  [Karel Zak]
   - do not use atol()  [Karel Zak]
   - resolve tty name even if stdin is specified  [tamz]
   - use CTRL+C to erase username  [Karel Zak]
   - use getttynam() if available  [Ludwig Nussel]
asciidoc:
   - fix quoted message in fsck.minix  [Rafael Fontenelle]
   - unconstrained formatting pair in fdisk  [Rafael Fontenelle]
bash-completion:
   - add --json to dmesg  [Karel Zak]
   - fix irqtop  [Karel Zak]
blkid:
   - check device type and name before probe  [Karel Zak]
   - don't print all devices if only garbage specified  [Karel Zak]
blockdev:
   - allow for larger values for start sector  [Thomas Abraham]
   - improve arguments parsing (remove atoi)  [Karel Zak]
   - remove accidental non-breaking spaces  [Chris Hofstaedtler]
   - use snprintf() rather than sprintf()  [Karel Zak]
build-sys:
   - (hardlink) check for llistxattr and lgetxattr  [Karel Zak]
   - (meson) fix hardlink  [Karel Zak]
   - Update configure.ac  [Alex Xu]
   - add USE_SYSTEMD  [Karel Zak]
   - add configure option to disable lsfd  [Anatoly Pugachev]
   - add cryptsetup config-gen  template  [Karel Zak]
   - add generated man-pages to distribution tarball  [Karel Zak]
   - add missing header  [Karel Zak]
   - add script to compare config.h from meson and autotools  [Karel Zak]
   - be verbose about missing gettext  [Karel Zak]
   - cleanup lsfd related stuff  [Karel Zak]
   - disable IPC tools on Darwin  [Karel Zak]
   - disable libmount when missing mntent.h  [Karel Zak]
   - display cryptsetup status after ./configure  [Luca Boccassi]
   - fir distcheck for fileeq.h  [Karel Zak]
   - fix test_procfs SOURCES  [Karel Zak]
   - fix {release-version} man pages  [Karel Zak]
   - generate all man pages for distribution tarball  [Karel Zak]
   - improve setns, unshare and prlimit checks  [Karel Zak]
   - include xlocale.h for locale_t on MacOS  [Karel Zak]
   - install hardlink bash-completion  [Karel Zak]
   - install lastb bash-completion  [Karel Zak]
   - link lib_common to test_procfs  [Masatake YAMATO]
   - make autogen.sh output more user friendly  [Karel Zak]
   - make libtool patching more robust  [Karel Zak]
   - make re-use of generated man-pages more robust  [Karel Zak]
   - patch libtool.m4 for darwin  [Karel Zak]
   - remove bashism  [Karel Zak]
   - remove lib/procutils.c  [Karel Zak]
   - use $LIBS rather than LDFLAGS  [Karel Zak]
   - use set +e before patch --try in ./autogen.sh  [Karel Zak]
cfdisk:
   - do not use atoi()  [Karel Zak]
   - optimize mountpoint detection for PARTUUID  [Karel Zak]
chfn:
   - flush stdout before reading stdin and fix uninitialized variable  [Lorenzo Beretta]
chrt:
   - use lib/procfs.c  [Karel Zak]
chsh:
   - fflush stdout before reading from stdin  [Lorenzo Beretta]
ci:
   - add a GHAction sending data to Coverity  [Evgeny Vereshchagin]
   - build coverage reports on Coveralls  [Evgeny Vereshchagin]
   - no longer refer to Travis CI  [Evgeny Vereshchagin]
cifuzz:
   - switch to the util-linux organization  [Evgeny Vereshchagin]
column:
   - segmentation fault on invalid unicode input passed to -s option  [Karel Zak]
dmesg:
   - add --json output format  [Karel Zak]
   - fix indentation in man page  [Platon Pronko]
   - fix possible memory leak [coverity scan]  [Karel Zak]
   - remove  condition [lgtm scan]  [Karel Zak]
   - translate ctime strings  [Karel Zak]
docs:
   - Uniformize references to section titles  [Rafael Fontenelle]
   - add hint about TP  [Karel Zak]
   - add hint for non-public reports  [Karel Zak]
   - add link to GitHub TODO items  [Karel Zak]
   - add links to adjtime_config manpage  [Karel Zak]
   - add man-common/in-bytes.adoc  [Karel Zak]
   - add note about GitHub PR  [Karel Zak]
   - add uclampset to AUTHORS file  [Karel Zak]
   - document --param-get, --param-set  [Bastian Krause]
   - fix info about LIBSMARTCOLS_DEBUG_PADDING  [Karel Zak]
   - fix typo in v2.37-ReleaseNotes  [Karel Zak]
   - update AUTHORS file  [Karel Zak]
   - update IRC address  [Karel Zak]
   - update TODO  [Karel Zak]
   - update github URL  [Karel Zak]
eject:
   - add __format__ attribute  [Karel Zak]
   - do not use atoi()  [Karel Zak]
   - fix typo in docs  [Karel Zak]
fallocate:
   - add verbose messages  [Karel Zak]
fdisk:
   - Add support for fixing MBR partitions CHS values  [Pali Rohár]
   - do not print error message when partition reordering is not needed  [Pali Rohár]
   - move reorder diag messages to fdisk_reorder_partitions()  [Pali Rohár]
   - open device in nonblock mode  [changlianzhi]
   - when use fdisk -l, open device in nonblock mode  [lishengyu]
findmnt:
   - (adoc) Added section stating exit code semantics  [Mister Me]
   - (verify) add hint about systemctl daemon-reload  [Karel Zak]
   - (verify) fix cache related memory leaks on --nocanonicalize [coverity scan]  [Karel Zak]
   - (verify) fix memory leak [asan]  [Karel Zak]
   - (verify) ignore passno for btrfs  [Karel Zak]
   - (verify) support fstype patterns  [Karel Zak]
   - add SOURCES column to print all devices with the same tag  [Karel Zak]
   - add __format__ attribute  [Karel Zak]
   - add reason to "cannot detect on-disk filesystem type" warning  [Karel Zak]
   - add support to print deleted targets  [Karel Zak]
   - add to the man page note about SOURCES  [Karel Zak]
   - allow SOURCES field even without '--fstab'  [Goffredo Baroncelli]
   - filter entries before add to the tree  [Karel Zak]
   - fix compiler warning [-Werror=sign-compare]  [Karel Zak]
   - make sure all entries are in tree output  [Karel Zak]
   - properly exclude poll columns from --output-all  [Thomas Weißschuh]
fixup! lsns:
   - interpolate missing namespaces for converting forests to a tree  [Masatake YAMATO]
flock:
   - (adoc) fix example  [Karel Zak]
fsck:
   - check errno after strto..()  [Karel Zak]
   - clear SIGCHLD inherited setting  [Karel Zak]
   - do not use atoi()  [Karel Zak]
   - use mnt_fs_is_regularfs()  [Karel Zak]
fsck.cramfs:
   - use open+fstat rather than stat+open  [Karel Zak]
fstrim:
   - Add fstab option X-fstrim.notrim  [Stanislav Brabec]
   - clean return code on --quiet-unsupported  [Karel Zak]
   - don't trigger autofs  [Karel Zak]
   - fix typo  [Karel Zak]
github:
   - add linux-modules-extra package to CI tests  [Karel Zak]
   - add meson build target  [Karel Zak]
hardlink:
   - Calling posix_fadvise without checking return value [coverity scan]  [Karel Zak]
   - add --cache-size  [Karel Zak]
   - add new option  -S/--maximum-size  [Daniele Pizzolli]
   - add reflinks support (add --reflinks and --skip-reflinks)  [Karel Zak]
   - add verbose messages when skip file  [Karel Zak]
   - call size_to_human_string() only when necessary  [Karel Zak]
   - fix compiler warning [-Wformat=]  [Karel Zak]
   - improve verbose messages  [Karel Zak]
   - make reflink detection more robust [coverity scan]  [Karel Zak]
   - remove pcre2posix.h support  [Karel Zak]
   - rename --buffer-size to --io-size  [Karel Zak]
   - rewrite files content comparison  [Karel Zak]
   - simplify file_link()  [Karel Zak]
   - small regex stuff refactoring  [Karel Zak]
   - use more passive wording in hardlink.1  [Eduard Bloch]
   - use open(O_CREAT) with mode  [Karel Zak]
hexdump:
   - correctly display signed single byte integers  [Samir Benmendil]
   - do not use atoi()  [Karel Zak]
hwclock:
   - add --param-get option  [Bastian Krause]
   - add --param-set option  [Bastian Krause]
   - check errno after strto..()  [Karel Zak]
   - cleanup hwclock_params[] use  [Karel Zak]
   - close adjtime on write error [coverity scan]  [Karel Zak]
   - don't ignore sscanf() return code [coverity scan]  [Karel Zak]
   - fix ul_path_scanf() use  [Karel Zak]
   - get/set param cleanup  [Karel Zak]
   - increase indent in help text  [Bastian Krause]
include:
   - Rename HiFive partition UUIDs  [Alexandre Ghiti]
include/c:
   - Add abs_diff macro  [Sean Anderson]
   - add __format__ attribute  [Karel Zak]
   - add cmp_timespec() and cmp_stat_mtime()  [Karel Zak]
   - add drop_permissions(), consolidate UID/GID reset  [Karel Zak]
include/fileeq:
   - add functions to compare files content  [Karel Zak]
include/path:
   - add __format__attribute  [Karel Zak]
include/strutils:
   - cleanup strto..() functions  [Karel Zak]
   - consolidate string to number conversion  [Karel Zak]
   - fix __format__attribute  [Karel Zak]
include/strv:
   - fix format attributes  [Karel Zak]
ipcmk:
   - fix strtoul use, remove deadcode [coverity scan]  [Karel Zak]
ipcs:
   - check errno after strto..()  [Karel Zak]
   - do not use atoi()  [Karel Zak]
   - use snprintf() rather than sprintf()  [Karel Zak]
irqtop:
   - add -c/--cpu-stat option  [zhenwei pi]
   - don't ignore sscanf() return code [coverity scan]  [Karel Zak]
   - fix options parsing  [Karel Zak]
   - small coding style change  [Karel Zak]
isfdisk:
   - improve --backup documentation  [Karel Zak]
kill:
   - check errno after strto..()  [Karel Zak]
   - use lib/procfs.c  [Karel Zak]
last:
   - use snprintf() rather than sprintf()  [Karel Zak]
ldattach:
   - add __format__ attribute  [Karel Zak]
lib:
   - use snprintf() rather than sprintf()  [Karel Zak]
lib/buffer:
   - add possibility to save position in the buffer  [Karel Zak]
   - add support for "safe" encoding  [Karel Zak]
   - fix buffer reset  [Karel Zak]
   - retun size of the buffer and data  [Karel Zak]
lib/caputils:
   - use lib/procfs.c  [Karel Zak]
lib/env:
   - don't ignore failed malloc  [Karel Zak]
lib/fileeq:
   - fix for small memsiz  [Karel Zak]
lib/jsonwrt:
   - check if JSON handler is initialized  [Karel Zak]
lib/loopdev:
   - perform retry on EAGAIN  [Karel Zak]
lib/path:
   - (test) fix ul_new_path() use  [Karel Zak]
   - add ul_path_next_dirent()  [Karel Zak]
   - fix possible leak when use ul_path_read_string() [coverity scan]  [Karel Zak]
   - fstat dir itself  [Karel Zak]
   - improve ul_path_readlink() to be more robust  [Karel Zak]
   - make path use more robust [coverity scan]  [Karel Zak]
   - use flags for fstatat()  [Karel Zak]
lib/procfs:
   - add functions to read /proc/#/ stuff  [Karel Zak]
lib/pwdutils:
   - don't use getlogin(3).  [Érico Nogueira]
   - use assert to check correct usage.  [Érico Nogueira]
lib/strutils:
   - add strappend()  [Karel Zak]
   - improve normalize_whitespace()  [Karel Zak]
   - make sure mem2strcpy() buffer is zeroized  [Karel Zak]
   - make test_strutils_normalize() more robust  [Karel Zak]
   - rename strappend() to strconcat()  [Karel Zak]
lib/sys:
   - add sysfs_chrdev_devno_to_devname()  [Karel Zak]
libblkid:
   - (btrfs) add debug messages to zoned support  [Karel Zak]
   - Add hyphens to UUID string representation in Stratis superblock parsing  [John Baublitz]
   - Optimize the blkid_safe_string() function  [Karel Zak, changlianzhi]
   - add magic and probing for zoned btrfs  [Naohiro Aota]
   - check UBI char device name  [Karel Zak]
   - check blkid_get_cache() return value [coverity scan]  [Karel Zak]
   - check errno after strto..()  [Karel Zak]
   - check for ioctl macro rather than for header file  [Karel Zak]
   - don't mark cache as "probed" if /sys not available  [Karel Zak]
   - fix and cleanup blkid_safe_string()  [Karel Zak]
   - ignore scanf() result when read number of stripes [coverity scan]  [Karel Zak]
   - implement zone-aware probing  [Naohiro Aota]
   - optimize ioctl calls in blkid_probe_set_device()  [Karel Zak]
   - remove EVMS support  [Karel Zak]
   - remove unnecessary ifdef  [Karel Zak]
   - reopen floppy without O_NONBLOCK  [Karel Zak]
   - reset errno after failed floppy test  [Karel Zak]
   - support zone reset for wipefs  [Naohiro Aota]
   - use snprintf() rather than sprintf()  [Karel Zak]
   - vfat  Fix reading FAT16 boot label and serial id  [Pali Rohár]
   - vfat  Fix reading FAT32 boot label  [Pali Rohár]
libblkid/src/probe:
   - check for ENOMEDIUM from ioctl(CDROM_LAST_WRITTEN)  [Jeremi Piotrowski]
libbuid:
   - use _UL_LIBUUID_UUID_H to cover uuid.h  [Karel Zak]
libfdisk:
   - (dos) Add check both begin and end CHS partition parameters  [Pali Rohár]
   - (dos) Add function dos_partition_sync_chs() for updating CHS values  [Pali Rohár]
   - (dos) Add function fdisk_dos_fix_chs() for fixing CHS values for all partitions  [Pali Rohár]
   - (dos) Fix check error message when CHS calculated sector does not match LBA  [Pali Rohár]
   - (dos) Fix determining number of heads and sectors per track from MBR  [Pali Rohár]
   - (dos) Fix printing number of CHS sectors in check error message  [Pali Rohár]
   - (dos) Fix setting CHS values when creating new partition  [Pali Rohár]
   - (dos) Fix upper bound cylinder check in check()  [Pali Rohár]
   - (dos) Fix upper bound cylinder check in check_consistency()  [Pali Rohár]
   - (dos) Put number of CHS check_consistency errors into summart message  [Pali Rohár]
   - (dos) Recalculate number of cylinders after changing number of heads and sectors  [Pali Rohár]
   - (dos) Use helper macros cylinder() and sector() in check_consistency()  [Pali Rohár]
   - (dos) don't ignore MBR+FAT use-case  [Karel Zak]
   - (dos) index partition from zero for DBG()  [Karel Zak]
   - (dos) support partition and MBR overlap  [Karel Zak]
   - (gpt) align size of partition by default  [Karel Zak]
   - (gpt) make fdisk -x output more readable  [Karel Zak]
   - (gpt) provide last LBA where is partitions array  [Karel Zak]
   - (script) rewrite start= and size= parsing  [Karel Zak]
   - add and fix __format__ attributes  [Karel Zak]
   - add new Linux GPT partition types  [WANG Xuerui]
   - check calloc() return [gcc-analyzer]  [Karel Zak]
   - dereference of possibly-NULL [gcc-analyzer]  [Karel Zak]
   - don't use too small free segments by default  [Karel Zak]
   - enlarge partition by move start down  [Karel Zak]
   - incorrect GUID for NetBSD  [Siu Ching Pong -Asuka Kenji-]
   - make self_partition() use more robust [gcc-analyzer]  [Karel Zak]
libmount:
   - (--all) continue although /proc is not mounted  [Karel Zak]
   - add X-mount.subdir=  [Karel Zak]
   - add __format__ attribute  [Karel Zak]
   - add mnt_fs_is_deleted()  [Karel Zak]
   - add mnt_fs_is_regularfs() to public API  [Karel Zak]
   - allow X-* options more than once  [Karel Zak]
   - assert() is enough [lgtm scan]  [Karel Zak]
   - change propagation of /run for X-mount.subdir  [Karel Zak]
   - check errno after strto..()  [Karel Zak]
   - disable mtab only on statfs() success only  [Karel Zak]
   - don't use setgroups at all()  [Karel Zak]
   - fix UID check for FUSE umount [CVE-2021-3995]  [Karel Zak]
   - fix setgroups() use  [Karel Zak]
   - make mnt_table_get_fs_root() more robust [gcc-analyzer]  [Karel Zak]
   - remove support for deleted mount table entries  [Karel Zak]
   - remove support for obsolete /dev/.mount/utab  [Karel Zak]
   - show options string on parse error  [Karel Zak]
   - support quotes in X-mount options  [Karel Zak]
   - use /run/mount/tmptgt rather than /tmp/mount/mount.<pid>  [Karel Zak]
libsmartcols:
   - add multi-line cells to samples  [Karel Zak]
   - add scols_line_get_column_data()  [Karel Zak]
   - add support for optional boolean values  [Thomas Weißschuh]
   - fix bare array on JSON output  [Karel Zak]
   - fix lines groups for multi-line cells  [Karel Zak]
   - use lib/buffer, remove local implementation  [Karel Zak]
   - use snprintf() rather than sprintf()  [Karel Zak]
libuuid:
   - check errno after strto..()  [Karel Zak]
   - include c.h to cover restrict keyword  [Karel Zak]
logger:
   - add __format__ attribute  [Karel Zak]
   - dealloc login name  [Karel Zak]
   - fix --prio-prefix doesn't use --priority default  [Karel Zak]
   - fix --size use for stdin  [Karel Zak]
   - realloc buffer when header size changed  [Karel Zak]
   - use xgetlogin from pwdutils.  [Érico Nogueira]
login:
   - (adoc) add hint about PAM and env.variables  [Karel Zak]
   - Restore tty size after calling vhangup()  [Daan De Meyer]
   - add callback for close_range()  [Karel Zak]
   - fix close_range() use  [Karel Zak]
   - improve coding style  [Karel Zak]
   - remove obsolete and confusing comment  [Karel Zak]
logindefs:
   - use snprintf() rather than sprintf()  [Karel Zak]
loopdev:
   - Do not treat errors when detecting overlap as fatal  [Jan Kara]
   - Properly translate errors from ul_path_read_*()  [Jan Kara]
   - accept ENOSYS for LOOP_CONFIGURE  [Alex Xu]
losetup:
   - Add missing pipe to man example for setting up loop device  [Vojtech Trefny]
   - directly set dio instead of afterwards  [Alex Xu (Hello71)]
   - use LOOP_CONFIGURE in a more robust way  [Karel Zak]
lsblk:
   - (adoc) improve --all description  [Karel Zak]
   - add --noempty  [Karel Zak]
   - add column START for partition start offsets  [Karel Zak]
   - add columns of zoned parameters  [Naohiro Aota]
   - add zoned columns to "lsblk -z"  [Naohiro Aota]
   - factor out function to read sysfs param as bytes  [Naohiro Aota]
   - fix formatting in -e option  [ratijas]
   - normalize space in SERIAL and MODEL  [Karel Zak]
   - sort list of columns  [Karel Zak]
   - sort usage() output  [Karel Zak]
   - use ID_MODEL_ENC is possible  [Karel Zak]
lscpu:
   - (arm) remove extra whitespace  [Karel Zak]
   - Add Phytium FT-2000+ & S2500 support  [panchenbo]
   - Add Phytium aarch64 cpupart  [panchenbo]
   - add SCALMHZ% and "CPU scaling MHz "  [Karel Zak]
   - add additional arm cpu part numbers  [Ali Saidi]
   - add bios_family  [Huang Shijie]
   - add more sanity checks for dmi_decode_cputype()  [Huang Shijie]
   - check errno after strto..()  [Karel Zak]
   - do not use atoi()  [Karel Zak]
   - don't use DMI if executed with --sysroot  [Karel Zak]
   - fix NULL dereference  [Karel Zak]
   - fix build on powerpc  [Georgy Yakovlev]
   - fix compilation against librtas  [Karel Zak]
   - fix cppcheck warning [Uninitialized variable]  [Karel Zak]
   - get the processor information by DMI  [Huang Shijie]
   - read MHZ from /sys/.../cpufreq/scaling_cur_freq  [Karel Zak, Thomas Weißschu]
   - remove extra blank lines  [Karel Zak]
   - remove the old code  [Huang Shijie]
   - remove unintended change  [Karel Zak]
   - use MHZ as number to be locale sensitive  [Karel Zak]
   - use json types  [Thomas Weißschuh]
   - use locale-independent strtod() when read from kernel  [Karel Zak]
   - use optional json values  [Thomas Weißschuh]
lsfd:
   - (adoc) add more exapmles  [Masatake YAMATO]
   - (adoc) add proc(5) to SEE ALSO section  [Masatake YAMATO]
   - (adoc) put missing    at the end of options  [Masatake YAMATO]
   - (adoc) remove a redundant word  [Masatake YAMATO]
   - (adoc) reorder the options  [Masatake YAMATO]
   - (adoc) reorder the sections  [Masatake YAMATO]
   - (adoc) update DESCRIPTION  [Masatake YAMATO]
   - (adoc) write about filter expression  [Masatake YAMATO]
   - (adoc) write more about the -o option  [Masatake YAMATO]
   - (filter) accept % char as a part of column name  [Masatake YAMATO]
   - (filter) fix a memory leak  [Masatake YAMATO]
   - (filter) give a name to a constant  [Masatake YAMATO]
   - (filter) implement !~, an operator for regex unmatching  [Masatake YAMATO]
   - (filter) implement =~, an operator for regex matching  [Masatake YAMATO]
   - (filter) make error messages in check_type methods  [Masatake YAMATO]
   - (filter) make some data structures its source file local  [Masatake YAMATO]
   - (filter) whitespace cleanup  [Masatake YAMATO]
   - (helper) accept an integer argument for a parameter  [Masatake YAMATO]
   - (helper) add "dentries" parameter to directory factory  [Masatake YAMATO]
   - (helper) add "dir" parameter to directory factory  [Masatake YAMATO]
   - (helper) add "file" parameter to ro-regular-file factory  [Masatake YAMATO]
   - (helper) add "nonblock" parameter to pipe-no-fork factory  [Masatake YAMATO]
   - (helper) add "offset" parameter to ro-regular-file factory  [Masatake YAMATO]
   - (helper) allow to pass extra parameters  [Masatake YAMATO]
   - (helper) improve the code converting file descriptor numbers  [Masatake YAMATO]
   - (helper) set proper errno before calling err()  [Masatake YAMATO]
   - (helper) update examples in the help message  [Masatake YAMATO]
   - (helper) use more "const" modifiers  [Masatake YAMATO]
   - (test) add a case for displaying COMMAND column  [Masatake YAMATO]
   - (test) add a case for displaying DEV column  [Masatake YAMATO]
   - (test) add a case for displaying a character device  [Masatake YAMATO]
   - (test) add a case for displaying a directory  [Masatake YAMATO]
   - (test) add a case for displaying socket pairs  [Masatake YAMATO]
   - (test) add a case for displaying symlinks  [Masatake YAMATO]
   - (test) add a case for testing FLAGS (wronly,nonblock) column  [Masatake YAMATO]
   - (test) add a case for testing SIZE column  [Masatake YAMATO]
   - (test) add cases for displaying a regular file and pipe  [Masatake YAMATO]
   - (test) test POS column  [Masatake YAMATO]
   - Add initial man page  [Mario Blättermann]
   - Add new man page to po4a.cfg  [Mario Blättermann]
   - Fix typos in lsfd.c  [Mario Blättermann]
   - add --debug-filter option  [Masatake YAMATO]
   - add --dump-counters option  [Masatake YAMATO]
   - add --notruncate  [Karel Zak]
   - add --sysroot, use lib/path.c  [Karel Zak]
   - add CHRDRV column  [Masatake YAMATO]
   - add DEVTYPE column  [Masatake YAMATO]
   - add FLAGS, MNTID, and POS columns  [Masatake YAMATO]
   - add FUID and OWNER columns  [Masatake YAMATO]
   - add KTHREAD column  [Masatake YAMATO]
   - add MAPLEN column  [Masatake YAMATO]
   - add MISCDEV column  [Masatake YAMATO]
   - add MODE column  [Masatake YAMATO]
   - add NLINK and DELETED columns  [Masatake YAMATO]
   - add PARTITION column  [Masatake YAMATO]
   - add PROTONAME column  [Masatake YAMATO]
   - add a function to get the name of filesystem from a given minor number  [Masatake YAMATO]
   - add a helper function for building filter  [Masatake YAMATO]
   - add a helper function for reading bdevs in /prode/devices  [Masatake YAMATO]
   - add a stub for fifo type  [Masatake YAMATO]
   - add code for reading /proc/$pid/maps  [Masatake YAMATO]
   - add columns for DEV and RDEV  [Masatake YAMATO]
   - add columns for SIZE  [Masatake YAMATO]
   - add cwd, exe, and root associations  [Masatake YAMATO]
   - add filter engine  [Masatake YAMATO]
   - add infrstructure code for reading fdinfo files  [Masatake YAMATO]
   - add mem associations  [Masatake YAMATO]
   - add namespace related associations  [Masatake YAMATO]
   - add new man page to Makemodule.am  [Masatake YAMATO]
   - add reference to proc from file  [Karel Zak]
   - add stubs for sockets and files of unknown type  [Masatake YAMATO]
   - add the way to initialize and finalize classes  [Masatake YAMATO]
   - adjust column width for COMMAND  [Masatake YAMATO]
   - allow passing a proc object to the constructors of the file classes  [Masatake YAMATO]
   - change the license of the filtering engine to LGPL  [Masatake YAMATO]
   - check ul_strtou*() return code [coverity scan]  [Karel Zak]
   - cleanup --summary semantic  [Karel Zak]
   - cleanup collect_outofbox_files stuff  [Karel Zak]
   - cleanup fdinfo handling  [Karel Zak]
   - cleanup new file initialization  [Karel Zak]
   - collect threads level information if TID is specified in a filter  [Masatake YAMATO]
   - convert lines introducing local variable to a block with {...}  [Masatake YAMATO]
   - declare JSON type in colinfo entries  [Masatake YAMATO]
   - declare local variables at the beginning of block  [Masatake YAMATO]
   - delete an unnecessary semicolon  [Masatake YAMATO]
   - don't collect and print redundant information about threads  [Masatake YAMATO]
   - don't define a local variable in the middle of a block  [Masatake YAMATO]
   - don't duplicate ASSOC_EXE processing  [Karel Zak]
   - don't use 'long int' for file data  [Karel Zak]
   - don't use threads  [Masatake YAMATO]
   - fill ASSOC field  [Masatake YAMATO]
   - fill DEVICE field  [Masatake YAMATO]
   - fill INODE field  [Masatake YAMATO]
   - fill POS and MODE columns for SHM and MEM associated files  [Masatake YAMATO]
   - fill PROTONAME field of file for mmap'ed socket  [Masatake YAMATO]
   - fill TYPE field  [Masatake YAMATO]
   - fill UID column with the process's uid  [Masatake YAMATO]
   - fill UID field  [Masatake YAMATO]
   - fill USER field  [Masatake YAMATO]
   - fix ASSOC_EXE use, make file->association use more robust  [Karel Zak]
   - fix a typo in DEVTYPE description  [Masatake YAMATO]
   - fix a typo in comment  [Masatake YAMATO]
   - fix copy & past error [coverity scan]  [Karel Zak]
   - fix typo, rename function  [Karel Zak]
   - fix use-after-free and resource leak [coverity scan]  [Karel Zak]
   - function rename  [Karel Zak]
   - give column widths  [Masatake YAMATO]
   - implement --summary and --counter options  [Masatake YAMATO]
   - increase the threads to collect information  [Masatake YAMATO]
   - initial commit  [Masatake YAMATO]
   - introduce --source filter option  [Masatake YAMATO]
   - introduce -Q option for generic filtering  [Masatake YAMATO]
   - introduce -p/--pid option, pids filter working in the early stage  [Masatake YAMATO]
   - introduce DEVNAME column and use it as default  [Masatake YAMATO]
   - introduce a data structure for storing common fdinfo data  [Masatake YAMATO]
   - introduce fopenf helper function  [Masatake YAMATO]
   - introduce name_manager  [Masatake YAMATO]
   - introduce new association SHM representing shared file mapping  [Masatake YAMATO]
   - keep main() at the end of the code  [Karel Zak]
   - make sure we do not use uninitialized struct stat [coverity scan]  [Karel Zak]
   - make username_cache lsfd-file privaite  [Masatake YAMATO]
   - move file_class variants after their constructors  [Masatake YAMATO]
   - move list_free() to list.h  [Karel Zak]
   - move the code for reading /proc/devices to lsfd.c  [Masatake YAMATO]
   - optimize maps use  [Karel Zak]
   - optimize symlinks use  [Karel Zak]
   - print the owner of process as USER  [Masatake YAMATO]
   - purge fd layer  [Masatake YAMATO]
   - read /proc/partitions  [Masatake YAMATO]
   - read character driver names from /proc/devices  [Masatake YAMATO]
   - read misc character device names from /proc/misc  [Masatake YAMATO]
   - refactor  [Masatake YAMATO]
   - refactor code calling collect_outofbox_files  [Masatake YAMATO]
   - remove --source option  [Masatake YAMATO]
   - remove collect_file()  [Karel Zak]
   - remove duplicated an O_ flag entry  [Masatake YAMATO]
   - remove prototype decls for removed functions  [Masatake YAMATO]
   - remove redundant "nodev " prefix from DEVNAME column  [Masatake YAMATO]
   - remove struct fdinfo_data  [Karel Zak]
   - remove unused --sysroot  [Karel Zak]
   - remove unused code  [Karel Zak]
   - rename DEVNAME column to SOURCE  [Masatake YAMATO]
   - rename the column DEVICE to MAJ MIN  [Masatake YAMATO]
   - reorder function  [Karel Zak]
   - replace "socket " in NAME of SOCKET with its protoname  [Masatake YAMATO]
   - replace POS with MNT_ID in default column set  [Masatake YAMATO]
   - revert include/path.h use  [Karel Zak]
   - simplify class hierarchy  [Masatake YAMATO]
   - small cleanup to mountinfo/nodev code  [Karel Zak]
   - sort the enumerators about columns  [Masatake YAMATO]
   - specify column more declarative way  [Masatake YAMATO]
   - split new_file(), remove map_file_data  [Karel Zak]
   - support threads with -l option  [Masatake YAMATO]
   - tiny change to default colummns initialization  [Karel Zak]
   - unify nodev lists into global one  [Masatake YAMATO]
   - use 'new_' prefix when we allocate something  [Karel Zak]
   - use lib/procfs.c  [Karel Zak]
   - use new scols_line_get_column_data()  [Karel Zak]
   - use one function to all symlinks  [Karel Zak]
   - use only "/proc/#/maps" file  [Karel Zak]
   - use path_cxt to read process  [Karel Zak]
   - use the list of block devices in /proc/devices for decoding SOURCE column  [Masatake YAMATO]
   - wrap code for debugging with #ifdef DEBUG/#endif  [Masatake YAMATO]
lsfd.1.adoc:
   - Add missing underscores  [Mario Blättermann]
   - Fix markup  [Mario Blättermann]
   - Fix wording and markup  [Mario Blättermann]
   - Fix yet another entry in the filter examples list  [Mario Blättermann]
   - Improve punctuation and add translator comments  [Mario Blättermann]
   - add caution about the CLI stability  [Masatake YAMATO]
   - fix a typo  [Masatake YAMATO]
   - remove redundant parenthesis from the examples  [Masatake YAMATO]
lsfd.1.doc:
   - use delimited literal block notation in the EXAMPLE section  [Masatake YAMATO]
   - write anout --summary and --counter options  [Masatake YAMATO]
lsipc:
   - use lib/procfs.c  [Karel Zak]
lslocks:
   - add INODE and MAJ MIN columns  [Karel Zak]
   - check errno after strto..()  [Karel Zak]
   - check scanf() return code [coverity scan]  [Karel Zak]
   - use lib/procfs.c  [Karel Zak]
   - use snprintf() rather than sprintf()  [Karel Zak]
lslogins:
   - ask for supplementary groups only once [asan]  [Karel Zak]
   - check errno after strto..()  [Karel Zak]
   - consolidate and optimize utmp files use  [Karel Zak]
   - fix memory leak [asan]  [Karel Zak]
   - remove unwanted debug message  [Karel Zak]
   - use lib/procfs.c  [Karel Zak]
   - use sd_journal_get_data() in proper way  [Karel Zak]
lsmem:
   - check errno after strto..()  [Karel Zak]
lsns:
   - fill UID and USER columns for interpolated namespaces  [Masatake YAMATO]
   - fix compilation on old systems without linux/nsfs.h  [Karel Zak]
   - fix copy & past in man page  [Karel Zak]
   - fix old error message  [Karel Zak]
   - fix passing wrong process lists when showing ownerns and parentns  [Masatake YAMATO]
   - interpolate missing namespaces for converting forests to a tree  [Masatake YAMATO]
   - make --tree default, update man-page  [Karel Zak]
   - make namespace having no process printable  [Masatake YAMATO]
   - optimize mountinfo use  [Karel Zak]
   - print namespace tree based on the relationship (parent/child or owner/owned)  [Masatake YAMATO]
   - reorganize members specifying other namespaces in lsns_namespace  [Masatake YAMATO]
   - unify the code and option for printing process based tree and namespace based trees  [Masatake YAMATO]
   - use lib/procfs.c  [Karel Zak]
lspcu:
   - Print dummy modelname if none present  [Thomas Weißschuh]
man pages:
   - Fix punctuation, wording and markup  [Mario Blättermann]
mcookie:
   - fix infinite-loop when use -f  [Hiroaki Sengoku]
meson:
   - add missing header files check  [Karel Zak]
   - do not generate fstrim.service if we do not have systemd  [Martin Roukala (né Peres)]
   - fix bash_completion.get_variable() use  [Karel Zak]
   - fix building libsmartcols  [Alex Xu (Hello71)]
   - fix building logger  [Alex Xu (Hello71)]
   - fix crypt_activate_by_signed_key detection  [Luca Boccassi]
   - fix dlopen support for cryptsetup  [Luca Boccassi]
   - fix typo  [Karel Zak]
   - headers  Install headers  [Thomas Weißschuh]
   - headers  use util-linux version of version defines  [Thomas Weißschuh]
   - install examples to correct directory  [Thomas Weißschuh]
   - install manpages and bash completions  [Thomas Weißschuh]
   - keep bash-completion symlinks in variable  [Karel Zak]
   - make asciidoc optional  [Alex Xu (Hello71)]
   - make raw(7) optional  [Karel Zak]
   - only install pkgconfig if library is built  [Thomas Weißschuh]
misc:
   - consolidate stat() error message  [Karel Zak]
   - improve string to number conversions  [Karel Zak]
mkfs.cramfs:
   - add comment to explain readlink() use  [Karel Zak]
mkswap:
   - (adoc) suggest looking up page size portably  [Jakub Wilk]
   - add --quiet  [Karel Zak]
   - fix holes detection (infinite loop and/or stack-buffer-underflow)  [Karel Zak]
   - support -U {clear,random,time,uuid}  [Karel Zak]
more:
   - Calling open without checking return value [coverity scan]  [Karel Zak]
   - POSIX compliance patch preventing exit on EOF without -e  [Ian Jones]
   - add __format__ attribute  [Karel Zak]
   - clear SIGCHLD inherited setting  [Karel Zak]
   - fix -e in non-interactive mode  [Karel Zak]
   - fix null-pointer dereference  [Karel Zak]
   - fix setuid/setgid order  [Karel Zak]
   - improve zero size handling  [Tobias Stoeckmann]
   - use snprintf() rather than sprintf()  [Karel Zak]
mount:
   - (adoc) add hint about /proc and /sys to --all description  [Karel Zak]
   - (adoc) ext_N_ → ext__N__ [manpage-l10n]  [Karel Zak]
   - (adoc) fix comma splice  [Jakub Wilk]
   - (adoc) fix missing period [manpage-l10n]  [Karel Zak]
   - (adoc) mount → mount(2),  of → or [manpage-l10n]  [Karel Zak]
   - (man) fix example  [Karel Zak]
   - Allow bind-mounting with "nosymfollow"  [Jakub Wilk]
   - Fix race in loop device reuse code  [Jan Kara]
   - add -m,--mkdir as shortcut for X-mount.mkdir  [Karel Zak]
   - add hint about dmesg(8) to error messages  [Karel Zak]
   - add hint about systemctl daemon-reload  [Karel Zak]
   - fix roothash signature extension in manpage  [Luca Boccassi]
   - man-page; add all overlayfs options  [Tj]
   - mount.8 don't consider additional mounts as experimental  [Karel Zak]
   - mount.8 fix overlayfs nfs_export= indention  [Karel Zak]
   - use mnt_fs_is_regularfs()  [Karel Zak]
mount.8.adoc:
   - Remove context options exclusion  [Thiébaud Weksteen]
   - document SELinux use of nosuid mount flag  [Topi Miettinen]
   - fix misformatting  [Mario Blättermann]
   - note that mandatory locking is fully deprecated in Linux 5.15  [Jeff Layton]
mount_fuzz:
   - reject giant files early  [Evgeny Vereshchagin]
namei:
   - simplify code  [Karel Zak]
newgrp:
   - fix memory leak [coverity scan]  [Karel Zak]
nsenter:
   - Do not try to enter nonexisting namespaces when --all is used  [Yonatan Goldschmidt]
   - add --wdns to change working directory  [Karel Zak]
   - clear SIGCHLD inherited setting  [Karel Zak]
pg:
   - do not use atoi()  [Karel Zak]
po:
   - add sk.po (from translationproject.org)  [Jose Riha]
   - merge changes  [Karel Zak]
   - update cs.po (from translationproject.org)  [Petr Písař]
   - update de.po (from translationproject.org)  [Mario Blättermann]
   - update es.po (from translationproject.org)  [Antonio Ceballos Roa]
   - update pl.po (from translationproject.org)  [Jakub Bogusz]
   - update pt_BR.po (from translationproject.org)  [Rafael Fontenelle]
   - update sr.po (from translationproject.org)  [Мирослав Николић]
   - update zh_CN.po (from translationproject.org)  [Boyuan Yang]
prlimit:
   - fix compiler warning [-Wmaybe-uninitialized]  [Karel Zak]
   - improve --help output  [Karel Zak]
   - make syscall use more robust  [Karel Zak]
readprofile:
   - check errno after strto..()  [Karel Zak]
   - use snprintf() rather than sprintf()  [Karel Zak]
rename:
   - add --all and --last parameters  [Todd Lewis]
   - size_t, mutually exclusive parameters  [Todd Lewis]
   - stop after count changes  [Todd Lewis]
   - use readlink() in more robust way  [Karel Zak]
rfkill:
   - Set scols table name to make the json output valid  [Nicolai Dagestad]
   - quit when read end of stdout is closed  [Mickey Rose]
script:
   - (adoc) improve man page readability  [Karel Zak]
   - add COMMAND= to log header  [Karel Zak, Henrik Bach]
   - add __format__ attribute  [Karel Zak]
   - add separator to header, update tests  [Karel Zak]
   - don't use \n when we log COMMAND  [Karel Zak]
script.1.adoc:
   - correct socond as second  [Vicente Jimenez Aguilar]
setterm:
   - (man) improve dosc about optional arguments  [Karel Zak]
sfdisk:
   - fix typo in --move-data when check partition size  [Karel Zak]
   - update docs, add examples to the man page  [Karel Zak]
   - write empty label also when only ignored partition specified  [Karel Zak]
sfdisk man:
   - Escape ((…)) to avoid AsciiDoc interpreting and stripping from manpage  [Paul Sarena]
su:
   - (bash-completion) offer usernames rather than files  [Karel Zak]
   - Verify default SIGCHLD handling.  [Tobias Stoeckmann]
   - reset RLIMIT_AS too  [Karel Zak]
   - reset RLIMIT_{NICE,RTPRIO} to zero  [Karel Zak]
   - reset also RLIMIT_FSIZE and RLIMIT_NOFILE  [Karel Zak]
   - use LOG_PID for syslog  [Sam James]
sulogin:
   - Display all kinds of errno during password input.  [Shigeki Morishima]
   - add missing ifdefs  [Karel Zak]
   - fix compiler warning [-Werror=implicit-fallthrough=]  [Karel Zak]
   - fix whitespace error  [Karel Zak]
   - ignore none-existing console devices  [Werner Fink]
   - use explicit_bzero() for buffer with password  [Karel Zak]
swapon:
   - do not use atoi()  [Karel Zak]
sys-utils/ipcutils:
   - be careful when call calloc() for uint64 nmembs  [Karel Zak]
sysfs:
   - fallback for partitions not including parent name  [Portisch]
taskset:
   - use lib/procfs.c  [Karel Zak]
test/eject:
   - guard asan LD_PRELOAD with use-system-commands check  [Ross Burton]
test_mount_optstr:
   - use xstrdup()  [Karel Zak]
tests:
   - (hardlink) add info about number of files to test  [Karel Zak]
   - (logger) check for socat  [Karel Zak]
   - (lsfd) add a case for listing a fd opening a block device  [Masatake YAMATO]
   - (lsfd) add a factory for opening a block device to the helper command  [Masatake YAMATO]
   - (lsfd) call ts_skip_nonroot earlier  [Masatake YAMATO]
   - (lsfd) don't compare inodes  [Masatake YAMATO]
   - (lsfd) fix file descriptor leaks reported by coverity  [Masatake YAMATO]
   - (lsfd) give missing test descriptions  [Masatake YAMATO]
   - (lsfd) make DGRAM socketpair to mitigate the change of protoname  [Masatake YAMATO]
   - (lsfd) normalize protoname before comparing  [Masatake YAMATO]
   - Fix test/misc/swaplabel failure due to change in mkswap behaviour.  [Mark Hindley]
   - Skip lsns/ioctl_ns test if unshare fails  [Chris Hofstaedtler]
   - add rv64 lscpu test  [Karel Zak]
   - add tests for dm-verity support in mount  [Vojtěch Eichler]
   - check correct log file for errors in blkdiscard test  [Ross Burton]
   - check for dm-verity support  [Karel Zak]
   - don't hardcode /bin/kill in the kill tests  [Ross Burton]
   - fdisk  Layout with more details  [Pali Rohár]
   - fdisk  Update CHS values in MBR partitions  [Pali Rohár]
   - fix fdisk/bsd on big endian systems (tested on sparc64 and ppc64)  [Anatoly Pugachev]
   - fix lsns test on kernels without USER namespaces  [Anatoly Pugachev]
   - make ./run.sh more robust  [Karel Zak]
   - make eject umount tests more robust  [Karel Zak]
   - make mount/fstab-all more robust  [Karel Zak]
   - make use of subtests  [Vojtěch Eichler]
   - mark ul/ul as a known failure  [Ross Burton]
   - skip if scsi_debug model file is not accessible  [Karel Zak]
   - split additional tests into subtests  [Vojtěch Eichler]
   - split cal/color test into subtests  [Vojtěch Eichler]
   - split cal/colorw test into subtests  [Vojtěch Eichler]
   - split several tests into subtests  [Vojtěch Eichler]
   - split test into subtest  [Vojtěch Eichler]
   - update build-sys test  [Karel Zak]
   - update hardlink --maximum-size  [Karel Zak]
   - update hardlink output  [Karel Zak]
   - update lscpu output  [Karel Zak]
   - update lscpu outputs  [Karel Zak]
   - update mountinfo files  [Karel Zak]
   - update sfdisk reorder test  [Karel Zak]
   - use sub-tests for dm-verity  [Karel Zak]
   - use subtests  [Vojtěch Eichler]
tests/eject:
   - check for root perms at beginning  [Karel Zak]
tools:
   - allow to update specific files on kernel.org  [Karel Zak]
   - report and use LDFLAGS in tools/config-gen  [Karel Zak]
tools/git-version-gen:
   - use NEWS as a fallback  [Karel Zak]
uclampset:
   - Fix left over optind++  [Qais Yousef]
   - use lib/procfs.c  [Karel Zak]
unshare:
   - Add option to automatically create user and group maps  [Sean Anderson]
   - Add options to map blocks of user/group IDs  [Sean Anderson]
   - Add some helpers for forking and synchronizing  [Sean Anderson]
   - Add waitchild helper  [Sean Anderson]
   - Document --map-{groups,users,auto}  [Sean Anderson]
   - Fix PDEATHSIG race for --kill-child  [Earl Chew]
   - Fix doc comments  [Sean Anderson]
   - Propagate inherited signal handling to forked child  [Earl Chew]
   - clear SIGCHLD inherited setting  [Karel Zak]
   - fix memory leak [coverity scan]  [Karel Zak]
   - fix typo in uint_to_id()  [Karel Zak]
unshare.1.adoc:
   - Improve wording re creation of bind mounts  [Michael Kerrisk]
   - Improve wording re namespace creation  [Michael Kerrisk]
utmpdump:
   - do not use atoi()  [Karel Zak]
   - don't ignore sscanf() return code [coverity scan]  [Karel Zak]
uuidd:
   - Whitelist libuuid clock file  [Stanislav Brabec]
   - fix open/lock state issue  [Karel Zak]
   - use snprintf() rather than sprintf()  [Karel Zak]
uuidgen.1.adoc:
   - mention uuidparse in SEE ALSO  [Bruno Heridet]
verity:
   - add support for corruption action flag  [Luca Boccassi]
   - fix verity.roothashsig only working as last parameter  [Luca Boccassi]
   - remove experimental tag from mount manpage  [Luca Boccassi]
vipw:
   - flush stdout before getting answer.  [Érico Nogueira]
   - improve child error handling  [Tobias Stoeckmann]
   - use snprintf() rather than sprintf()  [Karel Zak]
wall:
   - add __format__ attribute  [Karel Zak]
   - use xgetlogin.  [Érico Nogueira]
wdctl:
   - Workaround reported boot-status bits not being present in wd->ident.options  [Hans de Goede]
   - add --setpregovernor  [Karel Zak]
   - add --setpretimeout  [Karel Zak]
   - print the current and available governors  [Karel Zak]
   - set_watchdog() refactoring  [Karel Zak]
   - sysfs open refactoring  [Karel Zak]
   - update man page  [Karel Zak]
whereis:
   - use commands for Bash completions  [Smitty]
wipefs:
   - check errno after strto..()  [Karel Zak]
write:
   - use snprintf() rather than sprintf()  [Karel Zak]
zramctl:
   - add zstd compression algorithm option  [Jan Samek]
   - improve usage() output  [Karel Zak]


^ permalink raw reply	[relevance 1%]

* [PATCH 06/35] x86/cet: Add control-protection fault handler
  @ 2022-01-30 21:18  3% ` Rick Edgecombe
  0 siblings, 0 replies; 200+ results
From: Rick Edgecombe @ 2022-01-30 21:18 UTC (permalink / raw)
  To: x86, H . Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H . J . Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V . Shankar, Dave Martin, Weijiang Yang,
	Kirill A . Shutemov, joao.moreira, John Allen, kcc, eranian
  Cc: rick.p.edgecombe, Yu-cheng Yu, Michael Kerrisk

From: Yu-cheng Yu <yu-cheng.yu@intel.com>

A control-protection fault is triggered when a control-flow transfer
attempt violates Shadow Stack or Indirect Branch Tracking constraints.
For example, the return address for a RET instruction differs from the copy
on the shadow stack; or an indirect JMP instruction, without the NOTRACK
prefix, arrives at a non-ENDBR opcode.

The control-protection fault handler works in a similar way as the general
protection fault handler.  It provides the si_code SEGV_CPERR to the signal
handler.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
---

v1:
 - Update static asserts for NSIGSEGV

Yu-cheng v29:
 - Remove pr_emerg() since it is followed by die().
 - Change boot_cpu_has() to cpu_feature_enabled().

Yu-cheng v25:
 - Change CONFIG_X86_CET to CONFIG_X86_SHADOW_STACK.
 - Change X86_FEATURE_CET to X86_FEATURE_SHSTK.
 
 arch/arm/kernel/signal.c           |  2 +-
 arch/arm64/kernel/signal.c         |  2 +-
 arch/arm64/kernel/signal32.c       |  2 +-
 arch/sparc/kernel/signal32.c       |  2 +-
 arch/sparc/kernel/signal_64.c      |  2 +-
 arch/x86/include/asm/idtentry.h    |  4 ++
 arch/x86/kernel/idt.c              |  4 ++
 arch/x86/kernel/signal_compat.c    |  2 +-
 arch/x86/kernel/traps.c            | 62 ++++++++++++++++++++++++++++++
 include/uapi/asm-generic/siginfo.h |  3 +-
 10 files changed, 78 insertions(+), 7 deletions(-)

diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index c532a6041066..59aaadce9d52 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -681,7 +681,7 @@ asmlinkage void do_rseq_syscall(struct pt_regs *regs)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index d8aaf4b6f432..d2da57c415b8 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -983,7 +983,7 @@ void __init minsigstksz_setup(void)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/arm64/kernel/signal32.c b/arch/arm64/kernel/signal32.c
index d984282b979f..8776a34c6444 100644
--- a/arch/arm64/kernel/signal32.c
+++ b/arch/arm64/kernel/signal32.c
@@ -460,7 +460,7 @@ void compat_setup_restart_syscall(struct pt_regs *regs)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/sparc/kernel/signal32.c b/arch/sparc/kernel/signal32.c
index 6cc124a3bb98..dc50b2a78692 100644
--- a/arch/sparc/kernel/signal32.c
+++ b/arch/sparc/kernel/signal32.c
@@ -752,7 +752,7 @@ asmlinkage int do_sys32_sigstack(u32 u_ssptr, u32 u_ossptr, unsigned long sp)
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/sparc/kernel/signal_64.c b/arch/sparc/kernel/signal_64.c
index 2a78d2af1265..7fe2bd37bd1a 100644
--- a/arch/sparc/kernel/signal_64.c
+++ b/arch/sparc/kernel/signal_64.c
@@ -562,7 +562,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long orig_i0, unsigned long
  */
 static_assert(NSIGILL	== 11);
 static_assert(NSIGFPE	== 15);
-static_assert(NSIGSEGV	== 9);
+static_assert(NSIGSEGV	== 10);
 static_assert(NSIGBUS	== 5);
 static_assert(NSIGTRAP	== 6);
 static_assert(NSIGCHLD	== 6);
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 1345088e9902..a90791433152 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -562,6 +562,10 @@ DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_SS,	exc_stack_segment);
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_GP,	exc_general_protection);
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_AC,	exc_alignment_check);
 
+#ifdef CONFIG_X86_SHADOW_STACK
+DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_CP, exc_control_protection);
+#endif
+
 /* Raw exception entries which need extra work */
 DECLARE_IDTENTRY_RAW(X86_TRAP_UD,		exc_invalid_op);
 DECLARE_IDTENTRY_RAW(X86_TRAP_BP,		exc_int3);
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index df0fa695bb09..9f1bdaabc246 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -113,6 +113,10 @@ static const __initconst struct idt_data def_idts[] = {
 #elif defined(CONFIG_X86_32)
 	SYSG(IA32_SYSCALL_VECTOR,	entry_INT80_32),
 #endif
+
+#ifdef CONFIG_X86_SHADOW_STACK
+	INTG(X86_TRAP_CP,		asm_exc_control_protection),
+#endif
 };
 
 /*
diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index b52407c56000..ff50cd978ea5 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -27,7 +27,7 @@ static inline void signal_compat_build_tests(void)
 	 */
 	BUILD_BUG_ON(NSIGILL  != 11);
 	BUILD_BUG_ON(NSIGFPE  != 15);
-	BUILD_BUG_ON(NSIGSEGV != 9);
+	BUILD_BUG_ON(NSIGSEGV != 10);
 	BUILD_BUG_ON(NSIGBUS  != 5);
 	BUILD_BUG_ON(NSIGTRAP != 6);
 	BUILD_BUG_ON(NSIGCHLD != 6);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index c9d566dcf89a..54b7a146fd5e 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -39,6 +39,7 @@
 #include <linux/io.h>
 #include <linux/hardirq.h>
 #include <linux/atomic.h>
+#include <linux/nospec.h>
 
 #include <asm/stacktrace.h>
 #include <asm/processor.h>
@@ -641,6 +642,67 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
 	cond_local_irq_disable(regs);
 }
 
+#ifdef CONFIG_X86_SHADOW_STACK
+static const char * const control_protection_err[] = {
+	"unknown",
+	"near-ret",
+	"far-ret/iret",
+	"endbranch",
+	"rstorssp",
+	"setssbsy",
+	"unknown",
+};
+
+static DEFINE_RATELIMIT_STATE(cpf_rate, DEFAULT_RATELIMIT_INTERVAL,
+			      DEFAULT_RATELIMIT_BURST);
+
+/*
+ * When a control protection exception occurs, send a signal to the responsible
+ * application.  Currently, control protection is only enabled for user mode.
+ * This exception should not come from kernel mode.
+ */
+DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+{
+	struct task_struct *tsk;
+
+	if (!user_mode(regs)) {
+		die("kernel control protection fault", regs, error_code);
+		panic("Unexpected kernel control protection fault.  Machine halted.");
+	}
+
+	cond_local_irq_enable(regs);
+
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		WARN_ONCE(1, "Control protection fault with CET support disabled\n");
+
+	tsk = current;
+	tsk->thread.error_code = error_code;
+	tsk->thread.trap_nr = X86_TRAP_CP;
+
+	/*
+	 * Ratelimit to prevent log spamming.
+	 */
+	if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
+	    __ratelimit(&cpf_rate)) {
+		unsigned long ssp;
+		int cpf_type;
+
+		cpf_type = array_index_nospec(error_code, ARRAY_SIZE(control_protection_err));
+
+		rdmsrl(MSR_IA32_PL3_SSP, ssp);
+		pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)",
+			 tsk->comm, task_pid_nr(tsk),
+			 regs->ip, regs->sp, ssp, error_code,
+			 control_protection_err[cpf_type]);
+		print_vma_addr(KERN_CONT " in ", regs->ip);
+		pr_cont("\n");
+	}
+
+	force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
+	cond_local_irq_disable(regs);
+}
+#endif
+
 static bool do_int3(struct pt_regs *regs)
 {
 	int res;
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index 3ba180f550d7..081f4b37d22c 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -240,7 +240,8 @@ typedef struct siginfo {
 #define SEGV_ADIPERR	7	/* Precise MCD exception */
 #define SEGV_MTEAERR	8	/* Asynchronous ARM MTE error */
 #define SEGV_MTESERR	9	/* Synchronous ARM MTE exception */
-#define NSIGSEGV	9
+#define SEGV_CPERR	10	/* Control protection fault */
+#define NSIGSEGV	10
 
 /*
  * SIGBUS si_codes
-- 
2.17.1


^ permalink raw reply related	[relevance 3%]

* Re: [PATCH] pidfd: fix test failure due to stack overflow on some arches
  @ 2022-01-28  8:56  6% ` Christian Brauner
  2022-02-02 15:52  0%   ` Shuah Khan
  0 siblings, 1 reply; 200+ results
From: Christian Brauner @ 2022-01-28  8:56 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: Christian Brauner, Shuah Khan, Zach O'Keefe, linux-kernel,
	linux-kselftest

On Thu, Jan 27, 2022 at 01:29:51PM -0800, Axel Rasmussen wrote:
> When running the pidfd_fdinfo_test on arm64, it fails for me. After some
> digging, the reason is that the child exits due to SIGBUS, because it
> overflows the 1024 byte stack we've reserved for it.
> 
> To fix the issue, increase the stack size to 8192 bytes (this number is
> somewhat arbitrary, and was arrived at through experimentation -- I kept
> doubling until the failure no longer occurred).
> 
> Also, let's make the issue easier to debug. wait_for_pid() returns an
> ambiguous value: it may return -1 in all of these cases:
> 
> 1. waitpid() itself returned -1
> 2. waitpid() returned success, but we found !WIFEXITED(status).
> 3. The child process exited, but it did so with a -1 exit code.
> 
> There's no way for the caller to tell the difference. So, at least log
> which occurred, so the test runner can debug things.
> 
> While debugging this, I found that we had !WIFEXITED(), because the
> child exited due to a signal. This seems like a reasonably common case,
> so also print out whether or not we have WIFSIGNALED(), and the
> associated WTERMSIG() (if any). This lets us see the SIGBUS I'm fixing
> clearly when it occurs.
> 
> Finally, I'm suspicious of allocating the child's stack on our stack.
> man clone(2) suggests that the correct way to do this is with mmap(),
> and in particular by setting MAP_STACK. So, switch to doing it that way
> instead.

Heh, yes. :)

commit 99c3a000279919cc4875c9dfa9c3ebb41ed8773e
Author: Michael Kerrisk <mtk.manpages@gmail.com>
Date:   Thu Nov 14 12:19:21 2019 +0100

    clone.2: Allocate child's stack using mmap(2) rather than malloc(3)

    Christian Brauner suggested mmap(MAP_STACKED), rather than
    malloc(), as the canonical way of allocating a stack for the
    child of clone(), and Jann Horn noted some reasons why:

        Not on Linux, but on OpenBSD, they do use MAP_STACK now
        AFAIK; this was announced here:
        <http://openbsd-archive.7691.n7.nabble.com/stack-register-checking-td338238.html>.
        Basically they periodically check whether the userspace
        stack pointer points into a MAP_STACK region, and if not,
        they kill the process. So even if it's a no-op on Linux, it
        might make sense to advise people to use the flag to improve
        portability? I'm not sure if that's something that belongs
        in Linux manpages.

        Another reason against malloc() is that when setting up
        thread stacks in proper, reliable software, you'll probably
        want to place a guard page (in other words, a 4K PROT_NONE
        VMA) at the bottom of the stack to reliably catch stack
        overflows; and you probably don't want to do that with
        malloc, in particular with non-page-aligned allocations.

    And the OpenBSD 6.5 manual pages says:

        MAP_STACK
            Indicate that the mapping is used as a stack. This
            flag must be used in combination with MAP_ANON and
            MAP_PRIVATE.

    And I then noticed that MAP_STACK seems already to be on
    FreeBSD for a long time:

        MAP_STACK
            Map the area as a stack.  MAP_ANON is implied.
            Offset should be 0, fd must be -1, and prot should
            include at least PROT_READ and PROT_WRITE.  This
            option creates a memory region that grows to at
            most len bytes in size, starting from the stack
            top and growing down.  The stack top is the start‐
            ing address returned by the call, plus len bytes.
            The bottom of the stack at maximum growth is the
            starting address returned by the call.

            The entire area is reserved from the point of view
            of other mmap() calls, even if not faulted in yet.

    Reported-by: Jann Horn <jannh@google.com>
    Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
    Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>


> 
> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
> ---

Yeah, stack handling - especially with legacy clone() - is yucky on the
best of days. Thank you for the fix.

Acked-by: Christian Brauner <brauner@kernel.org>

^ permalink raw reply	[relevance 6%]

* Re: [RFC PATCH v2] rseq: Remove broken uapi field layout on 32-bit little endian
  2022-01-27 15:27  4% ` [RFC PATCH v2] " Mathieu Desnoyers
@ 2022-01-28  8:52  0%   ` Christian Brauner
  0 siblings, 0 replies; 200+ results
From: Christian Brauner @ 2022-01-28  8:52 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, linux-kernel, Thomas Gleixner, Paul E . McKenney,
	Boqun Feng, H . Peter Anvin, Paul Turner, linux-api, Shuah Khan,
	linux-kselftest, Florian Weimer, Andy Lutomirski, Dave Watson,
	Andrew Morton, Russell King, Andi Kleen, Christian Brauner,
	Ben Maurer, Steven Rostedt, Josh Triplett, Linus Torvalds,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Joel Fernandes

On Thu, Jan 27, 2022 at 10:27:20AM -0500, Mathieu Desnoyers wrote:
> The rseq rseq_cs.ptr.{ptr32,padding} uapi endianness handling is
> entirely wrong on 32-bit little endian: a preprocessor logic mistake
> wrongly uses the big endian field layout on 32-bit little endian
> architectures.
> 
> Fortunately, those ptr32 accessors were never used within the kernel,
> and only meant as a convenience for user-space.
> 
> Remove those and replace the whole rseq_cs union by a __u64 type, as
> this is the only thing really needed to express the ABI. Document how
> 32-bit architectures are meant to interact with this field.
> 
> Fixes: ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union, update includes")
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Florian Weimer <fw@deneb.enyo.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: linux-api@vger.kernel.org
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Dave Watson <davejwatson@fb.com>
> Cc: Paul Turner <pjt@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Russell King <linux@arm.linux.org.uk>
> Cc: "H . Peter Anvin" <hpa@zytor.com>
> Cc: Andi Kleen <andi@firstfloor.org>
> Cc: Christian Brauner <christian.brauner@ubuntu.com>
> Cc: Ben Maurer <bmaurer@fb.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Josh Triplett <josh@joshtriplett.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Joel Fernandes <joelaf@google.com>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> ---

Looks way cleaner now! Fwiw,
Acked-by: Christian Brauner <brauner@kernel.org>

>  include/uapi/linux/rseq.h | 20 ++++----------------
>  kernel/rseq.c             |  8 ++++----
>  2 files changed, 8 insertions(+), 20 deletions(-)
> 
> diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
> index 9a402fdb60e9..77ee207623a9 100644
> --- a/include/uapi/linux/rseq.h
> +++ b/include/uapi/linux/rseq.h
> @@ -105,23 +105,11 @@ struct rseq {
>  	 * Read and set by the kernel. Set by user-space with single-copy
>  	 * atomicity semantics. This field should only be updated by the
>  	 * thread which registered this data structure. Aligned on 64-bit.
> +	 *
> +	 * 32-bit architectures should update the low order bits of the
> +	 * rseq_cs field, leaving the high order bits initialized to 0.
>  	 */
> -	union {
> -		__u64 ptr64;
> -#ifdef __LP64__
> -		__u64 ptr;
> -#else
> -		struct {
> -#if (defined(__BYTE_ORDER) && (__BYTE_ORDER == __BIG_ENDIAN)) || defined(__BIG_ENDIAN)
> -			__u32 padding;		/* Initialized to zero. */
> -			__u32 ptr32;
> -#else /* LITTLE */
> -			__u32 ptr32;
> -			__u32 padding;		/* Initialized to zero. */
> -#endif /* ENDIAN */
> -		} ptr;
> -#endif
> -	} rseq_cs;
> +	__u64 rseq_cs;
>  
>  	/*
>  	 * Restartable sequences flags field.
> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index 6d45ac3dae7f..97ac20b4f738 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -128,10 +128,10 @@ static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
>  	int ret;
>  
>  #ifdef CONFIG_64BIT
> -	if (get_user(ptr, &t->rseq->rseq_cs.ptr64))
> +	if (get_user(ptr, &t->rseq->rseq_cs))
>  		return -EFAULT;
>  #else
> -	if (copy_from_user(&ptr, &t->rseq->rseq_cs.ptr64, sizeof(ptr)))
> +	if (copy_from_user(&ptr, &t->rseq->rseq_cs, sizeof(ptr)))
>  		return -EFAULT;
>  #endif
>  	if (!ptr) {
> @@ -217,9 +217,9 @@ static int clear_rseq_cs(struct task_struct *t)
>  	 * Set rseq_cs to NULL.
>  	 */
>  #ifdef CONFIG_64BIT
> -	return put_user(0UL, &t->rseq->rseq_cs.ptr64);
> +	return put_user(0UL, &t->rseq->rseq_cs);
>  #else
> -	if (clear_user(&t->rseq->rseq_cs.ptr64, sizeof(t->rseq->rseq_cs.ptr64)))
> +	if (clear_user(&t->rseq->rseq_cs, sizeof(t->rseq->rseq_cs)))
>  		return -EFAULT;
>  	return 0;
>  #endif
> -- 
> 2.17.1
> 

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v3] fs/exec: require argv[0] presence in do_execveat_common()
  2022-01-27  5:29  0% ` Kees Cook
@ 2022-01-27 16:51  0%   ` Eric W. Biederman
  0 siblings, 0 replies; 200+ results
From: Eric W. Biederman @ 2022-01-27 16:51 UTC (permalink / raw)
  To: Kees Cook
  Cc: Ariadne Conill, Andrew Morton, linux-kernel, linux-fsdevel,
	Alexander Viro, Michael Kerrisk, Matthew Wilcox,
	Christian Brauner, Rich Felker, linux-mm, stable

Kees Cook <keescook@chromium.org> writes:

> On Thu, Jan 27, 2022 at 12:07:24AM +0000, Ariadne Conill wrote:
>> In several other operating systems, it is a hard requirement that the
>> second argument to execve(2) be the name of a program, thus prohibiting
>> a scenario where argc < 1.  POSIX 2017 also recommends this behaviour,
>> but it is not an explicit requirement[0]:
>> 
>>     The argument arg0 should point to a filename string that is
>>     associated with the process being started by one of the exec
>>     functions.
>> 
>> To ensure that execve(2) with argc < 1 is not a useful tool for
>> shellcode to use, we can validate this in do_execveat_common() and
>> fail for this scenario, effectively blocking successful exploitation
>> of CVE-2021-4034 and similar bugs which depend on execve(2) working
>> with argc < 1.
>> 
>> We use -EINVAL for this case, mirroring recent changes to FreeBSD and
>> OpenBSD.  -EINVAL is also used by QNX for this, while Solaris uses
>> -EFAULT.
>> 
>> In earlier versions of the patch, it was proposed that we create a
>> fake argv for applications to use when argc < 1, but it was concluded
>> that it would be better to just fail the execve(2) in these cases, as
>> launching a process with an empty or NULL argv[0] was likely to just
>> cause more problems.
>
> Let's do it and see what breaks. :)
>
> I do see at least tools/testing/selftests/exec/recursion-depth.c will
> need a fix. And maybe testcases/kernel/syscalls/execveat/execveat.h
> in LTP.
>
> Acked-by: Kees Cook <keescook@chromium.org>

Yes since this only appears to be tests that will break.

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>

Especially since you are signing up to help fix the tests.


>> 
>> Interestingly, Michael Kerrisk opened an issue about this in 2008[1],
>> but there was no consensus to support fixing this issue then.
>> Hopefully now that CVE-2021-4034 shows practical exploitative use[2]
>> of this bug in a shellcode, we can reconsider.
>> 
>> This issue is being tracked in the KSPP issue tracker[3].
>> 
>> There are a few[4][5] minor edge cases (primarily in test suites) that
>> are caught by this, but we plan to work with the projects to fix those
>> edge cases.
>> 
>> [0]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
>> [1]: https://bugzilla.kernel.org/show_bug.cgi?id=8408
>> [2]: https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
>> [3]: https://github.com/KSPP/linux/issues/176
>> [4]: https://codesearch.debian.net/search?q=execve%5C+*%5C%28%5B%5E%2C%5D%2B%2C+*NULL&literal=0
>> [5]: https://codesearch.debian.net/search?q=execlp%3F%5Cs*%5C%28%5B%5E%2C%5D%2B%2C%5Cs*NULL&literal=0
>> 
>> Changes from v2:
>> - Switch to using -EINVAL as the error code for this.
>> - Use pr_warn_once() to warn when an execve(2) is rejected due to NULL
>>   argv.
>> 
>> Changes from v1:
>> - Rework commit message significantly.
>> - Make the argv[0] check explicit rather than hijacking the error-check
>>   for count().
>> 
>> Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
>> To: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Christian Brauner <brauner@kernel.org>
>> Cc: Rich Felker <dalias@libc.org>
>> Cc: Eric Biederman <ebiederm@xmission.com>
>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> Cc: Kees Cook <keescook@chromium.org>
>> Cc: linux-fsdevel@vger.kernel.org
>> Cc: linux-mm@kvack.org
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Ariadne Conill <ariadne@dereferenced.org>
>> ---
>>  fs/exec.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>> 
>> diff --git a/fs/exec.c b/fs/exec.c
>> index 79f2c9483302..982730cfe3b8 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1897,6 +1897,10 @@ static int do_execveat_common(int fd, struct filename *filename,
>>  	}
>>  
>>  	retval = count(argv, MAX_ARG_STRINGS);
>> +	if (retval == 0) {
>> +		pr_warn_once("Attempted to run process '%s' with NULL argv\n", bprm->filename);
>> +		retval = -EINVAL;
>> +	}
>>  	if (retval < 0)
>>  		goto out_free;
>>  	bprm->argc = retval;
>> -- 
>> 2.34.1
>> 

^ permalink raw reply	[relevance 0%]

* [RFC PATCH v2] rseq: Remove broken uapi field layout on 32-bit little endian
  @ 2022-01-27 15:27  4% ` Mathieu Desnoyers
  2022-01-28  8:52  0%   ` Christian Brauner
  0 siblings, 1 reply; 200+ results
From: Mathieu Desnoyers @ 2022-01-27 15:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Thomas Gleixner, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, Shuah Khan,
	linux-kselftest, Mathieu Desnoyers, Florian Weimer,
	Andy Lutomirski, Dave Watson, Andrew Morton, Russell King,
	Andi Kleen, Christian Brauner, Ben Maurer, Steven Rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Joel Fernandes

The rseq rseq_cs.ptr.{ptr32,padding} uapi endianness handling is
entirely wrong on 32-bit little endian: a preprocessor logic mistake
wrongly uses the big endian field layout on 32-bit little endian
architectures.

Fortunately, those ptr32 accessors were never used within the kernel,
and only meant as a convenience for user-space.

Remove those and replace the whole rseq_cs union by a __u64 type, as
this is the only thing really needed to express the ABI. Document how
32-bit architectures are meant to interact with this field.

Fixes: ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union, update includes")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Florian Weimer <fw@deneb.enyo.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-api@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Paul Turner <pjt@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
---
 include/uapi/linux/rseq.h | 20 ++++----------------
 kernel/rseq.c             |  8 ++++----
 2 files changed, 8 insertions(+), 20 deletions(-)

diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index 9a402fdb60e9..77ee207623a9 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -105,23 +105,11 @@ struct rseq {
 	 * Read and set by the kernel. Set by user-space with single-copy
 	 * atomicity semantics. This field should only be updated by the
 	 * thread which registered this data structure. Aligned on 64-bit.
+	 *
+	 * 32-bit architectures should update the low order bits of the
+	 * rseq_cs field, leaving the high order bits initialized to 0.
 	 */
-	union {
-		__u64 ptr64;
-#ifdef __LP64__
-		__u64 ptr;
-#else
-		struct {
-#if (defined(__BYTE_ORDER) && (__BYTE_ORDER == __BIG_ENDIAN)) || defined(__BIG_ENDIAN)
-			__u32 padding;		/* Initialized to zero. */
-			__u32 ptr32;
-#else /* LITTLE */
-			__u32 ptr32;
-			__u32 padding;		/* Initialized to zero. */
-#endif /* ENDIAN */
-		} ptr;
-#endif
-	} rseq_cs;
+	__u64 rseq_cs;
 
 	/*
 	 * Restartable sequences flags field.
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 6d45ac3dae7f..97ac20b4f738 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -128,10 +128,10 @@ static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
 	int ret;
 
 #ifdef CONFIG_64BIT
-	if (get_user(ptr, &t->rseq->rseq_cs.ptr64))
+	if (get_user(ptr, &t->rseq->rseq_cs))
 		return -EFAULT;
 #else
-	if (copy_from_user(&ptr, &t->rseq->rseq_cs.ptr64, sizeof(ptr)))
+	if (copy_from_user(&ptr, &t->rseq->rseq_cs, sizeof(ptr)))
 		return -EFAULT;
 #endif
 	if (!ptr) {
@@ -217,9 +217,9 @@ static int clear_rseq_cs(struct task_struct *t)
 	 * Set rseq_cs to NULL.
 	 */
 #ifdef CONFIG_64BIT
-	return put_user(0UL, &t->rseq->rseq_cs.ptr64);
+	return put_user(0UL, &t->rseq->rseq_cs);
 #else
-	if (clear_user(&t->rseq->rseq_cs.ptr64, sizeof(t->rseq->rseq_cs.ptr64)))
+	if (clear_user(&t->rseq->rseq_cs, sizeof(t->rseq->rseq_cs)))
 		return -EFAULT;
 	return 0;
 #endif
-- 
2.17.1


^ permalink raw reply related	[relevance 4%]

* Re: [PATCH v3] fs/exec: require argv[0] presence in do_execveat_common()
  2022-01-27  0:07  5% [PATCH v3] fs/exec: require argv[0] presence in do_execveat_common() Ariadne Conill
@ 2022-01-27  5:29  0% ` Kees Cook
  2022-01-27 16:51  0%   ` Eric W. Biederman
  0 siblings, 1 reply; 200+ results
From: Kees Cook @ 2022-01-27  5:29 UTC (permalink / raw)
  To: Ariadne Conill
  Cc: Andrew Morton, linux-kernel, linux-fsdevel, Eric Biederman,
	Alexander Viro, Michael Kerrisk, Matthew Wilcox,
	Christian Brauner, Rich Felker, linux-mm, stable

On Thu, Jan 27, 2022 at 12:07:24AM +0000, Ariadne Conill wrote:
> In several other operating systems, it is a hard requirement that the
> second argument to execve(2) be the name of a program, thus prohibiting
> a scenario where argc < 1.  POSIX 2017 also recommends this behaviour,
> but it is not an explicit requirement[0]:
> 
>     The argument arg0 should point to a filename string that is
>     associated with the process being started by one of the exec
>     functions.
> 
> To ensure that execve(2) with argc < 1 is not a useful tool for
> shellcode to use, we can validate this in do_execveat_common() and
> fail for this scenario, effectively blocking successful exploitation
> of CVE-2021-4034 and similar bugs which depend on execve(2) working
> with argc < 1.
> 
> We use -EINVAL for this case, mirroring recent changes to FreeBSD and
> OpenBSD.  -EINVAL is also used by QNX for this, while Solaris uses
> -EFAULT.
> 
> In earlier versions of the patch, it was proposed that we create a
> fake argv for applications to use when argc < 1, but it was concluded
> that it would be better to just fail the execve(2) in these cases, as
> launching a process with an empty or NULL argv[0] was likely to just
> cause more problems.

Let's do it and see what breaks. :)

I do see at least tools/testing/selftests/exec/recursion-depth.c will
need a fix. And maybe testcases/kernel/syscalls/execveat/execveat.h
in LTP.

Acked-by: Kees Cook <keescook@chromium.org>

> 
> Interestingly, Michael Kerrisk opened an issue about this in 2008[1],
> but there was no consensus to support fixing this issue then.
> Hopefully now that CVE-2021-4034 shows practical exploitative use[2]
> of this bug in a shellcode, we can reconsider.
> 
> This issue is being tracked in the KSPP issue tracker[3].
> 
> There are a few[4][5] minor edge cases (primarily in test suites) that
> are caught by this, but we plan to work with the projects to fix those
> edge cases.
> 
> [0]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
> [1]: https://bugzilla.kernel.org/show_bug.cgi?id=8408
> [2]: https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
> [3]: https://github.com/KSPP/linux/issues/176
> [4]: https://codesearch.debian.net/search?q=execve%5C+*%5C%28%5B%5E%2C%5D%2B%2C+*NULL&literal=0
> [5]: https://codesearch.debian.net/search?q=execlp%3F%5Cs*%5C%28%5B%5E%2C%5D%2B%2C%5Cs*NULL&literal=0
> 
> Changes from v2:
> - Switch to using -EINVAL as the error code for this.
> - Use pr_warn_once() to warn when an execve(2) is rejected due to NULL
>   argv.
> 
> Changes from v1:
> - Rework commit message significantly.
> - Make the argv[0] check explicit rather than hijacking the error-check
>   for count().
> 
> Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
> To: Andrew Morton <akpm@linux-foundation.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Rich Felker <dalias@libc.org>
> Cc: Eric Biederman <ebiederm@xmission.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-mm@kvack.org
> Cc: stable@vger.kernel.org
> Signed-off-by: Ariadne Conill <ariadne@dereferenced.org>
> ---
>  fs/exec.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 79f2c9483302..982730cfe3b8 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1897,6 +1897,10 @@ static int do_execveat_common(int fd, struct filename *filename,
>  	}
>  
>  	retval = count(argv, MAX_ARG_STRINGS);
> +	if (retval == 0) {
> +		pr_warn_once("Attempted to run process '%s' with NULL argv\n", bprm->filename);
> +		retval = -EINVAL;
> +	}
>  	if (retval < 0)
>  		goto out_free;
>  	bprm->argc = retval;
> -- 
> 2.34.1
> 

-- 
Kees Cook

^ permalink raw reply	[relevance 0%]

* [PATCH v3] fs/exec: require argv[0] presence in do_execveat_common()
@ 2022-01-27  0:07  5% Ariadne Conill
  2022-01-27  5:29  0% ` Kees Cook
  0 siblings, 1 reply; 200+ results
From: Ariadne Conill @ 2022-01-27  0:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-fsdevel, Eric Biederman, Kees Cook,
	Alexander Viro, Ariadne Conill, Michael Kerrisk, Matthew Wilcox,
	Christian Brauner, Rich Felker, linux-mm, stable

In several other operating systems, it is a hard requirement that the
second argument to execve(2) be the name of a program, thus prohibiting
a scenario where argc < 1.  POSIX 2017 also recommends this behaviour,
but it is not an explicit requirement[0]:

    The argument arg0 should point to a filename string that is
    associated with the process being started by one of the exec
    functions.

To ensure that execve(2) with argc < 1 is not a useful tool for
shellcode to use, we can validate this in do_execveat_common() and
fail for this scenario, effectively blocking successful exploitation
of CVE-2021-4034 and similar bugs which depend on execve(2) working
with argc < 1.

We use -EINVAL for this case, mirroring recent changes to FreeBSD and
OpenBSD.  -EINVAL is also used by QNX for this, while Solaris uses
-EFAULT.

In earlier versions of the patch, it was proposed that we create a
fake argv for applications to use when argc < 1, but it was concluded
that it would be better to just fail the execve(2) in these cases, as
launching a process with an empty or NULL argv[0] was likely to just
cause more problems.

Interestingly, Michael Kerrisk opened an issue about this in 2008[1],
but there was no consensus to support fixing this issue then.
Hopefully now that CVE-2021-4034 shows practical exploitative use[2]
of this bug in a shellcode, we can reconsider.

This issue is being tracked in the KSPP issue tracker[3].

There are a few[4][5] minor edge cases (primarily in test suites) that
are caught by this, but we plan to work with the projects to fix those
edge cases.

[0]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
[1]: https://bugzilla.kernel.org/show_bug.cgi?id=8408
[2]: https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
[3]: https://github.com/KSPP/linux/issues/176
[4]: https://codesearch.debian.net/search?q=execve%5C+*%5C%28%5B%5E%2C%5D%2B%2C+*NULL&literal=0
[5]: https://codesearch.debian.net/search?q=execlp%3F%5Cs*%5C%28%5B%5E%2C%5D%2B%2C%5Cs*NULL&literal=0

Changes from v2:
- Switch to using -EINVAL as the error code for this.
- Use pr_warn_once() to warn when an execve(2) is rejected due to NULL
  argv.

Changes from v1:
- Rework commit message significantly.
- Make the argv[0] check explicit rather than hijacking the error-check
  for count().

Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: stable@vger.kernel.org
Signed-off-by: Ariadne Conill <ariadne@dereferenced.org>
---
 fs/exec.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/exec.c b/fs/exec.c
index 79f2c9483302..982730cfe3b8 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1897,6 +1897,10 @@ static int do_execveat_common(int fd, struct filename *filename,
 	}
 
 	retval = count(argv, MAX_ARG_STRINGS);
+	if (retval == 0) {
+		pr_warn_once("Attempted to run process '%s' with NULL argv\n", bprm->filename);
+		retval = -EINVAL;
+	}
 	if (retval < 0)
 		goto out_free;
 	bprm->argc = retval;
-- 
2.34.1


^ permalink raw reply related	[relevance 5%]

* Re: [PATCH v2] fs/exec: require argv[0] presence in do_execveat_common()
  2022-01-26 20:56  0%     ` Kees Cook
@ 2022-01-26 21:13  0%       ` Ariadne Conill
  0 siblings, 0 replies; 200+ results
From: Ariadne Conill @ 2022-01-26 21:13 UTC (permalink / raw)
  To: Kees Cook
  Cc: Ariadne Conill, linux-kernel, linux-fsdevel, Eric Biederman,
	Alexander Viro

Hi,

On Wed, 26 Jan 2022, Kees Cook wrote:

> On Wed, Jan 26, 2022 at 02:23:59PM -0600, Ariadne Conill wrote:
>> Hi,
>>
>> On Wed, 26 Jan 2022, Kees Cook wrote:
>>
>>> On Wed, Jan 26, 2022 at 11:44:47AM +0000, Ariadne Conill wrote:
>>>> In several other operating systems, it is a hard requirement that the
>>>> first argument to execve(2) be the name of a program, thus prohibiting
>>>> a scenario where argc < 1.  POSIX 2017 also recommends this behaviour,
>>>> but it is not an explicit requirement[0]:
>>>>
>>>>     The argument arg0 should point to a filename string that is
>>>>     associated with the process being started by one of the exec
>>>>     functions.
>>>>
>>>> To ensure that execve(2) with argc < 1 is not a useful gadget for
>>>> shellcode to use, we can validate this in do_execveat_common() and
>>>> fail for this scenario, effectively blocking successful exploitation
>>>> of CVE-2021-4034 and similar bugs which depend on this gadget.
>>>>
>>>> The use of -EFAULT for this case is similar to other systems, such
>>>> as FreeBSD, OpenBSD and Solaris.  QNX uses -EINVAL for this case.
>>>>
>>>> Interestingly, Michael Kerrisk opened an issue about this in 2008[1],
>>>> but there was no consensus to support fixing this issue then.
>>>> Hopefully now that CVE-2021-4034 shows practical exploitative use
>>>> of this bug in a shellcode, we can reconsider.
>>>>
>>>> [0]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
>>>> [1]: https://bugzilla.kernel.org/show_bug.cgi?id=8408
>>>>
>>>> Changes from v1:
>>>> - Rework commit message significantly.
>>>> - Make the argv[0] check explicit rather than hijacking the error-check
>>>>   for count().
>>>>
>>>> Signed-off-by: Ariadne Conill <ariadne@dereferenced.org>
>>>> ---
>>>>  fs/exec.c | 4 ++++
>>>>  1 file changed, 4 insertions(+)
>>>>
>>>> diff --git a/fs/exec.c b/fs/exec.c
>>>> index 79f2c9483302..e52c41991aab 100644
>>>> --- a/fs/exec.c
>>>> +++ b/fs/exec.c
>>>> @@ -1899,6 +1899,10 @@ static int do_execveat_common(int fd, struct filename *filename,
>>>>  	retval = count(argv, MAX_ARG_STRINGS);
>>>>  	if (retval < 0)
>>>>  		goto out_free;
>>>> +	if (retval == 0) {
>>>> +		retval = -EFAULT;
>>>> +		goto out_free;
>>>> +	}
>>>>  	bprm->argc = retval;
>>>>
>>>>  	retval = count(envp, MAX_ARG_STRINGS);
>>>> --
>>>> 2.34.1
>>>
>>> Okay, so, the dangerous condition is userspace iterating through envp
>>> when it thinks it's iterating argv.
>>>
>>> Assuming it is not okay to break valgrind's test suite:
>>> https://sources.debian.org/src/valgrind/1:3.18.1-1/none/tests/execve.c/?hl=22#L22
>>> we cannot reject a NULL argv (test will fail), and we cannot mutate
>>> argc=0 into argc=1 (test will enter infinite loop).
>>>
>>> Perhaps we need to reject argv=NULL when envp!=NULL, and add a
>>> pr_warn_once() about using a NULL argv?
>>
>> Sure, I can rework the patch to do it for only the envp != NULL case.
>>
>> I think we should combine it with the {NULL, NULL} padding patch in this
>> case though, since it appears to work, that way the execve(..., NULL, NULL)
>> case gets some protection.
>
> I don't think the padding will actually work correctly, for the reason
> Jann pointed out. My testing shows that suddenly my envp becomes NULL,
> but libc is just counting argc to find envp to pass into main.
>
>>> I note that glibc already warns about NULL argv:
>>> argc0.c:7:3: warning: null argument where non-null required (argument 2)
>>> [-Wnonnull]
>>>    7 |   execve(argv[0], NULL, envp);
>>>      |   ^~~~~~
>>>
>>> in the future we could expand this to only looking at argv=NULL?
>>
>> I don't think musl's headers generate a diagnostic for this, but main(0,
>> {NULL}) is not a supported use-case at least as far as Alpine is concerned.
>> I am sure it is the same with the other musl distributions.
>>
>> Will send a v3 patch with this logic change and move to EINVAL shortly.
>
> I took a spin too. Refuses execve(..., NULL, !NULL), injects "" argv[0]
> for execve(..., NULL, NULL):
>
>
> diff --git a/fs/exec.c b/fs/exec.c
> index a098c133d8d7..0565089d5f9e 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1917,9 +1917,40 @@ static int do_execveat_common(int fd, struct filename *filename,
> 	if (retval < 0)
> 		goto out_free;
>
> -	retval = copy_strings(bprm->argc, argv, bprm);
> -	if (retval < 0)
> -		goto out_free;
> +	if (likely(bprm->argc > 0)) {
> +		retval = copy_strings(bprm->argc, argv, bprm);
> +		if (retval < 0)
> +			goto out_free;
> +	} else {
> +		const char * const argv0 = "";
> +
> +		/*
> +		 * Start making some noise about the argc == NULL case that
> +		 * POSIX doesn't like and other Unix-like systems refuse.
> +		 */
> +		pr_warn_once("process '%s' used a NULL argv\n", bprm->filename);
> +
> +		/*
> +		 * Refuse to execute when argc == 0 and envc > 0, since this
> +		 * can lead to userspace iterating envp if it fails to check
> +		 * for argc == 0.
> +		 *
> +		 * i.e. continue to allow: execve(path, NULL, NULL);
> +		 */
> +		if (bprm->envc > 0) {
> +			retval = -EINVAL;
> +			goto out_free;
> +		}
> +
> +		/*
> +		 * Force an argv of {"", NULL} if argc == 0 so that broken
> +		 * userspace that assumes argc != 0 will not be surprised.
> +		 */
> +		bprm->argc = 1;
> +		retval = copy_strings_kernel(bprm->argc, &argv0, bprm);
> +		if (retval < 0)
> +			goto out_free;
> +	}
>
> 	retval = bprm_execve(bprm, fd, filename, flags);
> out_free:

Looks good to me, but I wonder if we shouldn't set an argv of 
{bprm->filename, NULL} instead of {"", NULL}.  Discussion in IRC led to 
the realization that multicall programs will try to use argv[0] and might 
crash in this scenario.  If we're going to fake an argv, I guess we should 
try to do it right.

Ariadne

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v2] fs/exec: require argv[0] presence in do_execveat_common()
  2022-01-26 20:23  0%   ` Ariadne Conill
@ 2022-01-26 20:56  0%     ` Kees Cook
  2022-01-26 21:13  0%       ` Ariadne Conill
  0 siblings, 1 reply; 200+ results
From: Kees Cook @ 2022-01-26 20:56 UTC (permalink / raw)
  To: Ariadne Conill
  Cc: linux-kernel, linux-fsdevel, Eric Biederman, Alexander Viro

On Wed, Jan 26, 2022 at 02:23:59PM -0600, Ariadne Conill wrote:
> Hi,
> 
> On Wed, 26 Jan 2022, Kees Cook wrote:
> 
> > On Wed, Jan 26, 2022 at 11:44:47AM +0000, Ariadne Conill wrote:
> > > In several other operating systems, it is a hard requirement that the
> > > first argument to execve(2) be the name of a program, thus prohibiting
> > > a scenario where argc < 1.  POSIX 2017 also recommends this behaviour,
> > > but it is not an explicit requirement[0]:
> > > 
> > >     The argument arg0 should point to a filename string that is
> > >     associated with the process being started by one of the exec
> > >     functions.
> > > 
> > > To ensure that execve(2) with argc < 1 is not a useful gadget for
> > > shellcode to use, we can validate this in do_execveat_common() and
> > > fail for this scenario, effectively blocking successful exploitation
> > > of CVE-2021-4034 and similar bugs which depend on this gadget.
> > > 
> > > The use of -EFAULT for this case is similar to other systems, such
> > > as FreeBSD, OpenBSD and Solaris.  QNX uses -EINVAL for this case.
> > > 
> > > Interestingly, Michael Kerrisk opened an issue about this in 2008[1],
> > > but there was no consensus to support fixing this issue then.
> > > Hopefully now that CVE-2021-4034 shows practical exploitative use
> > > of this bug in a shellcode, we can reconsider.
> > > 
> > > [0]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
> > > [1]: https://bugzilla.kernel.org/show_bug.cgi?id=8408
> > > 
> > > Changes from v1:
> > > - Rework commit message significantly.
> > > - Make the argv[0] check explicit rather than hijacking the error-check
> > >   for count().
> > > 
> > > Signed-off-by: Ariadne Conill <ariadne@dereferenced.org>
> > > ---
> > >  fs/exec.c | 4 ++++
> > >  1 file changed, 4 insertions(+)
> > > 
> > > diff --git a/fs/exec.c b/fs/exec.c
> > > index 79f2c9483302..e52c41991aab 100644
> > > --- a/fs/exec.c
> > > +++ b/fs/exec.c
> > > @@ -1899,6 +1899,10 @@ static int do_execveat_common(int fd, struct filename *filename,
> > >  	retval = count(argv, MAX_ARG_STRINGS);
> > >  	if (retval < 0)
> > >  		goto out_free;
> > > +	if (retval == 0) {
> > > +		retval = -EFAULT;
> > > +		goto out_free;
> > > +	}
> > >  	bprm->argc = retval;
> > > 
> > >  	retval = count(envp, MAX_ARG_STRINGS);
> > > --
> > > 2.34.1
> > 
> > Okay, so, the dangerous condition is userspace iterating through envp
> > when it thinks it's iterating argv.
> > 
> > Assuming it is not okay to break valgrind's test suite:
> > https://sources.debian.org/src/valgrind/1:3.18.1-1/none/tests/execve.c/?hl=22#L22
> > we cannot reject a NULL argv (test will fail), and we cannot mutate
> > argc=0 into argc=1 (test will enter infinite loop).
> > 
> > Perhaps we need to reject argv=NULL when envp!=NULL, and add a
> > pr_warn_once() about using a NULL argv?
> 
> Sure, I can rework the patch to do it for only the envp != NULL case.
> 
> I think we should combine it with the {NULL, NULL} padding patch in this
> case though, since it appears to work, that way the execve(..., NULL, NULL)
> case gets some protection.

I don't think the padding will actually work correctly, for the reason
Jann pointed out. My testing shows that suddenly my envp becomes NULL,
but libc is just counting argc to find envp to pass into main.

> > I note that glibc already warns about NULL argv:
> > argc0.c:7:3: warning: null argument where non-null required (argument 2)
> > [-Wnonnull]
> >    7 |   execve(argv[0], NULL, envp);
> >      |   ^~~~~~
> > 
> > in the future we could expand this to only looking at argv=NULL?
> 
> I don't think musl's headers generate a diagnostic for this, but main(0,
> {NULL}) is not a supported use-case at least as far as Alpine is concerned.
> I am sure it is the same with the other musl distributions.
> 
> Will send a v3 patch with this logic change and move to EINVAL shortly.

I took a spin too. Refuses execve(..., NULL, !NULL), injects "" argv[0]
for execve(..., NULL, NULL):


diff --git a/fs/exec.c b/fs/exec.c
index a098c133d8d7..0565089d5f9e 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1917,9 +1917,40 @@ static int do_execveat_common(int fd, struct filename *filename,
 	if (retval < 0)
 		goto out_free;
 
-	retval = copy_strings(bprm->argc, argv, bprm);
-	if (retval < 0)
-		goto out_free;
+	if (likely(bprm->argc > 0)) {
+		retval = copy_strings(bprm->argc, argv, bprm);
+		if (retval < 0)
+			goto out_free;
+	} else {
+		const char * const argv0 = "";
+
+		/*
+		 * Start making some noise about the argc == NULL case that
+		 * POSIX doesn't like and other Unix-like systems refuse.
+		 */
+		pr_warn_once("process '%s' used a NULL argv\n", bprm->filename);
+
+		/*
+		 * Refuse to execute when argc == 0 and envc > 0, since this
+		 * can lead to userspace iterating envp if it fails to check
+		 * for argc == 0.
+		 *
+		 * i.e. continue to allow: execve(path, NULL, NULL);
+		 */
+		if (bprm->envc > 0) {
+			retval = -EINVAL;
+			goto out_free;
+		}
+
+		/*
+		 * Force an argv of {"", NULL} if argc == 0 so that broken
+		 * userspace that assumes argc != 0 will not be surprised.
+		 */
+		bprm->argc = 1;
+		retval = copy_strings_kernel(bprm->argc, &argv0, bprm);
+		if (retval < 0)
+			goto out_free;
+	}
 
 	retval = bprm_execve(bprm, fd, filename, flags);
 out_free:


$ cat argc0.c
#include <stdio.h>
#include <unistd.h>

int main(int argc, char *argv[], char *envp[])
{
        if (argv[0][0] != '\0') {
                printf("execve(argv[0], NULL, envp);\n");
                execve(argv[0], NULL, envp);
                perror("execve");
                printf("execve(argv[0], NULL, NULL);\n");
                execve(argv[0], NULL, NULL);
                return 0;
        }
        printf("argc=%d\n", argc);
        printf("argv[0]%p=%s\n", &argv[0], argv[0]);
        printf("argv[1]%p=%s\n", &argv[1], argv[1]);
        printf("envp[0]%p=%s\n", &envp[0], envp[0]);
        return 0;
}

$ gcc -Wall argc0.c -o argc0
argc0.c: In function 'main':
argc0.c:8:3: warning: null argument where non-null required (argument 2) [-Wnonnull]
    8 |   execve(argv[0], NULL, envp);
      |   ^~~~~~
argc0.c:11:3: warning: null argument where non-null required (argument 2) [-Wnonnull]
   11 |   execve(argv[0], NULL, NULL);
      |   ^~~~~~

$ ./argc0
execve(argv[0], NULL, envp);
execve: Invalid argument
execve(argv[0], NULL, NULL);
argc=1
argv[0]0x7fff1f577bd8=
argv[1]0x7fff1f577be0=(null)
envp[0]0x7fff1f577be8=(null)

$ dmesg | tail -n1
[   20.748467] process './argc0' used a NULL argv


-- 
Kees Cook

^ permalink raw reply related	[relevance 0%]

* Re: [PATCH] fs/binfmt_elf: Add padding NULL when argc == 0
  2022-01-26 17:57  5% [PATCH] fs/binfmt_elf: Add padding NULL when argc == 0 Kees Cook
  2022-01-26 18:07  0% ` Jann Horn
  2022-01-26 20:10  0% ` Ariadne Conill
@ 2022-01-26 20:52  0% ` Rich Felker
  2 siblings, 0 replies; 200+ results
From: Rich Felker @ 2022-01-26 20:52 UTC (permalink / raw)
  To: Kees Cook
  Cc: Ariadne Conill, Michael Kerrisk, Matthew Wilcox,
	Christian Brauner, Eric Biederman, Alexander Viro, linux-fsdevel,
	stable, linux-kernel, linux-hardening

On Wed, Jan 26, 2022 at 09:57:47AM -0800, Kees Cook wrote:
> Quoting Ariadne Conill:
> 
> "In several other operating systems, it is a hard requirement that the
> first argument to execve(2) be the name of a program, thus prohibiting
> a scenario where argc < 1. POSIX 2017 also recommends this behaviour,
> but it is not an explicit requirement[1]:
> 
>     The argument arg0 should point to a filename string that is
>     associated with the process being started by one of the exec
>     functions.
> ...
> Interestingly, Michael Kerrisk opened an issue about this in 2008[2],
> but there was no consensus to support fixing this issue then.
> Hopefully now that CVE-2021-4034 shows practical exploitative use[3]
> of this bug in a shellcode, we can reconsider."
> 
> An examination of existing[4] users of execve(..., NULL, NULL) shows
> mostly test code, or example rootkit code. While rejecting a NULL argv
> would be preferred, it looks like the main cause of userspace confusion
> is an assumption that argc >= 1, and buggy programs may skip argv[0]
> when iterating. To protect against userspace bugs of this nature, insert
> an extra NULL pointer in argv when argc == 0, so that argv[1] != envp[0].
> 
> Note that this is only done in the argc == 0 case because some userspace
> programs expect to find envp at exactly argv[argc]. The overlap of these
> two misguided assumptions is believed to be zero.
> 
> [1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
> [2] https://bugzilla.kernel.org/show_bug.cgi?id=8408
> [3] https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
> [4] https://codesearch.debian.net/search?q=execve%5C+*%5C%28%5B%5E%2C%5D%2B%2C+*NULL&literal=0
> 
> Reported-by: Ariadne Conill <ariadne@dereferenced.org>
> Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Rich Felker <dalias@libc.org>
> Cc: Eric Biederman <ebiederm@xmission.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: stable@vger.kernel.org
> Signed-off-by: Kees Cook <keescook@chromium.org>
> ---
>  fs/binfmt_elf.c | 10 +++++++++-
>  fs/exec.c       |  7 ++++++-
>  2 files changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 605017eb9349..e456c48658ad 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -297,7 +297,8 @@ create_elf_tables(struct linux_binprm *bprm, const struct elfhdr *exec,
>  	ei_index = elf_info - (elf_addr_t *)mm->saved_auxv;
>  	sp = STACK_ADD(p, ei_index);
>  
> -	items = (argc + 1) + (envc + 1) + 1;
> +	/* Make room for extra pointer when argc == 0. See below. */
> +	items = (min(argc, 1) + 1) + (envc + 1) + 1;
>  	bprm->p = STACK_ROUND(sp, items);
>  
>  	/* Point sp at the lowest address on the stack */
> @@ -326,6 +327,13 @@ create_elf_tables(struct linux_binprm *bprm, const struct elfhdr *exec,
>  
>  	/* Populate list of argv pointers back to argv strings. */
>  	p = mm->arg_end = mm->arg_start;
> +	/*
> +	 * Include an extra NULL pointer in argv when argc == 0 so
> +	 * that argv[1] != envp[0] to help userspace programs from
> +	 * mishandling argc == 0. See fs/exec.c bprm_stack_limits().
> +	 */
> +	if (argc == 0 && put_user(0, sp++))
> +		return -EFAULT;
>  	while (argc-- > 0) {
>  		size_t len;
>  		if (put_user((elf_addr_t)p, sp++))
> diff --git a/fs/exec.c b/fs/exec.c
> index 79f2c9483302..0b36384e55b1 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -495,8 +495,13 @@ static int bprm_stack_limits(struct linux_binprm *bprm)
>  	 * the stack. They aren't stored until much later when we can't
>  	 * signal to the parent that the child has run out of stack space.
>  	 * Instead, calculate it here so it's possible to fail gracefully.
> +	 *
> +	 * In the case of argc < 1, make sure there is a NULL pointer gap
> +	 * between argv and envp to ensure confused userspace programs don't
> +	 * start processing from argv[1], thinking argc can never be 0,
> +	 * to block them from walking envp by accident. See fs/binfmt_elf.c.
>  	 */
> -	ptr_size = (bprm->argc + bprm->envc) * sizeof(void *);
> +	ptr_size = (min(bprm->argc, 1) + bprm->envc) * sizeof(void *);
>  	if (limit <= ptr_size)
>  		return -E2BIG;
>  	limit -= ptr_size;
> -- 
> 2.30.2
> 

This patch is not just wrong, but extremely dangerously wrong, to the
point that it may make all suid-root binaries exploitable (at least
dynamic linked ones).

The ELF entry point contract is that argv+argc+1==envp, and in fact
this is the "preferred" way of computing envp so as to avoid linear
search over argv. In musl's dynamic linker we do exactly that; I'm not
sure about glibc's. See:

https://git.musl-libc.org/cgit/musl/tree/ldso/dynlink.c?id=v1.2.2#n1740

If argv[argc+1] wrongly contains a null pointer, semantically, that
means the environment is empty and auxv starts at the next stack slot.
It's an exercise for the reader to populate the environment in a way
that this memory wrongly gets interpreted as a meaningful auxv. I'm
not sure this is possible, but I wouldn't automatically rule it out.

In short: YOU CANNOT CHANGE/BREAK CONTRACTS TO MITIGATE A VULN. Doing
so just makes new vulns in the programs that were correct before.

Silently replacing argc==0 with argc==1 and argv[0]=="" would be a
safe variant of this, but I'm really in favor of just erroring out,
but *only doing it when the exec is a privilege boundary* (suid/etc.)
to minimize the chance of breaking software dependent on allowing
argc==0.

Rich

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] fs/binfmt_elf: Add padding NULL when argc == 0
  2022-01-26 20:10  0% ` Ariadne Conill
@ 2022-01-26 20:46  0%   ` Ariadne Conill
  0 siblings, 0 replies; 200+ results
From: Ariadne Conill @ 2022-01-26 20:46 UTC (permalink / raw)
  To: Ariadne Conill
  Cc: Kees Cook, Michael Kerrisk, Matthew Wilcox, Christian Brauner,
	Rich Felker, Eric Biederman, Alexander Viro, linux-fsdevel,
	stable, linux-kernel, linux-hardening

Hi,

On Wed, 26 Jan 2022, Ariadne Conill wrote:

> Hi,
>
> On Wed, 26 Jan 2022, Kees Cook wrote:
>
>> Quoting Ariadne Conill:
>> 
>> "In several other operating systems, it is a hard requirement that the
>> first argument to execve(2) be the name of a program, thus prohibiting
>> a scenario where argc < 1. POSIX 2017 also recommends this behaviour,
>> but it is not an explicit requirement[1]:
>>
>>    The argument arg0 should point to a filename string that is
>>    associated with the process being started by one of the exec
>>    functions.
>> ...
>> Interestingly, Michael Kerrisk opened an issue about this in 2008[2],
>> but there was no consensus to support fixing this issue then.
>> Hopefully now that CVE-2021-4034 shows practical exploitative use[3]
>> of this bug in a shellcode, we can reconsider."
>> 
>> An examination of existing[4] users of execve(..., NULL, NULL) shows
>> mostly test code, or example rootkit code. While rejecting a NULL argv
>> would be preferred, it looks like the main cause of userspace confusion
>> is an assumption that argc >= 1, and buggy programs may skip argv[0]
>> when iterating. To protect against userspace bugs of this nature, insert
>> an extra NULL pointer in argv when argc == 0, so that argv[1] != envp[0].
>> 
>> Note that this is only done in the argc == 0 case because some userspace
>> programs expect to find envp at exactly argv[argc]. The overlap of these
>> two misguided assumptions is believed to be zero.
>> 
>> [1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
>> [2] https://bugzilla.kernel.org/show_bug.cgi?id=8408
>> [3] https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
>> [4] 
>> https://codesearch.debian.net/search?q=execve%5C+*%5C%28%5B%5E%2C%5D%2B%2C+*NULL&literal=0
>> 
>> Reported-by: Ariadne Conill <ariadne@dereferenced.org>
>> Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Christian Brauner <brauner@kernel.org>
>> Cc: Rich Felker <dalias@libc.org>
>> Cc: Eric Biederman <ebiederm@xmission.com>
>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> Cc: linux-fsdevel@vger.kernel.org
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Kees Cook <keescook@chromium.org>
>
> Tested-by: Ariadne Conill <ariadne@dereferenced.org>
>
> It seems to work, but I still think bailing early with -EINVAL is a more 
> reasonable position to take.  For example, the following code, when used with 
> BusyBox applets results in a segfault, as the multicall stub does not support 
> scenarios where argc < 1:
>
> #include <stdio.h>
> #include <unistd.h>
> #include <sys/syscall.h>
>
> int main(int argc, const char **argv) {
>        if (syscall(SYS_execve, "/bin/date", NULL, NULL) < 0)
>                perror("execve");
>        return 0;
> }
>

Further testing indicates that while things *mostly* work, it results in 
memory corruption in various tasks, for example, trying to build a new 
kernel hung, and the gcc process's name was a bunch of uninitialized 
memory.  So, I don't think { NULL, NULL } is a good way to go.

Ariadne

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v2] fs/exec: require argv[0] presence in do_execveat_common()
  2022-01-26 20:09  0% ` Kees Cook
@ 2022-01-26 20:23  0%   ` Ariadne Conill
  2022-01-26 20:56  0%     ` Kees Cook
  0 siblings, 1 reply; 200+ results
From: Ariadne Conill @ 2022-01-26 20:23 UTC (permalink / raw)
  To: Kees Cook
  Cc: Ariadne Conill, linux-kernel, linux-fsdevel, Eric Biederman,
	Alexander Viro

Hi,

On Wed, 26 Jan 2022, Kees Cook wrote:

> On Wed, Jan 26, 2022 at 11:44:47AM +0000, Ariadne Conill wrote:
>> In several other operating systems, it is a hard requirement that the
>> first argument to execve(2) be the name of a program, thus prohibiting
>> a scenario where argc < 1.  POSIX 2017 also recommends this behaviour,
>> but it is not an explicit requirement[0]:
>>
>>     The argument arg0 should point to a filename string that is
>>     associated with the process being started by one of the exec
>>     functions.
>>
>> To ensure that execve(2) with argc < 1 is not a useful gadget for
>> shellcode to use, we can validate this in do_execveat_common() and
>> fail for this scenario, effectively blocking successful exploitation
>> of CVE-2021-4034 and similar bugs which depend on this gadget.
>>
>> The use of -EFAULT for this case is similar to other systems, such
>> as FreeBSD, OpenBSD and Solaris.  QNX uses -EINVAL for this case.
>>
>> Interestingly, Michael Kerrisk opened an issue about this in 2008[1],
>> but there was no consensus to support fixing this issue then.
>> Hopefully now that CVE-2021-4034 shows practical exploitative use
>> of this bug in a shellcode, we can reconsider.
>>
>> [0]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
>> [1]: https://bugzilla.kernel.org/show_bug.cgi?id=8408
>>
>> Changes from v1:
>> - Rework commit message significantly.
>> - Make the argv[0] check explicit rather than hijacking the error-check
>>   for count().
>>
>> Signed-off-by: Ariadne Conill <ariadne@dereferenced.org>
>> ---
>>  fs/exec.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/fs/exec.c b/fs/exec.c
>> index 79f2c9483302..e52c41991aab 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1899,6 +1899,10 @@ static int do_execveat_common(int fd, struct filename *filename,
>>  	retval = count(argv, MAX_ARG_STRINGS);
>>  	if (retval < 0)
>>  		goto out_free;
>> +	if (retval == 0) {
>> +		retval = -EFAULT;
>> +		goto out_free;
>> +	}
>>  	bprm->argc = retval;
>>
>>  	retval = count(envp, MAX_ARG_STRINGS);
>> --
>> 2.34.1
>
> Okay, so, the dangerous condition is userspace iterating through envp
> when it thinks it's iterating argv.
>
> Assuming it is not okay to break valgrind's test suite:
> https://sources.debian.org/src/valgrind/1:3.18.1-1/none/tests/execve.c/?hl=22#L22
> we cannot reject a NULL argv (test will fail), and we cannot mutate
> argc=0 into argc=1 (test will enter infinite loop).
>
> Perhaps we need to reject argv=NULL when envp!=NULL, and add a
> pr_warn_once() about using a NULL argv?

Sure, I can rework the patch to do it for only the envp != NULL case.

I think we should combine it with the {NULL, NULL} padding patch in this 
case though, since it appears to work, that way the execve(..., NULL, 
NULL) case gets some protection.

> I note that glibc already warns about NULL argv:
> argc0.c:7:3: warning: null argument where non-null required (argument 2)
> [-Wnonnull]
>    7 |   execve(argv[0], NULL, envp);
>      |   ^~~~~~
>
> in the future we could expand this to only looking at argv=NULL?

I don't think musl's headers generate a diagnostic for this, but main(0, 
{NULL}) is not a supported use-case at least as far as Alpine is 
concerned.  I am sure it is the same with the other musl distributions.

Will send a v3 patch with this logic change and move to EINVAL shortly.

Ariadne

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] fs/binfmt_elf: Add padding NULL when argc == 0
  2022-01-26 17:57  5% [PATCH] fs/binfmt_elf: Add padding NULL when argc == 0 Kees Cook
  2022-01-26 18:07  0% ` Jann Horn
@ 2022-01-26 20:10  0% ` Ariadne Conill
  2022-01-26 20:46  0%   ` Ariadne Conill
  2022-01-26 20:52  0% ` Rich Felker
  2 siblings, 1 reply; 200+ results
From: Ariadne Conill @ 2022-01-26 20:10 UTC (permalink / raw)
  To: Kees Cook
  Cc: Ariadne Conill, Michael Kerrisk, Matthew Wilcox,
	Christian Brauner, Rich Felker, Eric Biederman, Alexander Viro,
	linux-fsdevel, stable, linux-kernel, linux-hardening

Hi,

On Wed, 26 Jan 2022, Kees Cook wrote:

> Quoting Ariadne Conill:
>
> "In several other operating systems, it is a hard requirement that the
> first argument to execve(2) be the name of a program, thus prohibiting
> a scenario where argc < 1. POSIX 2017 also recommends this behaviour,
> but it is not an explicit requirement[1]:
>
>    The argument arg0 should point to a filename string that is
>    associated with the process being started by one of the exec
>    functions.
> ...
> Interestingly, Michael Kerrisk opened an issue about this in 2008[2],
> but there was no consensus to support fixing this issue then.
> Hopefully now that CVE-2021-4034 shows practical exploitative use[3]
> of this bug in a shellcode, we can reconsider."
>
> An examination of existing[4] users of execve(..., NULL, NULL) shows
> mostly test code, or example rootkit code. While rejecting a NULL argv
> would be preferred, it looks like the main cause of userspace confusion
> is an assumption that argc >= 1, and buggy programs may skip argv[0]
> when iterating. To protect against userspace bugs of this nature, insert
> an extra NULL pointer in argv when argc == 0, so that argv[1] != envp[0].
>
> Note that this is only done in the argc == 0 case because some userspace
> programs expect to find envp at exactly argv[argc]. The overlap of these
> two misguided assumptions is believed to be zero.
>
> [1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
> [2] https://bugzilla.kernel.org/show_bug.cgi?id=8408
> [3] https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
> [4] https://codesearch.debian.net/search?q=execve%5C+*%5C%28%5B%5E%2C%5D%2B%2C+*NULL&literal=0
>
> Reported-by: Ariadne Conill <ariadne@dereferenced.org>
> Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Rich Felker <dalias@libc.org>
> Cc: Eric Biederman <ebiederm@xmission.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: stable@vger.kernel.org
> Signed-off-by: Kees Cook <keescook@chromium.org>

Tested-by: Ariadne Conill <ariadne@dereferenced.org>

It seems to work, but I still think bailing early with -EINVAL is a more 
reasonable position to take.  For example, the following code, when used 
with BusyBox applets results in a segfault, as the multicall stub does not 
support scenarios where argc < 1:

#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>

int main(int argc, const char **argv) {
         if (syscall(SYS_execve, "/bin/date", NULL, NULL) < 0)
                 perror("execve");
         return 0;
}

Ariadne

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v2] fs/exec: require argv[0] presence in do_execveat_common()
  2022-01-26 11:44  4% [PATCH v2] " Ariadne Conill
  2022-01-26 14:40  0% ` Matthew Wilcox
  2022-01-26 14:59  0% ` Matthew Wilcox
@ 2022-01-26 20:09  0% ` Kees Cook
  2022-01-26 20:23  0%   ` Ariadne Conill
  2 siblings, 1 reply; 200+ results
From: Kees Cook @ 2022-01-26 20:09 UTC (permalink / raw)
  To: Ariadne Conill
  Cc: linux-kernel, linux-fsdevel, Eric Biederman, Alexander Viro

On Wed, Jan 26, 2022 at 11:44:47AM +0000, Ariadne Conill wrote:
> In several other operating systems, it is a hard requirement that the
> first argument to execve(2) be the name of a program, thus prohibiting
> a scenario where argc < 1.  POSIX 2017 also recommends this behaviour,
> but it is not an explicit requirement[0]:
> 
>     The argument arg0 should point to a filename string that is
>     associated with the process being started by one of the exec
>     functions.
> 
> To ensure that execve(2) with argc < 1 is not a useful gadget for
> shellcode to use, we can validate this in do_execveat_common() and
> fail for this scenario, effectively blocking successful exploitation
> of CVE-2021-4034 and similar bugs which depend on this gadget.
> 
> The use of -EFAULT for this case is similar to other systems, such
> as FreeBSD, OpenBSD and Solaris.  QNX uses -EINVAL for this case.
> 
> Interestingly, Michael Kerrisk opened an issue about this in 2008[1],
> but there was no consensus to support fixing this issue then.
> Hopefully now that CVE-2021-4034 shows practical exploitative use
> of this bug in a shellcode, we can reconsider.
> 
> [0]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
> [1]: https://bugzilla.kernel.org/show_bug.cgi?id=8408
> 
> Changes from v1:
> - Rework commit message significantly.
> - Make the argv[0] check explicit rather than hijacking the error-check
>   for count().
> 
> Signed-off-by: Ariadne Conill <ariadne@dereferenced.org>
> ---
>  fs/exec.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 79f2c9483302..e52c41991aab 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1899,6 +1899,10 @@ static int do_execveat_common(int fd, struct filename *filename,
>  	retval = count(argv, MAX_ARG_STRINGS);
>  	if (retval < 0)
>  		goto out_free;
> +	if (retval == 0) {
> +		retval = -EFAULT;
> +		goto out_free;
> +	}
>  	bprm->argc = retval;
>  
>  	retval = count(envp, MAX_ARG_STRINGS);
> -- 
> 2.34.1

Okay, so, the dangerous condition is userspace iterating through envp
when it thinks it's iterating argv.

Assuming it is not okay to break valgrind's test suite:
https://sources.debian.org/src/valgrind/1:3.18.1-1/none/tests/execve.c/?hl=22#L22
we cannot reject a NULL argv (test will fail), and we cannot mutate
argc=0 into argc=1 (test will enter infinite loop).

Perhaps we need to reject argv=NULL when envp!=NULL, and add a
pr_warn_once() about using a NULL argv?

I note that glibc already warns about NULL argv:
argc0.c:7:3: warning: null argument where non-null required (argument 2)
[-Wnonnull]
    7 |   execve(argv[0], NULL, envp);
      |   ^~~~~~

in the future we could expand this to only looking at argv=NULL?

-- 
Kees Cook

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] fs/binfmt_elf: Add padding NULL when argc == 0
  2022-01-26 19:58  0%       ` Kees Cook
@ 2022-01-26 20:08  0%         ` Matthew Wilcox
  0 siblings, 0 replies; 200+ results
From: Matthew Wilcox @ 2022-01-26 20:08 UTC (permalink / raw)
  To: Kees Cook
  Cc: Jann Horn, Ariadne Conill, Michael Kerrisk, Christian Brauner,
	Rich Felker, Eric Biederman, Alexander Viro, linux-fsdevel,
	stable, linux-kernel, linux-hardening

On Wed, Jan 26, 2022 at 11:58:39AM -0800, Kees Cook wrote:
> On Wed, Jan 26, 2022 at 08:50:39PM +0100, Jann Horn wrote:
> > On Wed, Jan 26, 2022 at 7:42 PM Ariadne Conill <ariadne@dereferenced.org> wrote:
> > > On Wed, 26 Jan 2022, Jann Horn wrote:
> > > > On Wed, Jan 26, 2022 at 6:58 PM Kees Cook <keescook@chromium.org> wrote:
> > > >> Quoting Ariadne Conill:
> > > >>
> > > >> "In several other operating systems, it is a hard requirement that the
> > > >> first argument to execve(2) be the name of a program, thus prohibiting
> > > >> a scenario where argc < 1. POSIX 2017 also recommends this behaviour,
> > > >> but it is not an explicit requirement[1]:
> > > >>
> > > >>     The argument arg0 should point to a filename string that is
> > > >>     associated with the process being started by one of the exec
> > > >>     functions.
> > > >> ...
> > > >> Interestingly, Michael Kerrisk opened an issue about this in 2008[2],
> > > >> but there was no consensus to support fixing this issue then.
> > > >> Hopefully now that CVE-2021-4034 shows practical exploitative use[3]
> > > >> of this bug in a shellcode, we can reconsider."
> > > >>
> > > >> An examination of existing[4] users of execve(..., NULL, NULL) shows
> > > >> mostly test code, or example rootkit code. While rejecting a NULL argv
> > > >> would be preferred, it looks like the main cause of userspace confusion
> > > >> is an assumption that argc >= 1, and buggy programs may skip argv[0]
> > > >> when iterating. To protect against userspace bugs of this nature, insert
> > > >> an extra NULL pointer in argv when argc == 0, so that argv[1] != envp[0].
> > > >>
> > > >> Note that this is only done in the argc == 0 case because some userspace
> > > >> programs expect to find envp at exactly argv[argc]. The overlap of these
> > > >> two misguided assumptions is believed to be zero.
> > > >
> > > > Will this result in the executed program being told that argc==0 but
> > > > having an extra NULL pointer on the stack?
> > > > If so, I believe this breaks the x86-64 ABI documented at
> > > > https://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf - page 29,
> > > > figure 3.9 describes the layout of the initial process stack.
> > >
> > > I'm presently compiling a kernel with the patch to see if it works or not.
> > >
> > > > Actually, does this even work? Can a program still properly access its
> > > > environment variables when invoked with argc==0 with this patch
> > > > applied? AFAIU the way userspace locates envv on x86-64 is by
> > > > calculating 8*(argc+1)?
> > >
> > > In the other thread, it was suggested that perhaps we should set up an
> > > argv of {"", NULL}.  In that case, it seems like it would be safe to claim
> > > argc == 1.
> > >
> > > What do you think?
> > 
> > Sounds good to me, since that's something that could also happen
> > normally if userspace calls execve(..., {"", NULL}, ...).
> > 
> > (I'd like it even better if we could just bail out with an error code,
> > but I guess the risk of breakage might be too high with that
> > approach?)
> 
> We can't mutate argc; it'll turn at least some userspace into an
> infinite loop:
> https://sources.debian.org/src/valgrind/1:3.18.1-1/none/tests/execve.c/?hl=22#L22

How does that become an infinite loop?  We obviously wouldn't mutate
argc in the caller, just the callee.

Also, there's a version of this where we only mutate argc if we're
executing a setuid program, which would remove the privilege
escalation part of things.

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] fs/binfmt_elf: Add padding NULL when argc == 0
  2022-01-26 19:50  0%     ` Jann Horn
@ 2022-01-26 19:58  0%       ` Kees Cook
  2022-01-26 20:08  0%         ` Matthew Wilcox
  0 siblings, 1 reply; 200+ results
From: Kees Cook @ 2022-01-26 19:58 UTC (permalink / raw)
  To: Jann Horn
  Cc: Ariadne Conill, Michael Kerrisk, Matthew Wilcox,
	Christian Brauner, Rich Felker, Eric Biederman, Alexander Viro,
	linux-fsdevel, stable, linux-kernel, linux-hardening

On Wed, Jan 26, 2022 at 08:50:39PM +0100, Jann Horn wrote:
> On Wed, Jan 26, 2022 at 7:42 PM Ariadne Conill <ariadne@dereferenced.org> wrote:
> > On Wed, 26 Jan 2022, Jann Horn wrote:
> > > On Wed, Jan 26, 2022 at 6:58 PM Kees Cook <keescook@chromium.org> wrote:
> > >> Quoting Ariadne Conill:
> > >>
> > >> "In several other operating systems, it is a hard requirement that the
> > >> first argument to execve(2) be the name of a program, thus prohibiting
> > >> a scenario where argc < 1. POSIX 2017 also recommends this behaviour,
> > >> but it is not an explicit requirement[1]:
> > >>
> > >>     The argument arg0 should point to a filename string that is
> > >>     associated with the process being started by one of the exec
> > >>     functions.
> > >> ...
> > >> Interestingly, Michael Kerrisk opened an issue about this in 2008[2],
> > >> but there was no consensus to support fixing this issue then.
> > >> Hopefully now that CVE-2021-4034 shows practical exploitative use[3]
> > >> of this bug in a shellcode, we can reconsider."
> > >>
> > >> An examination of existing[4] users of execve(..., NULL, NULL) shows
> > >> mostly test code, or example rootkit code. While rejecting a NULL argv
> > >> would be preferred, it looks like the main cause of userspace confusion
> > >> is an assumption that argc >= 1, and buggy programs may skip argv[0]
> > >> when iterating. To protect against userspace bugs of this nature, insert
> > >> an extra NULL pointer in argv when argc == 0, so that argv[1] != envp[0].
> > >>
> > >> Note that this is only done in the argc == 0 case because some userspace
> > >> programs expect to find envp at exactly argv[argc]. The overlap of these
> > >> two misguided assumptions is believed to be zero.
> > >
> > > Will this result in the executed program being told that argc==0 but
> > > having an extra NULL pointer on the stack?
> > > If so, I believe this breaks the x86-64 ABI documented at
> > > https://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf - page 29,
> > > figure 3.9 describes the layout of the initial process stack.
> >
> > I'm presently compiling a kernel with the patch to see if it works or not.
> >
> > > Actually, does this even work? Can a program still properly access its
> > > environment variables when invoked with argc==0 with this patch
> > > applied? AFAIU the way userspace locates envv on x86-64 is by
> > > calculating 8*(argc+1)?
> >
> > In the other thread, it was suggested that perhaps we should set up an
> > argv of {"", NULL}.  In that case, it seems like it would be safe to claim
> > argc == 1.
> >
> > What do you think?
> 
> Sounds good to me, since that's something that could also happen
> normally if userspace calls execve(..., {"", NULL}, ...).
> 
> (I'd like it even better if we could just bail out with an error code,
> but I guess the risk of breakage might be too high with that
> approach?)

We can't mutate argc; it'll turn at least some userspace into an
infinite loop:
https://sources.debian.org/src/valgrind/1:3.18.1-1/none/tests/execve.c/?hl=22#L22

-- 
Kees Cook

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] fs/binfmt_elf: Add padding NULL when argc == 0
  2022-01-26 18:07  0% ` Jann Horn
  2022-01-26 18:42  0%   ` Ariadne Conill
@ 2022-01-26 19:56  0%   ` Kees Cook
  1 sibling, 0 replies; 200+ results
From: Kees Cook @ 2022-01-26 19:56 UTC (permalink / raw)
  To: Jann Horn
  Cc: Ariadne Conill, Michael Kerrisk, Matthew Wilcox,
	Christian Brauner, Rich Felker, Eric Biederman, Alexander Viro,
	linux-fsdevel, stable, linux-kernel, linux-hardening

On Wed, Jan 26, 2022 at 07:07:20PM +0100, Jann Horn wrote:
> On Wed, Jan 26, 2022 at 6:58 PM Kees Cook <keescook@chromium.org> wrote:
> > Quoting Ariadne Conill:
> >
> > "In several other operating systems, it is a hard requirement that the
> > first argument to execve(2) be the name of a program, thus prohibiting
> > a scenario where argc < 1. POSIX 2017 also recommends this behaviour,
> > but it is not an explicit requirement[1]:
> >
> >     The argument arg0 should point to a filename string that is
> >     associated with the process being started by one of the exec
> >     functions.
> > ...
> > Interestingly, Michael Kerrisk opened an issue about this in 2008[2],
> > but there was no consensus to support fixing this issue then.
> > Hopefully now that CVE-2021-4034 shows practical exploitative use[3]
> > of this bug in a shellcode, we can reconsider."
> >
> > An examination of existing[4] users of execve(..., NULL, NULL) shows
> > mostly test code, or example rootkit code. While rejecting a NULL argv
> > would be preferred, it looks like the main cause of userspace confusion
> > is an assumption that argc >= 1, and buggy programs may skip argv[0]
> > when iterating. To protect against userspace bugs of this nature, insert
> > an extra NULL pointer in argv when argc == 0, so that argv[1] != envp[0].
> >
> > Note that this is only done in the argc == 0 case because some userspace
> > programs expect to find envp at exactly argv[argc]. The overlap of these
> > two misguided assumptions is believed to be zero.
> 
> Will this result in the executed program being told that argc==0 but
> having an extra NULL pointer on the stack?
> If so, I believe this breaks the x86-64 ABI documented at
> https://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf - page 29,
> figure 3.9 describes the layout of the initial process stack.
> 
> Actually, does this even work? Can a program still properly access its
> environment variables when invoked with argc==0 with this patch
> applied? AFAIU the way userspace locates envv on x86-64 is by
> calculating 8*(argc+1)?

Hrm, yeah, I guess it's libc providing the envp pointer; it's not passes
separately. Hrm.

-- 
Kees Cook

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] fs/binfmt_elf: Add padding NULL when argc == 0
  2022-01-26 18:42  0%   ` Ariadne Conill
@ 2022-01-26 19:50  0%     ` Jann Horn
  2022-01-26 19:58  0%       ` Kees Cook
  0 siblings, 1 reply; 200+ results
From: Jann Horn @ 2022-01-26 19:50 UTC (permalink / raw)
  To: Ariadne Conill
  Cc: Kees Cook, Michael Kerrisk, Matthew Wilcox, Christian Brauner,
	Rich Felker, Eric Biederman, Alexander Viro, linux-fsdevel,
	stable, linux-kernel, linux-hardening

On Wed, Jan 26, 2022 at 7:42 PM Ariadne Conill <ariadne@dereferenced.org> wrote:
> On Wed, 26 Jan 2022, Jann Horn wrote:
> > On Wed, Jan 26, 2022 at 6:58 PM Kees Cook <keescook@chromium.org> wrote:
> >> Quoting Ariadne Conill:
> >>
> >> "In several other operating systems, it is a hard requirement that the
> >> first argument to execve(2) be the name of a program, thus prohibiting
> >> a scenario where argc < 1. POSIX 2017 also recommends this behaviour,
> >> but it is not an explicit requirement[1]:
> >>
> >>     The argument arg0 should point to a filename string that is
> >>     associated with the process being started by one of the exec
> >>     functions.
> >> ...
> >> Interestingly, Michael Kerrisk opened an issue about this in 2008[2],
> >> but there was no consensus to support fixing this issue then.
> >> Hopefully now that CVE-2021-4034 shows practical exploitative use[3]
> >> of this bug in a shellcode, we can reconsider."
> >>
> >> An examination of existing[4] users of execve(..., NULL, NULL) shows
> >> mostly test code, or example rootkit code. While rejecting a NULL argv
> >> would be preferred, it looks like the main cause of userspace confusion
> >> is an assumption that argc >= 1, and buggy programs may skip argv[0]
> >> when iterating. To protect against userspace bugs of this nature, insert
> >> an extra NULL pointer in argv when argc == 0, so that argv[1] != envp[0].
> >>
> >> Note that this is only done in the argc == 0 case because some userspace
> >> programs expect to find envp at exactly argv[argc]. The overlap of these
> >> two misguided assumptions is believed to be zero.
> >
> > Will this result in the executed program being told that argc==0 but
> > having an extra NULL pointer on the stack?
> > If so, I believe this breaks the x86-64 ABI documented at
> > https://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf - page 29,
> > figure 3.9 describes the layout of the initial process stack.
>
> I'm presently compiling a kernel with the patch to see if it works or not.
>
> > Actually, does this even work? Can a program still properly access its
> > environment variables when invoked with argc==0 with this patch
> > applied? AFAIU the way userspace locates envv on x86-64 is by
> > calculating 8*(argc+1)?
>
> In the other thread, it was suggested that perhaps we should set up an
> argv of {"", NULL}.  In that case, it seems like it would be safe to claim
> argc == 1.
>
> What do you think?

Sounds good to me, since that's something that could also happen
normally if userspace calls execve(..., {"", NULL}, ...).

(I'd like it even better if we could just bail out with an error code,
but I guess the risk of breakage might be too high with that
approach?)

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] fs/binfmt_elf: Add padding NULL when argc == 0
  2022-01-26 18:07  0% ` Jann Horn
@ 2022-01-26 18:42  0%   ` Ariadne Conill
  2022-01-26 19:50  0%     ` Jann Horn
  2022-01-26 19:56  0%   ` Kees Cook
  1 sibling, 1 reply; 200+ results
From: Ariadne Conill @ 2022-01-26 18:42 UTC (permalink / raw)
  To: Jann Horn
  Cc: Kees Cook, Ariadne Conill, Michael Kerrisk, Matthew Wilcox,
	Christian Brauner, Rich Felker, Eric Biederman, Alexander Viro,
	linux-fsdevel, stable, linux-kernel, linux-hardening

Hi,

On Wed, 26 Jan 2022, Jann Horn wrote:

> On Wed, Jan 26, 2022 at 6:58 PM Kees Cook <keescook@chromium.org> wrote:
>> Quoting Ariadne Conill:
>>
>> "In several other operating systems, it is a hard requirement that the
>> first argument to execve(2) be the name of a program, thus prohibiting
>> a scenario where argc < 1. POSIX 2017 also recommends this behaviour,
>> but it is not an explicit requirement[1]:
>>
>>     The argument arg0 should point to a filename string that is
>>     associated with the process being started by one of the exec
>>     functions.
>> ...
>> Interestingly, Michael Kerrisk opened an issue about this in 2008[2],
>> but there was no consensus to support fixing this issue then.
>> Hopefully now that CVE-2021-4034 shows practical exploitative use[3]
>> of this bug in a shellcode, we can reconsider."
>>
>> An examination of existing[4] users of execve(..., NULL, NULL) shows
>> mostly test code, or example rootkit code. While rejecting a NULL argv
>> would be preferred, it looks like the main cause of userspace confusion
>> is an assumption that argc >= 1, and buggy programs may skip argv[0]
>> when iterating. To protect against userspace bugs of this nature, insert
>> an extra NULL pointer in argv when argc == 0, so that argv[1] != envp[0].
>>
>> Note that this is only done in the argc == 0 case because some userspace
>> programs expect to find envp at exactly argv[argc]. The overlap of these
>> two misguided assumptions is believed to be zero.
>
> Will this result in the executed program being told that argc==0 but
> having an extra NULL pointer on the stack?
> If so, I believe this breaks the x86-64 ABI documented at
> https://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf - page 29,
> figure 3.9 describes the layout of the initial process stack.

I'm presently compiling a kernel with the patch to see if it works or not.

> Actually, does this even work? Can a program still properly access its
> environment variables when invoked with argc==0 with this patch
> applied? AFAIU the way userspace locates envv on x86-64 is by
> calculating 8*(argc+1)?

In the other thread, it was suggested that perhaps we should set up an 
argv of {"", NULL}.  In that case, it seems like it would be safe to claim 
argc == 1.

What do you think?

Ariadne

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v2] fs/exec: require argv[0] presence in do_execveat_common()
  2022-01-26 18:03  0%     ` Matthew Wilcox
@ 2022-01-26 18:38  0%       ` Ariadne Conill
  0 siblings, 0 replies; 200+ results
From: Ariadne Conill @ 2022-01-26 18:38 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Eric W. Biederman, Ariadne Conill, linux-kernel, linux-fsdevel,
	Kees Cook, Alexander Viro

Hi,

On Wed, 26 Jan 2022, Matthew Wilcox wrote:

> On Wed, Jan 26, 2022 at 10:57:29AM -0600, Eric W. Biederman wrote:
>> Matthew Wilcox <willy@infradead.org> writes:
>>
>>> On Wed, Jan 26, 2022 at 11:44:47AM +0000, Ariadne Conill wrote:
>>>> Interestingly, Michael Kerrisk opened an issue about this in 2008[1],
>>>> but there was no consensus to support fixing this issue then.
>>>> Hopefully now that CVE-2021-4034 shows practical exploitative use
>>>> of this bug in a shellcode, we can reconsider.
>>>>
>>>> [0]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
>>>> [1]: https://bugzilla.kernel.org/show_bug.cgi?id=8408
>>>
>>> Having now read 8408 ... if ABI change is a concern (and I really doubt
>>> it is), we could treat calling execve() with a NULL argv as if the
>>> caller had passed an array of length 1 with the first element set to
>>> NULL.  Just like we reopen fds 0,1,2 for suid execs if they were
>>> closed.
>>
>> Where do we reopen fds 0,1,2 for suid execs?  I feel silly but I looked
>> through the code fs/exec.c quickly and I could not see it.
>
> I'm wondering if I misremembered and it's being done in ld.so
> rather than in the kernel?  That might be the right place to put
> this fix too.
>
>> I am attracted to the notion of converting an empty argv array passed
>> to the kernel into something we can safely pass to userspace.
>>
>> I think it would need to be having the first entry point to "" instead
>> of the first entry being NULL.  That would maintain the invariant that you
>> can always dereference a pointer in the argv array.
>
> Yes, I like that better than NULL.

If we are doing {"", NULL}, then I think it makes sense that we could just 
say argc == 1 at that point, which probably sidesteps the concern Jann 
raised with the {NULL, NULL} patch, no?

Ariadne

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] fs/binfmt_elf: Add padding NULL when argc == 0
  2022-01-26 17:57  5% [PATCH] fs/binfmt_elf: Add padding NULL when argc == 0 Kees Cook
@ 2022-01-26 18:07  0% ` Jann Horn
  2022-01-26 18:42  0%   ` Ariadne Conill
  2022-01-26 19:56  0%   ` Kees Cook
  2022-01-26 20:10  0% ` Ariadne Conill
  2022-01-26 20:52  0% ` Rich Felker
  2 siblings, 2 replies; 200+ results
From: Jann Horn @ 2022-01-26 18:07 UTC (permalink / raw)
  To: Kees Cook
  Cc: Ariadne Conill, Michael Kerrisk, Matthew Wilcox,
	Christian Brauner, Rich Felker, Eric Biederman, Alexander Viro,
	linux-fsdevel, stable, linux-kernel, linux-hardening

On Wed, Jan 26, 2022 at 6:58 PM Kees Cook <keescook@chromium.org> wrote:
> Quoting Ariadne Conill:
>
> "In several other operating systems, it is a hard requirement that the
> first argument to execve(2) be the name of a program, thus prohibiting
> a scenario where argc < 1. POSIX 2017 also recommends this behaviour,
> but it is not an explicit requirement[1]:
>
>     The argument arg0 should point to a filename string that is
>     associated with the process being started by one of the exec
>     functions.
> ...
> Interestingly, Michael Kerrisk opened an issue about this in 2008[2],
> but there was no consensus to support fixing this issue then.
> Hopefully now that CVE-2021-4034 shows practical exploitative use[3]
> of this bug in a shellcode, we can reconsider."
>
> An examination of existing[4] users of execve(..., NULL, NULL) shows
> mostly test code, or example rootkit code. While rejecting a NULL argv
> would be preferred, it looks like the main cause of userspace confusion
> is an assumption that argc >= 1, and buggy programs may skip argv[0]
> when iterating. To protect against userspace bugs of this nature, insert
> an extra NULL pointer in argv when argc == 0, so that argv[1] != envp[0].
>
> Note that this is only done in the argc == 0 case because some userspace
> programs expect to find envp at exactly argv[argc]. The overlap of these
> two misguided assumptions is believed to be zero.

Will this result in the executed program being told that argc==0 but
having an extra NULL pointer on the stack?
If so, I believe this breaks the x86-64 ABI documented at
https://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf - page 29,
figure 3.9 describes the layout of the initial process stack.

Actually, does this even work? Can a program still properly access its
environment variables when invoked with argc==0 with this patch
applied? AFAIU the way userspace locates envv on x86-64 is by
calculating 8*(argc+1)?

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v2] fs/exec: require argv[0] presence in do_execveat_common()
  2022-01-26 16:57  0%   ` Eric W. Biederman
  2022-01-26 17:32  0%     ` Ariadne Conill
@ 2022-01-26 18:03  0%     ` Matthew Wilcox
  2022-01-26 18:38  0%       ` Ariadne Conill
  1 sibling, 1 reply; 200+ results
From: Matthew Wilcox @ 2022-01-26 18:03 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ariadne Conill, linux-kernel, linux-fsdevel, Kees Cook, Alexander Viro

On Wed, Jan 26, 2022 at 10:57:29AM -0600, Eric W. Biederman wrote:
> Matthew Wilcox <willy@infradead.org> writes:
> 
> > On Wed, Jan 26, 2022 at 11:44:47AM +0000, Ariadne Conill wrote:
> >> Interestingly, Michael Kerrisk opened an issue about this in 2008[1],
> >> but there was no consensus to support fixing this issue then.
> >> Hopefully now that CVE-2021-4034 shows practical exploitative use
> >> of this bug in a shellcode, we can reconsider.
> >> 
> >> [0]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
> >> [1]: https://bugzilla.kernel.org/show_bug.cgi?id=8408
> >
> > Having now read 8408 ... if ABI change is a concern (and I really doubt
> > it is), we could treat calling execve() with a NULL argv as if the
> > caller had passed an array of length 1 with the first element set to
> > NULL.  Just like we reopen fds 0,1,2 for suid execs if they were
> > closed.
> 
> Where do we reopen fds 0,1,2 for suid execs?  I feel silly but I looked
> through the code fs/exec.c quickly and I could not see it.

I'm wondering if I misremembered and it's being done in ld.so
rather than in the kernel?  That might be the right place to put
this fix too.

> I am attracted to the notion of converting an empty argv array passed
> to the kernel into something we can safely pass to userspace.
> 
> I think it would need to be having the first entry point to "" instead
> of the first entry being NULL.  That would maintain the invariant that you
> can always dereference a pointer in the argv array.

Yes, I like that better than NULL.

^ permalink raw reply	[relevance 0%]

* [PATCH] fs/binfmt_elf: Add padding NULL when argc == 0
@ 2022-01-26 17:57  5% Kees Cook
  2022-01-26 18:07  0% ` Jann Horn
                   ` (2 more replies)
  0 siblings, 3 replies; 200+ results
From: Kees Cook @ 2022-01-26 17:57 UTC (permalink / raw)
  To: Ariadne Conill
  Cc: Kees Cook, Michael Kerrisk, Matthew Wilcox, Christian Brauner,
	Rich Felker, Eric Biederman, Alexander Viro, linux-fsdevel,
	stable, linux-kernel, linux-hardening

Quoting Ariadne Conill:

"In several other operating systems, it is a hard requirement that the
first argument to execve(2) be the name of a program, thus prohibiting
a scenario where argc < 1. POSIX 2017 also recommends this behaviour,
but it is not an explicit requirement[1]:

    The argument arg0 should point to a filename string that is
    associated with the process being started by one of the exec
    functions.
...
Interestingly, Michael Kerrisk opened an issue about this in 2008[2],
but there was no consensus to support fixing this issue then.
Hopefully now that CVE-2021-4034 shows practical exploitative use[3]
of this bug in a shellcode, we can reconsider."

An examination of existing[4] users of execve(..., NULL, NULL) shows
mostly test code, or example rootkit code. While rejecting a NULL argv
would be preferred, it looks like the main cause of userspace confusion
is an assumption that argc >= 1, and buggy programs may skip argv[0]
when iterating. To protect against userspace bugs of this nature, insert
an extra NULL pointer in argv when argc == 0, so that argv[1] != envp[0].

Note that this is only done in the argc == 0 case because some userspace
programs expect to find envp at exactly argv[argc]. The overlap of these
two misguided assumptions is believed to be zero.

[1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
[2] https://bugzilla.kernel.org/show_bug.cgi?id=8408
[3] https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
[4] https://codesearch.debian.net/search?q=execve%5C+*%5C%28%5B%5E%2C%5D%2B%2C+*NULL&literal=0

Reported-by: Ariadne Conill <ariadne@dereferenced.org>
Reported-by: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 fs/binfmt_elf.c | 10 +++++++++-
 fs/exec.c       |  7 ++++++-
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 605017eb9349..e456c48658ad 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -297,7 +297,8 @@ create_elf_tables(struct linux_binprm *bprm, const struct elfhdr *exec,
 	ei_index = elf_info - (elf_addr_t *)mm->saved_auxv;
 	sp = STACK_ADD(p, ei_index);
 
-	items = (argc + 1) + (envc + 1) + 1;
+	/* Make room for extra pointer when argc == 0. See below. */
+	items = (min(argc, 1) + 1) + (envc + 1) + 1;
 	bprm->p = STACK_ROUND(sp, items);
 
 	/* Point sp at the lowest address on the stack */
@@ -326,6 +327,13 @@ create_elf_tables(struct linux_binprm *bprm, const struct elfhdr *exec,
 
 	/* Populate list of argv pointers back to argv strings. */
 	p = mm->arg_end = mm->arg_start;
+	/*
+	 * Include an extra NULL pointer in argv when argc == 0 so
+	 * that argv[1] != envp[0] to help userspace programs from
+	 * mishandling argc == 0. See fs/exec.c bprm_stack_limits().
+	 */
+	if (argc == 0 && put_user(0, sp++))
+		return -EFAULT;
 	while (argc-- > 0) {
 		size_t len;
 		if (put_user((elf_addr_t)p, sp++))
diff --git a/fs/exec.c b/fs/exec.c
index 79f2c9483302..0b36384e55b1 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -495,8 +495,13 @@ static int bprm_stack_limits(struct linux_binprm *bprm)
 	 * the stack. They aren't stored until much later when we can't
 	 * signal to the parent that the child has run out of stack space.
 	 * Instead, calculate it here so it's possible to fail gracefully.
+	 *
+	 * In the case of argc < 1, make sure there is a NULL pointer gap
+	 * between argv and envp to ensure confused userspace programs don't
+	 * start processing from argv[1], thinking argc can never be 0,
+	 * to block them from walking envp by accident. See fs/binfmt_elf.c.
 	 */
-	ptr_size = (bprm->argc + bprm->envc) * sizeof(void *);
+	ptr_size = (min(bprm->argc, 1) + bprm->envc) * sizeof(void *);
 	if (limit <= ptr_size)
 		return -E2BIG;
 	limit -= ptr_size;
-- 
2.30.2


^ permalink raw reply related	[relevance 5%]

* Re: [PATCH v2] fs/exec: require argv[0] presence in do_execveat_common()
  2022-01-26 14:40  0% ` Matthew Wilcox
@ 2022-01-26 17:41  0%   ` Ariadne Conill
  0 siblings, 0 replies; 200+ results
From: Ariadne Conill @ 2022-01-26 17:41 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ariadne Conill, linux-kernel, linux-fsdevel, Eric Biederman,
	Kees Cook, Alexander Viro

Hi,

On Wed, 26 Jan 2022, Matthew Wilcox wrote:

> On Wed, Jan 26, 2022 at 11:44:47AM +0000, Ariadne Conill wrote:
>> In several other operating systems, it is a hard requirement that the
>> first argument to execve(2) be the name of a program, thus prohibiting
>> a scenario where argc < 1.  POSIX 2017 also recommends this behaviour,
>> but it is not an explicit requirement[0]:
>>
>>     The argument arg0 should point to a filename string that is
>>     associated with the process being started by one of the exec
>>     functions.
>>
>> To ensure that execve(2) with argc < 1 is not a useful gadget for
>> shellcode to use, we can validate this in do_execveat_common() and
>> fail for this scenario, effectively blocking successful exploitation
>> of CVE-2021-4034 and similar bugs which depend on this gadget.
>>
>> The use of -EFAULT for this case is similar to other systems, such
>> as FreeBSD, OpenBSD and Solaris.  QNX uses -EINVAL for this case.
>>
>> Interestingly, Michael Kerrisk opened an issue about this in 2008[1],
>> but there was no consensus to support fixing this issue then.
>> Hopefully now that CVE-2021-4034 shows practical exploitative use
>> of this bug in a shellcode, we can reconsider.
>>
>> [0]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
>> [1]: https://bugzilla.kernel.org/show_bug.cgi?id=8408
>>
>> Changes from v1:
>> - Rework commit message significantly.
>> - Make the argv[0] check explicit rather than hijacking the error-check
>>   for count().
>>
>> Signed-off-by: Ariadne Conill <ariadne@dereferenced.org>
>> ---
>>  fs/exec.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/fs/exec.c b/fs/exec.c
>> index 79f2c9483302..e52c41991aab 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1899,6 +1899,10 @@ static int do_execveat_common(int fd, struct filename *filename,
>>  	retval = count(argv, MAX_ARG_STRINGS);
>>  	if (retval < 0)
>>  		goto out_free;
>> +	if (retval == 0) {
>> +		retval = -EFAULT;
>> +		goto out_free;
>> +	}
>
> I don't object to the concept, but it's a more common pattern in Linux
> to do this:
>
> 	retval = count(argv, MAX_ARG_STRINGS);
> +	if (retval == 0)
> +		retval = -EFAULT;
> 	if (retval < 0)
> 		goto out_free;

Yeah, that seems fine.  We can of course do it that way, which I will 
revise the patch to do if we decide to stick with denial over making a 
"safe" argv instead.

> (aka I like my bikesheds painted in Toasty Eggshell)

Toasty Eggshell is a nice color for a bikeshed :)

Ariadne

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] fs/exec: require argv[0] presence in do_execveat_common()
  2022-01-26 13:27  0% ` Rich Felker
  2022-01-26 14:46  0%   ` Christian Brauner
@ 2022-01-26 17:37  0%   ` Ariadne Conill
  1 sibling, 0 replies; 200+ results
From: Ariadne Conill @ 2022-01-26 17:37 UTC (permalink / raw)
  To: Rich Felker
  Cc: Ariadne Conill, linux-kernel, linux-fsdevel, Eric Biederman,
	Kees Cook, Alexander Viro

Hi,

On Wed, 26 Jan 2022, Rich Felker wrote:

> On Wed, Jan 26, 2022 at 04:39:47AM +0000, Ariadne Conill wrote:
>> The first argument to argv when used with execv family of calls is
>> required to be the name of the program being executed, per POSIX.
>
> That's not quite the story. The relevant text is a "should", meaning
> that to be "strictly conforming" an application has to follow the
> convention, but still can't assume its invoker did. (Note that most
> programs do not aim to be "strictly conforming"; it's not just the
> word strictly applied as an adjective to conforming, but a definition
> of its own imposing very stringent portability conditions beyond what
> the standard already imposes.) Moreover, POSIX (following ISO C, after
> this was changed from early C drafts) rejected making it a
> requirement. This is documented in the RATIONALE for execve:
>
>    Early proposals required that the value of argc passed to main()
>    be "one or greater". This was driven by the same requirement in
>    drafts of the ISO C standard. In fact, historical implementations
>    have passed a value of zero when no arguments are supplied to the
>    caller of the exec functions. This requirement was removed from
>    the ISO C standard and subsequently removed from this volume of
>    POSIX.1-2017 as well. The wording, in particular the use of the
>    word should, requires a Strictly Conforming POSIX Application to
>    pass at least one argument to the exec function, thus guaranteeing
>    that argc be one or greater when invoked by such an application.
>    In fact, this is good practice, since many existing applications
>    reference argv[0] without first checking the value of argc.
>
> Source: https://pubs.opengroup.org/onlinepubs/9699919799/functions/execve.html
>
> Note that despite citing itself as POSIX.1-2017 above, this is not a
> change in the 2017 edition; it's just the way they self-cite. As far
> as I can tell, the change goes back to prior to the first publication
> of the standard.

This was clarified in the v2 commit text.

>> By validating this in do_execveat_common(), we can prevent execution
>> of shellcode which invokes execv(2) family syscalls with argc < 1,
>> a scenario which is disallowed by POSIX, thus providing a mitigation
>> against CVE-2021-4034 and similar bugs in the future.
>>
>> The use of -EFAULT for this case is similar to other systems, such
>> as FreeBSD and OpenBSD.
>
> I don't like this choice of error, since in principle EFAULT should
> never happen when you haven't invoked memory-safety-violating UB.
> Something like EINVAL would be more appropriate. But if the existing
> practice for systems that do this is to use EFAULT, it's probably best
> to do the same thing.

It turns out that OpenBSD uses -EINVAL for this, see 
https://github.com/openbsd/src/commit/74212563870067f5b1e271876e1ec5a2fdf2f2e0

>
>> Interestingly, Michael Kerrisk opened an issue about this in 2008,
>> but there was no consensus to support fixing this issue then.
>> Hopefully now that CVE-2021-4034 shows practical exploitative use
>> of this bug in a shellcode, we can reconsider.
>
> I'm not really opposed to attempting to change this with consensus
> (like, actually proposing it on the Austin Group tracker), but a less
> invasive change would be just enforcing it for the case where exec is
> a privilege boundary (suid/sgid/caps). There's really no motivation
> for changing longstanding standard behavior in a
> non-privilege-boundary case.

It would be nice for the Austin Group to clarify this, but I think this is 
a "common sense" issue.  I don't think execve(2) with argc < 1 is 
"standard behavior" too, as many other systems outside Linux fail to 
execve(2) when argc < 1.

Ariadne

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v2] fs/exec: require argv[0] presence in do_execveat_common()
  2022-01-26 16:57  0%   ` Eric W. Biederman
@ 2022-01-26 17:32  0%     ` Ariadne Conill
  2022-01-26 18:03  0%     ` Matthew Wilcox
  1 sibling, 0 replies; 200+ results
From: Ariadne Conill @ 2022-01-26 17:32 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matthew Wilcox, Ariadne Conill, linux-kernel, linux-fsdevel,
	Kees Cook, Alexander Viro

Hi,

On Wed, 26 Jan 2022, Eric W. Biederman wrote:

> Matthew Wilcox <willy@infradead.org> writes:
>
>> On Wed, Jan 26, 2022 at 11:44:47AM +0000, Ariadne Conill wrote:
>>> Interestingly, Michael Kerrisk opened an issue about this in 2008[1],
>>> but there was no consensus to support fixing this issue then.
>>> Hopefully now that CVE-2021-4034 shows practical exploitative use
>>> of this bug in a shellcode, we can reconsider.
>>>
>>> [0]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
>>> [1]: https://bugzilla.kernel.org/show_bug.cgi?id=8408
>>
>> Having now read 8408 ... if ABI change is a concern (and I really doubt
>> it is), we could treat calling execve() with a NULL argv as if the
>> caller had passed an array of length 1 with the first element set to
>> NULL.  Just like we reopen fds 0,1,2 for suid execs if they were
>> closed.
>
> Where do we reopen fds 0,1,2 for suid execs?  I feel silly but I looked
> through the code fs/exec.c quickly and I could not see it.
>
>
> I am attracted to the notion of converting an empty argv array passed
> to the kernel into something we can safely pass to userspace.
>
> I think it would need to be having the first entry point to "" instead
> of the first entry being NULL.  That would maintain the invariant that you
> can always dereference a pointer in the argv array.

Yes, I think this is correct, because there's a lot of programs out there 
which will try to blindly read from argv[0], assuming it is present. 
Ensuring we wind up with {"", NULL} would be the way I would want to 
approach this if we go that route.

This approach would solve the problem with pkexec, but I still think there 
is some wisdom in denying with -EFAULT outright like other systems do.

Ariadne

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v2] fs/exec: require argv[0] presence in do_execveat_common()
  2022-01-26 14:59  0% ` Matthew Wilcox
  2022-01-26 16:40  0%   ` Kees Cook
@ 2022-01-26 16:57  0%   ` Eric W. Biederman
  2022-01-26 17:32  0%     ` Ariadne Conill
  2022-01-26 18:03  0%     ` Matthew Wilcox
  1 sibling, 2 replies; 200+ results
From: Eric W. Biederman @ 2022-01-26 16:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ariadne Conill, linux-kernel, linux-fsdevel, Kees Cook, Alexander Viro

Matthew Wilcox <willy@infradead.org> writes:

> On Wed, Jan 26, 2022 at 11:44:47AM +0000, Ariadne Conill wrote:
>> Interestingly, Michael Kerrisk opened an issue about this in 2008[1],
>> but there was no consensus to support fixing this issue then.
>> Hopefully now that CVE-2021-4034 shows practical exploitative use
>> of this bug in a shellcode, we can reconsider.
>> 
>> [0]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
>> [1]: https://bugzilla.kernel.org/show_bug.cgi?id=8408
>
> Having now read 8408 ... if ABI change is a concern (and I really doubt
> it is), we could treat calling execve() with a NULL argv as if the
> caller had passed an array of length 1 with the first element set to
> NULL.  Just like we reopen fds 0,1,2 for suid execs if they were
> closed.

Where do we reopen fds 0,1,2 for suid execs?  I feel silly but I looked
through the code fs/exec.c quickly and I could not see it.


I am attracted to the notion of converting an empty argv array passed
to the kernel into something we can safely pass to userspace.

I think it would need to be having the first entry point to "" instead
of the first entry being NULL.  That would maintain the invariant that you
can always dereference a pointer in the argv array.

Eric





^ permalink raw reply	[relevance 0%]

* Re: [PATCH v2] fs/exec: require argv[0] presence in do_execveat_common()
  2022-01-26 14:59  0% ` Matthew Wilcox
@ 2022-01-26 16:40  0%   ` Kees Cook
  2022-01-26 16:57  0%   ` Eric W. Biederman
  1 sibling, 0 replies; 200+ results
From: Kees Cook @ 2022-01-26 16:40 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Ariadne Conill, linux-kernel, linux-fsdevel, Eric Biederman,
	Alexander Viro

On Wed, Jan 26, 2022 at 02:59:52PM +0000, Matthew Wilcox wrote:
> On Wed, Jan 26, 2022 at 11:44:47AM +0000, Ariadne Conill wrote:
> > Interestingly, Michael Kerrisk opened an issue about this in 2008[1],
> > but there was no consensus to support fixing this issue then.
> > Hopefully now that CVE-2021-4034 shows practical exploitative use
> > of this bug in a shellcode, we can reconsider.
> > 
> > [0]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
> > [1]: https://bugzilla.kernel.org/show_bug.cgi?id=8408
> 
> Having now read 8408 ... if ABI change is a concern (and I really doubt
> it is), we could treat calling execve() with a NULL argv as if the
> caller had passed an array of length 1 with the first element set to
> NULL.  Just like we reopen fds 0,1,2 for suid execs if they were closed.

I was having similar thoughts this morning. We can't actually change the
argc, though, because of the various tests (see the debian code search
links) that explicitly tests for argc == 0 in the child. But, the flaw
is not the count, but rather that argv == argp in the argc == 0 case.
(Or that argv NULL-checking iteration begins at argv[1].)

But that would could fix easily by just adding an extra NULL. e.g.:

Currently:

argc = 1
argv = "foo", NULL
envp = "bar=baz", ..., NULL

argc = 0
argv = NULL
envp = "bar=baz", ..., NULL

We could just make the argc = 0 case be:

argc = 0
argv = NULL, NULL
envp = "bar=baz", ..., NULL

We need to be careful with the stack utilization counts, though, so I'm
thinking we could actually make this completely unconditional and just
pad envp by 1 NULL on the user stack:

argv = "what", "ever", NULL
       NULL
envp = "bar=baz", ..., NULL

My only concern there is that there may be some code out there that
depends on envp immediately following the trailing argv NULL, so I think
my preference would be to pad only in the argc == 0 case and correctly
manage the stack utilization.

-- 
Kees Cook

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v2] fs/exec: require argv[0] presence in do_execveat_common()
  2022-01-26 11:44  4% [PATCH v2] " Ariadne Conill
  2022-01-26 14:40  0% ` Matthew Wilcox
@ 2022-01-26 14:59  0% ` Matthew Wilcox
  2022-01-26 16:40  0%   ` Kees Cook
  2022-01-26 16:57  0%   ` Eric W. Biederman
  2022-01-26 20:09  0% ` Kees Cook
  2 siblings, 2 replies; 200+ results
From: Matthew Wilcox @ 2022-01-26 14:59 UTC (permalink / raw)
  To: Ariadne Conill
  Cc: linux-kernel, linux-fsdevel, Eric Biederman, Kees Cook, Alexander Viro

On Wed, Jan 26, 2022 at 11:44:47AM +0000, Ariadne Conill wrote:
> Interestingly, Michael Kerrisk opened an issue about this in 2008[1],
> but there was no consensus to support fixing this issue then.
> Hopefully now that CVE-2021-4034 shows practical exploitative use
> of this bug in a shellcode, we can reconsider.
> 
> [0]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
> [1]: https://bugzilla.kernel.org/show_bug.cgi?id=8408

Having now read 8408 ... if ABI change is a concern (and I really doubt
it is), we could treat calling execve() with a NULL argv as if the
caller had passed an array of length 1 with the first element set to
NULL.  Just like we reopen fds 0,1,2 for suid execs if they were closed.

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] fs/exec: require argv[0] presence in do_execveat_common()
  2022-01-26 13:27  0% ` Rich Felker
@ 2022-01-26 14:46  0%   ` Christian Brauner
  2022-01-26 17:37  0%   ` Ariadne Conill
  1 sibling, 0 replies; 200+ results
From: Christian Brauner @ 2022-01-26 14:46 UTC (permalink / raw)
  To: Rich Felker
  Cc: Ariadne Conill, linux-kernel, linux-fsdevel, Eric Biederman,
	Kees Cook, Alexander Viro

On Wed, Jan 26, 2022 at 08:27:30AM -0500, Rich Felker wrote:
> On Wed, Jan 26, 2022 at 04:39:47AM +0000, Ariadne Conill wrote:
> > The first argument to argv when used with execv family of calls is
> > required to be the name of the program being executed, per POSIX.
> 
> That's not quite the story. The relevant text is a "should", meaning
> that to be "strictly conforming" an application has to follow the
> convention, but still can't assume its invoker did. (Note that most
> programs do not aim to be "strictly conforming"; it's not just the
> word strictly applied as an adjective to conforming, but a definition
> of its own imposing very stringent portability conditions beyond what
> the standard already imposes.) Moreover, POSIX (following ISO C, after
> this was changed from early C drafts) rejected making it a
> requirement. This is documented in the RATIONALE for execve:
> 
>     Early proposals required that the value of argc passed to main()
>     be "one or greater". This was driven by the same requirement in
>     drafts of the ISO C standard. In fact, historical implementations
>     have passed a value of zero when no arguments are supplied to the
>     caller of the exec functions. This requirement was removed from
>     the ISO C standard and subsequently removed from this volume of
>     POSIX.1-2017 as well. The wording, in particular the use of the
>     word should, requires a Strictly Conforming POSIX Application to
>     pass at least one argument to the exec function, thus guaranteeing
>     that argc be one or greater when invoked by such an application.
>     In fact, this is good practice, since many existing applications
>     reference argv[0] without first checking the value of argc.
> 
> Source: https://pubs.opengroup.org/onlinepubs/9699919799/functions/execve.html
> 
> Note that despite citing itself as POSIX.1-2017 above, this is not a
> change in the 2017 edition; it's just the way they self-cite. As far
> as I can tell, the change goes back to prior to the first publication
> of the standard.
> 
> > By validating this in do_execveat_common(), we can prevent execution
> > of shellcode which invokes execv(2) family syscalls with argc < 1,
> > a scenario which is disallowed by POSIX, thus providing a mitigation
> > against CVE-2021-4034 and similar bugs in the future.
> > 
> > The use of -EFAULT for this case is similar to other systems, such
> > as FreeBSD and OpenBSD.
> 
> I don't like this choice of error, since in principle EFAULT should
> never happen when you haven't invoked memory-safety-violating UB.
> Something like EINVAL would be more appropriate. But if the existing
> practice for systems that do this is to use EFAULT, it's probably best
> to do the same thing.
> 
> > Interestingly, Michael Kerrisk opened an issue about this in 2008,
> > but there was no consensus to support fixing this issue then.
> > Hopefully now that CVE-2021-4034 shows practical exploitative use
> > of this bug in a shellcode, we can reconsider.
> 
> I'm not really opposed to attempting to change this with consensus
> (like, actually proposing it on the Austin Group tracker), but a less
> invasive change would be just enforcing it for the case where exec is
> a privilege boundary (suid/sgid/caps). There's really no motivation
> for changing longstanding standard behavior in a
> non-privilege-boundary case.

Agreed. If we do this at all then this has way less regression potential.

^ permalink raw reply	[relevance 0%]

* Re: [PATCH v2] fs/exec: require argv[0] presence in do_execveat_common()
  2022-01-26 11:44  4% [PATCH v2] " Ariadne Conill
@ 2022-01-26 14:40  0% ` Matthew Wilcox
  2022-01-26 17:41  0%   ` Ariadne Conill
  2022-01-26 14:59  0% ` Matthew Wilcox
  2022-01-26 20:09  0% ` Kees Cook
  2 siblings, 1 reply; 200+ results
From: Matthew Wilcox @ 2022-01-26 14:40 UTC (permalink / raw)
  To: Ariadne Conill
  Cc: linux-kernel, linux-fsdevel, Eric Biederman, Kees Cook, Alexander Viro

On Wed, Jan 26, 2022 at 11:44:47AM +0000, Ariadne Conill wrote:
> In several other operating systems, it is a hard requirement that the
> first argument to execve(2) be the name of a program, thus prohibiting
> a scenario where argc < 1.  POSIX 2017 also recommends this behaviour,
> but it is not an explicit requirement[0]:
> 
>     The argument arg0 should point to a filename string that is
>     associated with the process being started by one of the exec
>     functions.
> 
> To ensure that execve(2) with argc < 1 is not a useful gadget for
> shellcode to use, we can validate this in do_execveat_common() and
> fail for this scenario, effectively blocking successful exploitation
> of CVE-2021-4034 and similar bugs which depend on this gadget.
> 
> The use of -EFAULT for this case is similar to other systems, such
> as FreeBSD, OpenBSD and Solaris.  QNX uses -EINVAL for this case.
> 
> Interestingly, Michael Kerrisk opened an issue about this in 2008[1],
> but there was no consensus to support fixing this issue then.
> Hopefully now that CVE-2021-4034 shows practical exploitative use
> of this bug in a shellcode, we can reconsider.
> 
> [0]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
> [1]: https://bugzilla.kernel.org/show_bug.cgi?id=8408
> 
> Changes from v1:
> - Rework commit message significantly.
> - Make the argv[0] check explicit rather than hijacking the error-check
>   for count().
> 
> Signed-off-by: Ariadne Conill <ariadne@dereferenced.org>
> ---
>  fs/exec.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 79f2c9483302..e52c41991aab 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1899,6 +1899,10 @@ static int do_execveat_common(int fd, struct filename *filename,
>  	retval = count(argv, MAX_ARG_STRINGS);
>  	if (retval < 0)
>  		goto out_free;
> +	if (retval == 0) {
> +		retval = -EFAULT;
> +		goto out_free;
> +	}

I don't object to the concept, but it's a more common pattern in Linux
to do this:

	retval = count(argv, MAX_ARG_STRINGS);
+	if (retval == 0)
+		retval = -EFAULT;
	if (retval < 0)
		goto out_free;

(aka I like my bikesheds painted in Toasty Eggshell)

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] fs/exec: require argv[0] presence in do_execveat_common()
  2022-01-26  4:39  5% [PATCH] fs/exec: require argv[0] presence in do_execveat_common() Ariadne Conill
  2022-01-26  6:42  0% ` Kees Cook
@ 2022-01-26 13:27  0% ` Rich Felker
  2022-01-26 14:46  0%   ` Christian Brauner
  2022-01-26 17:37  0%   ` Ariadne Conill
  1 sibling, 2 replies; 200+ results
From: Rich Felker @ 2022-01-26 13:27 UTC (permalink / raw)
  To: Ariadne Conill
  Cc: linux-kernel, linux-fsdevel, Eric Biederman, Kees Cook, Alexander Viro

On Wed, Jan 26, 2022 at 04:39:47AM +0000, Ariadne Conill wrote:
> The first argument to argv when used with execv family of calls is
> required to be the name of the program being executed, per POSIX.

That's not quite the story. The relevant text is a "should", meaning
that to be "strictly conforming" an application has to follow the
convention, but still can't assume its invoker did. (Note that most
programs do not aim to be "strictly conforming"; it's not just the
word strictly applied as an adjective to conforming, but a definition
of its own imposing very stringent portability conditions beyond what
the standard already imposes.) Moreover, POSIX (following ISO C, after
this was changed from early C drafts) rejected making it a
requirement. This is documented in the RATIONALE for execve:

    Early proposals required that the value of argc passed to main()
    be "one or greater". This was driven by the same requirement in
    drafts of the ISO C standard. In fact, historical implementations
    have passed a value of zero when no arguments are supplied to the
    caller of the exec functions. This requirement was removed from
    the ISO C standard and subsequently removed from this volume of
    POSIX.1-2017 as well. The wording, in particular the use of the
    word should, requires a Strictly Conforming POSIX Application to
    pass at least one argument to the exec function, thus guaranteeing
    that argc be one or greater when invoked by such an application.
    In fact, this is good practice, since many existing applications
    reference argv[0] without first checking the value of argc.

Source: https://pubs.opengroup.org/onlinepubs/9699919799/functions/execve.html

Note that despite citing itself as POSIX.1-2017 above, this is not a
change in the 2017 edition; it's just the way they self-cite. As far
as I can tell, the change goes back to prior to the first publication
of the standard.

> By validating this in do_execveat_common(), we can prevent execution
> of shellcode which invokes execv(2) family syscalls with argc < 1,
> a scenario which is disallowed by POSIX, thus providing a mitigation
> against CVE-2021-4034 and similar bugs in the future.
> 
> The use of -EFAULT for this case is similar to other systems, such
> as FreeBSD and OpenBSD.

I don't like this choice of error, since in principle EFAULT should
never happen when you haven't invoked memory-safety-violating UB.
Something like EINVAL would be more appropriate. But if the existing
practice for systems that do this is to use EFAULT, it's probably best
to do the same thing.

> Interestingly, Michael Kerrisk opened an issue about this in 2008,
> but there was no consensus to support fixing this issue then.
> Hopefully now that CVE-2021-4034 shows practical exploitative use
> of this bug in a shellcode, we can reconsider.

I'm not really opposed to attempting to change this with consensus
(like, actually proposing it on the Austin Group tracker), but a less
invasive change would be just enforcing it for the case where exec is
a privilege boundary (suid/sgid/caps). There's really no motivation
for changing longstanding standard behavior in a
non-privilege-boundary case.

Rich

^ permalink raw reply	[relevance 0%]

* [PATCH v2] fs/exec: require argv[0] presence in do_execveat_common()
@ 2022-01-26 11:44  4% Ariadne Conill
  2022-01-26 14:40  0% ` Matthew Wilcox
                   ` (2 more replies)
  0 siblings, 3 replies; 200+ results
From: Ariadne Conill @ 2022-01-26 11:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, Eric Biederman, Kees Cook, Alexander Viro, Ariadne Conill

In several other operating systems, it is a hard requirement that the
first argument to execve(2) be the name of a program, thus prohibiting
a scenario where argc < 1.  POSIX 2017 also recommends this behaviour,
but it is not an explicit requirement[0]:

    The argument arg0 should point to a filename string that is
    associated with the process being started by one of the exec
    functions.

To ensure that execve(2) with argc < 1 is not a useful gadget for
shellcode to use, we can validate this in do_execveat_common() and
fail for this scenario, effectively blocking successful exploitation
of CVE-2021-4034 and similar bugs which depend on this gadget.

The use of -EFAULT for this case is similar to other systems, such
as FreeBSD, OpenBSD and Solaris.  QNX uses -EINVAL for this case.

Interestingly, Michael Kerrisk opened an issue about this in 2008[1],
but there was no consensus to support fixing this issue then.
Hopefully now that CVE-2021-4034 shows practical exploitative use
of this bug in a shellcode, we can reconsider.

[0]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/exec.html
[1]: https://bugzilla.kernel.org/show_bug.cgi?id=8408

Changes from v1:
- Rework commit message significantly.
- Make the argv[0] check explicit rather than hijacking the error-check
  for count().

Signed-off-by: Ariadne Conill <ariadne@dereferenced.org>
---
 fs/exec.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/exec.c b/fs/exec.c
index 79f2c9483302..e52c41991aab 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1899,6 +1899,10 @@ static int do_execveat_common(int fd, struct filename *filename,
 	retval = count(argv, MAX_ARG_STRINGS);
 	if (retval < 0)
 		goto out_free;
+	if (retval == 0) {
+		retval = -EFAULT;
+		goto out_free;
+	}
 	bprm->argc = retval;
 
 	retval = count(envp, MAX_ARG_STRINGS);
-- 
2.34.1


^ permalink raw reply related	[relevance 4%]

* Re: [PATCH] fs/exec: require argv[0] presence in do_execveat_common()
  2022-01-26  7:28  0%   ` Kees Cook
@ 2022-01-26 11:18  0%     ` Ariadne Conill
  0 siblings, 0 replies; 200+ results
From: Ariadne Conill @ 2022-01-26 11:18 UTC (permalink / raw)
  To: Kees Cook
  Cc: Ariadne Conill, linux-kernel, linux-fsdevel, Eric Biederman,
	Alexander Viro

Hi,

On Tue, 25 Jan 2022, Kees Cook wrote:

>
>
> On January 25, 2022 10:42:41 PM PST, Kees Cook <keescook@chromium.org> wrote:
>> On Wed, Jan 26, 2022 at 04:39:47AM +0000, Ariadne Conill wrote:
>>> The first argument to argv when used with execv family of calls is
>>> required to be the name of the program being executed, per POSIX.
>>>
>>> By validating this in do_execveat_common(), we can prevent execution
>>> of shellcode which invokes execv(2) family syscalls with argc < 1,
>>> a scenario which is disallowed by POSIX, thus providing a mitigation
>>> against CVE-2021-4034 and similar bugs in the future.
>>>
>>> The use of -EFAULT for this case is similar to other systems, such
>>> as FreeBSD and OpenBSD.
>>>
>>> Interestingly, Michael Kerrisk opened an issue about this in 2008,
>
> For v2 please include a URL for this. I assume you mean this one?
> https://bugzilla.kernel.org/show_bug.cgi?id=8408

Yes, that's the one.  I honestly need to rewrite that commit message 
anyway.

>>> but there was no consensus to support fixing this issue then.
>>> Hopefully now that CVE-2021-4034 shows practical exploitative use
>>> of this bug in a shellcode, we can reconsider.
>>>
>>> Signed-off-by: Ariadne Conill <ariadne@dereferenced.org>
>>
>> Yup. Agreed. For context:
>> https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
>>
>>> ---
>>>  fs/exec.c | 4 +++-
>>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/exec.c b/fs/exec.c
>>> index 79f2c9483302..de0b832473ed 100644
>>> --- a/fs/exec.c
>>> +++ b/fs/exec.c
>>> @@ -1897,8 +1897,10 @@ static int do_execveat_common(int fd, struct filename *filename,
>>>  	}
>>>
>>>  	retval = count(argv, MAX_ARG_STRINGS);
>>> -	if (retval < 0)
>>> +	if (retval < 1) {
>>> +		retval = -EFAULT;
>>>  		goto out_free;
>>> +	}
>
> Actually, no, this needs to be more carefully special-cased to avoid masking error returns from count(). (e.g. -E2BIG would vanish with this patch.)
>
> Perhaps just add:
>
> if (retval == 0) {
>        retval = -EFAULT;
>        goto out_free;
> }

Alright.  I will do that in v2.

>>
>> There shouldn't be anything legitimate actually doing this in userspace.
>
> I spoke too soon.
>
> Unfortunately, this is not the case:
> https://codesearch.debian.net/search?q=execve%5C+*%5C%28%5B%5E%2C%5D%2B%2C+*NULL&literal=0
>
> Lots of stuff likes to do:
> execve(path, NULL, NULL);
>
> Do these things depend on argc==0 would be my next question...

I looked at these, and these seem to basically be lazily-written test 
cases which should be fixed.  I didn't see any example of real-world 
applications doing this.  As noted in some of the test cases, there are 
comments like "Solaris doesn't support this," etc.

So I think having this as a config option at the very least makes a lot of 
sense.  If users really need to run legacy code where execv() works with 
argc < 1, then they could just run a kernel that allows that nonsense, 
just like how Linux doesn't necessarily support the old a.out binary 
format today, unless it is enabled.

Ariadne

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] fs/exec: require argv[0] presence in do_execveat_common()
  2022-01-26  6:42  0% ` Kees Cook
@ 2022-01-26  7:28  0%   ` Kees Cook
  2022-01-26 11:18  0%     ` Ariadne Conill
  0 siblings, 1 reply; 200+ results
From: Kees Cook @ 2022-01-26  7:28 UTC (permalink / raw)
  To: Ariadne Conill
  Cc: linux-kernel, linux-fsdevel, Eric Biederman, Alexander Viro



On January 25, 2022 10:42:41 PM PST, Kees Cook <keescook@chromium.org> wrote:
>On Wed, Jan 26, 2022 at 04:39:47AM +0000, Ariadne Conill wrote:
>> The first argument to argv when used with execv family of calls is
>> required to be the name of the program being executed, per POSIX.
>> 
>> By validating this in do_execveat_common(), we can prevent execution
>> of shellcode which invokes execv(2) family syscalls with argc < 1,
>> a scenario which is disallowed by POSIX, thus providing a mitigation
>> against CVE-2021-4034 and similar bugs in the future.
>> 
>> The use of -EFAULT for this case is similar to other systems, such
>> as FreeBSD and OpenBSD.
>> 
>> Interestingly, Michael Kerrisk opened an issue about this in 2008,

For v2 please include a URL for this. I assume you mean this one?
https://bugzilla.kernel.org/show_bug.cgi?id=8408

>> but there was no consensus to support fixing this issue then.
>> Hopefully now that CVE-2021-4034 shows practical exploitative use
>> of this bug in a shellcode, we can reconsider.
>> 
>> Signed-off-by: Ariadne Conill <ariadne@dereferenced.org>
>
>Yup. Agreed. For context:
>https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt
>
>> ---
>>  fs/exec.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>> 
>> diff --git a/fs/exec.c b/fs/exec.c
>> index 79f2c9483302..de0b832473ed 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1897,8 +1897,10 @@ static int do_execveat_common(int fd, struct filename *filename,
>>  	}
>>  
>>  	retval = count(argv, MAX_ARG_STRINGS);
>> -	if (retval < 0)
>> +	if (retval < 1) {
>> +		retval = -EFAULT;
>>  		goto out_free;
>> +	}

Actually, no, this needs to be more carefully special-cased to avoid masking error returns from count(). (e.g. -E2BIG would vanish with this patch.)

Perhaps just add:

if (retval == 0) {
        retval = -EFAULT;
        goto out_free;
}

>
>There shouldn't be anything legitimate actually doing this in userspace.

I spoke too soon.

Unfortunately, this is not the case:
https://codesearch.debian.net/search?q=execve%5C+*%5C%28%5B%5E%2C%5D%2B%2C+*NULL&literal=0

Lots of stuff likes to do:
execve(path, NULL, NULL);

Do these things depend on argc==0 would be my next question...

>
>-Kees
>
>>  	bprm->argc = retval;
>>  
>>  	retval = count(envp, MAX_ARG_STRINGS);
>> -- 
>> 2.34.1
>> 
>

-- 
Kees Cook

^ permalink raw reply	[relevance 0%]

* Re: [PATCH] fs/exec: require argv[0] presence in do_execveat_common()
  2022-01-26  4:39  5% [PATCH] fs/exec: require argv[0] presence in do_execveat_common() Ariadne Conill
@ 2022-01-26  6:42  0% ` Kees Cook
  2022-01-26  7:28  0%   ` Kees Cook
  2022-01-26 13:27  0% ` Rich Felker
  1 sibling, 1 reply; 200+ results
From: Kees Cook @ 2022-01-26  6:42 UTC (permalink / raw)
  To: Ariadne Conill
  Cc: linux-kernel, linux-fsdevel, Eric Biederman, Alexander Viro

On Wed, Jan 26, 2022 at 04:39:47AM +0000, Ariadne Conill wrote:
> The first argument to argv when used with execv family of calls is
> required to be the name of the program being executed, per POSIX.
> 
> By validating this in do_execveat_common(), we can prevent execution
> of shellcode which invokes execv(2) family syscalls with argc < 1,
> a scenario which is disallowed by POSIX, thus providing a mitigation
> against CVE-2021-4034 and similar bugs in the future.
> 
> The use of -EFAULT for this case is similar to other systems, such
> as FreeBSD and OpenBSD.
> 
> Interestingly, Michael Kerrisk opened an issue about this in 2008,
> but there was no consensus to support fixing this issue then.
> Hopefully now that CVE-2021-4034 shows practical exploitative use
> of this bug in a shellcode, we can reconsider.
> 
> Signed-off-by: Ariadne Conill <ariadne@dereferenced.org>

Yup. Agreed. For context:
https://www.qualys.com/2022/01/25/cve-2021-4034/pwnkit.txt

> ---
>  fs/exec.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 79f2c9483302..de0b832473ed 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1897,8 +1897,10 @@ static int do_execveat_common(int fd, struct filename *filename,
>  	}
>  
>  	retval = count(argv, MAX_ARG_STRINGS);
> -	if (retval < 0)
> +	if (retval < 1) {
> +		retval = -EFAULT;
>  		goto out_free;
> +	}

There shouldn't be anything legitimate actually doing this in userspace.

-Kees

>  	bprm->argc = retval;
>  
>  	retval = count(envp, MAX_ARG_STRINGS);
> -- 
> 2.34.1
> 

-- 
Kees Cook

^ permalink raw reply	[relevance 0%]

* [PATCH] fs/exec: require argv[0] presence in do_execveat_common()
@ 2022-01-26  4:39  5% Ariadne Conill
  2022-01-26  6:42  0% ` Kees Cook
  2022-01-26 13:27  0% ` Rich Felker
  0 siblings, 2 replies; 200+ results
From: Ariadne Conill @ 2022-01-26  4:39 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, Eric Biederman, Kees Cook, Alexander Viro, Ariadne Conill

The first argument to argv when used with execv family of calls is
required to be the name of the program being executed, per POSIX.

By validating this in do_execveat_common(), we can prevent execution
of shellcode which invokes execv(2) family syscalls with argc < 1,
a scenario which is disallowed by POSIX, thus providing a mitigation
against CVE-2021-4034 and similar bugs in the future.

The use of -EFAULT for this case is similar to other systems, such
as FreeBSD and OpenBSD.

Interestingly, Michael Kerrisk opened an issue about this in 2008,
but there was no consensus to support fixing this issue then.
Hopefully now that CVE-2021-4034 shows practical exploitative use
of this bug in a shellcode, we can reconsider.

Signed-off-by: Ariadne Conill <ariadne@dereferenced.org>
---
 fs/exec.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/exec.c b/fs/exec.c
index 79f2c9483302..de0b832473ed 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1897,8 +1897,10 @@ static int do_execveat_common(int fd, struct filename *filename,
 	}
 
 	retval = count(argv, MAX_ARG_STRINGS);
-	if (retval < 0)
+	if (retval < 1) {
+		retval = -EFAULT;
 		goto out_free;
+	}
 	bprm->argc = retval;
 
 	retval = count(envp, MAX_ARG_STRINGS);
-- 
2.34.1


^ permalink raw reply related	[relevance 5%]

* Re: [RFC PATCH 02/15] rseq: Remove broken uapi field layout on 32-bit little endian
  2022-01-25 12:21  0%   ` Christian Brauner
@ 2022-01-25 14:41  0%     ` Mathieu Desnoyers
  0 siblings, 0 replies; 200+ results
From: Mathieu Desnoyers @ 2022-01-25 14:41 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Thomas Gleixner, linux-kernel, Peter Zijlstra, paulmck,
	Boqun Feng, H. Peter Anvin, Paul Turner, linux-api, shuah,
	linux-kselftest, Florian Weimer, Andy Lutomirski, Dave Watson,
	Andrew Morton, Russell King, Andi Kleen, Christian Brauner,
	Ben Maurer, rostedt, Josh Triplett, Linus Torvalds,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Joel Fernandes

----- On Jan 25, 2022, at 7:21 AM, Christian Brauner brauner@kernel.org wrote:

> On Mon, Jan 24, 2022 at 12:12:40PM -0500, Mathieu Desnoyers wrote:
>> The rseq rseq_cs.ptr.{ptr32,padding} uapi endianness handling is
>> entirely wrong on 32-bit little endian: a preprocessor logic mistake
>> wrongly uses the big endian field layout on 32-bit little endian
>> architectures.
>> 
>> Fortunately, those ptr32 accessors were never used within the kernel,
>> and only meant as a convenience for user-space.
>> 
>> Remove those and only leave the "ptr64" union field, as this is the only
>> thing really needed to express the ABI. Document how 32-bit
>> architectures are meant to interact with this "ptr64" union field.
>> 
>> Fixes: ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union, update
>> includes")
>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> Cc: Florian Weimer <fw@deneb.enyo.de>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: linux-api@vger.kernel.org
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Boqun Feng <boqun.feng@gmail.com>
>> Cc: Andy Lutomirski <luto@amacapital.net>
>> Cc: Dave Watson <davejwatson@fb.com>
>> Cc: Paul Turner <pjt@google.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Russell King <linux@arm.linux.org.uk>
>> Cc: "H . Peter Anvin" <hpa@zytor.com>
>> Cc: Andi Kleen <andi@firstfloor.org>
>> Cc: Christian Brauner <christian.brauner@ubuntu.com>
>> Cc: Ben Maurer <bmaurer@fb.com>
>> Cc: Steven Rostedt <rostedt@goodmis.org>
>> Cc: Josh Triplett <josh@joshtriplett.org>
>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>> Cc: Will Deacon <will.deacon@arm.com>
>> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
>> Cc: Joel Fernandes <joelaf@google.com>
>> Cc: Paul E. McKenney <paulmck@kernel.org>
>> ---
>>  include/uapi/linux/rseq.h | 17 ++++-------------
>>  1 file changed, 4 insertions(+), 13 deletions(-)
>> 
>> diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
>> index 9a402fdb60e9..31290f2424a7 100644
>> --- a/include/uapi/linux/rseq.h
>> +++ b/include/uapi/linux/rseq.h
>> @@ -105,22 +105,13 @@ struct rseq {
>>  	 * Read and set by the kernel. Set by user-space with single-copy
>>  	 * atomicity semantics. This field should only be updated by the
>>  	 * thread which registered this data structure. Aligned on 64-bit.
>> +	 *
>> +	 * 32-bit architectures should update the low order bits of the
>> +	 * rseq_cs.ptr64 field, leaving the high order bits initialized
>> +	 * to 0.
>>  	 */
>>  	union {
> 
> A bit unfortunate we seem to have to keep the union around even though
> it's just one field now.

Well, as far as the user-space projects that I know of which use rseq
are concerned (glibc, librseq, tcmalloc), those end up with their own
copy of the uapi header anyway to deal with the big/little endian field
on 32-bit. So I'm very much open to remove the union if we accept that
this uapi header is really just meant to express the ABI and is not
expected to be used as an API by user-space.

That would mean we also bring a uapi header copy into the kernel
rseq selftests as well to minimize the gap between librseq and
the kernel sefltests (the kernel sefltests pretty much include a
copy of librseq for convenience. librseq is maintained out of tree).

Thoughts ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[relevance 0%]

* Re: [RFC PATCH 02/15] rseq: Remove broken uapi field layout on 32-bit little endian
  2022-01-24 17:12  4% ` [RFC PATCH 02/15] rseq: Remove broken uapi field layout on 32-bit little endian Mathieu Desnoyers
@ 2022-01-25 12:21  0%   ` Christian Brauner
  2022-01-25 14:41  0%     ` Mathieu Desnoyers
  0 siblings, 1 reply; 200+ results
From: Christian Brauner @ 2022-01-25 12:21 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, linux-kernel, Peter Zijlstra, Paul E . McKenney,
	Boqun Feng, H . Peter Anvin, Paul Turner, linux-api, Shuah Khan,
	linux-kselftest, Florian Weimer, Andy Lutomirski, Dave Watson,
	Andrew Morton, Russell King, Andi Kleen, Christian Brauner,
	Ben Maurer, Steven Rostedt, Josh Triplett, Linus Torvalds,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Joel Fernandes

On Mon, Jan 24, 2022 at 12:12:40PM -0500, Mathieu Desnoyers wrote:
> The rseq rseq_cs.ptr.{ptr32,padding} uapi endianness handling is
> entirely wrong on 32-bit little endian: a preprocessor logic mistake
> wrongly uses the big endian field layout on 32-bit little endian
> architectures.
> 
> Fortunately, those ptr32 accessors were never used within the kernel,
> and only meant as a convenience for user-space.
> 
> Remove those and only leave the "ptr64" union field, as this is the only
> thing really needed to express the ABI. Document how 32-bit
> architectures are meant to interact with this "ptr64" union field.
> 
> Fixes: ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union, update includes")
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Florian Weimer <fw@deneb.enyo.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: linux-api@vger.kernel.org
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Dave Watson <davejwatson@fb.com>
> Cc: Paul Turner <pjt@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Russell King <linux@arm.linux.org.uk>
> Cc: "H . Peter Anvin" <hpa@zytor.com>
> Cc: Andi Kleen <andi@firstfloor.org>
> Cc: Christian Brauner <christian.brauner@ubuntu.com>
> Cc: Ben Maurer <bmaurer@fb.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Josh Triplett <josh@joshtriplett.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Joel Fernandes <joelaf@google.com>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> ---
>  include/uapi/linux/rseq.h | 17 ++++-------------
>  1 file changed, 4 insertions(+), 13 deletions(-)
> 
> diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
> index 9a402fdb60e9..31290f2424a7 100644
> --- a/include/uapi/linux/rseq.h
> +++ b/include/uapi/linux/rseq.h
> @@ -105,22 +105,13 @@ struct rseq {
>  	 * Read and set by the kernel. Set by user-space with single-copy
>  	 * atomicity semantics. This field should only be updated by the
>  	 * thread which registered this data structure. Aligned on 64-bit.
> +	 *
> +	 * 32-bit architectures should update the low order bits of the
> +	 * rseq_cs.ptr64 field, leaving the high order bits initialized
> +	 * to 0.
>  	 */
>  	union {

A bit unfortunate we seem to have to keep the union around even though
it's just one field now.

^ permalink raw reply	[relevance 0%]

* [RFC PATCH 02/15] rseq: Remove broken uapi field layout on 32-bit little endian
  @ 2022-01-24 17:12  4% ` Mathieu Desnoyers
  2022-01-25 12:21  0%   ` Christian Brauner
  0 siblings, 1 reply; 200+ results
From: Mathieu Desnoyers @ 2022-01-24 17:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Peter Zijlstra, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, Shuah Khan,
	linux-kselftest, Mathieu Desnoyers, Florian Weimer,
	Andy Lutomirski, Dave Watson, Andrew Morton, Russell King,
	Andi Kleen, Christian Brauner, Ben Maurer, Steven Rostedt,
	Josh Triplett, Linus Torvalds, Catalin Marinas, Will Deacon,
	Michael Kerrisk, Joel Fernandes

The rseq rseq_cs.ptr.{ptr32,padding} uapi endianness handling is
entirely wrong on 32-bit little endian: a preprocessor logic mistake
wrongly uses the big endian field layout on 32-bit little endian
architectures.

Fortunately, those ptr32 accessors were never used within the kernel,
and only meant as a convenience for user-space.

Remove those and only leave the "ptr64" union field, as this is the only
thing really needed to express the ABI. Document how 32-bit
architectures are meant to interact with this "ptr64" union field.

Fixes: ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union, update includes")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Florian Weimer <fw@deneb.enyo.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-api@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Paul Turner <pjt@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
---
 include/uapi/linux/rseq.h | 17 ++++-------------
 1 file changed, 4 insertions(+), 13 deletions(-)

diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index 9a402fdb60e9..31290f2424a7 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -105,22 +105,13 @@ struct rseq {
 	 * Read and set by the kernel. Set by user-space with single-copy
 	 * atomicity semantics. This field should only be updated by the
 	 * thread which registered this data structure. Aligned on 64-bit.
+	 *
+	 * 32-bit architectures should update the low order bits of the
+	 * rseq_cs.ptr64 field, leaving the high order bits initialized
+	 * to 0.
 	 */
 	union {
 		__u64 ptr64;
-#ifdef __LP64__
-		__u64 ptr;
-#else
-		struct {
-#if (defined(__BYTE_ORDER) && (__BYTE_ORDER == __BIG_ENDIAN)) || defined(__BIG_ENDIAN)
-			__u32 padding;		/* Initialized to zero. */
-			__u32 ptr32;
-#else /* LITTLE */
-			__u32 ptr32;
-			__u32 padding;		/* Initialized to zero. */
-#endif /* ENDIAN */
-		} ptr;
-#endif
 	} rseq_cs;
 
 	/*
-- 
2.17.1


^ permalink raw reply related	[relevance 4%]

* Re: [RFC PATCH] rseq: Fix broken uapi field layout on 32-bit little endian
  2022-01-23 19:31  4% [RFC PATCH] rseq: Fix broken uapi field layout on 32-bit little endian Mathieu Desnoyers
@ 2022-01-24  6:19  0% ` Greg KH
  0 siblings, 0 replies; 200+ results
From: Greg KH @ 2022-01-24  6:19 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, linux-kernel, Peter Zijlstra, Paul E . McKenney,
	Boqun Feng, H . Peter Anvin, Paul Turner, linux-api, stable,
	Florian Weimer, Andy Lutomirski, Dave Watson, Andrew Morton,
	Russell King, Andi Kleen, Christian Brauner, Ben Maurer,
	Steven Rostedt, Josh Triplett, Linus Torvalds, Catalin Marinas,
	Will Deacon, Michael Kerrisk, Joel Fernandes

On Sun, Jan 23, 2022 at 02:31:54PM -0500, Mathieu Desnoyers wrote:
> The rseq rseq_cs.ptr.{ptr32,padding} uapi endianness handling is
> entirely wrong on 32-bit little endian: a preprocessor logic mistake
> wrongly uses the big endian field layout on 32-bit little endian
> architectures.
> 
> Fortunately, those ptr32 accessors were never used within the kernel,
> and only meant as a convenience for user-space.
> 
> While working on fixing the ppc32 support in librseq [1], I made sure
> all 32-bit little endian architectures stopped depending on little
> endian byte ordering by using the ptr32 field. It led me to discover
> this wrong ptr32 field ordering on little endian.
> 
> Because it is already exposed as a UAPI, all we can do for the existing
> fields is document the wrong behavior and encourage users to use
> alternative mechanisms.
> 
> Introduce a new rseq_cs.arch field with correct field ordering. Use this
> opportunity to improve the layout so accesses to architecture fields on
> both 32-bit and 64-bit architectures are done through the same field
> hierarchy, which is much nicer than the previous scheme.
> 
> The intended use is now:
> 
> * rseq_thread_area->rseq_cs.ptr64: Access the 64-bit value of the rseq_cs
> 				   pointer. Available on all
>                                    architectures (unchanged).
> 
> * rseq_thread_area->rseq_cs.arch.ptr: Access the architecture specific
> 				      layout of the rseq_cs pointer. This
> 				      is a 32-bit field on 32-bit
> 				      architectures, and a 64-bit field on
>                                       64-bit architectures.
> 
> Link: https://git.kernel.org/pub/scm/libs/librseq/librseq.git/ [1]
> Fixes: ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union, update includes")
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Florian Weimer <fw@deneb.enyo.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: linux-api@vger.kernel.org
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Dave Watson <davejwatson@fb.com>
> Cc: Paul Turner <pjt@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Russell King <linux@arm.linux.org.uk>
> Cc: "H . Peter Anvin" <hpa@zytor.com>
> Cc: Andi Kleen <andi@firstfloor.org>
> Cc: Christian Brauner <christian.brauner@ubuntu.com>
> Cc: Ben Maurer <bmaurer@fb.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Josh Triplett <josh@joshtriplett.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Joel Fernandes <joelaf@google.com>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> ---
>  include/uapi/linux/rseq.h | 23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)

<formletter>

This is not the correct way to submit patches for inclusion in the
stable kernel tree.  Please read:
    https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.

</formletter>

^ permalink raw reply	[relevance 0%]

* [RFC PATCH] rseq: Fix broken uapi field layout on 32-bit little endian
@ 2022-01-23 19:31  4% Mathieu Desnoyers
  2022-01-24  6:19  0% ` Greg KH
  0 siblings, 1 reply; 200+ results
From: Mathieu Desnoyers @ 2022-01-23 19:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Peter Zijlstra, Paul E . McKenney, Boqun Feng,
	H . Peter Anvin, Paul Turner, linux-api, stable,
	Mathieu Desnoyers, Florian Weimer, Andy Lutomirski, Dave Watson,
	Andrew Morton, Russell King, Andi Kleen, Christian Brauner,
	Ben Maurer, Steven Rostedt, Josh Triplett, Linus Torvalds,
	Catalin Marinas, Will Deacon, Michael Kerrisk, Joel Fernandes

The rseq rseq_cs.ptr.{ptr32,padding} uapi endianness handling is
entirely wrong on 32-bit little endian: a preprocessor logic mistake
wrongly uses the big endian field layout on 32-bit little endian
architectures.

Fortunately, those ptr32 accessors were never used within the kernel,
and only meant as a convenience for user-space.

While working on fixing the ppc32 support in librseq [1], I made sure
all 32-bit little endian architectures stopped depending on little
endian byte ordering by using the ptr32 field. It led me to discover
this wrong ptr32 field ordering on little endian.

Because it is already exposed as a UAPI, all we can do for the existing
fields is document the wrong behavior and encourage users to use
alternative mechanisms.

Introduce a new rseq_cs.arch field with correct field ordering. Use this
opportunity to improve the layout so accesses to architecture fields on
both 32-bit and 64-bit architectures are done through the same field
hierarchy, which is much nicer than the previous scheme.

The intended use is now:

* rseq_thread_area->rseq_cs.ptr64: Access the 64-bit value of the rseq_cs
				   pointer. Available on all
                                   architectures (unchanged).

* rseq_thread_area->rseq_cs.arch.ptr: Access the architecture specific
				      layout of the rseq_cs pointer. This
				      is a 32-bit field on 32-bit
				      architectures, and a 64-bit field on
                                      64-bit architectures.

Link: https://git.kernel.org/pub/scm/libs/librseq/librseq.git/ [1]
Fixes: ec9c82e03a74 ("rseq: uapi: Declare rseq_cs field as union, update includes")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Florian Weimer <fw@deneb.enyo.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-api@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Watson <davejwatson@fb.com>
Cc: Paul Turner <pjt@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Ben Maurer <bmaurer@fb.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
---
 include/uapi/linux/rseq.h | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index 9a402fdb60e9..68f61cdb45db 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -108,6 +108,12 @@ struct rseq {
 	 */
 	union {
 		__u64 ptr64;
+
+		/*
+		 * The "ptr" field layout is broken on little-endian
+		 * 32-bit architectures due to wrong preprocessor logic.
+		 * DO NOT USE.
+		 */
 #ifdef __LP64__
 		__u64 ptr;
 #else
@@ -121,6 +127,23 @@ struct rseq {
 #endif /* ENDIAN */
 		} ptr;
 #endif
+
+		/*
+		 * The "arch" field provides architecture accessor for
+		 * the ptr field based on architecture pointer size and
+		 * endianness.
+		 */
+		struct {
+#ifdef __LP64__
+			__u64 ptr;
+#elif defined(__BYTE_ORDER) ? (__BYTE_ORDER == __BIG_ENDIAN) : defined(__BIG_ENDIAN)
+			__u32 padding;		/* Initialized to zero. */
+			__u32 ptr;
+#else
+			__u32 ptr;
+			__u32 padding;		/* Initialized to zero. */
+#endif
+		} arch;
 	} rseq_cs;
 
 	/*
-- 
2.17.1


^ permalink raw reply related	[relevance 4%]

* [PATCH] MAINTAINERS: Sort entries using parse-maintainers.pl
@ 2021-12-04 17:52  1% Jonathan Neuschäfer
  0 siblings, 0 replies; 200+ results
From: Jonathan Neuschäfer @ 2021-12-04 17:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: Linus Torvalds, Jonathan Neuschäfer, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Nathan Chancellor, Nick Desaulniers,
	linux-riscv, llvm

The MAINTAINERS file got slightly out of order again, making it
difficult to put new entries at the right (alphabetical) position.

Run parse-maintainers.pl to restore the alphabetical order.

Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
---

This patch applies cleanly to v5.16-rc3 and can then be cherry-picked
cleanly onto next-20211203, but applying it directly to next-20211203
(with "git am") fails:

  error: patch failed: MAINTAINERS:967
  error: MAINTAINERS: patch does not apply

Checkpatch warns about a few unordered "F:" lines within sections, but I
left those alone because I wanted this patch to be as automated as possible.
---
 MAINTAINERS | 710 ++++++++++++++++++++++++++--------------------------
 1 file changed, 355 insertions(+), 355 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 360e9aa0205d6..a8ae86e24cac0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -967,11 +967,6 @@ F:	drivers/gpu/drm/amd/include/v9_structs.h
 F:	drivers/gpu/drm/amd/include/vi_structs.h
 F:	include/uapi/linux/kfd_ioctl.h

-AMD SPI DRIVER
-M:	Sanjay R Mehta <sanju.mehta@amd.com>
-S:	Maintained
-F:	drivers/spi/spi-amd.c
-
 AMD MP2 I2C DRIVER
 M:	Elie Morisse <syniurge@gmail.com>
 M:	Nehal Shah <nehal-bakulchandra.shah@amd.com>
@@ -1006,13 +1001,6 @@ M:	Tom Lendacky <thomas.lendacky@amd.com>
 S:	Supported
 F:	arch/arm64/boot/dts/amd/

-AMD XGBE DRIVER
-M:	Tom Lendacky <thomas.lendacky@amd.com>
-L:	netdev@vger.kernel.org
-S:	Supported
-F:	arch/arm64/boot/dts/amd/amd-seattle-xgbe*.dtsi
-F:	drivers/net/ethernet/amd/xgbe/
-
 AMD SENSOR FUSION HUB DRIVER
 M:	Nehal Shah <nehal-bakulchandra.shah@amd.com>
 M:	Basavaraj Natikar <basavaraj.natikar@amd.com>
@@ -1021,6 +1009,18 @@ S:	Maintained
 F:	Documentation/hid/amd-sfh*
 F:	drivers/hid/amd-sfh-hid/

+AMD SPI DRIVER
+M:	Sanjay R Mehta <sanju.mehta@amd.com>
+S:	Maintained
+F:	drivers/spi/spi-amd.c
+
+AMD XGBE DRIVER
+M:	Tom Lendacky <thomas.lendacky@amd.com>
+L:	netdev@vger.kernel.org
+S:	Supported
+F:	arch/arm64/boot/dts/amd/amd-seattle-xgbe*.dtsi
+F:	drivers/net/ethernet/amd/xgbe/
+
 AMS AS73211 DRIVER
 M:	Christian Eggers <ceggers@arri.de>
 L:	linux-iio@vger.kernel.org
@@ -1409,6 +1409,16 @@ S:	Maintained
 F:	drivers/net/arcnet/
 F:	include/uapi/linux/if_arcnet.h

+ARM AND ARM64 SoC SUB-ARCHITECTURES (COMMON PARTS)
+M:	Arnd Bergmann <arnd@arndb.de>
+M:	Olof Johansson <olof@lixom.net>
+M:	soc@kernel.org
+L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
+S:	Maintained
+T:	git git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc.git
+F:	arch/arm/boot/dts/Makefile
+F:	arch/arm64/boot/dts/Makefile
+
 ARM ARCHITECTED TIMER DRIVER
 M:	Mark Rutland <mark.rutland@arm.com>
 M:	Marc Zyngier <maz@kernel.org>
@@ -1525,22 +1535,6 @@ S:	Odd Fixes
 F:	drivers/amba/
 F:	include/linux/amba/bus.h

-ARM PRIMECELL PL35X NAND CONTROLLER DRIVER
-M:	Miquel Raynal <miquel.raynal@bootlin.com>
-M:	Naga Sureshkumar Relli <nagasure@xilinx.com>
-L:	linux-mtd@lists.infradead.org
-S:	Maintained
-F:	Documentation/devicetree/bindings/mtd/arm,pl353-nand-r2p1.yaml
-F:	drivers/mtd/nand/raw/pl35x-nand-controller.c
-
-ARM PRIMECELL PL35X SMC DRIVER
-M:	Miquel Raynal <miquel.raynal@bootlin.com>
-M:	Naga Sureshkumar Relli <nagasure@xilinx.com>
-L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
-S:	Maintained
-F:	Documentation/devicetree/bindings/memory-controllers/arm,pl353-smc.yaml
-F:	drivers/memory/pl353-smc.c
-
 ARM PRIMECELL CLCD PL110 DRIVER
 M:	Russell King <linux@armlinux.org.uk>
 S:	Odd Fixes
@@ -1558,6 +1552,22 @@ S:	Odd Fixes
 F:	drivers/mmc/host/mmci.*
 F:	include/linux/amba/mmci.h

+ARM PRIMECELL PL35X NAND CONTROLLER DRIVER
+M:	Miquel Raynal <miquel.raynal@bootlin.com>
+M:	Naga Sureshkumar Relli <nagasure@xilinx.com>
+L:	linux-mtd@lists.infradead.org
+S:	Maintained
+F:	Documentation/devicetree/bindings/mtd/arm,pl353-nand-r2p1.yaml
+F:	drivers/mtd/nand/raw/pl35x-nand-controller.c
+
+ARM PRIMECELL PL35X SMC DRIVER
+M:	Miquel Raynal <miquel.raynal@bootlin.com>
+M:	Naga Sureshkumar Relli <nagasure@xilinx.com>
+L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
+S:	Maintained
+F:	Documentation/devicetree/bindings/memory-controllers/arm,pl353-smc.yaml
+F:	drivers/memory/pl353-smc.c
+
 ARM PRIMECELL SSP PL022 SPI DRIVER
 M:	Linus Walleij <linus.walleij@linaro.org>
 L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
@@ -1594,16 +1604,6 @@ F:	Documentation/devicetree/bindings/iommu/arm,smmu*
 F:	drivers/iommu/arm/
 F:	drivers/iommu/io-pgtable-arm*

-ARM AND ARM64 SoC SUB-ARCHITECTURES (COMMON PARTS)
-M:	Arnd Bergmann <arnd@arndb.de>
-M:	Olof Johansson <olof@lixom.net>
-M:	soc@kernel.org
-L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
-S:	Maintained
-T:	git git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc.git
-F:	arch/arm/boot/dts/Makefile
-F:	arch/arm64/boot/dts/Makefile
-
 ARM SUB-ARCHITECTURES
 L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
 S:	Maintained
@@ -2256,13 +2256,6 @@ F:	arch/arm64/boot/dts/microchip/
 F:	drivers/pinctrl/pinctrl-microchip-sgpio.c
 N:	sparx5

-Microchip Timer Counter Block (TCB) Capture Driver
-M:	Kamel Bouhara <kamel.bouhara@bootlin.com>
-L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
-L:	linux-iio@vger.kernel.org
-S:	Maintained
-F:	drivers/counter/microchip-tcb-capture.c
-
 ARM/MILBEAUT ARCHITECTURE
 M:	Taichi Sugaya <sugaya.taichi@socionext.com>
 M:	Takao Orito <orito.takao@socionext.com>
@@ -4106,29 +4099,6 @@ W:	https://github.com/Cascoda/ca8210-linux.git
 F:	Documentation/devicetree/bindings/net/ieee802154/ca8210.txt
 F:	drivers/net/ieee802154/ca8210.c

-CANAAN/KENDRYTE K210 SOC FPIOA DRIVER
-M:	Damien Le Moal <damien.lemoal@wdc.com>
-L:	linux-riscv@lists.infradead.org
-L:	linux-gpio@vger.kernel.org (pinctrl driver)
-F:	Documentation/devicetree/bindings/pinctrl/canaan,k210-fpioa.yaml
-F:	drivers/pinctrl/pinctrl-k210.c
-
-CANAAN/KENDRYTE K210 SOC RESET CONTROLLER DRIVER
-M:	Damien Le Moal <damien.lemoal@wdc.com>
-L:	linux-kernel@vger.kernel.org
-L:	linux-riscv@lists.infradead.org
-S:	Maintained
-F:	Documentation/devicetree/bindings/reset/canaan,k210-rst.yaml
-F:	drivers/reset/reset-k210.c
-
-CANAAN/KENDRYTE K210 SOC SYSTEM CONTROLLER DRIVER
-M:	Damien Le Moal <damien.lemoal@wdc.com>
-L:	linux-riscv@lists.infradead.org
-S:	Maintained
-F:      Documentation/devicetree/bindings/mfd/canaan,k210-sysctl.yaml
-F:	drivers/soc/canaan/
-F:	include/soc/canaan/
-
 CACHEFILES: FS-CACHE BACKEND FOR CACHING ON MOUNTED FILESYSTEMS
 M:	David Howells <dhowells@redhat.com>
 L:	linux-cachefs@redhat.com (moderated for non-subscribers)
@@ -4251,6 +4221,29 @@ F:	Documentation/networking/j1939.rst
 F:	include/uapi/linux/can/j1939.h
 F:	net/can/j1939/

+CANAAN/KENDRYTE K210 SOC FPIOA DRIVER
+M:	Damien Le Moal <damien.lemoal@wdc.com>
+L:	linux-riscv@lists.infradead.org
+L:	linux-gpio@vger.kernel.org (pinctrl driver)
+F:	Documentation/devicetree/bindings/pinctrl/canaan,k210-fpioa.yaml
+F:	drivers/pinctrl/pinctrl-k210.c
+
+CANAAN/KENDRYTE K210 SOC RESET CONTROLLER DRIVER
+M:	Damien Le Moal <damien.lemoal@wdc.com>
+L:	linux-kernel@vger.kernel.org
+L:	linux-riscv@lists.infradead.org
+S:	Maintained
+F:	Documentation/devicetree/bindings/reset/canaan,k210-rst.yaml
+F:	drivers/reset/reset-k210.c
+
+CANAAN/KENDRYTE K210 SOC SYSTEM CONTROLLER DRIVER
+M:	Damien Le Moal <damien.lemoal@wdc.com>
+L:	linux-riscv@lists.infradead.org
+S:	Maintained
+F:	Documentation/devicetree/bindings/mfd/canaan,k210-sysctl.yaml
+F:	drivers/soc/canaan/
+F:	include/soc/canaan/
+
 CAPABILITIES
 M:	Serge Hallyn <serge@hallyn.com>
 L:	linux-security-module@vger.kernel.org
@@ -4500,17 +4493,17 @@ F:	drivers/power/supply/cros_usbpd-charger.c
 N:	cros_ec
 N:	cros-ec

-CHROMEOS EC USB TYPE-C DRIVER
-M:	Prashant Malani <pmalani@chromium.org>
-S:	Maintained
-F:	drivers/platform/chrome/cros_ec_typec.c
-
 CHROMEOS EC USB PD NOTIFY DRIVER
 M:	Prashant Malani <pmalani@chromium.org>
 S:	Maintained
 F:	drivers/platform/chrome/cros_usbpd_notify.c
 F:	include/linux/platform_data/cros_usbpd_notify.h

+CHROMEOS EC USB TYPE-C DRIVER
+M:	Prashant Malani <pmalani@chromium.org>
+S:	Maintained
+F:	drivers/platform/chrome/cros_ec_typec.c
+
 CHRONTEL CH7322 CEC DRIVER
 M:	Joe Tessler <jrt@google.com>
 L:	linux-media@vger.kernel.org
@@ -4615,6 +4608,18 @@ M:	Nelson Escobar <neescoba@cisco.com>
 S:	Supported
 F:	drivers/infiniband/hw/usnic/

+CLANG CONTROL FLOW INTEGRITY SUPPORT
+M:	Sami Tolvanen <samitolvanen@google.com>
+M:	Kees Cook <keescook@chromium.org>
+R:	Nathan Chancellor <nathan@kernel.org>
+R:	Nick Desaulniers <ndesaulniers@google.com>
+L:	llvm@lists.linux.dev
+S:	Supported
+B:	https://github.com/ClangBuiltLinux/linux/issues
+T:	git git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git for-next/clang/features
+F:	include/linux/cfi.h
+F:	kernel/cfi.c
+
 CLANG-FORMAT FILE
 M:	Miguel Ojeda <ojeda@kernel.org>
 S:	Maintained
@@ -4634,18 +4639,6 @@ F:	scripts/Makefile.clang
 F:	scripts/clang-tools/
 K:	\b(?i:clang|llvm)\b

-CLANG CONTROL FLOW INTEGRITY SUPPORT
-M:	Sami Tolvanen <samitolvanen@google.com>
-M:	Kees Cook <keescook@chromium.org>
-R:	Nathan Chancellor <nathan@kernel.org>
-R:	Nick Desaulniers <ndesaulniers@google.com>
-L:	llvm@lists.linux.dev
-S:	Supported
-B:	https://github.com/ClangBuiltLinux/linux/issues
-T:	git git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git for-next/clang/features
-F:	include/linux/cfi.h
-F:	kernel/cfi.c
-
 CLEANCACHE API
 M:	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
 L:	linux-kernel@vger.kernel.org
@@ -5128,6 +5121,13 @@ S:	Supported
 W:	http://www.chelsio.com
 F:	drivers/crypto/chelsio

+CXGB4 ETHERNET DRIVER (CXGB4)
+M:	Raju Rangoju <rajur@chelsio.com>
+L:	netdev@vger.kernel.org
+S:	Supported
+W:	http://www.chelsio.com
+F:	drivers/net/ethernet/chelsio/cxgb4/
+
 CXGB4 INLINE CRYPTO DRIVER
 M:	Ayush Sawal <ayush.sawal@chelsio.com>
 M:	Vinay Kumar Yadav <vinay.yadav@chelsio.com>
@@ -5137,13 +5137,6 @@ S:	Supported
 W:	http://www.chelsio.com
 F:	drivers/net/ethernet/chelsio/inline_crypto/

-CXGB4 ETHERNET DRIVER (CXGB4)
-M:	Raju Rangoju <rajur@chelsio.com>
-L:	netdev@vger.kernel.org
-S:	Supported
-W:	http://www.chelsio.com
-F:	drivers/net/ethernet/chelsio/cxgb4/
-
 CXGB4 ISCSI DRIVER (CXGB4I)
 M:	Karen Xie <kxie@chelsio.com>
 L:	linux-scsi@vger.kernel.org
@@ -5199,16 +5192,6 @@ CYCLADES PC300 DRIVER
 S:	Orphan
 F:	drivers/net/wan/pc300*

-CYPRESS_FIRMWARE MEDIA DRIVER
-M:	Antti Palosaari <crope@iki.fi>
-L:	linux-media@vger.kernel.org
-S:	Maintained
-W:	https://linuxtv.org
-W:	http://palosaari.fi/linux/
-Q:	http://patchwork.linuxtv.org/project/linux-media/list/
-T:	git git://linuxtv.org/anttip/media_tree.git
-F:	drivers/media/common/cypress_firmware*
-
 CYPRESS CY8CTMA140 TOUCHSCREEN DRIVER
 M:	Linus Walleij <linus.walleij@linaro.org>
 L:	linux-input@vger.kernel.org
@@ -5222,6 +5205,16 @@ S:	Maintained
 F:	Documentation/devicetree/bindings/input/cypress-sf.yaml
 F:	drivers/input/keyboard/cypress-sf.c

+CYPRESS_FIRMWARE MEDIA DRIVER
+M:	Antti Palosaari <crope@iki.fi>
+L:	linux-media@vger.kernel.org
+S:	Maintained
+W:	https://linuxtv.org
+W:	http://palosaari.fi/linux/
+Q:	http://patchwork.linuxtv.org/project/linux-media/list/
+T:	git git://linuxtv.org/anttip/media_tree.git
+F:	drivers/media/common/cypress_firmware*
+
 CYTTSP TOUCHSCREEN DRIVER
 M:	Linus Walleij <linus.walleij@linaro.org>
 L:	linux-input@vger.kernel.org
@@ -5392,14 +5385,12 @@ L:	Dell.Client.Kernel@dell.com
 S:	Maintained
 F:	drivers/platform/x86/dell/dell-wmi-descriptor.c

-DELL WMI SYSMAN DRIVER
-M:	Divya Bharathi <divya.bharathi@dell.com>
-M:	Prasanth Ksr <prasanth.ksr@dell.com>
+DELL WMI HARDWARE PRIVACY SUPPORT
+M:	Perry Yuan <Perry.Yuan@dell.com>
 L:	Dell.Client.Kernel@dell.com
 L:	platform-driver-x86@vger.kernel.org
 S:	Maintained
-F:	Documentation/ABI/testing/sysfs-class-firmware-attributes
-F:	drivers/platform/x86/dell/dell-wmi-sysman/
+F:	drivers/platform/x86/dell/dell-wmi-privacy.c

 DELL WMI NOTIFICATIONS DRIVER
 M:	Matthew Garrett <mjg59@srcf.ucam.org>
@@ -5407,12 +5398,21 @@ M:	Pali Rohár <pali@kernel.org>
 S:	Maintained
 F:	drivers/platform/x86/dell/dell-wmi-base.c

-DELL WMI HARDWARE PRIVACY SUPPORT
-M:	Perry Yuan <Perry.Yuan@dell.com>
+DELL WMI SYSMAN DRIVER
+M:	Divya Bharathi <divya.bharathi@dell.com>
+M:	Prasanth Ksr <prasanth.ksr@dell.com>
 L:	Dell.Client.Kernel@dell.com
 L:	platform-driver-x86@vger.kernel.org
 S:	Maintained
-F:	drivers/platform/x86/dell/dell-wmi-privacy.c
+F:	Documentation/ABI/testing/sysfs-class-firmware-attributes
+F:	drivers/platform/x86/dell/dell-wmi-sysman/
+
+DELTA DPS920AB PSU DRIVER
+M:	Robert Marko <robert.marko@sartura.hr>
+L:	linux-hwmon@vger.kernel.org
+S:	Maintained
+F:	Documentation/hwmon/dps920ab.rst
+F:	drivers/hwmon/pmbus/dps920ab.c

 DELTA ST MEDIA DRIVER
 M:	Hugues Fruchet <hugues.fruchet@foss.st.com>
@@ -5422,13 +5422,6 @@ W:	https://linuxtv.org
 T:	git git://linuxtv.org/media_tree.git
 F:	drivers/media/platform/sti/delta

-DELTA DPS920AB PSU DRIVER
-M:	Robert Marko <robert.marko@sartura.hr>
-L:	linux-hwmon@vger.kernel.org
-S:	Maintained
-F:	Documentation/hwmon/dps920ab.rst
-F:	drivers/hwmon/pmbus/dps920ab.c
-
 DENALI NAND DRIVER
 L:	linux-mtd@lists.infradead.org
 S:	Orphan
@@ -5441,13 +5434,6 @@ S:	Maintained
 F:	drivers/dma/dw-edma/
 F:	include/linux/dma/edma.h

-DESIGNWARE XDATA IP DRIVER
-M:	Gustavo Pimentel <gustavo.pimentel@synopsys.com>
-L:	linux-pci@vger.kernel.org
-S:	Maintained
-F:	Documentation/misc-devices/dw-xdata-pcie.rst
-F:	drivers/misc/dw-xdata-pcie.c
-
 DESIGNWARE USB2 DRD IP DRIVER
 M:	Minas Harutyunyan <hminas@synopsys.com>
 L:	linux-usb@vger.kernel.org
@@ -5462,6 +5448,13 @@ S:	Maintained
 T:	git git://git.kernel.org/pub/scm/linux/kernel/git/balbi/usb.git
 F:	drivers/usb/dwc3/

+DESIGNWARE XDATA IP DRIVER
+M:	Gustavo Pimentel <gustavo.pimentel@synopsys.com>
+L:	linux-pci@vger.kernel.org
+S:	Maintained
+F:	Documentation/misc-devices/dw-xdata-pcie.rst
+F:	drivers/misc/dw-xdata-pcie.c
+
 DEVANTECH SRF ULTRASONIC RANGER IIO DRIVER
 M:	Andreas Klinger <ak@it-klinger.de>
 L:	linux-iio@vger.kernel.org
@@ -5691,6 +5684,12 @@ F:	include/linux/dma/
 F:	include/linux/dmaengine.h
 F:	include/linux/of_dma.h

+DMA MAPPING BENCHMARK
+M:	Barry Song <song.bao.hua@hisilicon.com>
+L:	iommu@lists.linux-foundation.org
+F:	kernel/dma/map_benchmark.c
+F:	tools/testing/selftests/dma/
+
 DMA MAPPING HELPERS
 M:	Christoph Hellwig <hch@lst.de>
 M:	Marek Szyprowski <m.szyprowski@samsung.com>
@@ -5705,12 +5704,6 @@ F:	include/linux/dma-mapping.h
 F:	include/linux/dma-map-ops.h
 F:	kernel/dma/

-DMA MAPPING BENCHMARK
-M:	Barry Song <song.bao.hua@hisilicon.com>
-L:	iommu@lists.linux-foundation.org
-F:	kernel/dma/map_benchmark.c
-F:	tools/testing/selftests/dma/
-
 DMA-BUF HEAPS FRAMEWORK
 M:	Sumit Semwal <sumit.semwal@linaro.org>
 R:	Benjamin Gaignard <benjamin.gaignard@linaro.org>
@@ -5992,6 +5985,14 @@ T:	git git://anongit.freedesktop.org/drm/drm-misc
 F:	Documentation/devicetree/bindings/display/himax,hx8357d.txt
 F:	drivers/gpu/drm/tiny/hx8357d.c

+DRM DRIVER FOR HYPERV SYNTHETIC VIDEO DEVICE
+M:	Deepak Rawat <drawat.floss@gmail.com>
+L:	linux-hyperv@vger.kernel.org
+L:	dri-devel@lists.freedesktop.org
+S:	Maintained
+T:	git git://anongit.freedesktop.org/drm/drm-misc
+F:	drivers/gpu/drm/hyperv
+
 DRM DRIVER FOR ILITEK ILI9225 PANELS
 M:	David Lechner <david@lechnology.com>
 S:	Maintained
@@ -6137,14 +6138,6 @@ S:	Maintained
 F:	Documentation/devicetree/bindings/display/panel/samsung,s6d27a1.yaml
 F:	drivers/gpu/drm/panel/panel-samsung-s6d27a1.c

-DRM DRIVER FOR SITRONIX ST7703 PANELS
-M:	Guido Günther <agx@sigxcpu.org>
-R:	Purism Kernel Team <kernel@puri.sm>
-R:	Ondrej Jirman <megous@megous.com>
-S:	Maintained
-F:	Documentation/devicetree/bindings/display/panel/rocktech,jh057n00900.yaml
-F:	drivers/gpu/drm/panel/panel-sitronix-st7703.c
-
 DRM DRIVER FOR SAVAGE VIDEO CARDS
 S:	Orphan / Obsolete
 F:	drivers/gpu/drm/savage/
@@ -6175,6 +6168,14 @@ S:	Maintained
 F:	Documentation/devicetree/bindings/display/panel/sitronix,st7701.yaml
 F:	drivers/gpu/drm/panel/panel-sitronix-st7701.c

+DRM DRIVER FOR SITRONIX ST7703 PANELS
+M:	Guido Günther <agx@sigxcpu.org>
+R:	Purism Kernel Team <kernel@puri.sm>
+R:	Ondrej Jirman <megous@megous.com>
+S:	Maintained
+F:	Documentation/devicetree/bindings/display/panel/rocktech,jh057n00900.yaml
+F:	drivers/gpu/drm/panel/panel-sitronix-st7703.c
+
 DRM DRIVER FOR SITRONIX ST7735R PANELS
 M:	David Lechner <david@lechnology.com>
 S:	Maintained
@@ -6369,14 +6370,6 @@ T:	git git://anongit.freedesktop.org/drm/drm-misc
 F:	Documentation/devicetree/bindings/display/hisilicon/
 F:	drivers/gpu/drm/hisilicon/

-DRM DRIVER FOR HYPERV SYNTHETIC VIDEO DEVICE
-M:	Deepak Rawat <drawat.floss@gmail.com>
-L:	linux-hyperv@vger.kernel.org
-L:	dri-devel@lists.freedesktop.org
-S:	Maintained
-T:	git git://anongit.freedesktop.org/drm/drm-misc
-F:	drivers/gpu/drm/hyperv
-
 DRM DRIVERS FOR LIMA
 M:	Qiang Yu <yuq825@gmail.com>
 L:	dri-devel@lists.freedesktop.org
@@ -6524,6 +6517,14 @@ T:	git git://anongit.freedesktop.org/drm/drm-misc
 F:	Documentation/devicetree/bindings/display/xlnx/
 F:	drivers/gpu/drm/xlnx/

+DRM GPU SCHEDULER
+M:	Andrey Grodzovsky <andrey.grodzovsky@amd.com>
+L:	dri-devel@lists.freedesktop.org
+S:	Maintained
+T:	git git://anongit.freedesktop.org/drm/drm-misc
+F:	drivers/gpu/drm/scheduler/
+F:	include/drm/gpu_scheduler.h
+
 DRM PANEL DRIVERS
 M:	Thierry Reding <thierry.reding@gmail.com>
 R:	Sam Ravnborg <sam@ravnborg.org>
@@ -6544,14 +6545,6 @@ T:	git git://anongit.freedesktop.org/drm/drm-misc
 F:	drivers/gpu/drm/ttm/
 F:	include/drm/ttm/

-DRM GPU SCHEDULER
-M:	Andrey Grodzovsky <andrey.grodzovsky@amd.com>
-L:	dri-devel@lists.freedesktop.org
-S:	Maintained
-T:	git git://anongit.freedesktop.org/drm/drm-misc
-F:	drivers/gpu/drm/scheduler/
-F:	include/drm/gpu_scheduler.h
-
 DSBR100 USB FM RADIO DRIVER
 M:	Alexey Klimov <klimov.linux@gmail.com>
 L:	linux-media@vger.kernel.org
@@ -6690,6 +6683,15 @@ F:	Documentation/networking/net_dim.rst
 F:	include/linux/dim.h
 F:	lib/dim/

+DYNAMIC THERMAL POWER MANAGEMENT (DTPM)
+M:	Daniel Lezcano <daniel.lezcano@kernel.org>
+L:	linux-pm@vger.kernel.org
+S:	Supported
+B:	https://bugzilla.kernel.org
+T:	git git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
+F:	drivers/powercap/dtpm*
+F:	include/linux/dtpm.h
+
 DZ DECSTATION DZ11 SERIAL DRIVER
 M:	"Maciej W. Rozycki" <macro@orcam.me.uk>
 S:	Maintained
@@ -7037,22 +7039,22 @@ W:	http://www.broadcom.com
 F:	drivers/infiniband/hw/ocrdma/
 F:	include/uapi/rdma/ocrdma-abi.h

-EMULEX/BROADCOM LPFC FC/FCOE SCSI DRIVER
+EMULEX/BROADCOM EFCT FC/FCOE SCSI TARGET DRIVER
 M:	James Smart <james.smart@broadcom.com>
-M:	Dick Kennedy <dick.kennedy@broadcom.com>
+M:	Ram Vegesna <ram.vegesna@broadcom.com>
 L:	linux-scsi@vger.kernel.org
+L:	target-devel@vger.kernel.org
 S:	Supported
 W:	http://www.broadcom.com
-F:	drivers/scsi/lpfc/
+F:	drivers/scsi/elx/

-EMULEX/BROADCOM EFCT FC/FCOE SCSI TARGET DRIVER
+EMULEX/BROADCOM LPFC FC/FCOE SCSI DRIVER
 M:	James Smart <james.smart@broadcom.com>
-M:	Ram Vegesna <ram.vegesna@broadcom.com>
+M:	Dick Kennedy <dick.kennedy@broadcom.com>
 L:	linux-scsi@vger.kernel.org
-L:	target-devel@vger.kernel.org
 S:	Supported
 W:	http://www.broadcom.com
-F:	drivers/scsi/elx/
+F:	drivers/scsi/lpfc/

 ENE CB710 FLASH CARD READER DRIVER
 M:	Michał Mirosław <mirq-linux@rere.qmqm.pl>
@@ -8529,6 +8531,12 @@ W:	http://www.highpoint-tech.com
 F:	Documentation/scsi/hptiop.rst
 F:	drivers/scsi/hptiop.c

+HIKEY960 ONBOARD USB GPIO HUB DRIVER
+M:	John Stultz <john.stultz@linaro.org>
+L:	linux-kernel@vger.kernel.org
+S:	Maintained
+F:	drivers/misc/hisi_hikey_usb.c
+
 HIPPI
 M:	Jes Sorensen <jes@trained-monkey.org>
 L:	linux-hippi@sunsite.dk
@@ -8599,12 +8607,6 @@ W:	http://www.hisilicon.com
 F:	Documentation/devicetree/bindings/net/hisilicon*.txt
 F:	drivers/net/ethernet/hisilicon/

-HIKEY960 ONBOARD USB GPIO HUB DRIVER
-M:	John Stultz <john.stultz@linaro.org>
-L:	linux-kernel@vger.kernel.org
-S:	Maintained
-F:	drivers/misc/hisi_hikey_usb.c
-
 HISILICON PMU DRIVER
 M:	Shaokun Zhang <zhangshaokun@hisilicon.com>
 S:	Supported
@@ -9630,18 +9632,18 @@ F:	Documentation/admin-guide/media/ipu3_rcb.svg
 F:	Documentation/userspace-api/media/v4l/pixfmt-meta-intel-ipu3.rst
 F:	drivers/staging/media/ipu3/

-INTEL IXP4XX CRYPTO SUPPORT
-M:	Corentin Labbe <clabbe@baylibre.com>
-L:	linux-crypto@vger.kernel.org
-S:	Maintained
-F:	drivers/crypto/ixp4xx_crypto.c
-
 INTEL ISHTP ECLITE DRIVER
 M:	Sumesh K Naduvalath <sumesh.k.naduvalath@intel.com>
 L:	platform-driver-x86@vger.kernel.org
 S:	Supported
 F:	drivers/platform/x86/intel/ishtp_eclite.c

+INTEL IXP4XX CRYPTO SUPPORT
+M:	Corentin Labbe <clabbe@baylibre.com>
+L:	linux-crypto@vger.kernel.org
+S:	Maintained
+F:	drivers/crypto/ixp4xx_crypto.c
+
 INTEL IXP4XX QMGR, NPE, ETHERNET and HSS SUPPORT
 M:	Krzysztof Halasa <khalasa@piap.pl>
 S:	Maintained
@@ -9784,6 +9786,21 @@ S:	Maintained
 F:	arch/x86/include/asm/intel_scu_ipc.h
 F:	drivers/platform/x86/intel_scu_*

+INTEL SGX
+M:	Jarkko Sakkinen <jarkko@kernel.org>
+R:	Dave Hansen <dave.hansen@linux.intel.com>
+L:	linux-sgx@vger.kernel.org
+S:	Supported
+Q:	https://patchwork.kernel.org/project/intel-sgx/list/
+T:	git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/sgx
+F:	Documentation/x86/sgx.rst
+F:	arch/x86/entry/vdso/vsgx.S
+F:	arch/x86/include/asm/sgx.h
+F:	arch/x86/include/uapi/asm/sgx.h
+F:	arch/x86/kernel/cpu/sgx/*
+F:	tools/testing/selftests/sgx/*
+K:	\bSGX_
+
 INTEL SKYLAKE INT3472 ACPI DEVICE DRIVER
 M:	Daniel Scally <djrscally@gmail.com>
 S:	Maintained
@@ -9878,21 +9895,6 @@ F:	Documentation/x86/intel_txt.rst
 F:	arch/x86/kernel/tboot.c
 F:	include/linux/tboot.h

-INTEL SGX
-M:	Jarkko Sakkinen <jarkko@kernel.org>
-R:	Dave Hansen <dave.hansen@linux.intel.com>
-L:	linux-sgx@vger.kernel.org
-S:	Supported
-Q:	https://patchwork.kernel.org/project/intel-sgx/list/
-T:	git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/sgx
-F:	Documentation/x86/sgx.rst
-F:	arch/x86/entry/vdso/vsgx.S
-F:	arch/x86/include/asm/sgx.h
-F:	arch/x86/include/uapi/asm/sgx.h
-F:	arch/x86/kernel/cpu/sgx/*
-F:	tools/testing/selftests/sgx/*
-K:	\bSGX_
-
 INTERCONNECT API
 M:	Georgi Djakov <djakov@kernel.org>
 L:	linux-pm@vger.kernel.org
@@ -11298,6 +11300,12 @@ F:	drivers/mailbox/arm_mhuv2.c
 F:	include/linux/mailbox/arm_mhuv2_message.h
 F:	Documentation/devicetree/bindings/mailbox/arm,mhuv2.yaml

+MAN-PAGES: MANUAL PAGES FOR LINUX -- Sections 2, 3, 4, 5, and 7
+M:	Michael Kerrisk <mtk.manpages@gmail.com>
+L:	linux-man@vger.kernel.org
+S:	Maintained
+W:	http://www.kernel.org/doc/man-pages
+
 MANAGEMENT COMPONENT TRANSPORT PROTOCOL (MCTP)
 M:	Jeremy Kerr <jk@codeconstruct.com.au>
 M:	Matt Johnston <matt@codeconstruct.com.au>
@@ -11310,12 +11318,6 @@ F:	include/net/mctpdevice.h
 F:	include/net/netns/mctp.h
 F:	net/mctp/

-MAN-PAGES: MANUAL PAGES FOR LINUX -- Sections 2, 3, 4, 5, and 7
-M:	Michael Kerrisk <mtk.manpages@gmail.com>
-L:	linux-man@vger.kernel.org
-S:	Maintained
-W:	http://www.kernel.org/doc/man-pages
-
 MARDUK (CREATOR CI40) DEVICE TREE SUPPORT
 M:	Rahul Bedarkar <rahulbedarkar89@gmail.com>
 L:	linux-mips@vger.kernel.org
@@ -11634,12 +11636,6 @@ L:	netdev@vger.kernel.org
 S:	Supported
 F:	drivers/net/phy/mxl-gpy.c

-MCBA MICROCHIP CAN BUS ANALYZER TOOL DRIVER
-R:	Yasushi SHOJI <yashi@spacecubics.com>
-L:	linux-can@vger.kernel.org
-S:	Maintained
-F:	drivers/net/can/usb/mcba_usb.c
-
 MCAN MMIO DEVICE DRIVER
 M:	Chandrasekar Ramakrishnan <rcsekar@samsung.com>
 L:	linux-can@vger.kernel.org
@@ -11649,6 +11645,12 @@ F:	drivers/net/can/m_can/m_can.c
 F:	drivers/net/can/m_can/m_can.h
 F:	drivers/net/can/m_can/m_can_platform.c

+MCBA MICROCHIP CAN BUS ANALYZER TOOL DRIVER
+R:	Yasushi SHOJI <yashi@spacecubics.com>
+L:	linux-can@vger.kernel.org
+S:	Maintained
+F:	drivers/net/can/usb/mcba_usb.c
+
 MCP2221A MICROCHIP USB-HID TO I2C BRIDGE DRIVER
 M:	Rishi Gupta <gupt21@gmail.com>
 L:	linux-i2c@vger.kernel.org
@@ -12041,13 +12043,6 @@ S:	Maintained
 F:	Documentation/devicetree/bindings/clock/mediatek,mt7621-sysc.yaml
 F:	drivers/clk/ralink/clk-mt7621.c

-MEDIATEK MT7621/28/88 I2C DRIVER
-M:	Stefan Roese <sr@denx.de>
-L:	linux-i2c@vger.kernel.org
-S:	Maintained
-F:	Documentation/devicetree/bindings/i2c/i2c-mt7621.txt
-F:	drivers/i2c/busses/i2c-mt7621.c
-
 MEDIATEK MT7621 PCIE CONTROLLER DRIVER
 M:	Sergio Paracuellos <sergio.paracuellos@gmail.com>
 S:	Maintained
@@ -12060,6 +12055,13 @@ S:	Maintained
 F:	Documentation/devicetree/bindings/phy/mediatek,mt7621-pci-phy.yaml
 F:	drivers/phy/ralink/phy-mt7621-pci.c

+MEDIATEK MT7621/28/88 I2C DRIVER
+M:	Stefan Roese <sr@denx.de>
+L:	linux-i2c@vger.kernel.org
+S:	Maintained
+F:	Documentation/devicetree/bindings/i2c/i2c-mt7621.txt
+F:	drivers/i2c/busses/i2c-mt7621.c
+
 MEDIATEK NAND CONTROLLER DRIVER
 L:	linux-mtd@lists.infradead.org
 S:	Orphan
@@ -12591,6 +12593,13 @@ S:	Supported
 F:	drivers/misc/atmel-ssc.c
 F:	include/linux/atmel-ssc.h

+Microchip Timer Counter Block (TCB) Capture Driver
+M:	Kamel Bouhara <kamel.bouhara@bootlin.com>
+L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
+L:	linux-iio@vger.kernel.org
+S:	Maintained
+F:	drivers/counter/microchip-tcb-capture.c
+
 MICROCHIP USB251XB DRIVER
 M:	Richard Leitner <richard.leitner@skidata.com>
 L:	linux-usb@vger.kernel.org
@@ -13691,13 +13700,6 @@ F:	drivers/iio/gyro/fxas21002c_core.c
 F:	drivers/iio/gyro/fxas21002c_i2c.c
 F:	drivers/iio/gyro/fxas21002c_spi.c

-NXP i.MX CLOCK DRIVERS
-M:	Abel Vesa <abel.vesa@nxp.com>
-L:	linux-clk@vger.kernel.org
-L:	linux-imx@nxp.com
-S:	Maintained
-F:	drivers/clk/imx/
-
 NXP i.MX 8MQ DCSS DRIVER
 M:	Laurentiu Palcu <laurentiu.palcu@oss.nxp.com>
 R:	Lucas Stach <l.stach@pengutronix.de>
@@ -13713,6 +13715,21 @@ S:	Supported
 F:	Documentation/devicetree/bindings/iio/adc/nxp,imx8qxp-adc.yaml
 F:	drivers/iio/adc/imx8qxp-adc.c

+NXP i.MX 8QXP/8QM JPEG V4L2 DRIVER
+M:	Mirela Rabulea <mirela.rabulea@nxp.com>
+R:	NXP Linux Team <linux-imx@nxp.com>
+L:	linux-media@vger.kernel.org
+S:	Maintained
+F:	Documentation/devicetree/bindings/media/nxp,imx8-jpeg.yaml
+F:	drivers/media/platform/imx-jpeg
+
+NXP i.MX CLOCK DRIVERS
+M:	Abel Vesa <abel.vesa@nxp.com>
+L:	linux-clk@vger.kernel.org
+L:	linux-imx@nxp.com
+S:	Maintained
+F:	drivers/clk/imx/
+
 NXP PF8100/PF8121A/PF8200 PMIC REGULATOR DEVICE DRIVER
 M:	Jagan Teki <jagan@amarulasolutions.com>
 S:	Maintained
@@ -13750,19 +13767,12 @@ F:	include/drm/i2c/tda998x.h
 F:	include/dt-bindings/display/tda998x.h
 K:	"nxp,tda998x"

-NXP TFA9879 DRIVER
-M:	Peter Rosin <peda@axentia.se>
-L:	alsa-devel@alsa-project.org (moderated for non-subscribers)
-S:	Maintained
-F:	Documentation/devicetree/bindings/sound/tfa9879.txt
-F:	sound/soc/codecs/tfa9879*
-
-NXP/Goodix TFA989X (TFA1) DRIVER
-M:	Stephan Gerhold <stephan@gerhold.net>
+NXP TFA9879 DRIVER
+M:	Peter Rosin <peda@axentia.se>
 L:	alsa-devel@alsa-project.org (moderated for non-subscribers)
 S:	Maintained
-F:	Documentation/devicetree/bindings/sound/nxp,tfa989x.yaml
-F:	sound/soc/codecs/tfa989x.c
+F:	Documentation/devicetree/bindings/sound/tfa9879.txt
+F:	sound/soc/codecs/tfa9879*

 NXP-NCI NFC DRIVER
 R:	Charles Gorand <charles.gorand@effinnov.com>
@@ -13771,13 +13781,12 @@ S:	Supported
 F:	Documentation/devicetree/bindings/net/nfc/nxp,nci.yaml
 F:	drivers/nfc/nxp-nci

-NXP i.MX 8QXP/8QM JPEG V4L2 DRIVER
-M:	Mirela Rabulea <mirela.rabulea@nxp.com>
-R:	NXP Linux Team <linux-imx@nxp.com>
-L:	linux-media@vger.kernel.org
+NXP/Goodix TFA989X (TFA1) DRIVER
+M:	Stephan Gerhold <stephan@gerhold.net>
+L:	alsa-devel@alsa-project.org (moderated for non-subscribers)
 S:	Maintained
-F:	Documentation/devicetree/bindings/media/nxp,imx8-jpeg.yaml
-F:	drivers/media/platform/imx-jpeg
+F:	Documentation/devicetree/bindings/sound/nxp,tfa989x.yaml
+F:	sound/soc/codecs/tfa989x.c

 NZXT-KRAKEN2 HARDWARE MONITORING DRIVER
 M:	Jonas Malaco <jonas@protocubo.io>
@@ -14567,6 +14576,14 @@ L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
 S:	Maintained
 F:	drivers/pci/controller/dwc/*layerscape*

+PCI DRIVER FOR FU740
+M:	Paul Walmsley <paul.walmsley@sifive.com>
+M:	Greentime Hu <greentime.hu@sifive.com>
+L:	linux-pci@vger.kernel.org
+S:	Maintained
+F:	Documentation/devicetree/bindings/pci/sifive,fu740-pcie.yaml
+F:	drivers/pci/controller/dwc/pcie-fu740.c
+
 PCI DRIVER FOR GENERIC OF HOSTS
 M:	Will Deacon <will@kernel.org>
 L:	linux-pci@vger.kernel.org
@@ -14585,14 +14602,6 @@ S:	Maintained
 F:	Documentation/devicetree/bindings/pci/fsl,imx6q-pcie.yaml
 F:	drivers/pci/controller/dwc/*imx6*

-PCI DRIVER FOR FU740
-M:	Paul Walmsley <paul.walmsley@sifive.com>
-M:	Greentime Hu <greentime.hu@sifive.com>
-L:	linux-pci@vger.kernel.org
-S:	Maintained
-F:	Documentation/devicetree/bindings/pci/sifive,fu740-pcie.yaml
-F:	drivers/pci/controller/dwc/pcie-fu740.c
-
 PCI DRIVER FOR INTEL IXP4XX
 M:	Linus Walleij <linus.walleij@linaro.org>
 S:	Maintained
@@ -14865,14 +14874,6 @@ L:	linux-arm-msm@vger.kernel.org
 S:	Maintained
 F:	drivers/pci/controller/dwc/pcie-qcom.c

-PCIE ENDPOINT DRIVER FOR QUALCOMM
-M:	Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
-L:	linux-pci@vger.kernel.org
-L:	linux-arm-msm@vger.kernel.org
-S:	Maintained
-F:	Documentation/devicetree/bindings/pci/qcom,pcie-ep.yaml
-F:	drivers/pci/controller/dwc/pcie-qcom-ep.c
-
 PCIE DRIVER FOR ROCKCHIP
 M:	Shawn Lin <shawn.lin@rock-chips.com>
 L:	linux-pci@vger.kernel.org
@@ -14894,6 +14895,14 @@ L:	linux-pci@vger.kernel.org
 S:	Maintained
 F:	drivers/pci/controller/dwc/*spear*

+PCIE ENDPOINT DRIVER FOR QUALCOMM
+M:	Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
+L:	linux-pci@vger.kernel.org
+L:	linux-arm-msm@vger.kernel.org
+S:	Maintained
+F:	Documentation/devicetree/bindings/pci/qcom,pcie-ep.yaml
+F:	drivers/pci/controller/dwc/pcie-qcom-ep.c
+
 PCMCIA SUBSYSTEM
 M:	Dominik Brodowski <linux@dominikbrodowski.net>
 S:	Odd Fixes
@@ -15153,13 +15162,6 @@ M:	Logan Gunthorpe <logang@deltatee.com>
 S:	Maintained
 F:	drivers/dma/plx_dma.c

-PM6764TR DRIVER
-M:	Charles Hsu	<hsu.yungteng@gmail.com>
-L:	linux-hwmon@vger.kernel.org
-S:	Maintained
-F:	Documentation/hwmon/pm6764tr.rst
-F:	drivers/hwmon/pmbus/pm6764tr.c
-
 PM-GRAPH UTILITY
 M:	"Todd E Brandt" <todd.e.brandt@linux.intel.com>
 L:	linux-pm@vger.kernel.org
@@ -15169,6 +15171,13 @@ B:	https://bugzilla.kernel.org/buglist.cgi?component=pm-graph&product=Tools
 T:	git git://github.com/intel/pm-graph
 F:	tools/power/pm-graph

+PM6764TR DRIVER
+M:	Charles Hsu	<hsu.yungteng@gmail.com>
+L:	linux-hwmon@vger.kernel.org
+S:	Maintained
+F:	Documentation/hwmon/pm6764tr.rst
+F:	drivers/hwmon/pmbus/pm6764tr.c
+
 PMBUS HARDWARE MONITORING DRIVERS
 M:	Guenter Roeck <linux@roeck-us.net>
 L:	linux-hwmon@vger.kernel.org
@@ -15249,15 +15258,6 @@ F:	include/linux/pm_*
 F:	include/linux/powercap.h
 F:	kernel/configs/nopm.config

-DYNAMIC THERMAL POWER MANAGEMENT (DTPM)
-M:	Daniel Lezcano <daniel.lezcano@kernel.org>
-L:	linux-pm@vger.kernel.org
-S:	Supported
-B:	https://bugzilla.kernel.org
-T:	git git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
-F:	drivers/powercap/dtpm*
-F:	include/linux/dtpm.h
-
 POWER STATE COORDINATION INTERFACE (PSCI)
 M:	Mark Rutland <mark.rutland@arm.com>
 M:	Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
@@ -16272,12 +16272,6 @@ S:	Supported
 F:	Documentation/devicetree/bindings/i2c/renesas,riic.yaml
 F:	drivers/i2c/busses/i2c-riic.c

-RENESAS USB PHY DRIVER
-M:	Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
-L:	linux-renesas-soc@vger.kernel.org
-S:	Maintained
-F:	drivers/phy/renesas/phy-rcar-gen3-usb*.c
-
 RENESAS RZ/G2L A/D DRIVER
 M:	Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
 L:	linux-iio@vger.kernel.org
@@ -16286,6 +16280,12 @@ S:	Supported
 F:	Documentation/devicetree/bindings/iio/adc/renesas,rzg2l-adc.yaml
 F:	drivers/iio/adc/rzg2l_adc.c

+RENESAS USB PHY DRIVER
+M:	Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
+L:	linux-renesas-soc@vger.kernel.org
+S:	Maintained
+F:	drivers/phy/renesas/phy-rcar-gen3-usb*.c
+
 RESET CONTROLLER FRAMEWORK
 M:	Philipp Zabel <p.zabel@pengutronix.de>
 S:	Maintained
@@ -17116,6 +17116,15 @@ F:	block/sed*
 F:	include/linux/sed*
 F:	include/uapi/linux/sed*

+SECURE MONITOR CALL(SMC) CALLING CONVENTION (SMCCC)
+M:	Mark Rutland <mark.rutland@arm.com>
+M:	Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
+M:	Sudeep Holla <sudeep.holla@arm.com>
+L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
+S:	Maintained
+F:	drivers/firmware/smccc/
+F:	include/linux/arm-smccc.h
+
 SECURITY CONTACT
 M:	Security Officers <security@kernel.org>
 S:	Supported
@@ -17523,15 +17532,6 @@ M:	Nicolas Pitre <nico@fluxnic.net>
 S:	Odd Fixes
 F:	drivers/net/ethernet/smsc/smc91x.*

-SECURE MONITOR CALL(SMC) CALLING CONVENTION (SMCCC)
-M:	Mark Rutland <mark.rutland@arm.com>
-M:	Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
-M:	Sudeep Holla <sudeep.holla@arm.com>
-L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
-S:	Maintained
-F:	drivers/firmware/smccc/
-F:	include/linux/arm-smccc.h
-
 SMM665 HARDWARE MONITOR DRIVER
 M:	Guenter Roeck <linux@roeck-us.net>
 L:	linux-hwmon@vger.kernel.org
@@ -18713,6 +18713,14 @@ M:	Thierry Reding <thierry.reding@gmail.com>
 S:	Supported
 F:	drivers/pwm/pwm-tegra.c

+TEGRA QUAD SPI DRIVER
+M:	Thierry Reding <thierry.reding@gmail.com>
+M:	Jonathan Hunter <jonathanh@nvidia.com>
+M:	Sowjanya Komatineni <skomatineni@nvidia.com>
+L:	linux-tegra@vger.kernel.org
+S:	Maintained
+F:	drivers/spi/spi-tegra210-quad.c
+
 TEGRA SERIAL DRIVER
 M:	Laxman Dewangan <ldewangan@nvidia.com>
 S:	Supported
@@ -18723,14 +18731,6 @@ M:	Laxman Dewangan <ldewangan@nvidia.com>
 S:	Supported
 F:	drivers/spi/spi-tegra*

-TEGRA QUAD SPI DRIVER
-M:	Thierry Reding <thierry.reding@gmail.com>
-M:	Jonathan Hunter <jonathanh@nvidia.com>
-M:	Sowjanya Komatineni <skomatineni@nvidia.com>
-L:	linux-tegra@vger.kernel.org
-S:	Maintained
-F:	drivers/spi/spi-tegra210-quad.c
-
 TEGRA VIDEO DRIVER
 M:	Thierry Reding <thierry.reding@gmail.com>
 M:	Jonathan Hunter <jonathanh@nvidia.com>
@@ -18779,13 +18779,6 @@ L:	alsa-devel@alsa-project.org (moderated for non-subscribers)
 S:	Maintained
 F:	sound/soc/ti/

-TEXAS INSTRUMENTS' DAC7612 DAC DRIVER
-M:	Ricardo Ribalda <ribalda@kernel.org>
-L:	linux-iio@vger.kernel.org
-S:	Supported
-F:	Documentation/devicetree/bindings/iio/dac/ti,dac7612.yaml
-F:	drivers/iio/dac/ti-dac7612.c
-
 TEXAS INSTRUMENTS DMA DRIVERS
 M:	Peter Ujfalusi <peter.ujfalusi@gmail.com>
 L:	dmaengine@vger.kernel.org
@@ -18799,6 +18792,22 @@ F:	include/linux/dma/k3-udma-glue.h
 F:	include/linux/dma/ti-cppi5.h
 F:	include/linux/dma/k3-psil.h

+TEXAS INSTRUMENTS TPS23861 PoE PSE DRIVER
+M:	Robert Marko <robert.marko@sartura.hr>
+M:	Luka Perkov <luka.perkov@sartura.hr>
+L:	linux-hwmon@vger.kernel.org
+S:	Maintained
+F:	Documentation/devicetree/bindings/hwmon/ti,tps23861.yaml
+F:	Documentation/hwmon/tps23861.rst
+F:	drivers/hwmon/tps23861.c
+
+TEXAS INSTRUMENTS' DAC7612 DAC DRIVER
+M:	Ricardo Ribalda <ribalda@kernel.org>
+L:	linux-iio@vger.kernel.org
+S:	Supported
+F:	Documentation/devicetree/bindings/iio/dac/ti,dac7612.yaml
+F:	drivers/iio/dac/ti-dac7612.c
+
 TEXAS INSTRUMENTS' SYSTEM CONTROL INTERFACE (TISCI) PROTOCOL DRIVER
 M:	Nishanth Menon <nm@ti.com>
 M:	Tero Kristo <kristo@kernel.org>
@@ -18823,15 +18832,6 @@ F:	include/dt-bindings/soc/ti,sci_pm_domain.h
 F:	include/linux/soc/ti/ti_sci_inta_msi.h
 F:	include/linux/soc/ti/ti_sci_protocol.h

-TEXAS INSTRUMENTS TPS23861 PoE PSE DRIVER
-M:	Robert Marko <robert.marko@sartura.hr>
-M:	Luka Perkov <luka.perkov@sartura.hr>
-L:	linux-hwmon@vger.kernel.org
-S:	Maintained
-F:	Documentation/devicetree/bindings/hwmon/ti,tps23861.yaml
-F:	Documentation/hwmon/tps23861.rst
-F:	drivers/hwmon/tps23861.c
-
 TEXAS INSTRUMENTS' TMP117 TEMPERATURE SENSOR DRIVER
 M:	Puranjay Mohan <puranjay12@gmail.com>
 L:	linux-iio@vger.kernel.org
@@ -19732,6 +19732,13 @@ L:	linux-usb@vger.kernel.org
 S:	Supported
 F:	drivers/usb/class/usblp.c

+USB QMI WWAN NETWORK DRIVER
+M:	Bjørn Mork <bjorn@mork.no>
+L:	netdev@vger.kernel.org
+S:	Maintained
+F:	Documentation/ABI/testing/sysfs-class-net-qmi
+F:	drivers/net/usb/qmi_wwan.c
+
 USB RAW GADGET DRIVER
 R:	Andrey Konovalov <andreyknvl@gmail.com>
 L:	linux-usb@vger.kernel.org
@@ -19740,13 +19747,6 @@ F:	Documentation/usb/raw-gadget.rst
 F:	drivers/usb/gadget/legacy/raw_gadget.c
 F:	include/uapi/linux/usb/raw_gadget.h

-USB QMI WWAN NETWORK DRIVER
-M:	Bjørn Mork <bjorn@mork.no>
-L:	netdev@vger.kernel.org
-S:	Maintained
-F:	Documentation/ABI/testing/sysfs-class-net-qmi
-F:	drivers/net/usb/qmi_wwan.c
-
 USB RTL8150 DRIVER
 M:	Petko Manolov <petkan@nucleusys.com>
 L:	linux-usb@vger.kernel.org
@@ -20062,6 +20062,14 @@ S:	Maintained
 F:	drivers/media/common/videobuf2/*
 F:	include/media/videobuf2-*

+VIDTV VIRTUAL DIGITAL TV DRIVER
+M:	Daniel W. S. Almeida <dwlsalmeida@gmail.com>
+L:	linux-media@vger.kernel.org
+S:	Maintained
+W:	https://linuxtv.org
+T:	git git://linuxtv.org/media_tree.git
+F:	drivers/media/test-drivers/vidtv/*
+
 VIMC VIRTUAL MEDIA CONTROLLER DRIVER
 M:	Helen Koike <helen.koike@collabora.com>
 R:	Shuah Khan <skhan@linuxfoundation.org>
@@ -20091,6 +20099,16 @@ F:	include/uapi/linux/virtio_vsock.h
 F:	net/vmw_vsock/virtio_transport.c
 F:	net/vmw_vsock/virtio_transport_common.c

+VIRTIO BALLOON
+M:	"Michael S. Tsirkin" <mst@redhat.com>
+M:	David Hildenbrand <david@redhat.com>
+L:	virtualization@lists.linux-foundation.org
+S:	Maintained
+F:	drivers/virtio/virtio_balloon.c
+F:	include/uapi/linux/virtio_balloon.h
+F:	include/linux/balloon_compaction.h
+F:	mm/balloon_compaction.c
+
 VIRTIO BLOCK AND SCSI DRIVERS
 M:	"Michael S. Tsirkin" <mst@redhat.com>
 M:	Jason Wang <jasowang@redhat.com>
@@ -20128,16 +20146,6 @@ F:	include/linux/virtio*.h
 F:	include/uapi/linux/virtio_*.h
 F:	tools/virtio/

-VIRTIO BALLOON
-M:	"Michael S. Tsirkin" <mst@redhat.com>
-M:	David Hildenbrand <david@redhat.com>
-L:	virtualization@lists.linux-foundation.org
-S:	Maintained
-F:	drivers/virtio/virtio_balloon.c
-F:	include/uapi/linux/virtio_balloon.h
-F:	include/linux/balloon_compaction.h
-F:	mm/balloon_compaction.c
-
 VIRTIO CRYPTO DRIVER
 M:	Gonglei <arei.gonglei@huawei.com>
 L:	virtualization@lists.linux-foundation.org
@@ -20199,6 +20207,15 @@ F:	drivers/vhost/
 F:	include/linux/vhost_iotlb.h
 F:	include/uapi/linux/vhost.h

+VIRTIO I2C DRIVER
+M:	Conghui Chen <conghui.chen@intel.com>
+M:	Viresh Kumar <viresh.kumar@linaro.org>
+L:	linux-i2c@vger.kernel.org
+L:	virtualization@lists.linux-foundation.org
+S:	Maintained
+F:	drivers/i2c/busses/i2c-virtio.c
+F:	include/uapi/linux/virtio_i2c.h
+
 VIRTIO INPUT DRIVER
 M:	Gerd Hoffmann <kraxel@redhat.com>
 S:	Maintained
@@ -20220,6 +20237,13 @@ W:	https://virtio-mem.gitlab.io/
 F:	drivers/virtio/virtio_mem.c
 F:	include/uapi/linux/virtio_mem.h

+VIRTIO PMEM DRIVER
+M:	Pankaj Gupta <pankaj.gupta.linux@gmail.com>
+L:	virtualization@lists.linux-foundation.org
+S:	Maintained
+F:	drivers/nvdimm/virtio_pmem.c
+F:	drivers/nvdimm/nd_virtio.c
+
 VIRTIO SOUND DRIVER
 M:	Anton Yakovlev <anton.yakovlev@opensynergy.com>
 M:	"Michael S. Tsirkin" <mst@redhat.com>
@@ -20229,22 +20253,6 @@ S:	Maintained
 F:	include/uapi/linux/virtio_snd.h
 F:	sound/virtio/*

-VIRTIO I2C DRIVER
-M:	Conghui Chen <conghui.chen@intel.com>
-M:	Viresh Kumar <viresh.kumar@linaro.org>
-L:	linux-i2c@vger.kernel.org
-L:	virtualization@lists.linux-foundation.org
-S:	Maintained
-F:	drivers/i2c/busses/i2c-virtio.c
-F:	include/uapi/linux/virtio_i2c.h
-
-VIRTIO PMEM DRIVER
-M:	Pankaj Gupta <pankaj.gupta.linux@gmail.com>
-L:	virtualization@lists.linux-foundation.org
-S:	Maintained
-F:	drivers/nvdimm/virtio_pmem.c
-F:	drivers/nvdimm/nd_virtio.c
-
 VIRTUAL BOX GUEST DEVICE DRIVER
 M:	Hans de Goede <hdegoede@redhat.com>
 M:	Arnd Bergmann <arnd@arndb.de>
@@ -20274,14 +20282,6 @@ W:	https://linuxtv.org
 T:	git git://linuxtv.org/media_tree.git
 F:	drivers/media/test-drivers/vivid/*

-VIDTV VIRTUAL DIGITAL TV DRIVER
-M:	Daniel W. S. Almeida <dwlsalmeida@gmail.com>
-L:	linux-media@vger.kernel.org
-S:	Maintained
-W:	https://linuxtv.org
-T:	git git://linuxtv.org/media_tree.git
-F:	drivers/media/test-drivers/vidtv/*
-
 VLYNQ BUS
 M:	Florian Fainelli <f.fainelli@gmail.com>
 L:	openwrt-devel@lists.openwrt.org (subscribers-only)
@@ -20289,18 +20289,6 @@ S:	Maintained
 F:	drivers/vlynq/vlynq.c
 F:	include/linux/vlynq.h

-VME SUBSYSTEM
-M:	Martyn Welch <martyn@welchs.me.uk>
-M:	Manohar Vanga <manohar.vanga@gmail.com>
-M:	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-L:	linux-kernel@vger.kernel.org
-S:	Maintained
-T:	git git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git
-F:	Documentation/driver-api/vme.rst
-F:	drivers/staging/vme/
-F:	drivers/vme/
-F:	include/linux/vme*
-
 VM SOCKETS (AF_VSOCK)
 M:	Stefano Garzarella <sgarzare@redhat.com>
 L:	virtualization@lists.linux-foundation.org
@@ -20314,6 +20302,18 @@ F:	include/uapi/linux/vsockmon.h
 F:	net/vmw_vsock/
 F:	tools/testing/vsock/

+VME SUBSYSTEM
+M:	Martyn Welch <martyn@welchs.me.uk>
+M:	Manohar Vanga <manohar.vanga@gmail.com>
+M:	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+L:	linux-kernel@vger.kernel.org
+S:	Maintained
+T:	git git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git
+F:	Documentation/driver-api/vme.rst
+F:	drivers/staging/vme/
+F:	drivers/vme/
+F:	include/linux/vme*
+
 VMWARE BALLOON DRIVER
 M:	Nadav Amit <namit@vmware.com>
 M:	"VMware, Inc." <pv-drivers@vmware.com>
--
2.30.2


^ permalink raw reply related	[relevance 1%]

* [PATCH v1 1/1] MAINTAINERS: Sort sections with parse-maintainers.pl help
@ 2021-11-17 19:05  3% Andy Shevchenko
  0 siblings, 0 replies; 200+ results
From: Andy Shevchenko @ 2021-11-17 19:05 UTC (permalink / raw)
  To: linux-kernel, linux-riscv, llvm
  Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, Nathan Chancellor,
	Nick Desaulniers, Joe Perches, Linus Torvalds, Andy Shevchenko

Sort sections with parse-maintainers.pl help since quite a few
got unsorted from the previous run.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
---
 MAINTAINERS | 710 ++++++++++++++++++++++++++--------------------------
 1 file changed, 355 insertions(+), 355 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 7a2345ce8521..1154c83ee3c5 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -967,11 +967,6 @@ F:	drivers/gpu/drm/amd/include/v9_structs.h
 F:	drivers/gpu/drm/amd/include/vi_structs.h
 F:	include/uapi/linux/kfd_ioctl.h
 
-AMD SPI DRIVER
-M:	Sanjay R Mehta <sanju.mehta@amd.com>
-S:	Maintained
-F:	drivers/spi/spi-amd.c
-
 AMD MP2 I2C DRIVER
 M:	Elie Morisse <syniurge@gmail.com>
 M:	Nehal Shah <nehal-bakulchandra.shah@amd.com>
@@ -1006,13 +1001,6 @@ M:	Tom Lendacky <thomas.lendacky@amd.com>
 S:	Supported
 F:	arch/arm64/boot/dts/amd/
 
-AMD XGBE DRIVER
-M:	Tom Lendacky <thomas.lendacky@amd.com>
-L:	netdev@vger.kernel.org
-S:	Supported
-F:	arch/arm64/boot/dts/amd/amd-seattle-xgbe*.dtsi
-F:	drivers/net/ethernet/amd/xgbe/
-
 AMD SENSOR FUSION HUB DRIVER
 M:	Nehal Shah <nehal-bakulchandra.shah@amd.com>
 M:	Basavaraj Natikar <basavaraj.natikar@amd.com>
@@ -1021,6 +1009,18 @@ S:	Maintained
 F:	Documentation/hid/amd-sfh*
 F:	drivers/hid/amd-sfh-hid/
 
+AMD SPI DRIVER
+M:	Sanjay R Mehta <sanju.mehta@amd.com>
+S:	Maintained
+F:	drivers/spi/spi-amd.c
+
+AMD XGBE DRIVER
+M:	Tom Lendacky <thomas.lendacky@amd.com>
+L:	netdev@vger.kernel.org
+S:	Supported
+F:	arch/arm64/boot/dts/amd/amd-seattle-xgbe*.dtsi
+F:	drivers/net/ethernet/amd/xgbe/
+
 AMS AS73211 DRIVER
 M:	Christian Eggers <ceggers@arri.de>
 L:	linux-iio@vger.kernel.org
@@ -1409,6 +1409,16 @@ S:	Maintained
 F:	drivers/net/arcnet/
 F:	include/uapi/linux/if_arcnet.h
 
+ARM AND ARM64 SoC SUB-ARCHITECTURES (COMMON PARTS)
+M:	Arnd Bergmann <arnd@arndb.de>
+M:	Olof Johansson <olof@lixom.net>
+M:	soc@kernel.org
+L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
+S:	Maintained
+T:	git git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc.git
+F:	arch/arm/boot/dts/Makefile
+F:	arch/arm64/boot/dts/Makefile
+
 ARM ARCHITECTED TIMER DRIVER
 M:	Mark Rutland <mark.rutland@arm.com>
 M:	Marc Zyngier <maz@kernel.org>
@@ -1525,22 +1535,6 @@ S:	Odd Fixes
 F:	drivers/amba/
 F:	include/linux/amba/bus.h
 
-ARM PRIMECELL PL35X NAND CONTROLLER DRIVER
-M:	Miquel Raynal <miquel.raynal@bootlin.com>
-M:	Naga Sureshkumar Relli <nagasure@xilinx.com>
-L:	linux-mtd@lists.infradead.org
-S:	Maintained
-F:	Documentation/devicetree/bindings/mtd/arm,pl353-nand-r2p1.yaml
-F:	drivers/mtd/nand/raw/pl35x-nand-controller.c
-
-ARM PRIMECELL PL35X SMC DRIVER
-M:	Miquel Raynal <miquel.raynal@bootlin.com>
-M:	Naga Sureshkumar Relli <nagasure@xilinx.com>
-L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
-S:	Maintained
-F:	Documentation/devicetree/bindings/memory-controllers/arm,pl353-smc.yaml
-F:	drivers/memory/pl353-smc.c
-
 ARM PRIMECELL CLCD PL110 DRIVER
 M:	Russell King <linux@armlinux.org.uk>
 S:	Odd Fixes
@@ -1558,6 +1552,22 @@ S:	Odd Fixes
 F:	drivers/mmc/host/mmci.*
 F:	include/linux/amba/mmci.h
 
+ARM PRIMECELL PL35X NAND CONTROLLER DRIVER
+M:	Miquel Raynal <miquel.raynal@bootlin.com>
+M:	Naga Sureshkumar Relli <nagasure@xilinx.com>
+L:	linux-mtd@lists.infradead.org
+S:	Maintained
+F:	Documentation/devicetree/bindings/mtd/arm,pl353-nand-r2p1.yaml
+F:	drivers/mtd/nand/raw/pl35x-nand-controller.c
+
+ARM PRIMECELL PL35X SMC DRIVER
+M:	Miquel Raynal <miquel.raynal@bootlin.com>
+M:	Naga Sureshkumar Relli <nagasure@xilinx.com>
+L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
+S:	Maintained
+F:	Documentation/devicetree/bindings/memory-controllers/arm,pl353-smc.yaml
+F:	drivers/memory/pl353-smc.c
+
 ARM PRIMECELL SSP PL022 SPI DRIVER
 M:	Linus Walleij <linus.walleij@linaro.org>
 L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
@@ -1594,16 +1604,6 @@ F:	Documentation/devicetree/bindings/iommu/arm,smmu*
 F:	drivers/iommu/arm/
 F:	drivers/iommu/io-pgtable-arm*
 
-ARM AND ARM64 SoC SUB-ARCHITECTURES (COMMON PARTS)
-M:	Arnd Bergmann <arnd@arndb.de>
-M:	Olof Johansson <olof@lixom.net>
-M:	soc@kernel.org
-L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
-S:	Maintained
-T:	git git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc.git
-F:	arch/arm/boot/dts/Makefile
-F:	arch/arm64/boot/dts/Makefile
-
 ARM SUB-ARCHITECTURES
 L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
 S:	Maintained
@@ -2256,13 +2256,6 @@ F:	arch/arm64/boot/dts/microchip/
 F:	drivers/pinctrl/pinctrl-microchip-sgpio.c
 N:	sparx5
 
-Microchip Timer Counter Block (TCB) Capture Driver
-M:	Kamel Bouhara <kamel.bouhara@bootlin.com>
-L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
-L:	linux-iio@vger.kernel.org
-S:	Maintained
-F:	drivers/counter/microchip-tcb-capture.c
-
 ARM/MIOA701 MACHINE SUPPORT
 M:	Robert Jarzmik <robert.jarzmik@free.fr>
 L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
@@ -4095,29 +4088,6 @@ W:	https://github.com/Cascoda/ca8210-linux.git
 F:	Documentation/devicetree/bindings/net/ieee802154/ca8210.txt
 F:	drivers/net/ieee802154/ca8210.c
 
-CANAAN/KENDRYTE K210 SOC FPIOA DRIVER
-M:	Damien Le Moal <damien.lemoal@wdc.com>
-L:	linux-riscv@lists.infradead.org
-L:	linux-gpio@vger.kernel.org (pinctrl driver)
-F:	Documentation/devicetree/bindings/pinctrl/canaan,k210-fpioa.yaml
-F:	drivers/pinctrl/pinctrl-k210.c
-
-CANAAN/KENDRYTE K210 SOC RESET CONTROLLER DRIVER
-M:	Damien Le Moal <damien.lemoal@wdc.com>
-L:	linux-kernel@vger.kernel.org
-L:	linux-riscv@lists.infradead.org
-S:	Maintained
-F:	Documentation/devicetree/bindings/reset/canaan,k210-rst.yaml
-F:	drivers/reset/reset-k210.c
-
-CANAAN/KENDRYTE K210 SOC SYSTEM CONTROLLER DRIVER
-M:	Damien Le Moal <damien.lemoal@wdc.com>
-L:	linux-riscv@lists.infradead.org
-S:	Maintained
-F:      Documentation/devicetree/bindings/mfd/canaan,k210-sysctl.yaml
-F:	drivers/soc/canaan/
-F:	include/soc/canaan/
-
 CACHEFILES: FS-CACHE BACKEND FOR CACHING ON MOUNTED FILESYSTEMS
 M:	David Howells <dhowells@redhat.com>
 L:	linux-cachefs@redhat.com (moderated for non-subscribers)
@@ -4240,6 +4210,29 @@ F:	Documentation/networking/j1939.rst
 F:	include/uapi/linux/can/j1939.h
 F:	net/can/j1939/
 
+CANAAN/KENDRYTE K210 SOC FPIOA DRIVER
+M:	Damien Le Moal <damien.lemoal@wdc.com>
+L:	linux-riscv@lists.infradead.org
+L:	linux-gpio@vger.kernel.org (pinctrl driver)
+F:	Documentation/devicetree/bindings/pinctrl/canaan,k210-fpioa.yaml
+F:	drivers/pinctrl/pinctrl-k210.c
+
+CANAAN/KENDRYTE K210 SOC RESET CONTROLLER DRIVER
+M:	Damien Le Moal <damien.lemoal@wdc.com>
+L:	linux-kernel@vger.kernel.org
+L:	linux-riscv@lists.infradead.org
+S:	Maintained
+F:	Documentation/devicetree/bindings/reset/canaan,k210-rst.yaml
+F:	drivers/reset/reset-k210.c
+
+CANAAN/KENDRYTE K210 SOC SYSTEM CONTROLLER DRIVER
+M:	Damien Le Moal <damien.lemoal@wdc.com>
+L:	linux-riscv@lists.infradead.org
+S:	Maintained
+F:	Documentation/devicetree/bindings/mfd/canaan,k210-sysctl.yaml
+F:	drivers/soc/canaan/
+F:	include/soc/canaan/
+
 CAPABILITIES
 M:	Serge Hallyn <serge@hallyn.com>
 L:	linux-security-module@vger.kernel.org
@@ -4489,17 +4482,17 @@ F:	drivers/power/supply/cros_usbpd-charger.c
 N:	cros_ec
 N:	cros-ec
 
-CHROMEOS EC USB TYPE-C DRIVER
-M:	Prashant Malani <pmalani@chromium.org>
-S:	Maintained
-F:	drivers/platform/chrome/cros_ec_typec.c
-
 CHROMEOS EC USB PD NOTIFY DRIVER
 M:	Prashant Malani <pmalani@chromium.org>
 S:	Maintained
 F:	drivers/platform/chrome/cros_usbpd_notify.c
 F:	include/linux/platform_data/cros_usbpd_notify.h
 
+CHROMEOS EC USB TYPE-C DRIVER
+M:	Prashant Malani <pmalani@chromium.org>
+S:	Maintained
+F:	drivers/platform/chrome/cros_ec_typec.c
+
 CHRONTEL CH7322 CEC DRIVER
 M:	Joe Tessler <jrt@google.com>
 L:	linux-media@vger.kernel.org
@@ -4604,6 +4597,18 @@ M:	Nelson Escobar <neescoba@cisco.com>
 S:	Supported
 F:	drivers/infiniband/hw/usnic/
 
+CLANG CONTROL FLOW INTEGRITY SUPPORT
+M:	Sami Tolvanen <samitolvanen@google.com>
+M:	Kees Cook <keescook@chromium.org>
+R:	Nathan Chancellor <nathan@kernel.org>
+R:	Nick Desaulniers <ndesaulniers@google.com>
+L:	llvm@lists.linux.dev
+S:	Supported
+B:	https://github.com/ClangBuiltLinux/linux/issues
+T:	git git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git for-next/clang/features
+F:	include/linux/cfi.h
+F:	kernel/cfi.c
+
 CLANG-FORMAT FILE
 M:	Miguel Ojeda <ojeda@kernel.org>
 S:	Maintained
@@ -4623,18 +4628,6 @@ F:	scripts/Makefile.clang
 F:	scripts/clang-tools/
 K:	\b(?i:clang|llvm)\b
 
-CLANG CONTROL FLOW INTEGRITY SUPPORT
-M:	Sami Tolvanen <samitolvanen@google.com>
-M:	Kees Cook <keescook@chromium.org>
-R:	Nathan Chancellor <nathan@kernel.org>
-R:	Nick Desaulniers <ndesaulniers@google.com>
-L:	llvm@lists.linux.dev
-S:	Supported
-B:	https://github.com/ClangBuiltLinux/linux/issues
-T:	git git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git for-next/clang/features
-F:	include/linux/cfi.h
-F:	kernel/cfi.c
-
 CLEANCACHE API
 M:	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
 L:	linux-kernel@vger.kernel.org
@@ -5117,6 +5110,13 @@ S:	Supported
 W:	http://www.chelsio.com
 F:	drivers/crypto/chelsio
 
+CXGB4 ETHERNET DRIVER (CXGB4)
+M:	Raju Rangoju <rajur@chelsio.com>
+L:	netdev@vger.kernel.org
+S:	Supported
+W:	http://www.chelsio.com
+F:	drivers/net/ethernet/chelsio/cxgb4/
+
 CXGB4 INLINE CRYPTO DRIVER
 M:	Ayush Sawal <ayush.sawal@chelsio.com>
 M:	Vinay Kumar Yadav <vinay.yadav@chelsio.com>
@@ -5126,13 +5126,6 @@ S:	Supported
 W:	http://www.chelsio.com
 F:	drivers/net/ethernet/chelsio/inline_crypto/
 
-CXGB4 ETHERNET DRIVER (CXGB4)
-M:	Raju Rangoju <rajur@chelsio.com>
-L:	netdev@vger.kernel.org
-S:	Supported
-W:	http://www.chelsio.com
-F:	drivers/net/ethernet/chelsio/cxgb4/
-
 CXGB4 ISCSI DRIVER (CXGB4I)
 M:	Karen Xie <kxie@chelsio.com>
 L:	linux-scsi@vger.kernel.org
@@ -5188,16 +5181,6 @@ CYCLADES PC300 DRIVER
 S:	Orphan
 F:	drivers/net/wan/pc300*
 
-CYPRESS_FIRMWARE MEDIA DRIVER
-M:	Antti Palosaari <crope@iki.fi>
-L:	linux-media@vger.kernel.org
-S:	Maintained
-W:	https://linuxtv.org
-W:	http://palosaari.fi/linux/
-Q:	http://patchwork.linuxtv.org/project/linux-media/list/
-T:	git git://linuxtv.org/anttip/media_tree.git
-F:	drivers/media/common/cypress_firmware*
-
 CYPRESS CY8CTMA140 TOUCHSCREEN DRIVER
 M:	Linus Walleij <linus.walleij@linaro.org>
 L:	linux-input@vger.kernel.org
@@ -5211,6 +5194,16 @@ S:	Maintained
 F:	Documentation/devicetree/bindings/input/cypress-sf.yaml
 F:	drivers/input/keyboard/cypress-sf.c
 
+CYPRESS_FIRMWARE MEDIA DRIVER
+M:	Antti Palosaari <crope@iki.fi>
+L:	linux-media@vger.kernel.org
+S:	Maintained
+W:	https://linuxtv.org
+W:	http://palosaari.fi/linux/
+Q:	http://patchwork.linuxtv.org/project/linux-media/list/
+T:	git git://linuxtv.org/anttip/media_tree.git
+F:	drivers/media/common/cypress_firmware*
+
 CYTTSP TOUCHSCREEN DRIVER
 M:	Linus Walleij <linus.walleij@linaro.org>
 L:	linux-input@vger.kernel.org
@@ -5381,14 +5374,12 @@ L:	Dell.Client.Kernel@dell.com
 S:	Maintained
 F:	drivers/platform/x86/dell/dell-wmi-descriptor.c
 
-DELL WMI SYSMAN DRIVER
-M:	Divya Bharathi <divya.bharathi@dell.com>
-M:	Prasanth Ksr <prasanth.ksr@dell.com>
+DELL WMI HARDWARE PRIVACY SUPPORT
+M:	Perry Yuan <Perry.Yuan@dell.com>
 L:	Dell.Client.Kernel@dell.com
 L:	platform-driver-x86@vger.kernel.org
 S:	Maintained
-F:	Documentation/ABI/testing/sysfs-class-firmware-attributes
-F:	drivers/platform/x86/dell/dell-wmi-sysman/
+F:	drivers/platform/x86/dell/dell-wmi-privacy.c
 
 DELL WMI NOTIFICATIONS DRIVER
 M:	Matthew Garrett <mjg59@srcf.ucam.org>
@@ -5396,12 +5387,21 @@ M:	Pali Rohár <pali@kernel.org>
 S:	Maintained
 F:	drivers/platform/x86/dell/dell-wmi-base.c
 
-DELL WMI HARDWARE PRIVACY SUPPORT
-M:	Perry Yuan <Perry.Yuan@dell.com>
+DELL WMI SYSMAN DRIVER
+M:	Divya Bharathi <divya.bharathi@dell.com>
+M:	Prasanth Ksr <prasanth.ksr@dell.com>
 L:	Dell.Client.Kernel@dell.com
 L:	platform-driver-x86@vger.kernel.org
 S:	Maintained
-F:	drivers/platform/x86/dell/dell-wmi-privacy.c
+F:	Documentation/ABI/testing/sysfs-class-firmware-attributes
+F:	drivers/platform/x86/dell/dell-wmi-sysman/
+
+DELTA DPS920AB PSU DRIVER
+M:	Robert Marko <robert.marko@sartura.hr>
+L:	linux-hwmon@vger.kernel.org
+S:	Maintained
+F:	Documentation/hwmon/dps920ab.rst
+F:	drivers/hwmon/pmbus/dps920ab.c
 
 DELTA ST MEDIA DRIVER
 M:	Hugues Fruchet <hugues.fruchet@foss.st.com>
@@ -5411,13 +5411,6 @@ W:	https://linuxtv.org
 T:	git git://linuxtv.org/media_tree.git
 F:	drivers/media/platform/sti/delta
 
-DELTA DPS920AB PSU DRIVER
-M:	Robert Marko <robert.marko@sartura.hr>
-L:	linux-hwmon@vger.kernel.org
-S:	Maintained
-F:	Documentation/hwmon/dps920ab.rst
-F:	drivers/hwmon/pmbus/dps920ab.c
-
 DENALI NAND DRIVER
 L:	linux-mtd@lists.infradead.org
 S:	Orphan
@@ -5430,13 +5423,6 @@ S:	Maintained
 F:	drivers/dma/dw-edma/
 F:	include/linux/dma/edma.h
 
-DESIGNWARE XDATA IP DRIVER
-M:	Gustavo Pimentel <gustavo.pimentel@synopsys.com>
-L:	linux-pci@vger.kernel.org
-S:	Maintained
-F:	Documentation/misc-devices/dw-xdata-pcie.rst
-F:	drivers/misc/dw-xdata-pcie.c
-
 DESIGNWARE USB2 DRD IP DRIVER
 M:	Minas Harutyunyan <hminas@synopsys.com>
 L:	linux-usb@vger.kernel.org
@@ -5451,6 +5437,13 @@ S:	Maintained
 T:	git git://git.kernel.org/pub/scm/linux/kernel/git/balbi/usb.git
 F:	drivers/usb/dwc3/
 
+DESIGNWARE XDATA IP DRIVER
+M:	Gustavo Pimentel <gustavo.pimentel@synopsys.com>
+L:	linux-pci@vger.kernel.org
+S:	Maintained
+F:	Documentation/misc-devices/dw-xdata-pcie.rst
+F:	drivers/misc/dw-xdata-pcie.c
+
 DEVANTECH SRF ULTRASONIC RANGER IIO DRIVER
 M:	Andreas Klinger <ak@it-klinger.de>
 L:	linux-iio@vger.kernel.org
@@ -5680,6 +5673,12 @@ F:	include/linux/dma/
 F:	include/linux/dmaengine.h
 F:	include/linux/of_dma.h
 
+DMA MAPPING BENCHMARK
+M:	Barry Song <song.bao.hua@hisilicon.com>
+L:	iommu@lists.linux-foundation.org
+F:	kernel/dma/map_benchmark.c
+F:	tools/testing/selftests/dma/
+
 DMA MAPPING HELPERS
 M:	Christoph Hellwig <hch@lst.de>
 M:	Marek Szyprowski <m.szyprowski@samsung.com>
@@ -5694,12 +5693,6 @@ F:	include/linux/dma-mapping.h
 F:	include/linux/dma-map-ops.h
 F:	kernel/dma/
 
-DMA MAPPING BENCHMARK
-M:	Barry Song <song.bao.hua@hisilicon.com>
-L:	iommu@lists.linux-foundation.org
-F:	kernel/dma/map_benchmark.c
-F:	tools/testing/selftests/dma/
-
 DMA-BUF HEAPS FRAMEWORK
 M:	Sumit Semwal <sumit.semwal@linaro.org>
 R:	Benjamin Gaignard <benjamin.gaignard@linaro.org>
@@ -5981,6 +5974,14 @@ T:	git git://anongit.freedesktop.org/drm/drm-misc
 F:	Documentation/devicetree/bindings/display/himax,hx8357d.txt
 F:	drivers/gpu/drm/tiny/hx8357d.c
 
+DRM DRIVER FOR HYPERV SYNTHETIC VIDEO DEVICE
+M:	Deepak Rawat <drawat.floss@gmail.com>
+L:	linux-hyperv@vger.kernel.org
+L:	dri-devel@lists.freedesktop.org
+S:	Maintained
+T:	git git://anongit.freedesktop.org/drm/drm-misc
+F:	drivers/gpu/drm/hyperv
+
 DRM DRIVER FOR ILITEK ILI9225 PANELS
 M:	David Lechner <david@lechnology.com>
 S:	Maintained
@@ -6126,14 +6127,6 @@ S:	Maintained
 F:	Documentation/devicetree/bindings/display/panel/samsung,s6d27a1.yaml
 F:	drivers/gpu/drm/panel/panel-samsung-s6d27a1.c
 
-DRM DRIVER FOR SITRONIX ST7703 PANELS
-M:	Guido Günther <agx@sigxcpu.org>
-R:	Purism Kernel Team <kernel@puri.sm>
-R:	Ondrej Jirman <megous@megous.com>
-S:	Maintained
-F:	Documentation/devicetree/bindings/display/panel/rocktech,jh057n00900.yaml
-F:	drivers/gpu/drm/panel/panel-sitronix-st7703.c
-
 DRM DRIVER FOR SAVAGE VIDEO CARDS
 S:	Orphan / Obsolete
 F:	drivers/gpu/drm/savage/
@@ -6164,6 +6157,14 @@ S:	Maintained
 F:	Documentation/devicetree/bindings/display/panel/sitronix,st7701.yaml
 F:	drivers/gpu/drm/panel/panel-sitronix-st7701.c
 
+DRM DRIVER FOR SITRONIX ST7703 PANELS
+M:	Guido Günther <agx@sigxcpu.org>
+R:	Purism Kernel Team <kernel@puri.sm>
+R:	Ondrej Jirman <megous@megous.com>
+S:	Maintained
+F:	Documentation/devicetree/bindings/display/panel/rocktech,jh057n00900.yaml
+F:	drivers/gpu/drm/panel/panel-sitronix-st7703.c
+
 DRM DRIVER FOR SITRONIX ST7735R PANELS
 M:	David Lechner <david@lechnology.com>
 S:	Maintained
@@ -6358,14 +6359,6 @@ T:	git git://anongit.freedesktop.org/drm/drm-misc
 F:	Documentation/devicetree/bindings/display/hisilicon/
 F:	drivers/gpu/drm/hisilicon/
 
-DRM DRIVER FOR HYPERV SYNTHETIC VIDEO DEVICE
-M:	Deepak Rawat <drawat.floss@gmail.com>
-L:	linux-hyperv@vger.kernel.org
-L:	dri-devel@lists.freedesktop.org
-S:	Maintained
-T:	git git://anongit.freedesktop.org/drm/drm-misc
-F:	drivers/gpu/drm/hyperv
-
 DRM DRIVERS FOR LIMA
 M:	Qiang Yu <yuq825@gmail.com>
 L:	dri-devel@lists.freedesktop.org
@@ -6513,6 +6506,14 @@ T:	git git://anongit.freedesktop.org/drm/drm-misc
 F:	Documentation/devicetree/bindings/display/xlnx/
 F:	drivers/gpu/drm/xlnx/
 
+DRM GPU SCHEDULER
+M:	Andrey Grodzovsky <andrey.grodzovsky@amd.com>
+L:	dri-devel@lists.freedesktop.org
+S:	Maintained
+T:	git git://anongit.freedesktop.org/drm/drm-misc
+F:	drivers/gpu/drm/scheduler/
+F:	include/drm/gpu_scheduler.h
+
 DRM PANEL DRIVERS
 M:	Thierry Reding <thierry.reding@gmail.com>
 R:	Sam Ravnborg <sam@ravnborg.org>
@@ -6533,14 +6534,6 @@ T:	git git://anongit.freedesktop.org/drm/drm-misc
 F:	drivers/gpu/drm/ttm/
 F:	include/drm/ttm/
 
-DRM GPU SCHEDULER
-M:	Andrey Grodzovsky <andrey.grodzovsky@amd.com>
-L:	dri-devel@lists.freedesktop.org
-S:	Maintained
-T:	git git://anongit.freedesktop.org/drm/drm-misc
-F:	drivers/gpu/drm/scheduler/
-F:	include/drm/gpu_scheduler.h
-
 DSBR100 USB FM RADIO DRIVER
 M:	Alexey Klimov <klimov.linux@gmail.com>
 L:	linux-media@vger.kernel.org
@@ -6679,6 +6672,15 @@ F:	Documentation/networking/net_dim.rst
 F:	include/linux/dim.h
 F:	lib/dim/
 
+DYNAMIC THERMAL POWER MANAGEMENT (DTPM)
+M:	Daniel Lezcano <daniel.lezcano@kernel.org>
+L:	linux-pm@vger.kernel.org
+S:	Supported
+B:	https://bugzilla.kernel.org
+T:	git git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
+F:	drivers/powercap/dtpm*
+F:	include/linux/dtpm.h
+
 DZ DECSTATION DZ11 SERIAL DRIVER
 M:	"Maciej W. Rozycki" <macro@orcam.me.uk>
 S:	Maintained
@@ -7026,22 +7028,22 @@ W:	http://www.broadcom.com
 F:	drivers/infiniband/hw/ocrdma/
 F:	include/uapi/rdma/ocrdma-abi.h
 
-EMULEX/BROADCOM LPFC FC/FCOE SCSI DRIVER
+EMULEX/BROADCOM EFCT FC/FCOE SCSI TARGET DRIVER
 M:	James Smart <james.smart@broadcom.com>
-M:	Dick Kennedy <dick.kennedy@broadcom.com>
+M:	Ram Vegesna <ram.vegesna@broadcom.com>
 L:	linux-scsi@vger.kernel.org
+L:	target-devel@vger.kernel.org
 S:	Supported
 W:	http://www.broadcom.com
-F:	drivers/scsi/lpfc/
+F:	drivers/scsi/elx/
 
-EMULEX/BROADCOM EFCT FC/FCOE SCSI TARGET DRIVER
+EMULEX/BROADCOM LPFC FC/FCOE SCSI DRIVER
 M:	James Smart <james.smart@broadcom.com>
-M:	Ram Vegesna <ram.vegesna@broadcom.com>
+M:	Dick Kennedy <dick.kennedy@broadcom.com>
 L:	linux-scsi@vger.kernel.org
-L:	target-devel@vger.kernel.org
 S:	Supported
 W:	http://www.broadcom.com
-F:	drivers/scsi/elx/
+F:	drivers/scsi/lpfc/
 
 ENE CB710 FLASH CARD READER DRIVER
 M:	Michał Mirosław <mirq-linux@rere.qmqm.pl>
@@ -8518,6 +8520,12 @@ W:	http://www.highpoint-tech.com
 F:	Documentation/scsi/hptiop.rst
 F:	drivers/scsi/hptiop.c
 
+HIKEY960 ONBOARD USB GPIO HUB DRIVER
+M:	John Stultz <john.stultz@linaro.org>
+L:	linux-kernel@vger.kernel.org
+S:	Maintained
+F:	drivers/misc/hisi_hikey_usb.c
+
 HIPPI
 M:	Jes Sorensen <jes@trained-monkey.org>
 L:	linux-hippi@sunsite.dk
@@ -8588,12 +8596,6 @@ W:	http://www.hisilicon.com
 F:	Documentation/devicetree/bindings/net/hisilicon*.txt
 F:	drivers/net/ethernet/hisilicon/
 
-HIKEY960 ONBOARD USB GPIO HUB DRIVER
-M:	John Stultz <john.stultz@linaro.org>
-L:	linux-kernel@vger.kernel.org
-S:	Maintained
-F:	drivers/misc/hisi_hikey_usb.c
-
 HISILICON PMU DRIVER
 M:	Shaokun Zhang <zhangshaokun@hisilicon.com>
 S:	Supported
@@ -9619,18 +9621,18 @@ F:	Documentation/admin-guide/media/ipu3_rcb.svg
 F:	Documentation/userspace-api/media/v4l/pixfmt-meta-intel-ipu3.rst
 F:	drivers/staging/media/ipu3/
 
-INTEL IXP4XX CRYPTO SUPPORT
-M:	Corentin Labbe <clabbe@baylibre.com>
-L:	linux-crypto@vger.kernel.org
-S:	Maintained
-F:	drivers/crypto/ixp4xx_crypto.c
-
 INTEL ISHTP ECLITE DRIVER
 M:	Sumesh K Naduvalath <sumesh.k.naduvalath@intel.com>
 L:	platform-driver-x86@vger.kernel.org
 S:	Supported
 F:	drivers/platform/x86/intel/ishtp_eclite.c
 
+INTEL IXP4XX CRYPTO SUPPORT
+M:	Corentin Labbe <clabbe@baylibre.com>
+L:	linux-crypto@vger.kernel.org
+S:	Maintained
+F:	drivers/crypto/ixp4xx_crypto.c
+
 INTEL IXP4XX QMGR, NPE, ETHERNET and HSS SUPPORT
 M:	Krzysztof Halasa <khalasa@piap.pl>
 S:	Maintained
@@ -9773,6 +9775,21 @@ S:	Maintained
 F:	arch/x86/include/asm/intel_scu_ipc.h
 F:	drivers/platform/x86/intel_scu_*
 
+INTEL SGX
+M:	Jarkko Sakkinen <jarkko@kernel.org>
+R:	Dave Hansen <dave.hansen@linux.intel.com>
+L:	linux-sgx@vger.kernel.org
+S:	Supported
+Q:	https://patchwork.kernel.org/project/intel-sgx/list/
+T:	git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/sgx
+F:	Documentation/x86/sgx.rst
+F:	arch/x86/entry/vdso/vsgx.S
+F:	arch/x86/include/asm/sgx.h
+F:	arch/x86/include/uapi/asm/sgx.h
+F:	arch/x86/kernel/cpu/sgx/*
+F:	tools/testing/selftests/sgx/*
+K:	\bSGX_
+
 INTEL SKYLAKE INT3472 ACPI DEVICE DRIVER
 M:	Daniel Scally <djrscally@gmail.com>
 S:	Maintained
@@ -9867,21 +9884,6 @@ F:	Documentation/x86/intel_txt.rst
 F:	arch/x86/kernel/tboot.c
 F:	include/linux/tboot.h
 
-INTEL SGX
-M:	Jarkko Sakkinen <jarkko@kernel.org>
-R:	Dave Hansen <dave.hansen@linux.intel.com>
-L:	linux-sgx@vger.kernel.org
-S:	Supported
-Q:	https://patchwork.kernel.org/project/intel-sgx/list/
-T:	git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/sgx
-F:	Documentation/x86/sgx.rst
-F:	arch/x86/entry/vdso/vsgx.S
-F:	arch/x86/include/asm/sgx.h
-F:	arch/x86/include/uapi/asm/sgx.h
-F:	arch/x86/kernel/cpu/sgx/*
-F:	tools/testing/selftests/sgx/*
-K:	\bSGX_
-
 INTERCONNECT API
 M:	Georgi Djakov <djakov@kernel.org>
 L:	linux-pm@vger.kernel.org
@@ -11287,6 +11289,12 @@ F:	drivers/mailbox/arm_mhuv2.c
 F:	include/linux/mailbox/arm_mhuv2_message.h
 F:	Documentation/devicetree/bindings/mailbox/arm,mhuv2.yaml
 
+MAN-PAGES: MANUAL PAGES FOR LINUX -- Sections 2, 3, 4, 5, and 7
+M:	Michael Kerrisk <mtk.manpages@gmail.com>
+L:	linux-man@vger.kernel.org
+S:	Maintained
+W:	http://www.kernel.org/doc/man-pages
+
 MANAGEMENT COMPONENT TRANSPORT PROTOCOL (MCTP)
 M:	Jeremy Kerr <jk@codeconstruct.com.au>
 M:	Matt Johnston <matt@codeconstruct.com.au>
@@ -11299,12 +11307,6 @@ F:	include/net/mctpdevice.h
 F:	include/net/netns/mctp.h
 F:	net/mctp/
 
-MAN-PAGES: MANUAL PAGES FOR LINUX -- Sections 2, 3, 4, 5, and 7
-M:	Michael Kerrisk <mtk.manpages@gmail.com>
-L:	linux-man@vger.kernel.org
-S:	Maintained
-W:	http://www.kernel.org/doc/man-pages
-
 MARDUK (CREATOR CI40) DEVICE TREE SUPPORT
 M:	Rahul Bedarkar <rahulbedarkar89@gmail.com>
 L:	linux-mips@vger.kernel.org
@@ -11623,12 +11625,6 @@ L:	netdev@vger.kernel.org
 S:	Supported
 F:	drivers/net/phy/mxl-gpy.c
 
-MCBA MICROCHIP CAN BUS ANALYZER TOOL DRIVER
-R:	Yasushi SHOJI <yashi@spacecubics.com>
-L:	linux-can@vger.kernel.org
-S:	Maintained
-F:	drivers/net/can/usb/mcba_usb.c
-
 MCAN MMIO DEVICE DRIVER
 M:	Chandrasekar Ramakrishnan <rcsekar@samsung.com>
 L:	linux-can@vger.kernel.org
@@ -11638,6 +11634,12 @@ F:	drivers/net/can/m_can/m_can.c
 F:	drivers/net/can/m_can/m_can.h
 F:	drivers/net/can/m_can/m_can_platform.c
 
+MCBA MICROCHIP CAN BUS ANALYZER TOOL DRIVER
+R:	Yasushi SHOJI <yashi@spacecubics.com>
+L:	linux-can@vger.kernel.org
+S:	Maintained
+F:	drivers/net/can/usb/mcba_usb.c
+
 MCP2221A MICROCHIP USB-HID TO I2C BRIDGE DRIVER
 M:	Rishi Gupta <gupt21@gmail.com>
 L:	linux-i2c@vger.kernel.org
@@ -12030,13 +12032,6 @@ S:	Maintained
 F:	Documentation/devicetree/bindings/clock/mediatek,mt7621-sysc.yaml
 F:	drivers/clk/ralink/clk-mt7621.c
 
-MEDIATEK MT7621/28/88 I2C DRIVER
-M:	Stefan Roese <sr@denx.de>
-L:	linux-i2c@vger.kernel.org
-S:	Maintained
-F:	Documentation/devicetree/bindings/i2c/i2c-mt7621.txt
-F:	drivers/i2c/busses/i2c-mt7621.c
-
 MEDIATEK MT7621 PCIE CONTROLLER DRIVER
 M:	Sergio Paracuellos <sergio.paracuellos@gmail.com>
 S:	Maintained
@@ -12049,6 +12044,13 @@ S:	Maintained
 F:	Documentation/devicetree/bindings/phy/mediatek,mt7621-pci-phy.yaml
 F:	drivers/phy/ralink/phy-mt7621-pci.c
 
+MEDIATEK MT7621/28/88 I2C DRIVER
+M:	Stefan Roese <sr@denx.de>
+L:	linux-i2c@vger.kernel.org
+S:	Maintained
+F:	Documentation/devicetree/bindings/i2c/i2c-mt7621.txt
+F:	drivers/i2c/busses/i2c-mt7621.c
+
 MEDIATEK NAND CONTROLLER DRIVER
 L:	linux-mtd@lists.infradead.org
 S:	Orphan
@@ -12580,6 +12582,13 @@ S:	Supported
 F:	drivers/misc/atmel-ssc.c
 F:	include/linux/atmel-ssc.h
 
+Microchip Timer Counter Block (TCB) Capture Driver
+M:	Kamel Bouhara <kamel.bouhara@bootlin.com>
+L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
+L:	linux-iio@vger.kernel.org
+S:	Maintained
+F:	drivers/counter/microchip-tcb-capture.c
+
 MICROCHIP USB251XB DRIVER
 M:	Richard Leitner <richard.leitner@skidata.com>
 L:	linux-usb@vger.kernel.org
@@ -13680,13 +13689,6 @@ F:	drivers/iio/gyro/fxas21002c_core.c
 F:	drivers/iio/gyro/fxas21002c_i2c.c
 F:	drivers/iio/gyro/fxas21002c_spi.c
 
-NXP i.MX CLOCK DRIVERS
-M:	Abel Vesa <abel.vesa@nxp.com>
-L:	linux-clk@vger.kernel.org
-L:	linux-imx@nxp.com
-S:	Maintained
-F:	drivers/clk/imx/
-
 NXP i.MX 8MQ DCSS DRIVER
 M:	Laurentiu Palcu <laurentiu.palcu@oss.nxp.com>
 R:	Lucas Stach <l.stach@pengutronix.de>
@@ -13702,6 +13704,21 @@ S:	Supported
 F:	Documentation/devicetree/bindings/iio/adc/nxp,imx8qxp-adc.yaml
 F:	drivers/iio/adc/imx8qxp-adc.c
 
+NXP i.MX 8QXP/8QM JPEG V4L2 DRIVER
+M:	Mirela Rabulea <mirela.rabulea@nxp.com>
+R:	NXP Linux Team <linux-imx@nxp.com>
+L:	linux-media@vger.kernel.org
+S:	Maintained
+F:	Documentation/devicetree/bindings/media/nxp,imx8-jpeg.yaml
+F:	drivers/media/platform/imx-jpeg
+
+NXP i.MX CLOCK DRIVERS
+M:	Abel Vesa <abel.vesa@nxp.com>
+L:	linux-clk@vger.kernel.org
+L:	linux-imx@nxp.com
+S:	Maintained
+F:	drivers/clk/imx/
+
 NXP PF8100/PF8121A/PF8200 PMIC REGULATOR DEVICE DRIVER
 M:	Jagan Teki <jagan@amarulasolutions.com>
 S:	Maintained
@@ -13739,19 +13756,12 @@ F:	include/drm/i2c/tda998x.h
 F:	include/dt-bindings/display/tda998x.h
 K:	"nxp,tda998x"
 
-NXP TFA9879 DRIVER
-M:	Peter Rosin <peda@axentia.se>
-L:	alsa-devel@alsa-project.org (moderated for non-subscribers)
-S:	Maintained
-F:	Documentation/devicetree/bindings/sound/tfa9879.txt
-F:	sound/soc/codecs/tfa9879*
-
-NXP/Goodix TFA989X (TFA1) DRIVER
-M:	Stephan Gerhold <stephan@gerhold.net>
+NXP TFA9879 DRIVER
+M:	Peter Rosin <peda@axentia.se>
 L:	alsa-devel@alsa-project.org (moderated for non-subscribers)
 S:	Maintained
-F:	Documentation/devicetree/bindings/sound/nxp,tfa989x.yaml
-F:	sound/soc/codecs/tfa989x.c
+F:	Documentation/devicetree/bindings/sound/tfa9879.txt
+F:	sound/soc/codecs/tfa9879*
 
 NXP-NCI NFC DRIVER
 R:	Charles Gorand <charles.gorand@effinnov.com>
@@ -13760,13 +13770,12 @@ S:	Supported
 F:	Documentation/devicetree/bindings/net/nfc/nxp,nci.yaml
 F:	drivers/nfc/nxp-nci
 
-NXP i.MX 8QXP/8QM JPEG V4L2 DRIVER
-M:	Mirela Rabulea <mirela.rabulea@nxp.com>
-R:	NXP Linux Team <linux-imx@nxp.com>
-L:	linux-media@vger.kernel.org
+NXP/Goodix TFA989X (TFA1) DRIVER
+M:	Stephan Gerhold <stephan@gerhold.net>
+L:	alsa-devel@alsa-project.org (moderated for non-subscribers)
 S:	Maintained
-F:	Documentation/devicetree/bindings/media/nxp,imx8-jpeg.yaml
-F:	drivers/media/platform/imx-jpeg
+F:	Documentation/devicetree/bindings/sound/nxp,tfa989x.yaml
+F:	sound/soc/codecs/tfa989x.c
 
 NZXT-KRAKEN2 HARDWARE MONITORING DRIVER
 M:	Jonas Malaco <jonas@protocubo.io>
@@ -14556,6 +14565,14 @@ L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
 S:	Maintained
 F:	drivers/pci/controller/dwc/*layerscape*
 
+PCI DRIVER FOR FU740
+M:	Paul Walmsley <paul.walmsley@sifive.com>
+M:	Greentime Hu <greentime.hu@sifive.com>
+L:	linux-pci@vger.kernel.org
+S:	Maintained
+F:	Documentation/devicetree/bindings/pci/sifive,fu740-pcie.yaml
+F:	drivers/pci/controller/dwc/pcie-fu740.c
+
 PCI DRIVER FOR GENERIC OF HOSTS
 M:	Will Deacon <will@kernel.org>
 L:	linux-pci@vger.kernel.org
@@ -14574,14 +14591,6 @@ S:	Maintained
 F:	Documentation/devicetree/bindings/pci/fsl,imx6q-pcie.yaml
 F:	drivers/pci/controller/dwc/*imx6*
 
-PCI DRIVER FOR FU740
-M:	Paul Walmsley <paul.walmsley@sifive.com>
-M:	Greentime Hu <greentime.hu@sifive.com>
-L:	linux-pci@vger.kernel.org
-S:	Maintained
-F:	Documentation/devicetree/bindings/pci/sifive,fu740-pcie.yaml
-F:	drivers/pci/controller/dwc/pcie-fu740.c
-
 PCI DRIVER FOR INTEL IXP4XX
 M:	Linus Walleij <linus.walleij@linaro.org>
 S:	Maintained
@@ -14854,14 +14863,6 @@ L:	linux-arm-msm@vger.kernel.org
 S:	Maintained
 F:	drivers/pci/controller/dwc/pcie-qcom.c
 
-PCIE ENDPOINT DRIVER FOR QUALCOMM
-M:	Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
-L:	linux-pci@vger.kernel.org
-L:	linux-arm-msm@vger.kernel.org
-S:	Maintained
-F:	Documentation/devicetree/bindings/pci/qcom,pcie-ep.yaml
-F:	drivers/pci/controller/dwc/pcie-qcom-ep.c
-
 PCIE DRIVER FOR ROCKCHIP
 M:	Shawn Lin <shawn.lin@rock-chips.com>
 L:	linux-pci@vger.kernel.org
@@ -14883,6 +14884,14 @@ L:	linux-pci@vger.kernel.org
 S:	Maintained
 F:	drivers/pci/controller/dwc/*spear*
 
+PCIE ENDPOINT DRIVER FOR QUALCOMM
+M:	Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
+L:	linux-pci@vger.kernel.org
+L:	linux-arm-msm@vger.kernel.org
+S:	Maintained
+F:	Documentation/devicetree/bindings/pci/qcom,pcie-ep.yaml
+F:	drivers/pci/controller/dwc/pcie-qcom-ep.c
+
 PCMCIA SUBSYSTEM
 M:	Dominik Brodowski <linux@dominikbrodowski.net>
 S:	Odd Fixes
@@ -15142,13 +15151,6 @@ M:	Logan Gunthorpe <logang@deltatee.com>
 S:	Maintained
 F:	drivers/dma/plx_dma.c
 
-PM6764TR DRIVER
-M:	Charles Hsu	<hsu.yungteng@gmail.com>
-L:	linux-hwmon@vger.kernel.org
-S:	Maintained
-F:	Documentation/hwmon/pm6764tr.rst
-F:	drivers/hwmon/pmbus/pm6764tr.c
-
 PM-GRAPH UTILITY
 M:	"Todd E Brandt" <todd.e.brandt@linux.intel.com>
 L:	linux-pm@vger.kernel.org
@@ -15158,6 +15160,13 @@ B:	https://bugzilla.kernel.org/buglist.cgi?component=pm-graph&product=Tools
 T:	git git://github.com/intel/pm-graph
 F:	tools/power/pm-graph
 
+PM6764TR DRIVER
+M:	Charles Hsu	<hsu.yungteng@gmail.com>
+L:	linux-hwmon@vger.kernel.org
+S:	Maintained
+F:	Documentation/hwmon/pm6764tr.rst
+F:	drivers/hwmon/pmbus/pm6764tr.c
+
 PMBUS HARDWARE MONITORING DRIVERS
 M:	Guenter Roeck <linux@roeck-us.net>
 L:	linux-hwmon@vger.kernel.org
@@ -15238,15 +15247,6 @@ F:	include/linux/pm_*
 F:	include/linux/powercap.h
 F:	kernel/configs/nopm.config
 
-DYNAMIC THERMAL POWER MANAGEMENT (DTPM)
-M:	Daniel Lezcano <daniel.lezcano@kernel.org>
-L:	linux-pm@vger.kernel.org
-S:	Supported
-B:	https://bugzilla.kernel.org
-T:	git git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
-F:	drivers/powercap/dtpm*
-F:	include/linux/dtpm.h
-
 POWER STATE COORDINATION INTERFACE (PSCI)
 M:	Mark Rutland <mark.rutland@arm.com>
 M:	Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
@@ -16261,12 +16261,6 @@ S:	Supported
 F:	Documentation/devicetree/bindings/i2c/renesas,riic.yaml
 F:	drivers/i2c/busses/i2c-riic.c
 
-RENESAS USB PHY DRIVER
-M:	Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
-L:	linux-renesas-soc@vger.kernel.org
-S:	Maintained
-F:	drivers/phy/renesas/phy-rcar-gen3-usb*.c
-
 RENESAS RZ/G2L A/D DRIVER
 M:	Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
 L:	linux-iio@vger.kernel.org
@@ -16275,6 +16269,12 @@ S:	Supported
 F:	Documentation/devicetree/bindings/iio/adc/renesas,rzg2l-adc.yaml
 F:	drivers/iio/adc/rzg2l_adc.c
 
+RENESAS USB PHY DRIVER
+M:	Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
+L:	linux-renesas-soc@vger.kernel.org
+S:	Maintained
+F:	drivers/phy/renesas/phy-rcar-gen3-usb*.c
+
 RESET CONTROLLER FRAMEWORK
 M:	Philipp Zabel <p.zabel@pengutronix.de>
 S:	Maintained
@@ -17105,6 +17105,15 @@ F:	block/sed*
 F:	include/linux/sed*
 F:	include/uapi/linux/sed*
 
+SECURE MONITOR CALL(SMC) CALLING CONVENTION (SMCCC)
+M:	Mark Rutland <mark.rutland@arm.com>
+M:	Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
+M:	Sudeep Holla <sudeep.holla@arm.com>
+L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
+S:	Maintained
+F:	drivers/firmware/smccc/
+F:	include/linux/arm-smccc.h
+
 SECURITY CONTACT
 M:	Security Officers <security@kernel.org>
 S:	Supported
@@ -17512,15 +17521,6 @@ M:	Nicolas Pitre <nico@fluxnic.net>
 S:	Odd Fixes
 F:	drivers/net/ethernet/smsc/smc91x.*
 
-SECURE MONITOR CALL(SMC) CALLING CONVENTION (SMCCC)
-M:	Mark Rutland <mark.rutland@arm.com>
-M:	Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
-M:	Sudeep Holla <sudeep.holla@arm.com>
-L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
-S:	Maintained
-F:	drivers/firmware/smccc/
-F:	include/linux/arm-smccc.h
-
 SMM665 HARDWARE MONITOR DRIVER
 M:	Guenter Roeck <linux@roeck-us.net>
 L:	linux-hwmon@vger.kernel.org
@@ -18701,6 +18701,14 @@ M:	Thierry Reding <thierry.reding@gmail.com>
 S:	Supported
 F:	drivers/pwm/pwm-tegra.c
 
+TEGRA QUAD SPI DRIVER
+M:	Thierry Reding <thierry.reding@gmail.com>
+M:	Jonathan Hunter <jonathanh@nvidia.com>
+M:	Sowjanya Komatineni <skomatineni@nvidia.com>
+L:	linux-tegra@vger.kernel.org
+S:	Maintained
+F:	drivers/spi/spi-tegra210-quad.c
+
 TEGRA SERIAL DRIVER
 M:	Laxman Dewangan <ldewangan@nvidia.com>
 S:	Supported
@@ -18711,14 +18719,6 @@ M:	Laxman Dewangan <ldewangan@nvidia.com>
 S:	Supported
 F:	drivers/spi/spi-tegra*
 
-TEGRA QUAD SPI DRIVER
-M:	Thierry Reding <thierry.reding@gmail.com>
-M:	Jonathan Hunter <jonathanh@nvidia.com>
-M:	Sowjanya Komatineni <skomatineni@nvidia.com>
-L:	linux-tegra@vger.kernel.org
-S:	Maintained
-F:	drivers/spi/spi-tegra210-quad.c
-
 TEGRA VIDEO DRIVER
 M:	Thierry Reding <thierry.reding@gmail.com>
 M:	Jonathan Hunter <jonathanh@nvidia.com>
@@ -18767,13 +18767,6 @@ L:	alsa-devel@alsa-project.org (moderated for non-subscribers)
 S:	Maintained
 F:	sound/soc/ti/
 
-TEXAS INSTRUMENTS' DAC7612 DAC DRIVER
-M:	Ricardo Ribalda <ribalda@kernel.org>
-L:	linux-iio@vger.kernel.org
-S:	Supported
-F:	Documentation/devicetree/bindings/iio/dac/ti,dac7612.yaml
-F:	drivers/iio/dac/ti-dac7612.c
-
 TEXAS INSTRUMENTS DMA DRIVERS
 M:	Peter Ujfalusi <peter.ujfalusi@gmail.com>
 L:	dmaengine@vger.kernel.org
@@ -18787,6 +18780,22 @@ F:	include/linux/dma/k3-udma-glue.h
 F:	include/linux/dma/ti-cppi5.h
 F:	include/linux/dma/k3-psil.h
 
+TEXAS INSTRUMENTS TPS23861 PoE PSE DRIVER
+M:	Robert Marko <robert.marko@sartura.hr>
+M:	Luka Perkov <luka.perkov@sartura.hr>
+L:	linux-hwmon@vger.kernel.org
+S:	Maintained
+F:	Documentation/devicetree/bindings/hwmon/ti,tps23861.yaml
+F:	Documentation/hwmon/tps23861.rst
+F:	drivers/hwmon/tps23861.c
+
+TEXAS INSTRUMENTS' DAC7612 DAC DRIVER
+M:	Ricardo Ribalda <ribalda@kernel.org>
+L:	linux-iio@vger.kernel.org
+S:	Supported
+F:	Documentation/devicetree/bindings/iio/dac/ti,dac7612.yaml
+F:	drivers/iio/dac/ti-dac7612.c
+
 TEXAS INSTRUMENTS' SYSTEM CONTROL INTERFACE (TISCI) PROTOCOL DRIVER
 M:	Nishanth Menon <nm@ti.com>
 M:	Tero Kristo <kristo@kernel.org>
@@ -18811,15 +18820,6 @@ F:	include/dt-bindings/soc/ti,sci_pm_domain.h
 F:	include/linux/soc/ti/ti_sci_inta_msi.h
 F:	include/linux/soc/ti/ti_sci_protocol.h
 
-TEXAS INSTRUMENTS TPS23861 PoE PSE DRIVER
-M:	Robert Marko <robert.marko@sartura.hr>
-M:	Luka Perkov <luka.perkov@sartura.hr>
-L:	linux-hwmon@vger.kernel.org
-S:	Maintained
-F:	Documentation/devicetree/bindings/hwmon/ti,tps23861.yaml
-F:	Documentation/hwmon/tps23861.rst
-F:	drivers/hwmon/tps23861.c
-
 TEXAS INSTRUMENTS' TMP117 TEMPERATURE SENSOR DRIVER
 M:	Puranjay Mohan <puranjay12@gmail.com>
 L:	linux-iio@vger.kernel.org
@@ -19719,6 +19719,13 @@ L:	linux-usb@vger.kernel.org
 S:	Supported
 F:	drivers/usb/class/usblp.c
 
+USB QMI WWAN NETWORK DRIVER
+M:	Bjørn Mork <bjorn@mork.no>
+L:	netdev@vger.kernel.org
+S:	Maintained
+F:	Documentation/ABI/testing/sysfs-class-net-qmi
+F:	drivers/net/usb/qmi_wwan.c
+
 USB RAW GADGET DRIVER
 R:	Andrey Konovalov <andreyknvl@gmail.com>
 L:	linux-usb@vger.kernel.org
@@ -19727,13 +19734,6 @@ F:	Documentation/usb/raw-gadget.rst
 F:	drivers/usb/gadget/legacy/raw_gadget.c
 F:	include/uapi/linux/usb/raw_gadget.h
 
-USB QMI WWAN NETWORK DRIVER
-M:	Bjørn Mork <bjorn@mork.no>
-L:	netdev@vger.kernel.org
-S:	Maintained
-F:	Documentation/ABI/testing/sysfs-class-net-qmi
-F:	drivers/net/usb/qmi_wwan.c
-
 USB RTL8150 DRIVER
 M:	Petko Manolov <petkan@nucleusys.com>
 L:	linux-usb@vger.kernel.org
@@ -20049,6 +20049,14 @@ S:	Maintained
 F:	drivers/media/common/videobuf2/*
 F:	include/media/videobuf2-*
 
+VIDTV VIRTUAL DIGITAL TV DRIVER
+M:	Daniel W. S. Almeida <dwlsalmeida@gmail.com>
+L:	linux-media@vger.kernel.org
+S:	Maintained
+W:	https://linuxtv.org
+T:	git git://linuxtv.org/media_tree.git
+F:	drivers/media/test-drivers/vidtv/*
+
 VIMC VIRTUAL MEDIA CONTROLLER DRIVER
 M:	Helen Koike <helen.koike@collabora.com>
 R:	Shuah Khan <skhan@linuxfoundation.org>
@@ -20078,6 +20086,16 @@ F:	include/uapi/linux/virtio_vsock.h
 F:	net/vmw_vsock/virtio_transport.c
 F:	net/vmw_vsock/virtio_transport_common.c
 
+VIRTIO BALLOON
+M:	"Michael S. Tsirkin" <mst@redhat.com>
+M:	David Hildenbrand <david@redhat.com>
+L:	virtualization@lists.linux-foundation.org
+S:	Maintained
+F:	drivers/virtio/virtio_balloon.c
+F:	include/uapi/linux/virtio_balloon.h
+F:	include/linux/balloon_compaction.h
+F:	mm/balloon_compaction.c
+
 VIRTIO BLOCK AND SCSI DRIVERS
 M:	"Michael S. Tsirkin" <mst@redhat.com>
 M:	Jason Wang <jasowang@redhat.com>
@@ -20115,16 +20133,6 @@ F:	include/linux/virtio*.h
 F:	include/uapi/linux/virtio_*.h
 F:	tools/virtio/
 
-VIRTIO BALLOON
-M:	"Michael S. Tsirkin" <mst@redhat.com>
-M:	David Hildenbrand <david@redhat.com>
-L:	virtualization@lists.linux-foundation.org
-S:	Maintained
-F:	drivers/virtio/virtio_balloon.c
-F:	include/uapi/linux/virtio_balloon.h
-F:	include/linux/balloon_compaction.h
-F:	mm/balloon_compaction.c
-
 VIRTIO CRYPTO DRIVER
 M:	Gonglei <arei.gonglei@huawei.com>
 L:	virtualization@lists.linux-foundation.org
@@ -20186,6 +20194,15 @@ F:	drivers/vhost/
 F:	include/linux/vhost_iotlb.h
 F:	include/uapi/linux/vhost.h
 
+VIRTIO I2C DRIVER
+M:	Conghui Chen <conghui.chen@intel.com>
+M:	Viresh Kumar <viresh.kumar@linaro.org>
+L:	linux-i2c@vger.kernel.org
+L:	virtualization@lists.linux-foundation.org
+S:	Maintained
+F:	drivers/i2c/busses/i2c-virtio.c
+F:	include/uapi/linux/virtio_i2c.h
+
 VIRTIO INPUT DRIVER
 M:	Gerd Hoffmann <kraxel@redhat.com>
 S:	Maintained
@@ -20207,6 +20224,13 @@ W:	https://virtio-mem.gitlab.io/
 F:	drivers/virtio/virtio_mem.c
 F:	include/uapi/linux/virtio_mem.h
 
+VIRTIO PMEM DRIVER
+M:	Pankaj Gupta <pankaj.gupta.linux@gmail.com>
+L:	virtualization@lists.linux-foundation.org
+S:	Maintained
+F:	drivers/nvdimm/virtio_pmem.c
+F:	drivers/nvdimm/nd_virtio.c
+
 VIRTIO SOUND DRIVER
 M:	Anton Yakovlev <anton.yakovlev@opensynergy.com>
 M:	"Michael S. Tsirkin" <mst@redhat.com>
@@ -20216,22 +20240,6 @@ S:	Maintained
 F:	include/uapi/linux/virtio_snd.h
 F:	sound/virtio/*
 
-VIRTIO I2C DRIVER
-M:	Conghui Chen <conghui.chen@intel.com>
-M:	Viresh Kumar <viresh.kumar@linaro.org>
-L:	linux-i2c@vger.kernel.org
-L:	virtualization@lists.linux-foundation.org
-S:	Maintained
-F:	drivers/i2c/busses/i2c-virtio.c
-F:	include/uapi/linux/virtio_i2c.h
-
-VIRTIO PMEM DRIVER
-M:	Pankaj Gupta <pankaj.gupta.linux@gmail.com>
-L:	virtualization@lists.linux-foundation.org
-S:	Maintained
-F:	drivers/nvdimm/virtio_pmem.c
-F:	drivers/nvdimm/nd_virtio.c
-
 VIRTUAL BOX GUEST DEVICE DRIVER
 M:	Hans de Goede <hdegoede@redhat.com>
 M:	Arnd Bergmann <arnd@arndb.de>
@@ -20261,14 +20269,6 @@ W:	https://linuxtv.org
 T:	git git://linuxtv.org/media_tree.git
 F:	drivers/media/test-drivers/vivid/*
 
-VIDTV VIRTUAL DIGITAL TV DRIVER
-M:	Daniel W. S. Almeida <dwlsalmeida@gmail.com>
-L:	linux-media@vger.kernel.org
-S:	Maintained
-W:	https://linuxtv.org
-T:	git git://linuxtv.org/media_tree.git
-F:	drivers/media/test-drivers/vidtv/*
-
 VLYNQ BUS
 M:	Florian Fainelli <f.fainelli@gmail.com>
 L:	openwrt-devel@lists.openwrt.org (subscribers-only)
@@ -20276,18 +20276,6 @@ S:	Maintained
 F:	drivers/vlynq/vlynq.c
 F:	include/linux/vlynq.h
 
-VME SUBSYSTEM
-M:	Martyn Welch <martyn@welchs.me.uk>
-M:	Manohar Vanga <manohar.vanga@gmail.com>
-M:	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-L:	linux-kernel@vger.kernel.org
-S:	Maintained
-T:	git git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git
-F:	Documentation/driver-api/vme.rst
-F:	drivers/staging/vme/
-F:	drivers/vme/
-F:	include/linux/vme*
-
 VM SOCKETS (AF_VSOCK)
 M:	Stefano Garzarella <sgarzare@redhat.com>
 L:	virtualization@lists.linux-foundation.org
@@ -20301,6 +20289,18 @@ F:	include/uapi/linux/vsockmon.h
 F:	net/vmw_vsock/
 F:	tools/testing/vsock/
 
+VME SUBSYSTEM
+M:	Martyn Welch <martyn@welchs.me.uk>
+M:	Manohar Vanga <manohar.vanga@gmail.com>
+M:	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+L:	linux-kernel@vger.kernel.org
+S:	Maintained
+T:	git git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc.git
+F:	Documentation/driver-api/vme.rst
+F:	drivers/staging/vme/
+F:	drivers/vme/
+F:	include/linux/vme*
+
 VMWARE BALLOON DRIVER
 M:	Nadav Amit <namit@vmware.com>
 M:	"VMware, Inc." <pv-drivers@vmware.com>
-- 
2.33.0


^ permalink raw reply related	[relevance 3%]

* [PATCH v30 06/32] x86/cet: Add control-protection fault handler
  @ 2021-08-30 18:15  3% ` Yu-cheng Yu
  0 siblings, 0 replies; 200+ results
From: Yu-cheng Yu @ 2021-08-30 18:15 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Dave Martin, Weijiang Yang, Pengfei Xu,
	Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu, Michael Kerrisk

A control-protection fault is triggered when a control-flow transfer
attempt violates Shadow Stack or Indirect Branch Tracking constraints.
For example, the return address for a RET instruction differs from the copy
on the shadow stack; or an indirect JMP instruction, without the NOTRACK
prefix, arrives at a non-ENDBR opcode.

The control-protection fault handler works in a similar way as the general
protection fault handler.  It provides the si_code SEGV_CPERR to the signal
handler.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
---
v29:
- Remove pr_emerg() since it is followed by die().
- Change boot_cpu_has() to cpu_feature_enabled().

v25:
- Change CONFIG_X86_CET to CONFIG_X86_SHADOW_STACK.
- Change X86_FEATURE_CET to X86_FEATURE_SHSTK.
---
 arch/x86/include/asm/idtentry.h    |  4 ++
 arch/x86/kernel/idt.c              |  4 ++
 arch/x86/kernel/signal_compat.c    |  2 +-
 arch/x86/kernel/traps.c            | 62 ++++++++++++++++++++++++++++++
 include/uapi/asm-generic/siginfo.h |  3 +-
 5 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 1345088e9902..a90791433152 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -562,6 +562,10 @@ DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_SS,	exc_stack_segment);
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_GP,	exc_general_protection);
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_AC,	exc_alignment_check);
 
+#ifdef CONFIG_X86_SHADOW_STACK
+DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_CP, exc_control_protection);
+#endif
+
 /* Raw exception entries which need extra work */
 DECLARE_IDTENTRY_RAW(X86_TRAP_UD,		exc_invalid_op);
 DECLARE_IDTENTRY_RAW(X86_TRAP_BP,		exc_int3);
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index df0fa695bb09..9f1bdaabc246 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -113,6 +113,10 @@ static const __initconst struct idt_data def_idts[] = {
 #elif defined(CONFIG_X86_32)
 	SYSG(IA32_SYSCALL_VECTOR,	entry_INT80_32),
 #endif
+
+#ifdef CONFIG_X86_SHADOW_STACK
+	INTG(X86_TRAP_CP,		asm_exc_control_protection),
+#endif
 };
 
 /*
diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index 06743ec054d2..049ea3dcc6cb 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -27,7 +27,7 @@ static inline void signal_compat_build_tests(void)
 	 */
 	BUILD_BUG_ON(NSIGILL  != 11);
 	BUILD_BUG_ON(NSIGFPE  != 15);
-	BUILD_BUG_ON(NSIGSEGV != 9);
+	BUILD_BUG_ON(NSIGSEGV != 10);
 	BUILD_BUG_ON(NSIGBUS  != 5);
 	BUILD_BUG_ON(NSIGTRAP != 6);
 	BUILD_BUG_ON(NSIGCHLD != 6);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index a58800973aed..b64192314a6d 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -39,6 +39,7 @@
 #include <linux/io.h>
 #include <linux/hardirq.h>
 #include <linux/atomic.h>
+#include <linux/nospec.h>
 
 #include <asm/stacktrace.h>
 #include <asm/processor.h>
@@ -607,6 +608,67 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
 	cond_local_irq_disable(regs);
 }
 
+#ifdef CONFIG_X86_SHADOW_STACK
+static const char * const control_protection_err[] = {
+	"unknown",
+	"near-ret",
+	"far-ret/iret",
+	"endbranch",
+	"rstorssp",
+	"setssbsy",
+	"unknown",
+};
+
+static DEFINE_RATELIMIT_STATE(cpf_rate, DEFAULT_RATELIMIT_INTERVAL,
+			      DEFAULT_RATELIMIT_BURST);
+
+/*
+ * When a control protection exception occurs, send a signal to the responsible
+ * application.  Currently, control protection is only enabled for user mode.
+ * This exception should not come from kernel mode.
+ */
+DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+{
+	struct task_struct *tsk;
+
+	if (!user_mode(regs)) {
+		die("kernel control protection fault", regs, error_code);
+		panic("Unexpected kernel control protection fault.  Machine halted.");
+	}
+
+	cond_local_irq_enable(regs);
+
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		WARN_ONCE(1, "Control protection fault with CET support disabled\n");
+
+	tsk = current;
+	tsk->thread.error_code = error_code;
+	tsk->thread.trap_nr = X86_TRAP_CP;
+
+	/*
+	 * Ratelimit to prevent log spamming.
+	 */
+	if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
+	    __ratelimit(&cpf_rate)) {
+		unsigned long ssp;
+		int cpf_type;
+
+		cpf_type = array_index_nospec(error_code, ARRAY_SIZE(control_protection_err));
+
+		rdmsrl(MSR_IA32_PL3_SSP, ssp);
+		pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)",
+			 tsk->comm, task_pid_nr(tsk),
+			 regs->ip, regs->sp, ssp, error_code,
+			 control_protection_err[cpf_type]);
+		print_vma_addr(KERN_CONT " in ", regs->ip);
+		pr_cont("\n");
+	}
+
+	force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
+	cond_local_irq_disable(regs);
+}
+#endif
+
 static bool do_int3(struct pt_regs *regs)
 {
 	int res;
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index 5a3c221f4c9d..a1a153ea3cc3 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -235,7 +235,8 @@ typedef struct siginfo {
 #define SEGV_ADIPERR	7	/* Precise MCD exception */
 #define SEGV_MTEAERR	8	/* Asynchronous ARM MTE error */
 #define SEGV_MTESERR	9	/* Synchronous ARM MTE exception */
-#define NSIGSEGV	9
+#define SEGV_CPERR	10	/* Control protection fault */
+#define NSIGSEGV	10
 
 /*
  * SIGBUS si_codes
-- 
2.21.0


^ permalink raw reply related	[relevance 3%]

* man-pages-5.13 is released
@ 2021-08-27 21:45 13% Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 200+ results
From: Michael Kerrisk (man-pages) @ 2021-08-27 21:45 UTC (permalink / raw)
  To: lkml; +Cc: mtk.manpages, Alejandro Colomar

Gidday,

Alex Colomar and I are proud to announce:

    man-pages-5.13 - man pages for Linux

This release resulted from patches, bug reports, reviews, and
comments from 40 contributors. The release includes
around 200 commits that changed approximately 120 pages.

Tarball download:
    http://www.kernel.org/doc/man-pages/download.html
Git repository:
    https://git.kernel.org/cgit/docs/man-pages/man-pages.git/
Online changelog:
    http://man7.org/linux/man-pages/changelog.html#release_5.13

A short summary of the release is blogged at:
https://linux-man-pages.blogspot.com/2021/08/man-pages-513-released.html

The current version of the pages is browsable at:
http://man7.org/linux/man-pages/

A selection of changes in this release that may be of interest
to readers of LKML is shown below.

Cheers,

Michael

==================== Changes in man-pages-5.13 ====================

Released: 2021-08-27, Christchurch


New and rewritten pages
-----------------------

mount_setattr.2
    Christian Brauner  [Alejandro Colomar, Michael Kerrisk]
        New manual page documenting the mount_setattr() system call


Newly documented interfaces in existing pages
---------------------------------------------

futex.2
    Kurt Kanzenbach  [Alejandro Colomar, Thomas Gleixner, Michael Kerrisk]
        Document FUTEX_LOCK_PI2

ioctl_tty.2
    Pali Rohár  [Alejandro Colomar, Michael kerrisk]
        Document ioctls: TCGETS2, TCSETS2, TCSETSW2, TCSETSF2

pidfd_open.2
    Michael Kerrisk
        Document PIDFD_NONBLOCK

seccomp_unotify.2
    Rodrigo Campos  [Alejandro Colomar]
        Document SECCOMP_ADDFD_FLAG_SEND

sigaction.2
    Peter Collingbourne  [Alejandro Colomar, Michael Kerrisk]
        Document SA_EXPOSE_TAGBITS and the flag support detection protocol

statx.2
    NeilBrown
        Document STATX_MNT_ID

capabilities.7
user_namespaces.7
    Michael Kerrisk, Kir Kolyshkin  [Alejandro Colomar]
        Describe CAP_SETFCAP for mapping UID 0

mount_namespaces.7
    Michael Kerrisk  [Christian Brauner, Eric W. Biederman]
        More clearly explain the notion of locked mounts
            For a long time, this manual page has had a brief discussion of
            "locked" mounts, without clearly saying what this concept is, or
            why it exists. Expand the discussion with an explanation of what
            locked mounts are, why mounts are locked, and some examples of the
            effect of locking.

user_namespaces.7
    Michael Kerrisk
        Document /proc/PID/projid_map

ld.so.8
    Michael Kerrisk
        Document --list-tunables option added in glibc 2.33


Global changes
--------------

Various pages
    Michael Kerrisk
        Fix EBADF error description
            Make the description of the EBADF error for invalid 'dirfd' more
            uniform. In particular, note that the error only occurs when the
            pathname is relative, and that it occurs when the 'dirfd' is
            neither valid *nor* has the value AT_FDCWD.

Various pages
    Michael Kerrisk
        Terminology clean-up: "mount point" ==> "mount"
            Many times, these pages use the terminology "mount point", where
            "mount" would be better. A "mount point" is the location at which
            a mount is attached. A "mount" is an association between a
            filesystem and a mount point.


Changes to individual pages
---------------------------

mount.2
    Michael Kerrisk
        ERRORS: add EPERM error for case where a mount is locked
            Refer the reader to mount_namespaces(7) for details.

open.2
    Michael Kerrisk
        Explicitly describe the EBADF error that can occur with openat()
            In particular, specifying an invalid file descriptor number
            in 'dirfd' can be used as a check that 'pathname' is absolute.
    Michael Kerrisk
        Clarify that openat()'s dirfd must be opened with O_RDONLY or O_PATH

seccomp.2
    Eric W. Biederman  [Kees Cook]
        Clarify that bad system calls kill the thread (not the process)

syscalls.2
    Michael Kerrisk
        Add quotactl_fd(); remove quotactl_path()
            quotactl_path() was never wired up in Linux 5.13.
            It was replaced instead by quotactl_fd(),
    Michael Kerrisk
        Add system calls that are new in 5.13

wait.2
    Michael Kerrisk
        ERRORS: document EAGAIN for waitid() on a PID file descriptor

termios.3
    Pali Rohár  [Alejandro Colomar]
        SPARC architecture has 4 different Bnnn constants
    Pali Rohár  [Alejandro Colomar]
        Add information how to set baud rate to any other value
    Pali Rohár  [Alejandro Colomar]
        Use bold style for Bnn and EXTn macro constants
    Pali Rohár  [Alejandro Colomar]
        Document missing baud-rate constants

vdso.7
    Michael Kerrisk  [Christophe Leroy]
        Update CLOCK_REALTIME_COARSE + CLOCK_MONOTONIC_COARSE info for powerpc
    Alejandro Colomar  [Christophe Leroy]
        Add y2038 compliant gettime for ppc/32

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[relevance 13%]

* [PATCH v29 06/32] x86/cet: Add control-protection fault handler
  @ 2021-08-20 18:11  3% ` Yu-cheng Yu
  0 siblings, 0 replies; 200+ results
From: Yu-cheng Yu @ 2021-08-20 18:11 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Dave Martin, Weijiang Yang, Pengfei Xu,
	Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu, Michael Kerrisk

A control-protection fault is triggered when a control-flow transfer
attempt violates Shadow Stack or Indirect Branch Tracking constraints.
For example, the return address for a RET instruction differs from the copy
on the shadow stack; or an indirect JMP instruction, without the NOTRACK
prefix, arrives at a non-ENDBR opcode.

The control-protection fault handler works in a similar way as the general
protection fault handler.  It provides the si_code SEGV_CPERR to the signal
handler.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
---
v29:
- Remove pr_emerg() since it is followed by die().
- Change boot_cpu_has() to cpu_feature_enabled().

v25:
- Change CONFIG_X86_CET to CONFIG_X86_SHADOW_STACK.
- Change X86_FEATURE_CET to X86_FEATURE_SHSTK.
---
 arch/x86/include/asm/idtentry.h    |  4 ++
 arch/x86/kernel/idt.c              |  4 ++
 arch/x86/kernel/signal_compat.c    |  2 +-
 arch/x86/kernel/traps.c            | 62 ++++++++++++++++++++++++++++++
 include/uapi/asm-generic/siginfo.h |  3 +-
 5 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 1345088e9902..a90791433152 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -562,6 +562,10 @@ DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_SS,	exc_stack_segment);
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_GP,	exc_general_protection);
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_AC,	exc_alignment_check);
 
+#ifdef CONFIG_X86_SHADOW_STACK
+DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_CP, exc_control_protection);
+#endif
+
 /* Raw exception entries which need extra work */
 DECLARE_IDTENTRY_RAW(X86_TRAP_UD,		exc_invalid_op);
 DECLARE_IDTENTRY_RAW(X86_TRAP_BP,		exc_int3);
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index df0fa695bb09..9f1bdaabc246 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -113,6 +113,10 @@ static const __initconst struct idt_data def_idts[] = {
 #elif defined(CONFIG_X86_32)
 	SYSG(IA32_SYSCALL_VECTOR,	entry_INT80_32),
 #endif
+
+#ifdef CONFIG_X86_SHADOW_STACK
+	INTG(X86_TRAP_CP,		asm_exc_control_protection),
+#endif
 };
 
 /*
diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index 06743ec054d2..049ea3dcc6cb 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -27,7 +27,7 @@ static inline void signal_compat_build_tests(void)
 	 */
 	BUILD_BUG_ON(NSIGILL  != 11);
 	BUILD_BUG_ON(NSIGFPE  != 15);
-	BUILD_BUG_ON(NSIGSEGV != 9);
+	BUILD_BUG_ON(NSIGSEGV != 10);
 	BUILD_BUG_ON(NSIGBUS  != 5);
 	BUILD_BUG_ON(NSIGTRAP != 6);
 	BUILD_BUG_ON(NSIGCHLD != 6);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index a58800973aed..b64192314a6d 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -39,6 +39,7 @@
 #include <linux/io.h>
 #include <linux/hardirq.h>
 #include <linux/atomic.h>
+#include <linux/nospec.h>
 
 #include <asm/stacktrace.h>
 #include <asm/processor.h>
@@ -607,6 +608,67 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
 	cond_local_irq_disable(regs);
 }
 
+#ifdef CONFIG_X86_SHADOW_STACK
+static const char * const control_protection_err[] = {
+	"unknown",
+	"near-ret",
+	"far-ret/iret",
+	"endbranch",
+	"rstorssp",
+	"setssbsy",
+	"unknown",
+};
+
+static DEFINE_RATELIMIT_STATE(cpf_rate, DEFAULT_RATELIMIT_INTERVAL,
+			      DEFAULT_RATELIMIT_BURST);
+
+/*
+ * When a control protection exception occurs, send a signal to the responsible
+ * application.  Currently, control protection is only enabled for user mode.
+ * This exception should not come from kernel mode.
+ */
+DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+{
+	struct task_struct *tsk;
+
+	if (!user_mode(regs)) {
+		die("kernel control protection fault", regs, error_code);
+		panic("Unexpected kernel control protection fault.  Machine halted.");
+	}
+
+	cond_local_irq_enable(regs);
+
+	if (!cpu_feature_enabled(X86_FEATURE_SHSTK))
+		WARN_ONCE(1, "Control protection fault with CET support disabled\n");
+
+	tsk = current;
+	tsk->thread.error_code = error_code;
+	tsk->thread.trap_nr = X86_TRAP_CP;
+
+	/*
+	 * Ratelimit to prevent log spamming.
+	 */
+	if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
+	    __ratelimit(&cpf_rate)) {
+		unsigned long ssp;
+		int cpf_type;
+
+		cpf_type = array_index_nospec(error_code, ARRAY_SIZE(control_protection_err));
+
+		rdmsrl(MSR_IA32_PL3_SSP, ssp);
+		pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)",
+			 tsk->comm, task_pid_nr(tsk),
+			 regs->ip, regs->sp, ssp, error_code,
+			 control_protection_err[cpf_type]);
+		print_vma_addr(KERN_CONT " in ", regs->ip);
+		pr_cont("\n");
+	}
+
+	force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
+	cond_local_irq_disable(regs);
+}
+#endif
+
 static bool do_int3(struct pt_regs *regs)
 {
 	int res;
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index 5a3c221f4c9d..a1a153ea3cc3 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -235,7 +235,8 @@ typedef struct siginfo {
 #define SEGV_ADIPERR	7	/* Precise MCD exception */
 #define SEGV_MTEAERR	8	/* Asynchronous ARM MTE error */
 #define SEGV_MTESERR	9	/* Synchronous ARM MTE exception */
-#define NSIGSEGV	9
+#define SEGV_CPERR	10	/* Control protection fault */
+#define NSIGSEGV	10
 
 /*
  * SIGBUS si_codes
-- 
2.21.0


^ permalink raw reply related	[relevance 3%]

* Re: [PATCHi, man-pages] mount_namespaces.7: More clearly explain "locked mounts"
  2021-08-17 14:06  4%     ` Christian Brauner
@ 2021-08-19  0:24 10%       ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 200+ results
From: Michael Kerrisk (man-pages) @ 2021-08-19  0:24 UTC (permalink / raw)
  To: Christian Brauner
  Cc: mtk.manpages, Eric W. Biederman, linux-man, linux-fsdevel,
	containers, Alejandro Colomar, linux-kernel, Christoph Hellwig

Hi Christian,

On 8/17/21 4:06 PM, Christian Brauner wrote:
> On Tue, Aug 17, 2021 at 05:12:20AM +0200, Michael Kerrisk (man-pages) wrote:
>> Hi Eric,
>>
>> Thanks for your feedback!
>>
>> On 8/16/21 6:03 PM, Eric W. Biederman wrote:
>>> Michael Kerrisk <mtk.manpages@gmail.com> writes:
>>>
>>>> For a long time, this manual page has had a brief discussion of
>>>> "locked" mounts, without clearly saying what this concept is, or
>>>> why it exists. Expand the discussion with an explanation of what
>>>> locked mounts are, why mounts are locked, and some examples of the
>>>> effect of locking.
>>>>
>>>> Thanks to Christian Brauner for a lot of help in understanding
>>>> these details.
>>>>
>>>> Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
>>>> Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
>>>> ---
>>>>
>>>> Hello Eric and others,
>>>>
>>>> After some quite helpful info from Chrstian Brauner, I've expanded
>>>> the discussion of locked mounts (a concept I didn't really have a
>>>> good grasp on) in the mount_namespaces(7) manual page. I would be
>>>> grateful to receive review comments, acks, etc., on the patch below.
>>>> Could you take a look please?
>>>>
>>>> Cheers,
>>>>
>>>> Michael
>>>>
>>>>  man7/mount_namespaces.7 | 73 +++++++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 73 insertions(+)
>>>>
>>>> diff --git a/man7/mount_namespaces.7 b/man7/mount_namespaces.7
>>>> index e3468bdb7..97427c9ea 100644
>>>> --- a/man7/mount_namespaces.7
>>>> +++ b/man7/mount_namespaces.7
>>>> @@ -107,6 +107,62 @@ operation brings across all of the mounts from the original
>>>>  mount namespace as a single unit,
>>>>  and recursive mounts that propagate between
>>>>  mount namespaces propagate as a single unit.)
>>>> +.IP
>>>> +In this context, "may not be separated" means that the mounts
>>>> +are locked so that they may not be individually unmounted.
>>>> +Consider the following example:
>>>> +.IP
>>>> +.RS
>>>> +.in +4n
>>>> +.EX
>>>> +$ \fBsudo mkdir /mnt/dir\fP
>>>> +$ \fBsudo sh \-c \(aqecho "aaaaaa" > /mnt/dir/a\(aq\fP
>>>> +$ \fBsudo mount \-\-bind -o ro /some/path /mnt/dir\fP
>>>> +$ \fBls /mnt/dir\fP   # Former contents of directory are invisible
>>>
>>> Do we want a more motivating example such as a /proc/sys?
> 
> Could be even be better to use an example involving /etc/shadow, e.g.:
> 
> sudo mount --bind /etc /mnt
> sudo mount --bind /dev/null /mnt/shadow

Nice! I've rewritten the example to use /etc/shadow
instead of a bind-mounted directory at /mnt/dir.
Thanks!

> the procfs example might be a bit awkward (see below).

Okay.

>>> It has been common to mount over /proc files and directories that can be
>>> written to by the global root so that users in a mount namespace may not
>>> touch them.
>>
>> Seems reasonable. But I want to check one thing. Can you please
>> define "global root". I'm pretty sure I know what you mean, but
>> I'd like to know your definition.
> 
> (global root == root in the initial user namespace.)

(As noted in the mail to Eric, I've added this definition to
user_namespaces(7).)

> Some application container runtimes have a concept of "masked paths"
> where they overmount certain directories they want to hide with an empty
> tmpfs and some files they want to hide with /dev/null (see [1]).
> 
> But I don't think this is a great example because this overmounting is
> mostly needed and done when you're running privileged containers (see [2]).
> 
> There's usually no point in overmounting parts of procfs that are
> writable by global root. If you're running in an unprivileged container
> userns root can't write to any of the files that only global root can.
> Otherwise this would be a rather severe security issue.
> 
> There might be a use-case for overmounting files that contain global
> information that are readable inside user namespaces but then one either
> has to question why they are readable in the first place or why this
> information needs to be hidden. Examples include /proc/kallsyms and
> /proc/keys.
> 
> But overall the overmounting of procfs is most sensible when running
> privileged containers or when sharing pid namespaces and procfs is
> somehow bind-mounted from somewhere. But that means there's no user
> namespace in play which means that the mounts aren't locked.
> 
> So if the container runtime has e.g. overmounted /proc/kcore with
> /dev/null then the privileged container can unmount it. To protect
> against this such privileged containers usually drop CAP_SYS_ADMIN.
> So the protection here comes from dropping capabilities not from locking
> mounts together. All of this makes this a bit of a confusing example.
> 
> An example where locked mount protection is relied on heavily which I'm
> involved in is systemd(-nspawn). All custom mounts a container gets such
> as data shared from the host with the container are mounted in a separate
> (privileged) mount namespace before the container workload is cloned.
> The cloned container then gets a new mount + userns pair and hence, all
> the mounts it inherited are now locked.
> 
> This way, you can e.g. share /etc with your container and just overmount
> /etc/shadow with /dev/null or a custom /etc/shadow (Reason for my
> example above.) without dropping capabilities that would prevent the
> container from mounting.
> 
> So I'd suggest using a simple example. This is not about illustrating
> what container runtimes do but what the behavior of a mount namespace
> is. There's really no need to overcomplicate this.

Thanks for the detailed explanation. As noted above, I've rewritten
the example to use /etc/shadow.

> [1]: https://github.com/moby/moby/blob/51b06c6795160d8a1ba05d05d6491df7588b2957/oci/defaults.go#L90
> [2]: https://github.com/moby/moby/blob/51b06c6795160d8a1ba05d05d6491df7588b2957/oci/defaults.go#L110
> 
>>
>>>> +.EE
>>>> +.in
>>>> +.RE
>>>> +.IP
>>>> +The above steps, performed in a more privileged user namespace,
>>>> +have created a (read-only) bind mount that
>>>> +obscures the contents of the directory
>>>> +.IR /mnt/dir .
>>>> +For security reasons, it should not be possible to unmount
>>>> +that mount in a less privileged user namespace,
>>>> +since that would reveal the contents of the directory
>>>> +.IR /mnt/dir .
>>>  > +.IP
>>>> +Suppose we now create a new mount namespace
>>>> +owned by a (new) subordinate user namespace.
>>>> +The new mount namespace will inherit copies of all of the mounts
>>>> +from the previous mount namespace.
>>>> +However, those mounts will be locked because the new mount namespace
>>>> +is owned by a less privileged user namespace.
>>>> +Consequently, an attempt to unmount the mount fails:
>>>> +.IP
>>>> +.RS
>>>> +.in +4n
>>>> +.EX
>>>> +$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP
>>>> +               \fBstrace \-o /tmp/log \e\fP
>>>> +               \fBumount /mnt/dir\fP
>>>> +umount: /mnt/dir: not mounted.
>>>> +$ \fBgrep \(aq^umount\(aq /tmp/log\fP
>>>> +umount2("/mnt/dir", 0)     = \-1 EINVAL (Invalid argument)
>>>> +.EE
>>>> +.in
>>>> +.RE
>>>> +.IP
>>>> +The error message from
>>>> +.BR mount (8)
>>>> +is a little confusing, but the
>>>> +.BR strace (1)
>>>> +output reveals that the underlying
>>>> +.BR umount2 (2)
>>>> +system call failed with the error
>>>> +.BR EINVAL ,
>>>> +which is the error that the kernel returns to indicate that
>>>> +the mount is locked.
>>>
>>> Do you want to mention that you can unmount the entire subtree?  Either
>>> with pivot_root if it is locked to "/" or with
>>> "umount -l /path/to/propagated/directory".
>>
>> Yes, I wondered about that, but hadn't got round to devising 
>> the scenario. How about this:
>>
>> [[
>>        *  Following on from the previous point, note that it is possible
>>           to unmount an entire tree of mounts that propagated as a unit
>>           into a mount namespace that is owned by a less privileged user
>>           namespace, as illustrated in the following example.
>>
>>           First, we create new user and mount namespaces using
>>           unshare(1).  In the new mount namespace, the propagation type
>>           of all mounts is set to private.  We then create a shared bind
>>           mount at /mnt, and a small hierarchy of mount points underneath
>>           that mount point.
>>
>>               $ PS1='ns1# ' sudo unshare --user --map-root-user \
>>                                      --mount --propagation private bash
>>               ns1# echo $$        # We need the PID of this shell later
>>               778501
>>               ns1# mount --make-shared --bind /mnt /mnt
>>               ns1# mkdir /mnt/x
>>               ns1# mount --make-private -t tmpfs none /mnt/x
>>               ns1# mkdir /mnt/x/y
>>               ns1# mount --make-private -t tmpfs none /mnt/x/y
>>               ns1# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
>>               986 83 8:5 /mnt /mnt rw,relatime shared:344
>>               989 986 0:56 / /mnt/x rw,relatime
>>               990 989 0:57 / /mnt/x/y rw,relatime
>>
>>           Continuing in the same shell session, we then create a second
>>           shell in a new mount namespace and a new subordinate (and thus
>>           less privileged) user namespace and check the state of the
>>           propagated mount points rooted at /mnt.
>>
>>               ns1# PS1='ns2# unshare --user --map-root-user \
>>                                      --mount --propagation unchanged bash
>>               ns2# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
>>               1239 1204 8:5 /mnt /mnt rw,relatime master:344
>>               1240 1239 0:56 / /mnt/x rw,relatime
>>               1241 1240 0:57 / /mnt/x/y rw,relatime
>>
>>           Of note in the above output is that the propagation type of the
>>           mount point /mnt has been reduced to slave, as explained near
>>           the start of this subsection.  This means that submount events
>>           will propagate from the master /mnt in "ns1", but propagation
>>           will not occur in the opposite direction.
>>
>>           From a separate terminal window, we then use nsenter(1) to
>>           enter the mount and user namespaces corresponding to "ns1".  In
>>           that terminal window, we then recursively bind mount /mnt/x at
>>           the location /mnt/ppp.
>>
>>               $ PS1='ns3# ' sudo nsenter -t 778501 --user --mount
>>               ns3# mount --rbind --make-private /mnt/x /mnt/ppp
>>               ns3# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
>>               986 83 8:5 /mnt /mnt rw,relatime shared:344
>>               989 986 0:56 / /mnt/x rw,relatime
>>               990 989 0:57 / /mnt/x/y rw,relatime
>>               1242 986 0:56 / /mnt/ppp rw,relatime
>>               1243 1242 0:57 / /mnt/ppp/y rw,relatime shared:518
>>
>>           Because the propagation type of the parent mount, /mnt, was
>>           shared, the recursive bind mount propagated a small tree of
>>           mounts under the slave mount /mnt into "ns2", as can be
>>           verified by executing the following command in that shell
>>           session:
>>
>>               ns2# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
>>               1239 1204 8:5 /mnt /mnt rw,relatime master:344
>>               1240 1239 0:56 / /mnt/x rw,relatime
>>               1241 1240 0:57 / /mnt/x/y rw,relatime
>>               1244 1239 0:56 / /mnt/ppp rw,relatime
>>               1245 1244 0:57 / /mnt/ppp/y rw,relatime master:518
>>
>>           While it is not possible to unmount a part of that propagated
>>           subtree (/mnt/ppp/y), it is possible to unmount the entire
>>           tree, as shown by the following commands:
>>
>>               ns2# umount /mnt/ppp/y
>>               umount: /mnt/ppp/y: not mounted.
>>               ns2# umount -l /mnt/ppp | sed 's/ - .*//'      # Succeeds...
>>               ns2# grep /mnt /proc/self/mountinfo
>>               1239 1204 8:5 /mnt /mnt rw,relatime master:344
>>               1240 1239 0:56 / /mnt/x rw,relatime
>>               1241 1240 0:57 / /mnt/x/y rw,relatime
>> ]]
>>
>> ?
> 
> I'd just add a note about mounts that propagated locked together as unit
> as being unmountable as a unit (which is intuitive but may need to be
> spelled out). But I'd leave this lenghty example as it makes the
> manpage pretty convoluted.

Christian, I do sympathize with this point of view, and I hesitated
about adding this much text to the page. But, on the other hand:

* Many of the pages in section 7 are intended to provide "the big
  picture" of how things work.
* Mount namespaces are complex and (I think) generally poorly
  understood. So let's help people as much as we can.
* I had already relocated this whole subsection to the end of
  the page, so it is less obtrusive.

In summary, I'm inclined to keep the text, but thank you for
voicing your (mild) objection.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[relevance 10%]

* Re: [PATCHi, man-pages] mount_namespaces.7: More clearly explain "locked mounts"
  2021-08-17 15:51  5%     ` Eric W. Biederman
@ 2021-08-19  0:22 11%       ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 200+ results
From: Michael Kerrisk (man-pages) @ 2021-08-19  0:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: mtk.manpages, linux-man, linux-fsdevel, containers,
	Alejandro Colomar, Christian Brauner, linux-kernel,
	Christoph Hellwig

Hello Eric,

Thank you for you response.

On 8/17/21 5:51 PM, Eric W. Biederman wrote:
> "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:
> 
>> Hi Eric,
>>
>> Thanks for your feedback!
>>
>> On 8/16/21 6:03 PM, Eric W. Biederman wrote:
>>> Michael Kerrisk <mtk.manpages@gmail.com> writes:
>>>
>>>> For a long time, this manual page has had a brief discussion of
>>>> "locked" mounts, without clearly saying what this concept is, or
>>>> why it exists. Expand the discussion with an explanation of what
>>>> locked mounts are, why mounts are locked, and some examples of the
>>>> effect of locking.
>>>>
>>>> Thanks to Christian Brauner for a lot of help in understanding
>>>> these details.
>>>>
>>>> Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
>>>> Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
>>>> ---
>>>>
>>>> Hello Eric and others,
>>>>
>>>> After some quite helpful info from Chrstian Brauner, I've expanded
>>>> the discussion of locked mounts (a concept I didn't really have a
>>>> good grasp on) in the mount_namespaces(7) manual page. I would be
>>>> grateful to receive review comments, acks, etc., on the patch below.
>>>> Could you take a look please?
>>>>
>>>> Cheers,
>>>>
>>>> Michael
>>>>
>>>>  man7/mount_namespaces.7 | 73 +++++++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 73 insertions(+)
>>>>
>>>> diff --git a/man7/mount_namespaces.7 b/man7/mount_namespaces.7
>>>> index e3468bdb7..97427c9ea 100644
>>>> --- a/man7/mount_namespaces.7
>>>> +++ b/man7/mount_namespaces.7
>>>> @@ -107,6 +107,62 @@ operation brings across all of the mounts from the original
>>>>  mount namespace as a single unit,
>>>>  and recursive mounts that propagate between
>>>>  mount namespaces propagate as a single unit.)
>>>> +.IP
>>>> +In this context, "may not be separated" means that the mounts
>>>> +are locked so that they may not be individually unmounted.
>>>> +Consider the following example:
>>>> +.IP
>>>> +.RS
>>>> +.in +4n
>>>> +.EX
>>>> +$ \fBsudo mkdir /mnt/dir\fP
>>>> +$ \fBsudo sh \-c \(aqecho "aaaaaa" > /mnt/dir/a\(aq\fP
>>>> +$ \fBsudo mount \-\-bind -o ro /some/path /mnt/dir\fP
>>>> +$ \fBls /mnt/dir\fP   # Former contents of directory are invisible
>>>
>>> Do we want a more motivating example such as a /proc/sys?
>>>
>>> It has been common to mount over /proc files and directories that can be
>>> written to by the global root so that users in a mount namespace may not
>>> touch them.
>>
>> Seems reasonable. But I want to check one thing. Can you please
>> define "global root". I'm pretty sure I know what you mean, but
>> I'd like to know your definition.
> 
> I mean uid 0 in the initial user namespace.

(Good. That's what I thought you meant. So far, that term is not 
described in the manual pages. I just now added a definition of the 
term to user_namespaces(7).)

> This uid owns most of files in /proc.
> 
> Container systems that don't want to use user namespaces frequently
> mount over files in proc to prevent using some of the root privileges
> that come simply by having uid 0.
> 
> Another use is mounting over files on virtual filesystems like proc
> to reduce the attack surface.

Thanks for the background. I think for the moment I will go with 
Christian's alternative suggestion (an example using /etc/shadow).

> For reducing what the root user in a container can do, I think using user
> namespaces and using a uid other than 0 in the initial user namespace.
> 
> 
>>>> +.EE
>>>> +.in
>>>> +.RE
>>>> +.IP
>>>> +The above steps, performed in a more privileged user namespace,
>>>> +have created a (read-only) bind mount that
>>>> +obscures the contents of the directory
>>>> +.IR /mnt/dir .
>>>> +For security reasons, it should not be possible to unmount
>>>> +that mount in a less privileged user namespace,
>>>> +since that would reveal the contents of the directory
>>>> +.IR /mnt/dir .
>>>  > +.IP
>>>> +Suppose we now create a new mount namespace
>>>> +owned by a (new) subordinate user namespace.
>>>> +The new mount namespace will inherit copies of all of the mounts
>>>> +from the previous mount namespace.
>>>> +However, those mounts will be locked because the new mount namespace
>>>> +is owned by a less privileged user namespace.
>>>> +Consequently, an attempt to unmount the mount fails:
>>>> +.IP
>>>> +.RS
>>>> +.in +4n
>>>> +.EX
>>>> +$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP
>>>> +               \fBstrace \-o /tmp/log \e\fP
>>>> +               \fBumount /mnt/dir\fP
>>>> +umount: /mnt/dir: not mounted.
>>>> +$ \fBgrep \(aq^umount\(aq /tmp/log\fP
>>>> +umount2("/mnt/dir", 0)     = \-1 EINVAL (Invalid argument)
>>>> +.EE
>>>> +.in
>>>> +.RE
>>>> +.IP
>>>> +The error message from
>>>> +.BR mount (8)
>>>> +is a little confusing, but the
>>>> +.BR strace (1)
>>>> +output reveals that the underlying
>>>> +.BR umount2 (2)
>>>> +system call failed with the error
>>>> +.BR EINVAL ,
>>>> +which is the error that the kernel returns to indicate that
>>>> +the mount is locked.
>>>
>>> Do you want to mention that you can unmount the entire subtree?  Either
>>> with pivot_root if it is locked to "/" or with
>>> "umount -l /path/to/propagated/directory".
>>
>> Yes, I wondered about that, but hadn't got round to devising 
>> the scenario. How about this:
>>
>> [[
>>        *  Following on from the previous point, note that it is possible
>>           to unmount an entire tree of mounts that propagated as a unit
>                                  ^^^^^ subtree?

Yes, probably better, to prevent misunderstandings. Changed (and in a few
other places also).

>>           into a mount namespace that is owned by a less privileged user
>>           namespace, as illustrated in the following example.
> 
>>
>>           First, we create new user and mount namespaces using
>>           unshare(1).  In the new mount namespace, the propagation type
>>           of all mounts is set to private.  We then create a shared bind
>>           mount at /mnt, and a small hierarchy of mount points underneath
>>           that mount point.
>>
>>               $ PS1='ns1# ' sudo unshare --user --map-root-user \
>>                                      --mount --propagation private bash
>>               ns1# echo $$        # We need the PID of this shell later
>>               778501
>>               ns1# mount --make-shared --bind /mnt /mnt
>>               ns1# mkdir /mnt/x
>>               ns1# mount --make-private -t tmpfs none /mnt/x
>>               ns1# mkdir /mnt/x/y
>>               ns1# mount --make-private -t tmpfs none /mnt/x/y
>>               ns1# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
>>               986 83 8:5 /mnt /mnt rw,relatime shared:344
>>               989 986 0:56 / /mnt/x rw,relatime
>>               990 989 0:57 / /mnt/x/y rw,relatime
>>
>>           Continuing in the same shell session, we then create a second
>>           shell in a new mount namespace and a new subordinate (and thus
>>           less privileged) user namespace and check the state of the
>>           propagated mount points rooted at /mnt.
>>
>>               ns1# PS1='ns2# unshare --user --map-root-user \
>>                                      --mount --propagation unchanged bash
>>               ns2# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
>>               1239 1204 8:5 /mnt /mnt rw,relatime master:344
>>               1240 1239 0:56 / /mnt/x rw,relatime
>>               1241 1240 0:57 / /mnt/x/y rw,relatime
>>
>>           Of note in the above output is that the propagation type of the
>>           mount point /mnt has been reduced to slave, as explained near
>>           the start of this subsection.  This means that submount events
>>           will propagate from the master /mnt in "ns1", but propagation
>>           will not occur in the opposite direction.
>>
>>           From a separate terminal window, we then use nsenter(1) to
>>           enter the mount and user namespaces corresponding to "ns1".  In
>>           that terminal window, we then recursively bind mount /mnt/x at
>>           the location /mnt/ppp.
>>
>>               $ PS1='ns3# ' sudo nsenter -t 778501 --user --mount
>>               ns3# mount --rbind --make-private /mnt/x /mnt/ppp
>>               ns3# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
>>               986 83 8:5 /mnt /mnt rw,relatime shared:344
>>               989 986 0:56 / /mnt/x rw,relatime
>>               990 989 0:57 / /mnt/x/y rw,relatime
>>               1242 986 0:56 / /mnt/ppp rw,relatime
>>               1243 1242 0:57 / /mnt/ppp/y rw,relatime shared:518
>>
>>           Because the propagation type of the parent mount, /mnt, was
>>           shared, the recursive bind mount propagated a small tree of
>>           mounts under the slave mount /mnt into "ns2", as can be
>>           verified by executing the following command in that shell
>>           session:
>>
>>               ns2# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
>>               1239 1204 8:5 /mnt /mnt rw,relatime master:344
>>               1240 1239 0:56 / /mnt/x rw,relatime
>>               1241 1240 0:57 / /mnt/x/y rw,relatime
>>               1244 1239 0:56 / /mnt/ppp rw,relatime
>>               1245 1244 0:57 / /mnt/ppp/y rw,relatime master:518
>>
>>           While it is not possible to unmount a part of that propagated
>>           subtree (/mnt/ppp/y), it is possible to unmount the entire
>>           tree, as shown by the following commands:
>>
>>               ns2# umount /mnt/ppp/y
>>               umount: /mnt/ppp/y: not mounted.
>>               ns2# umount -l /mnt/ppp | sed 's/ - .*//'      # Succeeds...
>>               ns2# grep /mnt /proc/self/mountinfo
>>               1239 1204 8:5 /mnt /mnt rw,relatime master:344
>>               1240 1239 0:56 / /mnt/x rw,relatime
>>               1241 1240 0:57 / /mnt/x/y rw,relatime
>> ]]
>>
>> ?
> 
> Yes.
> 
> It is worth noting that in ns2 it is also possible to mount on top of
> /mnt/ppp/y and umount from /mnt/ppp/y.

Yes, good point. I've added some text, and an example for that case.

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[relevance 11%]

* Re: [PATCHi, man-pages] mount_namespaces.7: More clearly explain "locked mounts"
  2021-08-17  3:12  8%   ` Michael Kerrisk (man-pages)
  2021-08-17 14:06  4%     ` Christian Brauner
@ 2021-08-17 15:51  5%     ` Eric W. Biederman
  2021-08-19  0:22 11%       ` Michael Kerrisk (man-pages)
  1 sibling, 1 reply; 200+ results
From: Eric W. Biederman @ 2021-08-17 15:51 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: linux-man, linux-fsdevel, containers, Alejandro Colomar,
	Christian Brauner, linux-kernel, Christoph Hellwig

"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> writes:

> Hi Eric,
>
> Thanks for your feedback!
>
> On 8/16/21 6:03 PM, Eric W. Biederman wrote:
>> Michael Kerrisk <mtk.manpages@gmail.com> writes:
>> 
>>> For a long time, this manual page has had a brief discussion of
>>> "locked" mounts, without clearly saying what this concept is, or
>>> why it exists. Expand the discussion with an explanation of what
>>> locked mounts are, why mounts are locked, and some examples of the
>>> effect of locking.
>>>
>>> Thanks to Christian Brauner for a lot of help in understanding
>>> these details.
>>>
>>> Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
>>> Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
>>> ---
>>>
>>> Hello Eric and others,
>>>
>>> After some quite helpful info from Chrstian Brauner, I've expanded
>>> the discussion of locked mounts (a concept I didn't really have a
>>> good grasp on) in the mount_namespaces(7) manual page. I would be
>>> grateful to receive review comments, acks, etc., on the patch below.
>>> Could you take a look please?
>>>
>>> Cheers,
>>>
>>> Michael
>>>
>>>  man7/mount_namespaces.7 | 73 +++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 73 insertions(+)
>>>
>>> diff --git a/man7/mount_namespaces.7 b/man7/mount_namespaces.7
>>> index e3468bdb7..97427c9ea 100644
>>> --- a/man7/mount_namespaces.7
>>> +++ b/man7/mount_namespaces.7
>>> @@ -107,6 +107,62 @@ operation brings across all of the mounts from the original
>>>  mount namespace as a single unit,
>>>  and recursive mounts that propagate between
>>>  mount namespaces propagate as a single unit.)
>>> +.IP
>>> +In this context, "may not be separated" means that the mounts
>>> +are locked so that they may not be individually unmounted.
>>> +Consider the following example:
>>> +.IP
>>> +.RS
>>> +.in +4n
>>> +.EX
>>> +$ \fBsudo mkdir /mnt/dir\fP
>>> +$ \fBsudo sh \-c \(aqecho "aaaaaa" > /mnt/dir/a\(aq\fP
>>> +$ \fBsudo mount \-\-bind -o ro /some/path /mnt/dir\fP
>>> +$ \fBls /mnt/dir\fP   # Former contents of directory are invisible
>> 
>> Do we want a more motivating example such as a /proc/sys?
>> 
>> It has been common to mount over /proc files and directories that can be
>> written to by the global root so that users in a mount namespace may not
>> touch them.
>
> Seems reasonable. But I want to check one thing. Can you please
> define "global root". I'm pretty sure I know what you mean, but
> I'd like to know your definition.

I mean uid 0 in the initial user namespace.
This uid owns most of files in /proc.

Container systems that don't want to use user namespaces frequently
mount over files in proc to prevent using some of the root privileges
that come simply by having uid 0.

Another use is mounting over files on virtual filesystems like proc
to reduce the attack surface.

For reducing what the root user in a container can do, I think using user
namespaces and using a uid other than 0 in the initial user namespace.


>>> +.EE
>>> +.in
>>> +.RE
>>> +.IP
>>> +The above steps, performed in a more privileged user namespace,
>>> +have created a (read-only) bind mount that
>>> +obscures the contents of the directory
>>> +.IR /mnt/dir .
>>> +For security reasons, it should not be possible to unmount
>>> +that mount in a less privileged user namespace,
>>> +since that would reveal the contents of the directory
>>> +.IR /mnt/dir .
>>  > +.IP
>>> +Suppose we now create a new mount namespace
>>> +owned by a (new) subordinate user namespace.
>>> +The new mount namespace will inherit copies of all of the mounts
>>> +from the previous mount namespace.
>>> +However, those mounts will be locked because the new mount namespace
>>> +is owned by a less privileged user namespace.
>>> +Consequently, an attempt to unmount the mount fails:
>>> +.IP
>>> +.RS
>>> +.in +4n
>>> +.EX
>>> +$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP
>>> +               \fBstrace \-o /tmp/log \e\fP
>>> +               \fBumount /mnt/dir\fP
>>> +umount: /mnt/dir: not mounted.
>>> +$ \fBgrep \(aq^umount\(aq /tmp/log\fP
>>> +umount2("/mnt/dir", 0)     = \-1 EINVAL (Invalid argument)
>>> +.EE
>>> +.in
>>> +.RE
>>> +.IP
>>> +The error message from
>>> +.BR mount (8)
>>> +is a little confusing, but the
>>> +.BR strace (1)
>>> +output reveals that the underlying
>>> +.BR umount2 (2)
>>> +system call failed with the error
>>> +.BR EINVAL ,
>>> +which is the error that the kernel returns to indicate that
>>> +the mount is locked.
>> 
>> Do you want to mention that you can unmount the entire subtree?  Either
>> with pivot_root if it is locked to "/" or with
>> "umount -l /path/to/propagated/directory".
>
> Yes, I wondered about that, but hadn't got round to devising 
> the scenario. How about this:
>
> [[
>        *  Following on from the previous point, note that it is possible
>           to unmount an entire tree of mounts that propagated as a unit
                                 ^^^^^ subtree?
>           into a mount namespace that is owned by a less privileged user
>           namespace, as illustrated in the following example.

>
>           First, we create new user and mount namespaces using
>           unshare(1).  In the new mount namespace, the propagation type
>           of all mounts is set to private.  We then create a shared bind
>           mount at /mnt, and a small hierarchy of mount points underneath
>           that mount point.
>
>               $ PS1='ns1# ' sudo unshare --user --map-root-user \
>                                      --mount --propagation private bash
>               ns1# echo $$        # We need the PID of this shell later
>               778501
>               ns1# mount --make-shared --bind /mnt /mnt
>               ns1# mkdir /mnt/x
>               ns1# mount --make-private -t tmpfs none /mnt/x
>               ns1# mkdir /mnt/x/y
>               ns1# mount --make-private -t tmpfs none /mnt/x/y
>               ns1# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
>               986 83 8:5 /mnt /mnt rw,relatime shared:344
>               989 986 0:56 / /mnt/x rw,relatime
>               990 989 0:57 / /mnt/x/y rw,relatime
>
>           Continuing in the same shell session, we then create a second
>           shell in a new mount namespace and a new subordinate (and thus
>           less privileged) user namespace and check the state of the
>           propagated mount points rooted at /mnt.
>
>               ns1# PS1='ns2# unshare --user --map-root-user \
>                                      --mount --propagation unchanged bash
>               ns2# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
>               1239 1204 8:5 /mnt /mnt rw,relatime master:344
>               1240 1239 0:56 / /mnt/x rw,relatime
>               1241 1240 0:57 / /mnt/x/y rw,relatime
>
>           Of note in the above output is that the propagation type of the
>           mount point /mnt has been reduced to slave, as explained near
>           the start of this subsection.  This means that submount events
>           will propagate from the master /mnt in "ns1", but propagation
>           will not occur in the opposite direction.
>
>           From a separate terminal window, we then use nsenter(1) to
>           enter the mount and user namespaces corresponding to "ns1".  In
>           that terminal window, we then recursively bind mount /mnt/x at
>           the location /mnt/ppp.
>
>               $ PS1='ns3# ' sudo nsenter -t 778501 --user --mount
>               ns3# mount --rbind --make-private /mnt/x /mnt/ppp
>               ns3# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
>               986 83 8:5 /mnt /mnt rw,relatime shared:344
>               989 986 0:56 / /mnt/x rw,relatime
>               990 989 0:57 / /mnt/x/y rw,relatime
>               1242 986 0:56 / /mnt/ppp rw,relatime
>               1243 1242 0:57 / /mnt/ppp/y rw,relatime shared:518
>
>           Because the propagation type of the parent mount, /mnt, was
>           shared, the recursive bind mount propagated a small tree of
>           mounts under the slave mount /mnt into "ns2", as can be
>           verified by executing the following command in that shell
>           session:
>
>               ns2# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
>               1239 1204 8:5 /mnt /mnt rw,relatime master:344
>               1240 1239 0:56 / /mnt/x rw,relatime
>               1241 1240 0:57 / /mnt/x/y rw,relatime
>               1244 1239 0:56 / /mnt/ppp rw,relatime
>               1245 1244 0:57 / /mnt/ppp/y rw,relatime master:518
>
>           While it is not possible to unmount a part of that propagated
>           subtree (/mnt/ppp/y), it is possible to unmount the entire
>           tree, as shown by the following commands:
>
>               ns2# umount /mnt/ppp/y
>               umount: /mnt/ppp/y: not mounted.
>               ns2# umount -l /mnt/ppp | sed 's/ - .*//'      # Succeeds...
>               ns2# grep /mnt /proc/self/mountinfo
>               1239 1204 8:5 /mnt /mnt rw,relatime master:344
>               1240 1239 0:56 / /mnt/x rw,relatime
>               1241 1240 0:57 / /mnt/x/y rw,relatime
> ]]
>
> ?

Yes.

It is worth noting that in ns2 it is also possible to mount on top of
/mnt/ppp/y and umount from /mnt/ppp/y.


Eric

^ permalink raw reply	[relevance 5%]

* Re: [PATCHi, man-pages] mount_namespaces.7: More clearly explain "locked mounts"
  2021-08-17  3:12  8%   ` Michael Kerrisk (man-pages)
@ 2021-08-17 14:06  4%     ` Christian Brauner
  2021-08-19  0:24 10%       ` Michael Kerrisk (man-pages)
  2021-08-17 15:51  5%     ` Eric W. Biederman
  1 sibling, 1 reply; 200+ results
From: Christian Brauner @ 2021-08-17 14:06 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Eric W. Biederman, linux-man, linux-fsdevel, containers,
	Alejandro Colomar, linux-kernel, Christoph Hellwig

On Tue, Aug 17, 2021 at 05:12:20AM +0200, Michael Kerrisk (man-pages) wrote:
> Hi Eric,
> 
> Thanks for your feedback!
> 
> On 8/16/21 6:03 PM, Eric W. Biederman wrote:
> > Michael Kerrisk <mtk.manpages@gmail.com> writes:
> > 
> >> For a long time, this manual page has had a brief discussion of
> >> "locked" mounts, without clearly saying what this concept is, or
> >> why it exists. Expand the discussion with an explanation of what
> >> locked mounts are, why mounts are locked, and some examples of the
> >> effect of locking.
> >>
> >> Thanks to Christian Brauner for a lot of help in understanding
> >> these details.
> >>
> >> Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
> >> Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
> >> ---
> >>
> >> Hello Eric and others,
> >>
> >> After some quite helpful info from Chrstian Brauner, I've expanded
> >> the discussion of locked mounts (a concept I didn't really have a
> >> good grasp on) in the mount_namespaces(7) manual page. I would be
> >> grateful to receive review comments, acks, etc., on the patch below.
> >> Could you take a look please?
> >>
> >> Cheers,
> >>
> >> Michael
> >>
> >>  man7/mount_namespaces.7 | 73 +++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 73 insertions(+)
> >>
> >> diff --git a/man7/mount_namespaces.7 b/man7/mount_namespaces.7
> >> index e3468bdb7..97427c9ea 100644
> >> --- a/man7/mount_namespaces.7
> >> +++ b/man7/mount_namespaces.7
> >> @@ -107,6 +107,62 @@ operation brings across all of the mounts from the original
> >>  mount namespace as a single unit,
> >>  and recursive mounts that propagate between
> >>  mount namespaces propagate as a single unit.)
> >> +.IP
> >> +In this context, "may not be separated" means that the mounts
> >> +are locked so that they may not be individually unmounted.
> >> +Consider the following example:
> >> +.IP
> >> +.RS
> >> +.in +4n
> >> +.EX
> >> +$ \fBsudo mkdir /mnt/dir\fP
> >> +$ \fBsudo sh \-c \(aqecho "aaaaaa" > /mnt/dir/a\(aq\fP
> >> +$ \fBsudo mount \-\-bind -o ro /some/path /mnt/dir\fP
> >> +$ \fBls /mnt/dir\fP   # Former contents of directory are invisible
> > 
> > Do we want a more motivating example such as a /proc/sys?

Could be even be better to use an example involving /etc/shadow, e.g.:

sudo mount --bind /etc /mnt
sudo mount --bind /dev/null /mnt/shadow

the procfs example might be a bit awkward (see below).

> > 
> > It has been common to mount over /proc files and directories that can be
> > written to by the global root so that users in a mount namespace may not
> > touch them.
> 
> Seems reasonable. But I want to check one thing. Can you please
> define "global root". I'm pretty sure I know what you mean, but
> I'd like to know your definition.

(global root == root in the initial user namespace.)

Some application container runtimes have a concept of "masked paths"
where they overmount certain directories they want to hide with an empty
tmpfs and some files they want to hide with /dev/null (see [1]).

But I don't think this is a great example because this overmounting is
mostly needed and done when you're running privileged containers (see [2]).

There's usually no point in overmounting parts of procfs that are
writable by global root. If you're running in an unprivileged container
userns root can't write to any of the files that only global root can.
Otherwise this would be a rather severe security issue.

There might be a use-case for overmounting files that contain global
information that are readable inside user namespaces but then one either
has to question why they are readable in the first place or why this
information needs to be hidden. Examples include /proc/kallsyms and
/proc/keys.

But overall the overmounting of procfs is most sensible when running
privileged containers or when sharing pid namespaces and procfs is
somehow bind-mounted from somewhere. But that means there's no user
namespace in play which means that the mounts aren't locked.

So if the container runtime has e.g. overmounted /proc/kcore with
/dev/null then the privileged container can unmount it. To protect
against this such privileged containers usually drop CAP_SYS_ADMIN.
So the protection here comes from dropping capabilities not from locking
mounts together. All of this makes this a bit of a confusing example.

An example where locked mount protection is relied on heavily which I'm
involved in is systemd(-nspawn). All custom mounts a container gets such
as data shared from the host with the container are mounted in a separate
(privileged) mount namespace before the container workload is cloned.
The cloned container then gets a new mount + userns pair and hence, all
the mounts it inherited are now locked.

This way, you can e.g. share /etc with your container and just overmount
/etc/shadow with /dev/null or a custom /etc/shadow (Reason for my
example above.) without dropping capabilities that would prevent the
container from mounting.

So I'd suggest using a simple example. This is not about illustrating
what container runtimes do but what the behavior of a mount namespace
is. There's really no need to overcomplicate this.

[1]: https://github.com/moby/moby/blob/51b06c6795160d8a1ba05d05d6491df7588b2957/oci/defaults.go#L90
[2]: https://github.com/moby/moby/blob/51b06c6795160d8a1ba05d05d6491df7588b2957/oci/defaults.go#L110

> 
> >> +.EE
> >> +.in
> >> +.RE
> >> +.IP
> >> +The above steps, performed in a more privileged user namespace,
> >> +have created a (read-only) bind mount that
> >> +obscures the contents of the directory
> >> +.IR /mnt/dir .
> >> +For security reasons, it should not be possible to unmount
> >> +that mount in a less privileged user namespace,
> >> +since that would reveal the contents of the directory
> >> +.IR /mnt/dir .
> >  > +.IP
> >> +Suppose we now create a new mount namespace
> >> +owned by a (new) subordinate user namespace.
> >> +The new mount namespace will inherit copies of all of the mounts
> >> +from the previous mount namespace.
> >> +However, those mounts will be locked because the new mount namespace
> >> +is owned by a less privileged user namespace.
> >> +Consequently, an attempt to unmount the mount fails:
> >> +.IP
> >> +.RS
> >> +.in +4n
> >> +.EX
> >> +$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP
> >> +               \fBstrace \-o /tmp/log \e\fP
> >> +               \fBumount /mnt/dir\fP
> >> +umount: /mnt/dir: not mounted.
> >> +$ \fBgrep \(aq^umount\(aq /tmp/log\fP
> >> +umount2("/mnt/dir", 0)     = \-1 EINVAL (Invalid argument)
> >> +.EE
> >> +.in
> >> +.RE
> >> +.IP
> >> +The error message from
> >> +.BR mount (8)
> >> +is a little confusing, but the
> >> +.BR strace (1)
> >> +output reveals that the underlying
> >> +.BR umount2 (2)
> >> +system call failed with the error
> >> +.BR EINVAL ,
> >> +which is the error that the kernel returns to indicate that
> >> +the mount is locked.
> > 
> > Do you want to mention that you can unmount the entire subtree?  Either
> > with pivot_root if it is locked to "/" or with
> > "umount -l /path/to/propagated/directory".
> 
> Yes, I wondered about that, but hadn't got round to devising 
> the scenario. How about this:
> 
> [[
>        *  Following on from the previous point, note that it is possible
>           to unmount an entire tree of mounts that propagated as a unit
>           into a mount namespace that is owned by a less privileged user
>           namespace, as illustrated in the following example.
> 
>           First, we create new user and mount namespaces using
>           unshare(1).  In the new mount namespace, the propagation type
>           of all mounts is set to private.  We then create a shared bind
>           mount at /mnt, and a small hierarchy of mount points underneath
>           that mount point.
> 
>               $ PS1='ns1# ' sudo unshare --user --map-root-user \
>                                      --mount --propagation private bash
>               ns1# echo $$        # We need the PID of this shell later
>               778501
>               ns1# mount --make-shared --bind /mnt /mnt
>               ns1# mkdir /mnt/x
>               ns1# mount --make-private -t tmpfs none /mnt/x
>               ns1# mkdir /mnt/x/y
>               ns1# mount --make-private -t tmpfs none /mnt/x/y
>               ns1# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
>               986 83 8:5 /mnt /mnt rw,relatime shared:344
>               989 986 0:56 / /mnt/x rw,relatime
>               990 989 0:57 / /mnt/x/y rw,relatime
> 
>           Continuing in the same shell session, we then create a second
>           shell in a new mount namespace and a new subordinate (and thus
>           less privileged) user namespace and check the state of the
>           propagated mount points rooted at /mnt.
> 
>               ns1# PS1='ns2# unshare --user --map-root-user \
>                                      --mount --propagation unchanged bash
>               ns2# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
>               1239 1204 8:5 /mnt /mnt rw,relatime master:344
>               1240 1239 0:56 / /mnt/x rw,relatime
>               1241 1240 0:57 / /mnt/x/y rw,relatime
> 
>           Of note in the above output is that the propagation type of the
>           mount point /mnt has been reduced to slave, as explained near
>           the start of this subsection.  This means that submount events
>           will propagate from the master /mnt in "ns1", but propagation
>           will not occur in the opposite direction.
> 
>           From a separate terminal window, we then use nsenter(1) to
>           enter the mount and user namespaces corresponding to "ns1".  In
>           that terminal window, we then recursively bind mount /mnt/x at
>           the location /mnt/ppp.
> 
>               $ PS1='ns3# ' sudo nsenter -t 778501 --user --mount
>               ns3# mount --rbind --make-private /mnt/x /mnt/ppp
>               ns3# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
>               986 83 8:5 /mnt /mnt rw,relatime shared:344
>               989 986 0:56 / /mnt/x rw,relatime
>               990 989 0:57 / /mnt/x/y rw,relatime
>               1242 986 0:56 / /mnt/ppp rw,relatime
>               1243 1242 0:57 / /mnt/ppp/y rw,relatime shared:518
> 
>           Because the propagation type of the parent mount, /mnt, was
>           shared, the recursive bind mount propagated a small tree of
>           mounts under the slave mount /mnt into "ns2", as can be
>           verified by executing the following command in that shell
>           session:
> 
>               ns2# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
>               1239 1204 8:5 /mnt /mnt rw,relatime master:344
>               1240 1239 0:56 / /mnt/x rw,relatime
>               1241 1240 0:57 / /mnt/x/y rw,relatime
>               1244 1239 0:56 / /mnt/ppp rw,relatime
>               1245 1244 0:57 / /mnt/ppp/y rw,relatime master:518
> 
>           While it is not possible to unmount a part of that propagated
>           subtree (/mnt/ppp/y), it is possible to unmount the entire
>           tree, as shown by the following commands:
> 
>               ns2# umount /mnt/ppp/y
>               umount: /mnt/ppp/y: not mounted.
>               ns2# umount -l /mnt/ppp | sed 's/ - .*//'      # Succeeds...
>               ns2# grep /mnt /proc/self/mountinfo
>               1239 1204 8:5 /mnt /mnt rw,relatime master:344
>               1240 1239 0:56 / /mnt/x rw,relatime
>               1241 1240 0:57 / /mnt/x/y rw,relatime
> ]]
> 
> ?

I'd just add a note about mounts that propagated locked together as unit
as being unmountable as a unit (which is intuitive but may need to be
spelled out). But I'd leave this lenghty example as it makes the
manpage pretty convoluted.

Christian

^ permalink raw reply	[relevance 4%]

* Re: [PATCHi, man-pages] mount_namespaces.7: More clearly explain "locked mounts"
  2021-08-16 16:03  5% ` Eric W. Biederman
@ 2021-08-17  3:12  8%   ` Michael Kerrisk (man-pages)
  2021-08-17 14:06  4%     ` Christian Brauner
  2021-08-17 15:51  5%     ` Eric W. Biederman
  0 siblings, 2 replies; 200+ results
From: Michael Kerrisk (man-pages) @ 2021-08-17  3:12 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: mtk.manpages, linux-man, linux-fsdevel, containers,
	Alejandro Colomar, Christian Brauner, linux-kernel,
	Christoph Hellwig

Hi Eric,

Thanks for your feedback!

On 8/16/21 6:03 PM, Eric W. Biederman wrote:
> Michael Kerrisk <mtk.manpages@gmail.com> writes:
> 
>> For a long time, this manual page has had a brief discussion of
>> "locked" mounts, without clearly saying what this concept is, or
>> why it exists. Expand the discussion with an explanation of what
>> locked mounts are, why mounts are locked, and some examples of the
>> effect of locking.
>>
>> Thanks to Christian Brauner for a lot of help in understanding
>> these details.
>>
>> Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
>> Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
>> ---
>>
>> Hello Eric and others,
>>
>> After some quite helpful info from Chrstian Brauner, I've expanded
>> the discussion of locked mounts (a concept I didn't really have a
>> good grasp on) in the mount_namespaces(7) manual page. I would be
>> grateful to receive review comments, acks, etc., on the patch below.
>> Could you take a look please?
>>
>> Cheers,
>>
>> Michael
>>
>>  man7/mount_namespaces.7 | 73 +++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 73 insertions(+)
>>
>> diff --git a/man7/mount_namespaces.7 b/man7/mount_namespaces.7
>> index e3468bdb7..97427c9ea 100644
>> --- a/man7/mount_namespaces.7
>> +++ b/man7/mount_namespaces.7
>> @@ -107,6 +107,62 @@ operation brings across all of the mounts from the original
>>  mount namespace as a single unit,
>>  and recursive mounts that propagate between
>>  mount namespaces propagate as a single unit.)
>> +.IP
>> +In this context, "may not be separated" means that the mounts
>> +are locked so that they may not be individually unmounted.
>> +Consider the following example:
>> +.IP
>> +.RS
>> +.in +4n
>> +.EX
>> +$ \fBsudo mkdir /mnt/dir\fP
>> +$ \fBsudo sh \-c \(aqecho "aaaaaa" > /mnt/dir/a\(aq\fP
>> +$ \fBsudo mount \-\-bind -o ro /some/path /mnt/dir\fP
>> +$ \fBls /mnt/dir\fP   # Former contents of directory are invisible
> 
> Do we want a more motivating example such as a /proc/sys?
> 
> It has been common to mount over /proc files and directories that can be
> written to by the global root so that users in a mount namespace may not
> touch them.

Seems reasonable. But I want to check one thing. Can you please
define "global root". I'm pretty sure I know what you mean, but
I'd like to know your definition.

>> +.EE
>> +.in
>> +.RE
>> +.IP
>> +The above steps, performed in a more privileged user namespace,
>> +have created a (read-only) bind mount that
>> +obscures the contents of the directory
>> +.IR /mnt/dir .
>> +For security reasons, it should not be possible to unmount
>> +that mount in a less privileged user namespace,
>> +since that would reveal the contents of the directory
>> +.IR /mnt/dir .
>  > +.IP
>> +Suppose we now create a new mount namespace
>> +owned by a (new) subordinate user namespace.
>> +The new mount namespace will inherit copies of all of the mounts
>> +from the previous mount namespace.
>> +However, those mounts will be locked because the new mount namespace
>> +is owned by a less privileged user namespace.
>> +Consequently, an attempt to unmount the mount fails:
>> +.IP
>> +.RS
>> +.in +4n
>> +.EX
>> +$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP
>> +               \fBstrace \-o /tmp/log \e\fP
>> +               \fBumount /mnt/dir\fP
>> +umount: /mnt/dir: not mounted.
>> +$ \fBgrep \(aq^umount\(aq /tmp/log\fP
>> +umount2("/mnt/dir", 0)     = \-1 EINVAL (Invalid argument)
>> +.EE
>> +.in
>> +.RE
>> +.IP
>> +The error message from
>> +.BR mount (8)
>> +is a little confusing, but the
>> +.BR strace (1)
>> +output reveals that the underlying
>> +.BR umount2 (2)
>> +system call failed with the error
>> +.BR EINVAL ,
>> +which is the error that the kernel returns to indicate that
>> +the mount is locked.
> 
> Do you want to mention that you can unmount the entire subtree?  Either
> with pivot_root if it is locked to "/" or with
> "umount -l /path/to/propagated/directory".

Yes, I wondered about that, but hadn't got round to devising 
the scenario. How about this:

[[
       *  Following on from the previous point, note that it is possible
          to unmount an entire tree of mounts that propagated as a unit
          into a mount namespace that is owned by a less privileged user
          namespace, as illustrated in the following example.

          First, we create new user and mount namespaces using
          unshare(1).  In the new mount namespace, the propagation type
          of all mounts is set to private.  We then create a shared bind
          mount at /mnt, and a small hierarchy of mount points underneath
          that mount point.

              $ PS1='ns1# ' sudo unshare --user --map-root-user \
                                     --mount --propagation private bash
              ns1# echo $$        # We need the PID of this shell later
              778501
              ns1# mount --make-shared --bind /mnt /mnt
              ns1# mkdir /mnt/x
              ns1# mount --make-private -t tmpfs none /mnt/x
              ns1# mkdir /mnt/x/y
              ns1# mount --make-private -t tmpfs none /mnt/x/y
              ns1# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
              986 83 8:5 /mnt /mnt rw,relatime shared:344
              989 986 0:56 / /mnt/x rw,relatime
              990 989 0:57 / /mnt/x/y rw,relatime

          Continuing in the same shell session, we then create a second
          shell in a new mount namespace and a new subordinate (and thus
          less privileged) user namespace and check the state of the
          propagated mount points rooted at /mnt.

              ns1# PS1='ns2# unshare --user --map-root-user \
                                     --mount --propagation unchanged bash
              ns2# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
              1239 1204 8:5 /mnt /mnt rw,relatime master:344
              1240 1239 0:56 / /mnt/x rw,relatime
              1241 1240 0:57 / /mnt/x/y rw,relatime

          Of note in the above output is that the propagation type of the
          mount point /mnt has been reduced to slave, as explained near
          the start of this subsection.  This means that submount events
          will propagate from the master /mnt in "ns1", but propagation
          will not occur in the opposite direction.

          From a separate terminal window, we then use nsenter(1) to
          enter the mount and user namespaces corresponding to "ns1".  In
          that terminal window, we then recursively bind mount /mnt/x at
          the location /mnt/ppp.

              $ PS1='ns3# ' sudo nsenter -t 778501 --user --mount
              ns3# mount --rbind --make-private /mnt/x /mnt/ppp
              ns3# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
              986 83 8:5 /mnt /mnt rw,relatime shared:344
              989 986 0:56 / /mnt/x rw,relatime
              990 989 0:57 / /mnt/x/y rw,relatime
              1242 986 0:56 / /mnt/ppp rw,relatime
              1243 1242 0:57 / /mnt/ppp/y rw,relatime shared:518

          Because the propagation type of the parent mount, /mnt, was
          shared, the recursive bind mount propagated a small tree of
          mounts under the slave mount /mnt into "ns2", as can be
          verified by executing the following command in that shell
          session:

              ns2# grep /mnt /proc/self/mountinfo | sed 's/ - .*//'
              1239 1204 8:5 /mnt /mnt rw,relatime master:344
              1240 1239 0:56 / /mnt/x rw,relatime
              1241 1240 0:57 / /mnt/x/y rw,relatime
              1244 1239 0:56 / /mnt/ppp rw,relatime
              1245 1244 0:57 / /mnt/ppp/y rw,relatime master:518

          While it is not possible to unmount a part of that propagated
          subtree (/mnt/ppp/y), it is possible to unmount the entire
          tree, as shown by the following commands:

              ns2# umount /mnt/ppp/y
              umount: /mnt/ppp/y: not mounted.
              ns2# umount -l /mnt/ppp | sed 's/ - .*//'      # Succeeds...
              ns2# grep /mnt /proc/self/mountinfo
              1239 1204 8:5 /mnt /mnt rw,relatime master:344
              1240 1239 0:56 / /mnt/x rw,relatime
              1241 1240 0:57 / /mnt/x/y rw,relatime
]]

?

Thanks,

Michael

> 
>>  .IP *
>>  The
>>  .BR mount (2)
>> @@ -128,6 +184,23 @@ settings become locked
>>  when propagated from a more privileged to
>>  a less privileged mount namespace,
>>  and may not be changed in the less privileged mount namespace.
>> +.IP
>> +This point can be illustrated by a continuation of the previous example.
>> +In that example, the bind mount was marked as read-only.
>> +For security reasons,
>> +it should not be possible to make the mount writable in
>> +a less privileged namespace, and indeed the kernel prevents this,
>> +as illustrated by the following:
>> +.IP
>> +.RS
>> +.in +4n
>> +.EX
>> +$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP
>> +               \fBmount \-o remount,rw /mnt/dir\fP
>> +mount: /mnt/dir: permission denied.
>> +.EE
>> +.in
>> +.RE
>>  .IP *
>>  .\" (As of 3.18-rc1 (in Al Viro's 2014-08-30 vfs.git#for-next tree))
>>  A file or directory that is a mount point in one namespace that is not
> 
> Eric
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[relevance 8%]

* Re: [PATCHi, man-pages] mount_namespaces.7: More clearly explain "locked mounts"
  2021-08-13 22:01  8% [PATCHi, man-pages] mount_namespaces.7: More clearly explain "locked mounts" Michael Kerrisk
  2021-08-14  8:09  5% ` Christian Brauner
@ 2021-08-16 16:03  5% ` Eric W. Biederman
  2021-08-17  3:12  8%   ` Michael Kerrisk (man-pages)
  1 sibling, 1 reply; 200+ results
From: Eric W. Biederman @ 2021-08-16 16:03 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: linux-man, linux-fsdevel, containers, Alejandro Colomar,
	Christian Brauner, linux-kernel, Christoph Hellwig

Michael Kerrisk <mtk.manpages@gmail.com> writes:

> For a long time, this manual page has had a brief discussion of
> "locked" mounts, without clearly saying what this concept is, or
> why it exists. Expand the discussion with an explanation of what
> locked mounts are, why mounts are locked, and some examples of the
> effect of locking.
>
> Thanks to Christian Brauner for a lot of help in understanding
> these details.
>
> Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
> Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
> ---
>
> Hello Eric and others,
>
> After some quite helpful info from Chrstian Brauner, I've expanded
> the discussion of locked mounts (a concept I didn't really have a
> good grasp on) in the mount_namespaces(7) manual page. I would be
> grateful to receive review comments, acks, etc., on the patch below.
> Could you take a look please?
>
> Cheers,
>
> Michael
>
>  man7/mount_namespaces.7 | 73 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 73 insertions(+)
>
> diff --git a/man7/mount_namespaces.7 b/man7/mount_namespaces.7
> index e3468bdb7..97427c9ea 100644
> --- a/man7/mount_namespaces.7
> +++ b/man7/mount_namespaces.7
> @@ -107,6 +107,62 @@ operation brings across all of the mounts from the original
>  mount namespace as a single unit,
>  and recursive mounts that propagate between
>  mount namespaces propagate as a single unit.)
> +.IP
> +In this context, "may not be separated" means that the mounts
> +are locked so that they may not be individually unmounted.
> +Consider the following example:
> +.IP
> +.RS
> +.in +4n
> +.EX
> +$ \fBsudo mkdir /mnt/dir\fP
> +$ \fBsudo sh \-c \(aqecho "aaaaaa" > /mnt/dir/a\(aq\fP
> +$ \fBsudo mount \-\-bind -o ro /some/path /mnt/dir\fP
> +$ \fBls /mnt/dir\fP   # Former contents of directory are invisible

Do we want a more motivating example such as a /proc/sys?

It has been common to mount over /proc files and directories that can be
written to by the global root so that users in a mount namespace may not
touch them.


> +.EE
> +.in
> +.RE
> +.IP
> +The above steps, performed in a more privileged user namespace,
> +have created a (read-only) bind mount that
> +obscures the contents of the directory
> +.IR /mnt/dir .
> +For security reasons, it should not be possible to unmount
> +that mount in a less privileged user namespace,
> +since that would reveal the contents of the directory
> +.IR /mnt/dir .
 > +.IP
> +Suppose we now create a new mount namespace
> +owned by a (new) subordinate user namespace.
> +The new mount namespace will inherit copies of all of the mounts
> +from the previous mount namespace.
> +However, those mounts will be locked because the new mount namespace
> +is owned by a less privileged user namespace.
> +Consequently, an attempt to unmount the mount fails:
> +.IP
> +.RS
> +.in +4n
> +.EX
> +$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP
> +               \fBstrace \-o /tmp/log \e\fP
> +               \fBumount /mnt/dir\fP
> +umount: /mnt/dir: not mounted.
> +$ \fBgrep \(aq^umount\(aq /tmp/log\fP
> +umount2("/mnt/dir", 0)     = \-1 EINVAL (Invalid argument)
> +.EE
> +.in
> +.RE
> +.IP
> +The error message from
> +.BR mount (8)
> +is a little confusing, but the
> +.BR strace (1)
> +output reveals that the underlying
> +.BR umount2 (2)
> +system call failed with the error
> +.BR EINVAL ,
> +which is the error that the kernel returns to indicate that
> +the mount is locked.

Do you want to mention that you can unmount the entire subtree?  Either
with pivot_root if it is locked to "/" or with
"umount -l /path/to/propagated/directory".

>  .IP *
>  The
>  .BR mount (2)
> @@ -128,6 +184,23 @@ settings become locked
>  when propagated from a more privileged to
>  a less privileged mount namespace,
>  and may not be changed in the less privileged mount namespace.
> +.IP
> +This point can be illustrated by a continuation of the previous example.
> +In that example, the bind mount was marked as read-only.
> +For security reasons,
> +it should not be possible to make the mount writable in
> +a less privileged namespace, and indeed the kernel prevents this,
> +as illustrated by the following:
> +.IP
> +.RS
> +.in +4n
> +.EX
> +$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP
> +               \fBmount \-o remount,rw /mnt/dir\fP
> +mount: /mnt/dir: permission denied.
> +.EE
> +.in
> +.RE
>  .IP *
>  .\" (As of 3.18-rc1 (in Al Viro's 2014-08-30 vfs.git#for-next tree))
>  A file or directory that is a mount point in one namespace that is not

Eric

^ permalink raw reply	[relevance 5%]

* Re: [PATCHi, man-pages] mount_namespaces.7: More clearly explain "locked mounts"
  2021-08-13 22:01  8% [PATCHi, man-pages] mount_namespaces.7: More clearly explain "locked mounts" Michael Kerrisk
@ 2021-08-14  8:09  5% ` Christian Brauner
  2021-08-16 16:03  5% ` Eric W. Biederman
  1 sibling, 0 replies; 200+ results
From: Christian Brauner @ 2021-08-14  8:09 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: ebiederm, linux-man, linux-fsdevel, containers,
	Alejandro Colomar, linux-kernel, Christoph Hellwig

On Sat, Aug 14, 2021 at 12:01:20AM +0200, Michael Kerrisk wrote:
> For a long time, this manual page has had a brief discussion of
> "locked" mounts, without clearly saying what this concept is, or
> why it exists. Expand the discussion with an explanation of what
> locked mounts are, why mounts are locked, and some examples of the
> effect of locking.
> 
> Thanks to Christian Brauner for a lot of help in understanding
> these details.
> 
> Link: https://lore.kernel.org/r/20210813220120.502058-1-mtk.manpages@gmail.com
> Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
> Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
> ---

Looks good. Thank you!
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>

^ permalink raw reply	[relevance 5%]

* [PATCHi, man-pages] mount_namespaces.7: More clearly explain "locked mounts"
@ 2021-08-13 22:01  8% Michael Kerrisk
  2021-08-14  8:09  5% ` Christian Brauner
  2021-08-16 16:03  5% ` Eric W. Biederman
  0 siblings, 2 replies; 200+ results
From: Michael Kerrisk @ 2021-08-13 22:01 UTC (permalink / raw)
  To: ebiederm
  Cc: Michael Kerrisk, linux-man, linux-fsdevel, containers,
	Alejandro Colomar, Christian Brauner, linux-kernel,
	Christoph Hellwig

For a long time, this manual page has had a brief discussion of
"locked" mounts, without clearly saying what this concept is, or
why it exists. Expand the discussion with an explanation of what
locked mounts are, why mounts are locked, and some examples of the
effect of locking.

Thanks to Christian Brauner for a lot of help in understanding
these details.

Reported-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>
---

Hello Eric and others,

After some quite helpful info from Chrstian Brauner, I've expanded
the discussion of locked mounts (a concept I didn't really have a
good grasp on) in the mount_namespaces(7) manual page. I would be
grateful to receive review comments, acks, etc., on the patch below.
Could you take a look please?

Cheers,

Michael

 man7/mount_namespaces.7 | 73 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)

diff --git a/man7/mount_namespaces.7 b/man7/mount_namespaces.7
index e3468bdb7..97427c9ea 100644
--- a/man7/mount_namespaces.7
+++ b/man7/mount_namespaces.7
@@ -107,6 +107,62 @@ operation brings across all of the mounts from the original
 mount namespace as a single unit,
 and recursive mounts that propagate between
 mount namespaces propagate as a single unit.)
+.IP
+In this context, "may not be separated" means that the mounts
+are locked so that they may not be individually unmounted.
+Consider the following example:
+.IP
+.RS
+.in +4n
+.EX
+$ \fBsudo mkdir /mnt/dir\fP
+$ \fBsudo sh \-c \(aqecho "aaaaaa" > /mnt/dir/a\(aq\fP
+$ \fBsudo mount \-\-bind -o ro /some/path /mnt/dir\fP
+$ \fBls /mnt/dir\fP   # Former contents of directory are invisible
+.EE
+.in
+.RE
+.IP
+The above steps, performed in a more privileged user namespace,
+have created a (read-only) bind mount that
+obscures the contents of the directory
+.IR /mnt/dir .
+For security reasons, it should not be possible to unmount
+that mount in a less privileged user namespace,
+since that would reveal the contents of the directory
+.IR /mnt/dir .
+.IP
+Suppose we now create a new mount namespace
+owned by a (new) subordinate user namespace.
+The new mount namespace will inherit copies of all of the mounts
+from the previous mount namespace.
+However, those mounts will be locked because the new mount namespace
+is owned by a less privileged user namespace.
+Consequently, an attempt to unmount the mount fails:
+.IP
+.RS
+.in +4n
+.EX
+$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP
+               \fBstrace \-o /tmp/log \e\fP
+               \fBumount /mnt/dir\fP
+umount: /mnt/dir: not mounted.
+$ \fBgrep \(aq^umount\(aq /tmp/log\fP
+umount2("/mnt/dir", 0)     = \-1 EINVAL (Invalid argument)
+.EE
+.in
+.RE
+.IP
+The error message from
+.BR mount (8)
+is a little confusing, but the
+.BR strace (1)
+output reveals that the underlying
+.BR umount2 (2)
+system call failed with the error
+.BR EINVAL ,
+which is the error that the kernel returns to indicate that
+the mount is locked.
 .IP *
 The
 .BR mount (2)
@@ -128,6 +184,23 @@ settings become locked
 when propagated from a more privileged to
 a less privileged mount namespace,
 and may not be changed in the less privileged mount namespace.
+.IP
+This point can be illustrated by a continuation of the previous example.
+In that example, the bind mount was marked as read-only.
+For security reasons,
+it should not be possible to make the mount writable in
+a less privileged namespace, and indeed the kernel prevents this,
+as illustrated by the following:
+.IP
+.RS
+.in +4n
+.EX
+$ \fBsudo unshare \-\-user \-\-map\-root\-user \-\-mount \e\fP
+               \fBmount \-o remount,rw /mnt/dir\fP
+mount: /mnt/dir: permission denied.
+.EE
+.in
+.RE
 .IP *
 .\" (As of 3.18-rc1 (in Al Viro's 2014-08-30 vfs.git#for-next tree))
 A file or directory that is a mount point in one namespace that is not
-- 
2.31.1


^ permalink raw reply related	[relevance 8%]

* Re: Questions re the new mount_setattr(2) manual page
  2021-08-12  8:38  4%       ` Christian Brauner
@ 2021-08-13  1:25 10%         ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 200+ results
From: Michael Kerrisk (man-pages) @ 2021-08-13  1:25 UTC (permalink / raw)
  To: Christian Brauner
  Cc: mtk.manpages, Alejandro Colomar, linux-fsdevel, lkml, linux-man,
	Christoph Hellwig, Eric W. Biederman

Hello Christian,

On 8/12/21 10:38 AM, Christian Brauner wrote:
> On Thu, Aug 12, 2021 at 07:36:54AM +0200, Michael Kerrisk (man-pages) wrote:
>> [CC += Eric, in case he has a comment on the last piece]

[...]

>>> That's really splitting hairs.
>>
>> To be clear, I'm not trying to split hairs :-). It's just that
>> I'm struggling a little to understand. (In particular, the notion
>> of locked mounts is one where my understanding is weak.) 
>>
>> And think of it like this: I am the first line of defense for the
>> user-space reader. If I am having trouble to understand the text,
>> I wont be alone. And often, the problem is not so much that the
>> text is "wrong", it's that there's a difference in background
>> knowledge between what you know and what the reader (in this case
>> me) knows. Part of my task is to fill that gap, by adding info
>> that I think is necessary to the page (with the happy side
>> effect that I learn along the way.)
> 
> All very good points.
> I didn't mean to complain btw. Sorry that it seemed that way. :)

No problem. I need to think more carefully about my words 
sometimes in mails too :-)

>>> Of course this means that we're
>>> propagating into a mount namespace that is owned by a different user
>>> namespace though "crossing user namespaces" might have been the better
>>> choice.
>>
>> This is a perfect example of the point I make above. You say "of course",
>> but I don't have the background knowledge that you do :-). From my
>> perspective, I want to make sure that I understand your meaning, so
>> that that meaning can (IMHO) be made easier for the average reader
>> of the manual page.
>>
>>>>                  the aforementioned  flags  to  protect  these  sensitive
>>>>                  properties from being altered.
>>>>
>>>>               •  A  new  mount  and user namespace pair is created.  This
>>>>                  happens for  example  when  specifying  CLONE_NEWUSER  |
>>>>                  CLONE_NEWNS  in unshare(2), clone(2), or clone3(2).  The
>>>>                  aforementioned flags become locked to protect user name‐
>>>>                  spaces from altering sensitive mount properties.
>>>>
>>>> Again, this seems imprecise. Should it say something like:
>>>> "... to prevent changes to sensitive mount properties in the new 
>>>> mount namespace" ? Or perhaps you have a better wording.
>>>
>>> That's not imprecise. 
>>
>> Okay -- poor choice of wording on my part:
>>
>> s/this seems imprecise/I'm having trouble understanding this/
>>
>>> What you want to protect against is altering
>>> sensitive mount properties from within a user namespace irrespective of
>>> whether or not the user namespace actually owns the mount namespace,
>>> i.e. even if you own the mount namespace you shouldn't be able to alter
>>> those properties. I concede though that "protect" should've been
>>> "prevent".
>>
>> Can I check my education here please. The point is this:
>>
>> * The mount point was created in a mount NS that was owned by
>>   a more privileged user NS (e.g., the initial user NS).
>> * A CLONE_NEWUSER|CLONE_NEWNS step occurs to create a new (user and) 
>>   mount NS.
>> * In the new mount NS, the mounts become locked.
>>
>> And, help me here: is it correct that the reason the properties
>> need to be locked is because they are shared between the mounts?
> 
> Yes, basically.

Yes, but that last sentence of mine was wrong, wasn't it? The 
properties are not actually shared between the mounts, right?
(Earlier, I had done in experiment which misled e into thinking
there was sharing, but now it looks to me like there is not.)

> The new mount namespace contains a copy of all the mounts in the
> previous mount namespace. So they are separate mounts which you can best
> see when you do unshare --mount --propagation=private. An unmount in the
> new mount namespace won't affect the mount in the previous mount
> namespace. Which can only nicely work if they are separate mounts.
> Propagation relies (among other things) on the fact that mount
> namespaces have copies of the mounts.
> 
> The copied mounts in the new mount namespace will have inherited all
> properties they had at the time when copy_namespaces() and specifically
> copy_mnt_ns() was called. Which calls into copy_tree() and ultimately
> into the appropriately named clone_mnt(). This is the low-level routine
> that is responsible for cloning the mounts including their mount
> properties.
> 
> Some mount properties such as read-only, nodev, noexec, nosuid, atime -
> while arguably not per se security mechanisms - are used for protection
> or as security measures in userspace applications. The most obvious one
> might be the read-only property. One wouldn't want to expose a set of
> files as read-only only for someone else to trivially gain write access
> to them. An example of where that could happen is when creating a new
> mount namespaces and user namespace pair where the new mount namespace
> is owned by the new user namespace in which the caller is privileged and
> thus the caller would also able to alter the new mount namespace. So
> without locking flags all it would take to turn a read-only into a
> read-write mount is:
> unshare -U --map-root --propagation=private -- mount -o remount,rw /some/mnt
> locking such flags prevents that from happening.

Thanks for the detailed explanation; it's very helpful.

>>> You could probably say:
>>>
>>> 	A  new  mount  and user namespace pair is created.  This
>>> 	happens for  example  when  specifying  CLONE_NEWUSER  |
>>> 	CLONE_NEWNS  in unshare(2), clone(2), or clone3(2).
>>> 	The aforementioned flags become locked in the new mount
>>> 	namespace to prevent sensitive mount properties from being
>>> 	altered.
>>> 	Since the newly created mount namespace will be owned by the
>>> 	newly created user namespace a caller privileged in the newly
>>> 	created user namespace would be able to alter senstive
>>> 	mount properties. For example, without locking the read-only
>>> 	property for the mounts in the new mount namespace such a caller
>>> 	would be able to remount them read-write.
>>
>> So, I've now made the text:
>>
>>        EPERM  One of the mounts had at least one of MOUNT_ATTR_NOATIME,
>>               MOUNT_ATTR_NODEV, MOUNT_ATTR_NODIRATIME, MOUNT_ATTR_NOEXEC,
>>               MOUNT_ATTR_NOSUID, or MOUNT_ATTR_RDONLY set and the flag is
>>               locked.  Mount attributes become locked on a mount if:
>>
>>               •  A new mount or mount tree is created causing mount
>>                  propagation across user namespaces (i.e., propagation to
>>                  a mount namespace owned by a different user namespace).
>>                  The kernel will lock the aforementioned flags to prevent
>>                  these sensitive properties from being altered.
>>
>>               •  A new mount and user namespace pair is created.  This
>>                  happens for example when specifying CLONE_NEWUSER |
>>                  CLONE_NEWNS in unshare(2), clone(2), or clone3(2).  The
>>                  aforementioned flags become locked in the new mount
>>                  namespace to prevent sensitive mount properties from
>>                  being altered.  Since the newly created mount namespace
>>                  will be owned by the newly created user namespace, a
>>                  calling process that is privileged in the new user
>>                  namespace would—in the absence of such locking—be able
>>                  to alter senstive mount properties (e.g., to remount a
>>                  mount that was marked read-only as read-write in the new
>>                  mount namespace).
>>
>> Okay?
> 
> Sounds good.

Okay.

>>> (Fwiw, in this scenario there's a bit of (moderately sane) strangeness.
>>>  A CLONE_NEWUSER | CLONE_NEWMNT will cause even stronger protection to
>>>  kick in. For all mounts not marked as expired MNT_LOCKED will be set
>>>  which means that a umount() on any such mount copied from the previous
>>>  mount namespace will yield EINVAL implying from userspace' perspective
>>>  it's not mounted - granted EINVAL is the ioctl() of multiplexing errnos
>>>  - whereas a remount to alter a locked flag will yield EPERM.)
>>
>> Thanks for educating me! So, is that what we are seeing below?

(Was your silence to the above question an implicit "yes"?)

>> $ sudo umount /mnt/m1
>> $ sudo mount -t tmpfs none /mnt/m1
>> $ sudo unshare -pf -Ur -m --mount-proc strace -o /tmp/log umount /mnt/m1
>> umount: /mnt/m1: not mounted.
>> $ grep ^umount /tmp/log
>> umount2("/mnt/m1", 0)                   = -1 EINVAL (Invalid argument)
>>
>> The mount_namespaces(7) page has for a log time had this text:
>>
>>        *  Mounts that come as a single unit from a more privileged mount
>>           namespace are locked together and may not be separated in a
>>           less privileged mount namespace.  (The unshare(2) CLONE_NEWNS
>>           operation brings across all of the mounts from the original
>>           mount namespace as a single unit, and recursive mounts that
>>           propagate between mount namespaces propagate as a single unit.)
>>
>> I have had trouble understanding that. But maybe you just helped.
>> Is that text relevant to what you just wrote above? In particular,
>> I have trouble understanding what "separated" means. But, perhaps
> 
> The text gives the "how" not the "why".

Yes, that's a big problem :-}.

> Consider a more elaborate mount tree where e.g., you have bind-mounted a
> mount over a subdirectory of another mount:
> 
> sudo mount -t tmpfs /mnt
> sudo mkdir /mnt/my-dir/
> sudo touch /mnt/my-dir/my-file
> sudo mount --bind /opt /mnt/my-dir
> 
> The files underneath /mnt/my-dir are now hidden. Consider what would
> happen if one would allow to address those mounts separately. A user
> could then do:
> 
> unshare -U --map-root --mount
> umount /mnt/my-dir
> cat /mnt/my-dir/my-file
> 
> giving them access to what's in my-dir.
> 
> Treating such mount trees as a unit in less privileged mount namespaces
> (cf. [1]) prevents that, i.e., prevents revealing files and directories
> that were overmounted.

Got it!
 
> Treating such mounts as a unit is also relevant when e.g. bind-mounting
> a mount tree containing locked mounts. Sticking with the example above:
> 
> unshare -U --map-root --mount
> 
> # non-recursive bind-mount will fail
> mount --bind /mnt /tmp
> 
> # recursive bind-mount will succeed
> mount --rbind /mnt /tmp
> 
> The reason is again that the mount tree at /mnt is treated as a mount
> unit because it is locked. If one were to allow to non-recursively
> bind-mountng /mnt somewhere it would mean revealing what's underneath
> the mount at my-dir (This is in some sense the inverse of preventing a
> filesystem from being mounted that isn't fully visible, i.e. contains
> hidden or over-mounted mounts.).

Got it!

> These semantics, in addition to being security relevant, also allow a
> more privileged mount namespace to create a restricted view of the
> filesystem hierarchy that can't be circumvented in a less privileged
> mount namespace (Otherwise pivot_root would have to be used which can
> also be used to guarantee a restriced view on the filesystem hierarchy
> especially when combined with a separate rootfs.).

Okay.

Christian, thanks for so generously taking the time to write this up.
It really helped me a lot! I will do some work on the mount namespaces
manual page, to cover at least part of what you said.

Thanks,

Michael

> Christian
> 
> [1]: I'll avoid jumping through the hoops of speaking about ownership
>      all the time now for the sake of brevity. Otherwise I'll still sit
>      here at lunchtime.
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[relevance 10%]

* Re: [PATCH 5/5] Add manpage for fsconfig(2)
  @ 2021-08-13  0:23 12%     ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 200+ results
From: Michael Kerrisk (man-pages) @ 2021-08-13  0:23 UTC (permalink / raw)
  To: David Howells, Alexander Viro; +Cc: linux-fsdevel, linux-man, Linux API, lkml

Hello David,

As noted in another mail, I will ping on all of the mails, just to
raise all the patches to the top of the inbox.

Thanks,

Michael


On Thu, 27 Aug 2020 at 13:07, Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
>
> Hello David,
>
> On 8/24/20 2:25 PM, David Howells wrote:
> > Add a manual page to document the fsconfig() system call.
> >
> > Signed-off-by: David Howells <dhowells@redhat.com>
> > ---
> >
> >  man2/fsconfig.2 |  277 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 277 insertions(+)
> >  create mode 100644 man2/fsconfig.2
> >
> > diff --git a/man2/fsconfig.2 b/man2/fsconfig.2
> > new file mode 100644
> > index 000000000..da53d2fcb
> > --- /dev/null
> > +++ b/man2/fsconfig.2
> > @@ -0,0 +1,277 @@
> > +'\" t
> > +.\" Copyright (c) 2020 David Howells <dhowells@redhat.com>
> > +.\"
> > +.\" %%%LICENSE_START(VERBATIM)
> > +.\" Permission is granted to make and distribute verbatim copies of this
> > +.\" manual provided the copyright notice and this permission notice are
> > +.\" preserved on all copies.
> > +.\"
> > +.\" Permission is granted to copy and distribute modified versions of this
> > +.\" manual under the conditions for verbatim copying, provided that the
> > +.\" entire resulting derived work is distributed under the terms of a
> > +.\" permission notice identical to this one.
> > +.\"
> > +.\" Since the Linux kernel and libraries are constantly changing, this
> > +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> > +.\" responsibility for errors or omissions, or for damages resulting from
> > +.\" the use of the information contained herein.  The author(s) may not
> > +.\" have taken the same level of care in the production of this manual,
> > +.\" which is licensed free of charge, as they might when working
> > +.\" professionally.
> > +.\"
> > +.\" Formatted or processed versions of this manual, if unaccompanied by
> > +.\" the source, must acknowledge the copyright and authors of this work.
> > +.\" %%%LICENSE_END
> > +.\"
> > +.TH FSCONFIG 2 2020-08-24 "Linux" "Linux Programmer's Manual"
> > +.SH NAME
> > +fsconfig \- Filesystem parameterisation
> > +.SH SYNOPSIS
> > +.nf
> > +.B #include <sys/types.h>
> > +.B #include <sys/mount.h>
> > +.B #include <unistd.h>
> > +.B #include <sys/mount.h>
> > +.PP
> > +.BI "int fsconfig(int *" fd ", unsigned int " cmd ", const char *" key ,
> > +.br
> > +.BI "             const void __user *" value ", int " aux ");"
> > +.br
>
> Please remove two instances of .br above
>
> > +.BI
> > +.fi
> > +.PP
> > +.IR Note :
> > +There is no glibc wrapper for this system call.
> > +.SH DESCRIPTION
> > +.PP
> > +.BR fsconfig ()
> > +is used to supply parameters to and issue commands against a filesystem
> > +configuration context as set up by
> > +.BR fsopen (2)
> > +or
> > +.BR fspick (2).
> > +The context is supplied attached to the file descriptor specified by
>
> s/by/by the/
>
> > +.I fd
> > +argument.
> > +.PP
> > +The
> > +.I cmd
> > +argument indicates the command to be issued, where some of the commands simply
> > +supply parameters to the context.  The meaning of
> > +.IR key ", " value " and " aux
> > +are command-dependent; unless required for the command, these should be set to
>
> "should" or "must"? If not "must", why not? (It feels like an API design
> error not to require these to be NULL/0 in cases where they are not used.)
>
> > +NULL or 0.
> > +.PP
> > +The available commands are:
> > +.TP
> > +.B FSCONFIG_SET_FLAG
> > +Set the parameter named by
> > +.IR key
> > +to true.  This may fail with error
>
> s/with error/with the error/
> (and multiple times below)
>
> > +.B EINVAL
> > +if the parameter requires an argument.
> > +.TP
> > +.B FSCONFIG_SET_STRING
> > +Set the parameter named by
> > +.I key
> > +to a string.  This may fail with error
> > +.B EINVAL
> > +if the parser doesn't want a parameter here, wants a non-string or the string
> > +cannot be interpreted appropriately.
> > +.I value
> > +points to a NUL-terminated string.
> > +.TP
> > +.B FSCONFIG_SET_BINARY
> > +Set the parameter named by
> > +.I key
> > +to be a binary blob argument.  This may cause
> > +.B EINVAL
> > +to be returned if the filesystem parser isn't expecting a binary blob and it
> > +can't be converted to something usable.
> > +.I value
> > +points to the data and
> > +.I aux
> > +indicates the size of the data.
> > +.TP
> > +.B FSCONFIG_SET_PATH
> > +Set the parameter named by
> > +.I key
> > +to the object at the provided path.
> > +.I value
> > +should point to a NUL-terminated pathname string and aux may indicate
> > +.B AT_FDCWD
> > +or a file descriptor indicating a directory from which to begin a relative
> > +path resolution.  This may fail with error
> > +.B EINVAL
> > +if the parameter isn't expecting a path; it may also fail if the path cannot
> > +be resolved with the typcal errors for that
>
> s/typcal/typical/
>
> > +.RB "(" ENOENT ", " ENOTDIR ", " EPERM ", " EACCES ", etc.)."
> > +.IP
> > +Note that FSCONFIG_SET_STRING can be used instead, implying AT_FDCWD.
>
> I don't understand the preceding sentence. Can you rewrite to supply more
> detail? (E.g., "instead *of what*")
>
> > +.TP
> > +.B FSCONFIG_SET_PATH_EMPTY
> > +As FSCONFIG_SET_PATH, but with
> > +.B AT_EMPTY_PATH
> > +applied to the pathwalk.
>
> Can you please supply a bit more detail here, rather than just referring to
> FSCONFIG_SET_PATH.
>
> > +.TP
> > +.B FSCONFIG_SET_FD
> > +Set the parameter named by
> > +.I key
> > +to the file descriptor specified by
> > +.IR aux .
> > +This will fail with
> > +.B EINVAL
> > +if the parameter doesn't expect a file descriptor or
> > +.B EBADF
> > +if the file descriptor is invalid.
>
> Can you mention some use cases for FSCONFIG_SET_FD here please?
>
> > +.IP
> > +Note that FSCONFIG_SET_STRING can be used instead with the file descriptor
> > +passed as a decimal string.
> > +.TP
> > +.B FSCONFIG_CMD_CREATE
> > +This command triggers the filesystem to take the parameters set in the context
> > +and to try to create filesystem representation in the kernel.  If an existing
> > +representation can be shared, the filesystem may do that instead if the
> > +parameters permit.  This is intended for use with
> > +.BR fsopen (2).
> > +.TP
> > +.B FSCONFIG_CMD_RECONFIGURE
> > +This command causes the driver to alter the parameters of an already live
>
> "the driver" seems like the wrong terminology here. The page never
> mentioned "driver" before this point.) Is there something better?
>
> > +filesystem instance according to the parameters stored in the context.  This
> > +is intended for use with
> > +.BR fspick (2),
> > +but may also by used against the context created by
> > +.BR fsopen()
> > +after
> > +.BR fsmount (2)
> > +has been called on it.
>
> s/it/that context/
>
> > +
> > +.\"________________________________________________________
>
> Please remove above two lines.
>
> > +.SH EXAMPLES
>
> Please move the EXAMPLES section to just above SEE ALSO.
>
> Are the following independent examples or all one big example?
> Can you please add some explanatory text to make it clear?
>
> > +.PP
> > +.in +4n
> > +.nf
> > +fsconfig(sfd, FSCONFIG_SET_FLAG, "ro", NULL, 0);
> > +
> > +fsconfig(sfd, FSCONFIG_SET_STRING, "user_xattr", "false", 0);
> > +
> > +fsconfig(sfd, FSCONFIG_SET_BINARY, "ms_pac", pac_buffer, pac_size);
> > +
> > +fsconfig(sfd, FSCONFIG_SET_PATH, "journal", "/dev/sdd4", AT_FDCWD);
> > +
> > +dirfd = open("/dev/", O_PATH);
> > +fsconfig(sfd, FSCONFIG_SET_PATH, "journal", "sdd4", dirfd);
> > +
> > +fd = open("/overlays/mine/", O_PATH);
> > +fsconfig(sfd, FSCONFIG_SET_PATH_EMPTY, "lower_dir", "", fd);
> > +
> > +pipe(pipefds);
> > +fsconfig(sfd, FSCONFIG_SET_FD, "fd", NULL, pipefds[1]);
> > +.fi
> > +.in
> > +.PP
> > +.SH RETURN VALUE
> > +On success, the function returns 0.  On error, \-1 is returned, and
> > +.I errno
> > +is set appropriately.
> > +.SH ERRORS
> > +The error values given below result from filesystem type independent
> > +errors.
> > +Each filesystem type may have its own special errors and its
>
> s/may/may additionally/
>
> > +own special behavior.
> > +See the Linux kernel source code for details.
> > +.TP
> > +.B EACCES
> > +A component of a path was not searchable.
> > +(See also
> > +.BR path_resolution (7).)
> > +.TP
> > +.B EACCES
> > +Mounting a read-only filesystem was attempted without specifying the
> > +.RB ' ro '
> > +parameter.
> > +.TP
> > +.B EACCES
> > +A specified block device is located on a filesystem mounted with the
> > +.B MS_NODEV
> > +option.
> > +.\" mtk: Probably: write permission is required for MS_BIND, with
> > +.\" the error EPERM if not present; CAP_DAC_OVERRIDE is required.
> > +.TP
> > +.B EBADF
> > +The file descriptor given by
> > +.I fd
> > +or possibly by
> > +.I aux
> > +(depending on the command) is invalid.
> > +.TP
> > +.B EBUSY
> > +The context attached to
> > +.I fd
> > +is in the wrong state for the given command.
> > +.TP
> > +.B EBUSY
> > +The filesystem representation cannot be reconfigured read-only because it still
> > +holds files open for writing.
> > +.TP
> > +.B EFAULT
> > +One of the pointer arguments points outside the accessible address space.
> > +.TP
> > +.B EINVAL
> > +.I fd
> > +does not refer to a filesystem configuration context.
> > +.TP
> > +.B EINVAL
> > +One of the source parameters referred to an invalid superblock.
> > +.TP
> > +.B ELOOP
> > +Too many links encountered during pathname resolution.
> > +.TP
> > +.B ENAMETOOLONG
> > +A path name was longer than
> > +.BR MAXPATHLEN .
> > +.TP
> > +.B ENOENT
> > +A pathname was empty or had a nonexistent component.
> > +.TP
> > +.B ENOMEM
> > +The kernel could not allocate sufficient memory to complete the call.
> > +.TP
> > +.B ENOTBLK
> > +Once of the parameters does not refer to a block device (and a device was
>
> s/Once/One/
>
> > +required).
> > +.TP
> > +.B ENOTDIR
> > +.IR pathname ,
>
> But there is no argument "pathname" mentioned in this page!?
>
> > +or a prefix of
> > +.IR source ,
> > +is not a directory.
>
> But there is no argument "source" mentioned in this page!?
>
> (Can you please review all of the errors listed in this section to
> check that they apply to fsconfig().)
>
> > +.TP
> > +.B EOPNOTSUPP
> > +The command given by
> > +.I cmd
> > +was not valid.
> > +.TP
> > +.B ENXIO
> > +The major number of a block device parameter is out of range.
> > +.TP
> > +.B EPERM
> > +The caller does not have the required privileges.
>
> Please name the capability. Also, there was no mention of privileges in
> the text above, so could you please add some text about why/when
> privilege is needed.
>
> > +.SH CONFORMING TO
> > +These functions are Linux-specific and should not be used in programs intended
> > +to be portable.
> > +.SH VERSIONS
> > +.BR fsconfig ()
> > +was added to Linux in kernel 5.2.
> > +.SH NOTES
> > +Glibc does not (yet) provide a wrapper for the
> > +.BR fsconfig ()
> > +system call; call it using
> > +.BR syscall (2).
> > +.SH SEE ALSO
> > +.BR mountpoint (1),
> > +.BR fsmount (2),
> > +.BR fsopen (2),
> > +.BR fspick (2),
> > +.BR mount_namespaces (7),
> > +.BR path_resolution (7)
>
> Thanks,
>
> Michael
>
>
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[relevance 12%]

* Re: [PATCH 4/5] Add manpage for fsopen(2) and fsmount(2)
  @ 2021-08-13  0:22 12%     ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 200+ results
From: Michael Kerrisk (man-pages) @ 2021-08-13  0:22 UTC (permalink / raw)
  To: David Howells, Alexander Viro, Christian Brauner
  Cc: linux-fsdevel, linux-man, Linux API, lkml

Hello David,

As noted in another mail, I will ping on all of the mails, just to
raise all the patches to the top of the inbox.

Thanks,

Michael

On Thu, 27 Aug 2020 at 13:07, Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
>
> Hello David,
>
> On 8/24/20 2:25 PM, David Howells wrote:
> > Add a manual page to document the fsopen() and fsmount() system calls.
> >
> > Signed-off-by: David Howells <dhowells@redhat.com>
> > ---
> >
> >  man2/fsmount.2 |    1
> >  man2/fsopen.2  |  245 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 246 insertions(+)
> >  create mode 100644 man2/fsmount.2
> >  create mode 100644 man2/fsopen.2
> >
> > diff --git a/man2/fsmount.2 b/man2/fsmount.2
> > new file mode 100644
> > index 000000000..2bf59fc3e
> > --- /dev/null
> > +++ b/man2/fsmount.2
> > @@ -0,0 +1 @@
> > +.so man2/fsopen.2
> > diff --git a/man2/fsopen.2 b/man2/fsopen.2
> > new file mode 100644
> > index 000000000..1d1bba238
> > --- /dev/null
> > +++ b/man2/fsopen.2
> > @@ -0,0 +1,245 @@
> > +'\" t
> > +.\" Copyright (c) 2020 David Howells <dhowells@redhat.com>
> > +.\"
> > +.\" %%%LICENSE_START(VERBATIM)
> > +.\" Permission is granted to make and distribute verbatim copies of this
> > +.\" manual provided the copyright notice and this permission notice are
> > +.\" preserved on all copies.
> > +.\"
> > +.\" Permission is granted to copy and distribute modified versions of this
> > +.\" manual under the conditions for verbatim copying, provided that the
> > +.\" entire resulting derived work is distributed under the terms of a
> > +.\" permission notice identical to this one.
> > +.\"
> > +.\" Since the Linux kernel and libraries are constantly changing, this
> > +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> > +.\" responsibility for errors or omissions, or for damages resulting from
> > +.\" the use of the information contained herein.  The author(s) may not
> > +.\" have taken the same level of care in the production of this manual,
> > +.\" which is licensed free of charge, as they might when working
> > +.\" professionally.
> > +.\"
> > +.\" Formatted or processed versions of this manual, if unaccompanied by
> > +.\" the source, must acknowledge the copyright and authors of this work.
> > +.\" %%%LICENSE_END
> > +.\"
> > +.TH FSOPEN 2 2020-08-07 "Linux" "Linux Programmer's Manual"
> > +.SH NAME
> > +fsopen, fsmount \- Filesystem parameterisation and mount creation
> > +.SH SYNOPSIS
> > +.nf
> > +.B #include <sys/types.h>
> > +.B #include <sys/mount.h>
> > +.B #include <unistd.h>
> > +.BR "#include <fcntl.h>           " "/* Definition of AT_* constants */"
> > +.PP
> > +.BI "int fsopen(const char *" fsname ", unsigned int " flags );
> > +.PP
> > +.BI "int fsmount(int " fd ", unsigned int " flags ", unsigned int " mount_attrs );
> > +.fi
> > +.PP
> > +.IR Note :
> > +There are no glibc wrappers for these system calls.
> > +.SH DESCRIPTION
> > +.PP
> > +.BR fsopen ()
> > +creates a blank filesystem configuration context within the kernel for the
> > +filesystem named in the
> > +.I fsname
> > +parameter, puts it into creation mode and attaches it to a file descriptor,
> > +which it then returns.
>
> In the preceding sentence, "it" is used three times, with two *different*
> referents. That's quite hard on the reader.
>
> How about:
>
> [[
> .BR fsopen ()
> creates a blank filesystem configuration context within the kernel for the
> filesystem named in the
> .I fsname
> parameter, puts the context into creation mode and
> attaches it to a file descriptor;
> .BR fsopen ()
> returns the file descriptor as the function result.
> ]]
>
> > The file descriptor can be marked close-on-exec by
> > +setting
> > +.B FSOPEN_CLOEXEC
> > +in
> > +.IR flags .
> > +.PP
> > +After calling fsopen(), the file descriptor should be passed to the
> > +.BR fsconfig (2)
> > +system call, using that to specify the desired filesystem and security
> > +parameters.
> > +.PP
> > +When the parameters are all set, the
> > +.BR fsconfig ()
> > +system call should then be called again with
> > +.B FSCONFIG_CMD_CREATE
> > +as the command argument to effect the creation.
> > +.RS
> > +.PP
> > +.BR "[!]\ NOTE" :
> > +Depending on the filesystem type and parameters, this may rather share an
>
> Please replace "this" with a noun (phrase), since it is a little
> unclear what "this" refers to.
>
> > +existing in-kernel filesystem representation instead of creating a new one.
> > +In such a case, the parameters specified may be discarded or may overwrite the
> > +parameters set by a previous mount - at the filesystem's discretion.
> > +.RE
> > +.PP
> > +The file descriptor also serves as a channel by which more comprehensive error,
> > +warning and information messages may be retrieved from the kernel using
> > +.BR read (2).
> > +.PP
> > +Once the creation command has been successfully run on a context, the context
> > +will not accept further configuration.  At
> > +this point,
> > +.BR fsmount ()
> > +should be called to create a mount object.
> > +.PP
> > +.BR fsmount ()
> > +takes the file descriptor returned by
> > +.BR fsopen ()
> > +and creates a mount object for the filesystem root specified there.  The
> > +attributes of the mount object are set from the
> > +.I mount_attrs
> > +parameter.  The attributes specify the propagation and mount restrictions to
> > +be applied to accesses through this mount.
>
> Can we please have a list of the available attributes here, with a
> description of each attribute.
>
> > +.PP
> > +The mount object is then attached to a new file descriptor that looks like one
> > +created by
> > +.BR open "(2) with " O_PATH " or " open_tree (2).
> > +This can be passed to
> > +.BR move_mount (2)
> > +to attach the mount object to a mountpoint, thereby completing the process.
>
> s/mountpoint/mount point/
>
> In the preceding paragraph, the description is a bit unclear. (Again,
> overuse of pronouns ("this) does not help. I think it
> would be better to say something like:
>
> [[
> .BR fsmount()
> attaches the mount object to a new file descriptor that looks like one
> created by
> .BR open "(2) with " O_PATH " or " open_tree (2).
> This file descriptor can be passed to
> .BR move_mount (2)
> to attach the mount object to a mount point, thereby completing the process.
> ]]
>
> But, please also replace "the process" with a more meaningful phrase.
>
> > +.PP
> > +The file descriptor returned by fsmount() is marked close-on-exec if
> > +FSMOUNT_CLOEXEC is specified in
> > +.IR flags .
> > +.PP
> > +After fsmount() has completed, the context created by fsopen() is reset and
> > +moved to reconfiguration state, allowing the new superblock to be
> > +reconfigured.  See
> > +.BR fspick (2)
> > +for details.
> > +.PP
> > +To use either of these calls, the caller requires the appropriate privilege
> > +(Linux: the
>
> s/Linux: //
> (this is after all a Linux-specific system call)
>
> > +.B CAP_SYS_ADMIN
> > +capability).
> > +.PP
> > +.SS Message Retrieval Interface
> > +The context file descriptor may be queried for message strings at any time by
>
> s/The context file descriptor/
>   The context file descriptor returned by fsopen()/
>
> > +calling
> > +.BR read (2)
> > +on the file descriptor.  This will return formatted messages that are prefixed
> > +to indicate their class:
> > +.TP
> > +\fB"e <message>"\fP
> > +An error message string was logged.
> > +.TP
> > +\fB"i <message>"\fP
> > +An informational message string was logged.
> > +.TP
> > +\fB"w <message>"\fP
> > +An warning message string was logged.
> > +.PP
> > +Messages are removed from the queue as they're read.
>
> What if there are no pending error messages to retrieve? What does
> read() do in that case? Please add an explanation here.
>
> > +.SH RETURN VALUE
> > +On success, both functions return a file descriptor.  On error, \-1 is
> > +returned, and
> > +.I errno
> > +is set appropriately> +.SH ERRORS
> > +The error values given below result from filesystem type independent
> > +errors.
> > +Each filesystem type may have its own special errors and its
> > +own special behavior.
> > +See the Linux kernel source code for details.
> > +.TP
> > +.B EBUSY
> > +The context referred to by
> > +.I fd
> > +is not in the right state to be used by
> > +.BR fsmount ().
> > +.TP
> > +.B EFAULT
> > +One of the pointer arguments points outside the user address space.
> > +.TP
> > +.B EINVAL
> > +.I flags
> > +had an invalid flag set.
> > +.TP
> > +.B EINVAL
> > +.I mount_attrs,
> > +includes invalid
> > +.BR MOUNT_ATTR_*
> > +flags.
> > +.TP
> > +.B EMFILE
> > +The system has too many open files to create more.
> > +.TP
> > +.B ENFILE
> > +The process has too many open files to create more.
> > +.TP
> > +.B ENODEV
> > +The filesystem
> > +.I fsname
> > +is not available in the kernel.
> > +.TP
> > +.B ENOMEM
> > +The kernel could not allocate sufficient memory to complete the call.
> > +.TP
> > +.B EPERM
> > +The caller does not have the required privileges.
>
> Please name the required capability.
>
> > +.SH CONFORMING TO
> > +These functions are Linux-specific and should not be used in programs intended
> > +to be portable.
> > +.SH VERSIONS
> > +.BR fsopen "(), and " fsmount ()
> > +were added to Linux in kernel 5.2.
> > +.SH NOTES
> > +Glibc does not (yet) provide a wrapper for the
> > +.BR fsopen "() or " fsmount "()"
> > +system calls; call them using
> > +.BR syscall (2).
> > +.SH EXAMPLES
> > +To illustrate the process, here's an example whereby this can be used to mount
>
> Please replace "this" by a noun (phrase).
>
> > +an ext4 filesystem on /dev/sdb1 onto /mnt.
> > +.PP
> > +.in +4n
> > +.nf
> > +sfd = fsopen("ext4", FSOPEN_CLOEXEC);
> > +fsconfig(sfd, FSCONFIG_SET_FLAG, "ro", NULL, 0);
> > +fsconfig(sfd, FSCONFIG_SET_STRING, "source", "/dev/sdb1", 0);
> > +fsconfig(sfd, FSCONFIG_SET_FLAG, "noatime", NULL, 0);
> > +fsconfig(sfd, FSCONFIG_SET_FLAG, "acl", NULL, 0);
> > +fsconfig(sfd, FSCONFIG_SET_FLAG, "user_attr", NULL, 0);
> > +fsconfig(sfd, FSCONFIG_SET_FLAG, "iversion", NULL, 0);
> > +fsconfig(sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> > +mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MS_RELATIME);
> > +move_mount(mfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
> > +.fi
> > +.in
> > +.PP
> > +Here, an ext4 context is created first and attached to sfd.  The context is
> > +then told where its source will be, given a bunch of options and a superblock
> > +record object is then created.  Then fsmount() is called to create a mount
> > +object and
> > +.BR move_mount (2)
> > +is called to attach it to its intended mountpoint.
>
> s/mountpoint/mount point/
>
> > +.PP
> > +And here's an example of mounting from an NFS server and setting a Smack
> > +security module label on it too:
>
> Please replace "it" with a noun (phrase).
>
> > +.PP
> > +.in +4n
> > +.nf
> > +sfd = fsopen("nfs", 0);
> > +fsconfig(sfd, FSCONFIG_SET_STRING, "source", "example.com:/pub", 0);
> > +fsconfig(sfd, FSCONFIG_SET_STRING, "nfsvers", "3", 0);
> > +fsconfig(sfd, FSCONFIG_SET_STRING, "rsize", "65536", 0);
> > +fsconfig(sfd, FSCONFIG_SET_STRING, "wsize", "65536", 0);
> > +fsconfig(sfd, FSCONFIG_SET_STRING, "smackfsdef", "foolabel", 0);
> > +fsconfig(sfd, FSCONFIG_SET_FLAG, "rdma", NULL, 0);
> > +fsconfig(sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> > +mfd = fsmount(sfd, 0, MS_NODEV);
> > +move_mount(mfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
> > +.fi
> > +.in
> > +.PP
> > +.SH SEE ALSO
> > +.BR mountpoint (1),
> > +.BR fsconfig (2),
> > +.BR fspick (2),
> > +.BR move_mount (2),
> > +.BR open_tree (2),
> > +.BR umount (2),
> > +.BR mount_namespaces (7),
> > +.BR path_resolution (7),
> > +.BR mount (8),
> > +.BR umount (8)
>
> Thanks,
>
> Michael
>
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[relevance 12%]

* Re: [PATCH 3/5] Add manpage for fspick(2)
  @ 2021-08-13  0:22 12%     ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 200+ results
From: Michael Kerrisk (man-pages) @ 2021-08-13  0:22 UTC (permalink / raw)
  To: David Howells, Alexander Viro
  Cc: linux-fsdevel, linux-man, Linux API, lkml, Christian Brauner

Hello David,

As noted in another mail, I will ping on all of the mails, just to
raise all the patches to the top of the inbox.

Thanks,

Michael

On Thu, 27 Aug 2020 at 13:05, Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
>
> Hello David,
>
> On 8/24/20 2:24 PM, David Howells wrote:
> > Add a manual page to document the fspick() system call.
> >
> > Signed-off-by: David Howells <dhowells@redhat.com>
> > ---
> >
> >  man2/fspick.2 |  180 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 180 insertions(+)
> >  create mode 100644 man2/fspick.2
> >
> > diff --git a/man2/fspick.2 b/man2/fspick.2
> > new file mode 100644
> > index 000000000..72bf645dd
> > --- /dev/null
> > +++ b/man2/fspick.2
> > @@ -0,0 +1,180 @@
> > +'\" t
> > +.\" Copyright (c) 2020 David Howells <dhowells@redhat.com>
> > +.\"
> > +.\" %%%LICENSE_START(VERBATIM)
> > +.\" Permission is granted to make and distribute verbatim copies of this
> > +.\" manual provided the copyright notice and this permission notice are
> > +.\" preserved on all copies.
> > +.\"
> > +.\" Permission is granted to copy and distribute modified versions of this
> > +.\" manual under the conditions for verbatim copying, provided that the
> > +.\" entire resulting derived work is distributed under the terms of a
> > +.\" permission notice identical to this one.
> > +.\"
> > +.\" Since the Linux kernel and libraries are constantly changing, this
> > +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> > +.\" responsibility for errors or omissions, or for damages resulting from
> > +.\" the use of the information contained herein.  The author(s) may not
> > +.\" have taken the same level of care in the production of this manual,
> > +.\" which is licensed free of charge, as they might when working
> > +.\" professionally.
> > +.\"
> > +.\" Formatted or processed versions of this manual, if unaccompanied by
> > +.\" the source, must acknowledge the copyright and authors of this work.
> > +.\" %%%LICENSE_END
> > +.\"
> > +.TH FSPICK 2 2020-08-24 "Linux" "Linux Programmer's Manual"
> > +.SH NAME
> > +fspick \- Select filesystem for reconfiguration
> > +.SH SYNOPSIS
> > +.nf
> > +.B #include <sys/types.h>
> > +.B #include <sys/mount.h>
> > +.B #include <unistd.h>
> > +.BR "#include <fcntl.h>           " "/* Definition of AT_* constants */"
> > +.PP
> > +.BI "int fspick(int " dirfd ", const char *" pathname ", unsigned int " flags );
> > +.fi
> > +.PP
> > +.IR Note :
> > +There is no glibc wrapper for this system call.
> > +.SH DESCRIPTION
> > +.PP
> > +.BR fspick ()
> > +creates a new filesystem configuration context within the kernel and attaches a
> > +pre-existing superblock to it so that it can be reconfigured (similar to
> > +.BR mount (8)
> > +with the "-o remount" option).  The configuration context is marked as being in
> > +reconfiguration mode and attached to a file descriptor, which is returned to
> > +the caller.  The file descriptor can be marked close-on-exec by setting
> > +.B FSPICK_CLOEXEC
> > +in
> > +.IR flags .
> > +.PP
> > +The target is whichever superblock backs the object determined by
> > +.IR dfd ", " pathname " and " flags .
> > +The following can be set in
> > +.I flags
> > +to control the pathwalk to that object:
> > +.TP
> > +.B FSPICK_SYMLINK_NOFOLLOW
> > +Don't follow symbolic links in the final component of the path.
> > +.TP
> > +.B FSPICK_NO_AUTOMOUNT
> > +Don't follow automounts in the final component of the path.
> > +.TP
> > +.B FSPICK_EMPTY_PATH
> > +Allow an empty string to be specified as the pathname.  This allows
> > +.I dirfd
> > +to specify the target mount exactly.
> > +.PP
> > +After calling fspick(), the file descriptor should be passed to the
> > +.BR fsconfig (2)
> > +system call, using that to specify the desired changes to filesystem and
>
> Better: s/using that/in order/
>
> > +security parameters.
> > +.PP
> > +When the parameters are all set, the
> > +.BR fsconfig ()
> > +system call should then be called again with
> > +.B FSCONFIG_CMD_RECONFIGURE
> > +as the command argument to effect the reconfiguration.
> > +.PP
> > +After the reconfiguration has taken place, the context is wiped clean (apart
> > +from the superblock attachment, which remains) and can be reused to make
> > +another reconfiguration.
> > +.PP
> > +The file descriptor also serves as a channel by which more comprehensive error,
> > +warning and information messages may be retrieved from the kernel using
> > +.BR read (2).
> > +.SS Message Retrieval Interface
> > +The context file descriptor may be queried for message strings at any time by
>
> s/descriptor/descriptor returned by fspick()/
>
> > +calling
> > +.BR read (2)
> > +on the file descriptor.  This will return formatted messages that are prefixed
> > +to indicate their class:
> > +.TP
> > +\fB"e <message>"\fP
> > +An error message string was logged.
> > +.TP
> > +\fB"i <message>"\fP
> > +An informational message string was logged.
> > +.TP
> > +\fB"w <message>"\fP
> > +An warning message string was logged.
> > +.PP
> > +Messages are removed from the queue as they're read and the queue has a limited
> > +depth of 8 messages, so it's possible for some to get lost.
>
> What if there are no pending error messages to retrieve? What does
> read() do in that case? Please add an explanation here.
>
> > +.SH RETURN VALUE
> > +On success, the function returns a file descriptor.  On error, \-1 is returned,
> > +and
> > +.I errno
> > +is set appropriately.
> > +.SH ERRORS
> > +The error values given below result from filesystem type independent errors.
> > +Additionally, each filesystem type may have its own special errors and its own
> > +special behavior.  See the Linux kernel source code for details.
> > +.TP
> > +.B EACCES
> > +A component of a path was not searchable.
> > +(See also
> > +.BR path_resolution (7).)
> > +.TP
> > +.B EFAULT
> > +.I pathname
> > +points outside the user address space.
> > +.TP
> > +.B EINVAL
> > +.I flags
> > +includes an undefined value.
> > +.TP
> > +.B ELOOP
> > +Too many links encountered during pathname resolution.
> > +.TP
> > +.B EMFILE
> > +The system has too many open files to create more.
> > +.TP
> > +.B ENFILE
> > +The process has too many open files to create more.
> > +.TP
> > +.B ENAMETOOLONG
> > +A pathname was longer than
> > +.BR MAXPATHLEN .
>
> MAXPATHLEN is not, I think, a constant known in user space. What is this?
> Should it be PATH_MAX?
>
> > +.TP
> > +.B ENOENT
> > +A pathname was empty or had a nonexistent component.
> > +.TP
> > +.B ENOMEM
> > +The kernel could not allocate sufficient memory to complete the call.
> > +.TP
> > +.B EPERM
> > +The caller does not have the required privileges.
>
> Please note the necessary capability here. Also, there was no mention of
> capabilities/privileges in DESCRIPTION. Should there have been?
>
> > +.SH CONFORMING TO
> > +These functions are Linux-specific and should not be used in programs intended
> > +to be portable.
> > +.SH VERSIONS
> > +.BR fsopen "(), " fsmount "() and " fspick ()
> > +were added to Linux in kernel 5.2.
> > +.SH EXAMPLES
> > +To illustrate the process, here's an example whereby this can be used to
> > +reconfigure a filesystem:
> > +.PP
> > +.in +4n
> > +.nf
> > +sfd = fspick(AT_FDCWD, "/mnt", FSPICK_NO_AUTOMOUNT | FSPICK_CLOEXEC);
> > +fsconfig(sfd, FSCONFIG_SET_FLAG, "ro", NULL, 0);
> > +fsconfig(sfd, FSCONFIG_SET_STRING, "user_xattr", "false", 0);
> > +fsconfig(sfd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0);
> > +.fi
> > +.in
> > +.PP
> > +.SH NOTES
> > +Glibc does not (yet) provide a wrapper for the
> > +.BR fspick "()"
> > +system call; call it using
> > +.BR syscall (2).
> > +.SH SEE ALSO
> > +.BR mountpoint (1),
> > +.BR fsconfig (2),
> > +.BR fsopen (2),
> > +.BR path_resolution (7),
> > +.BR mount (8)
>
> Thanks,
>
> Michael
>
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[relevance 12%]

* Re: [PATCH 2/5] Add manpages for move_mount(2)
  @ 2021-08-13  0:21 12%     ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 200+ results
From: Michael Kerrisk (man-pages) @ 2021-08-13  0:21 UTC (permalink / raw)
  To: David Howells, Alexander Viro, Christian Brauner
  Cc: linux-fsdevel, linux-man, Linux API, lkml

Hello David,

As noted in another mail, I will ping on all of the mails, just to
raise all the patches to the top of the inbox.

Thanks,

Michael

On Thu, 27 Aug 2020 at 13:04, Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
>
> Hello David,
>
> On 8/24/20 2:24 PM, David Howells wrote:
> > Add manual pages to document the move_mount() system call.
> >
> > Signed-off-by: David Howells <dhowells@redhat.com>
> > ---
> >
> >  man2/move_mount.2 |  267 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 267 insertions(+)
> >  create mode 100644 man2/move_mount.2
> >
> > diff --git a/man2/move_mount.2 b/man2/move_mount.2
> > new file mode 100644
> > index 000000000..2ceb775d9
> > --- /dev/null
> > +++ b/man2/move_mount.2
> > @@ -0,0 +1,267 @@
> > +'\" t
> > +.\" Copyright (c) 2020 David Howells <dhowells@redhat.com>
> > +.\"
> > +.\" %%%LICENSE_START(VERBATIM)
> > +.\" Permission is granted to make and distribute verbatim copies of this
> > +.\" manual provided the copyright notice and this permission notice are
> > +.\" preserved on all copies.
> > +.\"
> > +.\" Permission is granted to copy and distribute modified versions of this
> > +.\" manual under the conditions for verbatim copying, provided that the
> > +.\" entire resulting derived work is distributed under the terms of a
> > +.\" permission notice identical to this one.
> > +.\"
> > +.\" Since the Linux kernel and libraries are constantly changing, this
> > +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> > +.\" responsibility for errors or omissions, or for damages resulting from
> > +.\" the use of the information contained herein.  The author(s) may not
> > +.\" have taken the same level of care in the production of this manual,
> > +.\" which is licensed free of charge, as they might when working
> > +.\" professionally.
> > +.\"
> > +.\" Formatted or processed versions of this manual, if unaccompanied by
> > +.\" the source, must acknowledge the copyright and authors of this work.
> > +.\" %%%LICENSE_END
> > +.\"
> > +.TH MOVE_MOUNT 2 2020-08-24 "Linux" "Linux Programmer's Manual"
> > +.SH NAME
> > +move_mount \- Move mount objects around the filesystem topology
> > +.SH SYNOPSIS
> > +.nf
> > +.B #include <sys/types.h>
> > +.B #include <sys/mount.h>
> > +.B #include <unistd.h>
> > +.BR "#include <fcntl.h>           " "/* Definition of AT_* constants */"
> > +.PP
> > +.BI "int move_mount(int " from_dirfd ", const char *" from_pathname ","
> > +.BI "               int " to_dirfd ", const char *" to_pathname ","
> > +.BI "               unsigned int " flags );
> > +.fi
> > +.PP
> > +.IR Note :
> > +There is no glibc wrapper for this system call.
> > +.SH DESCRIPTION
> > +The
> > +.BR move_mount ()
> > +call moves a mount from one place to another; it can also be used to attach an
> > +unattached mount that was created by
> > +.BR fsmount "() or " open_tree "() with " OPEN_TREE_CLONE .
> > +.PP
> > +If
> > +.BR move_mount ()
> > +is called repeatedly with a file descriptor that refers to a mount object,
> > +then the object will be attached/moved the first time and then moved
> > +repeatedly, detaching it from the previous mountpoint each time.
>
> s/mountpoint/mount point/
> (and all other instances below)
>
> > +.PP
> > +To access the source mount object or the destination mountpoint, no
> > +permissions are required on the object itself, but if either pathname is
> > +supplied, execute (search) permission is required on all of the directories
> > +specified in
> > +.IR from_pathname " or " to_pathname .
> > +.PP
> > +The caller does, however, require the appropriate privilege (Linux: the
>
> s/Linux: //
>
> > +.B CAP_SYS_ADMIN
> > +capability) to move or attach mounts.
> > +.PP
> > +.BR move_mount ()
> > +uses
> > +.IR from_pathname ", " from_dirfd " and part of " flags
> > +to locate the mount object to be moved and
> > +.IR to_pathname ", " to_dirfd " and another part of " flags
> > +to locate the destination mountpoint.  Each lookup can be done in one of a
> > +variety of ways:
> > +.TP
> > +[*] By absolute path.
> > +The pathname points to an absolute path and the dirfd is ignored.  The file is
> > +looked up by name, starting from the root of the filesystem as seen by the
> > +calling process.
> > +.TP
> > +[*] By cwd-relative path.
> > +The pathname points to a relative path and the dirfd is
> > +.IR AT_FDCWD .
> > +The file is looked up by name, starting from the current working directory.
> > +.TP
> > +[*] By dir-relative path.
> > +The pathname points to relative path and the dirfd indicates a file descriptor
> > +pointing to a directory.  The file is looked up by name, starting from the
> > +directory specified by
> > +.IR dirfd .
> > +.TP
> > +[*] By file descriptor.  The pathname is an empty string (""), the dirfd
>
> Formatting problem here... Add a newline before "The"
>
> > +points directly to the mount object to move or the destination mount point and
> > +the appropriate
> > +.B *_EMPTY_PATH
> > +flag is set.
> > +.PP
> > +.I flags
> > +can be used to influence a path-based lookup.  The value for
> > +.I flags
> > +is constructed by OR'ing together zero or more of the following constants:
> > +.TP
> > +.BR MOVE_MOUNT_F_EMPTY_PATH
> > +.\" commit 65cfc6722361570bfe255698d9cd4dccaf47570d
> > +If
> > +.I from_pathname
> > +is an empty string, operate on the file referred to by
> > +.IR from_dirfd
> > +(which may have been obtained using the
> > +.BR open (2)
> > +.B O_PATH
> > +flag or
> > +.BR open_tree ())
> > +If
> > +.I from_dirfd
> > +is
> > +.BR AT_FDCWD ,
> > +the call operates on the current working directory.
> > +In this case,
> > +.I from_dirfd
> > +can refer to any type of file, not just a directory.
> > +This flag is Linux-specific; define
> > +.B _GNU_SOURCE
> > +.\" Before glibc 2.16, defining _ATFILE_SOURCE sufficed
> > +to obtain its definition.
> > +.TP
> > +.B MOVE_MOUNT_T_EMPTY_PATH
> > +As above, but operating on
>
> s/As above/As for MOVE_MOUNT_F_EMPTY_PATH/
>
> > +.IR to_pathname " and " to_dirfd .
> > +.TP
> > +.B MOVE_MOUNT_F_AUTOMOUNTS
> > +Don't automount the terminal ("basename") component of
> > +.I from_pathname
> > +if it is a directory that is an automount point.  This allows a mount object
> > +that has an automount point at its root to be moved and prevents unintended
> > +triggering of an automount point.
> > +The
> > +.B MOVE_MOUNT_F_AUTOMOUNTS
> > +flag has no effect if the automount point has already been mounted over.  This
> > +flag is Linux-specific; define
> > +.B _GNU_SOURCE
> > +.\" Before glibc 2.16, defining _ATFILE_SOURCE sufficed
> > +to obtain its definition.
> > +.TP
> > +.B MOVE_MOUNT_T_AUTOMOUNTS
> > +As above, but operating on
> > +.IR to_pathname " and " to_dirfd .
> > +This allows an automount point to be manually mounted over.
> > +.TP
> > +.B MOVE_MOUNT_F_SYMLINKS
> > +If
> > +.I from_pathname
> > +is a symbolic link, then dereference it.  The default for
> > +.BR move_mount ()
> > +is to not follow symlinks.
> > +.TP
> > +.B MOVE_MOUNT_T_SYMLINKS
> > +As above, but operating on
> > +.IR to_pathname " and " to_dirfd .
> > +.SH RETURN VALUE
> > +On success, 0 is returned.  On error, \-1 is returned, and
> > +.I errno
> > +is set appropriately.
> > +.SH ERRORS
>
> Should EPERM be in the following list?
>
> > +.TP
> > +.B EACCES
> > +Search permission is denied for one of the directories
> > +in the path prefix of
> > +.IR pathname .
> > +(See also
> > +.BR path_resolution (7).)
> > +.TP
> > +.B EBADF
> > +.IR from_dirfd " or " to_dirfd
> > +is not a valid open file descriptor.
> > +.TP
> > +.B EFAULT
> > +.IR from_pathname " or " to_pathname
> > +is NULL or either one point to a location outside the process's accessible
> > +address space.
> > +.TP
> > +.B EINVAL
> > +Reserved flag specified in
>
> Should this rather be, "Invalid flag specified in..." ?
>
> > +.IR flags .
> > +.TP
> > +.B ELOOP
> > +Too many symbolic links encountered while traversing the pathname.
> > +.TP
> > +.B ENAMETOOLONG
> > +.IR from_pathname " or " to_pathname
> > +is too long.
> > +.TP
> > +.B ENOENT
> > +A component of
> > +.IR from_pathname " or " to_pathname
> > +does not exist, or one is an empty string and the appropriate
> > +.B *_EMPTY_PATH
> > +was not specified in
> > +.IR flags .
> > +.TP
> > +.B ENOMEM
> > +Out of memory (i.e., kernel memory).
> > +.TP
> > +.B ENOTDIR
> > +A component of the path prefix of
> > +.IR from_pathname " or " to_pathname
> > +is not a directory or one or the other is relative and the appropriate
> > +.I *_dirfd
> > +is a file descriptor referring to a file other than a directory.
> > +.SH VERSIONS
> > +.BR move_mount ()
> > +was added to Linux in kernel 5.2.
> > +.SH CONFORMING TO
> > +.BR move_mount ()
> > +is Linux-specific.
> > +.SH NOTES
> > +Glibc does not (yet) provide a wrapper for the
> > +.BR move_mount ()
> > +system call; call it using
> > +.BR syscall (2).
> > +.SH EXAMPLES
> > +The
> > +.BR move_mount ()
> > +function can be used like the following:
> > +.PP
> > +.RS
> > +.nf
> > +move_mount(AT_FDCWD, "/a", AT_FDCWD, "/b", 0);
> > +.fi
> > +.RE
> > +.PP
> > +This would move the object mounted on "/a" to "/b".  It can also be used in
>
> s/It/move_mount()/
>
> > +conjunction with
> > +.BR open_tree "(2) or " open "(2) with " O_PATH :
> > +.PP
> > +.RS
> > +.nf
> > +fd = open_tree(AT_FDCWD, "/mnt", 0);
> > +move_mount(fd, "", AT_FDCWD, "/mnt2", MOVE_MOUNT_F_EMPTY_PATH);
> > +move_mount(fd, "", AT_FDCWD, "/mnt3", MOVE_MOUNT_F_EMPTY_PATH);
> > +move_mount(fd, "", AT_FDCWD, "/mnt4", MOVE_MOUNT_F_EMPTY_PATH);
> > +.fi
> > +.RE
> > +.PP
> > +This would attach the path point for "/mnt" to fd, then it would move the
> > +mount to "/mnt2", then move it to "/mnt3" and finally to "/mnt4".
> > +.PP
> > +It can also be used to attach new mounts:
>
> s/It/move_mount()/
>
> > +.PP
> > +.RS
> > +.nf
> > +sfd = fsopen("ext4", FSOPEN_CLOEXEC);
> > +fsconfig(sfd, FSCONFIG_SET_STRING, "source", "/dev/sda1", 0);
> > +fsconfig(sfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0);
> > +fsconfig(sfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> > +mfd = fsmount(sfd, FSMOUNT_CLOEXEC, MOUNT_ATTR_NODEV);
> > +move_mount(mfd, "", AT_FDCWD, "/home", MOVE_MOUNT_F_EMPTY_PATH);
> > +.fi
> > +.RE
> > +.PP
> > +Which would open the Ext4 filesystem mounted on "/dev/sda1", turn on user
> > +extended attribute support and create a mount object for it.  Finally, the new
>
> Please replace "it" with a noun (phrase).
>
> > +mount object would be attached with
> > +.BR move_mount ()
> > +to "/home".
> > +.SH SEE ALSO
> > +.BR fsmount (2),
> > +.BR fsopen (2),
> > +.BR open_tree (2)
>
> Thanks,
>
> Michael
>
>
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[relevance 12%]

* Re: [PATCH 1/5] Add manpage for open_tree(2)
  @ 2021-08-13  0:20 12%   ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 200+ results
From: Michael Kerrisk (man-pages) @ 2021-08-13  0:20 UTC (permalink / raw)
  To: David Howells, Alexander Viro, Christian Brauner
  Cc: linux-fsdevel, linux-man, Linux API, lkml

Hello David,

I've pinged on these manual pages for the new mount API already a few
times in the past.

I would really like to get them out the door, but some work is
required, and I can't do it on my own; I need your help. In
particular, there are a number of open questions that I do not feel
confident at guessing the answer.

How can I get your help please with completing these pages?

I will ping on all of the other mails, just to raise all the patches
to the top of the inbox.

Thanks,

Michael


On Thu, 27 Aug 2020 at 13:01, Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
>
> Hello David,
>
> Can I ask that you please reply to each of my mails, rather than
> just sending out a new patch series (which of course I would also
> like  you to do). Some things that I mentioned in the last mails
> got lost, and I end up having to repeat them.
>
> So, even where I say "please change this", could you please reply with
> "done", or a reason why you declined the suggested change, is useful.
> But in any case, a few words in reply to explain the other changes
> that you make would be helpful.
>
> Also, some of my questions now will get a little more complex, and as
> well as you updating the pages, I think a little discussion may be
> required in some cases.
>
> On 8/24/20 2:24 PM, David Howells wrote:
> > Add a manual page to document the open_tree() system call.
> >
> > Signed-off-by: David Howells <dhowells@redhat.com>
> > ---
> >
> >  man2/open_tree.2 |  249 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 249 insertions(+)
> >  create mode 100644 man2/open_tree.2
> >
> > diff --git a/man2/open_tree.2 b/man2/open_tree.2
> > new file mode 100644
> > index 000000000..d480bd82f
> > --- /dev/null
> > +++ b/man2/open_tree.2
> > @@ -0,0 +1,249 @@
> > +'\" t
> > +.\" Copyright (c) 2020 David Howells <dhowells@redhat.com>
> > +.\"
> > +.\" %%%LICENSE_START(VERBATIM)
> > +.\" Permission is granted to make and distribute verbatim copies of this
> > +.\" manual provided the copyright notice and this permission notice are
> > +.\" preserved on all copies.
> > +.\"
> > +.\" Permission is granted to copy and distribute modified versions of this
> > +.\" manual under the conditions for verbatim copying, provided that the
> > +.\" entire resulting derived work is distributed under the terms of a
> > +.\" permission notice identical to this one.
> > +.\"
> > +.\" Since the Linux kernel and libraries are constantly changing, this
> > +.\" manual page may be incorrect or out-of-date.  The author(s) assume no
> > +.\" responsibility for errors or omissions, or for damages resulting from
> > +.\" the use of the information contained herein.  The author(s) may not
> > +.\" have taken the same level of care in the production of this manual,
> > +.\" which is licensed free of charge, as they might when working
> > +.\" professionally.
> > +.\"
> > +.\" Formatted or processed versions of this manual, if unaccompanied by
> > +.\" the source, must acknowledge the copyright and authors of this work.
> > +.\" %%%LICENSE_END
> > +.\"
> > +.TH OPEN_TREE 2 2020-08-24 "Linux" "Linux Programmer's Manual"
> > +.SH NAME
> > +open_tree \- Pick or clone mount object and attach to fd
> > +.SH SYNOPSIS
> > +.nf
> > +.B #include <sys/types.h>
> > +.B #include <sys/mount.h>
> > +.B #include <unistd.h>
> > +.BR "#include <fcntl.h>           " "/* Definition of AT_* constants */"
> > +.PP
> > +.BI "int open_tree(int " dirfd ", const char *" pathname ", unsigned int " flags );
> > +.fi
> > +.PP
> > +.IR Note :
> > +There are no glibc wrappers for these system calls.
> > +.SH DESCRIPTION
> > +.BR open_tree ()
> > +picks the mount object specified by the pathname and attaches it to a new file
>
> The terminology "pick" is unusual, and you never really explain what
> it means.  Is there better terminology? In any case, can you add a few
> words to explain what the term (('pick" or whatever alternative you
> come up with) means.
>
> > +descriptor or clones it and attaches the clone to the file descriptor.  The
>
> Please replace "it" by a noun (phrase) -- maybe: "the mount object"?
>
> > +resultant file descriptor is indistinguishable from one produced by
> > +.BR open "(2) with " O_PATH .
>
> What is the significance of that last piece? Can you add some words
> about why the fact that the resulting FD is indistinguishable from one
> produced by open() O_PATH matters or is useful?
>
> > +.PP
> > +In the case that the mount object is cloned, the clone will be "unmounted" and
>
> You place "unmounted" in quotes. Why? Is this to signify that the the
> unmount is somehow different from other unmounts? If so, please
> explain how it is different.  If not, then I think we can lose the double
> quotes.
>
> > +destroyed when the file descriptor is closed if it is not otherwise mounted
> > +somewhere by calling
> > +.BR move_mount (2).
> > +.PP
> > +To select a mount object, no permissions are required on the object referred
>
> Here you use the word "select". Is this the same as "pick"? If yes, please
> use the same term.
>
> > +to by the path, but execute (search) permission is required on all of the
>
> s/the path/.I pathname/ ?
>
> (Where pathname == "the pathname argument)
>
> > +directories in
> > +.I pathname
> > +that lead to the object.
> > +.PP
> > +Appropriate privilege (Linux: the
>
> s/Linux: //
> (This is a Linux specific system call...)
>
> > +.B CAP_SYS_ADMIN
> > +capability) is required to clone mount objects.
> > +.PP
> > +.BR open_tree ()
> > +uses
> > +.IR pathname ", " dirfd " and " flags
> > +to locate the target object in one of a variety of ways:
> > +.TP
> > +[*] By absolute path.
> > +.I pathname
> > +points to an absolute path and
> > +.I dirfd
> > +is ignored.  The object is looked up by name, starting from the root of the
> > +filesystem as seen by the calling process.
> > +.TP
> > +[*] By cwd-relative path.
> > +.I pathname
> > +points to a relative path and
> > +.IR dirfd " is " AT_FDCWD .
> > +The object is looked up by name, starting from the current working directory.
> > +.TP
> > +[*] By dir-relative path.
> > +.I pathname
> > +points to relative path and
> > +.I dirfd
> > +indicates a file descriptor pointing to a directory.  The object is looked up
> > +by name, starting from the directory specified by
> > +.IR dirfd .
> > +.TP
> > +[*] By file descriptor.
> > +.I pathname
> > +is "",
> > +.I dirfd
> > +indicates a file descriptor and
> > +.B AT_EMPTY_PATH
> > +is set in
> > +.IR flags .
> > +The mount attached to the file descriptor is queried directly.  The file
> > +descriptor may point to any type of file, not just a directory.
>
> I want to check here. Is it really *any* type of file? Can it be a UNIX
> domain socket or a char/block device or a FIFO?
>
> > +.PP
> > +.I flags
> > +can be used to control the operation of the function and to influence a
> > +path-based lookup.  A value for
> > +.I flags
> > +is constructed by OR'ing together zero or more of the following constants:
> > +.TP
> > +.BR AT_EMPTY_PATH
> > +.\" commit 65cfc6722361570bfe255698d9cd4dccaf47570d
> > +If
> > +.I pathname
> > +is an empty string, operate on the file referred to by
> > +.IR dirfd
> > +(which may have been obtained from
> > +.BR open "(2) with"
> > +.BR O_PATH ", from " fsmount (2)
> > +or from another
>
> s/another/a previous call to/
>
> > +.BR open_tree ()).
> > +If
> > +.I dirfd
> > +is
> > +.BR AT_FDCWD ,
> > +the call operates on the current working directory.
> > +In this case,
> > +.I dirfd
> > +can refer to any type of file, not just a directory.
> > +This flag is Linux-specific; define
> > +.B _GNU_SOURCE
> > +.\" Before glibc 2.16, defining _ATFILE_SOURCE sufficed
> > +to obtain its definition.
> > +.TP
> > +.BR AT_NO_AUTOMOUNT
> > +Don't automount the final ("basename") component of
> > +.I pathname
> > +if it is a directory that is an automount point.  This flag allows the
> > +automount point itself to be picked up or a mount cloned that is rooted on the
> > +automount point.  The
> > +.B AT_NO_AUTOMOUNT
> > +flag has no effect if the mount point has already been mounted over.
> > +This flag is Linux-specific; define
> > +.B _GNU_SOURCE
> > +.\" Before glibc 2.16, defining _ATFILE_SOURCE sufficed
> > +to obtain its definition.
> > +.TP
> > +.B AT_SYMLINK_NOFOLLOW
> > +If
> > +.I pathname
> > +is a symbolic link, do not dereference it: instead pick up or clone a mount
> > +rooted on the link itself.
> > +.TP
> > +.B OPEN_TREE_CLOEXEC
> > +Set the close-on-exec flag for the new file descriptor.  This will cause the
> > +file descriptor to be closed automatically when a process exec's.
> > +.TP
> > +.B OPEN_TREE_CLONE
> > +Rather than directly attaching the selected object to the file descriptor,
> > +clone the object, set the root of the new mount object to that point and
>
> Could you expand on "that point" a little. It's not quite clear to me what
> you mean there.
>
> > +attach the clone to the file descriptor.
> > +.TP
> > +.B AT_RECURSIVE
> > +This is only permitted in conjunction with OPEN_TREE_CLONE.  It causes the
> > +entire mount subtree rooted at the selected spot to be cloned rather than just
>
> Is there a better word than "spot"?
>
> > +that one mount object.
> > +.SH RETURN VALUE
> > +On success, the new file descriptor is returned.  On error, \-1 is returned,
> > +and
> > +.I errno
> > +is set appropriately.
> > +.SH ERRORS
> > +.TP
> > +.B EACCES
> > +Search permission is denied for one of the directories
> > +in the path prefix of
> > +.IR pathname .
> > +(See also
> > +.BR path_resolution (7).)
> > +.TP
> > +.B EBADF
> > +.I dirfd
> > +is not a valid open file descriptor.
> > +.TP
> > +.B EFAULT
> > +.I pathname
> > +is NULL or
> > +.IR pathname
> > +point to a location outside the process's accessible address space.
> > +.TP
> > +.B EINVAL
> > +Reserved flag specified in
> > +.IR flags .
> > +.TP
> > +.B ELOOP
> > +Too many symbolic links encountered while traversing the pathname.
> > +.TP
> > +.B ENAMETOOLONG
> > +.I pathname
> > +is too long.
> > +.TP
> > +.B ENOENT
> > +A component of
> > +.I pathname
> > +does not exist, or
> > +.I pathname
> > +is an empty string and
> > +.B AT_EMPTY_PATH
> > +was not specified in
> > +.IR flags .
> > +.TP
> > +.B ENOMEM
> > +Out of memory (i.e., kernel memory).
> > +.TP
> > +.B ENOTDIR
> > +A component of the path prefix of
> > +.I pathname
> > +is not a directory or
> > +.I pathname
> > +is relative and
> > +.I dirfd
> > +is a file descriptor referring to a file other than a directory.
> > +.SH VERSIONS
> > +.BR open_tree ()
> > +was added to Linux in kernel 5.2.
> > +.SH CONFORMING TO
> > +.BR open_tree ()
> > +is Linux-specific.
> > +.SH NOTES
> > +Glibc does not (yet) provide a wrapper for the
> > +.BR open_tree ()
> > +system call; call it using
> > +.BR syscall (2).
>
> What's the current status with respect to glibc support? Is it coming/is
> someone working on this?
>
> > +.SH EXAMPLE
>
> s/EXAMPLE/EXAMPLES/
> (That's the standard section header name these days.)
>
> > +The
> > +.BR open_tree ()
> > +function can be used like the following:
>
> The following example does a recursive bind mount, right?
> Can you please add some words to say that explicitly.
>
> > +.PP
> > +.RS
> > +.nf
> > +fd1 = open_tree(AT_FDCWD, "/mnt", 0);
> > +fd2 = open_tree(fd1, "",
> > +                AT_EMPTY_PATH | OPEN_TREE_CLONE | AT_RECURSIVE);
> > +move_mount(fd2, "", AT_FDCWD, "/mnt2", MOVE_MOUNT_F_EMPTY_PATH);
> > +.fi
> > +.RE
> > +.PP
> > +This would attach the path point for "/mnt" to fd1, then it would copy the
>
> What is a "path point"? This is not standard terminology. Can you
> replace this with something better?
>
> > +entire subtree at the point referred to by fd1 and attach that to fd2; lastly,
> > +it would attach the clone to "/mnt2".
> > +.SH SEE ALSO
> > +.BR fsmount (2),
> > +.BR move_mount (2),
> > +.BR open (2)
>
> Thanks,
>
> Michael
>
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[relevance 12%]

* Re: Questions re the new mount_setattr(2) manual page
  2021-08-12  9:08  5%         ` Christian Brauner
@ 2021-08-12 22:32 11%           ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 200+ results
From: Michael Kerrisk (man-pages) @ 2021-08-12 22:32 UTC (permalink / raw)
  To: Christian Brauner
  Cc: mtk.manpages, Alejandro Colomar, linux-fsdevel, lkml, linux-man,
	Christoph Hellwig

Hi Christian,

[...]

Thanks for checking the various wordinfs.

[...]

>>>>>>>           int fd_tree = open_tree(-EBADF, source,
>>>>>>>                        OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC |
>>>>>>>                        AT_EMPTY_PATH | (recursive ? AT_RECURSIVE : 0));
>>>>>>
>>>>>> ???
>>>>>> What is the significance of -EBADF here? As far as I can tell, it
>>>>>> is not meaningful to open_tree()?
>>>>>
>>>>> I always pass -EBADF for similar reasons to [2]. Feel free to just use -1.
>>>>
>>>> ????
>>>> But here, both -EBADF and -1 seem to be wrong. This argument 
>>>> is a dirfd, and so should either be a file descriptor or the
>>>> value AT_FDCWD, right?
>>>
>>> [1]: In this code "source" is expected to be absolute. If it's not
>>>      absolute we should fail. This can be achieved by passing -1/-EBADF,
>>>      afaict.
>>
>> D'oh! Okay. I hadn't considered that use case for an invalid dirfd.
>> (And now I've done some adjustments to openat(2),which contains a
>> rationale for the *at() functions.)
>>
>> So, now I understand your purpose, but still the code is obscure,
>> since
>>
>> * You use a magic value (-EBADF) rather than (say) -1.
>> * There's no explanation (comment about) of the fact that you want
>>   to prevent relative pathnames.
>>
>> So, I've changed the code to use -1, not -EBADF, and I've added some
>> comments to explain that the intent is to prevent relative pathnames.
>> Okay?
> 
> Sounds good.
> 
>>
>> But, there is still the meta question: what's the problem with using
>> a relative pathname?
> 
> Nothing per se. Ok, you asked so it's your fault:
> When writing programs I like to never use relative paths with AT_FDCWD
> because. Because making assumptions about the current working directory
> of the calling process is just too easy to get wrong; especially when
> pivot_root() or chroot() are in play.
> My absolut preference (joke intended) is to open a well-known starting
> point with an absolute path to get a dirfd and then scope all future
> operations beneath that dirfd. This already works with old-style
> openat() and _very_ cautious programming but openat2() and its
> resolve-flag space have made this **chef's kiss**.
> If I can't operate based on a well-known dirfd I use absolute paths with
> a -EBADF dirfd passed to *at() functions.

Thanks for the clarification. I've noted your rationale in a 
comment in the manual page source so that future maintainers 
will not be puzzled!

Cheers,

Michael



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[relevance 11%]

* Re: Questions re the new mount_setattr(2) manual page
  2021-08-12  5:36  9%       ` Michael Kerrisk (man-pages)
@ 2021-08-12  9:08  5%         ` Christian Brauner
  2021-08-12 22:32 11%           ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 200+ results
From: Christian Brauner @ 2021-08-12  9:08 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Alejandro Colomar, linux-fsdevel, lkml, linux-man, Christoph Hellwig

On Thu, Aug 12, 2021 at 07:36:24AM +0200, Michael Kerrisk (man-pages) wrote:
> Hello Christian,
> 
> Thanks for the answers.
> 
> A couple of small queries still below.
> 
> On 8/11/21 12:07 PM, Christian Brauner wrote:
> > On Tue, Aug 10, 2021 at 11:06:52PM +0200, Michael Kerrisk (man-pages) wrote:
> 
> [...]
> 
> >>>>>       EINVAL The mount that is to be ID mapped is not a
> >>>>>              detached/anonymous mount; that is, the mount is
> >>>>
> >>>> ???
> >>>> What is a the distinction between "detached" and "anonymous"?
> >>>> Or do you mean them to be synonymous? If so, then let's use
> >>>> just one term, and I think "detached" is preferable.
> >>>
> >>> Yes, they are synonymous here. I list both because detached can
> >>> potentially be confusing. A detached mount is a mount that has not been
> >>> visible in the filesystem. But if you attached it an then unmount it
> >>> right after and keep the fd for the mountpoint open it's a detached
> >>> mount purely on a natural language level, I'd argue. But it's not a
> >>> detached mount from the kernel's view anymore because it has been
> >>> exposed in the filesystem and is thus not detached anymore.
> >>> But I do prefer "detached" to "anonymous" and that confusion is very
> >>> unlikely to occur.
> >>
> >> Thanks. I made it "detached". Elsewhere, the page already explains
> >> that a detached mount is one that:
> >>
> >>           must have been created by calling open_tree(2) with the
> >>           OPEN_TREE_CLONE flag and it must not already have been
> >>           visible in the filesystem.
> >>
> >> Which seems a fine explanation. 
> >>
> >> ????
> >> But, just a thought... "visible in the filesystem" seems not quite accurate. 
> >> What you really mean I guess is that it must not already have been
> >> /visible in the filesystem hierarchy/previously mounted/something else/,
> >> right?
> 
> I suppose that I should have clarified that my main problem was
> that you were using the word "filesystem" in a way that I find
> unconventional/ambiguous. I mean, I normally take the term
> "filesystem" to be "a storage system for folding files".
> Here, you are using "filesystem" to mean something else, what 
> I might call like "the single directory hierarchy" or "the
> filesystem hierarchy" or "the list of mount points".
> 
> > A detached mount is created via the OPEN_TREE_CLONE flag. It is a
> > separate new mount so "previously mounted" is not applicable.
> > A detached mount is _related_ to what the MS_BIND flag gives you with
> > mount(2). However, they differ conceptually and technically. A MS_BIND
> > mount(2) is always visible in the fileystem when mount(2) returns, i.e.
> > it is discoverable by regular path-lookup starting within the
> > filesystem.
> > 
> > However, a detached mount can be seen as a split of MS_BIND into two
> > distinct steps:
> > 1. fd_tree = open_tree(OPEN_TREE_CLONE): create a new mount
> > 2. move_mount(fd_tree, <somewhere>):     attach the mount to the filesystem
> > 
> > 1. and 2. together give you the equivalent of MS_BIND.
> > In between 1. and 2. however the mount is detached. For the kernel
> > "detached" means that an anonymous mount namespace is attached to it
> > which doen't appear in proc and has a 0 sequence number (Technically,
> > there's a bit of semantical argument to be made that "attached" and
> > "detached" are ambiguous as they could also be taken to mean "does or
> > does not have a parent mount". This ambiguity e.g. appears in
> > do_move_mount(). That's why the kernel itself calls it an "anonymous
> > mount". However, an OPEN_TREE_CLONE-detached mount of course doesn't
> > have a parent mount so it works.).
> > 
> > For userspace it's better to think of detached and attached in terms of
> > visibility in the filesystem or in a mount namespace. That's more
> > straightfoward, more relevant, and hits the target in 90% of the cases.
> > 
> > However, the better and clearer picture is to say that a
> > OPEN_TREE_CLONE-detached mount is a mount that has never been
> > move_mount()ed. Which in turn can be defined as the detached mount has
> > never been made visible in a mount namespace. Once that has happened the
> > mount is irreversibly an attached mount.
> > 
> > I keep thinking that maybe we should just say "anonymous mount"
> > everywhere. So changing the wording to:
> 
> I'm not against the word "detached". To user space, I think it is a
> little more meaningful than "anonymous". For the moment, I'll stay with
> "detached", but if you insist on "anonymous", I'll probably change it.

No, sounds good.

> 
> > [...]
> > EINVAL The mount that is to be ID mapped is not an anonymous mount;
> > that is, the mount has already been visible in a mount namespace.
> 
> I like that text *a lot* better! Thanks very much for suggesting
> wordings. It makes my life much easier. 
> 
> I've made the text:
> 
>        EINVAL The mount that is to be ID mapped is not a detached
>               mount; that is, the mount has not previously been
>               visible in a mount namespace.

Sounds good.

> 
> > [...]
> > The mount must be an anonymous mount; that is, it must have been
> > created by calling open_tree(2) with the OPEN_TREE_CLONE flag and it
> > must not already have been visible in a mount namespace, i.e. it must
> > not have been attached to the filesystem hierarchy with syscalls such
> > as move_mount() syscall.
> 
> And that too! I've made the text:
> 
>        •  The mount must be a detached mount; that is, it must have
>           been created by calling open_tree(2) with the
>           OPEN_TREE_CLONE flag and it must not already have been
>           visible in a mount namespace.  (To put things another way:
>           the mount must not have been attached to the filesystem
>           hierarchy with a system call such as move_mount(2).)

Sounds good.

> 
> > [...]
> > 
> > (I'm using the formulation "with syscalls such as move_mount()" to
> > future proof this. :)).
> 
> Fair enough.
> 
> >>>>>   EXAMPLES
> >>>>
> >>>> ???
> >>>> Do you have a (preferably simple) example piece of code
> >>>> somewhere for setting up an ID mapped mount?
> >>
> >> ????
> >> I guess the best example is this:
> >> https://github.com/brauner/mount-idmapped/
> >> right?
> > 
> > Ah yes, sorry. I forgot to answer that yesterday. I sent you links via
> > another medium but I repeat it here.
> > There are two places. The link you have here is a private repo. But I've
> > also merged a program alongside the fstests testsuite I merged:
> > https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/src/idmapped-mounts/mount-idmapped.c
> > which should be nicer and has seen reviews by Amir and Christoph.
> 
> Thanks.
> 
> [...]
> 
> >>>>>           int fd_tree = open_tree(-EBADF, source,
> >>>>>                        OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC |
> >>>>>                        AT_EMPTY_PATH | (recursive ? AT_RECURSIVE : 0));
> >>>>
> >>>> ???
> >>>> What is the significance of -EBADF here? As far as I can tell, it
> >>>> is not meaningful to open_tree()?
> >>>
> >>> I always pass -EBADF for similar reasons to [2]. Feel free to just use -1.
> >>
> >> ????
> >> But here, both -EBADF and -1 seem to be wrong. This argument 
> >> is a dirfd, and so should either be a file descriptor or the
> >> value AT_FDCWD, right?
> > 
> > [1]: In this code "source" is expected to be absolute. If it's not
> >      absolute we should fail. This can be achieved by passing -1/-EBADF,
> >      afaict.
> 
> D'oh! Okay. I hadn't considered that use case for an invalid dirfd.
> (And now I've done some adjustments to openat(2),which contains a
> rationale for the *at() functions.)
> 
> So, now I understand your purpose, but still the code is obscure,
> since
> 
> * You use a magic value (-EBADF) rather than (say) -1.
> * There's no explanation (comment about) of the fact that you want
>   to prevent relative pathnames.
> 
> So, I've changed the code to use -1, not -EBADF, and I've added some
> comments to explain that the intent is to prevent relative pathnames.
> Okay?

Sounds good.

> 
> But, there is still the meta question: what's the problem with using
> a relative pathname?

Nothing per se. Ok, you asked so it's your fault:
When writing programs I like to never use relative paths with AT_FDCWD
because. Because making assumptions about the current working directory
of the calling process is just too easy to get wrong; especially when
pivot_root() or chroot() are in play.
My absolut preference (joke intended) is to open a well-known starting
point with an absolute path to get a dirfd and then scope all future
operations beneath that dirfd. This already works with old-style
openat() and _very_ cautious programming but openat2() and its
resolve-flag space have made this **chef's kiss**.
If I can't operate based on a well-known dirfd I use absolute paths with
a -EBADF dirfd passed to *at() functions.

Christian

^ permalink raw reply	[relevance 5%]

* Re: Questions re the new mount_setattr(2) manual page
  2021-08-12  5:36  7%     ` Michael Kerrisk (man-pages)
@ 2021-08-12  8:38  4%       ` Christian Brauner
  2021-08-13  1:25 10%         ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 200+ results
From: Christian Brauner @ 2021-08-12  8:38 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Alejandro Colomar, linux-fsdevel, lkml, linux-man,
	Christoph Hellwig, Eric W. Biederman

On Thu, Aug 12, 2021 at 07:36:54AM +0200, Michael Kerrisk (man-pages) wrote:
> [CC += Eric, in case he has a comment on the last piece]
> 
> Hi Christian,
> 
> (A few questions below.)
> 
> On 8/11/21 12:40 PM, Christian Brauner wrote:
> > On Wed, Aug 11, 2021 at 12:47:14AM +0200, Michael Kerrisk (man-pages) wrote:
> >> Hi Christian,
> >>
> >> Some further questions...
> >>
> >> In ERRORS there is:
> >>
> >>        EINVAL The underlying filesystem is mounted in a user namespace.
> >>
> >> I don't understand this. What does it mean?
> > 
> > The underlying filesystem has been mounted in a mount namespace that is
> > owned by a non-initial user namespace (Think of sysfs, overlayfs etc.).
> 
> Thanks!
> 
> >> Also, there is this:
> >>
> >>        ENOMEM When  changing  mount  propagation to MS_SHARED, a new peer
> >>               group ID needs to be allocated for  all  mounts  without  a
> >>               peer  group  ID  set.  Allocation of this peer group ID has
> >>               failed.
> >>
> >>        ENOSPC When changing mount propagation to MS_SHARED,  a  new  peer
> >>               group  ID  needs  to  be allocated for all mounts without a
> >>               peer group ID set.  Allocation of this peer  group  ID  can
> >>               fail.  Note that technically further error codes are possi‐
> >>               ble that are specific to the ID  allocation  implementation
> >>               used.
> >>
> >> What is the difference between these two error cases? (That is, in what 
> >> circumstances will one get ENOMEM vs ENOSPC and vice versa?)
> > 
> > I did really wonder whether to even include those errors and I regret
> > having included them because they aren't worth a detailed discussion as
> > I'd consider them kernel internal relevant errors rather than userspace
> > relevant errors. In essence, peer group ids are allocated using the id
> > infrastructure of the kernel. It can fail for two main reasons:
> > 
> > 1. ENOMEM there's not enough memory to allocate the relevant internal
> >    structures needed for the bitmap.
> > 2. ENOSPC we ran out of ids, i.e. someone has somehow managed to
> >    allocate so many peer groups and managed to keep the kernel running
> >    (???) that the ida has ran out of ids.
> > 
> > Feel free to just drop those errors.
> 
> Because they can at least theoretically be visible to user space, I
> prefer to keep them. But I've reworked a bit:
> 
>        ENOMEM When changing mount propagation to MS_SHARED, a new
>               peer group ID needs to be allocated for all mounts
>               without a peer group ID set.  This allocation failed
>               because there was not enough memory to allocate the
>               relevant internal structures.
> 
>        ENOSPC When changing mount propagation to MS_SHARED, a new
>               peer group ID needs to be allocated for all mounts
>               without a peer group ID set.  This allocation failed
>               because the kernel has run out of IDs.
> 
> >> And then:
> >>
> >>        EPERM  One  of  the mounts had at least one of MOUNT_ATTR_NOATIME,
> >>               MOUNT_ATTR_NODEV, MOUNT_ATTR_NODIRATIME, MOUNT_ATTR_NOEXEC,
> >>               MOUNT_ATTR_NOSUID, or MOUNT_ATTR_RDONLY set and the flag is
> >>               locked.  Mount attributes become locked on a mount if:
> >>
> >>               •  A new mount or mount tree is created causing mount prop‐
> >>                  agation  across  user  namespaces.  The kernel will lock
> >>
> >> Propagation is done across mont points, not user namespaces.
> >> should "across user namespaces" be "to a mount namespace owned 
> >> by a different user namespace"? Or something else?
> > 
> > That's really splitting hairs.
> 
> To be clear, I'm not trying to split hairs :-). It's just that
> I'm struggling a little to understand. (In particular, the notion
> of locked mounts is one where my understanding is weak.) 
> 
> And think of it like this: I am the first line of defense for the
> user-space reader. If I am having trouble to understand the text,
> I wont be alone. And often, the problem is not so much that the
> text is "wrong", it's that there's a difference in background
> knowledge between what you know and what the reader (in this case
> me) knows. Part of my task is to fill that gap, by adding info
> that I think is necessary to the page (with the happy side
> effect that I learn along the way.)

All very good points.
I didn't mean to complain btw. Sorry that it seemed that way. :)

> 
> > Of course this means that we're
> > propagating into a mount namespace that is owned by a different user
> > namespace though "crossing user namespaces" might have been the better
> > choice.
> 
> This is a perfect example of the point I make above. You say "of course",
> but I don't have the background knowledge that you do :-). From my
> perspective, I want to make sure that I understand your meaning, so
> that that meaning can (IMHO) be made easier for the average reader
> of the manual page.
> 
> >>                  the aforementioned  flags  to  protect  these  sensitive
> >>                  properties from being altered.
> >>
> >>               •  A  new  mount  and user namespace pair is created.  This
> >>                  happens for  example  when  specifying  CLONE_NEWUSER  |
> >>                  CLONE_NEWNS  in unshare(2), clone(2), or clone3(2).  The
> >>                  aforementioned flags become locked to protect user name‐
> >>                  spaces from altering sensitive mount properties.
> >>
> >> Again, this seems imprecise. Should it say something like:
> >> "... to prevent changes to sensitive mount properties in the new 
> >> mount namespace" ? Or perhaps you have a better wording.
> > 
> > That's not imprecise. 
> 
> Okay -- poor choice of wording on my part:
> 
> s/this seems imprecise/I'm having trouble understanding this/
> 
> > What you want to protect against is altering
> > sensitive mount properties from within a user namespace irrespective of
> > whether or not the user namespace actually owns the mount namespace,
> > i.e. even if you own the mount namespace you shouldn't be able to alter
> > those properties. I concede though that "protect" should've been
> > "prevent".
> 
> Can I check my education here please. The point is this:
> 
> * The mount point was created in a mount NS that was owned by
>   a more privileged user NS (e.g., the initial user NS).
> * A CLONE_NEWUSER|CLONE_NEWNS step occurs to create a new (user and) 
>   mount NS.
> * In the new mount NS, the mounts become locked.
> 
> And, help me here: is it correct that the reason the properties
> need to be locked is because they are shared between the mounts?

Yes, basically.
The new mount namespace contains a copy of all the mounts in the
previous mount namespace. So they are separate mounts which you can best
see when you do unshare --mount --propagation=private. An unmount in the
new mount namespace won't affect the mount in the previous mount
namespace. Which can only nicely work if they are separate mounts.
Propagation relies (among other things) on the fact that mount
namespaces have copies of the mounts.

The copied mounts in the new mount namespace will have inherited all
properties they had at the time when copy_namespaces() and specifically
copy_mnt_ns() was called. Which calls into copy_tree() and ultimately
into the appropriately named clone_mnt(). This is the low-level routine
that is responsible for cloning the mounts including their mount
properties.

Some mount properties such as read-only, nodev, noexec, nosuid, atime -
while arguably not per se security mechanisms - are used for protection
or as security measures in userspace applications. The most obvious one
might be the read-only property. One wouldn't want to expose a set of
files as read-only only for someone else to trivially gain write access
to them. An example of where that could happen is when creating a new
mount namespaces and user namespace pair where the new mount namespace
is owned by the new user namespace in which the caller is privileged and
thus the caller would also able to alter the new mount namespace. So
without locking flags all it would take to turn a read-only into a
read-write mount is:
unshare -U --map-root --propagation=private -- mount -o remount,rw /some/mnt
locking such flags prevents that from happening.

> 
> > You could probably say:
> > 
> > 	A  new  mount  and user namespace pair is created.  This
> > 	happens for  example  when  specifying  CLONE_NEWUSER  |
> > 	CLONE_NEWNS  in unshare(2), clone(2), or clone3(2).
> > 	The aforementioned flags become locked in the new mount
> > 	namespace to prevent sensitive mount properties from being
> > 	altered.
> > 	Since the newly created mount namespace will be owned by the
> > 	newly created user namespace a caller privileged in the newly
> > 	created user namespace would be able to alter senstive
> > 	mount properties. For example, without locking the read-only
> > 	property for the mounts in the new mount namespace such a caller
> > 	would be able to remount them read-write.
> 
> So, I've now made the text:
> 
>        EPERM  One of the mounts had at least one of MOUNT_ATTR_NOATIME,
>               MOUNT_ATTR_NODEV, MOUNT_ATTR_NODIRATIME, MOUNT_ATTR_NOEXEC,
>               MOUNT_ATTR_NOSUID, or MOUNT_ATTR_RDONLY set and the flag is
>               locked.  Mount attributes become locked on a mount if:
> 
>               •  A new mount or mount tree is created causing mount
>                  propagation across user namespaces (i.e., propagation to
>                  a mount namespace owned by a different user namespace).
>                  The kernel will lock the aforementioned flags to prevent
>                  these sensitive properties from being altered.
> 
>               •  A new mount and user namespace pair is created.  This
>                  happens for example when specifying CLONE_NEWUSER |
>                  CLONE_NEWNS in unshare(2), clone(2), or clone3(2).  The
>                  aforementioned flags become locked in the new mount
>                  namespace to prevent sensitive mount properties from
>                  being altered.  Since the newly created mount namespace
>                  will be owned by the newly created user namespace, a
>                  calling process that is privileged in the new user
>                  namespace would—in the absence of such locking—be able
>                  to alter senstive mount properties (e.g., to remount a
>                  mount that was marked read-only as read-write in the new
>                  mount namespace).
> 
> Okay?

Sounds good.

> 
> > (Fwiw, in this scenario there's a bit of (moderately sane) strangeness.
> >  A CLONE_NEWUSER | CLONE_NEWMNT will cause even stronger protection to
> >  kick in. For all mounts not marked as expired MNT_LOCKED will be set
> >  which means that a umount() on any such mount copied from the previous
> >  mount namespace will yield EINVAL implying from userspace' perspective
> >  it's not mounted - granted EINVAL is the ioctl() of multiplexing errnos
> >  - whereas a remount to alter a locked flag will yield EPERM.)
> 
> Thanks for educating me! So, is that what we are seeing below?
> 
> $ sudo umount /mnt/m1
> $ sudo mount -t tmpfs none /mnt/m1
> $ sudo unshare -pf -Ur -m --mount-proc strace -o /tmp/log umount /mnt/m1
> umount: /mnt/m1: not mounted.
> $ grep ^umount /tmp/log
> umount2("/mnt/m1", 0)                   = -1 EINVAL (Invalid argument)
> 
> The mount_namespaces(7) page has for a log time had this text:
> 
>        *  Mounts that come as a single unit from a more privileged mount
>           namespace are locked together and may not be separated in a
>           less privileged mount namespace.  (The unshare(2) CLONE_NEWNS
>           operation brings across all of the mounts from the original
>           mount namespace as a single unit, and recursive mounts that
>           propagate between mount namespaces propagate as a single unit.)
> 
> I have had trouble understanding that. But maybe you just helped.
> Is that text relevant to what you just wrote above? In particular,
> I have trouble understanding what "separated" means. But, perhaps

The text gives the "how" not the "why".
Consider a more elaborate mount tree where e.g., you have bind-mounted a
mount over a subdirectory of another mount:

sudo mount -t tmpfs /mnt
sudo mkdir /mnt/my-dir/
sudo touch /mnt/my-dir/my-file
sudo mount --bind /opt /mnt/my-dir

The files underneath /mnt/my-dir are now hidden. Consider what would
happen if one would allow to address those mounts separately. A user
could then do:

unshare -U --map-root --mount
umount /mnt/my-dir
cat /mnt/my-dir/my-file

giving them access to what's in my-dir.

Treating such mount trees as a unit in less privileged mount namespaces
(cf. [1]) prevents that, i.e., prevents revealing files and directories
that were overmounted.

Treating such mounts as a unit is also relevant when e.g. bind-mounting
a mount tree containing locked mounts. Sticking with the example above:

unshare -U --map-root --mount

# non-recursive bind-mount will fail
mount --bind /mnt /tmp

# recursive bind-mount will succeed
mount --rbind /mnt /tmp

The reason is again that the mount tree at /mnt is treated as a mount
unit because it is locked. If one were to allow to non-recursively
bind-mountng /mnt somewhere it would mean revealing what's underneath
the mount at my-dir (This is in some sense the inverse of preventing a
filesystem from being mounted that isn't fully visible, i.e. contains
hidden or over-mounted mounts.).

These semantics, in addition to being security relevant, also allow a
more privileged mount namespace to create a restricted view of the
filesystem hierarchy that can't be circumvented in a less privileged
mount namespace (Otherwise pivot_root would have to be used which can
also be used to guarantee a restriced view on the filesystem hierarchy
especially when combined with a separate rootfs.).

Christian

[1]: I'll avoid jumping through the hoops of speaking about ownership
     all the time now for the sake of brevity. Otherwise I'll still sit
     here at lunchtime.

^ permalink raw reply	[relevance 4%]

* Re: Questions re the new mount_setattr(2) manual page
  2021-08-11 10:40  4%   ` Christian Brauner
@ 2021-08-12  5:36  7%     ` Michael Kerrisk (man-pages)
  2021-08-12  8:38  4%       ` Christian Brauner
  0 siblings, 1 reply; 200+ results
From: Michael Kerrisk (man-pages) @ 2021-08-12  5:36 UTC (permalink / raw)
  To: Christian Brauner
  Cc: mtk.manpages, Alejandro Colomar, linux-fsdevel, lkml, linux-man,
	Christoph Hellwig, Eric W. Biederman

[CC += Eric, in case he has a comment on the last piece]

Hi Christian,

(A few questions below.)

On 8/11/21 12:40 PM, Christian Brauner wrote:
> On Wed, Aug 11, 2021 at 12:47:14AM +0200, Michael Kerrisk (man-pages) wrote:
>> Hi Christian,
>>
>> Some further questions...
>>
>> In ERRORS there is:
>>
>>        EINVAL The underlying filesystem is mounted in a user namespace.
>>
>> I don't understand this. What does it mean?
> 
> The underlying filesystem has been mounted in a mount namespace that is
> owned by a non-initial user namespace (Think of sysfs, overlayfs etc.).

Thanks!

>> Also, there is this:
>>
>>        ENOMEM When  changing  mount  propagation to MS_SHARED, a new peer
>>               group ID needs to be allocated for  all  mounts  without  a
>>               peer  group  ID  set.  Allocation of this peer group ID has
>>               failed.
>>
>>        ENOSPC When changing mount propagation to MS_SHARED,  a  new  peer
>>               group  ID  needs  to  be allocated for all mounts without a
>>               peer group ID set.  Allocation of this peer  group  ID  can
>>               fail.  Note that technically further error codes are possi‐
>>               ble that are specific to the ID  allocation  implementation
>>               used.
>>
>> What is the difference between these two error cases? (That is, in what 
>> circumstances will one get ENOMEM vs ENOSPC and vice versa?)
> 
> I did really wonder whether to even include those errors and I regret
> having included them because they aren't worth a detailed discussion as
> I'd consider them kernel internal relevant errors rather than userspace
> relevant errors. In essence, peer group ids are allocated using the id
> infrastructure of the kernel. It can fail for two main reasons:
> 
> 1. ENOMEM there's not enough memory to allocate the relevant internal
>    structures needed for the bitmap.
> 2. ENOSPC we ran out of ids, i.e. someone has somehow managed to
>    allocate so many peer groups and managed to keep the kernel running
>    (???) that the ida has ran out of ids.
> 
> Feel free to just drop those errors.

Because they can at least theoretically be visible to user space, I
prefer to keep them. But I've reworked a bit:

       ENOMEM When changing mount propagation to MS_SHARED, a new
              peer group ID needs to be allocated for all mounts
              without a peer group ID set.  This allocation failed
              because there was not enough memory to allocate the
              relevant internal structures.

       ENOSPC When changing mount propagation to MS_SHARED, a new
              peer group ID needs to be allocated for all mounts
              without a peer group ID set.  This allocation failed
              because the kernel has run out of IDs.

>> And then:
>>
>>        EPERM  One  of  the mounts had at least one of MOUNT_ATTR_NOATIME,
>>               MOUNT_ATTR_NODEV, MOUNT_ATTR_NODIRATIME, MOUNT_ATTR_NOEXEC,
>>               MOUNT_ATTR_NOSUID, or MOUNT_ATTR_RDONLY set and the flag is
>>               locked.  Mount attributes become locked on a mount if:
>>
>>               •  A new mount or mount tree is created causing mount prop‐
>>                  agation  across  user  namespaces.  The kernel will lock
>>
>> Propagation is done across mont points, not user namespaces.
>> should "across user namespaces" be "to a mount namespace owned 
>> by a different user namespace"? Or something else?
> 
> That's really splitting hairs.

To be clear, I'm not trying to split hairs :-). It's just that
I'm struggling a little to understand. (In particular, the notion
of locked mounts is one where my understanding is weak.) 

And think of it like this: I am the first line of defense for the
user-space reader. If I am having trouble to understand the text,
I wont be alone. And often, the problem is not so much that the
text is "wrong", it's that there's a difference in background
knowledge between what you know and what the reader (in this case
me) knows. Part of my task is to fill that gap, by adding info
that I think is necessary to the page (with the happy side
effect that I learn along the way.)

> Of course this means that we're
> propagating into a mount namespace that is owned by a different user
> namespace though "crossing user namespaces" might have been the better
> choice.

This is a perfect example of the point I make above. You say "of course",
but I don't have the background knowledge that you do :-). From my
perspective, I want to make sure that I understand your meaning, so
that that meaning can (IMHO) be made easier for the average reader
of the manual page.

>>                  the aforementioned  flags  to  protect  these  sensitive
>>                  properties from being altered.
>>
>>               •  A  new  mount  and user namespace pair is created.  This
>>                  happens for  example  when  specifying  CLONE_NEWUSER  |
>>                  CLONE_NEWNS  in unshare(2), clone(2), or clone3(2).  The
>>                  aforementioned flags become locked to protect user name‐
>>                  spaces from altering sensitive mount properties.
>>
>> Again, this seems imprecise. Should it say something like:
>> "... to prevent changes to sensitive mount properties in the new 
>> mount namespace" ? Or perhaps you have a better wording.
> 
> That's not imprecise. 

Okay -- poor choice of wording on my part:

s/this seems imprecise/I'm having trouble understanding this/

> What you want to protect against is altering
> sensitive mount properties from within a user namespace irrespective of
> whether or not the user namespace actually owns the mount namespace,
> i.e. even if you own the mount namespace you shouldn't be able to alter
> those properties. I concede though that "protect" should've been
> "prevent".

Can I check my education here please. The point is this:

* The mount point was created in a mount NS that was owned by
  a more privileged user NS (e.g., the initial user NS).
* A CLONE_NEWUSER|CLONE_NEWNS step occurs to create a new (user and) 
  mount NS.
* In the new mount NS, the mounts become locked.

And, help me here: is it correct that the reason the properties
need to be locked is because they are shared between the mounts?

> You could probably say:
> 
> 	A  new  mount  and user namespace pair is created.  This
> 	happens for  example  when  specifying  CLONE_NEWUSER  |
> 	CLONE_NEWNS  in unshare(2), clone(2), or clone3(2).
> 	The aforementioned flags become locked in the new mount
> 	namespace to prevent sensitive mount properties from being
> 	altered.
> 	Since the newly created mount namespace will be owned by the
> 	newly created user namespace a caller privileged in the newly
> 	created user namespace would be able to alter senstive
> 	mount properties. For example, without locking the read-only
> 	property for the mounts in the new mount namespace such a caller
> 	would be able to remount them read-write.

So, I've now made the text:

       EPERM  One of the mounts had at least one of MOUNT_ATTR_NOATIME,
              MOUNT_ATTR_NODEV, MOUNT_ATTR_NODIRATIME, MOUNT_ATTR_NOEXEC,
              MOUNT_ATTR_NOSUID, or MOUNT_ATTR_RDONLY set and the flag is
              locked.  Mount attributes become locked on a mount if:

              •  A new mount or mount tree is created causing mount
                 propagation across user namespaces (i.e., propagation to
                 a mount namespace owned by a different user namespace).
                 The kernel will lock the aforementioned flags to prevent
                 these sensitive properties from being altered.

              •  A new mount and user namespace pair is created.  This
                 happens for example when specifying CLONE_NEWUSER |
                 CLONE_NEWNS in unshare(2), clone(2), or clone3(2).  The
                 aforementioned flags become locked in the new mount
                 namespace to prevent sensitive mount properties from
                 being altered.  Since the newly created mount namespace
                 will be owned by the newly created user namespace, a
                 calling process that is privileged in the new user
                 namespace would—in the absence of such locking—be able
                 to alter senstive mount properties (e.g., to remount a
                 mount that was marked read-only as read-write in the new
                 mount namespace).

Okay?

> (Fwiw, in this scenario there's a bit of (moderately sane) strangeness.
>  A CLONE_NEWUSER | CLONE_NEWMNT will cause even stronger protection to
>  kick in. For all mounts not marked as expired MNT_LOCKED will be set
>  which means that a umount() on any such mount copied from the previous
>  mount namespace will yield EINVAL implying from userspace' perspective
>  it's not mounted - granted EINVAL is the ioctl() of multiplexing errnos
>  - whereas a remount to alter a locked flag will yield EPERM.)

Thanks for educating me! So, is that what we are seeing below?

$ sudo umount /mnt/m1
$ sudo mount -t tmpfs none /mnt/m1
$ sudo unshare -pf -Ur -m --mount-proc strace -o /tmp/log umount /mnt/m1
umount: /mnt/m1: not mounted.
$ grep ^umount /tmp/log
umount2("/mnt/m1", 0)                   = -1 EINVAL (Invalid argument)

The mount_namespaces(7) page has for a log time had this text:

       *  Mounts that come as a single unit from a more privileged mount
          namespace are locked together and may not be separated in a
          less privileged mount namespace.  (The unshare(2) CLONE_NEWNS
          operation brings across all of the mounts from the original
          mount namespace as a single unit, and recursive mounts that
          propagate between mount namespaces propagate as a single unit.)

I have had trouble understanding that. But maybe you just helped.
Is that text relevant to what you just wrote above? In particular,
I have trouble understanding what "separated" means. But, perhaps
is means "separately unmounted"? (I added Eric in CC,
in case he has something to say.)

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[relevance 7%]

* Re: Questions re the new mount_setattr(2) manual page
  2021-08-11 10:07  4%     ` Christian Brauner
@ 2021-08-12  5:36  9%       ` Michael Kerrisk (man-pages)
  2021-08-12  9:08  5%         ` Christian Brauner
  0 siblings, 1 reply; 200+ results
From: Michael Kerrisk (man-pages) @ 2021-08-12  5:36 UTC (permalink / raw)
  To: Christian Brauner
  Cc: mtk.manpages, Alejandro Colomar, linux-fsdevel, lkml, linux-man,
	Christoph Hellwig

Hello Christian,

Thanks for the answers.

A couple of small queries still below.

On 8/11/21 12:07 PM, Christian Brauner wrote:
> On Tue, Aug 10, 2021 at 11:06:52PM +0200, Michael Kerrisk (man-pages) wrote:

[...]

>>>>>       EINVAL The mount that is to be ID mapped is not a
>>>>>              detached/anonymous mount; that is, the mount is
>>>>
>>>> ???
>>>> What is a the distinction between "detached" and "anonymous"?
>>>> Or do you mean them to be synonymous? If so, then let's use
>>>> just one term, and I think "detached" is preferable.
>>>
>>> Yes, they are synonymous here. I list both because detached can
>>> potentially be confusing. A detached mount is a mount that has not been
>>> visible in the filesystem. But if you attached it an then unmount it
>>> right after and keep the fd for the mountpoint open it's a detached
>>> mount purely on a natural language level, I'd argue. But it's not a
>>> detached mount from the kernel's view anymore because it has been
>>> exposed in the filesystem and is thus not detached anymore.
>>> But I do prefer "detached" to "anonymous" and that confusion is very
>>> unlikely to occur.
>>
>> Thanks. I made it "detached". Elsewhere, the page already explains
>> that a detached mount is one that:
>>
>>           must have been created by calling open_tree(2) with the
>>           OPEN_TREE_CLONE flag and it must not already have been
>>           visible in the filesystem.
>>
>> Which seems a fine explanation. 
>>
>> ????
>> But, just a thought... "visible in the filesystem" seems not quite accurate. 
>> What you really mean I guess is that it must not already have been
>> /visible in the filesystem hierarchy/previously mounted/something else/,
>> right?

I suppose that I should have clarified that my main problem was
that you were using the word "filesystem" in a way that I find
unconventional/ambiguous. I mean, I normally take the term
"filesystem" to be "a storage system for folding files".
Here, you are using "filesystem" to mean something else, what 
I might call like "the single directory hierarchy" or "the
filesystem hierarchy" or "the list of mount points".

> A detached mount is created via the OPEN_TREE_CLONE flag. It is a
> separate new mount so "previously mounted" is not applicable.
> A detached mount is _related_ to what the MS_BIND flag gives you with
> mount(2). However, they differ conceptually and technically. A MS_BIND
> mount(2) is always visible in the fileystem when mount(2) returns, i.e.
> it is discoverable by regular path-lookup starting within the
> filesystem.
> 
> However, a detached mount can be seen as a split of MS_BIND into two
> distinct steps:
> 1. fd_tree = open_tree(OPEN_TREE_CLONE): create a new mount
> 2. move_mount(fd_tree, <somewhere>):     attach the mount to the filesystem
> 
> 1. and 2. together give you the equivalent of MS_BIND.
> In between 1. and 2. however the mount is detached. For the kernel
> "detached" means that an anonymous mount namespace is attached to it
> which doen't appear in proc and has a 0 sequence number (Technically,
> there's a bit of semantical argument to be made that "attached" and
> "detached" are ambiguous as they could also be taken to mean "does or
> does not have a parent mount". This ambiguity e.g. appears in
> do_move_mount(). That's why the kernel itself calls it an "anonymous
> mount". However, an OPEN_TREE_CLONE-detached mount of course doesn't
> have a parent mount so it works.).
> 
> For userspace it's better to think of detached and attached in terms of
> visibility in the filesystem or in a mount namespace. That's more
> straightfoward, more relevant, and hits the target in 90% of the cases.
> 
> However, the better and clearer picture is to say that a
> OPEN_TREE_CLONE-detached mount is a mount that has never been
> move_mount()ed. Which in turn can be defined as the detached mount has
> never been made visible in a mount namespace. Once that has happened the
> mount is irreversibly an attached mount.
> 
> I keep thinking that maybe we should just say "anonymous mount"
> everywhere. So changing the wording to:

I'm not against the word "detached". To user space, I think it is a
little more meaningful than "anonymous". For the moment, I'll stay with
"detached", but if you insist on "anonymous", I'll probably change it.

> [...]
> EINVAL The mount that is to be ID mapped is not an anonymous mount;
> that is, the mount has already been visible in a mount namespace.

I like that text *a lot* better! Thanks very much for suggesting
wordings. It makes my life much easier. 

I've made the text:

       EINVAL The mount that is to be ID mapped is not a detached
              mount; that is, the mount has not previously been
              visible in a mount namespace.

> [...]
> The mount must be an anonymous mount; that is, it must have been
> created by calling open_tree(2) with the OPEN_TREE_CLONE flag and it
> must not already have been visible in a mount namespace, i.e. it must
> not have been attached to the filesystem hierarchy with syscalls such
> as move_mount() syscall.

And that too! I've made the text:

       •  The mount must be a detached mount; that is, it must have
          been created by calling open_tree(2) with the
          OPEN_TREE_CLONE flag and it must not already have been
          visible in a mount namespace.  (To put things another way:
          the mount must not have been attached to the filesystem
          hierarchy with a system call such as move_mount(2).)

> [...]
> 
> (I'm using the formulation "with syscalls such as move_mount()" to
> future proof this. :)).

Fair enough.

>>>>>   EXAMPLES
>>>>
>>>> ???
>>>> Do you have a (preferably simple) example piece of code
>>>> somewhere for setting up an ID mapped mount?
>>
>> ????
>> I guess the best example is this:
>> https://github.com/brauner/mount-idmapped/
>> right?
> 
> Ah yes, sorry. I forgot to answer that yesterday. I sent you links via
> another medium but I repeat it here.
> There are two places. The link you have here is a private repo. But I've
> also merged a program alongside the fstests testsuite I merged:
> https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/src/idmapped-mounts/mount-idmapped.c
> which should be nicer and has seen reviews by Amir and Christoph.

Thanks.

[...]

>>>>>           int fd_tree = open_tree(-EBADF, source,
>>>>>                        OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC |
>>>>>                        AT_EMPTY_PATH | (recursive ? AT_RECURSIVE : 0));
>>>>
>>>> ???
>>>> What is the significance of -EBADF here? As far as I can tell, it
>>>> is not meaningful to open_tree()?
>>>
>>> I always pass -EBADF for similar reasons to [2]. Feel free to just use -1.
>>
>> ????
>> But here, both -EBADF and -1 seem to be wrong. This argument 
>> is a dirfd, and so should either be a file descriptor or the
>> value AT_FDCWD, right?
> 
> [1]: In this code "source" is expected to be absolute. If it's not
>      absolute we should fail. This can be achieved by passing -1/-EBADF,
>      afaict.

D'oh! Okay. I hadn't considered that use case for an invalid dirfd.
(And now I've done some adjustments to openat(2),which contains a
rationale for the *at() functions.)

So, now I understand your purpose, but still the code is obscure,
since

* You use a magic value (-EBADF) rather than (say) -1.
* There's no explanation (comment about) of the fact that you want
  to prevent relative pathnames.

So, I've changed the code to use -1, not -EBADF, and I've added some
comments to explain that the intent is to prevent relative pathnames.
Okay?

But, there is still the meta question: what's the problem with using
a relative pathname?

[...]

>>>>>           ret = move_mount(fd_tree, "", -EBADF, target,
>>>>>                            MOVE_MOUNT_F_EMPTY_PATH);
>>>>
>>>> ???
>>>> What is the significance of -EBADF here? As far as I can tell, it
>>>> is not meaningful to move_mount()?
>>>
>>> See [2].
>>
>> ????
>> As above, both -EBADF and -1 seem to be wrong. This argument 
>> is a dirfd, and so should either be a file descriptor or the
>> value AT_FDCWD, right?
> 
> See [1].

I made the same change as above.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[relevance 9%]

* Re: Questions re the new mount_setattr(2) manual page
  2021-08-10 22:47  5% ` Michael Kerrisk (man-pages)
@ 2021-08-11 10:40  4%   ` Christian Brauner
  2021-08-12  5:36  7%     ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 200+ results
From: Christian Brauner @ 2021-08-11 10:40 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Alejandro Colomar, linux-fsdevel, lkml, linux-man, Christoph Hellwig

On Wed, Aug 11, 2021 at 12:47:14AM +0200, Michael Kerrisk (man-pages) wrote:
> Hi Christian,
> 
> Some further questions...
> 
> In ERRORS there is:
> 
>        EINVAL The underlying filesystem is mounted in a user namespace.
> 
> I don't understand this. What does it mean?

The underlying filesystem has been mounted in a mount namespace that is
owned by a non-initial user namespace (Think of sysfs, overlayfs etc.).

> 
> Also, there is this:
> 
>        ENOMEM When  changing  mount  propagation to MS_SHARED, a new peer
>               group ID needs to be allocated for  all  mounts  without  a
>               peer  group  ID  set.  Allocation of this peer group ID has
>               failed.
> 
>        ENOSPC When changing mount propagation to MS_SHARED,  a  new  peer
>               group  ID  needs  to  be allocated for all mounts without a
>               peer group ID set.  Allocation of this peer  group  ID  can
>               fail.  Note that technically further error codes are possi‐
>               ble that are specific to the ID  allocation  implementation
>               used.
> 
> What is the difference between these two error cases? (That is, in what 
> circumstances will one get ENOMEM vs ENOSPC and vice versa?)

I did really wonder whether to even include those errors and I regret
having included them because they aren't worth a detailed discussion as
I'd consider them kernel internal relevant errors rather than userspace
relevant errors. In essence, peer group ids are allocated using the id
infrastructure of the kernel. It can fail for two main reasons:

1. ENOMEM there's not enough memory to allocate the relevant internal
   structures needed for the bitmap.
2. ENOSPC we ran out of ids, i.e. someone has somehow managed to
   allocate so many peer groups and managed to keep the kernel running
   (???) that the ida has ran out of ids.

Feel free to just drop those errors.

> 
> And then:
> 
>        EPERM  One  of  the mounts had at least one of MOUNT_ATTR_NOATIME,
>               MOUNT_ATTR_NODEV, MOUNT_ATTR_NODIRATIME, MOUNT_ATTR_NOEXEC,
>               MOUNT_ATTR_NOSUID, or MOUNT_ATTR_RDONLY set and the flag is
>               locked.  Mount attributes become locked on a mount if:
> 
>               •  A new mount or mount tree is created causing mount prop‐
>                  agation  across  user  namespaces.  The kernel will lock
> 
> Propagation is done across mont points, not user namespaces.
> should "across user namespaces" be "to a mount namespace owned 
> by a different user namespace"? Or something else?

That's really splitting hairs. Of course this means that we're
propagating into a mount namespace that is owned by a different user
namespace though "crossing user namespaces" might have been the better
choice.

> 
>                  the aforementioned  flags  to  protect  these  sensitive
>                  properties from being altered.
> 
>               •  A  new  mount  and user namespace pair is created.  This
>                  happens for  example  when  specifying  CLONE_NEWUSER  |
>                  CLONE_NEWNS  in unshare(2), clone(2), or clone3(2).  The
>                  aforementioned flags become locked to protect user name‐
>                  spaces from altering sensitive mount properties.
> 
> Again, this seems imprecise. Should it say something like:
> "... to prevent changes to sensitive mount properties in the new 
> mount namespace" ? Or perhaps you have a better wording.

That's not imprecise. What you want to protect against is altering
sensitive mount properties from within a user namespace irrespective of
whether or not the user namespace actually owns the mount namespace,
i.e. even if you own the mount namespace you shouldn't be able to alter
those properties. I concede though that "protect" should've been
"prevent".

You could probably say:

	A  new  mount  and user namespace pair is created.  This
	happens for  example  when  specifying  CLONE_NEWUSER  |
	CLONE_NEWNS  in unshare(2), clone(2), or clone3(2).
	The aforementioned flags become locked in the new mount
	namespace to prevent sensitive mount properties from being
	altered.
	Since the newly created mount namespace will be owned by the
	newly created user namespace a caller privileged in the newly
	created user namespace would be able to alter senstive
	mount properties. For example, without locking the read-only
	property for the mounts in the new mount namespace such a caller
	would be able to remount them read-write.

(Fwiw, in this scenario there's a bit of (moderately sane) strangeness.
 A CLONE_NEWUSER | CLONE_NEWMNT will cause even stronger protection to
 kick in. For all mounts not marked as expired MNT_LOCKED will be set
 which means that a umount() on any such mount copied from the previous
 mount namespace will yield EINVAL implying from userspace' perspective
 it's not mounted - granted EINVAL is the ioctl() of multiplexing errnos
 - whereas a remount to alter a locked flag will yield EPERM.)

Christian

^ permalink raw reply	[relevance 4%]

* Re: Documenting the requirement of CAP_SETFCAP to map UID 0
  2021-08-10 23:58  5% ` Serge E. Hallyn
@ 2021-08-11 10:10 11%   ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 200+ results
From: Michael Kerrisk (man-pages) @ 2021-08-11 10:10 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: mtk.manpages, linux-security-module, lkml, Alejandro Colomar,
	Kir Kolyshkin, linux-man

Hi Serge

On 8/11/21 1:58 AM, Serge E. Hallyn wrote:
> On Sun, Aug 08, 2021 at 11:09:30AM +0200, Michael Kerrisk (man-pages) wrote:
>> Hello Serge,
>>
Hello Serge,


>> Your commit:
>>
>> [[
>> commit db2e718a47984b9d71ed890eb2ea36ecf150de18
>> Author: Serge E. Hallyn <serge@hallyn.com>
>> Date:   Tue Apr 20 08:43:34 2021 -0500
>>
>>     capabilities: require CAP_SETFCAP to map uid 0
>> ]]
>>
>> added a new requirement when updating a UID map a user namespace
>> with a value of '0 0 *'.
>>
>> Kir sent a patch to briefly document this change, but I think much more
>> should be written. I've attempted to do so. Could you tell me whether the
>> following text (to be added in user_namespaces(7)) is accurate please:
> 
> Sorry for the delay - this did not go into my main mailbox.
> 
> The text looks good.  Thanks!

Thanks for checking it!

Cheers,

Michael

>> [[
>>       In  order  for  a  process  to  write  to  the /proc/[pid]/uid_map
>>        (/proc/[pid]/gid_map) file, all of the following requirements must
>>        be met:
>>
>>        [...]
>>
>>        4. If  updating  /proc/[pid]/uid_map to create a mapping that maps
>>           UID 0 in the parent namespace, then one of the  following  must
>>           be true:
>>
>>           *  if  writing process is in the parent user namespace, then it
>>              must have the CAP_SETFCAP capability in that user namespace;
>>              or
>>
>>           *  if  the writing process is in the child user namespace, then
>>              the process that created the user namespace  must  have  had
>>              the CAP_SETFCAP capability when the namespace was created.
>>
>>           This rule has been in place since Linux 5.12.  It eliminates an
>>           earlier security bug whereby a UID 0  process  that  lacks  the
>>           CAP_SETFCAP capability, which is needed to create a binary with
>>           namespaced file capabilities (as described in capabilities(7)),
>>           could  nevertheless  create  such  a  binary,  by the following
>>           steps:
>>
>>           *  Create a new user namespace with the identity mapping (i.e.,
>>              UID  0 in the new user namespace maps to UID 0 in the parent
>>              namespace), so that UID 0 in both namespaces  is  equivalent
>>              to the same root user ID.
>>
>>           *  Since  the  child process has the CAP_SETFCAP capability, it
>>              could create a binary with namespaced file capabilities that
>>              would  then  be  effective in the parent user namespace (be‐
>>              cause the root user IDs are the same in the two namespaces).
>>
>>        [...]
>> ]]
>>
>> Thanks,
>>
>> Michael
>>
>> -- 
>> Michael Kerrisk
>> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
>> Linux/UNIX System Programming Training: http://man7.org/training/


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[relevance 11%]

* Re: Questions re the new mount_setattr(2) manual page
  2021-08-10 21:06  9%   ` Michael Kerrisk (man-pages)
@ 2021-08-11 10:07  4%     ` Christian Brauner
  2021-08-12  5:36  9%       ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 200+ results
From: Christian Brauner @ 2021-08-11 10:07 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Alejandro Colomar, linux-fsdevel, lkml, linux-man, Christoph Hellwig

On Tue, Aug 10, 2021 at 11:06:52PM +0200, Michael Kerrisk (man-pages) wrote:
> Hello Christian,
> 
> On 8/10/21 4:32 PM, Christian Brauner wrote:
> > On Tue, Aug 10, 2021 at 03:38:00AM +0200, Michael Kerrisk (man-pages) wrote:
> >> Hi Christian,
> >>
> >> Thanks for the very nice manual page that you wrote. I have
> > 
> > Thank you!
> > 
> >> made a large number of (mostly trivial) edits. If you could
> >> read the page closely, to check that I introduced no errors,
> >> I would appreciate it.
> > 
> > Happy to!
> 
> Thanks for the feedback. I've made some changes, and pushed to Git.
> 
> There's still a few open questions. Please see "????" below.

Sure.

> 
> >> I have various questions below, marked ???. Could you please take
> >> a look at these, and I will then make further edits based on your
> >> answers.
> > 
> > I've answered all questions, I think. Feel free to just reformulate
> > where my suggestions weren't adequate. Since most things you ask about
> > are minor adaptions there's no need from my end for you to resend with
> > those reformulations. You can just make them directly. :) I'll peruse
> > the man-pages git repo anyway after you apply them and will send changes
> > if I spot issues.
> > 
> > Thank you for the review!
> > Christian
> > 
> >>
> >> The current version of the page is already pushed to the man-pages
> >> Git repo.
> >>
> >>>   MOUNT_SETATTR(2)      Linux Programmer's Manual     MOUNT_SETATTR(2)
> >>>
> >>>   NAME
> >>>       mount_setattr - change mount properties of a mount or mount
> >>
> >> ???
> >> s/mount properties/properties ?
> >>
> >> (Just bcause more concise.)
> > 
> > Sounds good.
> 
> Done.
> 
> >>
> >>>       tree
> >>>
> >>>   SYNOPSIS
> >>>       #include <linux/fcntl.h> /* Definition of AT_* constants */
> >>>       #include <linux/mount.h> /* Definition of MOUNT_ATTR_* constants */
> >>>       #include <sys/syscall.h> /* Definition of SYS_* constants */
> >>>       #include <unistd.h>
> >>>
> >>>       int syscall(SYS_mount_setattr, int dirfd, const char *path,
> >>>               unsigned int flags, struct mount_attr *attr, size_t size);
> >>>
> >>>       Note: glibc provides no wrapper for mount_setattr(),
> >>>       necessitating the use of syscall(2).
> >>>
> >>>   DESCRIPTION
> 
> [...]
> 
> >>>       The size argument should usually be specified as
> >>>       sizeof(struct mount_attr).  However, if the caller does not
> >>>       intend to make use of features that got introduced after the
> >>>       initial version of struct mount_attr, it is possible to pass
> >>>       the size of the initial struct together with the larger
> >>>       struct.  This allows the kernel to not copy later parts of
> >>>       the struct that aren't used anyway.  With each extension that
> >>>       changes the size of struct mount_attr, the kernel will expose
> >>>       a definition of the form MOUNT_ATTR_SIZE_VERnumber.  For
> >>>       example, the macro for the size of the initial version of
> >>>       struct mount_attr is MOUNT_ATTR_SIZE_VER0.
> >>
> >> ???
> >> I think I understand the above paragraph, but I wonder if it could
> >> be improved a little. The general principle is that one can always
> >> pass the size of an earlier, smaller structure to the kernel, right?
> > 
> > Yes.
> > 
> >> My point is that it need not be the size of the initial structure,
> >> right? So, I wonder whether a little rewording might be need above.
> > 
> > Yes, the initial structure size is just an example because that will be
> > very common.
> > 
> >> What do you think?
> > 
> > Sure, I'm proposing something here but please, fell free to reformulate
> > or come up with something completely new:
> > 
> > 	[...]
> > 	However, if the caller is using a kernel that supports an
> > 	extended struct mount_attr but the caller does not intend to
> > 	make use of these features they can pass the size of an earlier
> > 	version of the struct together with the extended structure.
> > 	[...]
> 
> Perfect! I took that text pretty much exactly as you gave it.
> 
> [...]
> 
> >>>       The attr_set and attr_clr members are used to specify the
> >>>       mount properties that are supposed to be set or cleared for a
> >>>       mount or mount tree.  Flags set in attr_set enable a property
> >>>       on a mount or mount tree, and flags set in attr_clr remove a
> >>>       property from a mount or mount tree.
> >>>
> >>>       When changing mount properties, the kernel will first clear
> >>>       the flags specified in the attr_clr field, and then set the
> >>>       flags specified in the attr_set field:
> >>
> >> ???
> >> I find the following example a bit confusing. See below.
> >>
> >>>
> >>>           struct mount_attr attr = {
> >>>               .attr_clr = MOUNT_ATTR_NOEXEC | MOUNT_ATTR_NODEV,
> >>>               .attr_set = MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOSUID,
> >>>           };
> >>
> >> ???
> >> I *think* that what you are trying to show is that the above initializer
> >> resuts in the equivalent of the following code. Is that correct? If so, 
> >> I think the text needs some work to make this clearer. Let me know.
> > 
> > Yes, exactly. Feel free to remove that code and just explain it in text
> > if that's better.
> 
> I've done some rewording to say that the code snippet shows
> the effect of the initializer.
> 
> [...]
> 
> >>>   RETURN VALUE
> >>>       On success, mount_setattr() returns zero.  On error, -1 is
> >>>       returned and errno is set to indicate the cause of the error.
> >>>
> >>>   ERRORS
> 
> [...]
> 
> >>>       EINVAL A valid file descriptor value was specified in
> >>>              userns_fd, but the file descriptor wasn't a namespace
> >>>              file descriptor or did not refer to a user namespace.
> >>
> >> ???
> >> Could the above not be simplified to
> >>
> >>       EINVAL A valid file descriptor value was specified in
> >>              userns_fd, but the file descriptor did not refer
> >>              to a user namespace.
> > 
> > Sounds good.
> > 
> >> ?
> 
> Done.
> 
> >>>
> >>>       EINVAL The underlying filesystem does not support ID-mapped
> >>>              mounts.
> >>>
> >>>       EINVAL The mount that is to be ID mapped is not a
> >>>              detached/anonymous mount; that is, the mount is
> >>
> >> ???
> >> What is a the distinction between "detached" and "anonymous"?
> >> Or do you mean them to be synonymous? If so, then let's use
> >> just one term, and I think "detached" is preferable.
> > 
> > Yes, they are synonymous here. I list both because detached can
> > potentially be confusing. A detached mount is a mount that has not been
> > visible in the filesystem. But if you attached it an then unmount it
> > right after and keep the fd for the mountpoint open it's a detached
> > mount purely on a natural language level, I'd argue. But it's not a
> > detached mount from the kernel's view anymore because it has been
> > exposed in the filesystem and is thus not detached anymore.
> > But I do prefer "detached" to "anonymous" and that confusion is very
> > unlikely to occur.
> 
> Thanks. I made it "detached". Elsewhere, the page already explains
> that a detached mount is one that:
> 
>           must have been created by calling open_tree(2) with the
>           OPEN_TREE_CLONE flag and it must not already have been
>           visible in the filesystem.
> 
> Which seems a fine explanation. 
> 
> ????
> But, just a thought... "visible in the filesystem" seems not quite accurate. 
> What you really mean I guess is that it must not already have been
> /visible in the filesystem hierarchy/previously mounted/something else/,
> right?

A detached mount is created via the OPEN_TREE_CLONE flag. It is a
separate new mount so "previously mounted" is not applicable.
A detached mount is _related_ to what the MS_BIND flag gives you with
mount(2). However, they differ conceptually and technically. A MS_BIND
mount(2) is always visible in the fileystem when mount(2) returns, i.e.
it is discoverable by regular path-lookup starting within the
filesystem.

However, a detached mount can be seen as a split of MS_BIND into two
distinct steps:
1. fd_tree = open_tree(OPEN_TREE_CLONE): create a new mount
2. move_mount(fd_tree, <somewhere>):     attach the mount to the filesystem

1. and 2. together give you the equivalent of MS_BIND.
In between 1. and 2. however the mount is detached. For the kernel
"detached" means that an anonymous mount namespace is attached to it
which doen't appear in proc and has a 0 sequence number (Technically,
there's a bit of semantical argument to be made that "attached" and
"detached" are ambiguous as they could also be taken to mean "does or
does not have a parent mount". This ambiguity e.g. appears in
do_move_mount(). That's why the kernel itself calls it an "anonymous
mount". However, an OPEN_TREE_CLONE-detached mount of course doesn't
have a parent mount so it works.).

For userspace it's better to think of detached and attached in terms of
visibility in the filesystem or in a mount namespace. That's more
straightfoward, more relevant, and hits the target in 90% of the cases.

However, the better and clearer picture is to say that a
OPEN_TREE_CLONE-detached mount is a mount that has never been
move_mount()ed. Which in turn can be defined as the detached mount has
never been made visible in a mount namespace. Once that has happened the
mount is irreversibly an attached mount.

I keep thinking that maybe we should just say "anonymous mount"
everywhere. So changing the wording to:

[...]
EINVAL The mount that is to be ID mapped is not an anonymous mount; that is, the mount has already been visible in a mount namespace.
[...]

[...]
The mount must be an anonymous mount; that is, it must have been created by calling open_tree(2) with the OPEN_TREE_CLONE flag and it must not already have been visible in a mount namespace, i.e. it must not have been attached to the filesystem hierarchy with syscalls such as move_mount() syscall.
[...]

(I'm using the formulation "with syscalls such as move_mount()" to
future proof this. :)).

> 
> >>>              already visible in the filesystem.
> >>>
> 
> [...]
> 
> >>>       EPERM  An already ID-mapped mount was supposed to be ID
> >>>              mapped.
> >>
> >> ???
> >> Better:
> >>     An attempt was made to add an ID mapping to a mount that is already
> >>     ID mapped.
> > 
> > Sounds good.
> > 
> >> ?
> 
> Done.
> 
> [...]
> 
> >>>   NOTES
> >>>   ID-mapped mounts
> >>>       Creating an ID-mapped mount makes it possible to change the
> >>>       ownership of all files located under a mount.  Thus, ID-
> >>>       mapped mounts make it possible to change ownership in a
> >>>       temporary and localized way.  It is a localized change
> >>>       because ownership changes are restricted to a specific mount.
> >>
> >> ???
> >> Would it be clearer to say something like:
> >>
> >>     It is a localized change because ownership changes are
> >>     visible only via a specific mount.
> >> ?
> > 
> > Sounds good.
> 
> Done.
> 
> [...]
> 
> >>>       The following conditions must be met in order to create an
> >>>       ID-mapped mount:
> >>>
> >>>       •  The caller must have the CAP_SYS_ADMIN capability in the
> >>>          initial user namespace.
> >>>
> >>>       •  The filesystem must be mounted in the initial user
> >>>          namespace.
> >>
> >> ???
> >> Should this rather be written as:
> >>  
> >>      The filesystem must be mounted in a mount namespace 
> >>      that is owned by the initial user namespace.
> > 
> > Sounds good.
> 
> Done.
> 
> >>>       •  The underlying filesystem must support ID-mapped mounts.
> >>>          Currently, the xfs(5), ext4(5), and FAT filesystems
> >>>          support ID-mapped mounts with more filesystems being
> >>>          actively worked on.
> >>>
> >>>       •  The mount must not already be ID-mapped.  This also
> >>>          implies that the ID mapping of a mount cannot be altered.
> >>>
> >>>       •  The mount must be a detached/anonymous mount; that is, it
> >>
> >> ???
> >> See the above questionon "detached" vs "anonymous"
> > 
> > Yes, please use "detached" only.
> 
> Done.
> 
> >>>          must have been created by calling open_tree(2) with the
> >>>          OPEN_TREE_CLONE flag and it must not already have been
> >>>          visible in the filesystem.
> >>>
> >>>       ID mappings can be created for user IDs, group IDs, and
> >>>       project IDs.  An ID mapping is essentially a mapping of a
> >>>       range of user or group IDs into another or the same range of
> >>>       user or group IDs.  ID mappings are usually written as three
> >>>       numbers either separated by white space or a full stop.  The
> >>>       first two numbers specify the starting user or group ID in
> >>>       each of the two user namespaces.  The third number specifies
> >>>       the range of the ID mapping.  For example, a mapping for user
> >>>       IDs such as 1000:1001:1 would indicate that user ID 1000 in
> >>>       the caller's user namespace is mapped to user ID 1001 in its
> >>>       ancestor user namespace.  Since the map range is 1, only user
> >>>       ID 1000 is mapped.
> >>
> >> ???
> >> The details above seem wrong. When writing to map files, the
> >> fields must be white-space separated, AFAIK. But above you mention
> >> "full stops" and also show an example using colons (:). Those
> >> both seem wrong and confusing. Am I missing something?
> > 
> > This is more about notational conventions that exist and not about how
> > they are actually written. That's something I'm not touching on here as
> > it doesn't belong on this manpage. But feel free to only mention spaces.
> 
> Thanks for the explanation. In this context though, this could mislead
> the reader, so I've removed mention of "full stop" and ":".
> 
> >>>       It is possible to specify up to 340 ID mappings for each ID
> >>>       mapping type.  If any user IDs or group IDs are not mapped,
> >>>       all files owned by that unmapped user or group ID will appear
> >>>       as being owned by the overflow user ID or overflow group ID
> >>>       respectively.
> >>>
> >>>       Further details and instructions for setting up ID mappings
> >>>       can be found in the user_namespaces(7) man page.
> >>>
> >>>       In the common case, the user namespace passed in userns_fd
> >>>       together with MOUNT_ATTR_IDMAP in attr_set to create an ID-
> >>>       mapped mount will be the user namespace of a container.  In
> >>>       other scenarios it will be a dedicated user namespace
> >>>       associated with a user's login session as is the case for
> >>>       portable home directories in systemd-homed.service(8)).  It
> >>>       is also perfectly fine to create a dedicated user namespace
> >>>       for the sake of ID mapping a mount.
> 
> I forgot to mention it earlier, but the following text on the
> rationale for ID-mapped mounts is what turns this from a good 
> manual page into a great manual page. Thank you for including it.

Thank you for saying that. Appreciate it.

> 
> >>>       ID-mapped mounts can be useful in the following and a variety
> >>>       of other scenarios:
> >>>
> >>>       •  Sharing files between multiple users or multiple machines,
> >>
> >> ???
> >> s/Sharing files/Sharing filesystems/ ?
> > 
> > [1]: But work. But feel free to use "sharing filesystems".
> 
> s/But/Both/
> 
> I made it "Sharing files or filesystsms"
> 
> >>
> >>>          especially in complex scenarios.  For example, ID-mapped
> >>>          mounts are used to implement portable home directories in
> >>>          systemd-homed.service(8), where they allow users to move
> >>>          their home directory to an external storage device and use
> >>>          it on multiple computers where they are assigned different
> >>>          user IDs and group IDs.  This effectively makes it
> >>>          possible to assign random user IDs and group IDs at login
> >>>          time.
> >>>
> >>>       •  Sharing files from the host with unprivileged containers.
> >>
> >> ???
> >> s/Sharing files/Sharing filesystems/ ?
> > 
> > See [1].
> 
> Same.
> 
> >>>          This allows a user to avoid having to change ownership
> >>>          permanently through chown(2).
> >>>
> >>>       •  ID mapping a container's root filesystem.  Users don't
> >>>          need to change ownership permanently through chown(2).
> >>>          Especially for large root filesystems, using chown(2) can
> >>>          be prohibitively expensive.
> >>>
> >>>       •  Sharing files between containers with non-overlapping ID
> >>
> >> ???
> >> s/Sharing files/Sharing filesystems/ ?
> > 
> > See [1].
> 
> Same.
> 
> [...]
> 
> >>>       •  Locally and temporarily restricted ownership changes.  ID-
> >>>          mapped mounts make it possible to change ownership
> >>>          locally, restricting it to specific mounts, and
> >>
> >> ???
> >> The referent of "it" in the preceding line is not clear.
> >> Should it be "the ownership changes"? Or something else?
> > 
> > It should refer to ownership changes. I'd appreciate it if you could
> > reformulate.
> 
> Done.
> 
> >>>          temporarily as the ownership changes only apply as long as
> >>>          the mount exists.  By contrast, changing ownership via the
> >>>          chown(2) system call changes the ownership globally and
> >>>          permanently.
> >>>
> >>>   Extensibility
> 
> [...]
> 
> >>>   EXAMPLES
> >>
> >> ???
> >> Do you have a (preferably simple) example piece of code
> >> somewhere for setting up an ID mapped mount?
> 
> ????
> I guess the best example is this:
> https://github.com/brauner/mount-idmapped/
> right?

Ah yes, sorry. I forgot to answer that yesterday. I sent you links via
another medium but I repeat it here.
There are two places. The link you have here is a private repo. But I've
also merged a program alongside the fstests testsuite I merged:
https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/tree/src/idmapped-mounts/mount-idmapped.c
which should be nicer and has seen reviews by Amir and Christoph.

> 
> [...]
> 
> >>>       int
> >>>       main(int argc, char *argv[])
> >>>       {
> >>>           struct mount_attr *attr = &(struct mount_attr){};
> >>>           int fd_userns = -EBADF;
> >>
> >> ???
> >> Why this magic initializer here? Why not just "-1"?
> >> Using -EBADF makes it look this is value specifically is
> >> meaningful, although I don't think that's true.
> > 
> > [2]: I always use -EBADF to initialize fds in all my code. It makes it
> > pretty easy to grep for fd initialization etc. So it's pure visual
> > convenience. Freel free to just use -1.
> 
> Changed.
> 
> [...]
> 
> >>>           int fd_tree = open_tree(-EBADF, source,
> >>>                        OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC |
> >>>                        AT_EMPTY_PATH | (recursive ? AT_RECURSIVE : 0));
> >>
> >> ???
> >> What is the significance of -EBADF here? As far as I can tell, it
> >> is not meaningful to open_tree()?
> > 
> > I always pass -EBADF for similar reasons to [2]. Feel free to just use -1.
> 
> ????
> But here, both -EBADF and -1 seem to be wrong. This argument 
> is a dirfd, and so should either be a file descriptor or the
> value AT_FDCWD, right?

[1]: In this code "source" is expected to be absolute. If it's not
     absolute we should fail. This can be achieved by passing -1/-EBADF,
     afaict.

> 
> >>>           if (fd_tree == -1)
> >>>               exit_log("%m - Failed to open %s\n", source);
> >>>
> >>>           if (fd_userns >= 0) {
> >>>               attr->attr_set  |= MOUNT_ATTR_IDMAP;
> >>>               attr->userns_fd = fd_userns;
> >>>           }
> >>>
> >>>           ret = mount_setattr(fd_tree, "",
> >>>                       AT_EMPTY_PATH | (recursive ? AT_RECURSIVE : 0),
> >>>                       attr, sizeof(struct mount_attr));
> >>>           if (ret == -1)
> >>>               exit_log("%m - Failed to change mount attributes\n");
> >>>
> >>>           close(fd_userns);
> >>>
> >>>           ret = move_mount(fd_tree, "", -EBADF, target,
> >>>                            MOVE_MOUNT_F_EMPTY_PATH);
> >>
> >> ???
> >> What is the significance of -EBADF here? As far as I can tell, it
> >> is not meaningful to move_mount()?
> > 
> > See [2].
> 
> ????
> As above, both -EBADF and -1 seem to be wrong. This argument 
> is a dirfd, and so should either be a file descriptor or the
> value AT_FDCWD, right?

See [1].

Thanks!
Christian

^ permalink raw reply	[relevance 4%]

* Re: Documenting the requirement of CAP_SETFCAP to map UID 0
  2021-08-08  9:09  9% Documenting the requirement of CAP_SETFCAP to map UID 0 Michael Kerrisk (man-pages)
@ 2021-08-10 23:58  5% ` Serge E. Hallyn
  2021-08-11 10:10 11%   ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 200+ results
From: Serge E. Hallyn @ 2021-08-10 23:58 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Serge E. Hallyn, linux-security-module, lkml, Alejandro Colomar,
	Kir Kolyshkin, linux-man

On Sun, Aug 08, 2021 at 11:09:30AM +0200, Michael Kerrisk (man-pages) wrote:
> Hello Serge,
> 
> Your commit:
> 
> [[
> commit db2e718a47984b9d71ed890eb2ea36ecf150de18
> Author: Serge E. Hallyn <serge@hallyn.com>
> Date:   Tue Apr 20 08:43:34 2021 -0500
> 
>     capabilities: require CAP_SETFCAP to map uid 0
> ]]
> 
> added a new requirement when updating a UID map a user namespace
> with a value of '0 0 *'.
> 
> Kir sent a patch to briefly document this change, but I think much more
> should be written. I've attempted to do so. Could you tell me whether the
> following text (to be added in user_namespaces(7)) is accurate please:

Sorry for the delay - this did not go into my main mailbox.

The text looks good.  Thanks!

> [[
>       In  order  for  a  process  to  write  to  the /proc/[pid]/uid_map
>        (/proc/[pid]/gid_map) file, all of the following requirements must
>        be met:
> 
>        [...]
> 
>        4. If  updating  /proc/[pid]/uid_map to create a mapping that maps
>           UID 0 in the parent namespace, then one of the  following  must
>           be true:
> 
>           *  if  writing process is in the parent user namespace, then it
>              must have the CAP_SETFCAP capability in that user namespace;
>              or
> 
>           *  if  the writing process is in the child user namespace, then
>              the process that created the user namespace  must  have  had
>              the CAP_SETFCAP capability when the namespace was created.
> 
>           This rule has been in place since Linux 5.12.  It eliminates an
>           earlier security bug whereby a UID 0  process  that  lacks  the
>           CAP_SETFCAP capability, which is needed to create a binary with
>           namespaced file capabilities (as described in capabilities(7)),
>           could  nevertheless  create  such  a  binary,  by the following
>           steps:
> 
>           *  Create a new user namespace with the identity mapping (i.e.,
>              UID  0 in the new user namespace maps to UID 0 in the parent
>              namespace), so that UID 0 in both namespaces  is  equivalent
>              to the same root user ID.
> 
>           *  Since  the  child process has the CAP_SETFCAP capability, it
>              could create a binary with namespaced file capabilities that
>              would  then  be  effective in the parent user namespace (be‐
>              cause the root user IDs are the same in the two namespaces).
> 
>        [...]
> ]]
> 
> Thanks,
> 
> Michael
> 
> -- 
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[relevance 5%]

* Re: Questions re the new mount_setattr(2) manual page
  2021-08-10  1:38  4% Questions re the new mount_setattr(2) manual page Michael Kerrisk (man-pages)
  2021-08-10  7:12 11% ` Michael Kerrisk (man-pages)
  2021-08-10 14:32  4% ` Christian Brauner
@ 2021-08-10 22:47  5% ` Michael Kerrisk (man-pages)
  2021-08-11 10:40  4%   ` Christian Brauner
  2 siblings, 1 reply; 200+ results
From: Michael Kerrisk (man-pages) @ 2021-08-10 22:47 UTC (permalink / raw)
  To: Christian Brauner
  Cc: mtk.manpages, Alejandro Colomar, linux-fsdevel, lkml, linux-man,
	Christoph Hellwig

Hi Christian,

Some further questions...

In ERRORS there is:

       EINVAL The underlying filesystem is mounted in a user namespace.

I don't understand this. What does it mean?

Also, there is this:

       ENOMEM When  changing  mount  propagation to MS_SHARED, a new peer
              group ID needs to be allocated for  all  mounts  without  a
              peer  group  ID  set.  Allocation of this peer group ID has
              failed.

       ENOSPC When changing mount propagation to MS_SHARED,  a  new  peer
              group  ID  needs  to  be allocated for all mounts without a
              peer group ID set.  Allocation of this peer  group  ID  can
              fail.  Note that technically further error codes are possi‐
              ble that are specific to the ID  allocation  implementation
              used.

What is the difference between these two error cases? (That is, in what 
circumstances will one get ENOMEM vs ENOSPC and vice versa?)

And then:

       EPERM  One  of  the mounts had at least one of MOUNT_ATTR_NOATIME,
              MOUNT_ATTR_NODEV, MOUNT_ATTR_NODIRATIME, MOUNT_ATTR_NOEXEC,
              MOUNT_ATTR_NOSUID, or MOUNT_ATTR_RDONLY set and the flag is
              locked.  Mount attributes become locked on a mount if:

              •  A new mount or mount tree is created causing mount prop‐
                 agation  across  user  namespaces.  The kernel will lock

Propagation is done across mont points, not user namespaces.
should "across user namespaces" be "to a mount namespace owned 
by a different user namespace"? Or something else?

                 the aforementioned  flags  to  protect  these  sensitive
                 properties from being altered.

              •  A  new  mount  and user namespace pair is created.  This
                 happens for  example  when  specifying  CLONE_NEWUSER  |
                 CLONE_NEWNS  in unshare(2), clone(2), or clone3(2).  The
                 aforementioned flags become locked to protect user name‐
                 spaces from altering sensitive mount properties.

Again, this seems imprecise. Should it say something like:
"... to prevent changes to sensitive mount properties in the new 
mount namespace" ? Or perhaps you have a better wording.

Thanks,

Michael

^ permalink raw reply	[relevance 5%]

* Re: Questions re the new mount_setattr(2) manual page
  2021-08-10 14:32  4% ` Christian Brauner
@ 2021-08-10 21:06  9%   ` Michael Kerrisk (man-pages)
  2021-08-11 10:07  4%     ` Christian Brauner
  0 siblings, 1 reply; 200+ results
From: Michael Kerrisk (man-pages) @ 2021-08-10 21:06 UTC (permalink / raw)
  To: Christian Brauner
  Cc: mtk.manpages, Alejandro Colomar, linux-fsdevel, lkml, linux-man,
	Christoph Hellwig

Hello Christian,

On 8/10/21 4:32 PM, Christian Brauner wrote:
> On Tue, Aug 10, 2021 at 03:38:00AM +0200, Michael Kerrisk (man-pages) wrote:
>> Hi Christian,
>>
>> Thanks for the very nice manual page that you wrote. I have
> 
> Thank you!
> 
>> made a large number of (mostly trivial) edits. If you could
>> read the page closely, to check that I introduced no errors,
>> I would appreciate it.
> 
> Happy to!

Thanks for the feedback. I've made some changes, and pushed to Git.

There's still a few open questions. Please see "????" below.

>> I have various questions below, marked ???. Could you please take
>> a look at these, and I will then make further edits based on your
>> answers.
> 
> I've answered all questions, I think. Feel free to just reformulate
> where my suggestions weren't adequate. Since most things you ask about
> are minor adaptions there's no need from my end for you to resend with
> those reformulations. You can just make them directly. :) I'll peruse
> the man-pages git repo anyway after you apply them and will send changes
> if I spot issues.
> 
> Thank you for the review!
> Christian
> 
>>
>> The current version of the page is already pushed to the man-pages
>> Git repo.
>>
>>>   MOUNT_SETATTR(2)      Linux Programmer's Manual     MOUNT_SETATTR(2)
>>>
>>>   NAME
>>>       mount_setattr - change mount properties of a mount or mount
>>
>> ???
>> s/mount properties/properties ?
>>
>> (Just bcause more concise.)
> 
> Sounds good.

Done.

>>
>>>       tree
>>>
>>>   SYNOPSIS
>>>       #include <linux/fcntl.h> /* Definition of AT_* constants */
>>>       #include <linux/mount.h> /* Definition of MOUNT_ATTR_* constants */
>>>       #include <sys/syscall.h> /* Definition of SYS_* constants */
>>>       #include <unistd.h>
>>>
>>>       int syscall(SYS_mount_setattr, int dirfd, const char *path,
>>>               unsigned int flags, struct mount_attr *attr, size_t size);
>>>
>>>       Note: glibc provides no wrapper for mount_setattr(),
>>>       necessitating the use of syscall(2).
>>>
>>>   DESCRIPTION

[...]

>>>       The size argument should usually be specified as
>>>       sizeof(struct mount_attr).  However, if the caller does not
>>>       intend to make use of features that got introduced after the
>>>       initial version of struct mount_attr, it is possible to pass
>>>       the size of the initial struct together with the larger
>>>       struct.  This allows the kernel to not copy later parts of
>>>       the struct that aren't used anyway.  With each extension that
>>>       changes the size of struct mount_attr, the kernel will expose
>>>       a definition of the form MOUNT_ATTR_SIZE_VERnumber.  For
>>>       example, the macro for the size of the initial version of
>>>       struct mount_attr is MOUNT_ATTR_SIZE_VER0.
>>
>> ???
>> I think I understand the above paragraph, but I wonder if it could
>> be improved a little. The general principle is that one can always
>> pass the size of an earlier, smaller structure to the kernel, right?
> 
> Yes.
> 
>> My point is that it need not be the size of the initial structure,
>> right? So, I wonder whether a little rewording might be need above.
> 
> Yes, the initial structure size is just an example because that will be
> very common.
> 
>> What do you think?
> 
> Sure, I'm proposing something here but please, fell free to reformulate
> or come up with something completely new:
> 
> 	[...]
> 	However, if the caller is using a kernel that supports an
> 	extended struct mount_attr but the caller does not intend to
> 	make use of these features they can pass the size of an earlier
> 	version of the struct together with the extended structure.
> 	[...]

Perfect! I took that text pretty much exactly as you gave it.

[...]

>>>       The attr_set and attr_clr members are used to specify the
>>>       mount properties that are supposed to be set or cleared for a
>>>       mount or mount tree.  Flags set in attr_set enable a property
>>>       on a mount or mount tree, and flags set in attr_clr remove a
>>>       property from a mount or mount tree.
>>>
>>>       When changing mount properties, the kernel will first clear
>>>       the flags specified in the attr_clr field, and then set the
>>>       flags specified in the attr_set field:
>>
>> ???
>> I find the following example a bit confusing. See below.
>>
>>>
>>>           struct mount_attr attr = {
>>>               .attr_clr = MOUNT_ATTR_NOEXEC | MOUNT_ATTR_NODEV,
>>>               .attr_set = MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOSUID,
>>>           };
>>
>> ???
>> I *think* that what you are trying to show is that the above initializer
>> resuts in the equivalent of the following code. Is that correct? If so, 
>> I think the text needs some work to make this clearer. Let me know.
> 
> Yes, exactly. Feel free to remove that code and just explain it in text
> if that's better.

I've done some rewording to say that the code snippet shows
the effect of the initializer.

[...]

>>>   RETURN VALUE
>>>       On success, mount_setattr() returns zero.  On error, -1 is
>>>       returned and errno is set to indicate the cause of the error.
>>>
>>>   ERRORS

[...]

>>>       EINVAL A valid file descriptor value was specified in
>>>              userns_fd, but the file descriptor wasn't a namespace
>>>              file descriptor or did not refer to a user namespace.
>>
>> ???
>> Could the above not be simplified to
>>
>>       EINVAL A valid file descriptor value was specified in
>>              userns_fd, but the file descriptor did not refer
>>              to a user namespace.
> 
> Sounds good.
> 
>> ?

Done.

>>>
>>>       EINVAL The underlying filesystem does not support ID-mapped
>>>              mounts.
>>>
>>>       EINVAL The mount that is to be ID mapped is not a
>>>              detached/anonymous mount; that is, the mount is
>>
>> ???
>> What is a the distinction between "detached" and "anonymous"?
>> Or do you mean them to be synonymous? If so, then let's use
>> just one term, and I think "detached" is preferable.
> 
> Yes, they are synonymous here. I list both because detached can
> potentially be confusing. A detached mount is a mount that has not been
> visible in the filesystem. But if you attached it an then unmount it
> right after and keep the fd for the mountpoint open it's a detached
> mount purely on a natural language level, I'd argue. But it's not a
> detached mount from the kernel's view anymore because it has been
> exposed in the filesystem and is thus not detached anymore.
> But I do prefer "detached" to "anonymous" and that confusion is very
> unlikely to occur.

Thanks. I made it "detached". Elsewhere, the page already explains
that a detached mount is one that:

          must have been created by calling open_tree(2) with the
          OPEN_TREE_CLONE flag and it must not already have been
          visible in the filesystem.

Which seems a fine explanation. 

????
But, just a thought... "visible in the filesystem" seems not quite accurate. 
What you really mean I guess is that it must not already have been
/visible in the filesystem hierarchy/previously mounted/something else/,
right?

>>>              already visible in the filesystem.
>>>

[...]

>>>       EPERM  An already ID-mapped mount was supposed to be ID
>>>              mapped.
>>
>> ???
>> Better:
>>     An attempt was made to add an ID mapping to a mount that is already
>>     ID mapped.
> 
> Sounds good.
> 
>> ?

Done.

[...]

>>>   NOTES
>>>   ID-mapped mounts
>>>       Creating an ID-mapped mount makes it possible to change the
>>>       ownership of all files located under a mount.  Thus, ID-
>>>       mapped mounts make it possible to change ownership in a
>>>       temporary and localized way.  It is a localized change
>>>       because ownership changes are restricted to a specific mount.
>>
>> ???
>> Would it be clearer to say something like:
>>
>>     It is a localized change because ownership changes are
>>     visible only via a specific mount.
>> ?
> 
> Sounds good.

Done.

[...]

>>>       The following conditions must be met in order to create an
>>>       ID-mapped mount:
>>>
>>>       •  The caller must have the CAP_SYS_ADMIN capability in the
>>>          initial user namespace.
>>>
>>>       •  The filesystem must be mounted in the initial user
>>>          namespace.
>>
>> ???
>> Should this rather be written as:
>>  
>>      The filesystem must be mounted in a mount namespace 
>>      that is owned by the initial user namespace.
> 
> Sounds good.

Done.

>>>       •  The underlying filesystem must support ID-mapped mounts.
>>>          Currently, the xfs(5), ext4(5), and FAT filesystems
>>>          support ID-mapped mounts with more filesystems being
>>>          actively worked on.
>>>
>>>       •  The mount must not already be ID-mapped.  This also
>>>          implies that the ID mapping of a mount cannot be altered.
>>>
>>>       •  The mount must be a detached/anonymous mount; that is, it
>>
>> ???
>> See the above questionon "detached" vs "anonymous"
> 
> Yes, please use "detached" only.

Done.

>>>          must have been created by calling open_tree(2) with the
>>>          OPEN_TREE_CLONE flag and it must not already have been
>>>          visible in the filesystem.
>>>
>>>       ID mappings can be created for user IDs, group IDs, and
>>>       project IDs.  An ID mapping is essentially a mapping of a
>>>       range of user or group IDs into another or the same range of
>>>       user or group IDs.  ID mappings are usually written as three
>>>       numbers either separated by white space or a full stop.  The
>>>       first two numbers specify the starting user or group ID in
>>>       each of the two user namespaces.  The third number specifies
>>>       the range of the ID mapping.  For example, a mapping for user
>>>       IDs such as 1000:1001:1 would indicate that user ID 1000 in
>>>       the caller's user namespace is mapped to user ID 1001 in its
>>>       ancestor user namespace.  Since the map range is 1, only user
>>>       ID 1000 is mapped.
>>
>> ???
>> The details above seem wrong. When writing to map files, the
>> fields must be white-space separated, AFAIK. But above you mention
>> "full stops" and also show an example using colons (:). Those
>> both seem wrong and confusing. Am I missing something?
> 
> This is more about notational conventions that exist and not about how
> they are actually written. That's something I'm not touching on here as
> it doesn't belong on this manpage. But feel free to only mention spaces.

Thanks for the explanation. In this context though, this could mislead
the reader, so I've removed mention of "full stop" and ":".

>>>       It is possible to specify up to 340 ID mappings for each ID
>>>       mapping type.  If any user IDs or group IDs are not mapped,
>>>       all files owned by that unmapped user or group ID will appear
>>>       as being owned by the overflow user ID or overflow group ID
>>>       respectively.
>>>
>>>       Further details and instructions for setting up ID mappings
>>>       can be found in the user_namespaces(7) man page.
>>>
>>>       In the common case, the user namespace passed in userns_fd
>>>       together with MOUNT_ATTR_IDMAP in attr_set to create an ID-
>>>       mapped mount will be the user namespace of a container.  In
>>>       other scenarios it will be a dedicated user namespace
>>>       associated with a user's login session as is the case for
>>>       portable home directories in systemd-homed.service(8)).  It
>>>       is also perfectly fine to create a dedicated user namespace
>>>       for the sake of ID mapping a mount.

I forgot to mention it earlier, but the following text on the
rationale for ID-mapped mounts is what turns this from a good 
manual page into a great manual page. Thank you for including it.

>>>       ID-mapped mounts can be useful in the following and a variety
>>>       of other scenarios:
>>>
>>>       •  Sharing files between multiple users or multiple machines,
>>
>> ???
>> s/Sharing files/Sharing filesystems/ ?
> 
> [1]: But work. But feel free to use "sharing filesystems".

s/But/Both/

I made it "Sharing files or filesystsms"

>>
>>>          especially in complex scenarios.  For example, ID-mapped
>>>          mounts are used to implement portable home directories in
>>>          systemd-homed.service(8), where they allow users to move
>>>          their home directory to an external storage device and use
>>>          it on multiple computers where they are assigned different
>>>          user IDs and group IDs.  This effectively makes it
>>>          possible to assign random user IDs and group IDs at login
>>>          time.
>>>
>>>       •  Sharing files from the host with unprivileged containers.
>>
>> ???
>> s/Sharing files/Sharing filesystems/ ?
> 
> See [1].

Same.

>>>          This allows a user to avoid having to change ownership
>>>          permanently through chown(2).
>>>
>>>       •  ID mapping a container's root filesystem.  Users don't
>>>          need to change ownership permanently through chown(2).
>>>          Especially for large root filesystems, using chown(2) can
>>>          be prohibitively expensive.
>>>
>>>       •  Sharing files between containers with non-overlapping ID
>>
>> ???
>> s/Sharing files/Sharing filesystems/ ?
> 
> See [1].

Same.

[...]

>>>       •  Locally and temporarily restricted ownership changes.  ID-
>>>          mapped mounts make it possible to change ownership
>>>          locally, restricting it to specific mounts, and
>>
>> ???
>> The referent of "it" in the preceding line is not clear.
>> Should it be "the ownership changes"? Or something else?
> 
> It should refer to ownership changes. I'd appreciate it if you could
> reformulate.

Done.

>>>          temporarily as the ownership changes only apply as long as
>>>          the mount exists.  By contrast, changing ownership via the
>>>          chown(2) system call changes the ownership globally and
>>>          permanently.
>>>
>>>   Extensibility

[...]

>>>   EXAMPLES
>>
>> ???
>> Do you have a (preferably simple) example piece of code
>> somewhere for setting up an ID mapped mount?

????
I guess the best example is this:
https://github.com/brauner/mount-idmapped/
right?

[...]

>>>       int
>>>       main(int argc, char *argv[])
>>>       {
>>>           struct mount_attr *attr = &(struct mount_attr){};
>>>           int fd_userns = -EBADF;
>>
>> ???
>> Why this magic initializer here? Why not just "-1"?
>> Using -EBADF makes it look this is value specifically is
>> meaningful, although I don't think that's true.
> 
> [2]: I always use -EBADF to initialize fds in all my code. It makes it
> pretty easy to grep for fd initialization etc. So it's pure visual
> convenience. Freel free to just use -1.

Changed.

[...]

>>>           int fd_tree = open_tree(-EBADF, source,
>>>                        OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC |
>>>                        AT_EMPTY_PATH | (recursive ? AT_RECURSIVE : 0));
>>
>> ???
>> What is the significance of -EBADF here? As far as I can tell, it
>> is not meaningful to open_tree()?
> 
> I always pass -EBADF for similar reasons to [2]. Feel free to just use -1.

????
But here, both -EBADF and -1 seem to be wrong. This argument 
is a dirfd, and so should either be a file descriptor or the
value AT_FDCWD, right?

>>>           if (fd_tree == -1)
>>>               exit_log("%m - Failed to open %s\n", source);
>>>
>>>           if (fd_userns >= 0) {
>>>               attr->attr_set  |= MOUNT_ATTR_IDMAP;
>>>               attr->userns_fd = fd_userns;
>>>           }
>>>
>>>           ret = mount_setattr(fd_tree, "",
>>>                       AT_EMPTY_PATH | (recursive ? AT_RECURSIVE : 0),
>>>                       attr, sizeof(struct mount_attr));
>>>           if (ret == -1)
>>>               exit_log("%m - Failed to change mount attributes\n");
>>>
>>>           close(fd_userns);
>>>
>>>           ret = move_mount(fd_tree, "", -EBADF, target,
>>>                            MOVE_MOUNT_F_EMPTY_PATH);
>>
>> ???
>> What is the significance of -EBADF here? As far as I can tell, it
>> is not meaningful to move_mount()?
> 
> See [2].

????
As above, both -EBADF and -1 seem to be wrong. This argument 
is a dirfd, and so should either be a file descriptor or the
value AT_FDCWD, right?

[...]

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[relevance 9%]

* Re: Questions re the new mount_setattr(2) manual page
  2021-08-10 14:11  5%   ` Christian Brauner
@ 2021-08-10 19:30 11%     ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 200+ results
From: Michael Kerrisk (man-pages) @ 2021-08-10 19:30 UTC (permalink / raw)
  To: Christian Brauner
  Cc: mtk.manpages, Alejandro Colomar, linux-fsdevel, lkml, linux-man,
	Christoph Hellwig

On 8/10/21 4:11 PM, Christian Brauner wrote:
> On Tue, Aug 10, 2021 at 09:12:14AM +0200, Michael Kerrisk (man-pages) wrote:
>> Hi Christian,
>>
>> One more question...
>>
>>>>       The propagation field is used to specify the propagation typ
>>>>       of the mount or mount tree.  Mount propagation options are
>>>>       mutually exclusive; that is, the propagation values behave
>>>>       like an enum.  The supported mount propagation types are:
>>
>> The manual page text doesn't actually say it, but if the 'propagation'
>> field is 0, then this means leave the propagation type unchanged, 
>> right? This of course should be mentioned in the manual page.
> 
> Yes, if none of the documented values is set the propagation is unchanged.

Thanks for the confirmation.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[relevance 11%]

* Re: Questions re the new mount_setattr(2) manual page
  2021-08-10  1:38  4% Questions re the new mount_setattr(2) manual page Michael Kerrisk (man-pages)
  2021-08-10  7:12 11% ` Michael Kerrisk (man-pages)
@ 2021-08-10 14:32  4% ` Christian Brauner
  2021-08-10 21:06  9%   ` Michael Kerrisk (man-pages)
  2021-08-10 22:47  5% ` Michael Kerrisk (man-pages)
  2 siblings, 1 reply; 200+ results
From: Christian Brauner @ 2021-08-10 14:32 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Alejandro Colomar, linux-fsdevel, lkml, linux-man, Christoph Hellwig

On Tue, Aug 10, 2021 at 03:38:00AM +0200, Michael Kerrisk (man-pages) wrote:
> Hi Christian,
> 
> Thanks for the very nice manual page that you wrote. I have

Thank you!

> made a large number of (mostly trivial) edits. If you could
> read the page closely, to check that I introduced no errors,
> I would appreciate it.

Happy to!

> 
> I have various questions below, marked ???. Could you please take
> a look at these, and I will then make further edits based on your
> answers.

I've answered all questions, I think. Feel free to just reformulate
where my suggestions weren't adequate. Since most things you ask about
are minor adaptions there's no need from my end for you to resend with
those reformulations. You can just make them directly. :) I'll peruse
the man-pages git repo anyway after you apply them and will send changes
if I spot issues.

Thank you for the review!
Christian

> 
> The current version of the page is already pushed to the man-pages
> Git repo.
> 
> >   MOUNT_SETATTR(2)      Linux Programmer's Manual     MOUNT_SETATTR(2)
> >
> >   NAME
> >       mount_setattr - change mount properties of a mount or mount
> 
> ???
> s/mount properties/properties ?
> 
> (Just bcause more concise.)

Sounds good.

> 
> >       tree
> >
> >   SYNOPSIS
> >       #include <linux/fcntl.h> /* Definition of AT_* constants */
> >       #include <linux/mount.h> /* Definition of MOUNT_ATTR_* constants */
> >       #include <sys/syscall.h> /* Definition of SYS_* constants */
> >       #include <unistd.h>
> >
> >       int syscall(SYS_mount_setattr, int dirfd, const char *path,
> >               unsigned int flags, struct mount_attr *attr, size_t size);
> >
> >       Note: glibc provides no wrapper for mount_setattr(),
> >       necessitating the use of syscall(2).
> >
> >   DESCRIPTION
> >       The mount_setattr() system call changes the mount properties
> >       of a mount or an entire mount tree.  If path is a relative
> >       pathname, then it is interpreted relative to the directory
> >       referred to by the file descriptor dirfd.  If dirfd is the
> >       special value AT_FDCWD, then path is interpreted relative to
> >       the current working directory of the calling process.  If
> >       path is the empty string and AT_EMPTY_PATH is specified in
> >       flags, then the mount properties of the mount identified by
> >       dirfd are changed.
> >
> >       The mount_setattr() system call uses an extensible structure
> >       (struct mount_attr) to allow for future extensions.  Any non-
> >       flag extensions to mount_setattr() will be implemented as new
> >       fields appended to the this structure, with a zero value in a
> >       new field resulting in the kernel behaving as though that
> >       extension field was not present.  Therefore, the caller must
> >       zero-fill this structure on initialization.  See the
> >       "Extensibility" subsection under NOTES for more details.
> >
> >       The size argument should usually be specified as
> >       sizeof(struct mount_attr).  However, if the caller does not
> >       intend to make use of features that got introduced after the
> >       initial version of struct mount_attr, it is possible to pass
> >       the size of the initial struct together with the larger
> >       struct.  This allows the kernel to not copy later parts of
> >       the struct that aren't used anyway.  With each extension that
> >       changes the size of struct mount_attr, the kernel will expose
> >       a definition of the form MOUNT_ATTR_SIZE_VERnumber.  For
> >       example, the macro for the size of the initial version of
> >       struct mount_attr is MOUNT_ATTR_SIZE_VER0.
> 
> ???
> I think I understand the above paragraph, but I wonder if it could
> be improved a little. The general principle is that one can always
> pass the size of an earlier, smaller structure to the kernel, right?

Yes.

> My point is that it need not be the size of the initial structure,
> right? So, I wonder whether a little rewording might be need above.

Yes, the initial structure size is just an example because that will be
very common.

> What do you think?

Sure, I'm proposing something here but please, fell free to reformulate
or come up with something completely new:

	[...]
	However, if the caller is using a kernel that supports an
	extended struct mount_attr but the caller does not intend to
	make use of these features they can pass the size of an earlier
	version of the struct together with the extended structure.
	[...]

> 
> >
> >       The flags argument can be used to alter the path resolution
> >       behavior.  The supported values are:
> >
> >       AT_EMPTY_PATH
> >              If path is the empty string, change the mount
> >              properties on dirfd itself.
> >
> >       AT_RECURSIVE
> >              Change the mount properties of the entire mount tree.
> >
> >       AT_SYMLINK_NOFOLLOW
> >              Don't follow trailing symbolic links.
> >
> >       AT_NO_AUTOMOUNT
> >              Don't trigger automounts.
> >
> >       The attr argument of mount_setattr() is a structure of the
> >       following form:
> >
> >           struct mount_attr {
> >               __u64 attr_set;     /* Mount properties to set */
> >               __u64 attr_clr;     /* Mount properties to clear */
> >               __u64 propagation;  /* Mount propagation type */
> >               __u64 userns_fd;    /* User namespace file descriptor */
> >           };
> >
> >       The attr_set and attr_clr members are used to specify the
> >       mount properties that are supposed to be set or cleared for a
> >       mount or mount tree.  Flags set in attr_set enable a property
> >       on a mount or mount tree, and flags set in attr_clr remove a
> >       property from a mount or mount tree.
> >
> >       When changing mount properties, the kernel will first clear
> >       the flags specified in the attr_clr field, and then set the
> >       flags specified in the attr_set field:
> 
> ???
> I find the following example a bit confusing. See below.
> 
> >
> >           struct mount_attr attr = {
> >               .attr_clr = MOUNT_ATTR_NOEXEC | MOUNT_ATTR_NODEV,
> >               .attr_set = MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOSUID,
> >           };
> 
> ???
> I *think* that what you are trying to show is that the above initializer
> resuts in the equivalent of the following code. Is that correct? If so, 
> I think the text needs some work to make this clearer. Let me know.

Yes, exactly. Feel free to remove that code and just explain it in text
if that's better.

> 
> >           unsigned int current_mnt_flags = mnt->mnt_flags;
> >
> >           /*
> >            * Clear all flags set in .attr_clr,
> >            * clearing MOUNT_ATTR_NOEXEC and MOUNT_ATTR_NODEV.
> >            */
> >           current_mnt_flags &= ~attr->attr_clr;
> >
> >           /*
> >            * Now set all flags set in .attr_set,
> >            * applying MOUNT_ATTR_RDONLY and MOUNT_ATTR_NOSUID.
> >            */
> >           current_mnt_flags |= attr->attr_set;
> >
> >           mnt->mnt_flags = current_mnt_flags;
> >
> >       As a rsult of this change, the mount or mount tree (a) is

Typo: s/rsult/result/g

> >       read-only; (b) blocks the execution of set-user-ID and set-
> >       group-ID programs; (c) allows execution of programs; and (d)
> >       allows access to devices.
> >
> >       Multiple changes with the same set of flags requested in
> >       attr_clr and attr_set are guaranteed to be idempotent after
> >       the changes have been applied.
> >
> >       The following mount attributes can be specified in the
> >       attr_set or attr_clr fields:
> >
> >       MOUNT_ATTR_RDONLY
> >              If set in attr_set, makes the mount read-only.  If set
> >              in attr_clr, removes the read-only setting if set on
> >              the mount.
> >
> >       MOUNT_ATTR_NOSUID
> >              If set in attr_set, causes the mount not to honor the
> >              set-user-ID and set-group-ID mode bits and file
> >              capabilities when executing programs.  If set in
> >              attr_clr, clears the set-user-ID, set-group-ID, and
> >              file capability restriction if set on this mount.
> >
> >       MOUNT_ATTR_NODEV
> >              If set in attr_set, prevents access to devices on this
> >              mount.  If set in attr_clr, removes the restriction
> >              that prevented accessing devices on this mount.
> >
> >       MOUNT_ATTR_NOEXEC
> >              If set in attr_set, prevents executing programs on
> >              this mount.  If set in attr_clr, removes the
> >              restriction that prevented executing programs on this
> >              mount.
> >
> >       MOUNT_ATTR_NOSYMFOLLOW
> >              If set in attr_set, prevents following symbolic links
> >              on this mount.  If set in attr_clr, removes the
> >              restriction that prevented following symbolic links on
> >              this mount.
> >
> >       MOUNT_ATTR_NODIRATIME
> >              If set in attr_set, prevents updating access time for
> >              directories on this mount.  If set in attr_clr,
> >              removes the restriction that prevented updating access
> >              time for directories.  Note that MOUNT_ATTR_NODIRATIME
> >              can be combined with other access-time settings and is
> >              implied by the noatime setting.  All other access-time
> >              settings are mutually exclusive.
> >
> >       MOUNT_ATTR__ATIME - changing access-time settings
> >              In the new mount API, the access-time values are an
> >              enum starting from 0.  Even though they are an enum
> >              (in contrast to the other mount flags such as
> >              MOUNT_ATTR_NOEXEC), they are nonetheless passed in
> >              attr_set and attr_clr for consistency with fsmount(2),
> >              which introduced this behavior.
> >
> >              Note that, since access times are an enum not a bit
> >              map, users wanting to transition to a different
> >              access-time setting cannot simply specify the access-
> >              time setting in attr_set but must also set
> >              MOUNT_ATTR__ATIME in the attr_clr field.  The kernel
> >              will verify that MOUNT_ATTR__ATIME isn't partially set
> >              in attr_clr, and that attr_set doesn't have any
> >              access-time bits set if MOUNT_ATTR__ATIME isn't set in
> >              attr_clr.
> >
> >              MOUNT_ATTR_RELATIME
> >                     When a file is accessed via this mount, update
> >                     the file's last access time (atime) only if the
> >                     current value of atime is less than or equal to
> >                     the file's last modification time (mtime) or
> >                     last status change time (ctime).
> >
> >                     To enable this access-time setting on a mount
> >                     or mount tree, MOUNT_ATTR_RELATIME must be set
> >                     in attr_set and MOUNT_ATTR__ATIME must be set
> >                     in the attr_clr field.
> >
> >              MOUNT_ATTR_NOATIME
> >                     Do not update access times for (all types of)
> >                     files on this mount.
> >
> >                     To enable this access-time setting on a mount
> >                     or mount tree, MOUNT_ATTR_NOATIME must be set
> >                     in attr_set and MOUNT_ATTR__ATIME must be set
> >                     in the attr_clr field.
> >
> >              MOUNT_ATTR_STRICTATIME
> >                     Always update the last access time (atime) when
> >                     files are accessed on this mount.
> >
> >                     To enable this access-time setting on a mount
> >                     or mount tree, MOUNT_ATTR_STRICTATIME must be
> >                     set in attr_set and MOUNT_ATTR__ATIME must be
> >                     set in the attr_clr field.
> >
> >       MOUNT_ATTR_IDMAP
> >              If set in attr_set, creates an ID-mapped mount.  The
> >              ID mapping is taken from the user namespace specified
> 
> In various places, you wrote "idmapping". "idmapped", etc. I've
> changed these to the more natural English "ID mapping" etc.

Sure.

> 
> >              in userns_fd and attached to the mount.
> >
> >              Since it is not supported to change the ID mapping of
> >              a mount after it has been ID mapped, it is invalid to
> >              specify MOUNT_ATTR_IDMAP in attr_clr.
> >
> >              For further details, see the subsection "ID-mapped
> >              mounts" under NOTES.
> >
> >       The propagation field is used to specify the propagation type
> >       of the mount or mount tree.  Mount propagation options are
> >       mutually exclusive; that is, the propagation values behave
> >       like an enum.  The supported mount propagation types are:
> >
> >       MS_PRIVATE
> >              Turn all mounts into private mounts.  Mount and
> >              unmount events do not propagate into or out of this
> >              mount point.
> >
> >       MS_SHARED
> >              Turn all mounts into shared mounts.  Mount points
> >              share events with members of a peer group.  Mount and
> >              unmount events immediately under this mount point will
> >              propagate to the other mount points that are members
> >              of the peer group.  Propagation here means that the
> >              same mount or unmount will automatically occur under
> >              all of the other mount points in the peer group.
> >              Conversely, mount and unmount events that take place
> >              under peer mount points will propagate to this mount
> >              point.
> >
> >       MS_SLAVE
> >              Turn all mounts into dependent mounts.  Mount and
> >              unmount events propagate into this mount point from a
> >              shared peer group.  Mount and unmount events under
> >              this mount point do not propagate to any peer.
> >
> >       MS_UNBINDABLE
> >              This is like a private mount, and in addition this
> >              mount can't be bind mounted.  Attempts to bind mount
> >              this mount will fail.  When a recursive bind mount is
> >              performed on a directory subtree, any bind mounts
> >              within the subtree are automatically pruned (i.e., not
> >              replicated) when replicating that subtree to produce
> >              the target subtree.
> >
> >       For further details on propagation types, see
> >       mount_namespaces(7).
> >
> >   RETURN VALUE
> >       On success, mount_setattr() returns zero.  On error, -1 is
> >       returned and errno is set to indicate the cause of the error.
> >
> >   ERRORS
> >       EBADF  dirfd is not a valid file descriptor.
> >
> >       EBADF  userns_fd is not a valid file descriptor.
> >
> >       EBUSY  The caller tried to change the mount to
> >              MOUNT_ATTR_RDONLY, but the mount still holds files
> >              open for writing.
> >
> >       EINVAL The path specified via the dirfd and path arguments to
> >              mount_setattr() isn't a mount point.
> >
> >       EINVAL An unsupported value was set in flags.
> >
> >       EINVAL An unsupported value was specified in the attr_set
> >              field of mount_attr.
> >
> >       EINVAL An unsupported value was specified in the attr_clr
> >              field of mount_attr.
> >
> >       EINVAL An unsupported value was specified in the propagation
> >              field of mount_attr.
> >
> >       EINVAL More than one of MS_SHARED, MS_SLAVE, MS_PRIVATE, or
> >              MS_UNBINDABLE was set in the the propagation field of
> >              mount_attr.
> >
> >       EINVAL An access-time setting was specified in the attr_set
> >              field without MOUNT_ATTR__ATIME being set in the
> >              attr_clr field.
> >
> >       EINVAL MOUNT_ATTR_IDMAP was specified in attr_clr.
> >
> >       EINVAL A file descriptor value was specified in userns_fd
> >              which exceeds INT_MAX.
> >
> >       EINVAL A valid file descriptor value was specified in
> >              userns_fd, but the file descriptor wasn't a namespace
> >              file descriptor or did not refer to a user namespace.
> 
> ???
> Could the above not be simplified to
> 
>       EINVAL A valid file descriptor value was specified in
>              userns_fd, but the file descriptor did not refer
>              to a user namespace.

Sounds good.

> ?
> 
> >
> >       EINVAL The underlying filesystem does not support ID-mapped
> >              mounts.
> >
> >       EINVAL The mount that is to be ID mapped is not a
> >              detached/anonymous mount; that is, the mount is
> 
> ???
> What is a the distinction between "detached" and "anonymous"?
> Or do you mean them to be synonymous? If so, then let's use
> just one term, and I think "detached" is preferable.

Yes, they are synonymous here. I list both because detached can
potentially be confusing. A detached mount is a mount that has not been
visible in the filesystem. But if you attached it an then unmount it
right after and keep the fd for the mountpoint open it's a detached
mount purely on a natural language level, I'd argue. But it's not a
detached mount from the kernel's view anymore because it has been
exposed in the filesystem and is thus not detached anymore.
But I do prefer "detached" to "anonymous" and that confusion is very
unlikely to occur.

> 
> >              already visible in the filesystem.
> >
> >       EINVAL A partial access-time setting was specified in
> >              attr_clr instead of MOUNT_ATTR__ATIME being set.
> >
> >       EINVAL The mount is located outside the caller's mount
> >              namespace.
> >
> >       EINVAL The underlying filesystem is mounted in a user
> >              namespace.
> >
> >       ENOENT A pathname was empty or had a nonexistent component.
> >
> >       ENOMEM When changing mount propagation to MS_SHARED, a new
> >              peer group ID needs to be allocated for all mounts
> >              without a peer group ID set.  Allocation of this peer
> >              group ID has failed.
> >
> >       ENOSPC When changing mount propagation to MS_SHARED, a new
> >              peer group ID needs to be allocated for all mounts
> >              without a peer group ID set.  Allocation of this peer
> >              group ID can fail.  Note that technically further
> >              error codes are possible that are specific to the ID
> >              allocation implementation used.
> >
> >       EPERM  One of the mounts had at least one of
> >              MOUNT_ATTR_NOATIME, MOUNT_ATTR_NODEV,
> >              MOUNT_ATTR_NODIRATIME, MOUNT_ATTR_NOEXEC,
> >              MOUNT_ATTR_NOSUID, or MOUNT_ATTR_RDONLY set and the
> >              flag is locked.  Mount attributes become locked on a
> >              mount if:
> >
> >              •  A new mount or mount tree is created causing mount
> >                 propagation across user namespaces.  The kernel
> >                 will lock the aforementioned flags to protect these
> >                 sensitive properties from being altered.
> >
> >              •  A new mount and user namespace pair is created.
> >                 This happens for example when specifying
> >                 CLONE_NEWUSER | CLONE_NEWNS in unshare(2),
> >                 clone(2), or clone3(2).  The aforementioned flags
> >                 become locked to protect user namespaces from
> >                 altering sensitive mount properties.
> >
> >       EPERM  A valid file descriptor value was specified in
> >              userns_fd, but the file descriptor refers to the
> >              initial user namespace.
> >
> >       EPERM  An already ID-mapped mount was supposed to be ID
> >              mapped.
> 
> ???
> Better:
>     An attempt was made to add an ID mapping to a mount that is already
>     ID mapped.

Sounds good.

> ?
> 
> >
> >       EPERM  The caller does not have CAP_SYS_ADMIN in the initial
> >              user namespace.
> >
> >   VERSIONS
> >       mount_setattr() first appeared in Linux 5.12.
> >
> >   CONFORMING TO
> >       mount_setattr() is Linux-specific.
> >
> >   NOTES
> >   ID-mapped mounts
> >       Creating an ID-mapped mount makes it possible to change the
> >       ownership of all files located under a mount.  Thus, ID-
> >       mapped mounts make it possible to change ownership in a
> >       temporary and localized way.  It is a localized change
> >       because ownership changes are restricted to a specific mount.
> 
> ???
> Would it be clearer to say something like:
> 
>     It is a localized change because ownership changes are
>     visible only via a specific mount.
> ?

Sounds good.

> 
> 
> >       All other users and locations where the filesystem is exposed
> >       are unaffected.  And it is a temporary change because
> >       ownership changes are tied to the lifetime of the mount.
> >
> >       Whenever callers interact with the filesystem through an ID-
> >       mapped mount, the ID mapping of the mount will be applied to
> >       user and group IDs associated with filesystem objects.  This
> >       encompasses the user and group IDs associated with inodes and
> >       also the following xattr(7) keys:
> >
> >       •  security.capability, whenever filesystem capabilities are
> >          stored or returned in the VFS_CAP_REVISION_3 format, which
> >          stores a root user ID alongside the capabilities (see
> >          capabilities(7)).
> >
> >       •  system.posix_acl_access and system.posix_acl_default,
> >          whenever user IDs or group IDs are stored in ACL_USER or
> >          ACL_GROUP entries.
> >
> >       The following conditions must be met in order to create an
> >       ID-mapped mount:
> >
> >       •  The caller must have the CAP_SYS_ADMIN capability in the
> >          initial user namespace.
> >
> >       •  The filesystem must be mounted in the initial user
> >          namespace.
> 
> ???
> Should this rather be written as:
>  
>      The filesystem must be mounted in a mount namespace 
>      that is owned by the initial user namespace.

Sounds good.

> 
> >       •  The underlying filesystem must support ID-mapped mounts.
> >          Currently, the xfs(5), ext4(5), and FAT filesystems
> >          support ID-mapped mounts with more filesystems being
> >          actively worked on.
> >
> >       •  The mount must not already be ID-mapped.  This also
> >          implies that the ID mapping of a mount cannot be altered.
> >
> >       •  The mount must be a detached/anonymous mount; that is, it
> 
> ???
> See the above questionon "detached" vs "anonymous"

Yes, please use "detached" only.

> 
> >          must have been created by calling open_tree(2) with the
> >          OPEN_TREE_CLONE flag and it must not already have been
> >          visible in the filesystem.
> >
> >       ID mappings can be created for user IDs, group IDs, and
> >       project IDs.  An ID mapping is essentially a mapping of a
> >       range of user or group IDs into another or the same range of
> >       user or group IDs.  ID mappings are usually written as three
> >       numbers either separated by white space or a full stop.  The
> >       first two numbers specify the starting user or group ID in
> >       each of the two user namespaces.  The third number specifies
> >       the range of the ID mapping.  For example, a mapping for user
> >       IDs such as 1000:1001:1 would indicate that user ID 1000 in
> >       the caller's user namespace is mapped to user ID 1001 in its
> >       ancestor user namespace.  Since the map range is 1, only user
> >       ID 1000 is mapped.
> 
> ???
> The details above seem wrong. When writing to map files, the
> fields must be white-space separated, AFAIK. But above you mention
> "full stops" and also show an example using colons (:). Those
> both seem wrong and confusing. Am I missing something?

This is more about notational conventions that exist and not about how
they are actually written. That's something I'm not touching on here as
it doesn't belong on this manpage. But feel free to only mention spaces.

> 
> >       It is possible to specify up to 340 ID mappings for each ID
> >       mapping type.  If any user IDs or group IDs are not mapped,
> >       all files owned by that unmapped user or group ID will appear
> >       as being owned by the overflow user ID or overflow group ID
> >       respectively.
> >
> >       Further details and instructions for setting up ID mappings
> >       can be found in the user_namespaces(7) man page.
> >
> >       In the common case, the user namespace passed in userns_fd
> >       together with MOUNT_ATTR_IDMAP in attr_set to create an ID-
> >       mapped mount will be the user namespace of a container.  In
> >       other scenarios it will be a dedicated user namespace
> >       associated with a user's login session as is the case for
> >       portable home directories in systemd-homed.service(8)).  It
> >       is also perfectly fine to create a dedicated user namespace
> >       for the sake of ID mapping a mount.
> >
> >       ID-mapped mounts can be useful in the following and a variety
> >       of other scenarios:
> >
> >       •  Sharing files between multiple users or multiple machines,
> 
> ???
> s/Sharing files/Sharing filesystems/ ?

[1]: But work. But feel free to use "sharing filesystems".

> 
> >          especially in complex scenarios.  For example, ID-mapped
> >          mounts are used to implement portable home directories in
> >          systemd-homed.service(8), where they allow users to move
> >          their home directory to an external storage device and use
> >          it on multiple computers where they are assigned different
> >          user IDs and group IDs.  This effectively makes it
> >          possible to assign random user IDs and group IDs at login
> >          time.
> >
> >       •  Sharing files from the host with unprivileged containers.
> 
> ???
> s/Sharing files/Sharing filesystems/ ?

See [1].

> 
> >          This allows a user to avoid having to change ownership
> >          permanently through chown(2).
> >
> >       •  ID mapping a container's root filesystem.  Users don't
> >          need to change ownership permanently through chown(2).
> >          Especially for large root filesystems, using chown(2) can
> >          be prohibitively expensive.
> >
> >       •  Sharing files between containers with non-overlapping ID
> 
> ???
> s/Sharing files/Sharing filesystems/ ?

See [1].

> 
> >          mappings.
> >
> >       •  Implementing discretionary access (DAC) permission
> >          checking for filesystems lacking a concept of ownership.
> >
> >       •  Efficiently changing ownership on a per-mount basis.  In
> >          contrast to chown(2), changing ownership of large sets of
> >          files is instantaneous with ID-mapped mounts.  This is
> >          especially useful when ownership of an entire root
> >          filesystem of a virtual machine or container is to be
> >          changed as mentioned above.  With ID-mapped mounts, a
> >          single mount_setattr() system call will be sufficient to
> >          change the ownership of all files.
> >
> >       •  Taking the current ownership into account.  ID mappings
> >          specify precisely what a user or group ID is supposed to
> >          be mapped to.  This contrasts with the chown(2) system
> >          call which cannot by itself take the current ownership of
> >          the files it changes into account.  It simply changes the
> >          ownership to the specified user ID and group ID.
> >
> >       •  Locally and temporarily restricted ownership changes.  ID-
> >          mapped mounts make it possible to change ownership
> >          locally, restricting it to specific mounts, and
> 
> ???
> The referent of "it" in the preceding line is not clear.
> Should it be "the ownership changes"? Or something else?

It should refer to ownership changes. I'd appreciate it if you could
reformulate.

> 
> >          temporarily as the ownership changes only apply as long as
> >          the mount exists.  By contrast, changing ownership via the
> >          chown(2) system call changes the ownership globally and
> >          permanently.
> >
> >   Extensibility
> >       In order to allow for future extensibility, mount_setattr()
> >       requires the user-space application to specify the size of
> >       the mount_attr structure that it is passing.  By providing
> >       this information, it is possible for mount_setattr() to
> >       provide both forwards- and backwards-compatibility, with size
> >       acting as an implicit version number.  (Because new extension
> >       fields will always be appended, the structure size will
> >       always increase.)  This extensibility design is very similar
> >       to other system calls such as perf_setattr(2),
> >       perf_event_open(2), clone3(2) and openat2(2).
> >
> >       Let usize be the size of the structure as specified by the
> >       user-space application, and let ksize be the size of the
> >       structure which the kernel supports, then there are three
> >       cases to consider:
> >
> >       •  If ksize equals usize, then there is no version mismatch
> >          and attr can be used verbatim.
> >
> >       •  If ksize is larger than usize, then there are some
> >          extension fields that the kernel supports which the user-
> >          space application is unaware of.  Because a zero value in
> >          any added extension field signifies a no-op, the kernel
> >          treats all of the extension fields not provided by the
> >          user-space application as having zero values.  This
> >          provides backwards-compatibility.
> >
> >       •  If ksize is smaller than usize, then there are some
> >          extension fields which the user-space application is aware
> >          of but which the kernel does not support.  Because any
> >          extension field must have its zero values signify a no-op,
> >          the kernel can safely ignore the unsupported extension
> >          fields if they are all zero.  If any unsupported extension
> >          fields are non-zero, then -1 is returned and errno is set
> >          to E2BIG.  This provides forwards-compatibility.
> >
> >       Because the definition of struct mount_attr may change in the
> >       future (with new fields being added when system headers are
> >       updated), user-space applications should zero-fill struct
> >       mount_attr to ensure that recompiling the program with new
> >       headers will not result in spurious errors at runtime.  The
> >       simplest way is to use a designated initializer:
> >
> >           struct mount_attr attr = {
> >               .attr_set = MOUNT_ATTR_RDONLY,
> >               .attr_clr = MOUNT_ATTR_NODEV
> >           };
> >
> >       Alternatively, the structure can be zero-filled using
> >       memset(3) or similar functions:
> >
> >           struct mount_attr attr;
> >           memset(&attr, 0, sizeof(attr));
> >           attr.attr_set = MOUNT_ATTR_RDONLY;
> >           attr.attr_clr = MOUNT_ATTR_NODEV;
> >
> >       A user-space application that wishes to determine which
> >       extensions the running kernel supports can do so by
> >       conducting a binary search on size with a structure which has
> >       every byte nonzero (to find the largest value which doesn't
> >       produce an error of E2BIG).
> >
> >   EXAMPLES
> 
> ???
> Do you have a (preferably simple) example piece of code
> somewhere for setting up an ID mapped mount?


> 
> >       /*
> >        * This program allows the caller to create a new detached mount
> >        * and set various properties on it.
> >        */
> >       #define _GNU_SOURCE
> >       #include <errno.h>
> >       #include <fcntl.h>
> >       #include <getopt.h>
> >       #include <linux/mount.h>
> >       #include <linux/types.h>
> >       #include <stdbool.h>
> >       #include <stdio.h>
> >       #include <stdlib.h>
> >       #include <string.h>
> >       #include <sys/syscall.h>
> >       #include <unistd.h>
> >
> >       static inline int
> >       mount_setattr(int dirfd, const char *path, unsigned int flags,
> >                     struct mount_attr *attr, size_t size)
> >       {
> >           return syscall(SYS_mount_setattr, dirfd, path, flags,
> >                          attr, size);
> >       }
> >
> >       static inline int
> >       open_tree(int dirfd, const char *filename, unsigned int flags)
> >       {
> >           return syscall(SYS_open_tree, dirfd, filename, flags);
> >       }
> >
> >       static inline int
> >       move_mount(int from_dirfd, const char *from_pathname,
> >                  int to_dirfd, const char *to_pathname,
> >                  unsigned int flags)
> >       {
> >           return syscall(SYS_move_mount, from_dirfd, from_pathname,
> >                          to_dirfd, to_pathname, flags);
> >       }
> >
> >       static const struct option longopts[] = {
> >           {"map-mount",       required_argument,  NULL,  'a'},
> >           {"recursive",       no_argument,        NULL,  'b'},
> >           {"read-only",       no_argument,        NULL,  'c'},
> >           {"block-setid",     no_argument,        NULL,  'd'},
> >           {"block-devices",   no_argument,        NULL,  'e'},
> >           {"block-exec",      no_argument,        NULL,  'f'},
> >           {"no-access-time",  no_argument,        NULL,  'g'},
> >           { NULL,             0,                  NULL,   0 },
> >       };
> >
> >       #define exit_log(format, ...)  do           \
> >       {                                           \
> >           fprintf(stderr, format, ##__VA_ARGS__); \
> >           exit(EXIT_FAILURE);                     \
> >       } while (0)
> >
> >       int
> >       main(int argc, char *argv[])
> >       {
> >           struct mount_attr *attr = &(struct mount_attr){};
> >           int fd_userns = -EBADF;
> 
> ???
> Why this magic initializer here? Why not just "-1"?
> Using -EBADF makes it look this is value specifically is
> meaningful, although I don't think that's true.

[2]: I always use -EBADF to initialize fds in all my code. It makes it
pretty easy to grep for fd initialization etc. So it's pure visual
convenience. Freel free to just use -1.

> 
> >           bool recursive = false;
> >           int index = 0;
> >           int ret;
> >
> >           while ((ret = getopt_long_only(argc, argv, "",
> >                                          longopts, &index)) != -1) {
> >               switch (ret) {
> >               case 'a':
> >                   fd_userns = open(optarg, O_RDONLY | O_CLOEXEC);
> >                   if (fd_userns == -1)
> >                       exit_log("%m - Failed top open %s\n", optarg);
> >                   break;
> >               case 'b':
> >                   recursive = true;
> >                   break;
> >               case 'c':
> >                   attr->attr_set |= MOUNT_ATTR_RDONLY;
> >                   break;
> >               case 'd':
> >                   attr->attr_set |= MOUNT_ATTR_NOSUID;
> >                   break;
> >               case 'e':
> >                   attr->attr_set |= MOUNT_ATTR_NODEV;
> >                   break;
> >               case 'f':
> >                   attr->attr_set |= MOUNT_ATTR_NOEXEC;
> >                   break;
> >               case 'g':
> >                   attr->attr_set |= MOUNT_ATTR_NOATIME;
> >                   attr->attr_clr |= MOUNT_ATTR__ATIME;
> >                   break;
> >               default:
> >                   exit_log("Invalid argument specified");
> >               }
> >           }
> >
> >           if ((argc - optind) < 2)
> >               exit_log("Missing source or target mount point\n");
> >
> >           const char *source = argv[optind];
> >           const char *target = argv[optind + 1];
> >
> >           int fd_tree = open_tree(-EBADF, source,
> >                        OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC |
> >                        AT_EMPTY_PATH | (recursive ? AT_RECURSIVE : 0));
> 
> ???
> What is the significance of -EBADF here? As far as I can tell, it
> is not meaningful to open_tree()?

I always pass -EBADF for similar reasons to [2]. Feel free to just use -1.

> 
> 
> >           if (fd_tree == -1)
> >               exit_log("%m - Failed to open %s\n", source);
> >
> >           if (fd_userns >= 0) {
> >               attr->attr_set  |= MOUNT_ATTR_IDMAP;
> >               attr->userns_fd = fd_userns;
> >           }
> >
> >           ret = mount_setattr(fd_tree, "",
> >                       AT_EMPTY_PATH | (recursive ? AT_RECURSIVE : 0),
> >                       attr, sizeof(struct mount_attr));
> >           if (ret == -1)
> >               exit_log("%m - Failed to change mount attributes\n");
> >
> >           close(fd_userns);
> >
> >           ret = move_mount(fd_tree, "", -EBADF, target,
> >                            MOVE_MOUNT_F_EMPTY_PATH);
> 
> ???
> What is the significance of -EBADF here? As far as I can tell, it
> is not meaningful to move_mount()?

See [2].

> 
> >           if (ret == -1)
> >               exit_log("%m - Failed to attach mount to %s\n", target);
> >
> >           close(fd_tree);
> >
> >           exit(EXIT_SUCCESS);
> >       }
> >
> >   SEE ALSO
> >       newuidmap(1), newgidmap(1), clone(2), mount(2), unshare(2),
> >       proc(5), mount_namespaces(7), capabilities(7),
> >       user_namespaces(7), xattr(7)
> 
> Thanks,
> 
> Michael

^ permalink raw reply	[relevance 4%]

* Re: Questions re the new mount_setattr(2) manual page
  2021-08-10  7:12 11% ` Michael Kerrisk (man-pages)
@ 2021-08-10 14:11  5%   ` Christian Brauner
  2021-08-10 19:30 11%     ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 200+ results
From: Christian Brauner @ 2021-08-10 14:11 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Alejandro Colomar, linux-fsdevel, lkml, linux-man, Christoph Hellwig

On Tue, Aug 10, 2021 at 09:12:14AM +0200, Michael Kerrisk (man-pages) wrote:
> Hi Christian,
> 
> One more question...
> 
> >>       The propagation field is used to specify the propagation typ
> >>       of the mount or mount tree.  Mount propagation options are
> >>       mutually exclusive; that is, the propagation values behave
> >>       like an enum.  The supported mount propagation types are:
> 
> The manual page text doesn't actually say it, but if the 'propagation'
> field is 0, then this means leave the propagation type unchanged, 
> right? This of course should be mentioned in the manual page.

Yes, if none of the documented values is set the propagation is unchanged.

Christian

^ permalink raw reply	[relevance 5%]

* Re: Questions re the new mount_setattr(2) manual page
  2021-08-10  1:38  4% Questions re the new mount_setattr(2) manual page Michael Kerrisk (man-pages)
@ 2021-08-10  7:12 11% ` Michael Kerrisk (man-pages)
  2021-08-10 14:11  5%   ` Christian Brauner
  2021-08-10 14:32  4% ` Christian Brauner
  2021-08-10 22:47  5% ` Michael Kerrisk (man-pages)
  2 siblings, 1 reply; 200+ results
From: Michael Kerrisk (man-pages) @ 2021-08-10  7:12 UTC (permalink / raw)
  To: Christian Brauner
  Cc: mtk.manpages, Alejandro Colomar, linux-fsdevel, lkml, linux-man,
	Christoph Hellwig

Hi Christian,

One more question...

>>       The propagation field is used to specify the propagation typ
>>       of the mount or mount tree.  Mount propagation options are
>>       mutually exclusive; that is, the propagation values behave
>>       like an enum.  The supported mount propagation types are:

The manual page text doesn't actually say it, but if the 'propagation'
field is 0, then this means leave the propagation type unchanged, 
right? This of course should be mentioned in the manual page.

Cheers,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[relevance 11%]

* Re: [PATCH] seccomp.2: Clarify that bad system calls kill the thread
  @ 2021-08-10  2:07 11%     ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 200+ results
From: Michael Kerrisk (man-pages) @ 2021-08-10  2:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: mtk.manpages, linux-api, Andy Lutomirski, Will Drewry,
	Linus Torvalds, Al Viro, Kees Cook, linux-man, linux-kernel

Hi Eric,

On 6/30/21 10:11 PM, Eric W. Biederman wrote:
> 
> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>

Thanks. Patch applied, with Kees' Ack.

Cheers,

Michael


> ---
>  man2/seccomp.2 | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/man2/seccomp.2 b/man2/seccomp.2
> index a3421871f0f4..bde54c3e3e99 100644
> --- a/man2/seccomp.2
> +++ b/man2/seccomp.2
> @@ -69,9 +69,10 @@ The only system calls that the calling thread is permitted to make are
>  .BR exit_group (2)),
>  and
>  .BR sigreturn (2).
> -Other system calls result in the delivery of a
> +Other system calls result in the termination of the calling thread,
> +or termination of the entire process with the
>  .BR SIGKILL
> -signal.
> +signal when there is only one thread.
>  Strict secure computing mode is useful for number-crunching
>  applications that may need to execute untrusted byte code, perhaps
>  obtained by reading from a pipe or socket.
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[relevance 11%]

* Questions re the new mount_setattr(2) manual page
@ 2021-08-10  1:38  4% Michael Kerrisk (man-pages)
  2021-08-10  7:12 11% ` Michael Kerrisk (man-pages)
                   ` (2 more replies)
  0 siblings, 3 replies; 200+ results
From: Michael Kerrisk (man-pages) @ 2021-08-10  1:38 UTC (permalink / raw)
  To: Christian Brauner
  Cc: mtk.manpages, Alejandro Colomar, linux-fsdevel, lkml, linux-man,
	Christoph Hellwig

Hi Christian,

Thanks for the very nice manual page that you wrote. I have
made a large number of (mostly trivial) edits. If you could
read the page closely, to check that I introduced no errors,
I would appreciate it.

I have various questions below, marked ???. Could you please take
a look at these, and I will then make further edits based on your
answers.

The current version of the page is already pushed to the man-pages
Git repo.

>   MOUNT_SETATTR(2)      Linux Programmer's Manual     MOUNT_SETATTR(2)
>
>   NAME
>       mount_setattr - change mount properties of a mount or mount

???
s/mount properties/properties ?

(Just bcause more concise.)

>       tree
>
>   SYNOPSIS
>       #include <linux/fcntl.h> /* Definition of AT_* constants */
>       #include <linux/mount.h> /* Definition of MOUNT_ATTR_* constants */
>       #include <sys/syscall.h> /* Definition of SYS_* constants */
>       #include <unistd.h>
>
>       int syscall(SYS_mount_setattr, int dirfd, const char *path,
>               unsigned int flags, struct mount_attr *attr, size_t size);
>
>       Note: glibc provides no wrapper for mount_setattr(),
>       necessitating the use of syscall(2).
>
>   DESCRIPTION
>       The mount_setattr() system call changes the mount properties
>       of a mount or an entire mount tree.  If path is a relative
>       pathname, then it is interpreted relative to the directory
>       referred to by the file descriptor dirfd.  If dirfd is the
>       special value AT_FDCWD, then path is interpreted relative to
>       the current working directory of the calling process.  If
>       path is the empty string and AT_EMPTY_PATH is specified in
>       flags, then the mount properties of the mount identified by
>       dirfd are changed.
>
>       The mount_setattr() system call uses an extensible structure
>       (struct mount_attr) to allow for future extensions.  Any non-
>       flag extensions to mount_setattr() will be implemented as new
>       fields appended to the this structure, with a zero value in a
>       new field resulting in the kernel behaving as though that
>       extension field was not present.  Therefore, the caller must
>       zero-fill this structure on initialization.  See the
>       "Extensibility" subsection under NOTES for more details.
>
>       The size argument should usually be specified as
>       sizeof(struct mount_attr).  However, if the caller does not
>       intend to make use of features that got introduced after the
>       initial version of struct mount_attr, it is possible to pass
>       the size of the initial struct together with the larger
>       struct.  This allows the kernel to not copy later parts of
>       the struct that aren't used anyway.  With each extension that
>       changes the size of struct mount_attr, the kernel will expose
>       a definition of the form MOUNT_ATTR_SIZE_VERnumber.  For
>       example, the macro for the size of the initial version of
>       struct mount_attr is MOUNT_ATTR_SIZE_VER0.

???
I think I understand the above paragraph, but I wonder if it could
be improved a little. The general principle is that one can always
pass the size of an earlier, smaller structure to the kernel, right?
My point is that it need not be the size of the initial structure,
right? So, I wonder whether a little rewording might be need above.
What do you think?

>
>       The flags argument can be used to alter the path resolution
>       behavior.  The supported values are:
>
>       AT_EMPTY_PATH
>              If path is the empty string, change the mount
>              properties on dirfd itself.
>
>       AT_RECURSIVE
>              Change the mount properties of the entire mount tree.
>
>       AT_SYMLINK_NOFOLLOW
>              Don't follow trailing symbolic links.
>
>       AT_NO_AUTOMOUNT
>              Don't trigger automounts.
>
>       The attr argument of mount_setattr() is a structure of the
>       following form:
>
>           struct mount_attr {
>               __u64 attr_set;     /* Mount properties to set */
>               __u64 attr_clr;     /* Mount properties to clear */
>               __u64 propagation;  /* Mount propagation type */
>               __u64 userns_fd;    /* User namespace file descriptor */
>           };
>
>       The attr_set and attr_clr members are used to specify the
>       mount properties that are supposed to be set or cleared for a
>       mount or mount tree.  Flags set in attr_set enable a property
>       on a mount or mount tree, and flags set in attr_clr remove a
>       property from a mount or mount tree.
>
>       When changing mount properties, the kernel will first clear
>       the flags specified in the attr_clr field, and then set the
>       flags specified in the attr_set field:

???
I find the following example a bit confusing. See below.

>
>           struct mount_attr attr = {
>               .attr_clr = MOUNT_ATTR_NOEXEC | MOUNT_ATTR_NODEV,
>               .attr_set = MOUNT_ATTR_RDONLY | MOUNT_ATTR_NOSUID,
>           };

???
I *think* that what you are trying to show is that the above initializer
resuts in the equivalent of the following code. Is that correct? If so, 
I think the text needs some work to make this clearer. Let me know.

>           unsigned int current_mnt_flags = mnt->mnt_flags;
>
>           /*
>            * Clear all flags set in .attr_clr,
>            * clearing MOUNT_ATTR_NOEXEC and MOUNT_ATTR_NODEV.
>            */
>           current_mnt_flags &= ~attr->attr_clr;
>
>           /*
>            * Now set all flags set in .attr_set,
>            * applying MOUNT_ATTR_RDONLY and MOUNT_ATTR_NOSUID.
>            */
>           current_mnt_flags |= attr->attr_set;
>
>           mnt->mnt_flags = current_mnt_flags;
>
>       As a rsult of this change, the mount or mount tree (a) is
>       read-only; (b) blocks the execution of set-user-ID and set-
>       group-ID programs; (c) allows execution of programs; and (d)
>       allows access to devices.
>
>       Multiple changes with the same set of flags requested in
>       attr_clr and attr_set are guaranteed to be idempotent after
>       the changes have been applied.
>
>       The following mount attributes can be specified in the
>       attr_set or attr_clr fields:
>
>       MOUNT_ATTR_RDONLY
>              If set in attr_set, makes the mount read-only.  If set
>              in attr_clr, removes the read-only setting if set on
>              the mount.
>
>       MOUNT_ATTR_NOSUID
>              If set in attr_set, causes the mount not to honor the
>              set-user-ID and set-group-ID mode bits and file
>              capabilities when executing programs.  If set in
>              attr_clr, clears the set-user-ID, set-group-ID, and
>              file capability restriction if set on this mount.
>
>       MOUNT_ATTR_NODEV
>              If set in attr_set, prevents access to devices on this
>              mount.  If set in attr_clr, removes the restriction
>              that prevented accessing devices on this mount.
>
>       MOUNT_ATTR_NOEXEC
>              If set in attr_set, prevents executing programs on
>              this mount.  If set in attr_clr, removes the
>              restriction that prevented executing programs on this
>              mount.
>
>       MOUNT_ATTR_NOSYMFOLLOW
>              If set in attr_set, prevents following symbolic links
>              on this mount.  If set in attr_clr, removes the
>              restriction that prevented following symbolic links on
>              this mount.
>
>       MOUNT_ATTR_NODIRATIME
>              If set in attr_set, prevents updating access time for
>              directories on this mount.  If set in attr_clr,
>              removes the restriction that prevented updating access
>              time for directories.  Note that MOUNT_ATTR_NODIRATIME
>              can be combined with other access-time settings and is
>              implied by the noatime setting.  All other access-time
>              settings are mutually exclusive.
>
>       MOUNT_ATTR__ATIME - changing access-time settings
>              In the new mount API, the access-time values are an
>              enum starting from 0.  Even though they are an enum
>              (in contrast to the other mount flags such as
>              MOUNT_ATTR_NOEXEC), they are nonetheless passed in
>              attr_set and attr_clr for consistency with fsmount(2),
>              which introduced this behavior.
>
>              Note that, since access times are an enum not a bit
>              map, users wanting to transition to a different
>              access-time setting cannot simply specify the access-
>              time setting in attr_set but must also set
>              MOUNT_ATTR__ATIME in the attr_clr field.  The kernel
>              will verify that MOUNT_ATTR__ATIME isn't partially set
>              in attr_clr, and that attr_set doesn't have any
>              access-time bits set if MOUNT_ATTR__ATIME isn't set in
>              attr_clr.
>
>              MOUNT_ATTR_RELATIME
>                     When a file is accessed via this mount, update
>                     the file's last access time (atime) only if the
>                     current value of atime is less than or equal to
>                     the file's last modification time (mtime) or
>                     last status change time (ctime).
>
>                     To enable this access-time setting on a mount
>                     or mount tree, MOUNT_ATTR_RELATIME must be set
>                     in attr_set and MOUNT_ATTR__ATIME must be set
>                     in the attr_clr field.
>
>              MOUNT_ATTR_NOATIME
>                     Do not update access times for (all types of)
>                     files on this mount.
>
>                     To enable this access-time setting on a mount
>                     or mount tree, MOUNT_ATTR_NOATIME must be set
>                     in attr_set and MOUNT_ATTR__ATIME must be set
>                     in the attr_clr field.
>
>              MOUNT_ATTR_STRICTATIME
>                     Always update the last access time (atime) when
>                     files are accessed on this mount.
>
>                     To enable this access-time setting on a mount
>                     or mount tree, MOUNT_ATTR_STRICTATIME must be
>                     set in attr_set and MOUNT_ATTR__ATIME must be
>                     set in the attr_clr field.
>
>       MOUNT_ATTR_IDMAP
>              If set in attr_set, creates an ID-mapped mount.  The
>              ID mapping is taken from the user namespace specified

In various places, you wrote "idmapping". "idmapped", etc. I've
changed these to the more natural English "ID mapping" etc.

>              in userns_fd and attached to the mount.
>
>              Since it is not supported to change the ID mapping of
>              a mount after it has been ID mapped, it is invalid to
>              specify MOUNT_ATTR_IDMAP in attr_clr.
>
>              For further details, see the subsection "ID-mapped
>              mounts" under NOTES.
>
>       The propagation field is used to specify the propagation type
>       of the mount or mount tree.  Mount propagation options are
>       mutually exclusive; that is, the propagation values behave
>       like an enum.  The supported mount propagation types are:
>
>       MS_PRIVATE
>              Turn all mounts into private mounts.  Mount and
>              unmount events do not propagate into or out of this
>              mount point.
>
>       MS_SHARED
>              Turn all mounts into shared mounts.  Mount points
>              share events with members of a peer group.  Mount and
>              unmount events immediately under this mount point will
>              propagate to the other mount points that are members
>              of the peer group.  Propagation here means that the
>              same mount or unmount will automatically occur under
>              all of the other mount points in the peer group.
>              Conversely, mount and unmount events that take place
>              under peer mount points will propagate to this mount
>              point.
>
>       MS_SLAVE
>              Turn all mounts into dependent mounts.  Mount and
>              unmount events propagate into this mount point from a
>              shared peer group.  Mount and unmount events under
>              this mount point do not propagate to any peer.
>
>       MS_UNBINDABLE
>              This is like a private mount, and in addition this
>              mount can't be bind mounted.  Attempts to bind mount
>              this mount will fail.  When a recursive bind mount is
>              performed on a directory subtree, any bind mounts
>              within the subtree are automatically pruned (i.e., not
>              replicated) when replicating that subtree to produce
>              the target subtree.
>
>       For further details on propagation types, see
>       mount_namespaces(7).
>
>   RETURN VALUE
>       On success, mount_setattr() returns zero.  On error, -1 is
>       returned and errno is set to indicate the cause of the error.
>
>   ERRORS
>       EBADF  dirfd is not a valid file descriptor.
>
>       EBADF  userns_fd is not a valid file descriptor.
>
>       EBUSY  The caller tried to change the mount to
>              MOUNT_ATTR_RDONLY, but the mount still holds files
>              open for writing.
>
>       EINVAL The path specified via the dirfd and path arguments to
>              mount_setattr() isn't a mount point.
>
>       EINVAL An unsupported value was set in flags.
>
>       EINVAL An unsupported value was specified in the attr_set
>              field of mount_attr.
>
>       EINVAL An unsupported value was specified in the attr_clr
>              field of mount_attr.
>
>       EINVAL An unsupported value was specified in the propagation
>              field of mount_attr.
>
>       EINVAL More than one of MS_SHARED, MS_SLAVE, MS_PRIVATE, or
>              MS_UNBINDABLE was set in the the propagation field of
>              mount_attr.
>
>       EINVAL An access-time setting was specified in the attr_set
>              field without MOUNT_ATTR__ATIME being set in the
>              attr_clr field.
>
>       EINVAL MOUNT_ATTR_IDMAP was specified in attr_clr.
>
>       EINVAL A file descriptor value was specified in userns_fd
>              which exceeds INT_MAX.
>
>       EINVAL A valid file descriptor value was specified in
>              userns_fd, but the file descriptor wasn't a namespace
>              file descriptor or did not refer to a user namespace.

???
Could the above not be simplified to

      EINVAL A valid file descriptor value was specified in
             userns_fd, but the file descriptor did not refer
             to a user namespace.
?

>
>       EINVAL The underlying filesystem does not support ID-mapped
>              mounts.
>
>       EINVAL The mount that is to be ID mapped is not a
>              detached/anonymous mount; that is, the mount is

???
What is a the distinction between "detached" and "anonymous"?
Or do you mean them to be synonymous? If so, then let's use
just one term, and I think "detached" is preferable.

>              already visible in the filesystem.
>
>       EINVAL A partial access-time setting was specified in
>              attr_clr instead of MOUNT_ATTR__ATIME being set.
>
>       EINVAL The mount is located outside the caller's mount
>              namespace.
>
>       EINVAL The underlying filesystem is mounted in a user
>              namespace.
>
>       ENOENT A pathname was empty or had a nonexistent component.
>
>       ENOMEM When changing mount propagation to MS_SHARED, a new
>              peer group ID needs to be allocated for all mounts
>              without a peer group ID set.  Allocation of this peer
>              group ID has failed.
>
>       ENOSPC When changing mount propagation to MS_SHARED, a new
>              peer group ID needs to be allocated for all mounts
>              without a peer group ID set.  Allocation of this peer
>              group ID can fail.  Note that technically further
>              error codes are possible that are specific to the ID
>              allocation implementation used.
>
>       EPERM  One of the mounts had at least one of
>              MOUNT_ATTR_NOATIME, MOUNT_ATTR_NODEV,
>              MOUNT_ATTR_NODIRATIME, MOUNT_ATTR_NOEXEC,
>              MOUNT_ATTR_NOSUID, or MOUNT_ATTR_RDONLY set and the
>              flag is locked.  Mount attributes become locked on a
>              mount if:
>
>              •  A new mount or mount tree is created causing mount
>                 propagation across user namespaces.  The kernel
>                 will lock the aforementioned flags to protect these
>                 sensitive properties from being altered.
>
>              •  A new mount and user namespace pair is created.
>                 This happens for example when specifying
>                 CLONE_NEWUSER | CLONE_NEWNS in unshare(2),
>                 clone(2), or clone3(2).  The aforementioned flags
>                 become locked to protect user namespaces from
>                 altering sensitive mount properties.
>
>       EPERM  A valid file descriptor value was specified in
>              userns_fd, but the file descriptor refers to the
>              initial user namespace.
>
>       EPERM  An already ID-mapped mount was supposed to be ID
>              mapped.

???
Better:
    An attempt was made to add an ID mapping to a mount that is already
    ID mapped.
?

>
>       EPERM  The caller does not have CAP_SYS_ADMIN in the initial
>              user namespace.
>
>   VERSIONS
>       mount_setattr() first appeared in Linux 5.12.
>
>   CONFORMING TO
>       mount_setattr() is Linux-specific.
>
>   NOTES
>   ID-mapped mounts
>       Creating an ID-mapped mount makes it possible to change the
>       ownership of all files located under a mount.  Thus, ID-
>       mapped mounts make it possible to change ownership in a
>       temporary and localized way.  It is a localized change
>       because ownership changes are restricted to a specific mount.

???
Would it be clearer to say something like:

    It is a localized change because ownership changes are
    visible only via a specific mount.
?


>       All other users and locations where the filesystem is exposed
>       are unaffected.  And it is a temporary change because
>       ownership changes are tied to the lifetime of the mount.
>
>       Whenever callers interact with the filesystem through an ID-
>       mapped mount, the ID mapping of the mount will be applied to
>       user and group IDs associated with filesystem objects.  This
>       encompasses the user and group IDs associated with inodes and
>       also the following xattr(7) keys:
>
>       •  security.capability, whenever filesystem capabilities are
>          stored or returned in the VFS_CAP_REVISION_3 format, which
>          stores a root user ID alongside the capabilities (see
>          capabilities(7)).
>
>       •  system.posix_acl_access and system.posix_acl_default,
>          whenever user IDs or group IDs are stored in ACL_USER or
>          ACL_GROUP entries.
>
>       The following conditions must be met in order to create an
>       ID-mapped mount:
>
>       •  The caller must have the CAP_SYS_ADMIN capability in the
>          initial user namespace.
>
>       •  The filesystem must be mounted in the initial user
>          namespace.

???
Should this rather be written as:
 
     The filesystem must be mounted in a mount namespace 
     that is owned by the initial user namespace.

>       •  The underlying filesystem must support ID-mapped mounts.
>          Currently, the xfs(5), ext4(5), and FAT filesystems
>          support ID-mapped mounts with more filesystems being
>          actively worked on.
>
>       •  The mount must not already be ID-mapped.  This also
>          implies that the ID mapping of a mount cannot be altered.
>
>       •  The mount must be a detached/anonymous mount; that is, it

???
See the above questionon "detached" vs "anonymous"

>          must have been created by calling open_tree(2) with the
>          OPEN_TREE_CLONE flag and it must not already have been
>          visible in the filesystem.
>
>       ID mappings can be created for user IDs, group IDs, and
>       project IDs.  An ID mapping is essentially a mapping of a
>       range of user or group IDs into another or the same range of
>       user or group IDs.  ID mappings are usually written as three
>       numbers either separated by white space or a full stop.  The
>       first two numbers specify the starting user or group ID in
>       each of the two user namespaces.  The third number specifies
>       the range of the ID mapping.  For example, a mapping for user
>       IDs such as 1000:1001:1 would indicate that user ID 1000 in
>       the caller's user namespace is mapped to user ID 1001 in its
>       ancestor user namespace.  Since the map range is 1, only user
>       ID 1000 is mapped.

???
The details above seem wrong. When writing to map files, the
fields must be white-space separated, AFAIK. But above you mention
"full stops" and also show an example using colons (:). Those
both seem wrong and confusing. Am I missing something?

>       It is possible to specify up to 340 ID mappings for each ID
>       mapping type.  If any user IDs or group IDs are not mapped,
>       all files owned by that unmapped user or group ID will appear
>       as being owned by the overflow user ID or overflow group ID
>       respectively.
>
>       Further details and instructions for setting up ID mappings
>       can be found in the user_namespaces(7) man page.
>
>       In the common case, the user namespace passed in userns_fd
>       together with MOUNT_ATTR_IDMAP in attr_set to create an ID-
>       mapped mount will be the user namespace of a container.  In
>       other scenarios it will be a dedicated user namespace
>       associated with a user's login session as is the case for
>       portable home directories in systemd-homed.service(8)).  It
>       is also perfectly fine to create a dedicated user namespace
>       for the sake of ID mapping a mount.
>
>       ID-mapped mounts can be useful in the following and a variety
>       of other scenarios:
>
>       •  Sharing files between multiple users or multiple machines,

???
s/Sharing files/Sharing filesystems/ ?

>          especially in complex scenarios.  For example, ID-mapped
>          mounts are used to implement portable home directories in
>          systemd-homed.service(8), where they allow users to move
>          their home directory to an external storage device and use
>          it on multiple computers where they are assigned different
>          user IDs and group IDs.  This effectively makes it
>          possible to assign random user IDs and group IDs at login
>          time.
>
>       •  Sharing files from the host with unprivileged containers.

???
s/Sharing files/Sharing filesystems/ ?

>          This allows a user to avoid having to change ownership
>          permanently through chown(2).
>
>       •  ID mapping a container's root filesystem.  Users don't
>          need to change ownership permanently through chown(2).
>          Especially for large root filesystems, using chown(2) can
>          be prohibitively expensive.
>
>       •  Sharing files between containers with non-overlapping ID

???
s/Sharing files/Sharing filesystems/ ?

>          mappings.
>
>       •  Implementing discretionary access (DAC) permission
>          checking for filesystems lacking a concept of ownership.
>
>       •  Efficiently changing ownership on a per-mount basis.  In
>          contrast to chown(2), changing ownership of large sets of
>          files is instantaneous with ID-mapped mounts.  This is
>          especially useful when ownership of an entire root
>          filesystem of a virtual machine or container is to be
>          changed as mentioned above.  With ID-mapped mounts, a
>          single mount_setattr() system call will be sufficient to
>          change the ownership of all files.
>
>       •  Taking the current ownership into account.  ID mappings
>          specify precisely what a user or group ID is supposed to
>          be mapped to.  This contrasts with the chown(2) system
>          call which cannot by itself take the current ownership of
>          the files it changes into account.  It simply changes the
>          ownership to the specified user ID and group ID.
>
>       •  Locally and temporarily restricted ownership changes.  ID-
>          mapped mounts make it possible to change ownership
>          locally, restricting it to specific mounts, and

???
The referent of "it" in the preceding line is not clear.
Should it be "the ownership changes"? Or something else?

>          temporarily as the ownership changes only apply as long as
>          the mount exists.  By contrast, changing ownership via the
>          chown(2) system call changes the ownership globally and
>          permanently.
>
>   Extensibility
>       In order to allow for future extensibility, mount_setattr()
>       requires the user-space application to specify the size of
>       the mount_attr structure that it is passing.  By providing
>       this information, it is possible for mount_setattr() to
>       provide both forwards- and backwards-compatibility, with size
>       acting as an implicit version number.  (Because new extension
>       fields will always be appended, the structure size will
>       always increase.)  This extensibility design is very similar
>       to other system calls such as perf_setattr(2),
>       perf_event_open(2), clone3(2) and openat2(2).
>
>       Let usize be the size of the structure as specified by the
>       user-space application, and let ksize be the size of the
>       structure which the kernel supports, then there are three
>       cases to consider:
>
>       •  If ksize equals usize, then there is no version mismatch
>          and attr can be used verbatim.
>
>       •  If ksize is larger than usize, then there are some
>          extension fields that the kernel supports which the user-
>          space application is unaware of.  Because a zero value in
>          any added extension field signifies a no-op, the kernel
>          treats all of the extension fields not provided by the
>          user-space application as having zero values.  This
>          provides backwards-compatibility.
>
>       •  If ksize is smaller than usize, then there are some
>          extension fields which the user-space application is aware
>          of but which the kernel does not support.  Because any
>          extension field must have its zero values signify a no-op,
>          the kernel can safely ignore the unsupported extension
>          fields if they are all zero.  If any unsupported extension
>          fields are non-zero, then -1 is returned and errno is set
>          to E2BIG.  This provides forwards-compatibility.
>
>       Because the definition of struct mount_attr may change in the
>       future (with new fields being added when system headers are
>       updated), user-space applications should zero-fill struct
>       mount_attr to ensure that recompiling the program with new
>       headers will not result in spurious errors at runtime.  The
>       simplest way is to use a designated initializer:
>
>           struct mount_attr attr = {
>               .attr_set = MOUNT_ATTR_RDONLY,
>               .attr_clr = MOUNT_ATTR_NODEV
>           };
>
>       Alternatively, the structure can be zero-filled using
>       memset(3) or similar functions:
>
>           struct mount_attr attr;
>           memset(&attr, 0, sizeof(attr));
>           attr.attr_set = MOUNT_ATTR_RDONLY;
>           attr.attr_clr = MOUNT_ATTR_NODEV;
>
>       A user-space application that wishes to determine which
>       extensions the running kernel supports can do so by
>       conducting a binary search on size with a structure which has
>       every byte nonzero (to find the largest value which doesn't
>       produce an error of E2BIG).
>
>   EXAMPLES

???
Do you have a (preferably simple) example piece of code
somewhere for setting up an ID mapped mount?

>       /*
>        * This program allows the caller to create a new detached mount
>        * and set various properties on it.
>        */
>       #define _GNU_SOURCE
>       #include <errno.h>
>       #include <fcntl.h>
>       #include <getopt.h>
>       #include <linux/mount.h>
>       #include <linux/types.h>
>       #include <stdbool.h>
>       #include <stdio.h>
>       #include <stdlib.h>
>       #include <string.h>
>       #include <sys/syscall.h>
>       #include <unistd.h>
>
>       static inline int
>       mount_setattr(int dirfd, const char *path, unsigned int flags,
>                     struct mount_attr *attr, size_t size)
>       {
>           return syscall(SYS_mount_setattr, dirfd, path, flags,
>                          attr, size);
>       }
>
>       static inline int
>       open_tree(int dirfd, const char *filename, unsigned int flags)
>       {
>           return syscall(SYS_open_tree, dirfd, filename, flags);
>       }
>
>       static inline int
>       move_mount(int from_dirfd, const char *from_pathname,
>                  int to_dirfd, const char *to_pathname,
>                  unsigned int flags)
>       {
>           return syscall(SYS_move_mount, from_dirfd, from_pathname,
>                          to_dirfd, to_pathname, flags);
>       }
>
>       static const struct option longopts[] = {
>           {"map-mount",       required_argument,  NULL,  'a'},
>           {"recursive",       no_argument,        NULL,  'b'},
>           {"read-only",       no_argument,        NULL,  'c'},
>           {"block-setid",     no_argument,        NULL,  'd'},
>           {"block-devices",   no_argument,        NULL,  'e'},
>           {"block-exec",      no_argument,        NULL,  'f'},
>           {"no-access-time",  no_argument,        NULL,  'g'},
>           { NULL,             0,                  NULL,   0 },
>       };
>
>       #define exit_log(format, ...)  do           \
>       {                                           \
>           fprintf(stderr, format, ##__VA_ARGS__); \
>           exit(EXIT_FAILURE);                     \
>       } while (0)
>
>       int
>       main(int argc, char *argv[])
>       {
>           struct mount_attr *attr = &(struct mount_attr){};
>           int fd_userns = -EBADF;

???
Why this magic initializer here? Why not just "-1"?
Using -EBADF makes it look this is value specifically is
meaningful, although I don't think that's true.

>           bool recursive = false;
>           int index = 0;
>           int ret;
>
>           while ((ret = getopt_long_only(argc, argv, "",
>                                          longopts, &index)) != -1) {
>               switch (ret) {
>               case 'a':
>                   fd_userns = open(optarg, O_RDONLY | O_CLOEXEC);
>                   if (fd_userns == -1)
>                       exit_log("%m - Failed top open %s\n", optarg);
>                   break;
>               case 'b':
>                   recursive = true;
>                   break;
>               case 'c':
>                   attr->attr_set |= MOUNT_ATTR_RDONLY;
>                   break;
>               case 'd':
>                   attr->attr_set |= MOUNT_ATTR_NOSUID;
>                   break;
>               case 'e':
>                   attr->attr_set |= MOUNT_ATTR_NODEV;
>                   break;
>               case 'f':
>                   attr->attr_set |= MOUNT_ATTR_NOEXEC;
>                   break;
>               case 'g':
>                   attr->attr_set |= MOUNT_ATTR_NOATIME;
>                   attr->attr_clr |= MOUNT_ATTR__ATIME;
>                   break;
>               default:
>                   exit_log("Invalid argument specified");
>               }
>           }
>
>           if ((argc - optind) < 2)
>               exit_log("Missing source or target mount point\n");
>
>           const char *source = argv[optind];
>           const char *target = argv[optind + 1];
>
>           int fd_tree = open_tree(-EBADF, source,
>                        OPEN_TREE_CLONE | OPEN_TREE_CLOEXEC |
>                        AT_EMPTY_PATH | (recursive ? AT_RECURSIVE : 0));

???
What is the significance of -EBADF here? As far as I can tell, it
is not meaningful to open_tree()?


>           if (fd_tree == -1)
>               exit_log("%m - Failed to open %s\n", source);
>
>           if (fd_userns >= 0) {
>               attr->attr_set  |= MOUNT_ATTR_IDMAP;
>               attr->userns_fd = fd_userns;
>           }
>
>           ret = mount_setattr(fd_tree, "",
>                       AT_EMPTY_PATH | (recursive ? AT_RECURSIVE : 0),
>                       attr, sizeof(struct mount_attr));
>           if (ret == -1)
>               exit_log("%m - Failed to change mount attributes\n");
>
>           close(fd_userns);
>
>           ret = move_mount(fd_tree, "", -EBADF, target,
>                            MOVE_MOUNT_F_EMPTY_PATH);

???
What is the significance of -EBADF here? As far as I can tell, it
is not meaningful to move_mount()?

>           if (ret == -1)
>               exit_log("%m - Failed to attach mount to %s\n", target);
>
>           close(fd_tree);
>
>           exit(EXIT_SUCCESS);
>       }
>
>   SEE ALSO
>       newuidmap(1), newgidmap(1), clone(2), mount(2), unshare(2),
>       proc(5), mount_namespaces(7), capabilities(7),
>       user_namespaces(7), xattr(7)

Thanks,

Michael

^ permalink raw reply	[relevance 4%]

* Documenting the requirement of CAP_SETFCAP to map UID 0
@ 2021-08-08  9:09  9% Michael Kerrisk (man-pages)
  2021-08-10 23:58  5% ` Serge E. Hallyn
  0 siblings, 1 reply; 200+ results
From: Michael Kerrisk (man-pages) @ 2021-08-08  9:09 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: mtk.manpages, linux-security-module, lkml, Alejandro Colomar,
	Kir Kolyshkin, linux-man

Hello Serge,

Your commit:

[[
commit db2e718a47984b9d71ed890eb2ea36ecf150de18
Author: Serge E. Hallyn <serge@hallyn.com>
Date:   Tue Apr 20 08:43:34 2021 -0500

    capabilities: require CAP_SETFCAP to map uid 0
]]

added a new requirement when updating a UID map a user namespace
with a value of '0 0 *'.

Kir sent a patch to briefly document this change, but I think much more
should be written. I've attempted to do so. Could you tell me whether the
following text (to be added in user_namespaces(7)) is accurate please:

[[
      In  order  for  a  process  to  write  to  the /proc/[pid]/uid_map
       (/proc/[pid]/gid_map) file, all of the following requirements must
       be met:

       [...]

       4. If  updating  /proc/[pid]/uid_map to create a mapping that maps
          UID 0 in the parent namespace, then one of the  following  must
          be true:

          *  if  writing process is in the parent user namespace, then it
             must have the CAP_SETFCAP capability in that user namespace;
             or

          *  if  the writing process is in the child user namespace, then
             the process that created the user namespace  must  have  had
             the CAP_SETFCAP capability when the namespace was created.

          This rule has been in place since Linux 5.12.  It eliminates an
          earlier security bug whereby a UID 0  process  that  lacks  the
          CAP_SETFCAP capability, which is needed to create a binary with
          namespaced file capabilities (as described in capabilities(7)),
          could  nevertheless  create  such  a  binary,  by the following
          steps:

          *  Create a new user namespace with the identity mapping (i.e.,
             UID  0 in the new user namespace maps to UID 0 in the parent
             namespace), so that UID 0 in both namespaces  is  equivalent
             to the same root user ID.

          *  Since  the  child process has the CAP_SETFCAP capability, it
             could create a binary with namespaced file capabilities that
             would  then  be  effective in the parent user namespace (be‐
             cause the root user IDs are the same in the two namespaces).

       [...]
]]

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[relevance 9%]

* [PATCH 5.13 001/104] pipe: make pipe writes always wake up readers
  @ 2021-08-02 13:43  5% ` Greg Kroah-Hartman
  0 siblings, 0 replies; 200+ results
From: Greg Kroah-Hartman @ 2021-08-02 13:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Sandeep Patil, Michael Kerrisk,
	Linus Torvalds

From: Linus Torvalds <torvalds@linux-foundation.org>

commit 3a34b13a88caeb2800ab44a4918f230041b37dd9 upstream.

Since commit 1b6b26ae7053 ("pipe: fix and clarify pipe write wakeup
logic") we have sanitized the pipe write logic, and would only try to
wake up readers if they needed it.

In particular, if the pipe already had data in it before the write,
there was no point in trying to wake up a reader, since any existing
readers must have been aware of the pre-existing data already.  Doing
extraneous wakeups will only cause potential thundering herd problems.

However, it turns out that some Android libraries have misused the EPOLL
interface, and expected "edge triggered" be to "any new write will
trigger it".  Even if there was no edge in sight.

Quoting Sandeep Patil:
 "The commit 1b6b26ae7053 ('pipe: fix and clarify pipe write wakeup
  logic') changed pipe write logic to wakeup readers only if the pipe
  was empty at the time of write. However, there are libraries that
  relied upon the older behavior for notification scheme similar to
  what's described in [1]

  One such library 'realm-core'[2] is used by numerous Android
  applications. The library uses a similar notification mechanism as GNU
  Make but it never drains the pipe until it is full. When Android moved
  to v5.10 kernel, all applications using this library stopped working.

  The library has since been fixed[3] but it will be a while before all
  applications incorporate the updated library"

Our regression rule for the kernel is that if applications break from
new behavior, it's a regression, even if it was because the application
did something patently wrong.  Also note the original report [4] by
Michal Kerrisk about a test for this epoll behavior - but at that point
we didn't know of any actual broken use case.

So add the extraneous wakeup, to approximate the old behavior.

[ I say "approximate", because the exact old behavior was to do a wakeup
  not for each write(), but for each pipe buffer chunk that was filled
  in. The behavior introduced by this change is not that - this is just
  "every write will cause a wakeup, whether necessary or not", which
  seems to be sufficient for the broken library use. ]

It's worth noting that this adds the extraneous wakeup only for the
write side, while the read side still considers the "edge" to be purely
about reading enough from the pipe to allow further writes.

See commit f467a6a66419 ("pipe: fix and clarify pipe read wakeup logic")
for the pipe read case, which remains that "only wake up if the pipe was
full, and we read something from it".

Link: https://lore.kernel.org/lkml/CAHk-=wjeG0q1vgzu4iJhW5juPkTsjTYmiqiMUYAebWW+0bam6w@mail.gmail.com/ [1]
Link: https://github.com/realm/realm-core [2]
Link: https://github.com/realm/realm-core/issues/4666 [3]
Link: https://lore.kernel.org/lkml/CAKgNAkjMBGeAwF=2MKK758BhxvW58wYTgYKB2V-gY1PwXxrH+Q@mail.gmail.com/ [4]
Link: https://lore.kernel.org/lkml/20210729222635.2937453-1-sspatil@android.com/
Reported-by: Sandeep Patil <sspatil@android.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 fs/pipe.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -429,20 +429,20 @@ pipe_write(struct kiocb *iocb, struct io
 #endif
 
 	/*
-	 * Only wake up if the pipe started out empty, since
-	 * otherwise there should be no readers waiting.
+	 * Epoll nonsensically wants a wakeup whether the pipe
+	 * was already empty or not.
 	 *
 	 * If it wasn't empty we try to merge new data into
 	 * the last buffer.
 	 *
 	 * That naturally merges small writes, but it also
-	 * page-aligs the rest of the writes for large writes
+	 * page-aligns the rest of the writes for large writes
 	 * spanning multiple pages.
 	 */
 	head = pipe->head;
-	was_empty = pipe_empty(head, pipe->tail);
+	was_empty = true;
 	chars = total_len & (PAGE_SIZE-1);
-	if (chars && !was_empty) {
+	if (chars && !pipe_empty(head, pipe->tail)) {
 		unsigned int mask = pipe->ring_size - 1;
 		struct pipe_buffer *buf = &pipe->bufs[(head - 1) & mask];
 		int offset = buf->offset + buf->len;



^ permalink raw reply	[relevance 5%]

* [PATCH 5.10 03/67] pipe: make pipe writes always wake up readers
  @ 2021-08-02 13:44  5% ` Greg Kroah-Hartman
  0 siblings, 0 replies; 200+ results
From: Greg Kroah-Hartman @ 2021-08-02 13:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Sandeep Patil, Michael Kerrisk,
	Linus Torvalds

From: Linus Torvalds <torvalds@linux-foundation.org>

commit 3a34b13a88caeb2800ab44a4918f230041b37dd9 upstream.

Since commit 1b6b26ae7053 ("pipe: fix and clarify pipe write wakeup
logic") we have sanitized the pipe write logic, and would only try to
wake up readers if they needed it.

In particular, if the pipe already had data in it before the write,
there was no point in trying to wake up a reader, since any existing
readers must have been aware of the pre-existing data already.  Doing
extraneous wakeups will only cause potential thundering herd problems.

However, it turns out that some Android libraries have misused the EPOLL
interface, and expected "edge triggered" be to "any new write will
trigger it".  Even if there was no edge in sight.

Quoting Sandeep Patil:
 "The commit 1b6b26ae7053 ('pipe: fix and clarify pipe write wakeup
  logic') changed pipe write logic to wakeup readers only if the pipe
  was empty at the time of write. However, there are libraries that
  relied upon the older behavior for notification scheme similar to
  what's described in [1]

  One such library 'realm-core'[2] is used by numerous Android
  applications. The library uses a similar notification mechanism as GNU
  Make but it never drains the pipe until it is full. When Android moved
  to v5.10 kernel, all applications using this library stopped working.

  The library has since been fixed[3] but it will be a while before all
  applications incorporate the updated library"

Our regression rule for the kernel is that if applications break from
new behavior, it's a regression, even if it was because the application
did something patently wrong.  Also note the original report [4] by
Michal Kerrisk about a test for this epoll behavior - but at that point
we didn't know of any actual broken use case.

So add the extraneous wakeup, to approximate the old behavior.

[ I say "approximate", because the exact old behavior was to do a wakeup
  not for each write(), but for each pipe buffer chunk that was filled
  in. The behavior introduced by this change is not that - this is just
  "every write will cause a wakeup, whether necessary or not", which
  seems to be sufficient for the broken library use. ]

It's worth noting that this adds the extraneous wakeup only for the
write side, while the read side still considers the "edge" to be purely
about reading enough from the pipe to allow further writes.

See commit f467a6a66419 ("pipe: fix and clarify pipe read wakeup logic")
for the pipe read case, which remains that "only wake up if the pipe was
full, and we read something from it".

Link: https://lore.kernel.org/lkml/CAHk-=wjeG0q1vgzu4iJhW5juPkTsjTYmiqiMUYAebWW+0bam6w@mail.gmail.com/ [1]
Link: https://github.com/realm/realm-core [2]
Link: https://github.com/realm/realm-core/issues/4666 [3]
Link: https://lore.kernel.org/lkml/CAKgNAkjMBGeAwF=2MKK758BhxvW58wYTgYKB2V-gY1PwXxrH+Q@mail.gmail.com/ [4]
Link: https://lore.kernel.org/lkml/20210729222635.2937453-1-sspatil@android.com/
Reported-by: Sandeep Patil <sspatil@android.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 fs/pipe.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -429,20 +429,20 @@ pipe_write(struct kiocb *iocb, struct io
 #endif
 
 	/*
-	 * Only wake up if the pipe started out empty, since
-	 * otherwise there should be no readers waiting.
+	 * Epoll nonsensically wants a wakeup whether the pipe
+	 * was already empty or not.
 	 *
 	 * If it wasn't empty we try to merge new data into
 	 * the last buffer.
 	 *
 	 * That naturally merges small writes, but it also
-	 * page-aligs the rest of the writes for large writes
+	 * page-aligns the rest of the writes for large writes
 	 * spanning multiple pages.
 	 */
 	head = pipe->head;
-	was_empty = pipe_empty(head, pipe->tail);
+	was_empty = true;
 	chars = total_len & (PAGE_SIZE-1);
-	if (chars && !was_empty) {
+	if (chars && !pipe_empty(head, pipe->tail)) {
 		unsigned int mask = pipe->ring_size - 1;
 		struct pipe_buffer *buf = &pipe->bufs[(head - 1) & mask];
 		int offset = buf->offset + buf->len;



^ permalink raw reply	[relevance 5%]

* Re: [PATCH AUTOSEL 5.13 001/104] pipe: make pipe writes always wake up readers
  2021-08-02 10:46  5% [PATCH AUTOSEL 5.13 001/104] pipe: make pipe writes always wake up readers Sasha Levin
@ 2021-08-02 10:56  0% ` Greg Kroah-Hartman
  0 siblings, 0 replies; 200+ results
From: Greg Kroah-Hartman @ 2021-08-02 10:56 UTC (permalink / raw)
  To: Sasha Levin
  Cc: linux-kernel, stable, Linus Torvalds, Sandeep Patil,
	Michael Kerrisk, linux-fsdevel

On Mon, Aug 02, 2021 at 06:46:48AM -0400, Sasha Levin wrote:
> From: Linus Torvalds <torvalds@linux-foundation.org>
> 
> commit 3a34b13a88caeb2800ab44a4918f230041b37dd9 upstream.
> 
> Since commit 1b6b26ae7053 ("pipe: fix and clarify pipe write wakeup
> logic") we have sanitized the pipe write logic, and would only try to
> wake up readers if they needed it.
> 
> In particular, if the pipe already had data in it before the write,
> there was no point in trying to wake up a reader, since any existing
> readers must have been aware of the pre-existing data already.  Doing
> extraneous wakeups will only cause potential thundering herd problems.
> 
> However, it turns out that some Android libraries have misused the EPOLL
> interface, and expected "edge triggered" be to "any new write will
> trigger it".  Even if there was no edge in sight.
> 
> Quoting Sandeep Patil:
>  "The commit 1b6b26ae7053 ('pipe: fix and clarify pipe write wakeup
>   logic') changed pipe write logic to wakeup readers only if the pipe
>   was empty at the time of write. However, there are libraries that
>   relied upon the older behavior for notification scheme similar to
>   what's described in [1]
> 
>   One such library 'realm-core'[2] is used by numerous Android
>   applications. The library uses a similar notification mechanism as GNU
>   Make but it never drains the pipe until it is full. When Android moved
>   to v5.10 kernel, all applications using this library stopped working.
> 
>   The library has since been fixed[3] but it will be a while before all
>   applications incorporate the updated library"
> 
> Our regression rule for the kernel is that if applications break from
> new behavior, it's a regression, even if it was because the application
> did something patently wrong.  Also note the original report [4] by
> Michal Kerrisk about a test for this epoll behavior - but at that point
> we didn't know of any actual broken use case.
> 
> So add the extraneous wakeup, to approximate the old behavior.
> 
> [ I say "approximate", because the exact old behavior was to do a wakeup
>   not for each write(), but for each pipe buffer chunk that was filled
>   in. The behavior introduced by this change is not that - this is just
>   "every write will cause a wakeup, whether necessary or not", which
>   seems to be sufficient for the broken library use. ]
> 
> It's worth noting that this adds the extraneous wakeup only for the
> write side, while the read side still considers the "edge" to be purely
> about reading enough from the pipe to allow further writes.
> 
> See commit f467a6a66419 ("pipe: fix and clarify pipe read wakeup logic")
> for the pipe read case, which remains that "only wake up if the pipe was
> full, and we read something from it".
> 
> Link: https://lore.kernel.org/lkml/CAHk-=wjeG0q1vgzu4iJhW5juPkTsjTYmiqiMUYAebWW+0bam6w@mail.gmail.com/ [1]
> Link: https://github.com/realm/realm-core [2]
> Link: https://github.com/realm/realm-core/issues/4666 [3]
> Link: https://lore.kernel.org/lkml/CAKgNAkjMBGeAwF=2MKK758BhxvW58wYTgYKB2V-gY1PwXxrH+Q@mail.gmail.com/ [4]
> Link: https://lore.kernel.org/lkml/20210729222635.2937453-1-sspatil@android.com/
> Reported-by: Sandeep Patil <sspatil@android.com>
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> ---
>  fs/pipe.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)

This is already in the 5.13 queue, did you mean to send this again?

thanks,

greg k-h

^ permalink raw reply	[relevance 0%]

* [PATCH AUTOSEL 5.13 001/104] pipe: make pipe writes always wake up readers
@ 2021-08-02 10:46  5% Sasha Levin
  2021-08-02 10:56  0% ` Greg Kroah-Hartman
  0 siblings, 1 reply; 200+ results
From: Sasha Levin @ 2021-08-02 10:46 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: Linus Torvalds, Sandeep Patil, Michael Kerrisk,
	Greg Kroah-Hartman, linux-fsdevel

From: Linus Torvalds <torvalds@linux-foundation.org>

commit 3a34b13a88caeb2800ab44a4918f230041b37dd9 upstream.

Since commit 1b6b26ae7053 ("pipe: fix and clarify pipe write wakeup
logic") we have sanitized the pipe write logic, and would only try to
wake up readers if they needed it.

In particular, if the pipe already had data in it before the write,
there was no point in trying to wake up a reader, since any existing
readers must have been aware of the pre-existing data already.  Doing
extraneous wakeups will only cause potential thundering herd problems.

However, it turns out that some Android libraries have misused the EPOLL
interface, and expected "edge triggered" be to "any new write will
trigger it".  Even if there was no edge in sight.

Quoting Sandeep Patil:
 "The commit 1b6b26ae7053 ('pipe: fix and clarify pipe write wakeup
  logic') changed pipe write logic to wakeup readers only if the pipe
  was empty at the time of write. However, there are libraries that
  relied upon the older behavior for notification scheme similar to
  what's described in [1]

  One such library 'realm-core'[2] is used by numerous Android
  applications. The library uses a similar notification mechanism as GNU
  Make but it never drains the pipe until it is full. When Android moved
  to v5.10 kernel, all applications using this library stopped working.

  The library has since been fixed[3] but it will be a while before all
  applications incorporate the updated library"

Our regression rule for the kernel is that if applications break from
new behavior, it's a regression, even if it was because the application
did something patently wrong.  Also note the original report [4] by
Michal Kerrisk about a test for this epoll behavior - but at that point
we didn't know of any actual broken use case.

So add the extraneous wakeup, to approximate the old behavior.

[ I say "approximate", because the exact old behavior was to do a wakeup
  not for each write(), but for each pipe buffer chunk that was filled
  in. The behavior introduced by this change is not that - this is just
  "every write will cause a wakeup, whether necessary or not", which
  seems to be sufficient for the broken library use. ]

It's worth noting that this adds the extraneous wakeup only for the
write side, while the read side still considers the "edge" to be purely
about reading enough from the pipe to allow further writes.

See commit f467a6a66419 ("pipe: fix and clarify pipe read wakeup logic")
for the pipe read case, which remains that "only wake up if the pipe was
full, and we read something from it".

Link: https://lore.kernel.org/lkml/CAHk-=wjeG0q1vgzu4iJhW5juPkTsjTYmiqiMUYAebWW+0bam6w@mail.gmail.com/ [1]
Link: https://github.com/realm/realm-core [2]
Link: https://github.com/realm/realm-core/issues/4666 [3]
Link: https://lore.kernel.org/lkml/CAKgNAkjMBGeAwF=2MKK758BhxvW58wYTgYKB2V-gY1PwXxrH+Q@mail.gmail.com/ [4]
Link: https://lore.kernel.org/lkml/20210729222635.2937453-1-sspatil@android.com/
Reported-by: Sandeep Patil <sspatil@android.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 fs/pipe.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index bfd946a9ad01..9ef4231cce61 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -429,20 +429,20 @@ pipe_write(struct kiocb *iocb, struct iov_iter *from)
 #endif
 
 	/*
-	 * Only wake up if the pipe started out empty, since
-	 * otherwise there should be no readers waiting.
+	 * Epoll nonsensically wants a wakeup whether the pipe
+	 * was already empty or not.
 	 *
 	 * If it wasn't empty we try to merge new data into
 	 * the last buffer.
 	 *
 	 * That naturally merges small writes, but it also
-	 * page-aligs the rest of the writes for large writes
+	 * page-aligns the rest of the writes for large writes
 	 * spanning multiple pages.
 	 */
 	head = pipe->head;
-	was_empty = pipe_empty(head, pipe->tail);
+	was_empty = true;
 	chars = total_len & (PAGE_SIZE-1);
-	if (chars && !was_empty) {
+	if (chars && !pipe_empty(head, pipe->tail)) {
 		unsigned int mask = pipe->ring_size - 1;
 		struct pipe_buffer *buf = &pipe->bufs[(head - 1) & mask];
 		int offset = buf->offset + buf->len;
-- 
2.30.2


^ permalink raw reply related	[relevance 5%]

* Re: [PATCH 1/1] fs: pipe: wakeup readers everytime new data written is to pipe
  @ 2021-07-30 19:47  5%         ` Sandeep Patil
  0 siblings, 0 replies; 200+ results
From: Sandeep Patil @ 2021-07-30 19:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Linux Kernel Mailing List, David Howells,
	Greg Kroah-Hartman, stable, Android Kernel Team

On 7/30/21 7:23 PM, Linus Torvalds wrote:
> On Fri, Jul 30, 2021 at 12:11 PM Sandeep Patil <sspatil@android.com> wrote:
>>
>> Yes, your patch fixes all apps on Android I can test that include this
>> library.
> 
> Ok, thanks for checking.
> 
>> fwiw, the library seems to have been fixed. However, I am not sure
>> how long it will be for all apps to take that update :(.
> 
> I wonder if I could make the wakeup logic do this only for the epollet case.

aren't we supposed to wakeup on each write in level-triggered (default) 
case though?

> 
> I'll have to think about it, but maybe I'll just apply that simple
> patch. I dislike the pointless wakeups, and as long as the only case I
> knew of was only a test of broken behavior, it was fine. But now that
> you've reported actual application breakage, this is in the "real
> regression" category, and so I'll fix it one way or the other.
> 
> And on the other hand I also have a slight preference towards your
> patch simply because you did the work of finding this out, so you
> should get the credit.

Ha, I can't really claim credit here. This was also reported to us
in Android that triggered the search. Plus, now that I see your thread 
with Michael Kerrisk, he was way ahead of us in finding this out.

> 
> I'll mull it over a bit more, but whatever I'll do I'll do before rc4
> and mark it for stable.

Thanks, I was actually going to suggest taking your patch cause it also 
  makes changes in pipe_read(). I am not sure if there are apps that do 
EPOLLET | EPOLLOUT (can't think of a reason why).

- ssp

> 
> Thanks for testing,
> 
>                   Linus
> 


^ permalink raw reply	[relevance 5%]

* [PATCH v28 06/32] x86/cet: Add control-protection fault handler
  @ 2021-07-22 20:51  3% ` Yu-cheng Yu
  0 siblings, 0 replies; 200+ results
From: Yu-cheng Yu @ 2021-07-22 20:51 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang, Rick P Edgecombe
  Cc: Yu-cheng Yu, Michael Kerrisk

A control-protection fault is triggered when a control-flow transfer
attempt violates Shadow Stack or Indirect Branch Tracking constraints.
For example, the return address for a RET instruction differs from the copy
on the shadow stack; or an indirect JMP instruction, without the NOTRACK
prefix, arrives at a non-ENDBR opcode.

The control-protection fault handler works in a similar way as the general
protection fault handler.  It provides the si_code SEGV_CPERR to the signal
handler.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
---
v25:
- Change CONFIG_X86_CET to CONFIG_X86_SHADOW_STACK.
- Change X86_FEATURE_CET to X86_FEATURE_SHSTK.

 arch/x86/include/asm/idtentry.h    |  4 ++
 arch/x86/kernel/idt.c              |  4 ++
 arch/x86/kernel/signal_compat.c    |  2 +-
 arch/x86/kernel/traps.c            | 63 ++++++++++++++++++++++++++++++
 include/uapi/asm-generic/siginfo.h |  3 +-
 5 files changed, 74 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 1345088e9902..a90791433152 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -562,6 +562,10 @@ DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_SS,	exc_stack_segment);
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_GP,	exc_general_protection);
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_AC,	exc_alignment_check);
 
+#ifdef CONFIG_X86_SHADOW_STACK
+DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_CP, exc_control_protection);
+#endif
+
 /* Raw exception entries which need extra work */
 DECLARE_IDTENTRY_RAW(X86_TRAP_UD,		exc_invalid_op);
 DECLARE_IDTENTRY_RAW(X86_TRAP_BP,		exc_int3);
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index df0fa695bb09..9f1bdaabc246 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -113,6 +113,10 @@ static const __initconst struct idt_data def_idts[] = {
 #elif defined(CONFIG_X86_32)
 	SYSG(IA32_SYSCALL_VECTOR,	entry_INT80_32),
 #endif
+
+#ifdef CONFIG_X86_SHADOW_STACK
+	INTG(X86_TRAP_CP,		asm_exc_control_protection),
+#endif
 };
 
 /*
diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index 06743ec054d2..049ea3dcc6cb 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -27,7 +27,7 @@ static inline void signal_compat_build_tests(void)
 	 */
 	BUILD_BUG_ON(NSIGILL  != 11);
 	BUILD_BUG_ON(NSIGFPE  != 15);
-	BUILD_BUG_ON(NSIGSEGV != 9);
+	BUILD_BUG_ON(NSIGSEGV != 10);
 	BUILD_BUG_ON(NSIGBUS  != 5);
 	BUILD_BUG_ON(NSIGTRAP != 6);
 	BUILD_BUG_ON(NSIGCHLD != 6);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index a58800973aed..58664374ae8a 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -39,6 +39,7 @@
 #include <linux/io.h>
 #include <linux/hardirq.h>
 #include <linux/atomic.h>
+#include <linux/nospec.h>
 
 #include <asm/stacktrace.h>
 #include <asm/processor.h>
@@ -607,6 +608,68 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
 	cond_local_irq_disable(regs);
 }
 
+#ifdef CONFIG_X86_SHADOW_STACK
+static const char * const control_protection_err[] = {
+	"unknown",
+	"near-ret",
+	"far-ret/iret",
+	"endbranch",
+	"rstorssp",
+	"setssbsy",
+	"unknown",
+};
+
+static DEFINE_RATELIMIT_STATE(cpf_rate, DEFAULT_RATELIMIT_INTERVAL,
+			      DEFAULT_RATELIMIT_BURST);
+
+/*
+ * When a control protection exception occurs, send a signal to the responsible
+ * application.  Currently, control protection is only enabled for user mode.
+ * This exception should not come from kernel mode.
+ */
+DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+{
+	struct task_struct *tsk;
+
+	if (!user_mode(regs)) {
+		pr_emerg("PANIC: unexpected kernel control protection fault\n");
+		die("kernel control protection fault", regs, error_code);
+		panic("Machine halted.");
+	}
+
+	cond_local_irq_enable(regs);
+
+	if (!boot_cpu_has(X86_FEATURE_SHSTK))
+		WARN_ONCE(1, "Control protection fault with CET support disabled\n");
+
+	tsk = current;
+	tsk->thread.error_code = error_code;
+	tsk->thread.trap_nr = X86_TRAP_CP;
+
+	/*
+	 * Ratelimit to prevent log spamming.
+	 */
+	if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
+	    __ratelimit(&cpf_rate)) {
+		unsigned long ssp;
+		int cpf_type;
+
+		cpf_type = array_index_nospec(error_code, ARRAY_SIZE(control_protection_err));
+
+		rdmsrl(MSR_IA32_PL3_SSP, ssp);
+		pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)",
+			 tsk->comm, task_pid_nr(tsk),
+			 regs->ip, regs->sp, ssp, error_code,
+			 control_protection_err[cpf_type]);
+		print_vma_addr(KERN_CONT " in ", regs->ip);
+		pr_cont("\n");
+	}
+
+	force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
+	cond_local_irq_disable(regs);
+}
+#endif
+
 static bool do_int3(struct pt_regs *regs)
 {
 	int res;
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index 5a3c221f4c9d..a1a153ea3cc3 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -235,7 +235,8 @@ typedef struct siginfo {
 #define SEGV_ADIPERR	7	/* Precise MCD exception */
 #define SEGV_MTEAERR	8	/* Asynchronous ARM MTE error */
 #define SEGV_MTESERR	9	/* Synchronous ARM MTE exception */
-#define NSIGSEGV	9
+#define SEGV_CPERR	10	/* Control protection fault */
+#define NSIGSEGV	10
 
 /*
  * SIGBUS si_codes
-- 
2.21.0


^ permalink raw reply related	[relevance 3%]

* man-pages-5.12 is released
@ 2021-06-22  1:11 12% Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 200+ results
From: Michael Kerrisk (man-pages) @ 2021-06-22  1:11 UTC (permalink / raw)
  To: lkml; +Cc: mtk.manpages, Alejandro Colomar

Gidday,

Alex Colomar and I are proud to announce:

    man-pages-5.12 - man pages for Linux

This release resulted from patches, bug reports, reviews, and
comments from around 40 contributors. The release includes
around 300 commits that changed approximately 180 pages.

Tarball download:
    http://www.kernel.org/doc/man-pages/download.html
Git repository:
    https://git.kernel.org/cgit/docs/man-pages/man-pages.git/
Online changelog:
    http://man7.org/linux/man-pages/changelog.html#release_5.12

A short summary of the release is blogged at:
https://linux-man-pages.blogspot.com/2021/06/man-pages-512-released.html

The current version of the pages is browsable at:
http://man7.org/linux/man-pages/

You are receiving this message either because:

a) You contributed to the content of this release.

b) You are subscribed to linux-man@vger.kernel.org or
libc-alpha@sourceware.org.

c) I have information (possibly inaccurate) that you are the maintainer
of a translation of the manual pages, or are the maintainer of the
manual pages set in a particular distribution, or have expressed
interest in helping with man-pages maintenance, or have otherwise
expressed interest in being notified about man-pages releases.
If you don't want to receive such messages from me, or you know of
some other translator or maintainer who may want to receive such
notifications, send me a message.

Cheers,

Michael

==================== Changes in man-pages-5.12 ====================

Released: 2021-06-20, Christchurch


New and rewritten pages
-----------------------

seccomp_unotify.2
    Michael Kerrisk  [Tycho Andersen, Jann Horn, Kees Cook, Christian Brauner
                      Sargun Dhillon]
        New page documenting the seccomp user-space notification mechanism

MAX.3
    Alejandro Colomar
        New page to document MAX() and MIN()


Newly documented interfaces in existing pages
---------------------------------------------

seccomp.2
    Tycho Andersen  [MichaelKerrisk]
        Document SECCOMP_GET_NOTIF_SIZES
    Tycho Andersen
        Document SECCOMP_FILTER_FLAG_NEW_LISTENER   [Michael Kerrisk]
    Tycho Andersen
        Document SECCOMP_RET_USER_NOTIF  [Michael Kerrisk]

set_mempolicy.2
    Huang Ying  [Alejandro Colomar, "Huang, Ying"]
        Add mode flag MPOL_F_NUMA_BALANCING

userfaultfd.2
    Peter Xu  [Alejandro Colomar, Mike Rapoport]
        Add UFFD_FEATURE_THREAD_ID docs
    Peter Xu  [Alejandro Colomar, Mike Rapoport]
        Add write-protect mode docs

proc.5
    Michael Kerrisk
        Document /proc/sys/vm/sysctl_hugetlb_shm_group

system_data_types.7
    Alejandro Colomar
        Add 'blksize_t'
    Alejandro Colomar
        Add 'blkcnt_t'
    Alejandro Colomar
        Add 'mode_t'
    Alejandro Colomar
        Add 'struct sockaddr'
    Alejandro Colomar
        Add 'cc_t'
    Alejandro Colomar
        Add 'socklen_t'


Global changes
--------------

Many pages
    Alejandro Colomar
        SYNOPSIS: Use syscall(SYS_...); for system calls without a wrapper

Many pages
    Alejandro Colomar
        SYNOPSIS: Document why each header is required


Changes to individual pages
---------------------------

dup.2
    Michael Kerrisk
        Rewrite the description of dup() somewhat
            As can be seen by any number of StackOverflow questions, people
            persistently misunderstand what dup() does, and the existing manual
            page text, which talks of "copying" a file descriptor doesn't help.
            Rewrite the text a little to try to prevent some of these
            misunderstandings, in particular noting at the start that dup()
            allocates a new file descriptor.
    Michael Kerrisk
        Clarify what silent closing means

_exit.2
    Michael Kerrisk
        Add a little more detail on the raw _exit() system cal

flock.2
    Aurelien Aptel  [Alejandro Colomar]
        Add CIFS details
            CIFS flock() locks behave differently than the standard.
            Give an overview of those differences.

memfd_create.2
mmap.2
shmget.2
    Michael Kerrisk  [Yang Xu]
        Document the EPERM error for huge page allocations
            This error can occur if the caller is does not have CAP_IPC_LOCK
            and is not a member of the sysctl_hugetlb_shm_group.

mmap.2
    Bruce Merry
        Clarify that MAP_POPULATE is best-effort

mount.2
    Topi Miettinen
        Document SELinux use of MS_NOSUID mount flag

open.2
    Alejandro Colomar  [Walter Harms]
        Fix bug in linkat(2) call example
            AT_EMPTY_PATH works with empty strings (""), but not with NULL
            (or at least it's not obvious).

perfmonctl.2
    Michael Kerrisk
        This system call was removed in Linux 5.10

select.2
    Michael Kerrisk
        Strengthen the warning regarding the low value of FD_SETSIZE
            All modern code should avoid select(2) in favor of poll(2)
            or epoll(7).

capabilities.7
    Michael Kerrisk
        CAP_IPC_LOCK also governs memory allocation using huge pages

signal.7
    Michael Kerrisk
        Add reference to seccomp_unotify(2)
            The seccomp user-space notification feature can cause changes in
            the semantics of SA_RESTART with respect to system calls that
            would never normally be restarted. Point the reader to the page
            that provide further details.

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[relevance 12%]

* Re: [PATCH] kernel_lockdown.7: Remove additional text alluding to lifting via SysRq
  @ 2021-06-09 21:29 11% ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 200+ results
From: Michael Kerrisk (man-pages) @ 2021-06-09 21:29 UTC (permalink / raw)
  To: dann frazier, linux-man, Alejandro Colomar (man-pages),
	David Howells, Heinrich Schuchardt
  Cc: mtk.manpages, linux-kernel, Pedro Principeza

Hello Dann,

On 6/8/21 10:19 AM, dann frazier wrote:
> My previous patch intended to drop the docs for the lockdown lift SysRq,
> but it missed this other section that refers to lifting it via a keyboard -
> an allusion to that same SysRq.
> 
> Signed-off-by: dann frazier <dann.frazier@canonical.com>

Thanks. Patch applied.

Cheers,

Michael


> ---
>  man7/kernel_lockdown.7 | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/man7/kernel_lockdown.7 b/man7/kernel_lockdown.7
> index b0442b3b6..0c0a9500d 100644
> --- a/man7/kernel_lockdown.7
> +++ b/man7/kernel_lockdown.7
> @@ -19,9 +19,6 @@ modification of the kernel image and to prevent access to security and
>  cryptographic data located in kernel memory, whilst still permitting driver
>  modules to be loaded.
>  .PP
> -Lockdown is typically enabled during boot and may be terminated, if configured,
> -by typing a special key combination on a directly attached physical keyboard.
> -.PP
>  If a prohibited or restricted feature is accessed or used, the kernel will emit
>  a message that looks like:
>  .PP
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[relevance 11%]

* [ANNOUNCE] util-linux v2.37
@ 2021-06-01  8:38  5% Karel Zak
  0 siblings, 0 replies; 200+ results
From: Karel Zak @ 2021-06-01  8:38 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, util-linux


The util-linux release v2.37 is available at

  http://www.kernel.org/pub/linux/utils/util-linux/v2.37/

Feedback and bug reports, as always, are welcomed.

  Karel



Util-linux 2.37 Release Notes
=============================

Release highlights
------------------

This project no more uses Groff to maintain man-pages. Since v2.37 all text is
maintained in AsciiDoc and man-pages are generated by asciidoctor to man-pages
during the package build process (see also --disable-asciidoc configure
option). Thanks to Mario Blättermann.

The long-term goal is to maintain also man-page translations (via
translationproject.org and po4a) in the util-linux project. Please, contact
Mario Blättermann if you're want to help with the conversion from
manpages-l10n.

The old hardlink(1) implementation from Jakub Jelinek (originally for Fedora)
has been replaced by a new implementation from Julian Andres Klode (originally
for Debian). The new implementation does not support -f option to force
hardlinks creation between filesystem.

lscpu(1) has been reimplemented. Now it analyzes /sys for all CPUs and provides
information for all CPU types used by the system (for example heterogeneous
big.LITTLE ARMs, etc.). This command reads also SMBIOS tables to get CPU
identifiers. Thanks to Masayoshi Mizuma from Fujitsu and Jeffrey Bastian from
Red Hat.  The default output on the terminal is more structured now to be more
human-readable.

uclampset(1) is new util to manipulate the utilization clamping attributes of
the system or a process. Thanks to Qais Yousef from ARM.

hexdump(1) automatically uses -C when called as "hd".

dmesg(1) supports new command-line options --since and --until.

findmnt(8) supports new command-line options --shadowed to print only
filesystems over-mounted by another filesystem.

mount(8) supports --read-only command-line option for non-root users too.

umount(8) can umount also all over-mounted filesystems (more filesystems on the
the same mount point) when executed with --recursive.

libfdisk (and fdisk, sfdisk, cfdisk) supports partition type names on input,
ignoring the case of the characters and all non-alphanumeric and non-digit
characters in the name (e.g. type="Linux /usr x86" is the same as type="linux
usr-x86" for sfdisk).

libmount no more contains a workaround to detect inconsistent
/proc/self/mountinfo read. This problem is fixed by the Linux kernel (since v5.8,
kernel commit 9f6c61f96f2d97cbb5f7fa85607bc398f843ff0f).

libblkid supports "probing hints" now. The hints are the optional way how to
force probing functions to check for example another location -- for example
specific session on multi-session UDF. The command blkid(8) supports this
functionality with a new --hint option. The library has been also extended to
support others ISO9660 and UDF identifiers. Thanks to Pali Rohár.

blkzone(8) provides a new "capacity" command.

cfdisk(8) is possible to start in read-only mode by a new command-line option
--read-only

lsblk(8) provides new columns FSROOTS, and MOUNTPOINTS. The column
MOUNTPOINTS is used in the default output now and this new column prints all
mount points where the device is used (btrfs subvolumes, bind mounts, etc).

losetup(8) uses LOOP_CONFIG ioctl now.

column(1) supports a new command-line option --table-columns-limit to specify a
maximal number of the input columns. The last column will contain all remaining
line data if the limit is smaller than the number of the columns in the input
data.

It's possible to use meson to build util-linux. This feature is experimental
and currently designed only for developers. No panic, the current primary
autotools-based build process will be supported, maintained, and used as
primary for next years.


Changes between v2.36 and v2.37
-------------------------------

Asciidoc:
   - Adapt Makefiles to new asciidoc man pages  [Mario Blättermann]
   - Add Po4a hint to file headers  [Mario Blättermann]
   - Add missing macro definition in uclampset.1  [Mario Blättermann]
   - Add po4a config file and initial translation template for man pages  [Mario Blättermann]
   - Better gettext message splitting in nsenter.1.adoc  [Mario Blättermann]
   - Convert man-common/README to Markdown  [Mario Blättermann]
   - Fix artifact from initial import, sixth attempt  [Mario Blättermann]
   - Fix artifacts from initial import  [Mario Blättermann]
   - Fix artifacts from initial import, fifth attempt  [Mario Blättermann]
   - Fix artifacts from initial import, fourth attempt  [Mario Blättermann]
   - Fix artifacts from initial import, second attempt  [Mario Blättermann]
   - Fix artifacts from initial import, third attempt  [Mario Blättermann]
   - Fix man pages with variables to use the same value as in previous *.in files  [Mario Blättermann]
   - Fix markup  [Mario Blättermann]
   - Fix markup in example man page  [Mario Blättermann]
   - Fix typo  [Mario Blättermann]
   - Fix typo and remove invisible spaces which confuse po4a  [Mario Blättermann]
   - Formatting cleanup  [Mario Blättermann]
   - Import disk-utils man pages  [Mario Blättermann]
   - Import hwclock.8.in  [Mario Blättermann]
   - Import libuuid man pages  [Mario Blättermann]
   - Import login-utils man pages  [Mario Blättermann]
   - Import misc-utils man pages  [Mario Blättermann]
   - Import rtcwake.8.in  [Mario Blättermann]
   - Import sys-utils man pages, part 1  [Mario Blättermann]
   - Import sys-utils man pages, part 2  [Mario Blättermann]
   - Import sys-utils man pages, part 3  [Mario Blättermann]
   - Import term-utils man pages  [Mario Blättermann]
   - Import textutils man pages  [Mario Blättermann]
   - Incorporate latest change in findmnt.8  [Mario Blättermann]
   - Incorporate latest changes in findmnt.8  [Karel Zak]
   - Incorporate latest changes in rfkill.8 and umount.8  [Mario Blättermann]
   - Re-add empty lines to man pages  [Mario Blättermann]
   - Remove already imported *roff man pages  [Mario Blättermann]
   - Remove already imported disk-utils *roff man pages  [Mario Blättermann]
   - Remove already imported login-utils *roff man pages  [Mario Blättermann]
   - Remove already imported misc-utils *roff man pages  [Mario Blättermann]
   - Remove already imported text-utils *roff man pages  [Mario Blättermann]
   - Remove artifact from merge conflict  [Mario Blättermann]
   - Remove old man page links  [Mario Blättermann]
   - Reorder example command sequence  [Mario Blättermann]
   - Review disk-utils man pages  [Mario Blättermann]
   - Review login-utils man pages  [Mario Blättermann]
   - Review misc-utils man pages  [Mario Blättermann]
   - Review schedutils man pages  [Mario Blättermann]
   - Review sys-utils man pages, part 2  [Mario Blättermann]
   - Review sys-utils man pages,part 1  [Mario Blättermann]
   - Review term-utils man pages  [Mario Blättermann]
   - Review terminal-colors.d.5.adoc  [Mario Blättermann]
   - Review text-utils man pages  [Mario Blättermann]
   - Small fix in nsenter.1.adoc  [Mario Blättermann]
   - Small indentation fix in mount.8.adoc  [Mario Blättermann]
   - Some formatting cleanup in man pages  [Mario Blättermann]
   - Some more  man page formatting improvements  [Mario Blättermann]
   - Unify spelling of »User Commands«  [Mario Blättermann]
   - Update .pot template  [Mario Blättermann]
   - Use correct ' man manual ' for man pages from section 8  [Mario Blättermann]
   - Yet another formatting fix  [Mario Blättermann]
   - add missing bugreports section to libblkid and some cleanup  [Mario Blättermann]
Automake:
   - install uuidgen bash completion only if it is built  [Luca Boccassi]
   - use EXTRA_LTLIBRARIES instead of noinst_LTLIBRARIES  [Luca Boccassi]
Manual pages:
   - agetty.8  Minor formatting and wording fixes  [Michael Kerrisk (man-pages)]
   - blockdev.8  Minor wording and formatting fixes  [Michael Kerrisk (man-pages)]
   - blockdev.8, sfdisk.8  typo fixes  [Michael Kerrisk (man-pages)]
   - document the 'resize' command  [Vincent McIntyre]
   - logger.1  minor formatting and typo fixes  [Michael Kerrisk (man-pages)]
   - lsblk.8  Minor formatting and typo fixes  [Michael Kerrisk (man-pages)]
   - lslogins.1  Minor wording and formatting fixres  [Michael Kerrisk (man-pages)]
   - nologin.8  formatting fixes  [Michael Kerrisk (man-pages)]
   - raw.8  Minor formatting and wording fixes  [Michael Kerrisk (man-pages)]
   - sfdisk.8  Minor wording and formatting fixes  [Michael Kerrisk (man-pages)]
   - sfdisk.8  Use less aggressive indenting  [Michael Kerrisk (man-pages)]
   - wdctl.8  typo fix  [Michael Kerrisk (man-pages)]
   - wipefs.8  Formatting fixes  [Michael Kerrisk (man-pages)]
agetty:
   - Allow --init-string on a virtual console  [Ivan Mironov]
   - fix typo in manual page  [Samanta Navarro]
   - tty eol defaults to REPRINT  [Sami Loone]
bash-completion:
   - (lsblk) fix -E/-M arg (non-)completion  [Ville Skyttä]
   - (lsblk) update columns  [Karel Zak]
   - add column --table-columns-limit  [Karel Zak]
   - add irqtop/lsirq --softirq  [Karel Zak]
blkdiscard:
   - do not probe for signatures on --force  [Karel Zak]
   - fix compilation without libblkid  [Karel Zak]
   - fix compiler warnings [-Wmaybe-uninitialized]  [Karel Zak]
blkid:
   - add --hint <name>=value  [Karel Zak]
   - add another UDF identifiers  [Karel Zak]
   - document --hint  [Karel Zak]
   - encode all udf and iso IDs in udev output  [Karel Zak]
blkzone:
   - add capacity field to zone report  [Shin'ichiro Kawasaki]
   - add report capacity command  [Hans Holmberg]
blockdev:
   - fix man page formatting  [Jakub Wilk]
build-sys:
   - add --disable-scriptutils  [Karel Zak]
   - add .stamp to gitignore  [Karel Zak]
   - add EXTRA_LTLIBRARIES beween CLEANFILES  [Karel Zak]
   - add UL_REQUIRES_PROGRAM() macro, use it for asciidoc  [Karel Zak]
   - add configure options to disable individual utils  [heitbaum]
   - add man-common/Makemodule.am  [Karel Zak]
   - add missing header file  [Karel Zak]
   - add restrict keyword fallback  [Karel Zak]
   - add support for --enable-fuzzing-engine  [Evgeny Vereshchagin]
   - add targets to generated translated man pages  [Karel Zak]
   - add uninstall to po-man  [Karel Zak]
   - check for libselinux >= 3.1  [Karel Zak]
   - cleanup .gitignore files  [Karel Zak]
   - cleanup Makefiles  [Karel Zak]
   - cleanup distcheck options  [Karel Zak]
   - cleanup uclampset dependencies  [Karel Zak]
   - disable po-man by default, cleanup summary  [Karel Zak]
   - do not build plymouth-ctrl.c w/ disabled plymouth  [Pino Toscano]
   - do not use extra subdir for getopt examples  [Karel Zak]
   - exclude GPL from libcommon  [Karel Zak]
   - fix libblkid dependence  [Karel Zak]
   - fix out-of-tree build  [Karel Zak]
   - fix po-man/ make check  [Karel Zak]
   - fix schedutils/sched_attr.h include  [Karel Zak]
   - fix sendfile use  [Karel Zak]
   - fix test_loopdev build  [Karel Zak]
   - fix typo  [Karel Zak]
   - improve asciidoc generic rule  [Karel Zak]
   - keep adoc files in dist_noinst_DATA  [Karel Zak]
   - make man pages location independent  [Karel Zak]
   - make man pages optional, add --disable-asciidoc  [Karel Zak]
   - move selinux_utils.c  [Karel Zak]
   - release++ (v2.37-rc1)  [Karel Zak]
   - release++ (v2.37-rc2)  [Karel Zak]
   - remove duplicate hook  [Karel Zak]
   - remove fallback for security_context_t  [Karel Zak]
   - remove man page link files  [Karel Zak]
   - remove some man pages from PATHFILES  [Karel Zak]
   - remove with-cryptsetup from tools/config-gen.d/all.conf  [Karel Zak]
   - set localstatedir and sysconfdir default  [Karel Zak]
   - silence non-POSIX variable name warning  [Sami Kerola]
   - sort various lists in configure.ac  [Sami Kerola]
   - split man pages and man page links  [Karel Zak]
   - update to autoconf 2.70  [Sami Kerola]
   - update util-linux-man.pot on 'make dist'  [Karel Zak]
   - use _DATA to install getopt examples  [Karel Zak]
build-system:
   - make "make distcheck" work  [Evgeny Vereshchagin]
   - stop looking for %ms and %as  [Evgeny Vereshchagin]
cal:
   - do not use putp(), directly use stdio functions  [Karel Zak]
cfdisk:
   - (man) add info when cfdisk writes to the device  [Karel Zak]
   - Implemented cfdisk's opening in read-only mode  [Dmitriy Chestnykh]
   - show Q option when choosing label type  [Chris Hofstaedtler]
   - warn if disk on use  [Karel Zak]
chfs-chfn:
   - remove deprecated selinux_check_passwd_access()  [Karel Zak]
chrt:
   - (man) add human-readable names for policies  [Karel Zak]
   - don't restrict --reset-on-fork, add more info to man page  [Karel Zak]
   - non-Linux fix  [Karel Zak]
   - use SCHED_FLAG_RESET_ON_FORK for sched_setattr()  [Karel Zak]
ci:
   - 'downgrade' Ubuntu version to Bionic  [Frantisek Sumsal]
   - build both w/ and w/o sanitizers on GH Actions  [Frantisek Sumsal]
   - code cleanup  [Frantisek Sumsal]
   - deal with uninstrumented binaries using instrumented libs  [Frantisek Sumsal]
   - run the build test for each pull request  [Frantisek Sumsal]
   - trigger CiFuzz for the master branch only  [Evgeny Vereshchagin]
   - use the correct compiler version  [Frantisek Sumsal]
cifuzz:
   - reindent yaml file  [Sami Kerola]
   - turn on MSan  [Evgeny Vereshchagin]
col:
   - add defaults to switch case clauses  [Sami Kerola]
   - add handle_not_graphic() function  [Sami Kerola]
   - add more tests  [Sami Kerola]
   - add structure to hold line variables  [Sami Kerola]
   - add update_cur_line() function  [Sami Kerola]
   - cleanup usage() and struct col_*  [Karel Zak]
   - enable deallocation on exit also for __SANITIZE_ADDRESS__  [Karel Zak]
   - fix --help short option in usage() output  [Sami Kerola]
   - flip all comparisions to numerical order  [Sami Kerola]
   - free memory before exit [LeakSanitizer]  [Sami Kerola]
   - initialize variables when they are declared  [Sami Kerola]
   - make input to tolerate invalid wide characters  [Sami Kerola]
   - move global variables to a control structure  [Sami Kerola]
   - move option handling to separate function  [Sami Kerola]
   - remove function prototypes  [Sami Kerola]
   - replace LINE and CHAR typedefs with structs  [Sami Kerola]
   - tidy up sources a little bit  [Sami Kerola]
   - use inline function rather than function like define  [Sami Kerola]
   - use size_t when dealing with numbers that buffer sizes  [Sami Kerola]
   - use typedef and enum to clarify struct  [Sami Kerola]
colrm:
   - fix argument parsing  [Sami Kerola]
column:
   - Deprecate --table-empty-lines in favor of --keep-empty-lines  [Lennard Hofmann]
   - Optionally keep empty lines in cols/rows mode  [Lennard Hofmann]
   - add --table-columns-limit  [Karel Zak]
   - add placeholder '0' to specify all columns  [Karel Zak]
configure:
   - test -a|o is not POSIX  [Issam E. Maghni]
configure.ac:
   - check for sendfile  [Egor Chelak]
dmesg:
   - add --since and --until  [Karel Zak]
   - fix and cleanup --read-clear  [Karel Zak]
docs:
   - add #1266 to TODO file  [Karel Zak]
   - add hint about make install-strip and link to Documentation/  [Karel Zak]
   - add kernel version and commit to info about mountinfo workaround  [Karel Zak]
   - add note about github  [Karel Zak]
   - fix typo  [Karel Zak]
   - fix typo in v2.36-ReleaseNotes  [Karel Zak]
   - mention OSS-Fuzz and CIFuzz and how to build fuzz targets locally  [Evgeny Vereshchagin]
   - rename to getopt-example  [Karel Zak]
   - update AUTHORS file  [Karel Zak]
   - update Documentation/howto-man-page.txt  [Karel Zak]
   - update TODO  [Karel Zak]
   - update TODO (add item about mnt_context_get_excode() )  [Karel Zak]
   - update TODO (scols borders)  [Karel Zak]
   - update TODO file (add item about libblkid ZFS)  [Karel Zak]
   - update copyright years  [Karel Zak]
   - update v2.37-ReleaseNotes  [Karel Zak]
docs/TODO:
   - Minor update and fix typo  [Mario Blättermann]
eject:
   - cleanup before successful exit  [Karel Zak]
fallocate:
   - fix --dig-holes at end of files  [Gero Treuner]
fdformat:
   - remove command from default build  [Sami Kerola]
fdisk:
   - (man) add info about order for -l  [Karel Zak]
   - always report fdisk_create_disklabel() errors  [Karel Zak]
   - always skips zeros in dumps  [Karel Zak]
   - fix expected test output on alpha  [Chris Hofstaedtler]
   - support partition type name in dialogs  [Karel Zak]
   - warn if disk in use  [Karel Zak]
findmnt:
   - (man) add more info about --target  [Karel Zak]
   - add --shadowed  [Karel Zak]
   - add --shadowed to the man page  [Karel Zak]
   - add PARENT column  [Karel Zak]
   - add option to list all fs-independent flags  [Roberto Bergantinos Corpas]
   - sort columns  [Karel Zak]
flock:
   - fix time_t=long assumptions  [Karel Zak]
   - keep -E exit status more restrictive  [Karel Zak]
fsck:
   - fix time_t=long assumptions  [Karel Zak]
fsck, libblkid:
   - fix printf format string issue [coverity scan]  [Sami Kerola]
fsck.cramfs:
   - fix fsck.cramfs crashes on blocksizes > 4K  [ToddRK]
fstab:
   - fstab.5 NTFS and FAT volume IDs use upper case  [Heinrich Schuchardt]
fstrim:
   - do not start the timer in initrd  [Zbigniew Jędrzejewski-Szmek]
   - fix memory leak [coverity scan]  [Karel Zak]
   - fix paths comparison  [Karel Zak]
   - remove fstab condition from fstrim.timer  [Dusty Mabe]
fuzzers:
   - make tests setup more robust  [Karel Zak]
getopt:
   - explicitly ask for POSIX mode on POSIXLY_CORRECT  [Đoàn Trần Công Danh]
github:
   - CC fix export  [Karel Zak]
   - add 'distcheck' workflow job  [Karel Zak]
   - add build workflow  [Karel Zak]
   - add ruby-asciidoctor to CI-build  [Karel Zak]
   - cleanup cibuild.sh  [Karel Zak]
   - enable ci-build for all basic branches  [Karel Zak]
   - export CC and CXX  [Karel Zak]
   - fix asciidoctror dependence  [Karel Zak]
   - fix btrfs package name  [Karel Zak]
   - fix cibuild typo  [Karel Zak]
   - fix distcheck job  [Karel Zak]
   - make sure compiler is defined  [Karel Zak]
   - remove distcheck  [Karel Zak]
hardlink:
   - add --quiet option  [Karel Zak]
   - check and use sys/xattr.h  [Karel Zak]
   - cleanup --minimum-size stuff  [Karel Zak]
   - cleanup includes and types  [Karel Zak]
   - cleanup man page  [Karel Zak]
   - cleanup summary  [Karel Zak]
   - cleanup usage()  [Karel Zak]
   - fix hardlink pcre leak  [Sami Kerola]
   - fix indention  [Karel Zak]
   - fix time_t=long assumptions  [Karel Zak]
   - fix typo  [Karel Zak]
   - fix typo  [Mario Blättermann]
   - fix typo again  [Karel Zak]
   - fix typo in man page  [Karel Zak]
   - move default to options initialization  [Karel Zak]
   - replace with code from Debian  [Karel Zak]
   - s/DEBUG/VERBOSE/  [Karel Zak]
   - translate verbose messages  [Karel Zak]
   - use PRCE2 posix header file  [Karel Zak]
   - use err() if possible  [Karel Zak]
   - use errx() when parse options  [Karel Zak]
   - use monotonic time like other utils  [Karel Zak]
   - use only err.h to print errors and warnings  [Karel Zak]
   - use our xalloc.h  [Karel Zak]
   - use size_to_human_string()  [Karel Zak]
hexdump:
   - add "hd" program name to man page  [Chris Hofstaedtler]
   - automatically use -C when called as hd  [Chris Hofstaedtler]
hwclock:
   - add fallback if SYS_settimeofday does not exist  [Karel Zak]
   - do not assume __NR_settimeofday_time32  [Pino Toscano]
   - fix SYS_settimeofday fallback  [Rosen Penev]
   - fix compiler warnings [-Wmaybe-uninitialized]  [Karel Zak]
   - fix indentation  [Łukasz Stelmach]
   - follow timespec and use long int for nsec  [Karel Zak]
   - make tz use more robust [coverity scan]  [Karel Zak]
   - use pointer to adjtime data  [Karel Zak]
include/pathnames:
   - cleanup /proc/sys/kernel use  [Karel Zak]
include/strutils:
   - make xstrncpy() compatible with over-smart gcc 9  [Karel Zak]
ipcs:
   - Avoid shmall overflows  [Vasilis Liaskovitis]
   - fallback for overflow  [Karel Zak]
irqtop:
   - add per-cpu stats  [Karel Zak]
   - check scols_line_set_data() return code  [Karel Zak]
   - print header in reverse mode  [Karel Zak]
   - small cleanup  [Karel Zak]
irqtop/lsirq:
   - add additional desc for softirq  [zhenwei pi]
   - add softirq for man page  [zhenwei pi]
   - support softirq  [zhenwei pi]
lib:
   - add missing headers to .c files  [Karel Zak]
   - fix compiler warnings [-Wmaybe-uninitialized]  [Karel Zak]
   - use procutils.c on Linux only  [Karel Zak]
   - use ul_prefix for close_all_fds() and mkdir_p()  [Karel Zak]
lib/buffer:
   - add simple grow-able buffer  [Karel Zak]
   - fix end pointer initilaization  [Karel Zak]
   - make it robust for static analyzers [coverity scan]  [Karel Zak]
lib/caputils:
   - add fall back for last cap using prctl.  [Érico Rolim]
   - split to multiple functions, add test  [Karel Zak]
lib/env:
   - add function to save and restore unwanted variables  [Karel Zak]
lib/fileutils:
   - close fd if fdopen is failed  [Masatake YAMATO]
   - make close_all_fds() to be similar with close_range()  [Sami Kerola]
lib/jsonwrt:
   - add new functions to write in JSON  [Karel Zak]
   - don't use ctype.h for ASCII chars  [Karel Zak]
   - remove 'islast' from API  [Karel Zak]
   - remove fputs_quoted_json_* functions from include/carefulputc.h  [Karel Zak]
   - use proper output function  [Karel Zak]
lib/loopdev:
   - cosmetic changes to LOOP_CONFIGURE  [Karel Zak]
   - fix is_loopdev() to be usable with partitions  [Karel Zak]
   - make is_loopdev() more robust  [Karel Zak]
lib/pager:
   - fix improper use of negative value [coverity scan]  [Sami Kerola]
lib/procutils:
   - add proc_is_procfs helper.  [Érico Rolim]
   - improve proc_is_procfs(), add test  [Karel Zak]
   - use Public Domain for this file  [Karel Zak]
lib/pty-session:
   - fix time_t=long assumptions  [Karel Zak]
lib/randutils:
   - rename random_get_bytes()  [Sami Kerola]
lib/selinux-utils:
   - cleanup function names  [Karel Zak]
   - tiny cleanup  [Karel Zak]
lib/signames:
   - change license to public domain  [Karel Zak]
lib/strutils:
   - add normalize_whitespace()  [Karel Zak]
   - add ul_stralnumcmp()  [Karel Zak]
   - assume 64-bit time_t  [Karel Zak]
lib/sysfs:
   - fix doble free [coverity scan]  [Karel Zak]
libblikid.3.adoc:
   - Add missing SYNOPSIS section  [Mario Blättermann]
libblkid:
   - (gpt) accept tiny devices  [Karel Zak]
   - add blkid_probe_{set,get}_hint()  [Karel Zak]
   - add erofs filesystem support  [Gao Xiang]
   - allow a lot of mac partitions  [Samanta Navarro]
   - allow to specify offset defined by hint for blkid_probe_get_idmag()  [Pali Rohár]
   - detect CD/DVD discs in packet writing mode  [Pali Rohár]
   - detect session_offset hint for optical discs  [Pali Rohár]
   - do size correction of optical discs also by last written sector  [Pali Rohár]
   - drbdmanage  use blkid_probe_strncpy_uuid instead of blkid_probe_set_id_label  [Pali Rohár]
   - export blkid_probe_reset_hints()  [Karel Zak]
   - fix Atari prober logic  [Karel Zak]
   - fix blkid_probe_get_sb() to use hint offset calculation  [Pali Rohár]
   - fix comment block  [Karel Zak]
   - fix docs  [Karel Zak]
   - fix memory leak in config parser  [Samanta Navarro]
   - fix some typos in function comments  [nick black]
   - fix time_t handling  [Samanta Navarro]
   - improve debug for /proc/partitions  [Karel Zak]
   - initialize magic strings in robust way  [Karel Zak]
   - iso9660  add new test images  [Pali Rohár]
   - iso9660  add support for VOLUME_SET_ID and DATA_PREPARER_ID  [Pali Rohár]
   - iso9660  add support for multisession via session_offset hint  [Pali Rohár]
   - iso9660  check that iso->publisher_id and iso->application_id are not file paths  [Pali Rohár]
   - iso9660  do not check is_str_empty() for iso->system_id and boot->boot_system_id  [Pali Rohár]
   - iso9660  fix parsing images which do not have Primary Volume Descriptor as the first  [Pali Rohár]
   - iso9660  improve label parsing  [Pali Rohár]
   - iso9660  parse SYSTEM_ID, PUBLISHER_ID and APPLICATION_ID from Joliet  [Pali Rohár]
   - iso9660  set block size also for High Sierra format  [Pali Rohár]
   - limit amount of parsed partitions  [Samanta Navarro]
   - make Atari more robust  [Karel Zak]
   - make gfs2 prober more extendible  [Karel Zak]
   - overwrite existing hint  [Karel Zak]
   - remove workaround for FAT+MBR on whole-disk  [Karel Zak]
   - udf  add support for APPLICATION_ID  [Pali Rohár]
   - udf  add support for PUBLISHER_ID  [Pali Rohár]
   - udf  add support for multisession via session_offset hint  [Pali Rohár]
   - udf  add support for unclosed sequential Write-Once media  [Pali Rohár]
   - udf  check that dstrings are encoded in OSTA Compressed Unicode  [Pali Rohár]
   - udf  update test output for APPLICATION_ID and PUBLISHER_ID  [Pali Rohár]
   - use /sys to read all block devices  [Karel Zak]
libfdisk:
   - (dos) fix last possible sector calculation  [Karel Zak]
   - (gpt) make sure device is large enough  [Karel Zak]
   - (gpt) reduce number of entries to fit small device  [Karel Zak]
   - (gpt) returns location of the backup header too  [Karel Zak]
   - (script) don't use sector size if not specified  [Karel Zak]
   - (script) fix possible memory leaks  [Karel Zak]
   - (script) fix possible partno overflow  [Karel Zak]
   - (script) ignore empty values for start and size  [Gaël PORTAY]
   - (script) make sure buffer is initialized  [Karel Zak]
   - (script) make sure label is specified  [Karel Zak]
   - (script) print bootable flag only when set  [Karel Zak]
   - Include table-length in first-lba checks  [Samuel Dionne-Riel]
   - add "Linux /usr" and "Linux /usr verity" GPT partition types  [nl6720]
   - add systemd-homed user's home GPT partition type  [nl6720]
   - another parse_line_nameval() cleanup  [Karel Zak]
   - do not reset default if undefined by script  [Karel Zak]
   - fix fdisk_reread_changes() for extended partitions  [Karel Zak]
   - fix last free sector detection if partition size specified  [Karel Zak]
   - fix typo from 255f5f4c770ebd46a38b58975bd33e33ae87ed24  [Karel Zak]
   - ignore 33553920 byte optimal I/O size  [Ryan Finnie]
   - make fdisk_partname() more robust  [Karel Zak]
   - make labels allocations readable for analysers [coverity scan]  [Karel Zak]
   - reset context FD on error  [yangzz-97]
   - support partition type name parsing  [Karel Zak]
   - use lib/jsonwrt.s for JSON formatting  [Karel Zak]
   - use open(O_EXCL) to detect if device is used  [Karel Zak]
libmount:
   - (optstr) improve default initialization  [Karel Zak]
   - (python) fix compiler warning  [Karel Zak]
   - Fix 0x%u usage  [Dr. David Alan Gilbert]
   - add assert() to umount lookup code  [Karel Zak]
   - add mnt_table_over_fs()  [Karel Zak]
   - add vboxsf, virtiofs to pseudo filesystems  [Shahid Laher]
   - allow --read-only for not-root users  [Karel Zak]
   - do not canonicalize ZFS source dataset  [Karel Zak]
   - do not use pointer as an integer value  [Sami Kerola]
   - don't use "symfollow" for helpers on user mounts  [Karel Zak]
   - don't use deprecated security_context_t  [Karel Zak]
   - fix /{etc,proc}/filesystems use  [Karel Zak]
   - fix compiler warnings [-Wmaybe-uninitialized]  [Karel Zak]
   - fix memory leak [coverity scan]  [Karel Zak]
   - fix tab parser for badly terminated lines  [Karel Zak]
   - improve mnt_split_optstr() performance  [Karel Zak]
   - mark entries from /proc/swaps by MNT_FS_SWAP  [Karel Zak]
   - mnt_table_over_fs() make child optional  [Karel Zak]
   - optimize mnt_optstr_apply_flags()  [Karel Zak]
   - remove read-mountinfo workaround  [Karel Zak]
libmount (verity):
   - let crypt_deactivate_by_name handle its own data structure  [Luca Boccassi]
   - plug libcryptsetup logger into our logging system  [Luca Boccassi]
libsmartcols:
   - add comments to private header file  [Karel Zak]
   - add sort sunction to the sample  [Karel Zak]
   - don't print empty output on empty table in JSON  [Karel Zak]
   - fix colors use  [Karel Zak]
   - introduce default sort column  [Karel Zak]
   - make buffers append function more robust  [Karel Zak]
   - remove unnecessary code  [Karel Zak]
   - sanitize variable names on export output  [Karel Zak]
   - support arrays for JSON output  [Karel Zak]
   - use lib/jsonwrt.c for JSON  [Karel Zak]
libsmratcols:
   - print title color only when wanted  [Karel Zak]
libuuid:
   - check quality of random bytes  [Samanta Navarro]
   - improve "restrict" keyword use  [Karel Zak]
   - simplify uuid_is_null() check  [Sami Kerola]
login:
   - add initialize() function to have less stack allocated in main()  [Sami Kerola]
   - add option to not reset username on each attempt  [Thayne McCombs]
   - close() only a file descriptor that is open [coverity scan]  [Sami Kerola]
   - ensure getutxid() does not use uninitialized variable [coverity scan]  [Sami Kerola]
   - fix coding style issues  [Sami Kerola]
   - fix compiler warning [-Werror=strict-prototypes]  [Karel Zak]
   - move generic setting to ttyutils.h  [Karel Zak]
   - move getlogindefs_num() after localization init  [Sami Kerola]
   - move message printing out from main()  [Sami Kerola]
   - move proctitle code to login.c  [Karel Zak]
   - move timeout from global to local scope  [Sami Kerola]
   - replace function like definitions with inline functions  [Sami Kerola]
   - stop keeping timeout message in memory forever  [Sami Kerola]
   - tidy up manual page  [Sami Kerola]
   - use calloc() when memory needs to be cleared  [Sami Kerola]
   - use close_range() system call when possible  [Sami Kerola]
   - use explicit_bzero() to get rid of confidental memory  [Sami Kerola]
   - use full tty path for PAM_TTY  [Karel Zak]
   - use mem2strcpy() rather than rely on printf()  [Karel Zak]
   - use sig_atomic_t type for variable accessed from signal handler  [Sami Kerola]
   - use system definitions to determine maxium login name length  [Sami Kerola]
   - use ul_copy_file  [Egor Chelak]
   - use xalloc memory allocation helpers everywhere  [Sami Kerola]
login-utils:
   - don't use deprecated security_context_t  [Karel Zak]
loopdev:
   - use LOOP_CONFIG ioctl  [Sinan Kaya]
losetup:
   - avoid infinite busy loop  [Karel Zak]
   - fix wrong printf() format specifier for ino_t data type  [Manuel Bentele]
   - increase limit of setup attempts  [Karel Zak]
lsblk:
   - add --width option  [Karel Zak]
   - add FSROOTS column  [Karel Zak]
   - add dependence between CD/DVD block and packet devices  [Karel Zak]
   - add lscpu_read_topology_polarization()  [Karel Zak]
   - fix -T optional argument  [Karel Zak]
   - fix SCSI_IDENT_SERIAL  [Karel Zak]
   - fix filesystem array allocation  [Karel Zak]
   - ignore only loopdevs without backing file  [Karel Zak]
   - print all device mountpoints  [Karel Zak]
   - print zero rather than empty SIZE  [Karel Zak]
   - read ID_SCSI_IDENT_SERIAL if available  [Karel Zak]
   - read SCSI_IDENT_SERIAL also from udev  [Karel Zak]
   - show all empty, except loopdevs  [Karel Zak]
   - update man page  [Karel Zak]
   - use MOUNTPOINTS in --fs  [Karel Zak]
   - use MOUNTTARGETS in default output  [Karel Zak]
lscpu:
   - (arm) reuse parsed vendor ID  [Karel Zak]
   - (cpuinfo) fill empty cputype  [Karel Zak]
   - (cpuinfo) rewrite parser  [Karel Zak]
   - (cputype) add cpuinfo parser  [Karel Zak]
   - (cputype) add debug stuff  [Karel Zak]
   - (cputype) add header file, cleanup patterns code  [Karel Zak]
   - (cputype) add ref-counting, allocate context  [Karel Zak]
   - (cputype) move temporary stuff  [Karel Zak]
   - (cputype) simplify cpuinfo parsing  [Karel Zak]
   - (topology) add read_address()  [Karel Zak]
   - (topology) add read_configure()  [Karel Zak]
   - (topology) add read_mhz()  [Karel Zak]
   - (topology) read caches from /sys  [Karel Zak]
   - (virt) add macros for VMWARE  [Karel Zak]
   - (virt) simplify hypervisor parsing  [Karel Zak]
   - Adapt MIPS cpuinfo  [Karel Zak]
   - Add FUJITSU aarch64 A64FX cpupart  [Shunsuke Nakamura]
   - Even more Arm part numbers  [Jeremy Linton]
   - Replace space with tabs  [Bader Zaidan]
   - add LSCPU_OUTPUT_ enum  [Karel Zak]
   - add MHZ column  [Karel Zak]
   - add MHZ to the -e output  [Karel Zak]
   - add another part of summary output  [Karel Zak]
   - add extra caches to --cache output  [Karel Zak]
   - add function to count caches size  [Karel Zak]
   - add functions to get CPU freq  [Karel Zak]
   - add helper to get physical sockets  [Masayoshi Mizuma]
   - add info that caches sizes are sum  [Karel Zak]
   - add lscpu_cpu to internal API  [Karel Zak]
   - add lscpu_cpus_loopup_by_type(), improve readability  [Karel Zak]
   - add lscpu_read_architecture()  [Karel Zak]
   - add lscpu_read_cpulists()  [Karel Zak]
   - add lscpu_read_extra()  [Karel Zak]
   - add lscpu_read_numas()  [Karel Zak]
   - add lscpu_read_topolgy_ids()  [Karel Zak]
   - add lscpu_read_topology()  [Karel Zak]
   - add lscpu_read_virtualization()  [Karel Zak]
   - add lscpu_read_vulnerabilities()  [Karel Zak]
   - add note about cache IDs  [Karel Zak]
   - add per type summary function  [Karel Zak]
   - add rest of summary  [Karel Zak]
   - add sections  [Karel Zak]
   - add setsize to lscpu context  [Karel Zak]
   - add shared cached info for s390 lscpu -C  [Karel Zak]
   - add very basic cputype code  [Karel Zak]
   - assume L1d, L1i, L2, L3 for sparc  [Karel Zak]
   - assume gaps in list of CPUs  [Karel Zak]
   - avoid segfault on PowerPC systems with valid hardware configurations  [Thomas Abraham]
   - calculate threads number from type specific values  [Karel Zak]
   - cleanup --cache  [Karel Zak]
   - cleanup --parse  [Karel Zak]
   - cleanup -e  [Karel Zak]
   - cleanup lscpu_unref_cputype()  [Karel Zak]
   - cleanup tab vs. space  [Karel Zak]
   - cleaup arch freeing  [Karel Zak]
   - convert ARM decoding to new API  [Karel Zak]
   - convert getopt block to new API  [Karel Zak]
   - deallocate maps  [Karel Zak]
   - don't use section for extra caches  [Karel Zak]
   - don't use smbios when read snapshots  [Karel Zak]
   - fix "caches" header  [Karel Zak]
   - fix MHZ parsing  [Karel Zak]
   - fix NUMAs reading code  [Karel Zak]
   - fix NVIDIA ARM hw implementer spelling case  [Ville Skyttä]
   - fix for sparc64  [Karel Zak]
   - fix last caches separator in -e and -p output  [Karel Zak]
   - fix mem-leak in cpu  [Karel Zak]
   - fix memory leaks  [Karel Zak]
   - fix possible null dereferences [coverity scan]  [Karel Zak]
   - fix resource leak [coverity scan]  [Karel Zak]
   - fix variable shadowing  [Sami Kerola]
   - generate cache ID if not available  [Karel Zak]
   - hide all to lscpu_read_topology()  [Karel Zak]
   - improve bogomips use  [Karel Zak]
   - improve debug message  [Karel Zak]
   - improve topology calculation  [Karel Zak]
   - improve topology calculation, use /proc/sysinfo  [Karel Zak]
   - improve topology debug message  [Karel Zak]
   - keep hypervisor name in allocated memory  [Karel Zak]
   - keep static/dynamic MHz in cputype struct  [Karel Zak]
   - merge new API to lscpu.h  [Karel Zak]
   - move debug initialization to main  [Karel Zak]
   - move to main function to init context  [Karel Zak]
   - move topology stuff to separate file  [Karel Zak]
   - new cpuinfo parser  [Karel Zak]
   - print generic part of the summary  [Karel Zak]
   - read Sparc caches files  [Karel Zak]
   - recognize more ARM implementers  [Ville Skyttä]
   - remove obsolete code  [Karel Zak]
   - remove unnecessary prefix from static function  [Karel Zak]
   - remove unused code  [Karel Zak]
   - remove unused function  [Karel Zak]
   - report also number of cache instances  [Karel Zak]
   - show the number of physical socket on aarch64 machine without ACPI PPTT  [Masayoshi Mizuma]
   - sort extra caches  [Karel Zak]
   - split output to sections  [Karel Zak]
   - support +list for -e, -p and -C  [Karel Zak]
   - support s390 cpuinfo processor-pre-line format  [Karel Zak]
   - temporary commit  [Karel Zak]
   - update tests  [Karel Zak]
   - use SMBIOS tables on ARM for lscpu  [Jeffrey Bastian]
   - use cache ID, keep caches independent on CPU type  [Karel Zak]
   - use cluster on aarch64 machine which doesn't have ACPI PPTT  [Masayoshi Mizuma]
   - use constants from new API  [Karel Zak]
   - use new code to read CPUs info  [Karel Zak]
   - use size_t for counters  [Karel Zak]
   - use size_t for ncolumns  [Karel Zak]
lscpu-arm:
   - Add "BIOS Vendor ID" and "BIOS Model name" to show the SMBIOS information.  [Masayoshi Mizuma]
lscpu-dmi:
   - Move some functions related to DMI to lscpu-dmi  [Masayoshi Mizuma]
lscpu-virt:
   - fix return type of read_hypervisor_cpuid for non x86.  [Érico Rolim]
   - split hypervisor_from_dmi_table()  [Masayoshi Mizuma]
lsipc:
   - make default output byte sizes to be in human units  [Sami Kerola]
lsirq:
   - fix resources leak [coverity scan]  [Karel Zak]
lslogins:
   - call close() for usable FD [coverity scan]  [Karel Zak]
   - non-Linux fix  [Karel Zak]
lsmem:
   - use ul_path_readf_string() readable for analysers [coverity scan]  [Karel Zak]
lsns:
   - add columns for parent namespaces and owner namespaces  [Masatake YAMATO]
man:
   - add ioctl_ns(2) to SEE ALSO of lsns(2)  [Masatake YAMATO]
   - add missing backslash to caret printing macro  [Sami Kerola]
   - make tilde and caret characters to render correctly  [Sami Kerola]
manpages:
   - fix "The example command" in AVAILABILITY section  [Chris Hofstaedtler]
mesg:
   - use only stat() to get the current terminal status  [Karel Zak]
meson:
   - add irq utils  [Karel Zak]
   - add missing HAVE_ definitions  [Karel Zak]
   - add second build system  [Zbigniew Jędrzejewski-Szmek]
   - fix systemd dependence  [Karel Zak]
   - generate man pages from asciidoc  [Karel Zak]
   - implement building of static programs  [Zbigniew Jędrzejewski-Szmek]
   - port localstatedir and sysconfdir  [Karel Zak]
   - update configuration  [Karel Zak]
   - update for new hardlink  [Karel Zak]
   - update sources and dependencies  [Karel Zak]
misc:
   - fix typos  [Samanta Navarro]
   - fix typos [codespell]  [Samanta Navarro]
mkfs.minix:
   - add --lock and LOCK_BLOCK_DEVICE  [Karel Zak]
mkswap:
   - add --verbose, reduce extents check output  [Karel Zak]
   - check for holes and unwanted extentd in file  [Karel Zak]
   - cleanup usage()  [Karel Zak]
   - don't use deprecated security_context_t  [Karel Zak]
   - improve extents check  [Karel Zak]
   - remove deprecated SELinux matchpathcon()  [Karel Zak]
   - remove unnecessary on FS_IOC_FIEMAP  [Karel Zak]
   - remove unused variable when compile without libblkid  [Karel Zak]
   - tell how to fix insecure permissions and owner in warning  [Sami Kerola]
more:
   - fix ARROW_DOWN and PAGE_DOWN behaviour to not skip lines  [Hannes Müller]
   - fix command 'f' (screen forward) behaviour  [Hannes Müller]
   - fix floating point exception core dump  [Sami Kerola]
   - improve error messaging when input file is directory  [Sami Kerola]
mount:
   - Add support for "nosymfollow" mount option.  [Mattias Nissler]
mount, umount:
   - restore environ[] after suid drop  [Karel Zak]
mount.a.adoc:
   - Fix markup  [Mario Blättermann]
mountpoint:
   - different exit status for errors and non-mountpoint situation  [Karel Zak]
nologin:
   - use ul_copy_file  [Egor Chelak]
nsenter / switch_root:
   - fix insecure chroot [coverity scan]  [Sami Kerola]
pg:
   - fix wcstombs() use  [Karel Zak]
po:
   - add ko.po (from translationproject.org)  [Seong-ho Cho]
   - add sr.po (from translationproject.org)  [Мирослав Николић]
   - add xgettext hint for non-c-format string  [Karel Zak]
   - merge changes  [Karel Zak]
   - update  [Karel Zak]
   - update cs.po (from translationproject.org)  [Petr Písař]
   - update de.po (from translationproject.org)  [Mario Blättermann]
   - update es.po (from translationproject.org)  [Antonio Ceballos Roa]
   - update fr.po (from translationproject.org)  [Frédéric Marchal]
   - update hr.po (from translationproject.org)  [Božidar Putanec]
   - update pl.po (from translationproject.org)  [Jakub Bogusz]
   - update pt.po (from translationproject.org)  [Pedro Albuquerque]
   - update sv.po (from translationproject.org)  [Sebastian Rasmussen]
   - update uk.po (from translationproject.org)  [Yuri Chornoivan]
   - use msgmerge --previous  [Karel Zak]
po-man:
   - Add (incomplete) de.po for testing purposes  [Mario Blättermann]
   - Add po-man/README.md  [Mario Blättermann]
   - Adjust paths in po4a.cfg and update .pot file  [Mario Blättermann]
   - Fix the example man page  [Mario Blättermann]
   - Fix typos in de.po and po4a.cfg  [Mario Blättermann]
   - Fix typos in po-man/README.md  [Mario Blättermann]
   - Move Po4a config file and translation template to po-man  [Mario Blättermann]
   - Update the example man page  [Mario Blättermann]
prlimit:
   - fix optional arguments parsing  [Karel Zak]
   - make code more robust  [Karel Zak]
pylibmount:
   - PyEval_Call* is deprecate, use PyObject_Call*  [Karel Zak]
read_all:
   - return 0 when EOF occurs after 0 bytes  [Egor Chelak]
readprofile:
   - fix static analyzer warning [coverity scan]  [Karel Zak]
rfkill:
   - add "toggle" command  [Karel Zak]
   - fix compiler warning [-Wformat=]  [Karel Zak]
   - fix compiler warning [-Wsign-compare]  [Karel Zak]
   - fix static analyzer warning [coverity scan]  [Karel Zak]
   - make RFKILL_EVENT_SIZE_V1 use more portable  [Karel Zak]
   - stop execution when rfkill device cannot be opened  [Sami Kerola]
rtcwake:
   - fix time_t=long assumptions  [Karel Zak]
script:
   - cleanup --echo  [Soumendra Ganguly]
   - don't use strings from user as printf-format [coverity scan]  [Karel Zak]
   - fix compiler warnings [-Wmaybe-uninitialized]  [Karel Zak]
   - fix time_t=long assumptions  [Karel Zak]
   - improve I/O return code checks  [Soumendra Ganguly]
   - kill child process on error  [Karel Zak]
scriptlive:
   - (man) add missing parenthesis  [Karel Zak]
   - fix compiler warnings [-Wmaybe-uninitialized]  [Karel Zak]
scriptplay:
   - fix time_t=long assumptions  [Karel Zak]
scriptreplay:
   - enable special character handling  [Soumendra Ganguly]
setpriv:
   - allow using [-+]all for capabilities.  [Érico Rolim]
   - small clean-up.  [Érico Rolim]
sfdisk:
   - (docs) add more information about GPT attribute bits  [Karel Zak]
   - correct --json --dump false exclusive  [Dimitri John Ledkov]
   - disable bootbits protection on '--wipe always'  [Karel Zak]
   - do not free device name too soon [coverity scan]  [Sami Kerola]
   - fix backward --move-data  [Karel Zak]
   - fix resources leak [coverity scan]  [Karel Zak]
   - support for type="partition type name"  [Karel Zak]
su:
   - (pty) change owner and mode for pty  [Karel Zak]
   - explicitly enable echo for --pty  [Karel Zak]
   - fix man page typos  [Štěpán Němec]
   - remove useless assignment  [Karel Zak]
   - use full tty path for PAM_TTY  [Karel Zak]
swapon:
   - Keep headings and fields aligned in summary output.  [Sebastian Rasmussen]
switch_root:
   - check if mount point to move even exists  [Thomas Deutschmann]
   - fix double close [coverity scan]  [Karel Zak, Sami Kerola]
sys-utils:
   - mount.8  fix a typo  [Eric Biggers]
test_uuid_parser:
   - fix time_t=long assumptions  [Karel Zak]
tests:
   - (blkid) add erofs image  [Karel Zak]
   - (blkid) add support for multisession images  [Karel Zak]
   - (fileutils) remove unused code  [Karel Zak]
   - (ul) remove another 'dim' input  [Karel Zak]
   - add a fuzz target calling fdisk_script_read_file  [Evgeny Vereshchagin]
   - add a fuzzer for mnt_table_parse_stream  [Evgeny Vereshchagin]
   - add a fuzzer for process_wtmp_file  [Evgeny Vereshchagin]
   - add checksum for cramfs/mkfs for LE 16384 (ia64)  [Anatoly Pugachev]
   - add sfdisk test for 4fe7f9b614e2b5bb97f6d89af02acb867cffccc1  [Karel Zak]
   - add testcases that triggered various crashes  [Evgeny Vereshchagin]
   - an attempt to get around https //github.com/karelzak/util-linux/issues/1110  [Evgeny Vereshchagin]
   - be explicit with file permissions for cramfs  [Karel Zak]
   - cover the code parsing comments  [Evgeny Vereshchagin]
   - don't reply on scsi_debug partitions  [Karel Zak]
   - dump more information about CFS and block devices  [Karel Zak]
   - improve u64 use in ipcs test  [Karel Zak]
   - integrate test_last_fuzz into the testsuite  [Evgeny Vereshchagin]
   - integrate test_mount_fuzz into the testsuite  [Evgeny Vereshchagin]
   - make it compatible with meson  [Karel Zak]
   - mark ul/basic as KNOWN_FAIL  [Karel Zak]
   - migrate from ext3 to ext2  [Karel Zak]
   - mkfs-endianness test use iflag=fullblock to fill block completely with string  [Masami Ichikawa]
   - mkfs-endianness test uses prepared test data  [Masami Ichikawa]
   - move misc/ul to ul/ directory  [Sami Kerola]
   - pack testcases into zip archives  [Evgeny Vereshchagin]
   - remove ul(1) 'dim' input  [Karel Zak]
   - set shmmni to 32k  [Karel Zak]
   - skip hwclock/systohc on GH Actions  [Karel Zak]
   - small change to the lsns/ioctl_ns  [Karel Zak]
   - suggest "make check-programs"  [Karel Zak]
   - take exit codes into account  [Evgeny Vereshchagin]
   - update JSON outputs  [Karel Zak]
   - update atari blkid tests  [Karel Zak]
   - update atari partx tests  [Karel Zak]
   - update blkid output for iso/udf  [Karel Zak]
   - update build test results  [Karel Zak]
   - update build tests  [Karel Zak]
   - update fdisk dumps  [Karel Zak]
   - update hardlink tests  [Karel Zak]
   - update libfdisk JSON outputs  [Karel Zak]
   - update lscpu output  [Karel Zak]
   - update mountpoint return code chack  [Karel Zak]
   - update mountpoint tests  [Karel Zak]
   - update script(1) return code  [Karel Zak]
   - update sfdisk wipe tests  [Karel Zak]
   - update sparc lscpu tests  [Karel Zak]
   - update swaplabel.err  [Karel Zak]
tests/run:
   - create failure directory  [Zbigniew Jędrzejewski-Szmek]
text-utils:
   - correctly detect ASan under clang  [Frantisek Sumsal]
tools:
   - add missing stuff to Makefile.am  [Karel Zak]
   - make it possible to set all the fuzzing flags with config-gen  [Evgeny Vereshchagin]
   - replace checkmans.sh with adoc scripts  [Karel Zak]
   - use libcryptsetup in config-gen.d/all.conf  [Karel Zak]
travis:
   - cleanup before autogen  [Karel Zak]
   - disable OSX for now  [Karel Zak]
   - remove old ubuntu  [Karel Zak]
   - set CXX correctly  [Evgeny Vereshchagin]
   - stop building fuzz targets on macOS  [Evgeny Vereshchagin]
   - try update to xcode10.1  [Karel Zak]
   - turn off libmount on OSX  [Evgeny Vereshchagin]
   - turn on --enable-fuzzing-engine  [Evgeny Vereshchagin]
   - use verbose mode (V=1) for make  [Karel Zak]
ttymsg:
   - fix resource leak [coverity scan]  [Karel Zak]
uclampset:
   - Add man page  [Qais Yousef]
   - Plumb in bash-completion  [Qais Yousef]
   - Plump into the build system  [Qais Yousef]
   - cleanup --hel output  [Karel Zak]
ul:
   - add a term capabilities tracking structure  [Sami Kerola]
   - add basic tests  [Sami Kerola]
   - fix use of unsigned number  [Karel Zak]
   - flip comparisons to lesser to greater order  [Sami Kerola]
   - free most allocations ncurses did during setupterm()  [Sami Kerola]
   - improve function and variable names  [Sami Kerola]
   - make set_column() zero check more obvious  [Sami Kerola]
   - remove function like putwp preprocessor define  [Sami Kerola]
   - remove function prototypes  [Sami Kerola]
   - rename enumerated mode symbols  [Sami Kerola]
   - replace global runtime variables with a control structure  [Sami Kerola]
   - small coding changes  [Karel Zak]
   - tidy up coding style  [Sami Kerola]
   - use size_t to measure memory allocation size  [Sami Kerola]
ul_copy_file:
   - add test program  [Egor Chelak]
   - handle EAGAIN and EINTR  [Egor Chelak]
   - make defines for return values  [Egor Chelak]
   - use BUFSSIZ for buffer size  [Egor Chelak]
   - use all_read/all_write  [Egor Chelak]
   - use sendfile  [Egor Chelak]
umount:
   - ignore --no-canonicalize,-c for non-root users  [Karel Zak]
   - support over-mounts for --recursive  [Karel Zak]
unshare:
   - Fix error message when setting proc mount propagation  [Johan Herland]
   - fix bad bit shift operation [coverity scan]  [Sami Kerola]
utmpdup:
   - Ensure flushing when using follow flag  [Andrew Shapiro]
uuidd:
   - add command-line option values struct  [Sami Kerola]
   - add uuidd specific data types that are used in protocol  [Sami Kerola]
   - document uuidd protocol  [Sami Kerola]
   - fix misleading indentation  [Sami Kerola]
   - make timeout to take effect when debug is not defined  [Sami Kerola]
   - move option parsing to separate function  [Sami Kerola]
   - override operation type when performing bulk request  [Sami Kerola]
   - remove unnecessary bulk request size limit  [Sami Kerola]
   - reorder bulk time and random generation code  [Sami Kerola]
   - use pid_t type when referring to process id  [Sami Kerola]
uuidgen:
   - give hint in usage() what uuid namepaces can be used  [Sami Kerola]
   - use errx() rather than fprintf() when priting errors  [Sami Kerola]
uuidparse:
   - use libuuid function to test nil uuid  [Sami Kerola]
   - use uuid type definitions from libuuid header  [Sami Kerola]
vipw:
   - fix short write handling in copyfile  [Egor Chelak]
   - move copyfile to the lib  [Egor Chelak]
whereis:
   - add --disable-whereis to configure  [Samanta Navarro]
   - add lib32 directories  [Samanta Navarro]
   - do not ignore trailing numbers  [Samanta Navarro]
   - do not strip suffixes  [Samanta Navarro]
   - extend test case  [Samanta Navarro]
   - filter bin, man and src differently  [Samanta Navarro]
   - fix out of boundary read  [Samanta Navarro]
   - support zst compressed man pages  [Samanta Navarro]
wipefs:
   - (man) add hint to erase on partitions and disk  [Karel Zak]
   - fix compiler warning  [Karel Zak]
zramctl:
   - (man) fix streams default number  [Karel Zak]


^ permalink raw reply	[relevance 5%]

* [PATCH v27 06/31] x86/cet: Add control-protection fault handler
  @ 2021-05-21 22:11  3% ` Yu-cheng Yu
  0 siblings, 0 replies; 200+ results
From: Yu-cheng Yu @ 2021-05-21 22:11 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang
  Cc: Yu-cheng Yu, Michael Kerrisk

A control-protection fault is triggered when a control-flow transfer
attempt violates Shadow Stack or Indirect Branch Tracking constraints.
For example, the return address for a RET instruction differs from the copy
on the shadow stack; or an indirect JMP instruction, without the NOTRACK
prefix, arrives at a non-ENDBR opcode.

The control-protection fault handler works in a similar way as the general
protection fault handler.  It provides the si_code SEGV_CPERR to the signal
handler.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
---
v25:
- Change CONFIG_X86_CET to CONFIG_X86_SHADOW_STACK.
- Change X86_FEATURE_CET to X86_FEATURE_SHSTK.

 arch/x86/include/asm/idtentry.h    |  4 ++
 arch/x86/kernel/idt.c              |  4 ++
 arch/x86/kernel/signal_compat.c    |  2 +-
 arch/x86/kernel/traps.c            | 63 ++++++++++++++++++++++++++++++
 include/uapi/asm-generic/siginfo.h |  3 +-
 5 files changed, 74 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 73d45b0dfff2..74366706c994 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -571,6 +571,10 @@ DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_SS,	exc_stack_segment);
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_GP,	exc_general_protection);
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_AC,	exc_alignment_check);
 
+#ifdef CONFIG_X86_SHADOW_STACK
+DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_CP, exc_control_protection);
+#endif
+
 /* Raw exception entries which need extra work */
 DECLARE_IDTENTRY_RAW(X86_TRAP_UD,		exc_invalid_op);
 DECLARE_IDTENTRY_RAW(X86_TRAP_BP,		exc_int3);
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index d552f177eca0..bddeeb88b416 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -105,6 +105,10 @@ static const __initconst struct idt_data def_idts[] = {
 #elif defined(CONFIG_X86_32)
 	SYSG(IA32_SYSCALL_VECTOR,	entry_INT80_32),
 #endif
+
+#ifdef CONFIG_X86_SHADOW_STACK
+	INTG(X86_TRAP_CP,		asm_exc_control_protection),
+#endif
 };
 
 /*
diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index 0e5d0a7e203b..8788484bef1f 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -27,7 +27,7 @@ static inline void signal_compat_build_tests(void)
 	 */
 	BUILD_BUG_ON(NSIGILL  != 11);
 	BUILD_BUG_ON(NSIGFPE  != 15);
-	BUILD_BUG_ON(NSIGSEGV != 9);
+	BUILD_BUG_ON(NSIGSEGV != 10);
 	BUILD_BUG_ON(NSIGBUS  != 5);
 	BUILD_BUG_ON(NSIGTRAP != 6);
 	BUILD_BUG_ON(NSIGCHLD != 6);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 853ea7a80806..52f7d23d96e6 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -39,6 +39,7 @@
 #include <linux/io.h>
 #include <linux/hardirq.h>
 #include <linux/atomic.h>
+#include <linux/nospec.h>
 
 #include <asm/stacktrace.h>
 #include <asm/processor.h>
@@ -607,6 +608,68 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
 	cond_local_irq_disable(regs);
 }
 
+#ifdef CONFIG_X86_SHADOW_STACK
+static const char * const control_protection_err[] = {
+	"unknown",
+	"near-ret",
+	"far-ret/iret",
+	"endbranch",
+	"rstorssp",
+	"setssbsy",
+	"unknown",
+};
+
+static DEFINE_RATELIMIT_STATE(cpf_rate, DEFAULT_RATELIMIT_INTERVAL,
+			      DEFAULT_RATELIMIT_BURST);
+
+/*
+ * When a control protection exception occurs, send a signal to the responsible
+ * application.  Currently, control protection is only enabled for user mode.
+ * This exception should not come from kernel mode.
+ */
+DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+{
+	struct task_struct *tsk;
+
+	if (!user_mode(regs)) {
+		pr_emerg("PANIC: unexpected kernel control protection fault\n");
+		die("kernel control protection fault", regs, error_code);
+		panic("Machine halted.");
+	}
+
+	cond_local_irq_enable(regs);
+
+	if (!boot_cpu_has(X86_FEATURE_SHSTK))
+		WARN_ONCE(1, "Control protection fault with CET support disabled\n");
+
+	tsk = current;
+	tsk->thread.error_code = error_code;
+	tsk->thread.trap_nr = X86_TRAP_CP;
+
+	/*
+	 * Ratelimit to prevent log spamming.
+	 */
+	if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
+	    __ratelimit(&cpf_rate)) {
+		unsigned long ssp;
+		int cpf_type;
+
+		cpf_type = array_index_nospec(error_code, ARRAY_SIZE(control_protection_err));
+
+		rdmsrl(MSR_IA32_PL3_SSP, ssp);
+		pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)",
+			 tsk->comm, task_pid_nr(tsk),
+			 regs->ip, regs->sp, ssp, error_code,
+			 control_protection_err[cpf_type]);
+		print_vma_addr(KERN_CONT " in ", regs->ip);
+		pr_cont("\n");
+	}
+
+	force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
+	cond_local_irq_disable(regs);
+}
+#endif
+
 static bool do_int3(struct pt_regs *regs)
 {
 	int res;
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index 03d6f6d2c1fe..020badf91ea4 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -234,7 +234,8 @@ typedef struct siginfo {
 #define SEGV_ADIPERR	7	/* Precise MCD exception */
 #define SEGV_MTEAERR	8	/* Asynchronous ARM MTE error */
 #define SEGV_MTESERR	9	/* Synchronous ARM MTE exception */
-#define NSIGSEGV	9
+#define SEGV_CPERR	10	/* Control protection fault */
+#define NSIGSEGV	10
 
 /*
  * SIGBUS si_codes
-- 
2.21.0


^ permalink raw reply related	[relevance 3%]

* Re: [PATCH v19 6/8] PM: hibernate: disable when there are active secretmem users
  2021-05-13 18:47  3% ` [PATCH v19 6/8] PM: hibernate: disable when there are active secretmem users Mike Rapoport
  2021-05-14  9:27  0%   ` David Hildenbrand
@ 2021-05-18 10:24  0%   ` Mark Rutland
  1 sibling, 0 replies; 200+ results
From: Mark Rutland @ 2021-05-18 10:24 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, Alexander Viro, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christopher Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, Elena Reshetova,
	H. Peter Anvin, Hagen Paul Pfeifer, Ingo Molnar, James Bottomley,
	Kees Cook, Kirill A. Shutemov, Matthew Wilcox, Matthew Garrett,
	Michal Hocko, Mike Rapoport, Michael Kerrisk, Palmer Dabbelt,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rafael J. Wysocki,
	Rick Edgecombe, Roman Gushchin, Shakeel Butt, Shuah Khan,
	Thomas Gleixner, Tycho Andersen, Will Deacon, Yury Norov,
	linux-api, linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86

On Thu, May 13, 2021 at 09:47:32PM +0300, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> It is unsafe to allow saving of secretmem areas to the hibernation
> snapshot as they would be visible after the resume and this essentially
> will defeat the purpose of secret memory mappings.
> 
> Prevent hibernation whenever there are active secret memory users.

Have we thought about how this is going to work in practice, e.g. on
mobile systems? It seems to me that there are a variety of common
applications which might want to use this which people don't expect to
inhibit hibernate (e.g. authentication agents, web browsers).

Are we happy to say that any userspace application can incidentally
inhibit hibernate?

Thanks,
Mark.

> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Christopher Lameter <cl@linux.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Elena Reshetova <elena.reshetova@intel.com>
> Cc: Hagen Paul Pfeifer <hagen@jauu.net>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: James Bottomley <jejb@linux.ibm.com>
> Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Palmer Dabbelt <palmer@dabbelt.com>
> Cc: Palmer Dabbelt <palmerdabbelt@google.com>
> Cc: Paul Walmsley <paul.walmsley@sifive.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Tycho Andersen <tycho@tycho.ws>
> Cc: Will Deacon <will@kernel.org>
> ---
>  include/linux/secretmem.h |  6 ++++++
>  kernel/power/hibernate.c  |  5 ++++-
>  mm/secretmem.c            | 15 +++++++++++++++
>  3 files changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h
> index e617b4afcc62..21c3771e6a56 100644
> --- a/include/linux/secretmem.h
> +++ b/include/linux/secretmem.h
> @@ -30,6 +30,7 @@ static inline bool page_is_secretmem(struct page *page)
>  }
>  
>  bool vma_is_secretmem(struct vm_area_struct *vma);
> +bool secretmem_active(void);
>  
>  #else
>  
> @@ -43,6 +44,11 @@ static inline bool page_is_secretmem(struct page *page)
>  	return false;
>  }
>  
> +static inline bool secretmem_active(void)
> +{
> +	return false;
> +}
> +
>  #endif /* CONFIG_SECRETMEM */
>  
>  #endif /* _LINUX_SECRETMEM_H */
> diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
> index da0b41914177..559acef3fddb 100644
> --- a/kernel/power/hibernate.c
> +++ b/kernel/power/hibernate.c
> @@ -31,6 +31,7 @@
>  #include <linux/genhd.h>
>  #include <linux/ktime.h>
>  #include <linux/security.h>
> +#include <linux/secretmem.h>
>  #include <trace/events/power.h>
>  
>  #include "power.h"
> @@ -81,7 +82,9 @@ void hibernate_release(void)
>  
>  bool hibernation_available(void)
>  {
> -	return nohibernate == 0 && !security_locked_down(LOCKDOWN_HIBERNATION);
> +	return nohibernate == 0 &&
> +		!security_locked_down(LOCKDOWN_HIBERNATION) &&
> +		!secretmem_active();
>  }
>  
>  /**
> diff --git a/mm/secretmem.c b/mm/secretmem.c
> index 1ae50089adf1..7c2499e4de22 100644
> --- a/mm/secretmem.c
> +++ b/mm/secretmem.c
> @@ -40,6 +40,13 @@ module_param_named(enable, secretmem_enable, bool, 0400);
>  MODULE_PARM_DESC(secretmem_enable,
>  		 "Enable secretmem and memfd_secret(2) system call");
>  
> +static atomic_t secretmem_users;
> +
> +bool secretmem_active(void)
> +{
> +	return !!atomic_read(&secretmem_users);
> +}
> +
>  static vm_fault_t secretmem_fault(struct vm_fault *vmf)
>  {
>  	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> @@ -94,6 +101,12 @@ static const struct vm_operations_struct secretmem_vm_ops = {
>  	.fault = secretmem_fault,
>  };
>  
> +static int secretmem_release(struct inode *inode, struct file *file)
> +{
> +	atomic_dec(&secretmem_users);
> +	return 0;
> +}
> +
>  static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
>  {
>  	unsigned long len = vma->vm_end - vma->vm_start;
> @@ -116,6 +129,7 @@ bool vma_is_secretmem(struct vm_area_struct *vma)
>  }
>  
>  static const struct file_operations secretmem_fops = {
> +	.release	= secretmem_release,
>  	.mmap		= secretmem_mmap,
>  };
>  
> @@ -202,6 +216,7 @@ SYSCALL_DEFINE1(memfd_secret, unsigned int, flags)
>  	file->f_flags |= O_LARGEFILE;
>  
>  	fd_install(fd, file);
> +	atomic_inc(&secretmem_users);
>  	return fd;
>  
>  err_put_fd:
> -- 
> 2.28.0
> 

^ permalink raw reply	[relevance 0%]

* [PATCH v20 7/7] secretmem: test: add basic selftest for memfd_secret(2)
                     ` (4 preceding siblings ...)
  2021-05-18  7:20  3% ` [PATCH v20 6/7] arch, mm: wire up memfd_secret system call where relevant Mike Rapoport
@ 2021-05-18  7:20  2% ` Mike Rapoport
  5 siblings, 0 replies; 200+ results
From: Mike Rapoport @ 2021-05-18  7:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin,
	Hagen Paul Pfeifer, Ingo Molnar, James Bottomley,
	James Bottomley, Kees Cook, Kirill A. Shutemov, Matthew Wilcox,
	Matthew Garrett, Mark Rutland, Michal Hocko, Mike Rapoport,
	Mike Rapoport, Michael Kerrisk, Palmer Dabbelt, Palmer Dabbelt,
	Paul Walmsley, Peter Zijlstra, Rafael J. Wysocki, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, Yury Norov, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

From: Mike Rapoport <rppt@linux.ibm.com>

The test verifies that file descriptor created with memfd_secret does not
allow read/write operations, that secret memory mappings respect
RLIMIT_MEMLOCK and that remote accesses with process_vm_read() and
ptrace() to the secret memory fail.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 tools/testing/selftests/vm/.gitignore     |   1 +
 tools/testing/selftests/vm/Makefile       |   3 +-
 tools/testing/selftests/vm/memfd_secret.c | 296 ++++++++++++++++++++++
 tools/testing/selftests/vm/run_vmtests.sh |  17 ++
 4 files changed, 316 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/vm/memfd_secret.c

diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore
index 1f651e85ed60..da92ded5a27c 100644
--- a/tools/testing/selftests/vm/.gitignore
+++ b/tools/testing/selftests/vm/.gitignore
@@ -21,5 +21,6 @@ va_128TBswitch
 map_fixed_noreplace
 write_to_hugetlbfs
 hmm-tests
+memfd_secret
 local_config.*
 split_huge_page_test
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 73e1cc96d7c2..266580ea938c 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -34,6 +34,7 @@ TEST_GEN_FILES += khugepaged
 TEST_GEN_FILES += map_fixed_noreplace
 TEST_GEN_FILES += map_hugetlb
 TEST_GEN_FILES += map_populate
+TEST_GEN_FILES += memfd_secret
 TEST_GEN_FILES += mlock-random-test
 TEST_GEN_FILES += mlock2-tests
 TEST_GEN_FILES += mremap_dontunmap
@@ -134,7 +135,7 @@ warn_32bit_failure:
 endif
 endif
 
-$(OUTPUT)/mlock-random-test: LDLIBS += -lcap
+$(OUTPUT)/mlock-random-test $(OUTPUT)/memfd_secret: LDLIBS += -lcap
 
 $(OUTPUT)/gup_test: ../../../../mm/gup_test.h
 
diff --git a/tools/testing/selftests/vm/memfd_secret.c b/tools/testing/selftests/vm/memfd_secret.c
new file mode 100644
index 000000000000..93e7e7ffed33
--- /dev/null
+++ b/tools/testing/selftests/vm/memfd_secret.c
@@ -0,0 +1,296 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright IBM Corporation, 2021
+ *
+ * Author: Mike Rapoport <rppt@linux.ibm.com>
+ */
+
+#define _GNU_SOURCE
+#include <sys/uio.h>
+#include <sys/mman.h>
+#include <sys/wait.h>
+#include <sys/types.h>
+#include <sys/ptrace.h>
+#include <sys/syscall.h>
+#include <sys/resource.h>
+#include <sys/capability.h>
+
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+#include <stdio.h>
+
+#include "../kselftest.h"
+
+#define fail(fmt, ...) ksft_test_result_fail(fmt, ##__VA_ARGS__)
+#define pass(fmt, ...) ksft_test_result_pass(fmt, ##__VA_ARGS__)
+#define skip(fmt, ...) ksft_test_result_skip(fmt, ##__VA_ARGS__)
+
+#ifdef __NR_memfd_secret
+
+#define PATTERN	0x55
+
+static const int prot = PROT_READ | PROT_WRITE;
+static const int mode = MAP_SHARED;
+
+static unsigned long page_size;
+static unsigned long mlock_limit_cur;
+static unsigned long mlock_limit_max;
+
+static int memfd_secret(unsigned int flags)
+{
+	return syscall(__NR_memfd_secret, flags);
+}
+
+static void test_file_apis(int fd)
+{
+	char buf[64];
+
+	if ((read(fd, buf, sizeof(buf)) >= 0) ||
+	    (write(fd, buf, sizeof(buf)) >= 0) ||
+	    (pread(fd, buf, sizeof(buf), 0) >= 0) ||
+	    (pwrite(fd, buf, sizeof(buf), 0) >= 0))
+		fail("unexpected file IO\n");
+	else
+		pass("file IO is blocked as expected\n");
+}
+
+static void test_mlock_limit(int fd)
+{
+	size_t len;
+	char *mem;
+
+	len = mlock_limit_cur;
+	mem = mmap(NULL, len, prot, mode, fd, 0);
+	if (mem == MAP_FAILED) {
+		fail("unable to mmap secret memory\n");
+		return;
+	}
+	munmap(mem, len);
+
+	len = mlock_limit_max * 2;
+	mem = mmap(NULL, len, prot, mode, fd, 0);
+	if (mem != MAP_FAILED) {
+		fail("unexpected mlock limit violation\n");
+		munmap(mem, len);
+		return;
+	}
+
+	pass("mlock limit is respected\n");
+}
+
+static void try_process_vm_read(int fd, int pipefd[2])
+{
+	struct iovec liov, riov;
+	char buf[64];
+	char *mem;
+
+	if (read(pipefd[0], &mem, sizeof(mem)) < 0) {
+		fail("pipe write: %s\n", strerror(errno));
+		exit(KSFT_FAIL);
+	}
+
+	liov.iov_len = riov.iov_len = sizeof(buf);
+	liov.iov_base = buf;
+	riov.iov_base = mem;
+
+	if (process_vm_readv(getppid(), &liov, 1, &riov, 1, 0) < 0) {
+		if (errno == ENOSYS)
+			exit(KSFT_SKIP);
+		exit(KSFT_PASS);
+	}
+
+	exit(KSFT_FAIL);
+}
+
+static void try_ptrace(int fd, int pipefd[2])
+{
+	pid_t ppid = getppid();
+	int status;
+	char *mem;
+	long ret;
+
+	if (read(pipefd[0], &mem, sizeof(mem)) < 0) {
+		perror("pipe write");
+		exit(KSFT_FAIL);
+	}
+
+	ret = ptrace(PTRACE_ATTACH, ppid, 0, 0);
+	if (ret) {
+		perror("ptrace_attach");
+		exit(KSFT_FAIL);
+	}
+
+	ret = waitpid(ppid, &status, WUNTRACED);
+	if ((ret != ppid) || !(WIFSTOPPED(status))) {
+		fprintf(stderr, "weird waitppid result %ld stat %x\n",
+			ret, status);
+		exit(KSFT_FAIL);
+	}
+
+	if (ptrace(PTRACE_PEEKDATA, ppid, mem, 0))
+		exit(KSFT_PASS);
+
+	exit(KSFT_FAIL);
+}
+
+static void check_child_status(pid_t pid, const char *name)
+{
+	int status;
+
+	waitpid(pid, &status, 0);
+
+	if (WIFEXITED(status) && WEXITSTATUS(status) == KSFT_SKIP) {
+		skip("%s is not supported\n", name);
+		return;
+	}
+
+	if ((WIFEXITED(status) && WEXITSTATUS(status) == KSFT_PASS) ||
+	    WIFSIGNALED(status)) {
+		pass("%s is blocked as expected\n", name);
+		return;
+	}
+
+	fail("%s: unexpected memory access\n", name);
+}
+
+static void test_remote_access(int fd, const char *name,
+			       void (*func)(int fd, int pipefd[2]))
+{
+	int pipefd[2];
+	pid_t pid;
+	char *mem;
+
+	if (pipe(pipefd)) {
+		fail("pipe failed: %s\n", strerror(errno));
+		return;
+	}
+
+	pid = fork();
+	if (pid < 0) {
+		fail("fork failed: %s\n", strerror(errno));
+		return;
+	}
+
+	if (pid == 0) {
+		func(fd, pipefd);
+		return;
+	}
+
+	mem = mmap(NULL, page_size, prot, mode, fd, 0);
+	if (mem == MAP_FAILED) {
+		fail("Unable to mmap secret memory\n");
+		return;
+	}
+
+	ftruncate(fd, page_size);
+	memset(mem, PATTERN, page_size);
+
+	if (write(pipefd[1], &mem, sizeof(mem)) < 0) {
+		fail("pipe write: %s\n", strerror(errno));
+		return;
+	}
+
+	check_child_status(pid, name);
+}
+
+static void test_process_vm_read(int fd)
+{
+	test_remote_access(fd, "process_vm_read", try_process_vm_read);
+}
+
+static void test_ptrace(int fd)
+{
+	test_remote_access(fd, "ptrace", try_ptrace);
+}
+
+static int set_cap_limits(rlim_t max)
+{
+	struct rlimit new;
+	cap_t cap = cap_init();
+
+	new.rlim_cur = max;
+	new.rlim_max = max;
+	if (setrlimit(RLIMIT_MEMLOCK, &new)) {
+		perror("setrlimit() returns error");
+		return -1;
+	}
+
+	/* drop capabilities including CAP_IPC_LOCK */
+	if (cap_set_proc(cap)) {
+		perror("cap_set_proc() returns error");
+		return -2;
+	}
+
+	return 0;
+}
+
+static void prepare(void)
+{
+	struct rlimit rlim;
+
+	page_size = sysconf(_SC_PAGE_SIZE);
+	if (!page_size)
+		ksft_exit_fail_msg("Failed to get page size %s\n",
+				   strerror(errno));
+
+	if (getrlimit(RLIMIT_MEMLOCK, &rlim))
+		ksft_exit_fail_msg("Unable to detect mlock limit: %s\n",
+				   strerror(errno));
+
+	mlock_limit_cur = rlim.rlim_cur;
+	mlock_limit_max = rlim.rlim_max;
+
+	printf("page_size: %ld, mlock.soft: %ld, mlock.hard: %ld\n",
+	       page_size, mlock_limit_cur, mlock_limit_max);
+
+	if (page_size > mlock_limit_cur)
+		mlock_limit_cur = page_size;
+	if (page_size > mlock_limit_max)
+		mlock_limit_max = page_size;
+
+	if (set_cap_limits(mlock_limit_max))
+		ksft_exit_fail_msg("Unable to set mlock limit: %s\n",
+				   strerror(errno));
+}
+
+#define NUM_TESTS 4
+
+int main(int argc, char *argv[])
+{
+	int fd;
+
+	prepare();
+
+	ksft_print_header();
+	ksft_set_plan(NUM_TESTS);
+
+	fd = memfd_secret(0);
+	if (fd < 0) {
+		if (errno == ENOSYS)
+			ksft_exit_skip("memfd_secret is not supported\n");
+		else
+			ksft_exit_fail_msg("memfd_secret failed: %s\n",
+					   strerror(errno));
+	}
+
+	test_mlock_limit(fd);
+	test_file_apis(fd);
+	test_process_vm_read(fd);
+	test_ptrace(fd);
+
+	close(fd);
+
+	ksft_exit(!ksft_get_fail_cnt());
+}
+
+#else /* __NR_memfd_secret */
+
+int main(int argc, char *argv[])
+{
+	printf("skip: skipping memfd_secret test (missing __NR_memfd_secret)\n");
+	return KSFT_SKIP;
+}
+
+#endif /* __NR_memfd_secret */
diff --git a/tools/testing/selftests/vm/run_vmtests.sh b/tools/testing/selftests/vm/run_vmtests.sh
index e953f3cd9664..95a67382f132 100755
--- a/tools/testing/selftests/vm/run_vmtests.sh
+++ b/tools/testing/selftests/vm/run_vmtests.sh
@@ -346,4 +346,21 @@ else
 	exitcode=1
 fi
 
+echo "running memfd_secret test"
+echo "------------------------------------"
+./memfd_secret
+ret_val=$?
+
+if [ $ret_val -eq 0 ]; then
+	echo "[PASS]"
+elif [ $ret_val -eq $ksft_skip ]; then
+	echo "[SKIP]"
+	exitcode=$ksft_skip
+else
+	echo "[FAIL]"
+	exitcode=1
+fi
+
+exit $exitcode
+
 exit $exitcode
-- 
2.28.0


^ permalink raw reply related	[relevance 2%]

* [PATCH v20 6/7] arch, mm: wire up memfd_secret system call where relevant
                     ` (3 preceding siblings ...)
  2021-05-18  7:20  3% ` [PATCH v20 5/7] PM: hibernate: disable when there are active secretmem users Mike Rapoport
@ 2021-05-18  7:20  3% ` Mike Rapoport
  2021-05-18  7:20  2% ` [PATCH v20 7/7] secretmem: test: add basic selftest for memfd_secret(2) Mike Rapoport
  5 siblings, 0 replies; 200+ results
From: Mike Rapoport @ 2021-05-18  7:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin,
	Hagen Paul Pfeifer, Ingo Molnar, James Bottomley,
	James Bottomley, Kees Cook, Kirill A. Shutemov, Matthew Wilcox,
	Matthew Garrett, Mark Rutland, Michal Hocko, Mike Rapoport,
	Mike Rapoport, Michael Kerrisk, Palmer Dabbelt, Palmer Dabbelt,
	Paul Walmsley, Peter Zijlstra, Rafael J. Wysocki, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, Yury Norov, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

From: Mike Rapoport <rppt@linux.ibm.com>

Wire up memfd_secret system call on architectures that define
ARCH_HAS_SET_DIRECT_MAP, namely arm64, risc-v and x86.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/uapi/asm/unistd.h   | 1 +
 arch/riscv/include/asm/unistd.h        | 1 +
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 include/linux/syscalls.h               | 1 +
 include/uapi/asm-generic/unistd.h      | 7 ++++++-
 scripts/checksyscalls.sh               | 4 ++++
 7 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/uapi/asm/unistd.h b/arch/arm64/include/uapi/asm/unistd.h
index f83a70e07df8..ce2ee8f1e361 100644
--- a/arch/arm64/include/uapi/asm/unistd.h
+++ b/arch/arm64/include/uapi/asm/unistd.h
@@ -20,5 +20,6 @@
 #define __ARCH_WANT_SET_GET_RLIMIT
 #define __ARCH_WANT_TIME32_SYSCALLS
 #define __ARCH_WANT_SYS_CLONE3
+#define __ARCH_WANT_MEMFD_SECRET
 
 #include <asm-generic/unistd.h>
diff --git a/arch/riscv/include/asm/unistd.h b/arch/riscv/include/asm/unistd.h
index 977ee6181dab..6c316093a1e5 100644
--- a/arch/riscv/include/asm/unistd.h
+++ b/arch/riscv/include/asm/unistd.h
@@ -9,6 +9,7 @@
  */
 
 #define __ARCH_WANT_SYS_CLONE
+#define __ARCH_WANT_MEMFD_SECRET
 
 #include <uapi/asm/unistd.h>
 
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 28a1423ce32e..e44519020a43 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -451,3 +451,4 @@
 444	i386	landlock_create_ruleset	sys_landlock_create_ruleset
 445	i386	landlock_add_rule	sys_landlock_add_rule
 446	i386	landlock_restrict_self	sys_landlock_restrict_self
+447	i386	memfd_secret		sys_memfd_secret
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index ecd551b08d05..a06f16106f24 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -368,6 +368,7 @@
 444	common	landlock_create_ruleset	sys_landlock_create_ruleset
 445	common	landlock_add_rule	sys_landlock_add_rule
 446	common	landlock_restrict_self	sys_landlock_restrict_self
+447	common	memfd_secret		sys_memfd_secret
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 050511e8f1f8..1a1b5d724497 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1050,6 +1050,7 @@ asmlinkage long sys_landlock_create_ruleset(const struct landlock_ruleset_attr _
 asmlinkage long sys_landlock_add_rule(int ruleset_fd, enum landlock_rule_type rule_type,
 		const void __user *rule_attr, __u32 flags);
 asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags);
+asmlinkage long sys_memfd_secret(unsigned int flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 6de5a7fc066b..28b388368cf6 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -873,8 +873,13 @@ __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
 #define __NR_landlock_restrict_self 446
 __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
 
+#ifdef __ARCH_WANT_MEMFD_SECRET
+#define __NR_memfd_secret 447
+__SYSCALL(__NR_memfd_secret, sys_memfd_secret)
+#endif
+
 #undef __NR_syscalls
-#define __NR_syscalls 447
+#define __NR_syscalls 448
 
 /*
  * 32 bit systems traditionally used different
diff --git a/scripts/checksyscalls.sh b/scripts/checksyscalls.sh
index a18b47695f55..b7609958ee36 100755
--- a/scripts/checksyscalls.sh
+++ b/scripts/checksyscalls.sh
@@ -40,6 +40,10 @@ cat << EOF
 #define __IGNORE_setrlimit	/* setrlimit */
 #endif
 
+#ifndef __ARCH_WANT_MEMFD_SECRET
+#define __IGNORE_memfd_secret
+#endif
+
 /* Missing flags argument */
 #define __IGNORE_renameat	/* renameat2 */
 
-- 
2.28.0


^ permalink raw reply related	[relevance 3%]

* [PATCH v20 5/7] PM: hibernate: disable when there are active secretmem users
                     ` (2 preceding siblings ...)
  2021-05-18  7:20  2% ` [PATCH v20 4/7] mm: introduce memfd_secret system call to create "secret" memory areas Mike Rapoport
@ 2021-05-18  7:20  3% ` Mike Rapoport
  2021-05-18  7:20  3% ` [PATCH v20 6/7] arch, mm: wire up memfd_secret system call where relevant Mike Rapoport
  2021-05-18  7:20  2% ` [PATCH v20 7/7] secretmem: test: add basic selftest for memfd_secret(2) Mike Rapoport
  5 siblings, 0 replies; 200+ results
From: Mike Rapoport @ 2021-05-18  7:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin,
	Hagen Paul Pfeifer, Ingo Molnar, James Bottomley,
	James Bottomley, Kees Cook, Kirill A. Shutemov, Matthew Wilcox,
	Matthew Garrett, Mark Rutland, Michal Hocko, Mike Rapoport,
	Mike Rapoport, Michael Kerrisk, Palmer Dabbelt, Palmer Dabbelt,
	Paul Walmsley, Peter Zijlstra, Rafael J. Wysocki, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, Yury Norov, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

From: Mike Rapoport <rppt@linux.ibm.com>

It is unsafe to allow saving of secretmem areas to the hibernation
snapshot as they would be visible after the resume and this essentially
will defeat the purpose of secret memory mappings.

Prevent hibernation whenever there are active secret memory users.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 include/linux/secretmem.h |  6 ++++++
 kernel/power/hibernate.c  |  5 ++++-
 mm/secretmem.c            | 15 +++++++++++++++
 3 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h
index e617b4afcc62..21c3771e6a56 100644
--- a/include/linux/secretmem.h
+++ b/include/linux/secretmem.h
@@ -30,6 +30,7 @@ static inline bool page_is_secretmem(struct page *page)
 }
 
 bool vma_is_secretmem(struct vm_area_struct *vma);
+bool secretmem_active(void);
 
 #else
 
@@ -43,6 +44,11 @@ static inline bool page_is_secretmem(struct page *page)
 	return false;
 }
 
+static inline bool secretmem_active(void)
+{
+	return false;
+}
+
 #endif /* CONFIG_SECRETMEM */
 
 #endif /* _LINUX_SECRETMEM_H */
diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
index da0b41914177..559acef3fddb 100644
--- a/kernel/power/hibernate.c
+++ b/kernel/power/hibernate.c
@@ -31,6 +31,7 @@
 #include <linux/genhd.h>
 #include <linux/ktime.h>
 #include <linux/security.h>
+#include <linux/secretmem.h>
 #include <trace/events/power.h>
 
 #include "power.h"
@@ -81,7 +82,9 @@ void hibernate_release(void)
 
 bool hibernation_available(void)
 {
-	return nohibernate == 0 && !security_locked_down(LOCKDOWN_HIBERNATION);
+	return nohibernate == 0 &&
+		!security_locked_down(LOCKDOWN_HIBERNATION) &&
+		!secretmem_active();
 }
 
 /**
diff --git a/mm/secretmem.c b/mm/secretmem.c
index 972cd1bbc3cc..f77d25467a14 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -40,6 +40,13 @@ module_param_named(enable, secretmem_enable, bool, 0400);
 MODULE_PARM_DESC(secretmem_enable,
 		 "Enable secretmem and memfd_secret(2) system call");
 
+static atomic_t secretmem_users;
+
+bool secretmem_active(void)
+{
+	return !!atomic_read(&secretmem_users);
+}
+
 static vm_fault_t secretmem_fault(struct vm_fault *vmf)
 {
 	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
@@ -94,6 +101,12 @@ static const struct vm_operations_struct secretmem_vm_ops = {
 	.fault = secretmem_fault,
 };
 
+static int secretmem_release(struct inode *inode, struct file *file)
+{
+	atomic_dec(&secretmem_users);
+	return 0;
+}
+
 static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	unsigned long len = vma->vm_end - vma->vm_start;
@@ -116,6 +129,7 @@ bool vma_is_secretmem(struct vm_area_struct *vma)
 }
 
 static const struct file_operations secretmem_fops = {
+	.release	= secretmem_release,
 	.mmap		= secretmem_mmap,
 };
 
@@ -202,6 +216,7 @@ SYSCALL_DEFINE1(memfd_secret, unsigned int, flags)
 	file->f_flags |= O_LARGEFILE;
 
 	fd_install(fd, file);
+	atomic_inc(&secretmem_users);
 	return fd;
 
 err_put_fd:
-- 
2.28.0


^ permalink raw reply related	[relevance 3%]

* [PATCH v20 4/7] mm: introduce memfd_secret system call to create "secret" memory areas
    2021-05-18  7:20  4% ` [PATCH v20 1/7] mmap: make mlock_future_check() global Mike Rapoport
  2021-05-18  7:20  3% ` [PATCH v20 3/7] set_memory: allow querying whether set_direct_map_*() is actually enabled Mike Rapoport
@ 2021-05-18  7:20  2% ` Mike Rapoport
  2021-05-18  7:20  3% ` [PATCH v20 5/7] PM: hibernate: disable when there are active secretmem users Mike Rapoport
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 200+ results
From: Mike Rapoport @ 2021-05-18  7:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin,
	Hagen Paul Pfeifer, Ingo Molnar, James Bottomley,
	James Bottomley, Kees Cook, Kirill A. Shutemov, Matthew Wilcox,
	Matthew Garrett, Mark Rutland, Michal Hocko, Mike Rapoport,
	Mike Rapoport, Michael Kerrisk, Palmer Dabbelt, Palmer Dabbelt,
	Paul Walmsley, Peter Zijlstra, Rafael J. Wysocki, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, Yury Norov, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

From: Mike Rapoport <rppt@linux.ibm.com>

Introduce "memfd_secret" system call with the ability to create memory
areas visible only in the context of the owning process and not mapped not
only to other processes but in the kernel page tables as well.

The secretmem feature is off by default and the user must explicitly enable
it at the boot time.

Once secretmem is enabled, the user will be able to create a file
descriptor using the memfd_secret() system call. The memory areas created
by mmap() calls from this file descriptor will be unmapped from the kernel
direct map and they will be only mapped in the page table of the processes
that have access to the file descriptor.

Secretmem is designed to provide the following protections:

* Enhanced protection (in conjunction with all the other in-kernel
attack prevention systems) against ROP attacks. Seceretmem makes "simple"
ROP insufficient to perform exfiltration, which increases the required
complexity of the attack. Along with other protections like the kernel
stack size limit and address space layout randomization which make finding
gadgets is really hard, absence of any in-kernel primitive for accessing
secret memory means the one gadget ROP attack can't work. Since the only
way to access secret memory is to reconstruct the missing mapping entry,
the attacker has to recover the physical page and insert a PTE pointing to
it in the kernel and then retrieve the contents.  That takes at least three
gadgets which is a level of difficulty beyond most standard attacks.

* Prevent cross-process secret userspace memory exposures. Once the secret
memory is allocated, the user can't accidentally pass it into the kernel to
be transmitted somewhere. The secreremem pages cannot be accessed via the
direct map and they are disallowed in GUP.

* Harden against exploited kernel flaws. In order to access secretmem, a
kernel-side attack would need to either walk the page tables and create new
ones, or spawn a new privileged uiserspace process to perform secrets
exfiltration using ptrace.

The file descriptor based memory has several advantages over the
"traditional" mm interfaces, such as mlock(), mprotect(), madvise(). File
descriptor approach allows explicit and controlled sharing of the memory
areas, it allows to seal the operations. Besides, file descriptor based
memory paves the way for VMMs to remove the secret memory range from the
userspace hipervisor process, for instance QEMU. Andy Lutomirski says:

  "Getting fd-backed memory into a guest will take some possibly major work
   in the kernel, but getting vma-backed memory into a guest without
   mapping it in the host user address space seems much, much worse."

memfd_secret() is made a dedicated system call rather than an extension to
memfd_create() because it's purpose is to allow the user to create more
secure memory mappings rather than to simply allow file based access to the
memory. Nowadays a new system call cost is negligible while it is way
simpler for userspace to deal with a clear-cut system calls than with a
multiplexer or an overloaded syscall. Moreover, the initial implementation
of memfd_secret() is completely distinct from memfd_create() so there is no
much sense in overloading memfd_create() to begin with. If there will be a
need for code sharing between these implementation it can be easily
achieved without a need to adjust user visible APIs.

The secret memory remains accessible in the process context using uaccess
primitives, but it is not exposed to the kernel otherwise; secret memory
areas are removed from the direct map and functions in the
follow_page()/get_user_page() family will refuse to return a page that
belongs to the secret memory area.

Once there will be a use case that will require exposing secretmem to the
kernel it will be an opt-in request in the system call flags so that user
would have to decide what data can be exposed to the kernel.

Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which affects
the system performance. However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice". Hence, it is sufficient to have
secretmem disabled by default with the ability of a system administrator to
enable it at boot time.

Pages in the secretmem regions are unevictable and unmovable to avoid
accidental exposure of the sensitive data via swap or during page
migration.

Since the secretmem mappings are locked in memory they cannot exceed
RLIMIT_MEMLOCK. Since these mappings are already locked independently from
mlock(), an attempt to mlock()/munlock() secretmem range would fail and
mlockall()/munlockall() will ignore secretmem mappings.

However, unlike mlock()ed memory, secretmem currently behaves more like
long-term GUP: secretmem mappings are unmovable mappings directly consumed
by user space. With default limits, there is no excessive use of secretmem
and it poses no real problem in combination with ZONE_MOVABLE/CMA, but in
the future this should be addressed to allow balanced use of large amounts
of secretmem along with ZONE_MOVABLE/CMA.

A page that was a part of the secret memory area is cleared when it is
freed to ensure the data is not exposed to the next user of that page.

The following example demonstrates creation of a secret mapping (error
handling is omitted):

	fd = memfd_secret(0);
	ftruncate(fd, MAP_SIZE);
	ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
		   MAP_SHARED, fd, 0);

[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 include/linux/secretmem.h  |  48 ++++++++
 include/uapi/linux/magic.h |   1 +
 kernel/sys_ni.c            |   2 +
 mm/Kconfig                 |   5 +
 mm/Makefile                |   1 +
 mm/gup.c                   |  12 ++
 mm/mlock.c                 |   3 +-
 mm/secretmem.c             | 239 +++++++++++++++++++++++++++++++++++++
 8 files changed, 310 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/secretmem.h
 create mode 100644 mm/secretmem.c

diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h
new file mode 100644
index 000000000000..e617b4afcc62
--- /dev/null
+++ b/include/linux/secretmem.h
@@ -0,0 +1,48 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _LINUX_SECRETMEM_H
+#define _LINUX_SECRETMEM_H
+
+#ifdef CONFIG_SECRETMEM
+
+extern const struct address_space_operations secretmem_aops;
+
+static inline bool page_is_secretmem(struct page *page)
+{
+	struct address_space *mapping;
+
+	/*
+	 * Using page_mapping() is quite slow because of the actual call
+	 * instruction and repeated compound_head(page) inside the
+	 * page_mapping() function.
+	 * We know that secretmem pages are not compound and LRU so we can
+	 * save a couple of cycles here.
+	 */
+	if (PageCompound(page) || !PageLRU(page))
+		return false;
+
+	mapping = (struct address_space *)
+		((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS);
+
+	if (mapping != page->mapping)
+		return false;
+
+	return mapping->a_ops == &secretmem_aops;
+}
+
+bool vma_is_secretmem(struct vm_area_struct *vma);
+
+#else
+
+static inline bool vma_is_secretmem(struct vm_area_struct *vma)
+{
+	return false;
+}
+
+static inline bool page_is_secretmem(struct page *page)
+{
+	return false;
+}
+
+#endif /* CONFIG_SECRETMEM */
+
+#endif /* _LINUX_SECRETMEM_H */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index f3956fc11de6..35687dcb1a42 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -97,5 +97,6 @@
 #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define Z3FOLD_MAGIC		0x33
 #define PPC_CMM_MAGIC		0xc7571590
+#define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 0ea8128468c3..4d7e377a74f3 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -358,6 +358,8 @@ COND_SYSCALL(pkey_mprotect);
 COND_SYSCALL(pkey_alloc);
 COND_SYSCALL(pkey_free);
 
+/* memfd_secret */
+COND_SYSCALL(memfd_secret);
 
 /*
  * Architecture specific weak syscall entries.
diff --git a/mm/Kconfig b/mm/Kconfig
index 02d44e3420f5..6d0972db7278 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -901,4 +901,9 @@ config KMAP_LOCAL
 # struct io_mapping based helper.  Selected by drivers that need them
 config IO_MAPPING
 	bool
+
+config SECRETMEM
+	def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
+	select STRICT_DEVMEM
+
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index bf71e295e9f6..7bb6ed5e42e8 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -125,3 +125,4 @@ obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
 obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
 obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
 obj-$(CONFIG_IO_MAPPING) += io-mapping.o
+obj-$(CONFIG_SECRETMEM) += secretmem.o
diff --git a/mm/gup.c b/mm/gup.c
index 0697134b6a12..6515f82b0f32 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -10,6 +10,7 @@
 #include <linux/rmap.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/secretmem.h>
 
 #include <linux/sched/signal.h>
 #include <linux/rwsem.h>
@@ -816,6 +817,9 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 	struct follow_page_context ctx = { NULL };
 	struct page *page;
 
+	if (vma_is_secretmem(vma))
+		return NULL;
+
 	page = follow_page_mask(vma, address, foll_flags, &ctx);
 	if (ctx.pgmap)
 		put_dev_pagemap(ctx.pgmap);
@@ -949,6 +953,9 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
 		return -EOPNOTSUPP;
 
+	if (vma_is_secretmem(vma))
+		return -EFAULT;
+
 	if (write) {
 		if (!(vm_flags & VM_WRITE)) {
 			if (!(gup_flags & FOLL_FORCE))
@@ -2077,6 +2084,11 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 		if (!head)
 			goto pte_unmap;
 
+		if (unlikely(page_is_secretmem(page))) {
+			put_compound_head(head, 1, flags);
+			goto pte_unmap;
+		}
+
 		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
 			put_compound_head(head, 1, flags);
 			goto pte_unmap;
diff --git a/mm/mlock.c b/mm/mlock.c
index df590fda5688..5e9f4dea4e96 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -23,6 +23,7 @@
 #include <linux/hugetlb.h>
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h>
+#include <linux/secretmem.h>
 
 #include "internal.h"
 
@@ -503,7 +504,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 
 	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
 	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) ||
-	    vma_is_dax(vma))
+	    vma_is_dax(vma) || vma_is_secretmem(vma))
 		/* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
 		goto out;
 
diff --git a/mm/secretmem.c b/mm/secretmem.c
new file mode 100644
index 000000000000..972cd1bbc3cc
--- /dev/null
+++ b/mm/secretmem.c
@@ -0,0 +1,239 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright IBM Corporation, 2021
+ *
+ * Author: Mike Rapoport <rppt@linux.ibm.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/swap.h>
+#include <linux/mount.h>
+#include <linux/memfd.h>
+#include <linux/bitops.h>
+#include <linux/printk.h>
+#include <linux/pagemap.h>
+#include <linux/syscalls.h>
+#include <linux/pseudo_fs.h>
+#include <linux/secretmem.h>
+#include <linux/set_memory.h>
+#include <linux/sched/signal.h>
+
+#include <uapi/linux/magic.h>
+
+#include <asm/tlbflush.h>
+
+#include "internal.h"
+
+#undef pr_fmt
+#define pr_fmt(fmt) "secretmem: " fmt
+
+/*
+ * Define mode and flag masks to allow validation of the system call
+ * parameters.
+ */
+#define SECRETMEM_MODE_MASK	(0x0)
+#define SECRETMEM_FLAGS_MASK	SECRETMEM_MODE_MASK
+
+static bool secretmem_enable __ro_after_init;
+module_param_named(enable, secretmem_enable, bool, 0400);
+MODULE_PARM_DESC(secretmem_enable,
+		 "Enable secretmem and memfd_secret(2) system call");
+
+static vm_fault_t secretmem_fault(struct vm_fault *vmf)
+{
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+	struct inode *inode = file_inode(vmf->vma->vm_file);
+	pgoff_t offset = vmf->pgoff;
+	gfp_t gfp = vmf->gfp_mask;
+	unsigned long addr;
+	struct page *page;
+	int err;
+
+	if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
+		return vmf_error(-EINVAL);
+
+retry:
+	page = find_lock_page(mapping, offset);
+	if (!page) {
+		page = alloc_page(gfp | __GFP_ZERO);
+		if (!page)
+			return VM_FAULT_OOM;
+
+		err = set_direct_map_invalid_noflush(page);
+		if (err) {
+			put_page(page);
+			return vmf_error(err);
+		}
+
+		__SetPageUptodate(page);
+		err = add_to_page_cache_lru(page, mapping, offset, gfp);
+		if (unlikely(err)) {
+			put_page(page);
+			/*
+			 * If a split of large page was required, it
+			 * already happened when we marked the page invalid
+			 * which guarantees that this call won't fail
+			 */
+			set_direct_map_default_noflush(page);
+			if (err == -EEXIST)
+				goto retry;
+
+			return vmf_error(err);
+		}
+
+		addr = (unsigned long)page_address(page);
+		flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+	}
+
+	vmf->page = page;
+	return VM_FAULT_LOCKED;
+}
+
+static const struct vm_operations_struct secretmem_vm_ops = {
+	.fault = secretmem_fault,
+};
+
+static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	unsigned long len = vma->vm_end - vma->vm_start;
+
+	if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
+		return -EINVAL;
+
+	if (mlock_future_check(vma->vm_mm, vma->vm_flags | VM_LOCKED, len))
+		return -EAGAIN;
+
+	vma->vm_flags |= VM_LOCKED | VM_DONTDUMP;
+	vma->vm_ops = &secretmem_vm_ops;
+
+	return 0;
+}
+
+bool vma_is_secretmem(struct vm_area_struct *vma)
+{
+	return vma->vm_ops == &secretmem_vm_ops;
+}
+
+static const struct file_operations secretmem_fops = {
+	.mmap		= secretmem_mmap,
+};
+
+static bool secretmem_isolate_page(struct page *page, isolate_mode_t mode)
+{
+	return false;
+}
+
+static int secretmem_migratepage(struct address_space *mapping,
+				 struct page *newpage, struct page *page,
+				 enum migrate_mode mode)
+{
+	return -EBUSY;
+}
+
+static void secretmem_freepage(struct page *page)
+{
+	set_direct_map_default_noflush(page);
+	clear_highpage(page);
+}
+
+const struct address_space_operations secretmem_aops = {
+	.freepage	= secretmem_freepage,
+	.migratepage	= secretmem_migratepage,
+	.isolate_page	= secretmem_isolate_page,
+};
+
+static struct vfsmount *secretmem_mnt;
+
+static struct file *secretmem_file_create(unsigned long flags)
+{
+	struct file *file = ERR_PTR(-ENOMEM);
+	struct inode *inode;
+
+	inode = alloc_anon_inode(secretmem_mnt->mnt_sb);
+	if (IS_ERR(inode))
+		return ERR_CAST(inode);
+
+	file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
+				 O_RDWR, &secretmem_fops);
+	if (IS_ERR(file))
+		goto err_free_inode;
+
+	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+	mapping_set_unevictable(inode->i_mapping);
+
+	inode->i_mapping->a_ops = &secretmem_aops;
+
+	/* pretend we are a normal file with zero size */
+	inode->i_mode |= S_IFREG;
+	inode->i_size = 0;
+
+	return file;
+
+err_free_inode:
+	iput(inode);
+	return file;
+}
+
+SYSCALL_DEFINE1(memfd_secret, unsigned int, flags)
+{
+	struct file *file;
+	int fd, err;
+
+	/* make sure local flags do not confict with global fcntl.h */
+	BUILD_BUG_ON(SECRETMEM_FLAGS_MASK & O_CLOEXEC);
+
+	if (!secretmem_enable)
+		return -ENOSYS;
+
+	if (flags & ~(SECRETMEM_FLAGS_MASK | O_CLOEXEC))
+		return -EINVAL;
+
+	fd = get_unused_fd_flags(flags & O_CLOEXEC);
+	if (fd < 0)
+		return fd;
+
+	file = secretmem_file_create(flags);
+	if (IS_ERR(file)) {
+		err = PTR_ERR(file);
+		goto err_put_fd;
+	}
+
+	file->f_flags |= O_LARGEFILE;
+
+	fd_install(fd, file);
+	return fd;
+
+err_put_fd:
+	put_unused_fd(fd);
+	return err;
+}
+
+static int secretmem_init_fs_context(struct fs_context *fc)
+{
+	return init_pseudo(fc, SECRETMEM_MAGIC) ? 0 : -ENOMEM;
+}
+
+static struct file_system_type secretmem_fs = {
+	.name		= "secretmem",
+	.init_fs_context = secretmem_init_fs_context,
+	.kill_sb	= kill_anon_super,
+};
+
+static int secretmem_init(void)
+{
+	int ret = 0;
+
+	if (!secretmem_enable)
+		return ret;
+
+	secretmem_mnt = kern_mount(&secretmem_fs);
+	if (IS_ERR(secretmem_mnt))
+		ret = PTR_ERR(secretmem_mnt);
+
+	/* prevent secretmem mappings from ever getting PROT_EXEC */
+	secretmem_mnt->mnt_flags |= MNT_NOEXEC;
+
+	return ret;
+}
+fs_initcall(secretmem_init);
-- 
2.28.0


^ permalink raw reply related	[relevance 2%]

* [PATCH v20 3/7] set_memory: allow querying whether set_direct_map_*() is actually enabled
    2021-05-18  7:20  4% ` [PATCH v20 1/7] mmap: make mlock_future_check() global Mike Rapoport
@ 2021-05-18  7:20  3% ` Mike Rapoport
  2021-05-18  7:20  2% ` [PATCH v20 4/7] mm: introduce memfd_secret system call to create "secret" memory areas Mike Rapoport
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 200+ results
From: Mike Rapoport @ 2021-05-18  7:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin,
	Hagen Paul Pfeifer, Ingo Molnar, James Bottomley,
	James Bottomley, Kees Cook, Kirill A. Shutemov, Matthew Wilcox,
	Matthew Garrett, Mark Rutland, Michal Hocko, Mike Rapoport,
	Mike Rapoport, Michael Kerrisk, Palmer Dabbelt, Palmer Dabbelt,
	Paul Walmsley, Peter Zijlstra, Rafael J. Wysocki, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, Yury Norov, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

From: Mike Rapoport <rppt@linux.ibm.com>

On arm64, set_direct_map_*() functions may return 0 without actually
changing the linear map.  This behaviour can be controlled using kernel
parameters, so we need a way to determine at runtime whether calls to
set_direct_map_invalid_noflush() and set_direct_map_default_noflush() have
any effect.

Extend set_memory API with can_set_direct_map() function that allows
checking if calling set_direct_map_*() will actually change the page
table, replace several occurrences of open coded checks in arm64 with the
new function and provide a generic stub for architectures that always
modify page tables upon calls to set_direct_map APIs.

[arnd@arndb.de: arm64: kfence: fix header inclusion ]

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/Kbuild       |  1 -
 arch/arm64/include/asm/cacheflush.h |  6 ------
 arch/arm64/include/asm/kfence.h     |  2 +-
 arch/arm64/include/asm/set_memory.h | 17 +++++++++++++++++
 arch/arm64/kernel/machine_kexec.c   |  1 +
 arch/arm64/mm/mmu.c                 |  6 +++---
 arch/arm64/mm/pageattr.c            | 13 +++++++++----
 include/linux/set_memory.h          | 12 ++++++++++++
 8 files changed, 43 insertions(+), 15 deletions(-)
 create mode 100644 arch/arm64/include/asm/set_memory.h

diff --git a/arch/arm64/include/asm/Kbuild b/arch/arm64/include/asm/Kbuild
index 07ac208edc89..73aa25843f65 100644
--- a/arch/arm64/include/asm/Kbuild
+++ b/arch/arm64/include/asm/Kbuild
@@ -3,5 +3,4 @@ generic-y += early_ioremap.h
 generic-y += mcs_spinlock.h
 generic-y += qrwlock.h
 generic-y += qspinlock.h
-generic-y += set_memory.h
 generic-y += user.h
diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index 52e5c1623224..4e3c13799735 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -131,12 +131,6 @@ static __always_inline void __flush_icache_all(void)
 	dsb(ish);
 }
 
-int set_memory_valid(unsigned long addr, int numpages, int enable);
-
-int set_direct_map_invalid_noflush(struct page *page);
-int set_direct_map_default_noflush(struct page *page);
-bool kernel_page_present(struct page *page);
-
 #include <asm-generic/cacheflush.h>
 
 #endif /* __ASM_CACHEFLUSH_H */
diff --git a/arch/arm64/include/asm/kfence.h b/arch/arm64/include/asm/kfence.h
index d061176d57ea..aa855c6a0ae6 100644
--- a/arch/arm64/include/asm/kfence.h
+++ b/arch/arm64/include/asm/kfence.h
@@ -8,7 +8,7 @@
 #ifndef __ASM_KFENCE_H
 #define __ASM_KFENCE_H
 
-#include <asm/cacheflush.h>
+#include <asm/set_memory.h>
 
 static inline bool arch_kfence_init_pool(void) { return true; }
 
diff --git a/arch/arm64/include/asm/set_memory.h b/arch/arm64/include/asm/set_memory.h
new file mode 100644
index 000000000000..0f740b781187
--- /dev/null
+++ b/arch/arm64/include/asm/set_memory.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef _ASM_ARM64_SET_MEMORY_H
+#define _ASM_ARM64_SET_MEMORY_H
+
+#include <asm-generic/set_memory.h>
+
+bool can_set_direct_map(void);
+#define can_set_direct_map can_set_direct_map
+
+int set_memory_valid(unsigned long addr, int numpages, int enable);
+
+int set_direct_map_invalid_noflush(struct page *page);
+int set_direct_map_default_noflush(struct page *page);
+bool kernel_page_present(struct page *page);
+
+#endif /* _ASM_ARM64_SET_MEMORY_H */
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index 90a335c74442..0ec94e718724 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -11,6 +11,7 @@
 #include <linux/kernel.h>
 #include <linux/kexec.h>
 #include <linux/page-flags.h>
+#include <linux/set_memory.h>
 #include <linux/smp.h>
 
 #include <asm/cacheflush.h>
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 6dd9369e3ea0..e42aeff6c344 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -22,6 +22,7 @@
 #include <linux/io.h>
 #include <linux/mm.h>
 #include <linux/vmalloc.h>
+#include <linux/set_memory.h>
 
 #include <asm/barrier.h>
 #include <asm/cputype.h>
@@ -515,7 +516,7 @@ static void __init map_mem(pgd_t *pgdp)
 	 */
 	BUILD_BUG_ON(pgd_index(direct_map_end - 1) == pgd_index(direct_map_end));
 
-	if (rodata_full || crash_mem_map || debug_pagealloc_enabled())
+	if (can_set_direct_map() || crash_mem_map)
 		flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
 	/*
@@ -1483,8 +1484,7 @@ int arch_add_memory(int nid, u64 start, u64 size,
 	 * KFENCE requires linear map to be mapped at page granularity, so that
 	 * it is possible to protect/unprotect single pages in the KFENCE pool.
 	 */
-	if (rodata_full || debug_pagealloc_enabled() ||
-	    IS_ENABLED(CONFIG_KFENCE))
+	if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
 		flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
 	__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 92eccaf595c8..a3bacd79507a 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -19,6 +19,11 @@ struct page_change_data {
 
 bool rodata_full __ro_after_init = IS_ENABLED(CONFIG_RODATA_FULL_DEFAULT_ENABLED);
 
+bool can_set_direct_map(void)
+{
+	return rodata_full || debug_pagealloc_enabled();
+}
+
 static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
 {
 	struct page_change_data *cdata = data;
@@ -155,7 +160,7 @@ int set_direct_map_invalid_noflush(struct page *page)
 		.clear_mask = __pgprot(PTE_VALID),
 	};
 
-	if (!debug_pagealloc_enabled() && !rodata_full)
+	if (!can_set_direct_map())
 		return 0;
 
 	return apply_to_page_range(&init_mm,
@@ -170,7 +175,7 @@ int set_direct_map_default_noflush(struct page *page)
 		.clear_mask = __pgprot(PTE_RDONLY),
 	};
 
-	if (!debug_pagealloc_enabled() && !rodata_full)
+	if (!can_set_direct_map())
 		return 0;
 
 	return apply_to_page_range(&init_mm,
@@ -181,7 +186,7 @@ int set_direct_map_default_noflush(struct page *page)
 #ifdef CONFIG_DEBUG_PAGEALLOC
 void __kernel_map_pages(struct page *page, int numpages, int enable)
 {
-	if (!debug_pagealloc_enabled() && !rodata_full)
+	if (!can_set_direct_map())
 		return;
 
 	set_memory_valid((unsigned long)page_address(page), numpages, enable);
@@ -206,7 +211,7 @@ bool kernel_page_present(struct page *page)
 	pte_t *ptep;
 	unsigned long addr = (unsigned long)page_address(page);
 
-	if (!debug_pagealloc_enabled() && !rodata_full)
+	if (!can_set_direct_map())
 		return true;
 
 	pgdp = pgd_offset_k(addr);
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index fe1aa4e54680..f36be5166c19 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -28,7 +28,19 @@ static inline bool kernel_page_present(struct page *page)
 {
 	return true;
 }
+#else /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */
+/*
+ * Some architectures, e.g. ARM64 can disable direct map modifications at
+ * boot time. Let them overrive this query.
+ */
+#ifndef can_set_direct_map
+static inline bool can_set_direct_map(void)
+{
+	return true;
+}
+#define can_set_direct_map can_set_direct_map
 #endif
+#endif /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */
 
 #ifndef set_mce_nospec
 static inline int set_mce_nospec(unsigned long pfn, bool unmap)
-- 
2.28.0


^ permalink raw reply related	[relevance 3%]

* [PATCH v20 1/7] mmap: make mlock_future_check() global
  @ 2021-05-18  7:20  4% ` Mike Rapoport
  2021-05-18  7:20  3% ` [PATCH v20 3/7] set_memory: allow querying whether set_direct_map_*() is actually enabled Mike Rapoport
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 200+ results
From: Mike Rapoport @ 2021-05-18  7:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin,
	Hagen Paul Pfeifer, Ingo Molnar, James Bottomley,
	James Bottomley, Kees Cook, Kirill A. Shutemov, Matthew Wilcox,
	Matthew Garrett, Mark Rutland, Michal Hocko, Mike Rapoport,
	Mike Rapoport, Michael Kerrisk, Palmer Dabbelt, Palmer Dabbelt,
	Paul Walmsley, Peter Zijlstra, Rafael J. Wysocki, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, Yury Norov, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

From: Mike Rapoport <rppt@linux.ibm.com>

It will be used by the upcoming secret memory implementation.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 mm/internal.h | 3 +++
 mm/mmap.c     | 5 ++---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 54bd0dc2c23c..46eb82eaa195 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -373,6 +373,9 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
 extern void mlock_vma_page(struct page *page);
 extern unsigned int munlock_vma_page(struct page *page);
 
+extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
+			      unsigned long len);
+
 /*
  * Clear the page's PageMlocked().  This can be useful in a situation where
  * we want to unconditionally remove a page from the pagecache -- e.g.,
diff --git a/mm/mmap.c b/mm/mmap.c
index 0584e540246e..81f5595a8490 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1352,9 +1352,8 @@ static inline unsigned long round_hint_to_min(unsigned long hint)
 	return hint;
 }
 
-static inline int mlock_future_check(struct mm_struct *mm,
-				     unsigned long flags,
-				     unsigned long len)
+int mlock_future_check(struct mm_struct *mm, unsigned long flags,
+		       unsigned long len)
 {
 	unsigned long locked, lock_limit;
 
-- 
2.28.0


^ permalink raw reply related	[relevance 4%]

* Re: [PATCH v19 7/8] arch, mm: wire up memfd_secret system call where relevant
  2021-05-13 18:47  3% ` [PATCH v19 7/8] arch, mm: wire up memfd_secret system call where relevant Mike Rapoport
@ 2021-05-14  9:27  0%   ` David Hildenbrand
  0 siblings, 0 replies; 200+ results
From: David Hildenbrand @ 2021-05-14  9:27 UTC (permalink / raw)
  To: Mike Rapoport, Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	Elena Reshetova, H. Peter Anvin, Hagen Paul Pfeifer, Ingo Molnar,
	James Bottomley, Kees Cook, Kirill A. Shutemov, Matthew Wilcox,
	Matthew Garrett, Mark Rutland, Michal Hocko, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Rafael J. Wysocki, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, Yury Norov, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

On 13.05.21 20:47, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> Wire up memfd_secret system call on architectures that define
> ARCH_HAS_SET_DIRECT_MAP, namely arm64, risc-v and x86.
> 
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>
> Acked-by: Arnd Bergmann <arnd@arndb.de>
> Acked-by: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Christopher Lameter <cl@linux.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Elena Reshetova <elena.reshetova@intel.com>
> Cc: Hagen Paul Pfeifer <hagen@jauu.net>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: James Bottomley <jejb@linux.ibm.com>
> Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Palmer Dabbelt <palmer@dabbelt.com>
> Cc: Paul Walmsley <paul.walmsley@sifive.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Tycho Andersen <tycho@tycho.ws>
> Cc: Will Deacon <will@kernel.org>
> ---
>   arch/arm64/include/uapi/asm/unistd.h   | 1 +
>   arch/riscv/include/asm/unistd.h        | 1 +
>   arch/x86/entry/syscalls/syscall_32.tbl | 1 +
>   arch/x86/entry/syscalls/syscall_64.tbl | 1 +
>   include/linux/syscalls.h               | 1 +
>   include/uapi/asm-generic/unistd.h      | 7 ++++++-
>   scripts/checksyscalls.sh               | 4 ++++
>   7 files changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/include/uapi/asm/unistd.h b/arch/arm64/include/uapi/asm/unistd.h
> index f83a70e07df8..ce2ee8f1e361 100644
> --- a/arch/arm64/include/uapi/asm/unistd.h
> +++ b/arch/arm64/include/uapi/asm/unistd.h
> @@ -20,5 +20,6 @@
>   #define __ARCH_WANT_SET_GET_RLIMIT
>   #define __ARCH_WANT_TIME32_SYSCALLS
>   #define __ARCH_WANT_SYS_CLONE3
> +#define __ARCH_WANT_MEMFD_SECRET
>   
>   #include <asm-generic/unistd.h>
> diff --git a/arch/riscv/include/asm/unistd.h b/arch/riscv/include/asm/unistd.h
> index 977ee6181dab..6c316093a1e5 100644
> --- a/arch/riscv/include/asm/unistd.h
> +++ b/arch/riscv/include/asm/unistd.h
> @@ -9,6 +9,7 @@
>    */
>   
>   #define __ARCH_WANT_SYS_CLONE
> +#define __ARCH_WANT_MEMFD_SECRET
>   
>   #include <uapi/asm/unistd.h>
>   
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 28a1423ce32e..e44519020a43 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -451,3 +451,4 @@
>   444	i386	landlock_create_ruleset	sys_landlock_create_ruleset
>   445	i386	landlock_add_rule	sys_landlock_add_rule
>   446	i386	landlock_restrict_self	sys_landlock_restrict_self
> +447	i386	memfd_secret		sys_memfd_secret
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index ecd551b08d05..a06f16106f24 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -368,6 +368,7 @@
>   444	common	landlock_create_ruleset	sys_landlock_create_ruleset
>   445	common	landlock_add_rule	sys_landlock_add_rule
>   446	common	landlock_restrict_self	sys_landlock_restrict_self
> +447	common	memfd_secret		sys_memfd_secret
>   
>   #
>   # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 050511e8f1f8..1a1b5d724497 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -1050,6 +1050,7 @@ asmlinkage long sys_landlock_create_ruleset(const struct landlock_ruleset_attr _
>   asmlinkage long sys_landlock_add_rule(int ruleset_fd, enum landlock_rule_type rule_type,
>   		const void __user *rule_attr, __u32 flags);
>   asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags);
> +asmlinkage long sys_memfd_secret(unsigned int flags);
>   
>   /*
>    * Architecture-specific system calls
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 6de5a7fc066b..28b388368cf6 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -873,8 +873,13 @@ __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
>   #define __NR_landlock_restrict_self 446
>   __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
>   
> +#ifdef __ARCH_WANT_MEMFD_SECRET
> +#define __NR_memfd_secret 447
> +__SYSCALL(__NR_memfd_secret, sys_memfd_secret)
> +#endif
> +
>   #undef __NR_syscalls
> -#define __NR_syscalls 447
> +#define __NR_syscalls 448
>   
>   /*
>    * 32 bit systems traditionally used different
> diff --git a/scripts/checksyscalls.sh b/scripts/checksyscalls.sh
> index a18b47695f55..b7609958ee36 100755
> --- a/scripts/checksyscalls.sh
> +++ b/scripts/checksyscalls.sh
> @@ -40,6 +40,10 @@ cat << EOF
>   #define __IGNORE_setrlimit	/* setrlimit */
>   #endif
>   
> +#ifndef __ARCH_WANT_MEMFD_SECRET
> +#define __IGNORE_memfd_secret
> +#endif
> +
>   /* Missing flags argument */
>   #define __IGNORE_renameat	/* renameat2 */
>   
> 

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[relevance 0%]

* Re: [PATCH v19 6/8] PM: hibernate: disable when there are active secretmem users
  2021-05-13 18:47  3% ` [PATCH v19 6/8] PM: hibernate: disable when there are active secretmem users Mike Rapoport
@ 2021-05-14  9:27  0%   ` David Hildenbrand
  2021-05-18 10:24  0%   ` Mark Rutland
  1 sibling, 0 replies; 200+ results
From: David Hildenbrand @ 2021-05-14  9:27 UTC (permalink / raw)
  To: Mike Rapoport, Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	Elena Reshetova, H. Peter Anvin, Hagen Paul Pfeifer, Ingo Molnar,
	James Bottomley, Kees Cook, Kirill A. Shutemov, Matthew Wilcox,
	Matthew Garrett, Mark Rutland, Michal Hocko, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Rafael J. Wysocki, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, Yury Norov, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

On 13.05.21 20:47, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> It is unsafe to allow saving of secretmem areas to the hibernation
> snapshot as they would be visible after the resume and this essentially
> will defeat the purpose of secret memory mappings.
> 
> Prevent hibernation whenever there are active secret memory users.
> 
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Christopher Lameter <cl@linux.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Elena Reshetova <elena.reshetova@intel.com>
> Cc: Hagen Paul Pfeifer <hagen@jauu.net>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: James Bottomley <jejb@linux.ibm.com>
> Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Palmer Dabbelt <palmer@dabbelt.com>
> Cc: Palmer Dabbelt <palmerdabbelt@google.com>
> Cc: Paul Walmsley <paul.walmsley@sifive.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Tycho Andersen <tycho@tycho.ws>
> Cc: Will Deacon <will@kernel.org>
> ---
>   include/linux/secretmem.h |  6 ++++++
>   kernel/power/hibernate.c  |  5 ++++-
>   mm/secretmem.c            | 15 +++++++++++++++
>   3 files changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h
> index e617b4afcc62..21c3771e6a56 100644
> --- a/include/linux/secretmem.h
> +++ b/include/linux/secretmem.h
> @@ -30,6 +30,7 @@ static inline bool page_is_secretmem(struct page *page)
>   }
>   
>   bool vma_is_secretmem(struct vm_area_struct *vma);
> +bool secretmem_active(void);
>   
>   #else
>   
> @@ -43,6 +44,11 @@ static inline bool page_is_secretmem(struct page *page)
>   	return false;
>   }
>   
> +static inline bool secretmem_active(void)
> +{
> +	return false;
> +}
> +
>   #endif /* CONFIG_SECRETMEM */
>   
>   #endif /* _LINUX_SECRETMEM_H */
> diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
> index da0b41914177..559acef3fddb 100644
> --- a/kernel/power/hibernate.c
> +++ b/kernel/power/hibernate.c
> @@ -31,6 +31,7 @@
>   #include <linux/genhd.h>
>   #include <linux/ktime.h>
>   #include <linux/security.h>
> +#include <linux/secretmem.h>
>   #include <trace/events/power.h>
>   
>   #include "power.h"
> @@ -81,7 +82,9 @@ void hibernate_release(void)
>   
>   bool hibernation_available(void)
>   {
> -	return nohibernate == 0 && !security_locked_down(LOCKDOWN_HIBERNATION);
> +	return nohibernate == 0 &&
> +		!security_locked_down(LOCKDOWN_HIBERNATION) &&
> +		!secretmem_active();
>   }
>   
>   /**
> diff --git a/mm/secretmem.c b/mm/secretmem.c
> index 1ae50089adf1..7c2499e4de22 100644
> --- a/mm/secretmem.c
> +++ b/mm/secretmem.c
> @@ -40,6 +40,13 @@ module_param_named(enable, secretmem_enable, bool, 0400);
>   MODULE_PARM_DESC(secretmem_enable,
>   		 "Enable secretmem and memfd_secret(2) system call");
>   
> +static atomic_t secretmem_users;
> +
> +bool secretmem_active(void)
> +{
> +	return !!atomic_read(&secretmem_users);
> +}
> +
>   static vm_fault_t secretmem_fault(struct vm_fault *vmf)
>   {
>   	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> @@ -94,6 +101,12 @@ static const struct vm_operations_struct secretmem_vm_ops = {
>   	.fault = secretmem_fault,
>   };
>   
> +static int secretmem_release(struct inode *inode, struct file *file)
> +{
> +	atomic_dec(&secretmem_users);
> +	return 0;
> +}
> +
>   static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
>   {
>   	unsigned long len = vma->vm_end - vma->vm_start;
> @@ -116,6 +129,7 @@ bool vma_is_secretmem(struct vm_area_struct *vma)
>   }
>   
>   static const struct file_operations secretmem_fops = {
> +	.release	= secretmem_release,
>   	.mmap		= secretmem_mmap,
>   };
>   
> @@ -202,6 +216,7 @@ SYSCALL_DEFINE1(memfd_secret, unsigned int, flags)
>   	file->f_flags |= O_LARGEFILE;
>   
>   	fd_install(fd, file);
> +	atomic_inc(&secretmem_users);
>   	return fd;
>   
>   err_put_fd:
> 

It looks a bit racy, but I guess we don't really care about these corner 
cases.

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[relevance 0%]

* Re: [PATCH v19 1/8] mmap: make mlock_future_check() global
  2021-05-13 18:47  4% ` [PATCH v19 1/8] mmap: make mlock_future_check() global Mike Rapoport
@ 2021-05-14  8:27  0%   ` David Hildenbrand
  0 siblings, 0 replies; 200+ results
From: David Hildenbrand @ 2021-05-14  8:27 UTC (permalink / raw)
  To: Mike Rapoport, Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	Elena Reshetova, H. Peter Anvin, Hagen Paul Pfeifer, Ingo Molnar,
	James Bottomley, Kees Cook, Kirill A. Shutemov, Matthew Wilcox,
	Matthew Garrett, Mark Rutland, Michal Hocko, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Rafael J. Wysocki, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, Yury Norov, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

On 13.05.21 20:47, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> It will be used by the upcoming secret memory implementation.
> 
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Christopher Lameter <cl@linux.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Elena Reshetova <elena.reshetova@intel.com>
> Cc: Hagen Paul Pfeifer <hagen@jauu.net>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: James Bottomley <jejb@linux.ibm.com>
> Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Palmer Dabbelt <palmer@dabbelt.com>
> Cc: Palmer Dabbelt <palmerdabbelt@google.com>
> Cc: Paul Walmsley <paul.walmsley@sifive.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Cc: Roman Gushchin <guro@fb.com>
> Cc: Shakeel Butt <shakeelb@google.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Tycho Andersen <tycho@tycho.ws>
> Cc: Will Deacon <will@kernel.org>
> ---
>   mm/internal.h | 3 +++
>   mm/mmap.c     | 5 ++---
>   2 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 54bd0dc2c23c..46eb82eaa195 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -373,6 +373,9 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
>   extern void mlock_vma_page(struct page *page);
>   extern unsigned int munlock_vma_page(struct page *page);
>   
> +extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
> +			      unsigned long len);
> +
>   /*
>    * Clear the page's PageMlocked().  This can be useful in a situation where
>    * we want to unconditionally remove a page from the pagecache -- e.g.,
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 0584e540246e..81f5595a8490 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1352,9 +1352,8 @@ static inline unsigned long round_hint_to_min(unsigned long hint)
>   	return hint;
>   }
>   
> -static inline int mlock_future_check(struct mm_struct *mm,
> -				     unsigned long flags,
> -				     unsigned long len)
> +int mlock_future_check(struct mm_struct *mm, unsigned long flags,
> +		       unsigned long len)
>   {
>   	unsigned long locked, lock_limit;
>   
> 

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[relevance 0%]

* [PATCH v19 8/8] secretmem: test: add basic selftest for memfd_secret(2)
                     ` (5 preceding siblings ...)
  2021-05-13 18:47  3% ` [PATCH v19 7/8] arch, mm: wire up memfd_secret system call where relevant Mike Rapoport
@ 2021-05-13 18:47  2% ` Mike Rapoport
  6 siblings, 0 replies; 200+ results
From: Mike Rapoport @ 2021-05-13 18:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin,
	Hagen Paul Pfeifer, Ingo Molnar, James Bottomley, Kees Cook,
	Kirill A. Shutemov, Matthew Wilcox, Matthew Garrett,
	Mark Rutland, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Rafael J. Wysocki, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, Yury Norov, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

From: Mike Rapoport <rppt@linux.ibm.com>

The test verifies that file descriptor created with memfd_secret does not
allow read/write operations, that secret memory mappings respect
RLIMIT_MEMLOCK and that remote accesses with process_vm_read() and
ptrace() to the secret memory fail.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 tools/testing/selftests/vm/.gitignore     |   1 +
 tools/testing/selftests/vm/Makefile       |   3 +-
 tools/testing/selftests/vm/memfd_secret.c | 296 ++++++++++++++++++++++
 tools/testing/selftests/vm/run_vmtests.sh |  17 ++
 4 files changed, 316 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/vm/memfd_secret.c

diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore
index 1f651e85ed60..da92ded5a27c 100644
--- a/tools/testing/selftests/vm/.gitignore
+++ b/tools/testing/selftests/vm/.gitignore
@@ -21,5 +21,6 @@ va_128TBswitch
 map_fixed_noreplace
 write_to_hugetlbfs
 hmm-tests
+memfd_secret
 local_config.*
 split_huge_page_test
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 73e1cc96d7c2..266580ea938c 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -34,6 +34,7 @@ TEST_GEN_FILES += khugepaged
 TEST_GEN_FILES += map_fixed_noreplace
 TEST_GEN_FILES += map_hugetlb
 TEST_GEN_FILES += map_populate
+TEST_GEN_FILES += memfd_secret
 TEST_GEN_FILES += mlock-random-test
 TEST_GEN_FILES += mlock2-tests
 TEST_GEN_FILES += mremap_dontunmap
@@ -134,7 +135,7 @@ warn_32bit_failure:
 endif
 endif
 
-$(OUTPUT)/mlock-random-test: LDLIBS += -lcap
+$(OUTPUT)/mlock-random-test $(OUTPUT)/memfd_secret: LDLIBS += -lcap
 
 $(OUTPUT)/gup_test: ../../../../mm/gup_test.h
 
diff --git a/tools/testing/selftests/vm/memfd_secret.c b/tools/testing/selftests/vm/memfd_secret.c
new file mode 100644
index 000000000000..2462f52e9c96
--- /dev/null
+++ b/tools/testing/selftests/vm/memfd_secret.c
@@ -0,0 +1,296 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright IBM Corporation, 2020
+ *
+ * Author: Mike Rapoport <rppt@linux.ibm.com>
+ */
+
+#define _GNU_SOURCE
+#include <sys/uio.h>
+#include <sys/mman.h>
+#include <sys/wait.h>
+#include <sys/types.h>
+#include <sys/ptrace.h>
+#include <sys/syscall.h>
+#include <sys/resource.h>
+#include <sys/capability.h>
+
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+#include <stdio.h>
+
+#include "../kselftest.h"
+
+#define fail(fmt, ...) ksft_test_result_fail(fmt, ##__VA_ARGS__)
+#define pass(fmt, ...) ksft_test_result_pass(fmt, ##__VA_ARGS__)
+#define skip(fmt, ...) ksft_test_result_skip(fmt, ##__VA_ARGS__)
+
+#ifdef __NR_memfd_secret
+
+#define PATTERN	0x55
+
+static const int prot = PROT_READ | PROT_WRITE;
+static const int mode = MAP_SHARED;
+
+static unsigned long page_size;
+static unsigned long mlock_limit_cur;
+static unsigned long mlock_limit_max;
+
+static int memfd_secret(unsigned int flags)
+{
+	return syscall(__NR_memfd_secret, flags);
+}
+
+static void test_file_apis(int fd)
+{
+	char buf[64];
+
+	if ((read(fd, buf, sizeof(buf)) >= 0) ||
+	    (write(fd, buf, sizeof(buf)) >= 0) ||
+	    (pread(fd, buf, sizeof(buf), 0) >= 0) ||
+	    (pwrite(fd, buf, sizeof(buf), 0) >= 0))
+		fail("unexpected file IO\n");
+	else
+		pass("file IO is blocked as expected\n");
+}
+
+static void test_mlock_limit(int fd)
+{
+	size_t len;
+	char *mem;
+
+	len = mlock_limit_cur;
+	mem = mmap(NULL, len, prot, mode, fd, 0);
+	if (mem == MAP_FAILED) {
+		fail("unable to mmap secret memory\n");
+		return;
+	}
+	munmap(mem, len);
+
+	len = mlock_limit_max * 2;
+	mem = mmap(NULL, len, prot, mode, fd, 0);
+	if (mem != MAP_FAILED) {
+		fail("unexpected mlock limit violation\n");
+		munmap(mem, len);
+		return;
+	}
+
+	pass("mlock limit is respected\n");
+}
+
+static void try_process_vm_read(int fd, int pipefd[2])
+{
+	struct iovec liov, riov;
+	char buf[64];
+	char *mem;
+
+	if (read(pipefd[0], &mem, sizeof(mem)) < 0) {
+		fail("pipe write: %s\n", strerror(errno));
+		exit(KSFT_FAIL);
+	}
+
+	liov.iov_len = riov.iov_len = sizeof(buf);
+	liov.iov_base = buf;
+	riov.iov_base = mem;
+
+	if (process_vm_readv(getppid(), &liov, 1, &riov, 1, 0) < 0) {
+		if (errno == ENOSYS)
+			exit(KSFT_SKIP);
+		exit(KSFT_PASS);
+	}
+
+	exit(KSFT_FAIL);
+}
+
+static void try_ptrace(int fd, int pipefd[2])
+{
+	pid_t ppid = getppid();
+	int status;
+	char *mem;
+	long ret;
+
+	if (read(pipefd[0], &mem, sizeof(mem)) < 0) {
+		perror("pipe write");
+		exit(KSFT_FAIL);
+	}
+
+	ret = ptrace(PTRACE_ATTACH, ppid, 0, 0);
+	if (ret) {
+		perror("ptrace_attach");
+		exit(KSFT_FAIL);
+	}
+
+	ret = waitpid(ppid, &status, WUNTRACED);
+	if ((ret != ppid) || !(WIFSTOPPED(status))) {
+		fprintf(stderr, "weird waitppid result %ld stat %x\n",
+			ret, status);
+		exit(KSFT_FAIL);
+	}
+
+	if (ptrace(PTRACE_PEEKDATA, ppid, mem, 0))
+		exit(KSFT_PASS);
+
+	exit(KSFT_FAIL);
+}
+
+static void check_child_status(pid_t pid, const char *name)
+{
+	int status;
+
+	waitpid(pid, &status, 0);
+
+	if (WIFEXITED(status) && WEXITSTATUS(status) == KSFT_SKIP) {
+		skip("%s is not supported\n", name);
+		return;
+	}
+
+	if ((WIFEXITED(status) && WEXITSTATUS(status) == KSFT_PASS) ||
+	    WIFSIGNALED(status)) {
+		pass("%s is blocked as expected\n", name);
+		return;
+	}
+
+	fail("%s: unexpected memory access\n", name);
+}
+
+static void test_remote_access(int fd, const char *name,
+			       void (*func)(int fd, int pipefd[2]))
+{
+	int pipefd[2];
+	pid_t pid;
+	char *mem;
+
+	if (pipe(pipefd)) {
+		fail("pipe failed: %s\n", strerror(errno));
+		return;
+	}
+
+	pid = fork();
+	if (pid < 0) {
+		fail("fork failed: %s\n", strerror(errno));
+		return;
+	}
+
+	if (pid == 0) {
+		func(fd, pipefd);
+		return;
+	}
+
+	mem = mmap(NULL, page_size, prot, mode, fd, 0);
+	if (mem == MAP_FAILED) {
+		fail("Unable to mmap secret memory\n");
+		return;
+	}
+
+	ftruncate(fd, page_size);
+	memset(mem, PATTERN, page_size);
+
+	if (write(pipefd[1], &mem, sizeof(mem)) < 0) {
+		fail("pipe write: %s\n", strerror(errno));
+		return;
+	}
+
+	check_child_status(pid, name);
+}
+
+static void test_process_vm_read(int fd)
+{
+	test_remote_access(fd, "process_vm_read", try_process_vm_read);
+}
+
+static void test_ptrace(int fd)
+{
+	test_remote_access(fd, "ptrace", try_ptrace);
+}
+
+static int set_cap_limits(rlim_t max)
+{
+	struct rlimit new;
+	cap_t cap = cap_init();
+
+	new.rlim_cur = max;
+	new.rlim_max = max;
+	if (setrlimit(RLIMIT_MEMLOCK, &new)) {
+		perror("setrlimit() returns error");
+		return -1;
+	}
+
+	/* drop capabilities including CAP_IPC_LOCK */
+	if (cap_set_proc(cap)) {
+		perror("cap_set_proc() returns error");
+		return -2;
+	}
+
+	return 0;
+}
+
+static void prepare(void)
+{
+	struct rlimit rlim;
+
+	page_size = sysconf(_SC_PAGE_SIZE);
+	if (!page_size)
+		ksft_exit_fail_msg("Failed to get page size %s\n",
+				   strerror(errno));
+
+	if (getrlimit(RLIMIT_MEMLOCK, &rlim))
+		ksft_exit_fail_msg("Unable to detect mlock limit: %s\n",
+				   strerror(errno));
+
+	mlock_limit_cur = rlim.rlim_cur;
+	mlock_limit_max = rlim.rlim_max;
+
+	printf("page_size: %ld, mlock.soft: %ld, mlock.hard: %ld\n",
+	       page_size, mlock_limit_cur, mlock_limit_max);
+
+	if (page_size > mlock_limit_cur)
+		mlock_limit_cur = page_size;
+	if (page_size > mlock_limit_max)
+		mlock_limit_max = page_size;
+
+	if (set_cap_limits(mlock_limit_max))
+		ksft_exit_fail_msg("Unable to set mlock limit: %s\n",
+				   strerror(errno));
+}
+
+#define NUM_TESTS 4
+
+int main(int argc, char *argv[])
+{
+	int fd;
+
+	prepare();
+
+	ksft_print_header();
+	ksft_set_plan(NUM_TESTS);
+
+	fd = memfd_secret(0);
+	if (fd < 0) {
+		if (errno == ENOSYS)
+			ksft_exit_skip("memfd_secret is not supported\n");
+		else
+			ksft_exit_fail_msg("memfd_secret failed: %s\n",
+					   strerror(errno));
+	}
+
+	test_mlock_limit(fd);
+	test_file_apis(fd);
+	test_process_vm_read(fd);
+	test_ptrace(fd);
+
+	close(fd);
+
+	ksft_exit(!ksft_get_fail_cnt());
+}
+
+#else /* __NR_memfd_secret */
+
+int main(int argc, char *argv[])
+{
+	printf("skip: skipping memfd_secret test (missing __NR_memfd_secret)\n");
+	return KSFT_SKIP;
+}
+
+#endif /* __NR_memfd_secret */
diff --git a/tools/testing/selftests/vm/run_vmtests.sh b/tools/testing/selftests/vm/run_vmtests.sh
index e953f3cd9664..95a67382f132 100755
--- a/tools/testing/selftests/vm/run_vmtests.sh
+++ b/tools/testing/selftests/vm/run_vmtests.sh
@@ -346,4 +346,21 @@ else
 	exitcode=1
 fi
 
+echo "running memfd_secret test"
+echo "------------------------------------"
+./memfd_secret
+ret_val=$?
+
+if [ $ret_val -eq 0 ]; then
+	echo "[PASS]"
+elif [ $ret_val -eq $ksft_skip ]; then
+	echo "[SKIP]"
+	exitcode=$ksft_skip
+else
+	echo "[FAIL]"
+	exitcode=1
+fi
+
+exit $exitcode
+
 exit $exitcode
-- 
2.28.0


^ permalink raw reply related	[relevance 2%]

* [PATCH v19 7/8] arch, mm: wire up memfd_secret system call where relevant
                     ` (4 preceding siblings ...)
  2021-05-13 18:47  3% ` [PATCH v19 6/8] PM: hibernate: disable when there are active secretmem users Mike Rapoport
@ 2021-05-13 18:47  3% ` Mike Rapoport
  2021-05-14  9:27  0%   ` David Hildenbrand
  2021-05-13 18:47  2% ` [PATCH v19 8/8] secretmem: test: add basic selftest for memfd_secret(2) Mike Rapoport
  6 siblings, 1 reply; 200+ results
From: Mike Rapoport @ 2021-05-13 18:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin,
	Hagen Paul Pfeifer, Ingo Molnar, James Bottomley, Kees Cook,
	Kirill A. Shutemov, Matthew Wilcox, Matthew Garrett,
	Mark Rutland, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Rafael J. Wysocki, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, Yury Norov, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

From: Mike Rapoport <rppt@linux.ibm.com>

Wire up memfd_secret system call on architectures that define
ARCH_HAS_SET_DIRECT_MAP, namely arm64, risc-v and x86.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/uapi/asm/unistd.h   | 1 +
 arch/riscv/include/asm/unistd.h        | 1 +
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 include/linux/syscalls.h               | 1 +
 include/uapi/asm-generic/unistd.h      | 7 ++++++-
 scripts/checksyscalls.sh               | 4 ++++
 7 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/uapi/asm/unistd.h b/arch/arm64/include/uapi/asm/unistd.h
index f83a70e07df8..ce2ee8f1e361 100644
--- a/arch/arm64/include/uapi/asm/unistd.h
+++ b/arch/arm64/include/uapi/asm/unistd.h
@@ -20,5 +20,6 @@
 #define __ARCH_WANT_SET_GET_RLIMIT
 #define __ARCH_WANT_TIME32_SYSCALLS
 #define __ARCH_WANT_SYS_CLONE3
+#define __ARCH_WANT_MEMFD_SECRET
 
 #include <asm-generic/unistd.h>
diff --git a/arch/riscv/include/asm/unistd.h b/arch/riscv/include/asm/unistd.h
index 977ee6181dab..6c316093a1e5 100644
--- a/arch/riscv/include/asm/unistd.h
+++ b/arch/riscv/include/asm/unistd.h
@@ -9,6 +9,7 @@
  */
 
 #define __ARCH_WANT_SYS_CLONE
+#define __ARCH_WANT_MEMFD_SECRET
 
 #include <uapi/asm/unistd.h>
 
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 28a1423ce32e..e44519020a43 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -451,3 +451,4 @@
 444	i386	landlock_create_ruleset	sys_landlock_create_ruleset
 445	i386	landlock_add_rule	sys_landlock_add_rule
 446	i386	landlock_restrict_self	sys_landlock_restrict_self
+447	i386	memfd_secret		sys_memfd_secret
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index ecd551b08d05..a06f16106f24 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -368,6 +368,7 @@
 444	common	landlock_create_ruleset	sys_landlock_create_ruleset
 445	common	landlock_add_rule	sys_landlock_add_rule
 446	common	landlock_restrict_self	sys_landlock_restrict_self
+447	common	memfd_secret		sys_memfd_secret
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 050511e8f1f8..1a1b5d724497 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1050,6 +1050,7 @@ asmlinkage long sys_landlock_create_ruleset(const struct landlock_ruleset_attr _
 asmlinkage long sys_landlock_add_rule(int ruleset_fd, enum landlock_rule_type rule_type,
 		const void __user *rule_attr, __u32 flags);
 asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags);
+asmlinkage long sys_memfd_secret(unsigned int flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 6de5a7fc066b..28b388368cf6 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -873,8 +873,13 @@ __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule)
 #define __NR_landlock_restrict_self 446
 __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self)
 
+#ifdef __ARCH_WANT_MEMFD_SECRET
+#define __NR_memfd_secret 447
+__SYSCALL(__NR_memfd_secret, sys_memfd_secret)
+#endif
+
 #undef __NR_syscalls
-#define __NR_syscalls 447
+#define __NR_syscalls 448
 
 /*
  * 32 bit systems traditionally used different
diff --git a/scripts/checksyscalls.sh b/scripts/checksyscalls.sh
index a18b47695f55..b7609958ee36 100755
--- a/scripts/checksyscalls.sh
+++ b/scripts/checksyscalls.sh
@@ -40,6 +40,10 @@ cat << EOF
 #define __IGNORE_setrlimit	/* setrlimit */
 #endif
 
+#ifndef __ARCH_WANT_MEMFD_SECRET
+#define __IGNORE_memfd_secret
+#endif
+
 /* Missing flags argument */
 #define __IGNORE_renameat	/* renameat2 */
 
-- 
2.28.0


^ permalink raw reply related	[relevance 3%]

* [PATCH v19 6/8] PM: hibernate: disable when there are active secretmem users
                     ` (3 preceding siblings ...)
  2021-05-13 18:47  2% ` [PATCH v19 5/8] mm: introduce memfd_secret system call to create "secret" memory areas Mike Rapoport
@ 2021-05-13 18:47  3% ` Mike Rapoport
  2021-05-14  9:27  0%   ` David Hildenbrand
  2021-05-18 10:24  0%   ` Mark Rutland
  2021-05-13 18:47  3% ` [PATCH v19 7/8] arch, mm: wire up memfd_secret system call where relevant Mike Rapoport
  2021-05-13 18:47  2% ` [PATCH v19 8/8] secretmem: test: add basic selftest for memfd_secret(2) Mike Rapoport
  6 siblings, 2 replies; 200+ results
From: Mike Rapoport @ 2021-05-13 18:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin,
	Hagen Paul Pfeifer, Ingo Molnar, James Bottomley, Kees Cook,
	Kirill A. Shutemov, Matthew Wilcox, Matthew Garrett,
	Mark Rutland, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Rafael J. Wysocki, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, Yury Norov, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

From: Mike Rapoport <rppt@linux.ibm.com>

It is unsafe to allow saving of secretmem areas to the hibernation
snapshot as they would be visible after the resume and this essentially
will defeat the purpose of secret memory mappings.

Prevent hibernation whenever there are active secret memory users.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 include/linux/secretmem.h |  6 ++++++
 kernel/power/hibernate.c  |  5 ++++-
 mm/secretmem.c            | 15 +++++++++++++++
 3 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h
index e617b4afcc62..21c3771e6a56 100644
--- a/include/linux/secretmem.h
+++ b/include/linux/secretmem.h
@@ -30,6 +30,7 @@ static inline bool page_is_secretmem(struct page *page)
 }
 
 bool vma_is_secretmem(struct vm_area_struct *vma);
+bool secretmem_active(void);
 
 #else
 
@@ -43,6 +44,11 @@ static inline bool page_is_secretmem(struct page *page)
 	return false;
 }
 
+static inline bool secretmem_active(void)
+{
+	return false;
+}
+
 #endif /* CONFIG_SECRETMEM */
 
 #endif /* _LINUX_SECRETMEM_H */
diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
index da0b41914177..559acef3fddb 100644
--- a/kernel/power/hibernate.c
+++ b/kernel/power/hibernate.c
@@ -31,6 +31,7 @@
 #include <linux/genhd.h>
 #include <linux/ktime.h>
 #include <linux/security.h>
+#include <linux/secretmem.h>
 #include <trace/events/power.h>
 
 #include "power.h"
@@ -81,7 +82,9 @@ void hibernate_release(void)
 
 bool hibernation_available(void)
 {
-	return nohibernate == 0 && !security_locked_down(LOCKDOWN_HIBERNATION);
+	return nohibernate == 0 &&
+		!security_locked_down(LOCKDOWN_HIBERNATION) &&
+		!secretmem_active();
 }
 
 /**
diff --git a/mm/secretmem.c b/mm/secretmem.c
index 1ae50089adf1..7c2499e4de22 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -40,6 +40,13 @@ module_param_named(enable, secretmem_enable, bool, 0400);
 MODULE_PARM_DESC(secretmem_enable,
 		 "Enable secretmem and memfd_secret(2) system call");
 
+static atomic_t secretmem_users;
+
+bool secretmem_active(void)
+{
+	return !!atomic_read(&secretmem_users);
+}
+
 static vm_fault_t secretmem_fault(struct vm_fault *vmf)
 {
 	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
@@ -94,6 +101,12 @@ static const struct vm_operations_struct secretmem_vm_ops = {
 	.fault = secretmem_fault,
 };
 
+static int secretmem_release(struct inode *inode, struct file *file)
+{
+	atomic_dec(&secretmem_users);
+	return 0;
+}
+
 static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	unsigned long len = vma->vm_end - vma->vm_start;
@@ -116,6 +129,7 @@ bool vma_is_secretmem(struct vm_area_struct *vma)
 }
 
 static const struct file_operations secretmem_fops = {
+	.release	= secretmem_release,
 	.mmap		= secretmem_mmap,
 };
 
@@ -202,6 +216,7 @@ SYSCALL_DEFINE1(memfd_secret, unsigned int, flags)
 	file->f_flags |= O_LARGEFILE;
 
 	fd_install(fd, file);
+	atomic_inc(&secretmem_users);
 	return fd;
 
 err_put_fd:
-- 
2.28.0


^ permalink raw reply related	[relevance 3%]

* [PATCH v19 5/8] mm: introduce memfd_secret system call to create "secret" memory areas
                     ` (2 preceding siblings ...)
  2021-05-13 18:47  3% ` [PATCH v19 4/8] set_memory: allow querying whether set_direct_map_*() is actually enabled Mike Rapoport
@ 2021-05-13 18:47  2% ` Mike Rapoport
  2021-05-13 18:47  3% ` [PATCH v19 6/8] PM: hibernate: disable when there are active secretmem users Mike Rapoport
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 200+ results
From: Mike Rapoport @ 2021-05-13 18:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin,
	Hagen Paul Pfeifer, Ingo Molnar, James Bottomley, Kees Cook,
	Kirill A. Shutemov, Matthew Wilcox, Matthew Garrett,
	Mark Rutland, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Rafael J. Wysocki, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, Yury Norov, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

From: Mike Rapoport <rppt@linux.ibm.com>

Introduce "memfd_secret" system call with the ability to create memory
areas visible only in the context of the owning process and not mapped not
only to other processes but in the kernel page tables as well.

The secretmem feature is off by default and the user must explicitly enable
it at the boot time.

Once secretmem is enabled, the user will be able to create a file
descriptor using the memfd_secret() system call. The memory areas created
by mmap() calls from this file descriptor will be unmapped from the kernel
direct map and they will be only mapped in the page table of the processes
that have access to the file descriptor.

The file descriptor based memory has several advantages over the
"traditional" mm interfaces, such as mlock(), mprotect(), madvise(). File
descriptor approach allows explict and controlled sharing of the memory
areas, it allows to seal the operations. Besides, file descriptor based
memory paves the way for VMMs to remove the secret memory range from the
userpace hipervisor process, for instance QEMU. Andy Lutomirski says:

  "Getting fd-backed memory into a guest will take some possibly major work
   in the kernel, but getting vma-backed memory into a guest without
   mapping it in the host user address space seems much, much worse."

memfd_secret() is made a dedicated system call rather than an extention to
memfd_create() because it's purpose is to allow the user to create more
secure memory mappings rather than to simply allow file based access to the
memory. Nowadays a new system call cost is negligible while it is way
simpler for userspace to deal with a clear-cut system calls than with a
multiplexer or an overloaded syscall. Moreover, the initial implementation
of memfd_secret() is completely distinct from memfd_create() so there is no
much sense in overloading memfd_create() to begin with. If there will be a
need for code sharing between these implementation it can be easily
achieved without a need to adjust user visible APIs.

The secret memory remains accessible in the process context using uaccess
primitives, but it is not exposed to the kernel otherwise; secret memory
areas are removed from the direct map and functions in the
follow_page()/get_user_page() family will refuse to return a page that
belongs to the secret memory area.

Once there will be a use case that will require exposing secretmem to the
kernel it will be an opt-in request in the system call flags so that user
would have to decide what data can be exposed to the kernel.

Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which affects
the system performance. However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice". Hence, it is sufficient to have
secretmem disabled by default with the ability of a system administrator to
enable it at boot time.

Pages in the secretmem regions are unevictable and unmovable to avoid
accidental exposure of the sensitive data via swap or during page
migration.

Since the secretmem mappings are locked in memory they cannot exceed
RLIMIT_MEMLOCK. Since these mappings are already locked independently from
mlock(), an attempt to mlock()/munlock() secretmem range would fail and
mlockall()/munlockall() will ignore secretmem mappings.

However, unlike mlock()ed memory, secretmem currently behaves more like
long-term GUP: secretmem mappings are unmovable mappings directly consumed
by user space. With default limits, there is no excessive use of secretmem
and it poses no real problem in combination with ZONE_MOVABLE/CMA, but in
the future this should be addressed to allow balanced use of large amounts
of secretmem along with ZONE_MOVABLE/CMA.

A page that was a part of the secret memory area is cleared when it is
freed to ensure the data is not exposed to the next user of that page.

The following example demonstrates creation of a secret mapping (error
handling is omitted):

	fd = memfd_secret(0);
	ftruncate(fd, MAP_SIZE);
	ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
		   MAP_SHARED, fd, 0);

[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 drivers/char/mem.c         |   4 +
 include/linux/secretmem.h  |  48 ++++++++
 include/uapi/linux/magic.h |   1 +
 kernel/sys_ni.c            |   2 +
 mm/Kconfig                 |   4 +
 mm/Makefile                |   1 +
 mm/gup.c                   |  12 ++
 mm/mlock.c                 |   3 +-
 mm/secretmem.c             | 239 +++++++++++++++++++++++++++++++++++++
 9 files changed, 313 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/secretmem.h
 create mode 100644 mm/secretmem.c

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 15dc54fa1d47..95741f93a6cd 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -31,6 +31,7 @@
 #include <linux/uio.h>
 #include <linux/uaccess.h>
 #include <linux/security.h>
+#include <linux/secretmem.h>
 
 #ifdef CONFIG_IA64
 # include <linux/efi.h>
@@ -64,6 +65,9 @@ static inline int valid_mmap_phys_addr_range(unsigned long pfn, size_t size)
 #ifdef CONFIG_STRICT_DEVMEM
 static inline int page_is_allowed(unsigned long pfn)
 {
+	if (pfn_valid(pfn) && page_is_secretmem(pfn_to_page(pfn)))
+		return 0;
+
 	return devmem_is_allowed(pfn);
 }
 static inline int range_is_allowed(unsigned long pfn, unsigned long size)
diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h
new file mode 100644
index 000000000000..e617b4afcc62
--- /dev/null
+++ b/include/linux/secretmem.h
@@ -0,0 +1,48 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _LINUX_SECRETMEM_H
+#define _LINUX_SECRETMEM_H
+
+#ifdef CONFIG_SECRETMEM
+
+extern const struct address_space_operations secretmem_aops;
+
+static inline bool page_is_secretmem(struct page *page)
+{
+	struct address_space *mapping;
+
+	/*
+	 * Using page_mapping() is quite slow because of the actual call
+	 * instruction and repeated compound_head(page) inside the
+	 * page_mapping() function.
+	 * We know that secretmem pages are not compound and LRU so we can
+	 * save a couple of cycles here.
+	 */
+	if (PageCompound(page) || !PageLRU(page))
+		return false;
+
+	mapping = (struct address_space *)
+		((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS);
+
+	if (mapping != page->mapping)
+		return false;
+
+	return mapping->a_ops == &secretmem_aops;
+}
+
+bool vma_is_secretmem(struct vm_area_struct *vma);
+
+#else
+
+static inline bool vma_is_secretmem(struct vm_area_struct *vma)
+{
+	return false;
+}
+
+static inline bool page_is_secretmem(struct page *page)
+{
+	return false;
+}
+
+#endif /* CONFIG_SECRETMEM */
+
+#endif /* _LINUX_SECRETMEM_H */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index f3956fc11de6..35687dcb1a42 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -97,5 +97,6 @@
 #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define Z3FOLD_MAGIC		0x33
 #define PPC_CMM_MAGIC		0xc7571590
+#define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 0ea8128468c3..4d7e377a74f3 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -358,6 +358,8 @@ COND_SYSCALL(pkey_mprotect);
 COND_SYSCALL(pkey_alloc);
 COND_SYSCALL(pkey_free);
 
+/* memfd_secret */
+COND_SYSCALL(memfd_secret);
 
 /*
  * Architecture specific weak syscall entries.
diff --git a/mm/Kconfig b/mm/Kconfig
index 02d44e3420f5..f61e7d33c7bf 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -901,4 +901,8 @@ config KMAP_LOCAL
 # struct io_mapping based helper.  Selected by drivers that need them
 config IO_MAPPING
 	bool
+
+config SECRETMEM
+	def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
+
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index bf71e295e9f6..7bb6ed5e42e8 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -125,3 +125,4 @@ obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
 obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
 obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
 obj-$(CONFIG_IO_MAPPING) += io-mapping.o
+obj-$(CONFIG_SECRETMEM) += secretmem.o
diff --git a/mm/gup.c b/mm/gup.c
index 0697134b6a12..6515f82b0f32 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -10,6 +10,7 @@
 #include <linux/rmap.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/secretmem.h>
 
 #include <linux/sched/signal.h>
 #include <linux/rwsem.h>
@@ -816,6 +817,9 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 	struct follow_page_context ctx = { NULL };
 	struct page *page;
 
+	if (vma_is_secretmem(vma))
+		return NULL;
+
 	page = follow_page_mask(vma, address, foll_flags, &ctx);
 	if (ctx.pgmap)
 		put_dev_pagemap(ctx.pgmap);
@@ -949,6 +953,9 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
 		return -EOPNOTSUPP;
 
+	if (vma_is_secretmem(vma))
+		return -EFAULT;
+
 	if (write) {
 		if (!(vm_flags & VM_WRITE)) {
 			if (!(gup_flags & FOLL_FORCE))
@@ -2077,6 +2084,11 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 		if (!head)
 			goto pte_unmap;
 
+		if (unlikely(page_is_secretmem(page))) {
+			put_compound_head(head, 1, flags);
+			goto pte_unmap;
+		}
+
 		if (unlikely(pte_val(pte) != pte_val(*ptep))) {
 			put_compound_head(head, 1, flags);
 			goto pte_unmap;
diff --git a/mm/mlock.c b/mm/mlock.c
index df590fda5688..5e9f4dea4e96 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -23,6 +23,7 @@
 #include <linux/hugetlb.h>
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h>
+#include <linux/secretmem.h>
 
 #include "internal.h"
 
@@ -503,7 +504,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 
 	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
 	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) ||
-	    vma_is_dax(vma))
+	    vma_is_dax(vma) || vma_is_secretmem(vma))
 		/* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
 		goto out;
 
diff --git a/mm/secretmem.c b/mm/secretmem.c
new file mode 100644
index 000000000000..1ae50089adf1
--- /dev/null
+++ b/mm/secretmem.c
@@ -0,0 +1,239 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright IBM Corporation, 2021
+ *
+ * Author: Mike Rapoport <rppt@linux.ibm.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/swap.h>
+#include <linux/mount.h>
+#include <linux/memfd.h>
+#include <linux/bitops.h>
+#include <linux/printk.h>
+#include <linux/pagemap.h>
+#include <linux/syscalls.h>
+#include <linux/pseudo_fs.h>
+#include <linux/secretmem.h>
+#include <linux/set_memory.h>
+#include <linux/sched/signal.h>
+
+#include <uapi/linux/magic.h>
+
+#include <asm/tlbflush.h>
+
+#include "internal.h"
+
+#undef pr_fmt
+#define pr_fmt(fmt) "secretmem: " fmt
+
+/*
+ * Define mode and flag masks to allow validation of the system call
+ * parameters.
+ */
+#define SECRETMEM_MODE_MASK	(0x0)
+#define SECRETMEM_FLAGS_MASK	SECRETMEM_MODE_MASK
+
+static bool secretmem_enable __ro_after_init;
+module_param_named(enable, secretmem_enable, bool, 0400);
+MODULE_PARM_DESC(secretmem_enable,
+		 "Enable secretmem and memfd_secret(2) system call");
+
+static vm_fault_t secretmem_fault(struct vm_fault *vmf)
+{
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+	struct inode *inode = file_inode(vmf->vma->vm_file);
+	pgoff_t offset = vmf->pgoff;
+	gfp_t gfp = vmf->gfp_mask;
+	unsigned long addr;
+	struct page *page;
+	int err;
+
+	if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
+		return vmf_error(-EINVAL);
+
+retry:
+	page = find_lock_page(mapping, offset);
+	if (!page) {
+		page = alloc_page(gfp | __GFP_ZERO);
+		if (!page)
+			return VM_FAULT_OOM;
+
+		err = set_direct_map_invalid_noflush(page, 1);
+		if (err) {
+			put_page(page);
+			return vmf_error(err);
+		}
+
+		__SetPageUptodate(page);
+		err = add_to_page_cache_lru(page, mapping, offset, gfp);
+		if (unlikely(err)) {
+			put_page(page);
+			/*
+			 * If a split of large page was required, it
+			 * already happened when we marked the page invalid
+			 * which guarantees that this call won't fail
+			 */
+			set_direct_map_default_noflush(page, 1);
+			if (err == -EEXIST)
+				goto retry;
+
+			return vmf_error(err);
+		}
+
+		addr = (unsigned long)page_address(page);
+		flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+	}
+
+	vmf->page = page;
+	return VM_FAULT_LOCKED;
+}
+
+static const struct vm_operations_struct secretmem_vm_ops = {
+	.fault = secretmem_fault,
+};
+
+static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	unsigned long len = vma->vm_end - vma->vm_start;
+
+	if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
+		return -EINVAL;
+
+	if (mlock_future_check(vma->vm_mm, vma->vm_flags | VM_LOCKED, len))
+		return -EAGAIN;
+
+	vma->vm_flags |= VM_LOCKED | VM_DONTDUMP;
+	vma->vm_ops = &secretmem_vm_ops;
+
+	return 0;
+}
+
+bool vma_is_secretmem(struct vm_area_struct *vma)
+{
+	return vma->vm_ops == &secretmem_vm_ops;
+}
+
+static const struct file_operations secretmem_fops = {
+	.mmap		= secretmem_mmap,
+};
+
+static bool secretmem_isolate_page(struct page *page, isolate_mode_t mode)
+{
+	return false;
+}
+
+static int secretmem_migratepage(struct address_space *mapping,
+				 struct page *newpage, struct page *page,
+				 enum migrate_mode mode)
+{
+	return -EBUSY;
+}
+
+static void secretmem_freepage(struct page *page)
+{
+	set_direct_map_default_noflush(page, 1);
+	clear_highpage(page);
+}
+
+const struct address_space_operations secretmem_aops = {
+	.freepage	= secretmem_freepage,
+	.migratepage	= secretmem_migratepage,
+	.isolate_page	= secretmem_isolate_page,
+};
+
+static struct vfsmount *secretmem_mnt;
+
+static struct file *secretmem_file_create(unsigned long flags)
+{
+	struct file *file = ERR_PTR(-ENOMEM);
+	struct inode *inode;
+
+	inode = alloc_anon_inode(secretmem_mnt->mnt_sb);
+	if (IS_ERR(inode))
+		return ERR_CAST(inode);
+
+	file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
+				 O_RDWR, &secretmem_fops);
+	if (IS_ERR(file))
+		goto err_free_inode;
+
+	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+	mapping_set_unevictable(inode->i_mapping);
+
+	inode->i_mapping->a_ops = &secretmem_aops;
+
+	/* pretend we are a normal file with zero size */
+	inode->i_mode |= S_IFREG;
+	inode->i_size = 0;
+
+	return file;
+
+err_free_inode:
+	iput(inode);
+	return file;
+}
+
+SYSCALL_DEFINE1(memfd_secret, unsigned int, flags)
+{
+	struct file *file;
+	int fd, err;
+
+	/* make sure local flags do not confict with global fcntl.h */
+	BUILD_BUG_ON(SECRETMEM_FLAGS_MASK & O_CLOEXEC);
+
+	if (!secretmem_enable)
+		return -ENOSYS;
+
+	if (flags & ~(SECRETMEM_FLAGS_MASK | O_CLOEXEC))
+		return -EINVAL;
+
+	fd = get_unused_fd_flags(flags & O_CLOEXEC);
+	if (fd < 0)
+		return fd;
+
+	file = secretmem_file_create(flags);
+	if (IS_ERR(file)) {
+		err = PTR_ERR(file);
+		goto err_put_fd;
+	}
+
+	file->f_flags |= O_LARGEFILE;
+
+	fd_install(fd, file);
+	return fd;
+
+err_put_fd:
+	put_unused_fd(fd);
+	return err;
+}
+
+static int secretmem_init_fs_context(struct fs_context *fc)
+{
+	return init_pseudo(fc, SECRETMEM_MAGIC) ? 0 : -ENOMEM;
+}
+
+static struct file_system_type secretmem_fs = {
+	.name		= "secretmem",
+	.init_fs_context = secretmem_init_fs_context,
+	.kill_sb	= kill_anon_super,
+};
+
+static int secretmem_init(void)
+{
+	int ret = 0;
+
+	if (!secretmem_enable)
+		return ret;
+
+	secretmem_mnt = kern_mount(&secretmem_fs);
+	if (IS_ERR(secretmem_mnt))
+		ret = PTR_ERR(secretmem_mnt);
+
+	/* prevent secretmem mappings from ever getting PROT_EXEC */
+	secretmem_mnt->mnt_flags |= MNT_NOEXEC;
+
+	return ret;
+}
+fs_initcall(secretmem_init);
-- 
2.28.0


^ permalink raw reply related	[relevance 2%]

* [PATCH v19 4/8] set_memory: allow querying whether set_direct_map_*() is actually enabled
    2021-05-13 18:47  4% ` [PATCH v19 1/8] mmap: make mlock_future_check() global Mike Rapoport
  2021-05-13 18:47  3% ` [PATCH v19 3/8] set_memory: allow set_direct_map_*_noflush() for multiple pages Mike Rapoport
@ 2021-05-13 18:47  3% ` Mike Rapoport
  2021-05-13 18:47  2% ` [PATCH v19 5/8] mm: introduce memfd_secret system call to create "secret" memory areas Mike Rapoport
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 200+ results
From: Mike Rapoport @ 2021-05-13 18:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin,
	Hagen Paul Pfeifer, Ingo Molnar, James Bottomley, Kees Cook,
	Kirill A. Shutemov, Matthew Wilcox, Matthew Garrett,
	Mark Rutland, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Rafael J. Wysocki, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, Yury Norov, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

From: Mike Rapoport <rppt@linux.ibm.com>

On arm64, set_direct_map_*() functions may return 0 without actually
changing the linear map.  This behaviour can be controlled using kernel
parameters, so we need a way to determine at runtime whether calls to
set_direct_map_invalid_noflush() and set_direct_map_default_noflush() have
any effect.

Extend set_memory API with can_set_direct_map() function that allows
checking if calling set_direct_map_*() will actually change the page
table, replace several occurrences of open coded checks in arm64 with the
new function and provide a generic stub for architectures that always
modify page tables upon calls to set_direct_map APIs.

[arnd@arndb.de: arm64: kfence: fix header inclusion ]

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/Kbuild       |  1 -
 arch/arm64/include/asm/cacheflush.h |  6 ------
 arch/arm64/include/asm/kfence.h     |  2 +-
 arch/arm64/include/asm/set_memory.h | 17 +++++++++++++++++
 arch/arm64/kernel/machine_kexec.c   |  1 +
 arch/arm64/mm/mmu.c                 |  6 +++---
 arch/arm64/mm/pageattr.c            | 13 +++++++++----
 include/linux/set_memory.h          | 12 ++++++++++++
 8 files changed, 43 insertions(+), 15 deletions(-)
 create mode 100644 arch/arm64/include/asm/set_memory.h

diff --git a/arch/arm64/include/asm/Kbuild b/arch/arm64/include/asm/Kbuild
index 07ac208edc89..73aa25843f65 100644
--- a/arch/arm64/include/asm/Kbuild
+++ b/arch/arm64/include/asm/Kbuild
@@ -3,5 +3,4 @@ generic-y += early_ioremap.h
 generic-y += mcs_spinlock.h
 generic-y += qrwlock.h
 generic-y += qspinlock.h
-generic-y += set_memory.h
 generic-y += user.h
diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index ace2c3d7ae7e..4e3c13799735 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -131,12 +131,6 @@ static __always_inline void __flush_icache_all(void)
 	dsb(ish);
 }
 
-int set_memory_valid(unsigned long addr, int numpages, int enable);
-
-int set_direct_map_invalid_noflush(struct page *page, int numpages);
-int set_direct_map_default_noflush(struct page *page, int numpages);
-bool kernel_page_present(struct page *page);
-
 #include <asm-generic/cacheflush.h>
 
 #endif /* __ASM_CACHEFLUSH_H */
diff --git a/arch/arm64/include/asm/kfence.h b/arch/arm64/include/asm/kfence.h
index d061176d57ea..aa855c6a0ae6 100644
--- a/arch/arm64/include/asm/kfence.h
+++ b/arch/arm64/include/asm/kfence.h
@@ -8,7 +8,7 @@
 #ifndef __ASM_KFENCE_H
 #define __ASM_KFENCE_H
 
-#include <asm/cacheflush.h>
+#include <asm/set_memory.h>
 
 static inline bool arch_kfence_init_pool(void) { return true; }
 
diff --git a/arch/arm64/include/asm/set_memory.h b/arch/arm64/include/asm/set_memory.h
new file mode 100644
index 000000000000..ecb6b0f449ab
--- /dev/null
+++ b/arch/arm64/include/asm/set_memory.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef _ASM_ARM64_SET_MEMORY_H
+#define _ASM_ARM64_SET_MEMORY_H
+
+#include <asm-generic/set_memory.h>
+
+bool can_set_direct_map(void);
+#define can_set_direct_map can_set_direct_map
+
+int set_memory_valid(unsigned long addr, int numpages, int enable);
+
+int set_direct_map_invalid_noflush(struct page *page, int numpages);
+int set_direct_map_default_noflush(struct page *page, int numpages);
+bool kernel_page_present(struct page *page);
+
+#endif /* _ASM_ARM64_SET_MEMORY_H */
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index 90a335c74442..0ec94e718724 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -11,6 +11,7 @@
 #include <linux/kernel.h>
 #include <linux/kexec.h>
 #include <linux/page-flags.h>
+#include <linux/set_memory.h>
 #include <linux/smp.h>
 
 #include <asm/cacheflush.h>
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 6dd9369e3ea0..e42aeff6c344 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -22,6 +22,7 @@
 #include <linux/io.h>
 #include <linux/mm.h>
 #include <linux/vmalloc.h>
+#include <linux/set_memory.h>
 
 #include <asm/barrier.h>
 #include <asm/cputype.h>
@@ -515,7 +516,7 @@ static void __init map_mem(pgd_t *pgdp)
 	 */
 	BUILD_BUG_ON(pgd_index(direct_map_end - 1) == pgd_index(direct_map_end));
 
-	if (rodata_full || crash_mem_map || debug_pagealloc_enabled())
+	if (can_set_direct_map() || crash_mem_map)
 		flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
 	/*
@@ -1483,8 +1484,7 @@ int arch_add_memory(int nid, u64 start, u64 size,
 	 * KFENCE requires linear map to be mapped at page granularity, so that
 	 * it is possible to protect/unprotect single pages in the KFENCE pool.
 	 */
-	if (rodata_full || debug_pagealloc_enabled() ||
-	    IS_ENABLED(CONFIG_KFENCE))
+	if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
 		flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
 	__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index b53ef37bf95a..d505172265b0 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -19,6 +19,11 @@ struct page_change_data {
 
 bool rodata_full __ro_after_init = IS_ENABLED(CONFIG_RODATA_FULL_DEFAULT_ENABLED);
 
+bool can_set_direct_map(void)
+{
+	return rodata_full || debug_pagealloc_enabled();
+}
+
 static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
 {
 	struct page_change_data *cdata = data;
@@ -156,7 +161,7 @@ int set_direct_map_invalid_noflush(struct page *page, int numpages)
 	};
 	unsigned long size = PAGE_SIZE * numpages;
 
-	if (!debug_pagealloc_enabled() && !rodata_full)
+	if (!can_set_direct_map())
 		return 0;
 
 	return apply_to_page_range(&init_mm,
@@ -172,7 +177,7 @@ int set_direct_map_default_noflush(struct page *page, int numpages)
 	};
 	unsigned long size = PAGE_SIZE * numpages;
 
-	if (!debug_pagealloc_enabled() && !rodata_full)
+	if (!can_set_direct_map())
 		return 0;
 
 	return apply_to_page_range(&init_mm,
@@ -183,7 +188,7 @@ int set_direct_map_default_noflush(struct page *page, int numpages)
 #ifdef CONFIG_DEBUG_PAGEALLOC
 void __kernel_map_pages(struct page *page, int numpages, int enable)
 {
-	if (!debug_pagealloc_enabled() && !rodata_full)
+	if (!can_set_direct_map())
 		return;
 
 	set_memory_valid((unsigned long)page_address(page), numpages, enable);
@@ -208,7 +213,7 @@ bool kernel_page_present(struct page *page)
 	pte_t *ptep;
 	unsigned long addr = (unsigned long)page_address(page);
 
-	if (!debug_pagealloc_enabled() && !rodata_full)
+	if (!can_set_direct_map())
 		return true;
 
 	pgdp = pgd_offset_k(addr);
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index c650f82db813..7b4b6626032d 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -28,7 +28,19 @@ static inline bool kernel_page_present(struct page *page)
 {
 	return true;
 }
+#else /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */
+/*
+ * Some architectures, e.g. ARM64 can disable direct map modifications at
+ * boot time. Let them overrive this query.
+ */
+#ifndef can_set_direct_map
+static inline bool can_set_direct_map(void)
+{
+	return true;
+}
+#define can_set_direct_map can_set_direct_map
 #endif
+#endif /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */
 
 #ifndef set_mce_nospec
 static inline int set_mce_nospec(unsigned long pfn, bool unmap)
-- 
2.28.0


^ permalink raw reply related	[relevance 3%]

* [PATCH v19 3/8] set_memory: allow set_direct_map_*_noflush() for multiple pages
    2021-05-13 18:47  4% ` [PATCH v19 1/8] mmap: make mlock_future_check() global Mike Rapoport
@ 2021-05-13 18:47  3% ` Mike Rapoport
  2021-05-13 18:47  3% ` [PATCH v19 4/8] set_memory: allow querying whether set_direct_map_*() is actually enabled Mike Rapoport
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 200+ results
From: Mike Rapoport @ 2021-05-13 18:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin,
	Hagen Paul Pfeifer, Ingo Molnar, James Bottomley, Kees Cook,
	Kirill A. Shutemov, Matthew Wilcox, Matthew Garrett,
	Mark Rutland, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Rafael J. Wysocki, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, Yury Norov, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

From: Mike Rapoport <rppt@linux.ibm.com>

The underlying implementations of set_direct_map_invalid_noflush() and
set_direct_map_default_noflush() allow updating multiple contiguous pages
at once.

Add numpages parameter to set_direct_map_*_noflush() to expose this
ability with these APIs.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/cacheflush.h |  4 ++--
 arch/arm64/mm/pageattr.c            | 10 ++++++----
 arch/riscv/include/asm/set_memory.h |  4 ++--
 arch/riscv/mm/pageattr.c            |  8 ++++----
 arch/x86/include/asm/set_memory.h   |  4 ++--
 arch/x86/mm/pat/set_memory.c        |  8 ++++----
 include/linux/set_memory.h          |  4 ++--
 kernel/power/snapshot.c             |  4 ++--
 mm/vmalloc.c                        |  5 +++--
 9 files changed, 27 insertions(+), 24 deletions(-)

diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index 52e5c1623224..ace2c3d7ae7e 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -133,8 +133,8 @@ static __always_inline void __flush_icache_all(void)
 
 int set_memory_valid(unsigned long addr, int numpages, int enable);
 
-int set_direct_map_invalid_noflush(struct page *page);
-int set_direct_map_default_noflush(struct page *page);
+int set_direct_map_invalid_noflush(struct page *page, int numpages);
+int set_direct_map_default_noflush(struct page *page, int numpages);
 bool kernel_page_present(struct page *page);
 
 #include <asm-generic/cacheflush.h>
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 92eccaf595c8..b53ef37bf95a 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -148,34 +148,36 @@ int set_memory_valid(unsigned long addr, int numpages, int enable)
 					__pgprot(PTE_VALID));
 }
 
-int set_direct_map_invalid_noflush(struct page *page)
+int set_direct_map_invalid_noflush(struct page *page, int numpages)
 {
 	struct page_change_data data = {
 		.set_mask = __pgprot(0),
 		.clear_mask = __pgprot(PTE_VALID),
 	};
+	unsigned long size = PAGE_SIZE * numpages;
 
 	if (!debug_pagealloc_enabled() && !rodata_full)
 		return 0;
 
 	return apply_to_page_range(&init_mm,
 				   (unsigned long)page_address(page),
-				   PAGE_SIZE, change_page_range, &data);
+				   size, change_page_range, &data);
 }
 
-int set_direct_map_default_noflush(struct page *page)
+int set_direct_map_default_noflush(struct page *page, int numpages)
 {
 	struct page_change_data data = {
 		.set_mask = __pgprot(PTE_VALID | PTE_WRITE),
 		.clear_mask = __pgprot(PTE_RDONLY),
 	};
+	unsigned long size = PAGE_SIZE * numpages;
 
 	if (!debug_pagealloc_enabled() && !rodata_full)
 		return 0;
 
 	return apply_to_page_range(&init_mm,
 				   (unsigned long)page_address(page),
-				   PAGE_SIZE, change_page_range, &data);
+				   size, change_page_range, &data);
 }
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
diff --git a/arch/riscv/include/asm/set_memory.h b/arch/riscv/include/asm/set_memory.h
index 086f757e8ba3..06aed922ec1f 100644
--- a/arch/riscv/include/asm/set_memory.h
+++ b/arch/riscv/include/asm/set_memory.h
@@ -32,8 +32,8 @@ void protect_kernel_linear_mapping_text_rodata(void);
 static inline void protect_kernel_linear_mapping_text_rodata(void) {}
 #endif
 
-int set_direct_map_invalid_noflush(struct page *page);
-int set_direct_map_default_noflush(struct page *page);
+int set_direct_map_invalid_noflush(struct page *page, int numpages);
+int set_direct_map_default_noflush(struct page *page, int numpages);
 bool kernel_page_present(struct page *page);
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c
index 5e49e4b4a4cc..9618181b70be 100644
--- a/arch/riscv/mm/pageattr.c
+++ b/arch/riscv/mm/pageattr.c
@@ -156,11 +156,11 @@ int set_memory_nx(unsigned long addr, int numpages)
 	return __set_memory(addr, numpages, __pgprot(0), __pgprot(_PAGE_EXEC));
 }
 
-int set_direct_map_invalid_noflush(struct page *page)
+int set_direct_map_invalid_noflush(struct page *page, int numpages)
 {
 	int ret;
 	unsigned long start = (unsigned long)page_address(page);
-	unsigned long end = start + PAGE_SIZE;
+	unsigned long end = start + PAGE_SIZE * numpages;
 	struct pageattr_masks masks = {
 		.set_mask = __pgprot(0),
 		.clear_mask = __pgprot(_PAGE_PRESENT)
@@ -173,11 +173,11 @@ int set_direct_map_invalid_noflush(struct page *page)
 	return ret;
 }
 
-int set_direct_map_default_noflush(struct page *page)
+int set_direct_map_default_noflush(struct page *page, int numpages)
 {
 	int ret;
 	unsigned long start = (unsigned long)page_address(page);
-	unsigned long end = start + PAGE_SIZE;
+	unsigned long end = start + PAGE_SIZE * numpages;
 	struct pageattr_masks masks = {
 		.set_mask = PAGE_KERNEL,
 		.clear_mask = __pgprot(0)
diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index 43fa081a1adb..5f84aa4b6961 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -80,8 +80,8 @@ int set_pages_wb(struct page *page, int numpages);
 int set_pages_ro(struct page *page, int numpages);
 int set_pages_rw(struct page *page, int numpages);
 
-int set_direct_map_invalid_noflush(struct page *page);
-int set_direct_map_default_noflush(struct page *page);
+int set_direct_map_invalid_noflush(struct page *page, int numpages);
+int set_direct_map_default_noflush(struct page *page, int numpages);
 bool kernel_page_present(struct page *page);
 
 extern int kernel_set_to_readonly;
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 156cd235659f..15a55d6e9cec 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2192,14 +2192,14 @@ static int __set_pages_np(struct page *page, int numpages)
 	return __change_page_attr_set_clr(&cpa, 0);
 }
 
-int set_direct_map_invalid_noflush(struct page *page)
+int set_direct_map_invalid_noflush(struct page *page, int numpages)
 {
-	return __set_pages_np(page, 1);
+	return __set_pages_np(page, numpages);
 }
 
-int set_direct_map_default_noflush(struct page *page)
+int set_direct_map_default_noflush(struct page *page, int numpages)
 {
-	return __set_pages_p(page, 1);
+	return __set_pages_p(page, numpages);
 }
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index fe1aa4e54680..c650f82db813 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -15,11 +15,11 @@ static inline int set_memory_nx(unsigned long addr, int numpages) { return 0; }
 #endif
 
 #ifndef CONFIG_ARCH_HAS_SET_DIRECT_MAP
-static inline int set_direct_map_invalid_noflush(struct page *page)
+static inline int set_direct_map_invalid_noflush(struct page *page, int numpages)
 {
 	return 0;
 }
-static inline int set_direct_map_default_noflush(struct page *page)
+static inline int set_direct_map_default_noflush(struct page *page, int numpages)
 {
 	return 0;
 }
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 1a221dcb3c01..27cb4e7086b7 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -86,7 +86,7 @@ static inline void hibernate_restore_unprotect_page(void *page_address) {}
 static inline void hibernate_map_page(struct page *page)
 {
 	if (IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP)) {
-		int ret = set_direct_map_default_noflush(page);
+		int ret = set_direct_map_default_noflush(page, 1);
 
 		if (ret)
 			pr_warn_once("Failed to remap page\n");
@@ -99,7 +99,7 @@ static inline void hibernate_unmap_page(struct page *page)
 {
 	if (IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP)) {
 		unsigned long addr = (unsigned long)page_address(page);
-		int ret  = set_direct_map_invalid_noflush(page);
+		int ret = set_direct_map_invalid_noflush(page, 1);
 
 		if (ret)
 			pr_warn_once("Failed to remap page\n");
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index a13ac524f6ff..5d96fee17226 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2469,14 +2469,15 @@ struct vm_struct *remove_vm_area(const void *addr)
 }
 
 static inline void set_area_direct_map(const struct vm_struct *area,
-				       int (*set_direct_map)(struct page *page))
+				       int (*set_direct_map)(struct page *page,
+							     int numpages))
 {
 	int i;
 
 	/* HUGE_VMALLOC passes small pages to set_direct_map */
 	for (i = 0; i < area->nr_pages; i++)
 		if (page_address(area->pages[i]))
-			set_direct_map(area->pages[i]);
+			set_direct_map(area->pages[i], 1);
 }
 
 /* Handle removing and resetting vm mappings related to the vm_struct. */
-- 
2.28.0


^ permalink raw reply related	[relevance 3%]

* [PATCH v19 1/8] mmap: make mlock_future_check() global
  @ 2021-05-13 18:47  4% ` Mike Rapoport
  2021-05-14  8:27  0%   ` David Hildenbrand
  2021-05-13 18:47  3% ` [PATCH v19 3/8] set_memory: allow set_direct_map_*_noflush() for multiple pages Mike Rapoport
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 200+ results
From: Mike Rapoport @ 2021-05-13 18:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin,
	Hagen Paul Pfeifer, Ingo Molnar, James Bottomley, Kees Cook,
	Kirill A. Shutemov, Matthew Wilcox, Matthew Garrett,
	Mark Rutland, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Rafael J. Wysocki, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, Yury Norov, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

From: Mike Rapoport <rppt@linux.ibm.com>

It will be used by the upcoming secret memory implementation.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 mm/internal.h | 3 +++
 mm/mmap.c     | 5 ++---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 54bd0dc2c23c..46eb82eaa195 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -373,6 +373,9 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
 extern void mlock_vma_page(struct page *page);
 extern unsigned int munlock_vma_page(struct page *page);
 
+extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
+			      unsigned long len);
+
 /*
  * Clear the page's PageMlocked().  This can be useful in a situation where
  * we want to unconditionally remove a page from the pagecache -- e.g.,
diff --git a/mm/mmap.c b/mm/mmap.c
index 0584e540246e..81f5595a8490 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1352,9 +1352,8 @@ static inline unsigned long round_hint_to_min(unsigned long hint)
 	return hint;
 }
 
-static inline int mlock_future_check(struct mm_struct *mm,
-				     unsigned long flags,
-				     unsigned long len)
+int mlock_future_check(struct mm_struct *mm, unsigned long flags,
+		       unsigned long len)
 {
 	unsigned long locked, lock_limit;
 
-- 
2.28.0


^ permalink raw reply related	[relevance 4%]

* Re: [PATCH] copy_file_range.2: Update cross-filesystem support for 5.12
  2021-05-10  4:26  5%     ` Amir Goldstein
@ 2021-05-10 16:34 10%       ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 200+ results
From: Michael Kerrisk (man-pages) @ 2021-05-10 16:34 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: mtk.manpages, Alejandro Colomar, linux-man, Luis Henriques,
	Greg KH, Anna Schumaker, Jeff Layton, Steve French,
	Miklos Szeredi, Trond Myklebust, Alexander Viro, Darrick J. Wong,
	Dave Chinner, Nicolas Boichat, Ian Lance Taylor, Luis Lozano,
	Andreas Dilger, Olga Kornievskaia, Christoph Hellwig, ceph-devel,
	linux-kernel, CIFS, samba-technical, linux-fsdevel,
	Linux NFS Mailing List, Walter Harms

Hi Amir,

On 5/10/21 4:26 PM, Amir Goldstein wrote:
> On Mon, May 10, 2021 at 3:01 AM Michael Kerrisk (man-pages)
> <mtk.manpages@gmail.com> wrote:
>>
>> Hi Alex,
>>
>> On 5/10/21 9:39 AM, Alejandro Colomar wrote:
>>> Linux 5.12 fixes a regression.
> 
> Nope.
> That never happened:
> https://lore.kernel.org/linux-fsdevel/8735v4tcye.fsf@suse.de/
> 
>>>
>>> Cross-filesystem (introduced in 5.3) copies were buggy.
>>>
>>> Move the statements documenting cross-fs to BUGS.
>>> Kernels 5.3..5.11 should be patched soon.
>>>
>>> State version information for some errors related to this.
>>
>> Thanks. Patch applied.
> 
> I guess that would need to be reverted...

Thanks for catching that. I had not pushed the patch, so 
I'll just drop it.

Cheers,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[relevance 10%]

* Re: [PATCH] copy_file_range.2: Update cross-filesystem support for 5.12
  2021-05-10  0:01 10%   ` Michael Kerrisk (man-pages)
@ 2021-05-10  4:26  5%     ` Amir Goldstein
  2021-05-10 16:34 10%       ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 200+ results
From: Amir Goldstein @ 2021-05-10  4:26 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: Alejandro Colomar, linux-man, Luis Henriques, Greg KH,
	Anna Schumaker, Jeff Layton, Steve French, Miklos Szeredi,
	Trond Myklebust, Alexander Viro, Darrick J. Wong, Dave Chinner,
	Nicolas Boichat, Ian Lance Taylor, Luis Lozano, Andreas Dilger,
	Olga Kornievskaia, Christoph Hellwig, ceph-devel, linux-kernel,
	CIFS, samba-technical, linux-fsdevel, Linux NFS Mailing List,
	Walter Harms

On Mon, May 10, 2021 at 3:01 AM Michael Kerrisk (man-pages)
<mtk.manpages@gmail.com> wrote:
>
> Hi Alex,
>
> On 5/10/21 9:39 AM, Alejandro Colomar wrote:
> > Linux 5.12 fixes a regression.

Nope.
That never happened:
https://lore.kernel.org/linux-fsdevel/8735v4tcye.fsf@suse.de/

> >
> > Cross-filesystem (introduced in 5.3) copies were buggy.
> >
> > Move the statements documenting cross-fs to BUGS.
> > Kernels 5.3..5.11 should be patched soon.
> >
> > State version information for some errors related to this.
>
> Thanks. Patch applied.

I guess that would need to be reverted...

Thanks,
Amir.

^ permalink raw reply	[relevance 5%]

* Re: [PATCH] copy_file_range.2: Update cross-filesystem support for 5.12
  2021-05-09 21:39  3% ` [PATCH] copy_file_range.2: Update cross-filesystem support for 5.12 Alejandro Colomar
@ 2021-05-10  0:01 10%   ` Michael Kerrisk (man-pages)
  2021-05-10  4:26  5%     ` Amir Goldstein
  0 siblings, 1 reply; 200+ results
From: Michael Kerrisk (man-pages) @ 2021-05-10  0:01 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: mtk.manpages, linux-man, Luis Henriques, Amir Goldstein, Greg KH,
	Anna Schumaker, Jeff Layton, Steve French, Miklos Szeredi,
	Trond Myklebust, Alexander Viro, Darrick J. Wong, Dave Chinner,
	Nicolas Boichat, Ian Lance Taylor, Luis Lozano, Andreas Dilger,
	Olga Kornievskaia, Christoph Hellwig, ceph-devel, linux-kernel,
	CIFS, samba-technical, linux-fsdevel, Linux NFS Mailing List,
	Walter Harms

Hi Alex,

On 5/10/21 9:39 AM, Alejandro Colomar wrote:
> Linux 5.12 fixes a regression.
> 
> Cross-filesystem (introduced in 5.3) copies were buggy.
> 
> Move the statements documenting cross-fs to BUGS.
> Kernels 5.3..5.11 should be patched soon.
> 
> State version information for some errors related to this.

Thanks. Patch applied.

Cheers,

Michael

> 
> Reported-by: Luis Henriques <lhenriques@suse.de>
> Reported-by: Amir Goldstein <amir73il@gmail.com>
> Related: <https://lwn.net/Articles/846403/>
> Cc: Greg KH <gregkh@linuxfoundation.org>
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Anna Schumaker <anna.schumaker@netapp.com>
> Cc: Jeff Layton <jlayton@kernel.org>
> Cc: Steve French <sfrench@samba.org>
> Cc: Miklos Szeredi <miklos@szeredi.hu>
> Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: Nicolas Boichat <drinkcat@chromium.org>
> Cc: Ian Lance Taylor <iant@google.com>
> Cc: Luis Lozano <llozano@chromium.org>
> Cc: Andreas Dilger <adilger@dilger.ca>
> Cc: Olga Kornievskaia <aglo@umich.edu>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> Cc: linux-kernel <linux-kernel@vger.kernel.org>
> Cc: CIFS <linux-cifs@vger.kernel.org>
> Cc: samba-technical <samba-technical@lists.samba.org>
> Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>
> Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>
> Cc: Walter Harms <wharms@bfs.de>
> Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
> ---
>  man2/copy_file_range.2 | 27 +++++++++++++++++++++++----
>  1 file changed, 23 insertions(+), 4 deletions(-)
> 
> diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
> index 467a16300..843e02241 100644
> --- a/man2/copy_file_range.2
> +++ b/man2/copy_file_range.2
> @@ -169,6 +169,9 @@ Out of memory.
>  .B ENOSPC
>  There is not enough space on the target filesystem to complete the copy.
>  .TP
> +.BR EOPNOTSUPP " (since Linux 5.12)"
> +The filesystem does not support this operation.
> +.TP
>  .B EOVERFLOW
>  The requested source or destination range is too large to represent in the
>  specified data types.
> @@ -184,10 +187,17 @@ or
>  .I fd_out
>  refers to an active swap file.
>  .TP
> -.B EXDEV
> +.BR EXDEV " (before Linux 5.3)"
> +The files referred to by
> +.IR fd_in " and " fd_out
> +are not on the same filesystem.
> +.TP
> +.BR EXDEV " (since Linux 5.12)"
>  The files referred to by
>  .IR fd_in " and " fd_out
> -are not on the same mounted filesystem (pre Linux 5.3).
> +are not on the same filesystem,
> +and the source and target filesystems are not of the same type,
> +or do not support cross-filesystem copy.
>  .SH VERSIONS
>  The
>  .BR copy_file_range ()
> @@ -200,8 +210,11 @@ Areas of the API that weren't clearly defined were clarified and the API bounds
>  are much more strictly checked than on earlier kernels.
>  Applications should target the behaviour and requirements of 5.3 kernels.
>  .PP
> -First support for cross-filesystem copies was introduced in Linux 5.3.
> -Older kernels will return -EXDEV when cross-filesystem copies are attempted.
> +Since Linux 5.12,
> +cross-filesystem copies can be achieved
> +when both filesystems are of the same type,
> +and that filesystem implements support for it.
> +See BUGS for behavior prior to 5.12.
>  .SH CONFORMING TO
>  The
>  .BR copy_file_range ()
> @@ -226,6 +239,12 @@ gives filesystems an opportunity to implement "copy acceleration" techniques,
>  such as the use of reflinks (i.e., two or more inodes that share
>  pointers to the same copy-on-write disk blocks)
>  or server-side-copy (in the case of NFS).
> +.SH BUGS
> +In Linux kernels 5.3 to 5.11,
> +cross-filesystem copies were implemented by the kernel,
> +if the operation was not supported by individual filesystems.
> +However, on some virtual filesystems,
> +the call failed to copy, while still reporting success.
>  .SH EXAMPLES
>  .EX
>  #define _GNU_SOURCE
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[relevance 10%]

* [PATCH] copy_file_range.2: Update cross-filesystem support for 5.12
       [not found]     <20210509213930.94120-1-alx.manpages@gmail.com>
@ 2021-05-09 21:39  3% ` Alejandro Colomar
  2021-05-10  0:01 10%   ` Michael Kerrisk (man-pages)
  0 siblings, 1 reply; 200+ results
From: Alejandro Colomar @ 2021-05-09 21:39 UTC (permalink / raw)
  To: mtk.manpages
  Cc: Alejandro Colomar, linux-man, Luis Henriques, Amir Goldstein,
	Greg KH, Anna Schumaker, Jeff Layton, Steve French,
	Miklos Szeredi, Trond Myklebust, Alexander Viro, Darrick J. Wong,
	Dave Chinner, Nicolas Boichat, Ian Lance Taylor, Luis Lozano,
	Andreas Dilger, Olga Kornievskaia, Christoph Hellwig, ceph-devel,
	linux-kernel, CIFS, samba-technical, linux-fsdevel,
	Linux NFS Mailing List, Walter Harms

Linux 5.12 fixes a regression.

Cross-filesystem (introduced in 5.3) copies were buggy.

Move the statements documenting cross-fs to BUGS.
Kernels 5.3..5.11 should be patched soon.

State version information for some errors related to this.

Reported-by: Luis Henriques <lhenriques@suse.de>
Reported-by: Amir Goldstein <amir73il@gmail.com>
Related: <https://lwn.net/Articles/846403/>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Steve French <sfrench@samba.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Nicolas Boichat <drinkcat@chromium.org>
Cc: Ian Lance Taylor <iant@google.com>
Cc: Luis Lozano <llozano@chromium.org>
Cc: Andreas Dilger <adilger@dilger.ca>
Cc: Olga Kornievskaia <aglo@umich.edu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: ceph-devel <ceph-devel@vger.kernel.org>
Cc: linux-kernel <linux-kernel@vger.kernel.org>
Cc: CIFS <linux-cifs@vger.kernel.org>
Cc: samba-technical <samba-technical@lists.samba.org>
Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>
Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Cc: Walter Harms <wharms@bfs.de>
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
---
 man2/copy_file_range.2 | 27 +++++++++++++++++++++++----
 1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
index 467a16300..843e02241 100644
--- a/man2/copy_file_range.2
+++ b/man2/copy_file_range.2
@@ -169,6 +169,9 @@ Out of memory.
 .B ENOSPC
 There is not enough space on the target filesystem to complete the copy.
 .TP
+.BR EOPNOTSUPP " (since Linux 5.12)"
+The filesystem does not support this operation.
+.TP
 .B EOVERFLOW
 The requested source or destination range is too large to represent in the
 specified data types.
@@ -184,10 +187,17 @@ or
 .I fd_out
 refers to an active swap file.
 .TP
-.B EXDEV
+.BR EXDEV " (before Linux 5.3)"
+The files referred to by
+.IR fd_in " and " fd_out
+are not on the same filesystem.
+.TP
+.BR EXDEV " (since Linux 5.12)"
 The files referred to by
 .IR fd_in " and " fd_out
-are not on the same mounted filesystem (pre Linux 5.3).
+are not on the same filesystem,
+and the source and target filesystems are not of the same type,
+or do not support cross-filesystem copy.
 .SH VERSIONS
 The
 .BR copy_file_range ()
@@ -200,8 +210,11 @@ Areas of the API that weren't clearly defined were clarified and the API bounds
 are much more strictly checked than on earlier kernels.
 Applications should target the behaviour and requirements of 5.3 kernels.
 .PP
-First support for cross-filesystem copies was introduced in Linux 5.3.
-Older kernels will return -EXDEV when cross-filesystem copies are attempted.
+Since Linux 5.12,
+cross-filesystem copies can be achieved
+when both filesystems are of the same type,
+and that filesystem implements support for it.
+See BUGS for behavior prior to 5.12.
 .SH CONFORMING TO
 The
 .BR copy_file_range ()
@@ -226,6 +239,12 @@ gives filesystems an opportunity to implement "copy acceleration" techniques,
 such as the use of reflinks (i.e., two or more inodes that share
 pointers to the same copy-on-write disk blocks)
 or server-side-copy (in the case of NFS).
+.SH BUGS
+In Linux kernels 5.3 to 5.11,
+cross-filesystem copies were implemented by the kernel,
+if the operation was not supported by individual filesystems.
+However, on some virtual filesystems,
+the call failed to copy, while still reporting success.
 .SH EXAMPLES
 .EX
 #define _GNU_SOURCE
-- 
2.31.1


^ permalink raw reply related	[relevance 3%]

* [PATCH v26 06/30] x86/cet: Add control-protection fault handler
  @ 2021-04-27 20:42  3% ` Yu-cheng Yu
  0 siblings, 0 replies; 200+ results
From: Yu-cheng Yu @ 2021-04-27 20:42 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang
  Cc: Yu-cheng Yu, Michael Kerrisk

A control-protection fault is triggered when a control-flow transfer
attempt violates Shadow Stack or Indirect Branch Tracking constraints.
For example, the return address for a RET instruction differs from the copy
on the shadow stack; or an indirect JMP instruction, without the NOTRACK
prefix, arrives at a non-ENDBR opcode.

The control-protection fault handler works in a similar way as the general
protection fault handler.  It provides the si_code SEGV_CPERR to the signal
handler.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
---
v25:
- Change CONFIG_X86_CET to CONFIG_X86_SHADOW_STACK.
- Change X86_FEATURE_CET to X86_FEATURE_SHSTK.

 arch/x86/include/asm/idtentry.h    |  4 ++
 arch/x86/kernel/idt.c              |  4 ++
 arch/x86/kernel/signal_compat.c    |  2 +-
 arch/x86/kernel/traps.c            | 63 ++++++++++++++++++++++++++++++
 include/uapi/asm-generic/siginfo.h |  3 +-
 5 files changed, 74 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 5eb3bdf36a41..5791c02864ec 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -571,6 +571,10 @@ DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_SS,	exc_stack_segment);
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_GP,	exc_general_protection);
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_AC,	exc_alignment_check);
 
+#ifdef CONFIG_X86_SHADOW_STACK
+DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_CP, exc_control_protection);
+#endif
+
 /* Raw exception entries which need extra work */
 DECLARE_IDTENTRY_RAW(X86_TRAP_UD,		exc_invalid_op);
 DECLARE_IDTENTRY_RAW(X86_TRAP_BP,		exc_int3);
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index ee1a283f8e96..0315fb297dd3 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -105,6 +105,10 @@ static const __initconst struct idt_data def_idts[] = {
 #elif defined(CONFIG_X86_32)
 	SYSG(IA32_SYSCALL_VECTOR,	entry_INT80_32),
 #endif
+
+#ifdef CONFIG_X86_SHADOW_STACK
+	INTG(X86_TRAP_CP,		asm_exc_control_protection),
+#endif
 };
 
 /*
diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index a5330ff498f0..dd92490b1e7f 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -27,7 +27,7 @@ static inline void signal_compat_build_tests(void)
 	 */
 	BUILD_BUG_ON(NSIGILL  != 11);
 	BUILD_BUG_ON(NSIGFPE  != 15);
-	BUILD_BUG_ON(NSIGSEGV != 9);
+	BUILD_BUG_ON(NSIGSEGV != 10);
 	BUILD_BUG_ON(NSIGBUS  != 5);
 	BUILD_BUG_ON(NSIGTRAP != 5);
 	BUILD_BUG_ON(NSIGCHLD != 6);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 651e3e508959..a40b34b09400 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -39,6 +39,7 @@
 #include <linux/io.h>
 #include <linux/hardirq.h>
 #include <linux/atomic.h>
+#include <linux/nospec.h>
 
 #include <asm/stacktrace.h>
 #include <asm/processor.h>
@@ -606,6 +607,68 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
 	cond_local_irq_disable(regs);
 }
 
+#ifdef CONFIG_X86_SHADOW_STACK
+static const char * const control_protection_err[] = {
+	"unknown",
+	"near-ret",
+	"far-ret/iret",
+	"endbranch",
+	"rstorssp",
+	"setssbsy",
+	"unknown",
+};
+
+static DEFINE_RATELIMIT_STATE(cpf_rate, DEFAULT_RATELIMIT_INTERVAL,
+			      DEFAULT_RATELIMIT_BURST);
+
+/*
+ * When a control protection exception occurs, send a signal to the responsible
+ * application.  Currently, control protection is only enabled for user mode.
+ * This exception should not come from kernel mode.
+ */
+DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+{
+	struct task_struct *tsk;
+
+	if (!user_mode(regs)) {
+		pr_emerg("PANIC: unexpected kernel control protection fault\n");
+		die("kernel control protection fault", regs, error_code);
+		panic("Machine halted.");
+	}
+
+	cond_local_irq_enable(regs);
+
+	if (!boot_cpu_has(X86_FEATURE_SHSTK))
+		WARN_ONCE(1, "Control protection fault with CET support disabled\n");
+
+	tsk = current;
+	tsk->thread.error_code = error_code;
+	tsk->thread.trap_nr = X86_TRAP_CP;
+
+	/*
+	 * Ratelimit to prevent log spamming.
+	 */
+	if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
+	    __ratelimit(&cpf_rate)) {
+		unsigned long ssp;
+		int cpf_type;
+
+		cpf_type = array_index_nospec(error_code, ARRAY_SIZE(control_protection_err));
+
+		rdmsrl(MSR_IA32_PL3_SSP, ssp);
+		pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)",
+			 tsk->comm, task_pid_nr(tsk),
+			 regs->ip, regs->sp, ssp, error_code,
+			 control_protection_err[cpf_type]);
+		print_vma_addr(KERN_CONT " in ", regs->ip);
+		pr_cont("\n");
+	}
+
+	force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
+	cond_local_irq_disable(regs);
+}
+#endif
+
 static bool do_int3(struct pt_regs *regs)
 {
 	int res;
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index d2597000407a..1c2ea91284a0 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -231,7 +231,8 @@ typedef struct siginfo {
 #define SEGV_ADIPERR	7	/* Precise MCD exception */
 #define SEGV_MTEAERR	8	/* Asynchronous ARM MTE error */
 #define SEGV_MTESERR	9	/* Synchronous ARM MTE exception */
-#define NSIGSEGV	9
+#define SEGV_CPERR	10	/* Control protection fault */
+#define NSIGSEGV	10
 
 /*
  * SIGBUS si_codes
-- 
2.21.0


^ permalink raw reply related	[relevance 3%]

* [PATCH v25 06/30] x86/cet: Add control-protection fault handler
  @ 2021-04-15 22:13  3% ` Yu-cheng Yu
  0 siblings, 0 replies; 200+ results
From: Yu-cheng Yu @ 2021-04-15 22:13 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang
  Cc: Yu-cheng Yu, Michael Kerrisk

A control-protection fault is triggered when a control-flow transfer
attempt violates Shadow Stack or Indirect Branch Tracking constraints.
For example, the return address for a RET instruction differs from the copy
on the shadow stack; or an indirect JMP instruction, without the NOTRACK
prefix, arrives at a non-ENDBR opcode.

The control-protection fault handler works in a similar way as the general
protection fault handler.  It provides the si_code SEGV_CPERR to the signal
handler.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
---
v25:
- Change CONFIG_X86_CET to CONFIG_X86_SHADOW_STACK.
- Change X86_FEATURE_CET to X86_FEATURE_SHSTK.

 arch/x86/include/asm/idtentry.h    |  4 ++
 arch/x86/kernel/idt.c              |  4 ++
 arch/x86/kernel/signal_compat.c    |  2 +-
 arch/x86/kernel/traps.c            | 63 ++++++++++++++++++++++++++++++
 include/uapi/asm-generic/siginfo.h |  3 +-
 5 files changed, 74 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 5eb3bdf36a41..5791c02864ec 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -571,6 +571,10 @@ DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_SS,	exc_stack_segment);
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_GP,	exc_general_protection);
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_AC,	exc_alignment_check);
 
+#ifdef CONFIG_X86_SHADOW_STACK
+DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_CP, exc_control_protection);
+#endif
+
 /* Raw exception entries which need extra work */
 DECLARE_IDTENTRY_RAW(X86_TRAP_UD,		exc_invalid_op);
 DECLARE_IDTENTRY_RAW(X86_TRAP_BP,		exc_int3);
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index ee1a283f8e96..0315fb297dd3 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -105,6 +105,10 @@ static const __initconst struct idt_data def_idts[] = {
 #elif defined(CONFIG_X86_32)
 	SYSG(IA32_SYSCALL_VECTOR,	entry_INT80_32),
 #endif
+
+#ifdef CONFIG_X86_SHADOW_STACK
+	INTG(X86_TRAP_CP,		asm_exc_control_protection),
+#endif
 };
 
 /*
diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index a5330ff498f0..dd92490b1e7f 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -27,7 +27,7 @@ static inline void signal_compat_build_tests(void)
 	 */
 	BUILD_BUG_ON(NSIGILL  != 11);
 	BUILD_BUG_ON(NSIGFPE  != 15);
-	BUILD_BUG_ON(NSIGSEGV != 9);
+	BUILD_BUG_ON(NSIGSEGV != 10);
 	BUILD_BUG_ON(NSIGBUS  != 5);
 	BUILD_BUG_ON(NSIGTRAP != 5);
 	BUILD_BUG_ON(NSIGCHLD != 6);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 651e3e508959..a40b34b09400 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -39,6 +39,7 @@
 #include <linux/io.h>
 #include <linux/hardirq.h>
 #include <linux/atomic.h>
+#include <linux/nospec.h>
 
 #include <asm/stacktrace.h>
 #include <asm/processor.h>
@@ -606,6 +607,68 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
 	cond_local_irq_disable(regs);
 }
 
+#ifdef CONFIG_X86_SHADOW_STACK
+static const char * const control_protection_err[] = {
+	"unknown",
+	"near-ret",
+	"far-ret/iret",
+	"endbranch",
+	"rstorssp",
+	"setssbsy",
+	"unknown",
+};
+
+static DEFINE_RATELIMIT_STATE(cpf_rate, DEFAULT_RATELIMIT_INTERVAL,
+			      DEFAULT_RATELIMIT_BURST);
+
+/*
+ * When a control protection exception occurs, send a signal to the responsible
+ * application.  Currently, control protection is only enabled for user mode.
+ * This exception should not come from kernel mode.
+ */
+DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+{
+	struct task_struct *tsk;
+
+	if (!user_mode(regs)) {
+		pr_emerg("PANIC: unexpected kernel control protection fault\n");
+		die("kernel control protection fault", regs, error_code);
+		panic("Machine halted.");
+	}
+
+	cond_local_irq_enable(regs);
+
+	if (!boot_cpu_has(X86_FEATURE_SHSTK))
+		WARN_ONCE(1, "Control protection fault with CET support disabled\n");
+
+	tsk = current;
+	tsk->thread.error_code = error_code;
+	tsk->thread.trap_nr = X86_TRAP_CP;
+
+	/*
+	 * Ratelimit to prevent log spamming.
+	 */
+	if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
+	    __ratelimit(&cpf_rate)) {
+		unsigned long ssp;
+		int cpf_type;
+
+		cpf_type = array_index_nospec(error_code, ARRAY_SIZE(control_protection_err));
+
+		rdmsrl(MSR_IA32_PL3_SSP, ssp);
+		pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)",
+			 tsk->comm, task_pid_nr(tsk),
+			 regs->ip, regs->sp, ssp, error_code,
+			 control_protection_err[cpf_type]);
+		print_vma_addr(KERN_CONT " in ", regs->ip);
+		pr_cont("\n");
+	}
+
+	force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
+	cond_local_irq_disable(regs);
+}
+#endif
+
 static bool do_int3(struct pt_regs *regs)
 {
 	int res;
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index d2597000407a..1c2ea91284a0 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -231,7 +231,8 @@ typedef struct siginfo {
 #define SEGV_ADIPERR	7	/* Precise MCD exception */
 #define SEGV_MTEAERR	8	/* Asynchronous ARM MTE error */
 #define SEGV_MTESERR	9	/* Synchronous ARM MTE exception */
-#define NSIGSEGV	9
+#define SEGV_CPERR	10	/* Control protection fault */
+#define NSIGSEGV	10
 
 /*
  * SIGBUS si_codes
-- 
2.21.0


^ permalink raw reply related	[relevance 3%]

* Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space
  2021-04-14  5:52  3% [PATCH 0/4 POC] Allow executing code and syscalls in another address space Andrei Vagin
@ 2021-04-14  7:22  0% ` Anton Ivanov
  0 siblings, 0 replies; 200+ results
From: Anton Ivanov @ 2021-04-14  7:22 UTC (permalink / raw)
  To: Andrei Vagin, linux-kernel, linux-api
  Cc: linux-um, criu, avagin, Andrew Morton, Andy Lutomirski,
	Christian Brauner, Dmitry Safonov, Ingo Molnar, Jeff Dike,
	Mike Rapoport, Michael Kerrisk, Oleg Nesterov, Peter Zijlstra,
	Richard Weinberger, Thomas Gleixner

On 14/04/2021 06:52, Andrei Vagin wrote:
> We already have process_vm_readv and process_vm_writev to read and write
> to a process memory faster than we can do this with ptrace. And now it
> is time for process_vm_exec that allows executing code in an address
> space of another process. We can do this with ptrace but it is much
> slower.
> 
> = Use-cases =
> 
> Here are two known use-cases. The first one is “application kernel”
> sandboxes like User-mode Linux and gVisor. In this case, we have a
> process that runs the sandbox kernel and a set of stub processes that
> are used to manage guest address spaces. Guest code is executed in the
> context of stub processes but all system calls are intercepted and
> handled in the sandbox kernel. Right now, these sort of sandboxes use
> PTRACE_SYSEMU to trap system calls, but the process_vm_exec can
> significantly speed them up.

Certainly interesting, but will require um to rework most of its memory 
management and we will most likely need extra mm support to make use of 
it in UML. We are not likely to get away just with one syscall there.

> 
> Another use-case is CRIU (Checkpoint/Restore in User-space). Several
> process properties can be received only from the process itself. Right
> now, we use a parasite code that is injected into the process. We do
> this with ptrace but it is slow, unsafe, and tricky. process_vm_exec can
> simplify the process of injecting a parasite code and it will allow
> pre-dump memory without stopping processes. The pre-dump here is when we
> enable a memory tracker and dump the memory while a process is continue
> running. On each interaction we dump memory that has been changed from
> the previous iteration. In the final step, we will stop processes and
> dump their full state. Right now the most effective way to dump process
> memory is to create a set of pipes and splice memory into these pipes
> from the parasite code. With process_vm_exec, we will be able to call
> vmsplice directly. It means that we will not need to stop a process to
> inject the parasite code.
> 
> = How it works =
> 
> process_vm_exec has two modes:
> 
> * Execute code in an address space of a target process and stop on any
>    signal or system call.
> 
> * Execute a system call in an address space of a target process.
> 
> int process_vm_exec(pid_t pid, struct sigcontext uctx,
> 		    unsigned long flags, siginfo_t siginfo,
> 		    sigset_t  *sigmask, size_t sizemask)
> 
> PID - target process identification. We can consider to use pidfd
> instead of PID here.
> 
> sigcontext contains a process state with what the process will be
> resumed after switching the address space and then when a process will
> be stopped, its sate will be saved back to sigcontext.
> 
> siginfo is information about a signal that has interrupted the process.
> If a process is interrupted by a system call, signfo will contain a
> synthetic siginfo of the SIGSYS signal.
> 
> sigmask is a set of signals that process_vm_exec returns via signfo.
> 
> # How fast is it
> 
> In the fourth patch, you can find two benchmarks that execute a function
> that calls system calls in a loop. ptrace_vm_exe uses ptrace to trap
> system calls, proces_vm_exec uses the process_vm_exec syscall to do the
> same thing.
> 
> ptrace_vm_exec:   1446 ns/syscall
> ptrocess_vm_exec:  289 ns/syscall
> 
> PS: This version is just a prototype. Its goal is to collect the initial
> feedback, to discuss the interfaces, and maybe to get some advice on
> implementation..
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
> Cc: Christian Brauner <christian.brauner@ubuntu.com>
> Cc: Dmitry Safonov <0x7f454c46@gmail.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Jeff Dike <jdike@addtoit.com>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Cc: Michael Kerrisk (man-pages) <mtk.manpages@gmail.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Richard Weinberger <richard@nod.at>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> 
> Andrei Vagin (4):
>    signal: add a helper to restore a process state from sigcontex
>    arch/x86: implement the process_vm_exec syscall
>    arch/x86: allow to execute syscalls via process_vm_exec
>    selftests: add tests for process_vm_exec
> 
>   arch/Kconfig                                  |  15 ++
>   arch/x86/Kconfig                              |   1 +
>   arch/x86/entry/common.c                       |  19 +++
>   arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
>   arch/x86/include/asm/sigcontext.h             |   2 +
>   arch/x86/kernel/Makefile                      |   1 +
>   arch/x86/kernel/process_vm_exec.c             | 160 ++++++++++++++++++
>   arch/x86/kernel/signal.c                      | 125 ++++++++++----
>   include/linux/entry-common.h                  |   2 +
>   include/linux/process_vm_exec.h               |  17 ++
>   include/linux/sched.h                         |   7 +
>   include/linux/syscalls.h                      |   6 +
>   include/uapi/asm-generic/unistd.h             |   4 +-
>   include/uapi/linux/process_vm_exec.h          |   8 +
>   kernel/entry/common.c                         |   2 +-
>   kernel/fork.c                                 |   9 +
>   kernel/sys_ni.c                               |   2 +
>   .../selftests/process_vm_exec/Makefile        |   7 +
>   tools/testing/selftests/process_vm_exec/log.h |  26 +++
>   .../process_vm_exec/process_vm_exec.c         | 105 ++++++++++++
>   .../process_vm_exec/process_vm_exec_fault.c   | 111 ++++++++++++
>   .../process_vm_exec/process_vm_exec_syscall.c |  81 +++++++++
>   .../process_vm_exec/ptrace_vm_exec.c          | 111 ++++++++++++
>   23 files changed, 785 insertions(+), 37 deletions(-)
>   create mode 100644 arch/x86/kernel/process_vm_exec.c
>   create mode 100644 include/linux/process_vm_exec.h
>   create mode 100644 include/uapi/linux/process_vm_exec.h
>   create mode 100644 tools/testing/selftests/process_vm_exec/Makefile
>   create mode 100644 tools/testing/selftests/process_vm_exec/log.h
>   create mode 100644 tools/testing/selftests/process_vm_exec/process_vm_exec.c
>   create mode 100644 tools/testing/selftests/process_vm_exec/process_vm_exec_fault.c
>   create mode 100644 tools/testing/selftests/process_vm_exec/process_vm_exec_syscall.c
>   create mode 100644 tools/testing/selftests/process_vm_exec/ptrace_vm_exec.c
> 


-- 
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/

^ permalink raw reply	[relevance 0%]

* [PATCH 0/4 POC] Allow executing code and syscalls in another address space
@ 2021-04-14  5:52  3% Andrei Vagin
  2021-04-14  7:22  0% ` Anton Ivanov
  0 siblings, 1 reply; 200+ results
From: Andrei Vagin @ 2021-04-14  5:52 UTC (permalink / raw)
  To: linux-kernel, linux-api
  Cc: linux-um, criu, avagin, Andrei Vagin, Andrew Morton,
	Andy Lutomirski, Anton Ivanov, Christian Brauner, Dmitry Safonov,
	Ingo Molnar, Jeff Dike, Mike Rapoport, Michael Kerrisk,
	Oleg Nesterov, Peter Zijlstra, Richard Weinberger,
	Thomas Gleixner

We already have process_vm_readv and process_vm_writev to read and write
to a process memory faster than we can do this with ptrace. And now it
is time for process_vm_exec that allows executing code in an address
space of another process. We can do this with ptrace but it is much
slower.

= Use-cases =

Here are two known use-cases. The first one is “application kernel”
sandboxes like User-mode Linux and gVisor. In this case, we have a
process that runs the sandbox kernel and a set of stub processes that
are used to manage guest address spaces. Guest code is executed in the
context of stub processes but all system calls are intercepted and
handled in the sandbox kernel. Right now, these sort of sandboxes use
PTRACE_SYSEMU to trap system calls, but the process_vm_exec can
significantly speed them up.

Another use-case is CRIU (Checkpoint/Restore in User-space). Several
process properties can be received only from the process itself. Right
now, we use a parasite code that is injected into the process. We do
this with ptrace but it is slow, unsafe, and tricky. process_vm_exec can
simplify the process of injecting a parasite code and it will allow
pre-dump memory without stopping processes. The pre-dump here is when we
enable a memory tracker and dump the memory while a process is continue
running. On each interaction we dump memory that has been changed from
the previous iteration. In the final step, we will stop processes and
dump their full state. Right now the most effective way to dump process
memory is to create a set of pipes and splice memory into these pipes
from the parasite code. With process_vm_exec, we will be able to call
vmsplice directly. It means that we will not need to stop a process to
inject the parasite code.

= How it works =

process_vm_exec has two modes:

* Execute code in an address space of a target process and stop on any
  signal or system call.

* Execute a system call in an address space of a target process.

int process_vm_exec(pid_t pid, struct sigcontext uctx,
		    unsigned long flags, siginfo_t siginfo,
		    sigset_t  *sigmask, size_t sizemask)

PID - target process identification. We can consider to use pidfd
instead of PID here.

sigcontext contains a process state with what the process will be
resumed after switching the address space and then when a process will
be stopped, its sate will be saved back to sigcontext.

siginfo is information about a signal that has interrupted the process.
If a process is interrupted by a system call, signfo will contain a
synthetic siginfo of the SIGSYS signal.

sigmask is a set of signals that process_vm_exec returns via signfo.

# How fast is it

In the fourth patch, you can find two benchmarks that execute a function
that calls system calls in a loop. ptrace_vm_exe uses ptrace to trap
system calls, proces_vm_exec uses the process_vm_exec syscall to do the
same thing.

ptrace_vm_exec:   1446 ns/syscall
ptrocess_vm_exec:  289 ns/syscall

PS: This version is just a prototype. Its goal is to collect the initial
feedback, to discuss the interfaces, and maybe to get some advice on
implementation..

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Michael Kerrisk (man-pages) <mtk.manpages@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Thomas Gleixner <tglx@linutronix.de>

Andrei Vagin (4):
  signal: add a helper to restore a process state from sigcontex
  arch/x86: implement the process_vm_exec syscall
  arch/x86: allow to execute syscalls via process_vm_exec
  selftests: add tests for process_vm_exec

 arch/Kconfig                                  |  15 ++
 arch/x86/Kconfig                              |   1 +
 arch/x86/entry/common.c                       |  19 +++
 arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
 arch/x86/include/asm/sigcontext.h             |   2 +
 arch/x86/kernel/Makefile                      |   1 +
 arch/x86/kernel/process_vm_exec.c             | 160 ++++++++++++++++++
 arch/x86/kernel/signal.c                      | 125 ++++++++++----
 include/linux/entry-common.h                  |   2 +
 include/linux/process_vm_exec.h               |  17 ++
 include/linux/sched.h                         |   7 +
 include/linux/syscalls.h                      |   6 +
 include/uapi/asm-generic/unistd.h             |   4 +-
 include/uapi/linux/process_vm_exec.h          |   8 +
 kernel/entry/common.c                         |   2 +-
 kernel/fork.c                                 |   9 +
 kernel/sys_ni.c                               |   2 +
 .../selftests/process_vm_exec/Makefile        |   7 +
 tools/testing/selftests/process_vm_exec/log.h |  26 +++
 .../process_vm_exec/process_vm_exec.c         | 105 ++++++++++++
 .../process_vm_exec/process_vm_exec_fault.c   | 111 ++++++++++++
 .../process_vm_exec/process_vm_exec_syscall.c |  81 +++++++++
 .../process_vm_exec/ptrace_vm_exec.c          | 111 ++++++++++++
 23 files changed, 785 insertions(+), 37 deletions(-)
 create mode 100644 arch/x86/kernel/process_vm_exec.c
 create mode 100644 include/linux/process_vm_exec.h
 create mode 100644 include/uapi/linux/process_vm_exec.h
 create mode 100644 tools/testing/selftests/process_vm_exec/Makefile
 create mode 100644 tools/testing/selftests/process_vm_exec/log.h
 create mode 100644 tools/testing/selftests/process_vm_exec/process_vm_exec.c
 create mode 100644 tools/testing/selftests/process_vm_exec/process_vm_exec_fault.c
 create mode 100644 tools/testing/selftests/process_vm_exec/process_vm_exec_syscall.c
 create mode 100644 tools/testing/selftests/process_vm_exec/ptrace_vm_exec.c

-- 
2.29.2


^ permalink raw reply	[relevance 3%]

* [ANNOUNCE] util-linux v2.37-rc1
@ 2021-04-12 10:30  5% Karel Zak
  0 siblings, 0 replies; 200+ results
From: Karel Zak @ 2021-04-12 10:30 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, util-linux


 
The util-linux release v2.37-rc1 is available at
 
  http://www.kernel.org/pub/linux/utils/util-linux/v2.37/
 
Feedback and bug reports, as always, are welcomed.
 
  Karel



Util-linux 2.37 Release Notes
=============================
 
Release highlights
------------------

This project no more uses Groff to maintain man-pages. Since v2.37 all text is
maintained in AsciiDoc and man-pages are generated by asciidoctor to man-pages
during the package build process (see also --disable-asciidoc configure
option). Thanks to Mario Blättermann.           
                                                
The long-term goal is to maintain also man-page translations (via
translationproject.org and po4a) in the util-linux project. Please, contact
Mario Blättermann if you're want to help with the conversion from
manpages-l10n.
                                                
The old hardlink(1) implementation from Jakub Jelinek (originally for Fedora)
has been replaced by a new implementation from Julian Andres Klode (originally
for Debian). The new implementation does not support -f option to force
hardlinks creation between filesystem.          
                                                
lscpu(1) has been reimplemented. Now it analyzes /sys for all CPUs and provides
information for all CPU types used by the system (for example heterogeneous
big.LITTLE ARMs, etc.). This command reads also SMBIOS tables to get CPU
identifiers. Thanks to Masayoshi Mizuma from Fujitsu and Jeffrey Bastian from
Red Hat.  The default output on the terminal is more structured now to be more
human-readable. 
                                                
uclampset(1) is new util to manipulate the utilization clamping attributes of
the system or a process. Thanks to Qais Yousef from ARM.
                                                
hexdump(1) automatically uses -C when called as "hd".
                                                
dmesg(1) supports new command-line options --since and --until.
                                                
findmnt(8) supports new command-line options --shadowed to print only
filesystems over-mounted by another filesystem. 
                                                
mount(8) supports --read-only command-line option for non-root users too.
                                                
umount(8) can umount also all over-mounted filesystems (more filesystems on the
the same mount point) when executed with --recursive.
                                                
libfdisk (and fdisk, sfdisk, cfdisk) supports partition type names on input,
ignoring the case of the characters and all non-alphanumeric and non-digit
characters in the name (e.g. type="Linux /usr x86" is the same as type="linux
usr-x86" for sfdisk).                           
                                                
libmount no more contains a workaround to detect inconsistent
/proc/self/mountinfo read. This problem is fixed by the Linux kernel now.
                                                
libblkid supports "probing hints" now. The hints are the optional way how to
force probing functions to check for example another location -- for example
specific session on multi-session UDF. The command blkid(8) supports this
functionality with a new --hint option. The library has been also extended to
support others ISO9660 and UDF identifiers. Thanks to Pali Rohár.

blkzone(8) provides a new "capacity" command.
 
cfdisk(8) is possible to start in read-only mode by a new command-line option
--read-only
 
lsblk(8) provides new columns FSROOTS, and MOUNTTARGETS. The column
MOUNTTARGETS is used in the default output now and this new column prints all
mount points where the device is used (btrfs subvolumes, bind mounts, etc).
 
losetup(8) uses LOOP_CONFIG ioctl now.
 
column(1) supports a new command-line option --table-columns-limit to specify a
maximal number of the input columns. The last column will contain all remaining
line data if the limit is smaller than the number of the columns in the input
data.
 
It's possible to use meson to build util-linux. This feature is experimental
and currently designed only for developers. No panic, the current primary
autotools-based build process will be supported, maintained, and used as
primary for next years.


Changes between v2.36 and v2.37
-------------------------------

Asciidoc:
   - Adapt Makefiles to new asciidoc man pages  [Mario Blättermann]
   - Add Po4a hint to file headers  [Mario Blättermann]
   - Add po4a config file and initial translation template for man pages  [Mario Blättermann]
   - Better gettext message splitting in nsenter.1.adoc  [Mario Blättermann]
   - Fix artifact from initial import, sixth attempt  [Mario Blättermann]
   - Fix artifacts from initial import  [Mario Blättermann]
   - Fix artifacts from initial import, fifth attempt  [Mario Blättermann]
   - Fix artifacts from initial import, fourth attempt  [Mario Blättermann]
   - Fix artifacts from initial import, second attempt  [Mario Blättermann]
   - Fix artifacts from initial import, third attempt  [Mario Blättermann]
   - Fix man pages with variables to use the same value as in previous *.in files  [Mario Blättermann]
   - Fix typo  [Mario Blättermann]
   - Fix typo and remove invisible spaces which confuse po4a  [Mario Blättermann]
   - Formatting cleanup  [Mario Blättermann]
   - Import disk-utils man pages  [Mario Blättermann]
   - Import hwclock.8.in  [Mario Blättermann]
   - Import libuuid man pages  [Mario Blättermann]
   - Import login-utils man pages  [Mario Blättermann]
   - Import misc-utils man pages  [Mario Blättermann]
   - Import rtcwake.8.in  [Mario Blättermann]
   - Import sys-utils man pages, part 1  [Mario Blättermann]
   - Import sys-utils man pages, part 2  [Mario Blättermann]
   - Import sys-utils man pages, part 3  [Mario Blättermann]
   - Import term-utils man pages  [Mario Blättermann]
   - Import textutils man pages  [Mario Blättermann]
   - Incorporate latest change in findmnt.8  [Mario Blättermann]
   - Incorporate latest changes in findmnt.8  [Karel Zak]
   - Incorporate latest changes in rfkill.8 and umount.8  [Mario Blättermann]
   - Re-add empty lines to man pages  [Mario Blättermann]
   - Remove already imported *roff man pages  [Mario Blättermann]
   - Remove already imported disk-utils *roff man pages  [Mario Blättermann]
   - Remove already imported login-utils *roff man pages  [Mario Blättermann]
   - Remove already imported misc-utils *roff man pages  [Mario Blättermann]
   - Remove already imported text-utils *roff man pages  [Mario Blättermann]
   - Remove old man page links  [Mario Blättermann]
   - Reorder example command sequence  [Mario Blättermann]
   - Review disk-utils man pages  [Mario Blättermann]
   - Review login-utils man pages  [Mario Blättermann]
   - Review misc-utils man pages  [Mario Blättermann]
   - Review schedutils man pages  [Mario Blättermann]
   - Review sys-utils man pages, part 2  [Mario Blättermann]
   - Review sys-utils man pages,part 1  [Mario Blättermann]
   - Review term-utils man pages  [Mario Blättermann]
   - Review terminal-colors.d.5.adoc  [Mario Blättermann]
   - Review text-utils man pages  [Mario Blättermann]
   - Small fix in nsenter.1.adoc  [Mario Blättermann]
   - Small indentation fix in mount.8.adoc  [Mario Blättermann]
   - Some formatting cleanup in man pages  [Mario Blättermann]
   - Some more  man page formatting improvements  [Mario Blättermann]
   - Unify spelling of »User Commands«  [Mario Blättermann]
   - Update .pot template  [Mario Blättermann]
   - Use correct ' man manual ' for man pages from section 8  [Mario Blättermann]
   - Yet another formatting fix  [Mario Blättermann]
   - add missing bugreports section to libblkid and some cleanup  [Mario Blättermann]
Automake:
   - install uuidgen bash completion only if it is built  [Luca Boccassi]
   - use EXTRA_LTLIBRARIES instead of noinst_LTLIBRARIES  [Luca Boccassi]
Manual pages:
   - spelling and grammar fixes  [Ville Skyttä]
   - agetty.8  Minor formatting and wording fixes  [Michael Kerrisk (man-pages)]
   - blockdev.8  Minor wording and formatting fixes  [Michael Kerrisk (man-pages)]
   - blockdev.8, sfdisk.8  typo fixes  [Michael Kerrisk (man-pages)]
   - document the 'resize' command  [Vincent McIntyre]
   - logger.1  minor formatting and typo fixes  [Michael Kerrisk (man-pages)]
   - lsblk.8  Minor formatting and typo fixes  [Michael Kerrisk (man-pages)]
   - lslogins.1  Minor wording and formatting fixres  [Michael Kerrisk (man-pages)]
   - nologin.8  formatting fixes  [Michael Kerrisk (man-pages)]
   - raw.8  Minor formatting and wording fixes  [Michael Kerrisk (man-pages)]
   - sfdisk.8  Minor wording and formatting fixes  [Michael Kerrisk (man-pages)]
   - sfdisk.8  Use less aggressive indenting  [Michael Kerrisk (man-pages)]
   - wdctl.8  typo fix  [Michael Kerrisk (man-pages)]
   - wipefs.8  Formatting fixes  [Michael Kerrisk (man-pages)]
agetty:
   - Allow --init-string on a virtual console  [Ivan Mironov]
   - fix typo in manual page  [Samanta Navarro]
   - tty eol defaults to REPRINT  [Sami Loone]
bash-completion:
   - (lsblk) update columns  [Karel Zak]
   - add column --table-columns-limit  [Karel Zak]
   - add irqtop/lsirq --softirq  [Karel Zak]
blkdiscard:
   - fix compiler warnings [-Wmaybe-uninitialized]  [Karel Zak]
blkid:
   - add --hint <name>=value  [Karel Zak]
   - add another UDF identifiers  [Karel Zak]
   - encode all udf and iso IDs in udev output  [Karel Zak]
blkzone:
   - add capacity field to zone report  [Shin'ichiro Kawasaki]
   - add report capacity command  [Hans Holmberg]
blockdev:
   - fix man page formatting  [Jakub Wilk]
build-sys:
   - add --disable-scriptutils  [Karel Zak]
   - add EXTRA_LTLIBRARIES beween CLEANFILES  [Karel Zak]
   - add UL_REQUIRES_PROGRAM() macro, use it for asciidoc  [Karel Zak]
   - add man-common/Makemodule.am  [Karel Zak]
   - add missing header file  [Karel Zak]
   - add restrict keyword fallback  [Karel Zak]
   - add support for --enable-fuzzing-engine  [Evgeny Vereshchagin]
   - check for libselinux >= 3.1  [Karel Zak]
   - cleanup .gitignore files  [Karel Zak]
   - cleanup distcheck options  [Karel Zak]
   - cleanup uclampset dependencies  [Karel Zak]
   - do not build plymouth-ctrl.c w/ disabled plymouth  [Pino Toscano]
   - do not use extra subdir for getopt examples  [Karel Zak]
   - exclude GPL from libcommon  [Karel Zak]
   - fix out-of-tree build  [Karel Zak]
   - fix schedutils/sched_attr.h include  [Karel Zak]
   - fix sendfile use  [Karel Zak]
   - fix typo  [Karel Zak]
   - improve asciidoc generic rule  [Karel Zak]
   - make man pages location independent  [Karel Zak]
   - make man pages optional, add --disable-asciidoc  [Karel Zak]
   - move selinux_utils.c  [Karel Zak]
   - remove duplicate hook  [Karel Zak]
   - remove fallback for security_context_t  [Karel Zak]
   - remove man page link files  [Karel Zak]
   - remove some man pages from PATHFILES  [Karel Zak]
   - set localstatedir and sysconfdir default  [Karel Zak]
   - silence non-POSIX variable name warning  [Sami Kerola]
   - sort various lists in configure.ac  [Sami Kerola]
   - split man pages and man page links  [Karel Zak]
   - update to autoconf 2.70  [Sami Kerola]
   - use _DATA to install getopt examples  [Karel Zak]
build-system:
   - make "make distcheck" work  [Evgeny Vereshchagin]
   - stop looking for %ms and %as  [Evgeny Vereshchagin]
cal:
   - do not use putp(), directly use stdio functions  [Karel Zak]
cfdisk:
   - (man) add info when cfdisk writes to the device  [Karel Zak]
   - Implemented cfdisk's opening in read-only mode  [Dmitriy Chestnykh]
   - show Q option when choosing label type  [Chris Hofstaedtler]
chfs-chfn:
   - remove deprecated selinux_check_passwd_access()  [Karel Zak]
chrt:
   - (man) add human-readable names for policies  [Karel Zak]
   - don't restrict --reset-on-fork, add more info to man page  [Karel Zak]
   - use SCHED_FLAG_RESET_ON_FORK for sched_setattr()  [Karel Zak]
ci:
   - 'downgrade' Ubuntu version to Bionic  [Frantisek Sumsal]
   - build both w/ and w/o sanitizers on GH Actions  [Frantisek Sumsal]
   - code cleanup  [Frantisek Sumsal]
   - deal with uninstrumented binaries using instrumented libs  [Frantisek Sumsal]
   - run the build test for each pull request  [Frantisek Sumsal]
   - trigger CiFuzz for the master branch only  [Evgeny Vereshchagin]
   - use the correct compiler version  [Frantisek Sumsal]
cifuzz:
   - reindent yaml file  [Sami Kerola]
   - turn on MSan  [Evgeny Vereshchagin]
col:
   - add defaults to switch case clauses  [Sami Kerola]
   - add handle_not_graphic() function  [Sami Kerola]
   - add more tests  [Sami Kerola]
   - add structure to hold line variables  [Sami Kerola]
   - add update_cur_line() function  [Sami Kerola]
   - cleanup usage() and struct col_*  [Karel Zak]
   - enable deallocation on exit also for __SANITIZE_ADDRESS__  [Karel Zak]
   - fix --help short option in usage() output  [Sami Kerola]
   - flip all comparisions to numerical order  [Sami Kerola]
   - free memory before exit [LeakSanitizer]  [Sami Kerola]
   - initialize variables when they are declared  [Sami Kerola]
   - make input to tolerate invalid wide characters  [Sami Kerola]
   - move global variables to a control structure  [Sami Kerola]
   - move option handling to separate function  [Sami Kerola]
   - remove function prototypes  [Sami Kerola]
   - replace LINE and CHAR typedefs with structs  [Sami Kerola]
   - tidy up sources a little bit  [Sami Kerola]
   - use inline function rather than function like define  [Sami Kerola]
   - use size_t when dealing with numbers that buffer sizes  [Sami Kerola]
   - use typedef and enum to clarify struct  [Sami Kerola]
colrm:
   - fix argument parsing  [Sami Kerola]
column:
   - Deprecate --table-empty-lines in favor of --keep-empty-lines  [Lennard Hofmann]
   - Optionally keep empty lines in cols/rows mode  [Lennard Hofmann]
   - add --table-columns-limit  [Karel Zak]
configure:
   - test -a|o is not POSIX  [Issam E. Maghni]
configure.ac:
   - check for sendfile  [Egor Chelak]
dmesg:
   - add --since and --until  [Karel Zak]
   - fix and cleanup --read-clear  [Karel Zak]
docs:
   - add hint about make install-strip and link to Documentation/  [Karel Zak]
   - add note about github  [Karel Zak]
   - fix typo in v2.36-ReleaseNotes  [Karel Zak]
   - mention OSS-Fuzz and CIFuzz and how to build fuzz targets locally  [Evgeny Vereshchagin]
   - rename to getopt-example  [Karel Zak]
   - update AUTHORS file  [Karel Zak]
   - update Documentation/howto-man-page.txt  [Karel Zak]
   - update TODO  [Karel Zak]
   - update TODO (add item about mnt_context_get_excode() )  [Karel Zak]
   - update TODO (scols borders)  [Karel Zak]
   - update TODO file (add item about libblkid ZFS)  [Karel Zak]
eject:
   - cleanup before successful exit  [Karel Zak]
fallocate:
   - fix --dig-holes at end of files  [Gero Treuner]
fdformat:
   - remove command from default build  [Sami Kerola]
fdisk:
   - (man) add info about order for -l  [Karel Zak]
   - always report fdisk_create_disklabel() errors  [Karel Zak]
   - always skips zeros in dumps  [Karel Zak]
   - fix expected test output on alpha  [Chris Hofstaedtler]
   - support partition type name in dialogs  [Karel Zak]
findmnt:
   - (man) add more info about --target  [Karel Zak]
   - add --shadowed  [Karel Zak]
   - add PARENT column  [Karel Zak]
   - add option to list all fs-independent flags  [Roberto Bergantinos Corpas]
   - sort columns  [Karel Zak]
flock:
   - keep -E exit status more restrictive  [Karel Zak]
fsck, libblkid:
   - fix printf format string issue [coverity scan]  [Sami Kerola]
fsck.cramfs:
   - fix fsck.cramfs crashes on blocksizes > 4K  [ToddRK]
fstab:
   - fstab.5 NTFS and FAT volume IDs use upper case  [Heinrich Schuchardt]
fstrim:
   - fix memory leak [coverity scan]  [Karel Zak]
   - remove fstab condition from fstrim.timer  [Dusty Mabe]
fuzzers:
   - make tests setup more robust  [Karel Zak]
getopt:
   - explicitly ask for POSIX mode on POSIXLY_CORRECT  [Đoàn Trần Công Danh]
github:
   - CC fix export  [Karel Zak]
   - add 'distcheck' workflow job  [Karel Zak]
   - add build workflow  [Karel Zak]
   - add ruby-asciidoctor to CI-build  [Karel Zak]
   - cleanup cibuild.sh  [Karel Zak]
   - enable ci-build for all basic branches  [Karel Zak]
   - export CC and CXX  [Karel Zak]
   - fix asciidoctror dependence  [Karel Zak]
   - fix btrfs package name  [Karel Zak]
   - fix cibuild typo  [Karel Zak]
   - fix distcheck job  [Karel Zak]
   - make sure compiler is defined  [Karel Zak]
   - remove distcheck  [Karel Zak]
hardlink:
   - add --quiet option  [Karel Zak]
   - check and use sys/xattr.h  [Karel Zak]
   - cleanup --minimum-size stuff  [Karel Zak]
   - cleanup includes and types  [Karel Zak]
   - cleanup man page  [Karel Zak]
   - cleanup summary  [Karel Zak]
   - cleanup usage()  [Karel Zak]
   - fix hardlink pcre leak  [Sami Kerola]
   - fix indention  [Karel Zak]
   - fix typo in man page  [Karel Zak]
   - move default to options initialization  [Karel Zak]
   - replace with code from Debian  [Karel Zak]
   - s/DEBUG/VERBOSE/  [Karel Zak]
   - translate verbose messages  [Karel Zak]
   - use PRCE2 posix header file  [Karel Zak]
   - use err() if possible  [Karel Zak]
   - use errx() when parse options  [Karel Zak]
   - use monotonic time like other utils  [Karel Zak]
   - use only err.h to print errors and warnings  [Karel Zak]
   - use our xalloc.h  [Karel Zak]
   - use size_to_human_string()  [Karel Zak]
hexdump:
   - automatically use -C when called as hd  [Chris Hofstaedtler]
hwclock:
   - add fallback if SYS_settimeofday does not exist  [Karel Zak]
   - do not assume __NR_settimeofday_time32  [Pino Toscano]
   - fix SYS_settimeofday fallback  [Rosen Penev]
   - fix compiler warnings [-Wmaybe-uninitialized]  [Karel Zak]
   - fix indentation  [Łukasz Stelmach]
   - make tz use more robust [coverity scan]  [Karel Zak]
   - use pointer to adjtime data  [Karel Zak]
include/pathnames:
   - cleanup /proc/sys/kernel use  [Karel Zak]
include/strutils:
   - make xstrncpy() compatible with over-smart gcc 9  [Karel Zak]
ipcs:
   - Avoid shmall overflows  [Vasilis Liaskovitis]
   - fallback for overflow  [Karel Zak]
irqtop:
   - add per-cpu stats  [Karel Zak]
   - check scols_line_set_data() return code  [Karel Zak]
   - print header in reverse mode  [Karel Zak]
   - small cleanup  [Karel Zak]
irqtop/lsirq:
   - add additional desc for softirq  [zhenwei pi]
   - add softirq for man page  [zhenwei pi]
   - support softirq  [zhenwei pi]
lib:
   - add missing headers to .c files  [Karel Zak]
   - fix compiler warnings [-Wmaybe-uninitialized]  [Karel Zak]
   - use procutils.c on Linux only  [Karel Zak]
   - use ul_prefix for close_all_fds() and mkdir_p()  [Karel Zak]
lib/buffer:
   - add simple grow-able buffer  [Karel Zak]
   - fix end pointer initilaization  [Karel Zak]
   - make it robust for static analyzers [coverity scan]  [Karel Zak]
lib/caputils:
   - add fall back for last cap using prctl.  [Érico Rolim]
   - split to multiple functions, add test  [Karel Zak]
lib/env:
   - add function to save and restore unwanted variables  [Karel Zak]
lib/fileutils:
   - make close_all_fds() to be similar with close_range()  [Sami Kerola]
lib/jsonwrt:
   - add new functions to write in JSON  [Karel Zak]
   - use proper output function  [Karel Zak]
lib/loopdev:
   - cosmetic changes to LOOP_CONFIGURE  [Karel Zak]
   - make is_loopdev() more robust  [Karel Zak]
lib/pager:
   - fix improper use of negative value [coverity scan]  [Sami Kerola]
lib/procutils:
   - add proc_is_procfs helper.  [Érico Rolim]
   - improve proc_is_procfs(), add test  [Karel Zak]
   - use Public Domain for this file  [Karel Zak]
lib/randutils:
   - rename random_get_bytes()  [Sami Kerola]
lib/selinux-utils:
   - cleanup function names  [Karel Zak]
   - tiny cleanup  [Karel Zak]
lib/signames:
   - change license to public domain  [Karel Zak]
lib/strutils:
   - add normalize_whitespace()  [Karel Zak]
   - add ul_stralnumcmp()  [Karel Zak]
lib/sysfs:
   - fix doble free [coverity scan]  [Karel Zak]
libblikid.3.adoc:
   - Add missing SYNOPSIS section  [Mario Blättermann]
libblkid:
   - (gpt) accept tiny devices  [Karel Zak]
   - add blkid_probe_{set,get}_hint()  [Karel Zak]
   - add erofs filesystem support  [Gao Xiang]
   - allow a lot of mac partitions  [Samanta Navarro]
   - allow to specify offset defined by hint for blkid_probe_get_idmag()  [Pali Rohár]
   - detect CD/DVD discs in packet writing mode  [Pali Rohár]
   - detect session_offset hint for optical discs  [Pali Rohár]
   - do size correction of optical discs also by last written sector  [Pali Rohár]
   - drbdmanage  use blkid_probe_strncpy_uuid instead of blkid_probe_set_id_label  [Pali Rohár]
   - export blkid_probe_reset_hints()  [Karel Zak]
   - fix Atari prober logic  [Karel Zak]
   - fix blkid_probe_get_sb() to use hint offset calculation  [Pali Rohár]
   - fix comment block  [Karel Zak]
   - fix memory leak in config parser  [Samanta Navarro]
   - fix some typos in function comments  [nick black]
   - fix time_t handling  [Samanta Navarro]
   - improve debug for /proc/partitions  [Karel Zak]
   - initialize magic strings in robust way  [Karel Zak]
   - iso9660  add new test images  [Pali Rohár]
   - iso9660  add support for VOLUME_SET_ID and DATA_PREPARER_ID  [Pali Rohár]
   - iso9660  add support for multisession via session_offset hint  [Pali Rohár]
   - iso9660  check that iso->publisher_id and iso->application_id are not file paths  [Pali Rohár]
   - iso9660  do not check is_str_empty() for iso->system_id and boot->boot_system_id  [Pali Rohár]
   - iso9660  fix parsing images which do not have Primary Volume Descriptor as the first  [Pali Rohár]
   - iso9660  improve label parsing  [Pali Rohár]
   - iso9660  parse SYSTEM_ID, PUBLISHER_ID and APPLICATION_ID from Joliet  [Pali Rohár]
   - iso9660  set block size also for High Sierra format  [Pali Rohár]
   - limit amount of parsed partitions  [Samanta Navarro]
   - make Atari more robust  [Karel Zak]
   - make gfs2 prober more extendible  [Karel Zak]
   - overwrite existing hint  [Karel Zak]
   - udf  add support for APPLICATION_ID  [Pali Rohár]
   - udf  add support for PUBLISHER_ID  [Pali Rohár]
   - udf  add support for multisession via session_offset hint  [Pali Rohár]
   - udf  add support for unclosed sequential Write-Once media  [Pali Rohár]
   - udf  check that dstrings are encoded in OSTA Compressed Unicode  [Pali Rohár]
   - udf  update test output for APPLICATION_ID and PUBLISHER_ID  [Pali Rohár]
   - use /sys to read all block devices  [Karel Zak]
libfdisk:
   - (dos) fix last possible sector calculation  [Karel Zak]
   - (gpt) make sure device is large enough  [Karel Zak]
   - (gpt) reduce number of entries to fit small device  [Karel Zak]
   - (gpt) returns location of the backup header too  [Karel Zak]
   - (script) don't use sector size if not specified  [Karel Zak]
   - (script) fix possible memory leaks  [Karel Zak]
   - (script) fix possible partno overflow  [Karel Zak]
   - (script) ignore empty values for start and size  [Gaël PORTAY]
   - (script) make sure buffer is initialized  [Karel Zak]
   - (script) make sure label is specified  [Karel Zak]
   - add "Linux /usr" and "Linux /usr verity" GPT partition types  [nl6720]
   - add systemd-homed user's home GPT partition type  [nl6720]
   - another parse_line_nameval() cleanup  [Karel Zak]
   - fix fdisk_reread_changes() for extended partitions  [Karel Zak]
   - fix last free sector detection if partition size specified  [Karel Zak]
   - fix typo from 255f5f4c770ebd46a38b58975bd33e33ae87ed24  [Karel Zak]
   - ignore 33553920 byte optimal I/O size  [Ryan Finnie]
   - make fdisk_partname() more robust  [Karel Zak]
   - make labels allocations readable for analysers [coverity scan]  [Karel Zak]
   - reset context FD on error  [yangzz-97]
   - support partition type name parsing  [Karel Zak]
libmount:
   - (optstr) improve default initialization  [Karel Zak]
   - (python) fix compiler warning  [Karel Zak]
   - Fix 0x%u usage  [Dr. David Alan Gilbert]
   - add assert() to umount lookup code  [Karel Zak]
   - add mnt_table_over_fs()  [Karel Zak]
   - add vboxsf, virtiofs to pseudo filesystems  [Shahid Laher]
   - allow --read-only for not-root users  [Karel Zak]
   - do not canonicalize ZFS source dataset  [Karel Zak]
   - do not use pointer as an integer value  [Sami Kerola]
   - don't use "symfollow" for helpers on user mounts  [Karel Zak]
   - don't use deprecated security_context_t  [Karel Zak]
   - fix /{etc,proc}/filesystems use  [Karel Zak]
   - fix compiler warnings [-Wmaybe-uninitialized]  [Karel Zak]
   - fix memory leak [coverity scan]  [Karel Zak]
   - fix tab parser for badly terminated lines  [Karel Zak]
   - improve mnt_split_optstr() performance  [Karel Zak]
   - mark entries from /proc/swaps by MNT_FS_SWAP  [Karel Zak]
   - mnt_table_over_fs() make child optional  [Karel Zak]
   - optimize mnt_optstr_apply_flags()  [Karel Zak]
   - remove read-mountinfo workaround  [Karel Zak]
libmount (verity):
   - let crypt_deactivate_by_name handle its own data structure  [Luca Boccassi]
   - plug libcryptsetup logger into our logging system  [Luca Boccassi]
libsmartcols:
   - add comments to private header file  [Karel Zak]
   - add sort sunction to the sample  [Karel Zak]
   - don't print empty output on empty table in JSON  [Karel Zak]
   - fix colors use  [Karel Zak]
   - introduce default sort column  [Karel Zak]
   - remove unnecessary code  [Karel Zak]
   - sanitize variable names on export output  [Karel Zak]
   - support arrays for JSON output  [Karel Zak]
   - use lib/jsonwrt.c for JSON  [Karel Zak]
libsmratcols:
   - print title color only when wanted  [Karel Zak]
libuuid:
   - check quality of random bytes  [Samanta Navarro]
   - improve "restrict" keyword use  [Karel Zak]
   - simplify uuid_is_null() check  [Sami Kerola]
login:
   - add initialize() function to have less stack allocated in main()  [Sami Kerola]
   - add option to not reset username on each attempt  [Thayne McCombs]
   - close() only a file descriptor that is open [coverity scan]  [Sami Kerola]
   - ensure getutxid() does not use uninitialized variable [coverity scan]  [Sami Kerola]
   - fix coding style issues  [Sami Kerola]
   - fix compiler warning [-Werror=strict-prototypes]  [Karel Zak]
   - move generic setting to ttyutils.h  [Karel Zak]
   - move getlogindefs_num() after localization init  [Sami Kerola]
   - move message printing out from main()  [Sami Kerola]
   - move proctitle code to login.c  [Karel Zak]
   - move timeout from global to local scope  [Sami Kerola]
   - replace function like definitions with inline functions  [Sami Kerola]
   - stop keeping timeout message in memory forever  [Sami Kerola]
   - tidy up manual page  [Sami Kerola]
   - use calloc() when memory needs to be cleared  [Sami Kerola]
   - use close_range() system call when possible  [Sami Kerola]
   - use explicit_bzero() to get rid of confidental memory  [Sami Kerola]
   - use full tty path for PAM_TTY  [Karel Zak]
   - use mem2strcpy() rather than rely on printf()  [Karel Zak]
   - use sig_atomic_t type for variable accessed from signal handler  [Sami Kerola]
   - use system definitions to determine maxium login name length  [Sami Kerola]
   - use ul_copy_file  [Egor Chelak]
   - use xalloc memory allocation helpers everywhere  [Sami Kerola]
login-utils:
   - don't use deprecated security_context_t  [Karel Zak]
loopdev:
   - use LOOP_CONFIG ioctl  [Sinan Kaya]
losetup:
   - avoid infinite busy loop  [Karel Zak]
   - fix wrong printf() format specifier for ino_t data type  [Manuel Bentele]
   - increase limit of setup attempts  [Karel Zak]
lsblk:
   - add --width option  [Karel Zak]
   - add FSROOTS column  [Karel Zak]
   - add dependence between CD/DVD block and packet devices  [Karel Zak]
   - add lscpu_read_topology_polarization()  [Karel Zak]
   - fix -T optional argument  [Karel Zak]
   - fix SCSI_IDENT_SERIAL  [Karel Zak]
   - fix filesystem array allocation  [Karel Zak]
   - ignore only loopdevs without backing file  [Karel Zak]
   - print all device mountpoints  [Karel Zak]
   - print zero rather than empty SIZE  [Karel Zak]
   - read ID_SCSI_IDENT_SERIAL if available  [Karel Zak]
   - read SCSI_IDENT_SERIAL also from udev  [Karel Zak]
   - show all empty, except loopdevs  [Karel Zak]
   - update man page  [Karel Zak]
   - use MOUNTPOINTS in --fs  [Karel Zak]
   - use MOUNTTARGETS in default output  [Karel Zak]
lscpu:
   - (arm) reuse parsed vendor ID  [Karel Zak]
   - (cpuinfo) fill empty cputype  [Karel Zak]
   - (cpuinfo) rewrite parser  [Karel Zak]
   - (cputype) add cpuinfo parser  [Karel Zak]
   - (cputype) add debug stuff  [Karel Zak]
   - (cputype) add header file, cleanup patterns code  [Karel Zak]
   - (cputype) add ref-counting, allocate context  [Karel Zak]
   - (cputype) move temporary stuff  [Karel Zak]
   - (cputype) simplify cpuinfo parsing  [Karel Zak]
   - (topology) add read_address()  [Karel Zak]
   - (topology) add read_configure()  [Karel Zak]
   - (topology) add read_mhz()  [Karel Zak]
   - (topology) read caches from /sys  [Karel Zak]
   - (virt) add macros for VMWARE  [Karel Zak]
   - (virt) simplify hypervisor parsing  [Karel Zak]
   - Adapt MIPS cpuinfo  [Karel Zak]
   - Add FUJITSU aarch64 A64FX cpupart  [Shunsuke Nakamura]
   - Even more Arm part numbers  [Jeremy Linton]
   - add LSCPU_OUTPUT_ enum  [Karel Zak]
   - add MHZ column  [Karel Zak]
   - add another part of summary output  [Karel Zak]
   - add extra caches to --cache output  [Karel Zak]
   - add function to count caches size  [Karel Zak]
   - add functions to get CPU freq  [Karel Zak]
   - add helper to get physical sockets  [Masayoshi Mizuma]
   - add info that caches sizes are sum  [Karel Zak]
   - add lscpu_cpu to internal API  [Karel Zak]
   - add lscpu_cpus_loopup_by_type(), improve readability  [Karel Zak]
   - add lscpu_read_architecture()  [Karel Zak]
   - add lscpu_read_cpulists()  [Karel Zak]
   - add lscpu_read_extra()  [Karel Zak]
   - add lscpu_read_numas()  [Karel Zak]
   - add lscpu_read_topolgy_ids()  [Karel Zak]
   - add lscpu_read_topology()  [Karel Zak]
   - add lscpu_read_virtualization()  [Karel Zak]
   - add lscpu_read_vulnerabilities()  [Karel Zak]
   - add note about cache IDs  [Karel Zak]
   - add per type summary function  [Karel Zak]
   - add rest of summary  [Karel Zak]
   - add sections  [Karel Zak]
   - add setsize to lscpu context  [Karel Zak]
   - add shared cached info for s390 lscpu -C  [Karel Zak]
   - add very basic cputype code  [Karel Zak]
   - assume gaps in list of CPUs  [Karel Zak]
   - avoid segfault on PowerPC systems with valid hardware configurations  [Thomas Abraham]
   - calculate threads number from type specific values  [Karel Zak]
   - cleanup --cache  [Karel Zak]
   - cleanup --parse  [Karel Zak]
   - cleanup -e  [Karel Zak]
   - cleanup lscpu_unref_cputype()  [Karel Zak]
   - cleaup arch freeing  [Karel Zak]
   - convert ARM decoding to new API  [Karel Zak]
   - convert getopt block to new API  [Karel Zak]
   - deallocate maps  [Karel Zak]
   - don't use section for extra caches  [Karel Zak]
   - don't use smbios when read snapshots  [Karel Zak]
   - fix MHZ parsing  [Karel Zak]
   - fix NUMAs reading code  [Karel Zak]
   - fix for sparc64  [Karel Zak]
   - fix last caches separator in -e and -p output  [Karel Zak]
   - fix mem-leak in cpu  [Karel Zak]
   - fix memory leaks  [Karel Zak]
   - fix possible null dereferences [coverity scan]  [Karel Zak]
   - fix resource leak [coverity scan]  [Karel Zak]
   - fix variable shadowing  [Sami Kerola]
   - generate cache ID if not available  [Karel Zak]
   - hide all to lscpu_read_topology()  [Karel Zak]
   - improve bogomips use  [Karel Zak]
   - improve debug message  [Karel Zak]
   - improve topology calculation  [Karel Zak]
   - improve topology calculation, use /proc/sysinfo  [Karel Zak]
   - improve topology debug message  [Karel Zak]
   - keep hypervisor name in allocated memory  [Karel Zak]
   - keep static/dynamic MHz in cputype struct  [Karel Zak]
   - merge new API to lscpu.h  [Karel Zak]
   - move debug initialization to main  [Karel Zak]
   - move to main function to init context  [Karel Zak]
   - move topology stuff to separate file  [Karel Zak]
   - new cpuinfo parser  [Karel Zak]
   - print generic part of the summary  [Karel Zak]
   - remove obsolete code  [Karel Zak]
   - remove unnecessary prefix from static function  [Karel Zak]
   - remove unused code  [Karel Zak]
   - remove unused function  [Karel Zak]
   - report also number of cache instances  [Karel Zak]
   - show the number of physical socket on aarch64 machine without ACPI PPTT  [Masayoshi Mizuma]
   - sort extra caches  [Karel Zak]
   - split output to sections  [Karel Zak]
   - support +list for -e, -p and -C  [Karel Zak]
   - support s390 cpuinfo processor-pre-line format  [Karel Zak]
   - temporary commit  [Karel Zak]
   - update tests  [Karel Zak]
   - use SMBIOS tables on ARM for lscpu  [Jeffrey Bastian]
   - use cache ID, keep caches independent on CPU type  [Karel Zak]
   - use cluster on aarch64 machine which doesn't have ACPI PPTT  [Masayoshi Mizuma]
   - use constants from new API  [Karel Zak]
   - use new code to read CPUs info  [Karel Zak]
   - use size_t for counters  [Karel Zak]
   - use size_t for ncolumns  [Karel Zak]
lscpu-arm:
   - Add "BIOS Vendor ID" and "BIOS Model name" to show the SMBIOS information.  [Masayoshi Mizuma]
lscpu-dmi:
   - Move some functions related to DMI to lscpu-dmi  [Masayoshi Mizuma]
lscpu-virt:
   - fix return type of read_hypervisor_cpuid for non x86.  [Érico Rolim]
   - split hypervisor_from_dmi_table()  [Masayoshi Mizuma]
lsipc:
   - make default output byte sizes to be in human units  [Sami Kerola]
lsirq:
   - fix resources leak [coverity scan]  [Karel Zak]
lslogins:
   - call close() for usable FD [coverity scan]  [Karel Zak]
lsmem:
   - use ul_path_readf_string() readable for analysers [coverity scan]  [Karel Zak]
man:
   - add missing backslash to caret printing macro  [Sami Kerola]
   - make tilde and caret characters to render correctly  [Sami Kerola]
manpages:
   - fix "The example command" in AVAILABILITY section  [Chris Hofstaedtler]
meson:
   - add irq utils  [Karel Zak]
   - add missing HAVE_ definitions  [Karel Zak]
   - add second build system  [Zbigniew Jędrzejewski-Szmek]
   - generate man pages from asciidoc  [Karel Zak]
   - implement building of static programs  [Zbigniew Jędrzejewski-Szmek]
   - port localstatedir and sysconfdir  [Karel Zak]
   - update configuration  [Karel Zak]
   - update for new hardlink  [Karel Zak]
   - update sources and dependencies  [Karel Zak]
misc:
   - fix typos  [Samanta Navarro]
   - fix typos [codespell]  [Samanta Navarro]
mkfs.minix:
   - add --lock and LOCK_BLOCK_DEVICE  [Karel Zak]
mkswap:
   - add --verbose, reduce extents check output  [Karel Zak]
   - check for holes and unwanted extentd in file  [Karel Zak]
   - cleanup usage()  [Karel Zak]
   - don't use deprecated security_context_t  [Karel Zak]
   - improve extents check  [Karel Zak]
   - remove deprecated SELinux matchpathcon()  [Karel Zak]
   - remove unnecessary on FS_IOC_FIEMAP  [Karel Zak]
   - tell how to fix insecure permissions and owner in warning  [Sami Kerola]
more:
   - fix ARROW_DOWN and PAGE_DOWN behaviour to not skip lines  [Hannes Müller]
   - fix command 'f' (screen forward) behaviour  [Hannes Müller]
   - improve error messaging when input file is directory  [Sami Kerola]
mount:
   - Add support for "nosymfollow" mount option.  [Mattias Nissler]
mount, umount:
   - restore environ[] after suid drop  [Karel Zak]
mountpoint:
   - different exit status for errors and non-mountpoint situation  [Karel Zak]
nologin:
   - use ul_copy_file  [Egor Chelak]
nsenter / switch_root:
   - fix insecure chroot [coverity scan]  [Sami Kerola]
pg:
   - fix wcstombs() use  [Karel Zak]
po:
   - add sr.po (from translationproject.org)  [Мирослав Николић]
   - merge changes  [Karel Zak]
   - update cs.po (from translationproject.org)  [Petr Písař]
   - update es.po (from translationproject.org)  [Antonio Ceballos Roa]
   - update hr.po (from translationproject.org)  [Božidar Putanec]
   - update sv.po (from translationproject.org)  [Sebastian Rasmussen]
prlimit:
   - fix optional arguments parsing  [Karel Zak]
pylibmount:
   - PyEval_Call* is deprecate, use PyObject_Call*  [Karel Zak]
read_all:
   - return 0 when EOF occurs after 0 bytes  [Egor Chelak]
readprofile:
   - fix static analyzer warning [coverity scan]  [Karel Zak]
rfkill:
   - add "toggle" command  [Karel Zak]
   - fix static analyzer warning [coverity scan]  [Karel Zak]
   - stop execution when rfkill device cannot be opened  [Sami Kerola]
script:
   - cleanup --echo  [Soumendra Ganguly]
   - don't use strings from user as printf-format [coverity scan]  [Karel Zak]
   - fix compiler warnings [-Wmaybe-uninitialized]  [Karel Zak]
   - improve I/O return code checks  [Soumendra Ganguly]
   - kill child process on error  [Karel Zak]
scriptlive:
   - fix compiler warnings [-Wmaybe-uninitialized]  [Karel Zak]
scriptreplay:
   - enable special character handling  [Soumendra Ganguly]
setpriv:
   - allow using [-+]all for capabilities.  [Érico Rolim]
   - small clean-up.  [Érico Rolim]
sfdisk:
   - (docs) add more information about GPT attribute bits  [Karel Zak]
   - correct --json --dump false exclusive  [Dimitri John Ledkov]
   - disable bootbits protection on '--wipe always'  [Karel Zak]
   - do not free device name too soon [coverity scan]  [Sami Kerola]
   - fix backward --move-data  [Karel Zak]
   - fix resources leak [coverity scan]  [Karel Zak]
   - support for type="partition type name"  [Karel Zak]
su:
   - (pty) change owner and mode for pty  [Karel Zak]
   - explicitly enable echo for --pty  [Karel Zak]
   - fix man page typos  [Štěpán Němec]
   - remove useless assignment  [Karel Zak]
   - use full tty path for PAM_TTY  [Karel Zak]
switch_root:
   - check if mount point to move even exists  [Thomas Deutschmann]
   - fix double close [coverity scan]  [Karel Zak, Sami Kerola]
sys-utils:
   - mount.8  fix a typo  [Eric Biggers]
tests:
   - (blkid) add erofs image  [Karel Zak]
   - (blkid) add support for multisession images  [Karel Zak]
   - (fileutils) remove unused code  [Karel Zak]
   - (ul) remove another 'dim' input  [Karel Zak]
   - add a fuzz target calling fdisk_script_read_file  [Evgeny Vereshchagin]
   - add a fuzzer for mnt_table_parse_stream  [Evgeny Vereshchagin]
   - add a fuzzer for process_wtmp_file  [Evgeny Vereshchagin]
   - add checksum for cramfs/mkfs for LE 16384 (ia64)  [Anatoly Pugachev]
   - add sfdisk test for 4fe7f9b614e2b5bb97f6d89af02acb867cffccc1  [Karel Zak]
   - add testcases that triggered various crashes  [Evgeny Vereshchagin]
   - an attempt to get around https //github.com/karelzak/util-linux/issues/1110  [Evgeny Vereshchagin]
   - be explicit with file permissions for cramfs  [Karel Zak]
   - cover the code parsing comments  [Evgeny Vereshchagin]
   - don't reply on scsi_debug partitions  [Karel Zak]
   - dump more information about CFS and block devices  [Karel Zak]
   - improve u64 use in ipcs test  [Karel Zak]
   - integrate test_last_fuzz into the testsuite  [Evgeny Vereshchagin]
   - integrate test_mount_fuzz into the testsuite  [Evgeny Vereshchagin]
   - make it compatible with meson  [Karel Zak]
   - mark ul/basic as KNOWN_FAIL  [Karel Zak]
   - migrate from ext3 to ext2  [Karel Zak]
   - mkfs-endianness test use iflag=fullblock to fill block completely with string  [Masami Ichikawa]
   - mkfs-endianness test uses prepared test data  [Masami Ichikawa]
   - move misc/ul to ul/ directory  [Sami Kerola]
   - pack testcases into zip archives  [Evgeny Vereshchagin]
   - remove ul(1) 'dim' input  [Karel Zak]
   - set shmmni to 32k  [Karel Zak]
   - skip hwclock/systohc on GH Actions  [Karel Zak]
   - suggest "make check-programs"  [Karel Zak]
   - take exit codes into account  [Evgeny Vereshchagin]
   - update JSON outputs  [Karel Zak]
   - update atari blkid tests  [Karel Zak]
   - update atari partx tests  [Karel Zak]
   - update blkid output for iso/udf  [Karel Zak]
   - update build test results  [Karel Zak]
   - update fdisk dumps  [Karel Zak]
   - update hardlink tests  [Karel Zak]
   - update lscpu output  [Karel Zak]
   - update mountpoint return code chack  [Karel Zak]
   - update mountpoint tests  [Karel Zak]
   - update script(1) return code  [Karel Zak]
   - update sfdisk wipe tests  [Karel Zak]
   - update swaplabel.err  [Karel Zak]
tests/run:
   - create failure directory  [Zbigniew Jędrzejewski-Szmek]
text-utils:
   - correctly detect ASan under clang  [Frantisek Sumsal]
tools:
   - add missing stuff to Makefile.am  [Karel Zak]
   - make it possible to set all the fuzzing flags with config-gen  [Evgeny Vereshchagin]
   - replace checkmans.sh with adoc scripts  [Karel Zak]
   - use libcryptsetup in config-gen.d/all.conf  [Karel Zak]
travis:
   - cleanup before autogen  [Karel Zak]
   - disable OSX for now  [Karel Zak]
   - remove old ubuntu  [Karel Zak]
   - set CXX correctly  [Evgeny Vereshchagin]
   - stop building fuzz targets on macOS  [Evgeny Vereshchagin]
   - try update to xcode10.1  [Karel Zak]
   - turn off libmount on OSX  [Evgeny Vereshchagin]
   - turn on --enable-fuzzing-engine  [Evgeny Vereshchagin]
   - use verbose mode (V=1) for make  [Karel Zak]
ttymsg:
   - fix resource leak [coverity scan]  [Karel Zak]
uclampset:
   - Add man page  [Qais Yousef]
   - Plumb in bash-completion  [Qais Yousef]
   - Plump into the build system  [Qais Yousef]
   - cleanup --hel output  [Karel Zak]
ul:
   - add a term capabilities tracking structure  [Sami Kerola]
   - add basic tests  [Sami Kerola]
   - fix use of unsigned number  [Karel Zak]
   - flip comparisons to lesser to greater order  [Sami Kerola]
   - free most allocations ncurses did during setupterm()  [Sami Kerola]
   - improve function and variable names  [Sami Kerola]
   - make set_column() zero check more obvious  [Sami Kerola]
   - remove function like putwp preprocessor define  [Sami Kerola]
   - remove function prototypes  [Sami Kerola]
   - rename enumerated mode symbols  [Sami Kerola]
   - replace global runtime variables with a control structure  [Sami Kerola]
   - small coding changes  [Karel Zak]
   - tidy up coding style  [Sami Kerola]
   - use size_t to measure memory allocation size  [Sami Kerola]
ul_copy_file:
   - add test program  [Egor Chelak]
   - handle EAGAIN and EINTR  [Egor Chelak]
   - make defines for return values  [Egor Chelak]
   - use BUFSSIZ for buffer size  [Egor Chelak]
   - use all_read/all_write  [Egor Chelak]
   - use sendfile  [Egor Chelak]
umount:
   - ignore --no-canonicalize,-c for non-root users  [Karel Zak]
   - support over-mounts for --recursive  [Karel Zak]
unshare:
   - fix bad bit shift operation [coverity scan]  [Sami Kerola]
utmpdup:
   - Ensure flushing when using follow flag  [Andrew Shapiro]
uuidd:
   - add command-line option values struct  [Sami Kerola]
   - add uuidd specific data types that are used in protocol  [Sami Kerola]
   - document uuidd protocol  [Sami Kerola]
   - fix misleading indentation  [Sami Kerola]
   - make timeout to take effect when debug is not defined  [Sami Kerola]
   - move option parsing to separate function  [Sami Kerola]
   - override operation type when performing bulk request  [Sami Kerola]
   - remove unnecessary bulk request size limit  [Sami Kerola]
   - reorder bulk time and random generation code  [Sami Kerola]
   - use pid_t type when referring to process id  [Sami Kerola]
uuidgen:
   - give hint in usage() what uuid namepaces can be used  [Sami Kerola]
   - use errx() rather than fprintf() when priting errors  [Sami Kerola]
uuidparse:
   - use libuuid function to test nil uuid  [Sami Kerola]
   - use uuid type definitions from libuuid header  [Sami Kerola]
vipw:
   - fix short write handling in copyfile  [Egor Chelak]
   - move copyfile to the lib  [Egor Chelak]
whereis:
   - add --disable-whereis to configure  [Samanta Navarro]
   - add lib32 directories  [Samanta Navarro]
   - do not ignore trailing numbers  [Samanta Navarro]
   - do not strip suffixes  [Samanta Navarro]
   - extend test case  [Samanta Navarro]
   - filter bin, man and src differently  [Samanta Navarro]
   - fix out of boundary read  [Samanta Navarro]
   - support zst compressed man pages  [Samanta Navarro]
wipefs:
   - (man) add hint to erase on partitions and disk  [Karel Zak]
   - fix compiler warning  [Karel Zak]


-- 
 Karel Zak  <kzak@redhat.com>
 http://karelzak.blogspot.com


^ permalink raw reply	[relevance 5%]

* Re: [PATCH v5 0/4] man2: udpate mm/userfaultfd manpages to latest
  @ 2021-04-05 11:50 11%   ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 200+ results
From: Michael Kerrisk (man-pages) @ 2021-04-05 11:50 UTC (permalink / raw)
  To: Alejandro Colomar (man-pages),
	Peter Xu, linux-mm, linux-kernel, linux-man
  Cc: mtk.manpages, Axel Rasmussen, Nadav Amit, Mike Rapoport,
	Andrea Arcangeli, Andrew Morton

Hi Alex,

> I applied all 4 patches (with a few minor fixes to 1/4 and 4/4 (cosmetic 
> fixes; some of them about the 80-col right margin)): 
> <https://github.com/alejandro-colomar/man-pages/tree/eb8f2001d493d458d08b9b87605ed2ac453c7f5f>

How big is your current queue of pending patches from others?

Thanks,

Michael


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[relevance 11%]

* [PATCH v24 06/30] x86/cet: Add control-protection fault handler
  @ 2021-04-01 22:10  3% ` Yu-cheng Yu
  0 siblings, 0 replies; 200+ results
From: Yu-cheng Yu @ 2021-04-01 22:10 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang
  Cc: Yu-cheng Yu, Michael Kerrisk

A control-protection fault is triggered when a control-flow transfer
attempt violates Shadow Stack or Indirect Branch Tracking constraints.
For example, the return address for a RET instruction differs from the copy
on the shadow stack; or an indirect JMP instruction, without the NOTRACK
prefix, arrives at a non-ENDBR opcode.

The control-protection fault handler works in a similar way as the general
protection fault handler.  It provides the si_code SEGV_CPERR to the signal
handler.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
---
 arch/x86/include/asm/idtentry.h    |  4 ++
 arch/x86/kernel/idt.c              |  4 ++
 arch/x86/kernel/signal_compat.c    |  2 +-
 arch/x86/kernel/traps.c            | 63 ++++++++++++++++++++++++++++++
 include/uapi/asm-generic/siginfo.h |  3 +-
 5 files changed, 74 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 5eb3bdf36a41..fa98ca6a17a2 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -571,6 +571,10 @@ DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_SS,	exc_stack_segment);
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_GP,	exc_general_protection);
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_AC,	exc_alignment_check);
 
+#ifdef CONFIG_X86_CET
+DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_CP, exc_control_protection);
+#endif
+
 /* Raw exception entries which need extra work */
 DECLARE_IDTENTRY_RAW(X86_TRAP_UD,		exc_invalid_op);
 DECLARE_IDTENTRY_RAW(X86_TRAP_BP,		exc_int3);
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index ee1a283f8e96..e8166d9bbb10 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -105,6 +105,10 @@ static const __initconst struct idt_data def_idts[] = {
 #elif defined(CONFIG_X86_32)
 	SYSG(IA32_SYSCALL_VECTOR,	entry_INT80_32),
 #endif
+
+#ifdef CONFIG_X86_CET
+	INTG(X86_TRAP_CP,		asm_exc_control_protection),
+#endif
 };
 
 /*
diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index a5330ff498f0..dd92490b1e7f 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -27,7 +27,7 @@ static inline void signal_compat_build_tests(void)
 	 */
 	BUILD_BUG_ON(NSIGILL  != 11);
 	BUILD_BUG_ON(NSIGFPE  != 15);
-	BUILD_BUG_ON(NSIGSEGV != 9);
+	BUILD_BUG_ON(NSIGSEGV != 10);
 	BUILD_BUG_ON(NSIGBUS  != 5);
 	BUILD_BUG_ON(NSIGTRAP != 5);
 	BUILD_BUG_ON(NSIGCHLD != 6);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index ac1874a2a70e..ee9c88e4e1bb 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -39,6 +39,7 @@
 #include <linux/io.h>
 #include <linux/hardirq.h>
 #include <linux/atomic.h>
+#include <linux/nospec.h>
 
 #include <asm/stacktrace.h>
 #include <asm/processor.h>
@@ -606,6 +607,68 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
 	cond_local_irq_disable(regs);
 }
 
+#ifdef CONFIG_X86_CET
+static const char * const control_protection_err[] = {
+	"unknown",
+	"near-ret",
+	"far-ret/iret",
+	"endbranch",
+	"rstorssp",
+	"setssbsy",
+	"unknown",
+};
+
+static DEFINE_RATELIMIT_STATE(cpf_rate, DEFAULT_RATELIMIT_INTERVAL,
+			      DEFAULT_RATELIMIT_BURST);
+
+/*
+ * When a control protection exception occurs, send a signal to the responsible
+ * application.  Currently, control protection is only enabled for user mode.
+ * This exception should not come from kernel mode.
+ */
+DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+{
+	struct task_struct *tsk;
+
+	if (!user_mode(regs)) {
+		pr_emerg("PANIC: unexpected kernel control protection fault\n");
+		die("kernel control protection fault", regs, error_code);
+		panic("Machine halted.");
+	}
+
+	cond_local_irq_enable(regs);
+
+	if (!boot_cpu_has(X86_FEATURE_CET))
+		WARN_ONCE(1, "Control protection fault with CET support disabled\n");
+
+	tsk = current;
+	tsk->thread.error_code = error_code;
+	tsk->thread.trap_nr = X86_TRAP_CP;
+
+	/*
+	 * Ratelimit to prevent log spamming.
+	 */
+	if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
+	    __ratelimit(&cpf_rate)) {
+		unsigned long ssp;
+		int cpf_type;
+
+		cpf_type = array_index_nospec(error_code, ARRAY_SIZE(control_protection_err));
+
+		rdmsrl(MSR_IA32_PL3_SSP, ssp);
+		pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)",
+			 tsk->comm, task_pid_nr(tsk),
+			 regs->ip, regs->sp, ssp, error_code,
+			 control_protection_err[cpf_type]);
+		print_vma_addr(KERN_CONT " in ", regs->ip);
+		pr_cont("\n");
+	}
+
+	force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
+	cond_local_irq_disable(regs);
+}
+#endif
+
 static bool do_int3(struct pt_regs *regs)
 {
 	int res;
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index d2597000407a..1c2ea91284a0 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -231,7 +231,8 @@ typedef struct siginfo {
 #define SEGV_ADIPERR	7	/* Precise MCD exception */
 #define SEGV_MTEAERR	8	/* Asynchronous ARM MTE error */
 #define SEGV_MTESERR	9	/* Synchronous ARM MTE exception */
-#define NSIGSEGV	9
+#define SEGV_CPERR	10	/* Control protection fault */
+#define NSIGSEGV	10
 
 /*
  * SIGBUS si_codes
-- 
2.21.0


^ permalink raw reply related	[relevance 3%]

* [PATCH v6 02/10] mm/hugetlb: Add a macro to get HUGETLB page sizes for mmap
  @ 2021-03-30  8:08  4% ` Yanan Wang
  0 siblings, 0 replies; 200+ results
From: Yanan Wang @ 2021-03-30  8:08 UTC (permalink / raw)
  To: Paolo Bonzini, Andrew Jones, kvm, linux-kselftest, linux-kernel
  Cc: Ben Gardon, Sean Christopherson, Vitaly Kuznetsov, Peter Xu,
	Ingo Molnar, Adrian Hunter, Jiri Olsa, Arnaldo Carvalho de Melo,
	Arnd Bergmann, Michael Kerrisk, Thomas Gleixner, wanghaibin.wang,
	yuzenghui, Yanan Wang

We know that if a system supports multiple hugetlb page sizes,
the desired hugetlb page size can be specified in bits [26:31]
of the mmap() flag arguments. The value in these 6 bits will be
the shift of each hugetlb page size.

So add a macro to get the page size shift and then calculate the
corresponding hugetlb page size, using flag x.

Cc: Ben Gardon <bgardon@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
---
 include/uapi/linux/mman.h       | 2 ++
 tools/include/uapi/linux/mman.h | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
index f55bc680b5b0..d72df73b182d 100644
--- a/include/uapi/linux/mman.h
+++ b/include/uapi/linux/mman.h
@@ -41,4 +41,6 @@
 #define MAP_HUGE_2GB	HUGETLB_FLAG_ENCODE_2GB
 #define MAP_HUGE_16GB	HUGETLB_FLAG_ENCODE_16GB
 
+#define MAP_HUGE_PAGE_SIZE(x) (1ULL << ((x >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK))
+
 #endif /* _UAPI_LINUX_MMAN_H */
diff --git a/tools/include/uapi/linux/mman.h b/tools/include/uapi/linux/mman.h
index f55bc680b5b0..d72df73b182d 100644
--- a/tools/include/uapi/linux/mman.h
+++ b/tools/include/uapi/linux/mman.h
@@ -41,4 +41,6 @@
 #define MAP_HUGE_2GB	HUGETLB_FLAG_ENCODE_2GB
 #define MAP_HUGE_16GB	HUGETLB_FLAG_ENCODE_16GB
 
+#define MAP_HUGE_PAGE_SIZE(x) (1ULL << ((x >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK))
+
 #endif /* _UAPI_LINUX_MMAN_H */
-- 
2.19.1


^ permalink raw reply related	[relevance 4%]

* Re: [RFC PATCH v5 02/10] tools headers: Add a macro to get HUGETLB page sizes for mmap
  2021-03-23 14:03  0%   ` Andrew Jones
@ 2021-03-24  1:48  0%     ` wangyanan (Y)
  0 siblings, 0 replies; 200+ results
From: wangyanan (Y) @ 2021-03-24  1:48 UTC (permalink / raw)
  To: Andrew Jones
  Cc: Paolo Bonzini, kvm, linux-kselftest, linux-kernel, Ben Gardon,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Ingo Molnar,
	Adrian Hunter, Jiri Olsa, Arnaldo Carvalho de Melo,
	Arnd Bergmann, Michael Kerrisk, Thomas Gleixner, wanghaibin.wang,
	yuzenghui


On 2021/3/23 22:03, Andrew Jones wrote:
> $SUBJECT says "tools headers", but this is actually changing
> a UAPI header and then copying the change to tools.
Indeed. I think head of the subject should be "mm/hugetlb".
I will fix it.

Thanks,
Yanan
> Thanks,
> drew
>
> On Tue, Mar 23, 2021 at 09:52:23PM +0800, Yanan Wang wrote:
>> We know that if a system supports multiple hugetlb page sizes,
>> the desired hugetlb page size can be specified in bits [26:31]
>> of the flag arguments. The value in these 6 bits will be the
>> shift of each hugetlb page size.
>>
>> So add a macro to get the page size shift and then calculate the
>> corresponding hugetlb page size, using flag x.
>>
>> Cc: Ben Gardon <bgardon@google.com>
>> Cc: Ingo Molnar <mingo@kernel.org>
>> Cc: Adrian Hunter <adrian.hunter@intel.com>
>> Cc: Jiri Olsa <jolsa@redhat.com>
>> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
>> Cc: Arnd Bergmann <arnd@arndb.de>
>> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Suggested-by: Ben Gardon <bgardon@google.com>
>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>> Reviewed-by: Ben Gardon <bgardon@google.com>
>> ---
>>   include/uapi/linux/mman.h       | 2 ++
>>   tools/include/uapi/linux/mman.h | 2 ++
>>   2 files changed, 4 insertions(+)
>>
>> diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
>> index f55bc680b5b0..d72df73b182d 100644
>> --- a/include/uapi/linux/mman.h
>> +++ b/include/uapi/linux/mman.h
>> @@ -41,4 +41,6 @@
>>   #define MAP_HUGE_2GB	HUGETLB_FLAG_ENCODE_2GB
>>   #define MAP_HUGE_16GB	HUGETLB_FLAG_ENCODE_16GB
>>   
>> +#define MAP_HUGE_PAGE_SIZE(x) (1ULL << ((x >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK))
>> +
>>   #endif /* _UAPI_LINUX_MMAN_H */
>> diff --git a/tools/include/uapi/linux/mman.h b/tools/include/uapi/linux/mman.h
>> index f55bc680b5b0..d72df73b182d 100644
>> --- a/tools/include/uapi/linux/mman.h
>> +++ b/tools/include/uapi/linux/mman.h
>> @@ -41,4 +41,6 @@
>>   #define MAP_HUGE_2GB	HUGETLB_FLAG_ENCODE_2GB
>>   #define MAP_HUGE_16GB	HUGETLB_FLAG_ENCODE_16GB
>>   
>> +#define MAP_HUGE_PAGE_SIZE(x) (1ULL << ((x >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK))
>> +
>>   #endif /* _UAPI_LINUX_MMAN_H */
>> -- 
>> 2.19.1
>>
> .

^ permalink raw reply	[relevance 0%]

* Re: [RFC PATCH v5 02/10] tools headers: Add a macro to get HUGETLB page sizes for mmap
  2021-03-23 13:52  4% ` [RFC PATCH v5 02/10] tools headers: Add a macro to get HUGETLB page sizes for mmap Yanan Wang
@ 2021-03-23 14:03  0%   ` Andrew Jones
  2021-03-24  1:48  0%     ` wangyanan (Y)
  0 siblings, 1 reply; 200+ results
From: Andrew Jones @ 2021-03-23 14:03 UTC (permalink / raw)
  To: Yanan Wang
  Cc: Paolo Bonzini, kvm, linux-kselftest, linux-kernel, Ben Gardon,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Ingo Molnar,
	Adrian Hunter, Jiri Olsa, Arnaldo Carvalho de Melo,
	Arnd Bergmann, Michael Kerrisk, Thomas Gleixner, wanghaibin.wang,
	yuzenghui


$SUBJECT says "tools headers", but this is actually changing
a UAPI header and then copying the change to tools.

Thanks,
drew

On Tue, Mar 23, 2021 at 09:52:23PM +0800, Yanan Wang wrote:
> We know that if a system supports multiple hugetlb page sizes,
> the desired hugetlb page size can be specified in bits [26:31]
> of the flag arguments. The value in these 6 bits will be the
> shift of each hugetlb page size.
> 
> So add a macro to get the page size shift and then calculate the
> corresponding hugetlb page size, using flag x.
> 
> Cc: Ben Gardon <bgardon@google.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Adrian Hunter <adrian.hunter@intel.com>
> Cc: Jiri Olsa <jolsa@redhat.com>
> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Suggested-by: Ben Gardon <bgardon@google.com>
> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
> Reviewed-by: Ben Gardon <bgardon@google.com>
> ---
>  include/uapi/linux/mman.h       | 2 ++
>  tools/include/uapi/linux/mman.h | 2 ++
>  2 files changed, 4 insertions(+)
> 
> diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
> index f55bc680b5b0..d72df73b182d 100644
> --- a/include/uapi/linux/mman.h
> +++ b/include/uapi/linux/mman.h
> @@ -41,4 +41,6 @@
>  #define MAP_HUGE_2GB	HUGETLB_FLAG_ENCODE_2GB
>  #define MAP_HUGE_16GB	HUGETLB_FLAG_ENCODE_16GB
>  
> +#define MAP_HUGE_PAGE_SIZE(x) (1ULL << ((x >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK))
> +
>  #endif /* _UAPI_LINUX_MMAN_H */
> diff --git a/tools/include/uapi/linux/mman.h b/tools/include/uapi/linux/mman.h
> index f55bc680b5b0..d72df73b182d 100644
> --- a/tools/include/uapi/linux/mman.h
> +++ b/tools/include/uapi/linux/mman.h
> @@ -41,4 +41,6 @@
>  #define MAP_HUGE_2GB	HUGETLB_FLAG_ENCODE_2GB
>  #define MAP_HUGE_16GB	HUGETLB_FLAG_ENCODE_16GB
>  
> +#define MAP_HUGE_PAGE_SIZE(x) (1ULL << ((x >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK))
> +
>  #endif /* _UAPI_LINUX_MMAN_H */
> -- 
> 2.19.1
> 


^ permalink raw reply	[relevance 0%]

* [RFC PATCH v5 02/10] tools headers: Add a macro to get HUGETLB page sizes for mmap
  @ 2021-03-23 13:52  4% ` Yanan Wang
  2021-03-23 14:03  0%   ` Andrew Jones
  0 siblings, 1 reply; 200+ results
From: Yanan Wang @ 2021-03-23 13:52 UTC (permalink / raw)
  To: Paolo Bonzini, Andrew Jones, kvm, linux-kselftest, linux-kernel
  Cc: Ben Gardon, Sean Christopherson, Vitaly Kuznetsov, Peter Xu,
	Ingo Molnar, Adrian Hunter, Jiri Olsa, Arnaldo Carvalho de Melo,
	Arnd Bergmann, Michael Kerrisk, Thomas Gleixner, wanghaibin.wang,
	yuzenghui, Yanan Wang

We know that if a system supports multiple hugetlb page sizes,
the desired hugetlb page size can be specified in bits [26:31]
of the flag arguments. The value in these 6 bits will be the
shift of each hugetlb page size.

So add a macro to get the page size shift and then calculate the
corresponding hugetlb page size, using flag x.

Cc: Ben Gardon <bgardon@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
---
 include/uapi/linux/mman.h       | 2 ++
 tools/include/uapi/linux/mman.h | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
index f55bc680b5b0..d72df73b182d 100644
--- a/include/uapi/linux/mman.h
+++ b/include/uapi/linux/mman.h
@@ -41,4 +41,6 @@
 #define MAP_HUGE_2GB	HUGETLB_FLAG_ENCODE_2GB
 #define MAP_HUGE_16GB	HUGETLB_FLAG_ENCODE_16GB
 
+#define MAP_HUGE_PAGE_SIZE(x) (1ULL << ((x >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK))
+
 #endif /* _UAPI_LINUX_MMAN_H */
diff --git a/tools/include/uapi/linux/mman.h b/tools/include/uapi/linux/mman.h
index f55bc680b5b0..d72df73b182d 100644
--- a/tools/include/uapi/linux/mman.h
+++ b/tools/include/uapi/linux/mman.h
@@ -41,4 +41,6 @@
 #define MAP_HUGE_2GB	HUGETLB_FLAG_ENCODE_2GB
 #define MAP_HUGE_16GB	HUGETLB_FLAG_ENCODE_16GB
 
+#define MAP_HUGE_PAGE_SIZE(x) (1ULL << ((x >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK))
+
 #endif /* _UAPI_LINUX_MMAN_H */
-- 
2.19.1


^ permalink raw reply related	[relevance 4%]

* Re: [PATCH v6] close_range.2: new page documenting close_range(2)
  2021-03-21 15:38 11%     ` Michael Kerrisk (man-pages)
@ 2021-03-22 21:31  5%       ` Stephen Kitt
  0 siblings, 0 replies; 200+ results
From: Stephen Kitt @ 2021-03-22 21:31 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: linux-man, Alejandro Colomar, Christian Brauner,
	Giuseppe Scrivano, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 734 bytes --]

On Sun, 21 Mar 2021 16:38:59 +0100, "Michael Kerrisk (man-pages)"
<mtk.manpages@gmail.com> wrote:
> On 3/9/21 8:53 PM, Stephen Kitt wrote:
> > On Thu, 28 Jan 2021 21:50:23 +0100, "Michael Kerrisk (man-pages)"
> > <mtk.manpages@gmail.com> wrote:  
> >> Thanks for your patch revision. I've merged it, and have
> >> done some light editing, but I still have a question:  
> > 
> > Does this need anything more? I don’t see it in the man-pages repo.  
> 
> Sorry, Stephen. It's just me being slow. I've made a few edits,
> replaced the example program with another that more clearly allows
> the user to see what's going on, and pushed to Git.

Thanks, your example program is indeed much better!

Regards,

Stephen

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[relevance 5%]

* man-pages-5.11 released
@ 2021-03-22 10:45 12% Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 200+ results
From: Michael Kerrisk (man-pages) @ 2021-03-22 10:45 UTC (permalink / raw)
  To: lkml; +Cc: mtk.manpages, Alejandro Colomar

Gidday,

Alex Colomar and I are proud to announce:

    man-pages-5.11 - man pages for Linux

This release resulted from patches, bug reports, reviews, and
comments from around 40 contributors. The release includes
around 480 commits that changed 950 (about 90% of the) pages.
With a 50k diff, this is one of the largest man-pages releases
in quite a long time.

Tarball download:
    http://www.kernel.org/doc/man-pages/download.html
Git repository:
    https://git.kernel.org/cgit/docs/man-pages/man-pages.git/
Online changelog:
    http://man7.org/linux/man-pages/changelog.html#release_5.11

A short summary of the release is blogged at:
https://linux-man-pages.blogspot.com/2021/03/man-pages-511-is-released.html

The current version of the pages is browsable at:
http://man7.org/linux/man-pages/

A selection of changes in this release that may be of interest
to readers of LKML is shown below.

Cheers,

Michael

==================== Changes in man-pages-5.11 ====================

Released: 2021-03-21, Munich


New and rewritten pages
-----------------------

close_range.2
    Stephen Kitt, Michael Kerrisk  [Christian Brauner]
        New page documenting close_range(2)

process_madvise.2
    Suren Baghdasaryan, Minchan Kim  [Michal Hocko, Alejandro Colomar,
    Michael Kerrisk]
        Document process_madvise(2)

fileno.3
    Michael Kerrisk
        Split fileno(3) content out of ferror(3) into new page
            fileno(3) differs from the other functions in various ways.
            For example, it is governed by different standards,
            and can set 'errno'. Conversely, the other functions
            are about examining the status of a stream, while
            fileno(3) simply obtains the underlying file descriptor.
            Furthermore, splitting this function out allows
            for some cleaner upcoming changes in ferror(3).


Newly documented interfaces in existing pages
---------------------------------------------

epoll_wait.2
    Willem de Bruijn  [Dmitry V. Levin]
        Add documentation of epoll_pwait2()
            Expand the epoll_wait() page with epoll_pwait2(), an epoll_wait()
            variant that takes a struct timespec to enable nanosecond
            resolution timeout.

fanotify_init.2
fanotify.7
    Jan Kara  [Steve Grubb]
        Document FAN_AUDIT flag and FAN_ENABLE_AUDIT

madvise.2
    Michael Kerrisk
        Add descriptions of MADV_COLD and MADV_PAGEOUT
            Taken from process_madvise(2).

openat2.2
    Jens Axboe
        Add RESOLVE_CACHED

prctl.2
    Gabriel Krisman Bertazi
        Document Syscall User Dispatch

mallinfo.3
    Michael Kerrisk
        Document mallinfo2() and note that mallinfo() is deprecated
            Document the mallinfo2() function added in glibc 2.33.
        Update example program to use mallinfo2()

system_data_types.7
    Alejandro Colomar
        Add off64_t to system_data_types(7)

ld.so.8
    Michael Kerrisk
        Document the --argv0 option added in glibc 2.33


Global changes
--------------

Various pages
    Alejandro Colomar
        SYNOPSIS: Use 'restrict' in prototypes
            This change has been completed for *all* relevant pages
            (around 135 pages in total).

Various pages
    Alejandro Colomar  [Zack Weinberg]
        Remove unused <sys/types.h>
            The manual pages are already inconsistent in which headers need
            to be included.  Right now, not all of the types used by a
            function have their required header included in the SYNOPSIS.

            If we were to add the headers required by all of the types used by
            functions, the SYNOPSIS would grow too much.  Not only it would
            grow too much, but the information there would be less precise.

            Having system_data_types(7) document each type with all the
            information about required includes is much more precise, and the
            info is centralized so that it's much easier to maintain.

            So let's document only the include required for the function
            prototype, and also the ones required for the macros needed to
            call the function.

            <sys/types.h> only defines types, not functions or constants, so
            it doesn't belong to man[23] (function) pages at all.

            I ignore if some old systems had headers that required you to
            include <sys/types.h> *before* them (incomplete headers), but if
            so, those implementations would be broken, and those headers
            should probably provide some kind of warning.  I hope this is not
            the case.

            [mtk: Already in 2001, POSIX.1 removed the requirement to
            include <sys/types.h> for many APIs, so this patch seems
            well past due.]

_exit.2
abort.3
err.3
exit.3
pthread_exit.3
setjmp.3
    Alejandro Colomar
        SYNOPSIS: Use 'noreturn' in prototypes
            Use standard C11 'noreturn' in these manual page for
            functions that do not return.


Changes to individual pages
---------------------------

getcpu.2
    Michael Kerrisk  [Alejandro Colomar]
        Rewrite page to describe glibc wrapper function
            Since glibc 2.29, there is a wrapper for getcpu(2).
            The wrapper has only 2 arguments, omitting the unused
            third system call argument. Rework the manual page
            to reflect this.

kcmp.2
    Michael Kerrisk
        Since Linux 5.12, kcmp() availability is unconditional
            kcmp() is no longer dependent on CONFIG_CHECKPOINT_RESTORE.

mmap2.2
    Alejandro Colomar
        Fix prototype parameter types
            There are many slightly different prototypes for this syscall,
            but none of them is like the documented one.
            Of all the different prototypes,
            let's document the asm-generic one.

mount.2
    Michael Kerrisk
        Note that the 'data' argument can be NULL

syscall.2
    Peter H. Froehlich
        Update superh syscall convention

syscalls.2
    Michael Kerrisk
        Add epoll_pwait2()

netdevice.7
    Pali Rohár  [Alejandro Colomar]
        Update documentation for SIOCGIFADDR SIOCSIFADDR SIOCDIFADDR

netlink.7
    Pali Rohár  [Alejandro Colomar]
        Fix minimal Linux version for NETLINK_CAP_ACK
            NETLINK_CAP_ACK option was introduced in commit 0a6a3a23ea6e which first
            appeared in Linux version 4.3 and not 4.2.
    Pali Rohár  [Alejandro Colomar]
        Remove IPv4 from description
            rtnetlink is not only used for IPv4
    Philipp Schuster
        Clarify details of netlink error responses
            Make it clear that netlink error responses (i.e., messages with
            type NLMSG_ERROR (0x2)), can be longer than sizeof(struct
            nlmsgerr). In certain circumstances, the payload can be longer.
sock_diag.7
    Pali Rohár  [Alejandro Colomar]
        Fix recvmsg() usage in the example

tcp.7
    Enke Chen
        Documentation revision for TCP_USER_TIMEOUT

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[relevance 12%]

* Re: [PATCH v6] close_range.2: new page documenting close_range(2)
  2021-03-09 19:53  5%   ` Stephen Kitt
@ 2021-03-21 15:38 11%     ` Michael Kerrisk (man-pages)
  2021-03-22 21:31  5%       ` Stephen Kitt
  0 siblings, 1 reply; 200+ results
From: Michael Kerrisk (man-pages) @ 2021-03-21 15:38 UTC (permalink / raw)
  To: Stephen Kitt
  Cc: mtk.manpages, linux-man, Alejandro Colomar, Christian Brauner,
	Giuseppe Scrivano, linux-kernel

On 3/9/21 8:53 PM, Stephen Kitt wrote:
> Hi Michael,
> 
> On Thu, 28 Jan 2021 21:50:23 +0100, "Michael Kerrisk (man-pages)"
> <mtk.manpages@gmail.com> wrote:
>> Thanks for your patch revision. I've merged it, and have
>> done some light editing, but I still have a question:
> 
> Does this need anything more? I don’t see it in the man-pages repo.

Sorry, Stephen. It's just me being slow. I've made a few edits,
replaced the example program with another that more clearly allows
the user to see what's going on, and pushed to Git.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[relevance 11%]

* Re: [PATCH v6] close_range.2: new page documenting close_range(2)
       [not found]         ` <20210129100024.m4bil5mz5prry4iq@wittgenstein>
@ 2021-03-21 15:31 11%       ` Michael Kerrisk (man-pages)
  0 siblings, 0 replies; 200+ results
From: Michael Kerrisk (man-pages) @ 2021-03-21 15:31 UTC (permalink / raw)
  To: Christian Brauner, Stephen Kitt
  Cc: mtk.manpages, linux-man, Alejandro Colomar, Giuseppe Scrivano,
	linux-kernel

Hello Stephen and Christian,

Late follow-up, I'm afraid...

On 1/29/21 11:00 AM, Christian Brauner wrote:
> On Thu, Jan 28, 2021 at 11:10:40PM +0100, Stephen Kitt wrote:
>> Hello Michael,
>>
>> On Thu, 28 Jan 2021 21:50:23 +0100, "Michael Kerrisk (man-pages)"
>> <mtk.manpages@gmail.com> wrote:
>>> Thanks for your patch revision. I've merged it, and have
>>> done some light editing, but I still have a question:
>>>
>>> On 1/23/21 5:11 PM, Stephen Kitt wrote:
>>>
>>> [...]
>>>
>>>> +.SH ERRORS  
>>>
>>>> +.TP
>>>> +.B EMFILE
>>>> +The per-process limit on the number of open file descriptors has been
>>>> reached +(see the description of
>>>> +.B RLIMIT_NOFILE
>>>> +in
>>>> +.BR getrlimit (2)).  
>>>
>>> I think there was already a question about this error, but
>>> I still have a doubt.
>>>
>>> A glance at the code tells me that indeed EMFILE can occur.
>>> But how can the reason be because the limit on the number
>>> of open file descriptors has been reached? I mean: no new
>>> FDs are being opened, so how can we go over the limit. I think
>>> the cause of this error is something else, but what is it?
>>
>> Here’s how I understand the code that can lead to EMFILE:
>>
>> * in __close_range(), if CLOSE_RANGE_UNSHARE is set, call unshare_fd() with
>>   CLONE_FILES to clone the fd table
>> * unshare_fd() calls dup_fd()
>> * dup_fd() allocates a new fdtable, and if the resulting fdtable ends up
>>   being too small to hold the number of fds calculated by
>>   sane_fdtable_size(), fails with EMFILE
>>
>> I suspect that, given that we’re starting with a valid fdtable, the only way
>> this can happen is if there’s a race with sysctl_nr_open being reduced.
> 
> Yes, and sysctls are racy by nature.

Got it, I think. I changed the error text here to:

       EMFILE The number of open file descriptors exceeds the limit spec‐
              ified in /proc/sys/fs/nr_open (see  proc(5)).   This  error
              can occur in situations where that limit was lowered before
              a call to close_range() where the CLOSE_RANGE_UNSHARE  flag
              is specified.

Thanks,

Michael

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

^ permalink raw reply	[relevance 11%]

* [PATCH v23 06/28] x86/cet: Add control-protection fault handler
  @ 2021-03-16 15:10  3% ` Yu-cheng Yu
  0 siblings, 0 replies; 200+ results
From: Yu-cheng Yu @ 2021-03-16 15:10 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang
  Cc: Yu-cheng Yu, Michael Kerrisk

A control-protection fault is triggered when a control-flow transfer
attempt violates Shadow Stack or Indirect Branch Tracking constraints.
For example, the return address for a RET instruction differs from the copy
on the shadow stack; or an indirect JMP instruction, without the NOTRACK
prefix, arrives at a non-ENDBR opcode.

The control-protection fault handler works in a similar way as the general
protection fault handler.  It provides the si_code SEGV_CPERR to the signal
handler.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
---
 arch/x86/include/asm/idtentry.h    |  4 ++
 arch/x86/kernel/idt.c              |  4 ++
 arch/x86/kernel/signal_compat.c    |  2 +-
 arch/x86/kernel/traps.c            | 63 ++++++++++++++++++++++++++++++
 include/uapi/asm-generic/siginfo.h |  3 +-
 5 files changed, 74 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 5eb3bdf36a41..fa98ca6a17a2 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -571,6 +571,10 @@ DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_SS,	exc_stack_segment);
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_GP,	exc_general_protection);
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_AC,	exc_alignment_check);
 
+#ifdef CONFIG_X86_CET
+DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_CP, exc_control_protection);
+#endif
+
 /* Raw exception entries which need extra work */
 DECLARE_IDTENTRY_RAW(X86_TRAP_UD,		exc_invalid_op);
 DECLARE_IDTENTRY_RAW(X86_TRAP_BP,		exc_int3);
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index ee1a283f8e96..e8166d9bbb10 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -105,6 +105,10 @@ static const __initconst struct idt_data def_idts[] = {
 #elif defined(CONFIG_X86_32)
 	SYSG(IA32_SYSCALL_VECTOR,	entry_INT80_32),
 #endif
+
+#ifdef CONFIG_X86_CET
+	INTG(X86_TRAP_CP,		asm_exc_control_protection),
+#endif
 };
 
 /*
diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index a5330ff498f0..dd92490b1e7f 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -27,7 +27,7 @@ static inline void signal_compat_build_tests(void)
 	 */
 	BUILD_BUG_ON(NSIGILL  != 11);
 	BUILD_BUG_ON(NSIGFPE  != 15);
-	BUILD_BUG_ON(NSIGSEGV != 9);
+	BUILD_BUG_ON(NSIGSEGV != 10);
 	BUILD_BUG_ON(NSIGBUS  != 5);
 	BUILD_BUG_ON(NSIGTRAP != 5);
 	BUILD_BUG_ON(NSIGCHLD != 6);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index ac1874a2a70e..ee9c88e4e1bb 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -39,6 +39,7 @@
 #include <linux/io.h>
 #include <linux/hardirq.h>
 #include <linux/atomic.h>
+#include <linux/nospec.h>
 
 #include <asm/stacktrace.h>
 #include <asm/processor.h>
@@ -606,6 +607,68 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
 	cond_local_irq_disable(regs);
 }
 
+#ifdef CONFIG_X86_CET
+static const char * const control_protection_err[] = {
+	"unknown",
+	"near-ret",
+	"far-ret/iret",
+	"endbranch",
+	"rstorssp",
+	"setssbsy",
+	"unknown",
+};
+
+static DEFINE_RATELIMIT_STATE(cpf_rate, DEFAULT_RATELIMIT_INTERVAL,
+			      DEFAULT_RATELIMIT_BURST);
+
+/*
+ * When a control protection exception occurs, send a signal to the responsible
+ * application.  Currently, control protection is only enabled for user mode.
+ * This exception should not come from kernel mode.
+ */
+DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+{
+	struct task_struct *tsk;
+
+	if (!user_mode(regs)) {
+		pr_emerg("PANIC: unexpected kernel control protection fault\n");
+		die("kernel control protection fault", regs, error_code);
+		panic("Machine halted.");
+	}
+
+	cond_local_irq_enable(regs);
+
+	if (!boot_cpu_has(X86_FEATURE_CET))
+		WARN_ONCE(1, "Control protection fault with CET support disabled\n");
+
+	tsk = current;
+	tsk->thread.error_code = error_code;
+	tsk->thread.trap_nr = X86_TRAP_CP;
+
+	/*
+	 * Ratelimit to prevent log spamming.
+	 */
+	if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
+	    __ratelimit(&cpf_rate)) {
+		unsigned long ssp;
+		int cpf_type;
+
+		cpf_type = array_index_nospec(error_code, ARRAY_SIZE(control_protection_err));
+
+		rdmsrl(MSR_IA32_PL3_SSP, ssp);
+		pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)",
+			 tsk->comm, task_pid_nr(tsk),
+			 regs->ip, regs->sp, ssp, error_code,
+			 control_protection_err[cpf_type]);
+		print_vma_addr(KERN_CONT " in ", regs->ip);
+		pr_cont("\n");
+	}
+
+	force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
+	cond_local_irq_disable(regs);
+}
+#endif
+
 static bool do_int3(struct pt_regs *regs)
 {
 	int res;
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index d2597000407a..1c2ea91284a0 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -231,7 +231,8 @@ typedef struct siginfo {
 #define SEGV_ADIPERR	7	/* Precise MCD exception */
 #define SEGV_MTEAERR	8	/* Asynchronous ARM MTE error */
 #define SEGV_MTESERR	9	/* Synchronous ARM MTE exception */
-#define NSIGSEGV	9
+#define SEGV_CPERR	10	/* Control protection fault */
+#define NSIGSEGV	10
 
 /*
  * SIGBUS si_codes
-- 
2.21.0


^ permalink raw reply related	[relevance 3%]

* Re: [RFC PATCH v4 2/9] tools headers: Add a macro to get HUGETLB page sizes for mmap
  2021-03-12 11:14  0%   ` Andrew Jones
@ 2021-03-15  2:06  0%     ` wangyanan (Y)
  0 siblings, 0 replies; 200+ results
From: wangyanan (Y) @ 2021-03-15  2:06 UTC (permalink / raw)
  To: Andrew Jones
  Cc: kvm, linux-kselftest, linux-kernel, Paolo Bonzini, Ben Gardon,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Marc Zyngier,
	Ingo Molnar, Adrian Hunter, Jiri Olsa, Arnaldo Carvalho de Melo,
	Arnd Bergmann, Michael Kerrisk, Thomas Gleixner, wanghaibin.wang,
	yezengruan, yuzenghui


On 2021/3/12 19:14, Andrew Jones wrote:
> On Tue, Mar 02, 2021 at 08:57:44PM +0800, Yanan Wang wrote:
>> We know that if a system supports multiple hugetlb page sizes,
>> the desired hugetlb page size can be specified in bits [26:31]
>> of the flag arguments. The value in these 6 bits will be the
>> shift of each hugetlb page size.
>>
>> So add a macro to get the page size shift and then calculate the
>> corresponding hugetlb page size, using flag x.
>>
>> Cc: Ben Gardon <bgardon@google.com>
>> Cc: Ingo Molnar <mingo@kernel.org>
>> Cc: Adrian Hunter <adrian.hunter@intel.com>
>> Cc: Jiri Olsa <jolsa@redhat.com>
>> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
>> Cc: Arnd Bergmann <arnd@arndb.de>
>> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Suggested-by: Ben Gardon <bgardon@google.com>
>> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
>> Reviewed-by: Ben Gardon <bgardon@google.com>
>> ---
>>   include/uapi/linux/mman.h       | 2 ++
>>   tools/include/uapi/linux/mman.h | 2 ++
>>   2 files changed, 4 insertions(+)
>>
>> diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
>> index f55bc680b5b0..8bd41128a0ee 100644
>> --- a/include/uapi/linux/mman.h
>> +++ b/include/uapi/linux/mman.h
>> @@ -41,4 +41,6 @@
>>   #define MAP_HUGE_2GB	HUGETLB_FLAG_ENCODE_2GB
>>   #define MAP_HUGE_16GB	HUGETLB_FLAG_ENCODE_16GB
>>   
>> +#define MAP_HUGE_PAGE_SIZE(x) (1 << ((x >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK))
> Needs to be '1ULL' to avoid shift overflow when given MAP_HUGE_16GB.
Thanks, drew. Will fix it.
> Thanks,
> drew
>
>> +
>>   #endif /* _UAPI_LINUX_MMAN_H */
>> diff --git a/tools/include/uapi/linux/mman.h b/tools/include/uapi/linux/mman.h
>> index f55bc680b5b0..8bd41128a0ee 100644
>> --- a/tools/include/uapi/linux/mman.h
>> +++ b/tools/include/uapi/linux/mman.h
>> @@ -41,4 +41,6 @@
>>   #define MAP_HUGE_2GB	HUGETLB_FLAG_ENCODE_2GB
>>   #define MAP_HUGE_16GB	HUGETLB_FLAG_ENCODE_16GB
>>   
>> +#define MAP_HUGE_PAGE_SIZE(x) (1 << ((x >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK))
>> +
>>   #endif /* _UAPI_LINUX_MMAN_H */
>> -- 
>> 2.23.0
>>
> .

^ permalink raw reply	[relevance 0%]

* Re: [RFC PATCH v4 2/9] tools headers: Add a macro to get HUGETLB page sizes for mmap
  @ 2021-03-12 11:14  0%   ` Andrew Jones
  2021-03-15  2:06  0%     ` wangyanan (Y)
  0 siblings, 1 reply; 200+ results
From: Andrew Jones @ 2021-03-12 11:14 UTC (permalink / raw)
  To: Yanan Wang
  Cc: kvm, linux-kselftest, linux-kernel, Paolo Bonzini, Ben Gardon,
	Sean Christopherson, Vitaly Kuznetsov, Peter Xu, Marc Zyngier,
	Ingo Molnar, Adrian Hunter, Jiri Olsa, Arnaldo Carvalho de Melo,
	Arnd Bergmann, Michael Kerrisk, Thomas Gleixner, wanghaibin.wang,
	yezengruan, yuzenghui

On Tue, Mar 02, 2021 at 08:57:44PM +0800, Yanan Wang wrote:
> We know that if a system supports multiple hugetlb page sizes,
> the desired hugetlb page size can be specified in bits [26:31]
> of the flag arguments. The value in these 6 bits will be the
> shift of each hugetlb page size.
> 
> So add a macro to get the page size shift and then calculate the
> corresponding hugetlb page size, using flag x.
> 
> Cc: Ben Gardon <bgardon@google.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Adrian Hunter <adrian.hunter@intel.com>
> Cc: Jiri Olsa <jolsa@redhat.com>
> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Suggested-by: Ben Gardon <bgardon@google.com>
> Signed-off-by: Yanan Wang <wangyanan55@huawei.com>
> Reviewed-by: Ben Gardon <bgardon@google.com>
> ---
>  include/uapi/linux/mman.h       | 2 ++
>  tools/include/uapi/linux/mman.h | 2 ++
>  2 files changed, 4 insertions(+)
> 
> diff --git a/include/uapi/linux/mman.h b/include/uapi/linux/mman.h
> index f55bc680b5b0..8bd41128a0ee 100644
> --- a/include/uapi/linux/mman.h
> +++ b/include/uapi/linux/mman.h
> @@ -41,4 +41,6 @@
>  #define MAP_HUGE_2GB	HUGETLB_FLAG_ENCODE_2GB
>  #define MAP_HUGE_16GB	HUGETLB_FLAG_ENCODE_16GB
>  
> +#define MAP_HUGE_PAGE_SIZE(x) (1 << ((x >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK))

Needs to be '1ULL' to avoid shift overflow when given MAP_HUGE_16GB.

Thanks,
drew

> +
>  #endif /* _UAPI_LINUX_MMAN_H */
> diff --git a/tools/include/uapi/linux/mman.h b/tools/include/uapi/linux/mman.h
> index f55bc680b5b0..8bd41128a0ee 100644
> --- a/tools/include/uapi/linux/mman.h
> +++ b/tools/include/uapi/linux/mman.h
> @@ -41,4 +41,6 @@
>  #define MAP_HUGE_2GB	HUGETLB_FLAG_ENCODE_2GB
>  #define MAP_HUGE_16GB	HUGETLB_FLAG_ENCODE_16GB
>  
> +#define MAP_HUGE_PAGE_SIZE(x) (1 << ((x >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK))
> +
>  #endif /* _UAPI_LINUX_MMAN_H */
> -- 
> 2.23.0
> 


^ permalink raw reply	[relevance 0%]

* [PATCH v22 06/28] x86/cet: Add control-protection fault handler
  @ 2021-03-10 22:00  3% ` Yu-cheng Yu
  0 siblings, 0 replies; 200+ results
From: Yu-cheng Yu @ 2021-03-10 22:00 UTC (permalink / raw)
  To: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, linux-kernel,
	linux-doc, linux-mm, linux-arch, linux-api, Arnd Bergmann,
	Andy Lutomirski, Balbir Singh, Borislav Petkov, Cyrill Gorcunov,
	Dave Hansen, Eugene Syromiatnikov, Florian Weimer, H.J. Lu,
	Jann Horn, Jonathan Corbet, Kees Cook, Mike Kravetz, Nadav Amit,
	Oleg Nesterov, Pavel Machek, Peter Zijlstra, Randy Dunlap,
	Ravi V. Shankar, Vedvyas Shanbhogue, Dave Martin, Weijiang Yang,
	Pengfei Xu, Haitao Huang
  Cc: Yu-cheng Yu, Michael Kerrisk

A control-protection fault is triggered when a control-flow transfer
attempt violates Shadow Stack or Indirect Branch Tracking constraints.
For example, the return address for a RET instruction differs from the copy
on the shadow stack; or an indirect JMP instruction, without the NOTRACK
prefix, arrives at a non-ENDBR opcode.

The control-protection fault handler works in a similar way as the general
protection fault handler.  It provides the si_code SEGV_CPERR to the signal
handler.

Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
---
 arch/x86/include/asm/idtentry.h    |  4 ++
 arch/x86/kernel/idt.c              |  4 ++
 arch/x86/kernel/signal_compat.c    |  2 +-
 arch/x86/kernel/traps.c            | 63 ++++++++++++++++++++++++++++++
 include/uapi/asm-generic/siginfo.h |  3 +-
 5 files changed, 74 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 5eb3bdf36a41..fa98ca6a17a2 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -571,6 +571,10 @@ DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_SS,	exc_stack_segment);
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_GP,	exc_general_protection);
 DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_AC,	exc_alignment_check);
 
+#ifdef CONFIG_X86_CET
+DECLARE_IDTENTRY_ERRORCODE(X86_TRAP_CP, exc_control_protection);
+#endif
+
 /* Raw exception entries which need extra work */
 DECLARE_IDTENTRY_RAW(X86_TRAP_UD,		exc_invalid_op);
 DECLARE_IDTENTRY_RAW(X86_TRAP_BP,		exc_int3);
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index ee1a283f8e96..e8166d9bbb10 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -105,6 +105,10 @@ static const __initconst struct idt_data def_idts[] = {
 #elif defined(CONFIG_X86_32)
 	SYSG(IA32_SYSCALL_VECTOR,	entry_INT80_32),
 #endif
+
+#ifdef CONFIG_X86_CET
+	INTG(X86_TRAP_CP,		asm_exc_control_protection),
+#endif
 };
 
 /*
diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index a5330ff498f0..dd92490b1e7f 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -27,7 +27,7 @@ static inline void signal_compat_build_tests(void)
 	 */
 	BUILD_BUG_ON(NSIGILL  != 11);
 	BUILD_BUG_ON(NSIGFPE  != 15);
-	BUILD_BUG_ON(NSIGSEGV != 9);
+	BUILD_BUG_ON(NSIGSEGV != 10);
 	BUILD_BUG_ON(NSIGBUS  != 5);
 	BUILD_BUG_ON(NSIGTRAP != 5);
 	BUILD_BUG_ON(NSIGCHLD != 6);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 7f5aec758f0e..83c641459ec6 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -39,6 +39,7 @@
 #include <linux/io.h>
 #include <linux/hardirq.h>
 #include <linux/atomic.h>
+#include <linux/nospec.h>
 
 #include <asm/stacktrace.h>
 #include <asm/processor.h>
@@ -606,6 +607,68 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
 	cond_local_irq_disable(regs);
 }
 
+#ifdef CONFIG_X86_CET
+static const char * const control_protection_err[] = {
+	"unknown",
+	"near-ret",
+	"far-ret/iret",
+	"endbranch",
+	"rstorssp",
+	"setssbsy",
+	"unknown",
+};
+
+static DEFINE_RATELIMIT_STATE(cpf_rate, DEFAULT_RATELIMIT_INTERVAL,
+			      DEFAULT_RATELIMIT_BURST);
+
+/*
+ * When a control protection exception occurs, send a signal to the responsible
+ * application.  Currently, control protection is only enabled for user mode.
+ * This exception should not come from kernel mode.
+ */
+DEFINE_IDTENTRY_ERRORCODE(exc_control_protection)
+{
+	struct task_struct *tsk;
+
+	if (!user_mode(regs)) {
+		pr_emerg("PANIC: unexpected kernel control protection fault\n");
+		die("kernel control protection fault", regs, error_code);
+		panic("Machine halted.");
+	}
+
+	cond_local_irq_enable(regs);
+
+	if (!boot_cpu_has(X86_FEATURE_CET))
+		WARN_ONCE(1, "Control protection fault with CET support disabled\n");
+
+	tsk = current;
+	tsk->thread.error_code = error_code;
+	tsk->thread.trap_nr = X86_TRAP_CP;
+
+	/*
+	 * Ratelimit to prevent log spamming.
+	 */
+	if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV) &&
+	    __ratelimit(&cpf_rate)) {
+		unsigned long ssp;
+		int cpf_type;
+
+		cpf_type = array_index_nospec(error_code, ARRAY_SIZE(control_protection_err));
+
+		rdmsrl(MSR_IA32_PL3_SSP, ssp);
+		pr_emerg("%s[%d] control protection ip:%lx sp:%lx ssp:%lx error:%lx(%s)",
+			 tsk->comm, task_pid_nr(tsk),
+			 regs->ip, regs->sp, ssp, error_code,
+			 control_protection_err[cpf_type]);
+		print_vma_addr(KERN_CONT " in ", regs->ip);
+		pr_cont("\n");
+	}
+
+	force_sig_fault(SIGSEGV, SEGV_CPERR, (void __user *)0);
+	cond_local_irq_disable(regs);
+}
+#endif
+
 static bool do_int3(struct pt_regs *regs)
 {
 	int res;
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index d2597000407a..1c2ea91284a0 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -231,7 +231,8 @@ typedef struct siginfo {
 #define SEGV_ADIPERR	7	/* Precise MCD exception */
 #define SEGV_MTEAERR	8	/* Asynchronous ARM MTE error */
 #define SEGV_MTESERR	9	/* Synchronous ARM MTE exception */
-#define NSIGSEGV	9
+#define SEGV_CPERR	10	/* Control protection fault */
+#define NSIGSEGV	10
 
 /*
  * SIGBUS si_codes
-- 
2.21.0


^ permalink raw reply related	[relevance 3%]

* Re: [PATCH v6] close_range.2: new page documenting close_range(2)
    @ 2021-03-09 19:53  5%   ` Stephen Kitt
  2021-03-21 15:38 11%     ` Michael Kerrisk (man-pages)
  1 sibling, 1 reply; 200+ results
From: Stephen Kitt @ 2021-03-09 19:53 UTC (permalink / raw)
  To: Michael Kerrisk (man-pages)
  Cc: linux-man, Alejandro Colomar, Christian Brauner,
	Giuseppe Scrivano, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 328 bytes --]

Hi Michael,

On Thu, 28 Jan 2021 21:50:23 +0100, "Michael Kerrisk (man-pages)"
<mtk.manpages@gmail.com> wrote:
> Thanks for your patch revision. I've merged it, and have
> done some light editing, but I still have a question:

Does this need anything more? I don’t see it in the man-pages repo.

Regards,

Stephen

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[relevance 5%]

* Re: [PATCH 2/2] sigaction.2: wfix - Clarify si_addr description.
  2021-03-08 21:30  0%   ` Borislav Petkov
@ 2021-03-08 21:46  0%     ` Yu, Yu-cheng
  0 siblings, 0 replies; 200+ results
From: Yu, Yu-cheng @ 2021-03-08 21:46 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-man, Alejandro Colomar, Michael Kerrisk, Andy Lutomirski,
	Dave Hansen, Florian Weimer, H.J. Lu, linux-kernel, linux-api

On 3/8/2021 1:30 PM, Borislav Petkov wrote:
> On Fri, Feb 26, 2021 at 09:26:34AM -0800, Yu-cheng Yu wrote:
>> SIGSEGV fills si_addr only for memory access faults.  Add a note to clarify.
>>
>> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
>> Cc: Alejandro Colomar <alx.manpages@gmail.com>
>> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Cc: Borislav Petkov <bp@alien8.de>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Florian Weimer <fweimer@redhat.com>
>> Cc: "H.J. Lu" <hjl.tools@gmail.com>
>> Cc: linux-kernel@vger.kernel.org
>> Cc: linux-api@vger.kenel.org
>> Link: https://lore.kernel.org/linux-api/20210217222730.15819-7-yu-cheng.yu@intel.com/
>> ---
>>   man2/sigaction.2 | 7 ++++++-
>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/man2/sigaction.2 b/man2/sigaction.2
>> index 49a30f11e..bea884a23 100644
>> --- a/man2/sigaction.2
>> +++ b/man2/sigaction.2
>> @@ -467,7 +467,7 @@ and
>>   .BR SIGTRAP
>>   fill in
>>   .I si_addr
>> -with the address of the fault.
>> +with the address of the fault (see notes).
>>   On some architectures,
>>   these signals also fill in the
>>   .I si_trapno
>> @@ -955,6 +955,11 @@ It is not possible to block
>>   .IR sa_mask ).
>>   Attempts to do so are silently ignored.
>>   .PP
>> +In a
>> +.B SIGSEGV,
>> +if the fault is a memory access fault, si_addr is filled with the address
>> +causing the fault, otherwise it is not filled.
> 
> "... otherwise it is uninitialized." or "zeroed" or whatever...
> 
> And I'm having trouble figuring out why do you need to clarify this?
> 
> Because of this sentence:
> 
>         * SIGILL,  SIGFPE, SIGSEGV, SIGBUS, and SIGTRAP fill in si_addr with the address
>           of the fault.  On some architectures, these signals also fill in the si_trapno
>           field.
> 
> ?

I think the sentence above is vague, but probably for the reason that 
each arch is different.  Maybe this patch is unnecessary and can be dropped?

> 
> If so, did you audit all architectures whether si_addr is populated only
> on memory access faults or is this something POSIX dictates or what's
> up? Because the sigaction(2) manpage is arch-agnostic and this is a
> rather strong assertion.
> 
> What am I missing?
> 
> Thx.
>

^ permalink raw reply	[relevance 0%]

* Re: [PATCH 2/2] sigaction.2: wfix - Clarify si_addr description.
  @ 2021-03-08 21:30  0%   ` Borislav Petkov
  2021-03-08 21:46  0%     ` Yu, Yu-cheng
  0 siblings, 1 reply; 200+ results
From: Borislav Petkov @ 2021-03-08 21:30 UTC (permalink / raw)
  To: Yu-cheng Yu
  Cc: linux-man, Alejandro Colomar, Michael Kerrisk, Andy Lutomirski,
	Dave Hansen, Florian Weimer, H.J. Lu, linux-kernel, linux-api

On Fri, Feb 26, 2021 at 09:26:34AM -0800, Yu-cheng Yu wrote:
> SIGSEGV fills si_addr only for memory access faults.  Add a note to clarify.
> 
> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
> Cc: Alejandro Colomar <alx.manpages@gmail.com>
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Florian Weimer <fweimer@redhat.com>
> Cc: "H.J. Lu" <hjl.tools@gmail.com>
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-api@vger.kenel.org
> Link: https://lore.kernel.org/linux-api/20210217222730.15819-7-yu-cheng.yu@intel.com/
> ---
>  man2/sigaction.2 | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/man2/sigaction.2 b/man2/sigaction.2
> index 49a30f11e..bea884a23 100644
> --- a/man2/sigaction.2
> +++ b/man2/sigaction.2
> @@ -467,7 +467,7 @@ and
>  .BR SIGTRAP
>  fill in
>  .I si_addr
> -with the address of the fault.
> +with the address of the fault (see notes).
>  On some architectures,
>  these signals also fill in the
>  .I si_trapno
> @@ -955,6 +955,11 @@ It is not possible to block
>  .IR sa_mask ).
>  Attempts to do so are silently ignored.
>  .PP
> +In a
> +.B SIGSEGV,
> +if the fault is a memory access fault, si_addr is filled with the address
> +causing the fault, otherwise it is not filled.

"... otherwise it is uninitialized." or "zeroed" or whatever...

And I'm having trouble figuring out why do you need to clarify this?

Because of this sentence:

       * SIGILL,  SIGFPE, SIGSEGV, SIGBUS, and SIGTRAP fill in si_addr with the address
         of the fault.  On some architectures, these signals also fill in the si_trapno
         field.

?

If so, did you audit all architectures whether si_addr is populated only
on memory access faults or is this something POSIX dictates or what's
up? Because the sigaction(2) manpage is arch-agnostic and this is a
rather strong assertion.

What am I missing?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[relevance 0%]

* Re: [RFC v4] copy_file_range.2: Update cross-filesystem support for 5.12
  2021-03-04  9:38  3% ` [RFC v4] copy_file_range.2: Update cross-filesystem support for 5.12 Alejandro Colomar
@ 2021-03-04 17:13  0%   ` Darrick J. Wong
  0 siblings, 0 replies; 200+ results
From: Darrick J. Wong @ 2021-03-04 17:13 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: linux-man, Amir Goldstein, Michael Kerrisk, Luis Henriques,
	Steve French, Greg KH, Anna Schumaker, Jeff Layton,
	Miklos Szeredi, Trond Myklebust, Alexander Viro, Darrick J. Wong,
	Dave Chinner, Nicolas Boichat, Ian Lance Taylor, Luis Lozano,
	Andreas Dilger, Olga Kornievskaia, Christoph Hellwig, ceph-devel,
	linux-kernel, CIFS, samba-technical, linux-fsdevel,
	Linux NFS Mailing List, Walter Harms

On Thu, Mar 04, 2021 at 10:38:07AM +0100, Alejandro Colomar wrote:
> Linux 5.12 fixes a regression.
> 
> Cross-filesystem (introduced in 5.3) copies were buggy.
> 
> Move the statements documenting cross-fs to BUGS.
> Kernels 5.3..5.11 should be patched soon.
> 
> State version information for some errors related to this.
> 
> Reported-by: Luis Henriques <lhenriques@suse.de>
> Reported-by: Amir Goldstein <amir73il@gmail.com>
> Related: <https://lwn.net/Articles/846403/>
> Cc: Greg KH <gregkh@linuxfoundation.org>
> Cc: Michael Kerrisk <mtk.manpages@gmail.com>
> Cc: Anna Schumaker <anna.schumaker@netapp.com>
> Cc: Jeff Layton <jlayton@kernel.org>
> Cc: Steve French <sfrench@samba.org>
> Cc: Miklos Szeredi <miklos@szeredi.hu>
> Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: Nicolas Boichat <drinkcat@chromium.org>
> Cc: Ian Lance Taylor <iant@google.com>
> Cc: Luis Lozano <llozano@chromium.org>
> Cc: Andreas Dilger <adilger@dilger.ca>
> Cc: Olga Kornievskaia <aglo@umich.edu>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: ceph-devel <ceph-devel@vger.kernel.org>
> Cc: linux-kernel <linux-kernel@vger.kernel.org>
> Cc: CIFS <linux-cifs@vger.kernel.org>
> Cc: samba-technical <samba-technical@lists.samba.org>
> Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>
> Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>
> Cc: Walter Harms <wharms@bfs.de>
> Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
> ---
> 
> v3:
>         - Don't remove some important text.
>         - Reword BUGS.
> v4:
> 	- Reword.
> 	- Link to BUGS.
> 
> Thanks, Amir, for all the help and better wordings.
> 
> Cheers,
> 
> Alex
> 
> ---
>  man2/copy_file_range.2 | 27 +++++++++++++++++++++++----
>  1 file changed, 23 insertions(+), 4 deletions(-)
> 
> diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
> index 611a39b80..f58bfea8f 100644
> --- a/man2/copy_file_range.2
> +++ b/man2/copy_file_range.2
> @@ -169,6 +169,9 @@ Out of memory.
>  .B ENOSPC
>  There is not enough space on the target filesystem to complete the copy.
>  .TP
> +.BR EOPNOTSUPP " (since Linux 5.12)"
> +The filesystem does not support this operation.
> +.TP
>  .B EOVERFLOW
>  The requested source or destination range is too large to represent in the
>  specified data types.
> @@ -184,10 +187,17 @@ or
>  .I fd_out
>  refers to an active swap file.
>  .TP
> -.B EXDEV
> +.BR EXDEV " (before Linux 5.3)"
> +The files referred to by
> +.IR fd_in " and " fd_out
> +are not on the same filesystem.
> +.TP
> +.BR EXDEV " (since Linux 5.12)"
>  The files referred to by
>  .IR fd_in " and " fd_out
> -are not on the same mounted filesystem (pre Linux 5.3).
> +are not on the same filesystem,
> +and the source and target filesystems are not of the same type,
> +or do not support cross-filesystem copy.
>  .SH VERSIONS
>  The
>  .BR copy_file_range ()
> @@ -200,8 +210,11 @@ Areas of the API that weren't clearly defined were clarified and the API bounds
>  are much more strictly checked than on earlier kernels.
>  Applications should target the behaviour and requirements of 5.3 kernels.
>  .PP
> -First support for cross-filesystem copies was introduced in Linux 5.3.
> -Older kernels will return -EXDEV when cross-filesystem copies are attempted.
> +Since Linux 5.12,
> +cross-filesystem copies can be achieved
> +when both filesystems are of the same type,
> +and that filesystem implements support for it.
> +See BUGS for behavior prior to 5.12.
>  .SH CONFORMING TO
>  The
>  .BR copy_file_range ()
> @@ -226,6 +239,12 @@ gives filesystems an opportunity to implement "copy acceleration" techniques,
>  such as the use of reflinks (i.e., two or more inodes that share
>  pointers to the same copy-on-write disk blocks)
>  or server-side-copy (in the case of NFS).
> +.SH BUGS
> +In Linux kernels 5.3 to 5.11,
> +cross-filesystem copies were implemented by the kernel,
> +if the operation was not supported by individual filesystems.
> +However, on some virtual filesystems,
> +the call failed to copy, while still reporting success.

...success, or merely a short copy?

(The rest looks reasonable (at least by c_f_r standards) to me.)

--D

>  .SH EXAMPLES
>  .EX
>  #define _GNU_SOURCE
> -- 
> 2.30.1.721.g45526154a5
> 

^ permalink raw reply	[relevance 0%]

* [RFC v4] copy_file_range.2: Update cross-filesystem support for 5.12
  @ 2021-03-04  9:38  3% ` Alejandro Colomar
  2021-03-04 17:13  0%   ` Darrick J. Wong
  0 siblings, 1 reply; 200+ results
From: Alejandro Colomar @ 2021-03-04  9:38 UTC (permalink / raw)
  To: linux-man, Amir Goldstein, Michael Kerrisk, Luis Henriques, Steve French
  Cc: Alejandro Colomar, Greg KH, Anna Schumaker, Jeff Layton,
	Miklos Szeredi, Trond Myklebust, Alexander Viro, Darrick J. Wong,
	Dave Chinner, Nicolas Boichat, Ian Lance Taylor, Luis Lozano,
	Andreas Dilger, Olga Kornievskaia, Christoph Hellwig, ceph-devel,
	linux-kernel, CIFS, samba-technical, linux-fsdevel,
	Linux NFS Mailing List, Walter Harms

Linux 5.12 fixes a regression.

Cross-filesystem (introduced in 5.3) copies were buggy.

Move the statements documenting cross-fs to BUGS.
Kernels 5.3..5.11 should be patched soon.

State version information for some errors related to this.

Reported-by: Luis Henriques <lhenriques@suse.de>
Reported-by: Amir Goldstein <amir73il@gmail.com>
Related: <https://lwn.net/Articles/846403/>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Anna Schumaker <anna.schumaker@netapp.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Steve French <sfrench@samba.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Nicolas Boichat <drinkcat@chromium.org>
Cc: Ian Lance Taylor <iant@google.com>
Cc: Luis Lozano <llozano@chromium.org>
Cc: Andreas Dilger <adilger@dilger.ca>
Cc: Olga Kornievskaia <aglo@umich.edu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: ceph-devel <ceph-devel@vger.kernel.org>
Cc: linux-kernel <linux-kernel@vger.kernel.org>
Cc: CIFS <linux-cifs@vger.kernel.org>
Cc: samba-technical <samba-technical@lists.samba.org>
Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>
Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Cc: Walter Harms <wharms@bfs.de>
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
---

v3:
        - Don't remove some important text.
        - Reword BUGS.
v4:
	- Reword.
	- Link to BUGS.

Thanks, Amir, for all the help and better wordings.

Cheers,

Alex

---
 man2/copy_file_range.2 | 27 +++++++++++++++++++++++----
 1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
index 611a39b80..f58bfea8f 100644
--- a/man2/copy_file_range.2
+++ b/man2/copy_file_range.2
@@ -169,6 +169,9 @@ Out of memory.
 .B ENOSPC
 There is not enough space on the target filesystem to complete the copy.
 .TP
+.BR EOPNOTSUPP " (since Linux 5.12)"
+The filesystem does not support this operation.
+.TP
 .B EOVERFLOW
 The requested source or destination range is too large to represent in the
 specified data types.
@@ -184,10 +187,17 @@ or
 .I fd_out
 refers to an active swap file.
 .TP
-.B EXDEV
+.BR EXDEV " (before Linux 5.3)"
+The files referred to by
+.IR fd_in " and " fd_out
+are not on the same filesystem.
+.TP
+.BR EXDEV " (since Linux 5.12)"
 The files referred to by
 .IR fd_in " and " fd_out
-are not on the same mounted filesystem (pre Linux 5.3).
+are not on the same filesystem,
+and the source and target filesystems are not of the same type,
+or do not support cross-filesystem copy.
 .SH VERSIONS
 The
 .BR copy_file_range ()
@@ -200,8 +210,11 @@ Areas of the API that weren't clearly defined were clarified and the API bounds
 are much more strictly checked than on earlier kernels.
 Applications should target the behaviour and requirements of 5.3 kernels.
 .PP
-First support for cross-filesystem copies was introduced in Linux 5.3.
-Older kernels will return -EXDEV when cross-filesystem copies are attempted.
+Since Linux 5.12,
+cross-filesystem copies can be achieved
+when both filesystems are of the same type,
+and that filesystem implements support for it.
+See BUGS for behavior prior to 5.12.
 .SH CONFORMING TO
 The
 .BR copy_file_range ()
@@ -226,6 +239,12 @@ gives filesystems an opportunity to implement "copy acceleration" techniques,
 such as the use of reflinks (i.e., two or more inodes that share
 pointers to the same copy-on-write disk blocks)
 or server-side-copy (in the case of NFS).
+.SH BUGS
+In Linux kernels 5.3 to 5.11,
+cross-filesystem copies were implemented by the kernel,
+if the operation was not supported by individual filesystems.
+However, on some virtual filesystems,
+the call failed to copy, while still reporting success.
 .SH EXAMPLES
 .EX
 #define _GNU_SOURCE
-- 
2.30.1.721.g45526154a5


^ permalink raw reply related	[relevance 3%]

* [PATCH v18 9/9] secretmem: test: add basic selftest for memfd_secret(2)
                     ` (4 preceding siblings ...)
  2021-03-03 16:22  3% ` [PATCH v18 8/9] arch, mm: wire up memfd_secret system call where relevant Mike Rapoport
@ 2021-03-03 16:22  2% ` Mike Rapoport
  5 siblings, 0 replies; 200+ results
From: Mike Rapoport @ 2021-03-03 16:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Matthew Garrett, Mark Rutland, Michal Hocko, Mike Rapoport,
	Mike Rapoport, Michael Kerrisk, Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Rafael J. Wysocki, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

From: Mike Rapoport <rppt@linux.ibm.com>

The test verifies that file descriptor created with memfd_secret does not
allow read/write operations, that secret memory mappings respect
RLIMIT_MEMLOCK and that remote accesses with process_vm_read() and
ptrace() to the secret memory fail.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 tools/testing/selftests/vm/.gitignore     |   1 +
 tools/testing/selftests/vm/Makefile       |   3 +-
 tools/testing/selftests/vm/memfd_secret.c | 296 ++++++++++++++++++++++
 tools/testing/selftests/vm/run_vmtests.sh |  17 ++
 4 files changed, 316 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/vm/memfd_secret.c

diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore
index 9a35c3f6a557..c8deddc81e7a 100644
--- a/tools/testing/selftests/vm/.gitignore
+++ b/tools/testing/selftests/vm/.gitignore
@@ -21,4 +21,5 @@ va_128TBswitch
 map_fixed_noreplace
 write_to_hugetlbfs
 hmm-tests
+memfd_secret
 local_config.*
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index d42115e4284d..0200fb61646c 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -34,6 +34,7 @@ TEST_GEN_FILES += khugepaged
 TEST_GEN_FILES += map_fixed_noreplace
 TEST_GEN_FILES += map_hugetlb
 TEST_GEN_FILES += map_populate
+TEST_GEN_FILES += memfd_secret
 TEST_GEN_FILES += mlock-random-test
 TEST_GEN_FILES += mlock2-tests
 TEST_GEN_FILES += mremap_dontunmap
@@ -133,7 +134,7 @@ warn_32bit_failure:
 endif
 endif
 
-$(OUTPUT)/mlock-random-test: LDLIBS += -lcap
+$(OUTPUT)/mlock-random-test $(OUTPUT)/memfd_secret: LDLIBS += -lcap
 
 $(OUTPUT)/gup_test: ../../../../mm/gup_test.h
 
diff --git a/tools/testing/selftests/vm/memfd_secret.c b/tools/testing/selftests/vm/memfd_secret.c
new file mode 100644
index 000000000000..c878c2b841fc
--- /dev/null
+++ b/tools/testing/selftests/vm/memfd_secret.c
@@ -0,0 +1,296 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright IBM Corporation, 2020
+ *
+ * Author: Mike Rapoport <rppt@linux.ibm.com>
+ */
+
+#define _GNU_SOURCE
+#include <sys/uio.h>
+#include <sys/mman.h>
+#include <sys/wait.h>
+#include <sys/types.h>
+#include <sys/ptrace.h>
+#include <sys/syscall.h>
+#include <sys/resource.h>
+#include <sys/capability.h>
+
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+#include <stdio.h>
+
+#include "../kselftest.h"
+
+#define fail(fmt, ...) ksft_test_result_fail(fmt, ##__VA_ARGS__)
+#define pass(fmt, ...) ksft_test_result_pass(fmt, ##__VA_ARGS__)
+#define skip(fmt, ...) ksft_test_result_skip(fmt, ##__VA_ARGS__)
+
+#ifdef __NR_memfd_secret
+
+#define PATTERN	0x55
+
+static const int prot = PROT_READ | PROT_WRITE;
+static const int mode = MAP_SHARED;
+
+static unsigned long page_size;
+static unsigned long mlock_limit_cur;
+static unsigned long mlock_limit_max;
+
+static int memfd_secret(unsigned long flags)
+{
+	return syscall(__NR_memfd_secret, flags);
+}
+
+static void test_file_apis(int fd)
+{
+	char buf[64];
+
+	if ((read(fd, buf, sizeof(buf)) >= 0) ||
+	    (write(fd, buf, sizeof(buf)) >= 0) ||
+	    (pread(fd, buf, sizeof(buf), 0) >= 0) ||
+	    (pwrite(fd, buf, sizeof(buf), 0) >= 0))
+		fail("unexpected file IO\n");
+	else
+		pass("file IO is blocked as expected\n");
+}
+
+static void test_mlock_limit(int fd)
+{
+	size_t len;
+	char *mem;
+
+	len = mlock_limit_cur;
+	mem = mmap(NULL, len, prot, mode, fd, 0);
+	if (mem == MAP_FAILED) {
+		fail("unable to mmap secret memory\n");
+		return;
+	}
+	munmap(mem, len);
+
+	len = mlock_limit_max * 2;
+	mem = mmap(NULL, len, prot, mode, fd, 0);
+	if (mem != MAP_FAILED) {
+		fail("unexpected mlock limit violation\n");
+		munmap(mem, len);
+		return;
+	}
+
+	pass("mlock limit is respected\n");
+}
+
+static void try_process_vm_read(int fd, int pipefd[2])
+{
+	struct iovec liov, riov;
+	char buf[64];
+	char *mem;
+
+	if (read(pipefd[0], &mem, sizeof(mem)) < 0) {
+		fail("pipe write: %s\n", strerror(errno));
+		exit(KSFT_FAIL);
+	}
+
+	liov.iov_len = riov.iov_len = sizeof(buf);
+	liov.iov_base = buf;
+	riov.iov_base = mem;
+
+	if (process_vm_readv(getppid(), &liov, 1, &riov, 1, 0) < 0) {
+		if (errno == ENOSYS)
+			exit(KSFT_SKIP);
+		exit(KSFT_PASS);
+	}
+
+	exit(KSFT_FAIL);
+}
+
+static void try_ptrace(int fd, int pipefd[2])
+{
+	pid_t ppid = getppid();
+	int status;
+	char *mem;
+	long ret;
+
+	if (read(pipefd[0], &mem, sizeof(mem)) < 0) {
+		perror("pipe write");
+		exit(KSFT_FAIL);
+	}
+
+	ret = ptrace(PTRACE_ATTACH, ppid, 0, 0);
+	if (ret) {
+		perror("ptrace_attach");
+		exit(KSFT_FAIL);
+	}
+
+	ret = waitpid(ppid, &status, WUNTRACED);
+	if ((ret != ppid) || !(WIFSTOPPED(status))) {
+		fprintf(stderr, "weird waitppid result %ld stat %x\n",
+			ret, status);
+		exit(KSFT_FAIL);
+	}
+
+	if (ptrace(PTRACE_PEEKDATA, ppid, mem, 0))
+		exit(KSFT_PASS);
+
+	exit(KSFT_FAIL);
+}
+
+static void check_child_status(pid_t pid, const char *name)
+{
+	int status;
+
+	waitpid(pid, &status, 0);
+
+	if (WIFEXITED(status) && WEXITSTATUS(status) == KSFT_SKIP) {
+		skip("%s is not supported\n", name);
+		return;
+	}
+
+	if ((WIFEXITED(status) && WEXITSTATUS(status) == KSFT_PASS) ||
+	    WIFSIGNALED(status)) {
+		pass("%s is blocked as expected\n", name);
+		return;
+	}
+
+	fail("%s: unexpected memory access\n", name);
+}
+
+static void test_remote_access(int fd, const char *name,
+			       void (*func)(int fd, int pipefd[2]))
+{
+	int pipefd[2];
+	pid_t pid;
+	char *mem;
+
+	if (pipe(pipefd)) {
+		fail("pipe failed: %s\n", strerror(errno));
+		return;
+	}
+
+	pid = fork();
+	if (pid < 0) {
+		fail("fork failed: %s\n", strerror(errno));
+		return;
+	}
+
+	if (pid == 0) {
+		func(fd, pipefd);
+		return;
+	}
+
+	mem = mmap(NULL, page_size, prot, mode, fd, 0);
+	if (mem == MAP_FAILED) {
+		fail("Unable to mmap secret memory\n");
+		return;
+	}
+
+	ftruncate(fd, page_size);
+	memset(mem, PATTERN, page_size);
+
+	if (write(pipefd[1], &mem, sizeof(mem)) < 0) {
+		fail("pipe write: %s\n", strerror(errno));
+		return;
+	}
+
+	check_child_status(pid, name);
+}
+
+static void test_process_vm_read(int fd)
+{
+	test_remote_access(fd, "process_vm_read", try_process_vm_read);
+}
+
+static void test_ptrace(int fd)
+{
+	test_remote_access(fd, "ptrace", try_ptrace);
+}
+
+static int set_cap_limits(rlim_t max)
+{
+	struct rlimit new;
+	cap_t cap = cap_init();
+
+	new.rlim_cur = max;
+	new.rlim_max = max;
+	if (setrlimit(RLIMIT_MEMLOCK, &new)) {
+		perror("setrlimit() returns error");
+		return -1;
+	}
+
+	/* drop capabilities including CAP_IPC_LOCK */
+	if (cap_set_proc(cap)) {
+		perror("cap_set_proc() returns error");
+		return -2;
+	}
+
+	return 0;
+}
+
+static void prepare(void)
+{
+	struct rlimit rlim;
+
+	page_size = sysconf(_SC_PAGE_SIZE);
+	if (!page_size)
+		ksft_exit_fail_msg("Failed to get page size %s\n",
+				   strerror(errno));
+
+	if (getrlimit(RLIMIT_MEMLOCK, &rlim))
+		ksft_exit_fail_msg("Unable to detect mlock limit: %s\n",
+				   strerror(errno));
+
+	mlock_limit_cur = rlim.rlim_cur;
+	mlock_limit_max = rlim.rlim_max;
+
+	printf("page_size: %ld, mlock.soft: %ld, mlock.hard: %ld\n",
+	       page_size, mlock_limit_cur, mlock_limit_max);
+
+	if (page_size > mlock_limit_cur)
+		mlock_limit_cur = page_size;
+	if (page_size > mlock_limit_max)
+		mlock_limit_max = page_size;
+
+	if (set_cap_limits(mlock_limit_max))
+		ksft_exit_fail_msg("Unable to set mlock limit: %s\n",
+				   strerror(errno));
+}
+
+#define NUM_TESTS 4
+
+int main(int argc, char *argv[])
+{
+	int fd;
+
+	prepare();
+
+	ksft_print_header();
+	ksft_set_plan(NUM_TESTS);
+
+	fd = memfd_secret(0);
+	if (fd < 0) {
+		if (errno == ENOSYS)
+			ksft_exit_skip("memfd_secret is not supported\n");
+		else
+			ksft_exit_fail_msg("memfd_secret failed: %s\n",
+					   strerror(errno));
+	}
+
+	test_mlock_limit(fd);
+	test_file_apis(fd);
+	test_process_vm_read(fd);
+	test_ptrace(fd);
+
+	close(fd);
+
+	ksft_exit(!ksft_get_fail_cnt());
+}
+
+#else /* __NR_memfd_secret */
+
+int main(int argc, char *argv[])
+{
+	printf("skip: skipping memfd_secret test (missing __NR_memfd_secret)\n");
+	return KSFT_SKIP;
+}
+
+#endif /* __NR_memfd_secret */
diff --git a/tools/testing/selftests/vm/run_vmtests.sh b/tools/testing/selftests/vm/run_vmtests.sh
index e953f3cd9664..95a67382f132 100755
--- a/tools/testing/selftests/vm/run_vmtests.sh
+++ b/tools/testing/selftests/vm/run_vmtests.sh
@@ -346,4 +346,21 @@ else
 	exitcode=1
 fi
 
+echo "running memfd_secret test"
+echo "------------------------------------"
+./memfd_secret
+ret_val=$?
+
+if [ $ret_val -eq 0 ]; then
+	echo "[PASS]"
+elif [ $ret_val -eq $ksft_skip ]; then
+	echo "[SKIP]"
+	exitcode=$ksft_skip
+else
+	echo "[FAIL]"
+	exitcode=1
+fi
+
+exit $exitcode
+
 exit $exitcode
-- 
2.28.0


^ permalink raw reply related	[relevance 2%]

* [PATCH v18 8/9] arch, mm: wire up memfd_secret system call where relevant
                     ` (3 preceding siblings ...)
  2021-03-03 16:22  3% ` [PATCH v18 7/9] PM: hibernate: disable when there are active secretmem users Mike Rapoport
@ 2021-03-03 16:22  3% ` Mike Rapoport
  2021-03-03 16:22  2% ` [PATCH v18 9/9] secretmem: test: add basic selftest for memfd_secret(2) Mike Rapoport
  5 siblings, 0 replies; 200+ results
From: Mike Rapoport @ 2021-03-03 16:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Matthew Garrett, Mark Rutland, Michal Hocko, Mike Rapoport,
	Mike Rapoport, Michael Kerrisk, Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Rafael J. Wysocki, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86, Palmer Dabbelt,
	Hagen Paul Pfeifer

From: Mike Rapoport <rppt@linux.ibm.com>

Wire up memfd_secret system call on architectures that define
ARCH_HAS_SET_DIRECT_MAP, namely arm64, risc-v and x86.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/uapi/asm/unistd.h   | 1 +
 arch/riscv/include/asm/unistd.h        | 1 +
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 include/linux/syscalls.h               | 1 +
 include/uapi/asm-generic/unistd.h      | 6 +++++-
 scripts/checksyscalls.sh               | 4 ++++
 7 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/uapi/asm/unistd.h b/arch/arm64/include/uapi/asm/unistd.h
index f83a70e07df8..ce2ee8f1e361 100644
--- a/arch/arm64/include/uapi/asm/unistd.h
+++ b/arch/arm64/include/uapi/asm/unistd.h
@@ -20,5 +20,6 @@
 #define __ARCH_WANT_SET_GET_RLIMIT
 #define __ARCH_WANT_TIME32_SYSCALLS
 #define __ARCH_WANT_SYS_CLONE3
+#define __ARCH_WANT_MEMFD_SECRET
 
 #include <asm-generic/unistd.h>
diff --git a/arch/riscv/include/asm/unistd.h b/arch/riscv/include/asm/unistd.h
index 977ee6181dab..6c316093a1e5 100644
--- a/arch/riscv/include/asm/unistd.h
+++ b/arch/riscv/include/asm/unistd.h
@@ -9,6 +9,7 @@
  */
 
 #define __ARCH_WANT_SYS_CLONE
+#define __ARCH_WANT_MEMFD_SECRET
 
 #include <uapi/asm/unistd.h>
 
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index a1c9f496fca6..34f04076a140 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -447,3 +447,4 @@
 440	i386	process_madvise		sys_process_madvise
 441	i386	epoll_pwait2		sys_epoll_pwait2		compat_sys_epoll_pwait2
 442	i386	mount_setattr		sys_mount_setattr
+443	i386	memfd_secret		sys_memfd_secret
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 7bf01cbe582f..bd3783edf27f 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -364,6 +364,7 @@
 440	common	process_madvise		sys_process_madvise
 441	common	epoll_pwait2		sys_epoll_pwait2
 442	common	mount_setattr		sys_mount_setattr
+443	common	memfd_secret		sys_memfd_secret
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 2839dc9a7c01..4b87a2b3f442 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1041,6 +1041,7 @@ asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
 				       siginfo_t __user *info,
 				       unsigned int flags);
 asmlinkage long sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);
+asmlinkage long sys_memfd_secret(unsigned long flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index ce58cff99b66..7ac0732dbaa4 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -863,9 +863,13 @@ __SYSCALL(__NR_process_madvise, sys_process_madvise)
 __SC_COMP(__NR_epoll_pwait2, sys_epoll_pwait2, compat_sys_epoll_pwait2)
 #define __NR_mount_setattr 442
 __SYSCALL(__NR_mount_setattr, sys_mount_setattr)
+#ifdef __ARCH_WANT_MEMFD_SECRET
+#define __NR_memfd_secret 443
+__SYSCALL(__NR_memfd_secret, sys_memfd_secret)
+#endif
 
 #undef __NR_syscalls
-#define __NR_syscalls 443
+#define __NR_syscalls 444
 
 /*
  * 32 bit systems traditionally used different
diff --git a/scripts/checksyscalls.sh b/scripts/checksyscalls.sh
index a18b47695f55..b7609958ee36 100755
--- a/scripts/checksyscalls.sh
+++ b/scripts/checksyscalls.sh
@@ -40,6 +40,10 @@ cat << EOF
 #define __IGNORE_setrlimit	/* setrlimit */
 #endif
 
+#ifndef __ARCH_WANT_MEMFD_SECRET
+#define __IGNORE_memfd_secret
+#endif
+
 /* Missing flags argument */
 #define __IGNORE_renameat	/* renameat2 */
 
-- 
2.28.0


^ permalink raw reply related	[relevance 3%]

* [PATCH v18 6/9] mm: introduce memfd_secret system call to create "secret" memory areas
    2021-03-03 16:22  3% ` [PATCH v18 1/9] mm: add definition of PMD_PAGE_ORDER Mike Rapoport
  2021-03-03 16:22  3% ` [PATCH v18 5/9] set_memory: allow querying whether set_direct_map_*() is actually enabled Mike Rapoport
@ 2021-03-03 16:22  2% ` Mike Rapoport
  2021-03-03 16:22  3% ` [PATCH v18 7/9] PM: hibernate: disable when there are active secretmem users Mike Rapoport
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 200+ results
From: Mike Rapoport @ 2021-03-03 16:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Matthew Garrett, Mark Rutland, Michal Hocko, Mike Rapoport,
	Mike Rapoport, Michael Kerrisk, Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Rafael J. Wysocki, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

From: Mike Rapoport <rppt@linux.ibm.com>

Introduce "memfd_secret" system call with the ability to create memory
areas visible only in the context of the owning process and not mapped not
only to other processes but in the kernel page tables as well.

The secretmem feature is off by default and the user must explicitly enable
it at the boot time.

Once secretmem is enabled, the user will be able to create a file
descriptor using the memfd_secret() system call. The memory areas created
by mmap() calls from this file descriptor will be unmapped from the kernel
direct map and they will be only mapped in the page table of the processes
that have access to the file descriptor.

The file descriptor based memory has several advantages over the
"traditional" mm interfaces, such as mlock(), mprotect(), madvise(). File
descriptor approach allows explict and controlled sharing of the memory
areas, it allows to seal the operations. Besides, file descriptor based
memory paves the way for VMMs to remove the secret memory range from the
userpace hipervisor process, for instance QEMU. Andy Lutomirski says:

  "Getting fd-backed memory into a guest will take some possibly major work
   in the kernel, but getting vma-backed memory into a guest without
   mapping it in the host user address space seems much, much worse."

memfd_secret() is made a dedicated system call rather than an extention to
memfd_create() because it's purpose is to allow the user to create more
secure memory mappings rather than to simply allow file based access to the
memory. Nowadays a new system call cost is negligible while it is way
simpler for userspace to deal with a clear-cut system calls than with a
multiplexer or an overloaded syscall. Moreover, the initial implementation
of memfd_secret() is completely distinct from memfd_create() so there is no
much sense in overloading memfd_create() to begin with. If there will be a
need for code sharing between these implementation it can be easily
achieved without a need to adjust user visible APIs.

The secret memory remains accessible in the process context using uaccess
primitives, but it is not exposed to the kernel otherwise; secret memory
areas are removed from the direct map and functions in the
follow_page()/get_user_page() family will refuse to return a page that
belongs to the secret memory area.

Once there will be a use case that will require exposing secretmem to the
kernel it will be an opt-in request in the system call flags so that user
would have to decide what data can be exposed to the kernel.

Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which affects
the system performance. However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice". Hence, it is sufficient to have
secretmem disabled by default with the ability of a system administrator to
enable it at boot time.

Pages in the secretmem regions are unevictable and unmovable to avoid
accidental exposure of the sensitive data via swap or during page
migration.

Since the secretmem mappings are locked in memory they cannot exceed
RLIMIT_MEMLOCK. Since these mappings are already locked independently from
mlock(), an attempt to mlock()/munlock() secretmem range would fail and
mlockall()/munlockall() will ignore secretmem mappings.

However, unlike mlock()ed memory, secretmem currently behaves more like
long-term GUP: secretmem mappings are unmovable mappings directly consumed
by user space. With default limits, there is no excessive use of secretmem
and it poses no real problem in combination with ZONE_MOVABLE/CMA, but in
the future this should be addressed to allow balanced use of large amounts
of secretmem along with ZONE_MOVABLE/CMA.

A page that was a part of the secret memory area is cleared when it is
freed to ensure the data is not exposed to the next user of that page.

The following example demonstrates creation of a secret mapping (error
handling is omitted):

	fd = memfd_secret(0);
	ftruncate(fd, MAP_SIZE);
	ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
		   MAP_SHARED, fd, 0);

[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 include/linux/secretmem.h  |  24 ++++
 include/uapi/linux/magic.h |   1 +
 kernel/sys_ni.c            |   2 +
 mm/Kconfig                 |   3 +
 mm/Makefile                |   1 +
 mm/gup.c                   |  10 ++
 mm/mlock.c                 |   3 +-
 mm/secretmem.c             | 246 +++++++++++++++++++++++++++++++++++++
 8 files changed, 289 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/secretmem.h
 create mode 100644 mm/secretmem.c

diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h
new file mode 100644
index 000000000000..70e7db9f94fe
--- /dev/null
+++ b/include/linux/secretmem.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _LINUX_SECRETMEM_H
+#define _LINUX_SECRETMEM_H
+
+#ifdef CONFIG_SECRETMEM
+
+bool vma_is_secretmem(struct vm_area_struct *vma);
+bool page_is_secretmem(struct page *page);
+
+#else
+
+static inline bool vma_is_secretmem(struct vm_area_struct *vma)
+{
+	return false;
+}
+
+static inline bool page_is_secretmem(struct page *page)
+{
+	return false;
+}
+
+#endif /* CONFIG_SECRETMEM */
+
+#endif /* _LINUX_SECRETMEM_H */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index f3956fc11de6..35687dcb1a42 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -97,5 +97,6 @@
 #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define Z3FOLD_MAGIC		0x33
 #define PPC_CMM_MAGIC		0xc7571590
+#define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 19aa806890d5..e9a2011ee4a2 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -352,6 +352,8 @@ COND_SYSCALL(pkey_mprotect);
 COND_SYSCALL(pkey_alloc);
 COND_SYSCALL(pkey_free);
 
+/* memfd_secret */
+COND_SYSCALL(memfd_secret);
 
 /*
  * Architecture specific weak syscall entries.
diff --git a/mm/Kconfig b/mm/Kconfig
index 24c045b24b95..5f8243442f66 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -872,4 +872,7 @@ config MAPPING_DIRTY_HELPERS
 config KMAP_LOCAL
 	bool
 
+config SECRETMEM
+	def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
+
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 72227b24a616..b2a564eec27f 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -120,3 +120,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o
 obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
 obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
 obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
+obj-$(CONFIG_SECRETMEM) += secretmem.o
diff --git a/mm/gup.c b/mm/gup.c
index e40579624f10..ecadc80934b2 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -10,6 +10,7 @@
 #include <linux/rmap.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/secretmem.h>
 
 #include <linux/sched/signal.h>
 #include <linux/rwsem.h>
@@ -758,6 +759,9 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 	struct follow_page_context ctx = { NULL };
 	struct page *page;
 
+	if (vma_is_secretmem(vma))
+		return NULL;
+
 	page = follow_page_mask(vma, address, foll_flags, &ctx);
 	if (ctx.pgmap)
 		put_dev_pagemap(ctx.pgmap);
@@ -891,6 +895,9 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
 		return -EOPNOTSUPP;
 
+	if (vma_is_secretmem(vma))
+		return -EFAULT;
+
 	if (write) {
 		if (!(vm_flags & VM_WRITE)) {
 			if (!(gup_flags & FOLL_FORCE))
@@ -2030,6 +2037,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
 
+		if (page_is_secretmem(page))
+			goto pte_unmap;
+
 		head = try_grab_compound_head(page, 1, flags);
 		if (!head)
 			goto pte_unmap;
diff --git a/mm/mlock.c b/mm/mlock.c
index f8f8cc32d03d..188711c72b67 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -23,6 +23,7 @@
 #include <linux/hugetlb.h>
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h>
+#include <linux/secretmem.h>
 
 #include "internal.h"
 
@@ -503,7 +504,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 
 	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
 	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) ||
-	    vma_is_dax(vma))
+	    vma_is_dax(vma) || vma_is_secretmem(vma))
 		/* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
 		goto out;
 
diff --git a/mm/secretmem.c b/mm/secretmem.c
new file mode 100644
index 000000000000..fa6738e860c2
--- /dev/null
+++ b/mm/secretmem.c
@@ -0,0 +1,246 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright IBM Corporation, 2021
+ *
+ * Author: Mike Rapoport <rppt@linux.ibm.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/swap.h>
+#include <linux/mount.h>
+#include <linux/memfd.h>
+#include <linux/bitops.h>
+#include <linux/printk.h>
+#include <linux/pagemap.h>
+#include <linux/syscalls.h>
+#include <linux/pseudo_fs.h>
+#include <linux/secretmem.h>
+#include <linux/set_memory.h>
+#include <linux/sched/signal.h>
+
+#include <uapi/linux/magic.h>
+
+#include <asm/tlbflush.h>
+
+#include "internal.h"
+
+#undef pr_fmt
+#define pr_fmt(fmt) "secretmem: " fmt
+
+/*
+ * Define mode and flag masks to allow validation of the system call
+ * parameters.
+ */
+#define SECRETMEM_MODE_MASK	(0x0)
+#define SECRETMEM_FLAGS_MASK	SECRETMEM_MODE_MASK
+
+static bool secretmem_enable __ro_after_init;
+module_param_named(enable, secretmem_enable, bool, 0400);
+MODULE_PARM_DESC(secretmem_enable,
+		 "Enable secretmem and memfd_secret(2) system call");
+
+static vm_fault_t secretmem_fault(struct vm_fault *vmf)
+{
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+	struct inode *inode = file_inode(vmf->vma->vm_file);
+	pgoff_t offset = vmf->pgoff;
+	gfp_t gfp = vmf->gfp_mask;
+	unsigned long addr;
+	struct page *page;
+	int err;
+
+	if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
+		return vmf_error(-EINVAL);
+
+retry:
+	page = find_lock_page(mapping, offset);
+	if (!page) {
+		page = alloc_page(gfp | __GFP_ZERO);
+		if (!page)
+			return VM_FAULT_OOM;
+
+		err = set_direct_map_invalid_noflush(page, 1);
+		if (err) {
+			put_page(page);
+			return vmf_error(err);
+		}
+
+		__SetPageUptodate(page);
+		err = add_to_page_cache_lru(page, mapping, offset, gfp);
+		if (unlikely(err)) {
+			put_page(page);
+			/*
+			 * If a split of large page was required, it
+			 * already happened when we marked the page invalid
+			 * which guarantees that this call won't fail
+			 */
+			set_direct_map_default_noflush(page, 1);
+			if (err == -EEXIST)
+				goto retry;
+
+			return vmf_error(err);
+		}
+
+		addr = (unsigned long)page_address(page);
+		flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+	}
+
+	vmf->page = page;
+	return VM_FAULT_LOCKED;
+}
+
+static const struct vm_operations_struct secretmem_vm_ops = {
+	.fault = secretmem_fault,
+};
+
+static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	unsigned long len = vma->vm_end - vma->vm_start;
+
+	if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
+		return -EINVAL;
+
+	if (mlock_future_check(vma->vm_mm, vma->vm_flags | VM_LOCKED, len))
+		return -EAGAIN;
+
+	vma->vm_flags |= VM_LOCKED | VM_DONTDUMP;
+	vma->vm_ops = &secretmem_vm_ops;
+
+	return 0;
+}
+
+bool vma_is_secretmem(struct vm_area_struct *vma)
+{
+	return vma->vm_ops == &secretmem_vm_ops;
+}
+
+static const struct file_operations secretmem_fops = {
+	.mmap		= secretmem_mmap,
+};
+
+static bool secretmem_isolate_page(struct page *page, isolate_mode_t mode)
+{
+	return false;
+}
+
+static int secretmem_migratepage(struct address_space *mapping,
+				 struct page *newpage, struct page *page,
+				 enum migrate_mode mode)
+{
+	return -EBUSY;
+}
+
+static void secretmem_freepage(struct page *page)
+{
+	set_direct_map_default_noflush(page, 1);
+	clear_highpage(page);
+}
+
+static const struct address_space_operations secretmem_aops = {
+	.freepage	= secretmem_freepage,
+	.migratepage	= secretmem_migratepage,
+	.isolate_page	= secretmem_isolate_page,
+};
+
+bool page_is_secretmem(struct page *page)
+{
+	struct address_space *mapping = page_mapping(page);
+
+	if (!mapping)
+		return false;
+
+	return mapping->a_ops == &secretmem_aops;
+}
+
+static struct vfsmount *secretmem_mnt;
+
+static struct file *secretmem_file_create(unsigned long flags)
+{
+	struct file *file = ERR_PTR(-ENOMEM);
+	struct inode *inode;
+
+	inode = alloc_anon_inode(secretmem_mnt->mnt_sb);
+	if (IS_ERR(inode))
+		return ERR_CAST(inode);
+
+	file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
+				 O_RDWR, &secretmem_fops);
+	if (IS_ERR(file))
+		goto err_free_inode;
+
+	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+	mapping_set_unevictable(inode->i_mapping);
+
+	inode->i_mapping->a_ops = &secretmem_aops;
+
+	/* pretend we are a normal file with zero size */
+	inode->i_mode |= S_IFREG;
+	inode->i_size = 0;
+
+	return file;
+
+err_free_inode:
+	iput(inode);
+	return file;
+}
+
+SYSCALL_DEFINE1(memfd_secret, unsigned long, flags)
+{
+	struct file *file;
+	int fd, err;
+
+	/* make sure local flags do not confict with global fcntl.h */
+	BUILD_BUG_ON(SECRETMEM_FLAGS_MASK & O_CLOEXEC);
+
+	if (!secretmem_enable)
+		return -ENOSYS;
+
+	if (flags & ~(SECRETMEM_FLAGS_MASK | O_CLOEXEC))
+		return -EINVAL;
+
+	fd = get_unused_fd_flags(flags & O_CLOEXEC);
+	if (fd < 0)
+		return fd;
+
+	file = secretmem_file_create(flags);
+	if (IS_ERR(file)) {
+		err = PTR_ERR(file);
+		goto err_put_fd;
+	}
+
+	file->f_flags |= O_LARGEFILE;
+
+	fd_install(fd, file);
+	return fd;
+
+err_put_fd:
+	put_unused_fd(fd);
+	return err;
+}
+
+static int secretmem_init_fs_context(struct fs_context *fc)
+{
+	return init_pseudo(fc, SECRETMEM_MAGIC) ? 0 : -ENOMEM;
+}
+
+static struct file_system_type secretmem_fs = {
+	.name		= "secretmem",
+	.init_fs_context = secretmem_init_fs_context,
+	.kill_sb	= kill_anon_super,
+};
+
+static int secretmem_init(void)
+{
+	int ret = 0;
+
+	if (!secretmem_enable)
+		return ret;
+
+	secretmem_mnt = kern_mount(&secretmem_fs);
+	if (IS_ERR(secretmem_mnt))
+		ret = PTR_ERR(secretmem_mnt);
+
+	return ret;
+}
+fs_initcall(secretmem_init);
-- 
2.28.0


^ permalink raw reply related	[relevance 2%]

* [PATCH v18 5/9] set_memory: allow querying whether set_direct_map_*() is actually enabled
    2021-03-03 16:22  3% ` [PATCH v18 1/9] mm: add definition of PMD_PAGE_ORDER Mike Rapoport
@ 2021-03-03 16:22  3% ` Mike Rapoport
  2021-03-03 16:22  2% ` [PATCH v18 6/9] mm: introduce memfd_secret system call to create "secret" memory areas Mike Rapoport
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 200+ results
From: Mike Rapoport @ 2021-03-03 16:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Matthew Garrett, Mark Rutland, Michal Hocko, Mike Rapoport,
	Mike Rapoport, Michael Kerrisk, Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Rafael J. Wysocki, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

From: Mike Rapoport <rppt@linux.ibm.com>

On arm64, set_direct_map_*() functions may return 0 without actually
changing the linear map.  This behaviour can be controlled using kernel
parameters, so we need a way to determine at runtime whether calls to
set_direct_map_invalid_noflush() and set_direct_map_default_noflush() have
any effect.

Extend set_memory API with can_set_direct_map() function that allows
checking if calling set_direct_map_*() will actually change the page
table, replace several occurrences of open coded checks in arm64 with the
new function and provide a generic stub for architectures that always
modify page tables upon calls to set_direct_map APIs.

[arnd@arndb.de: arm64: kfence: fix header inclusion ]

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/Kbuild       |  1 -
 arch/arm64/include/asm/cacheflush.h |  6 ------
 arch/arm64/include/asm/kfence.h     |  2 +-
 arch/arm64/include/asm/set_memory.h | 17 +++++++++++++++++
 arch/arm64/kernel/machine_kexec.c   |  1 +
 arch/arm64/mm/mmu.c                 |  6 +++---
 arch/arm64/mm/pageattr.c            | 13 +++++++++----
 include/linux/set_memory.h          | 12 ++++++++++++
 8 files changed, 43 insertions(+), 15 deletions(-)
 create mode 100644 arch/arm64/include/asm/set_memory.h

diff --git a/arch/arm64/include/asm/Kbuild b/arch/arm64/include/asm/Kbuild
index 07ac208edc89..73aa25843f65 100644
--- a/arch/arm64/include/asm/Kbuild
+++ b/arch/arm64/include/asm/Kbuild
@@ -3,5 +3,4 @@ generic-y += early_ioremap.h
 generic-y += mcs_spinlock.h
 generic-y += qrwlock.h
 generic-y += qspinlock.h
-generic-y += set_memory.h
 generic-y += user.h
diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index ace2c3d7ae7e..4e3c13799735 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -131,12 +131,6 @@ static __always_inline void __flush_icache_all(void)
 	dsb(ish);
 }
 
-int set_memory_valid(unsigned long addr, int numpages, int enable);
-
-int set_direct_map_invalid_noflush(struct page *page, int numpages);
-int set_direct_map_default_noflush(struct page *page, int numpages);
-bool kernel_page_present(struct page *page);
-
 #include <asm-generic/cacheflush.h>
 
 #endif /* __ASM_CACHEFLUSH_H */
diff --git a/arch/arm64/include/asm/kfence.h b/arch/arm64/include/asm/kfence.h
index d061176d57ea..aa855c6a0ae6 100644
--- a/arch/arm64/include/asm/kfence.h
+++ b/arch/arm64/include/asm/kfence.h
@@ -8,7 +8,7 @@
 #ifndef __ASM_KFENCE_H
 #define __ASM_KFENCE_H
 
-#include <asm/cacheflush.h>
+#include <asm/set_memory.h>
 
 static inline bool arch_kfence_init_pool(void) { return true; }
 
diff --git a/arch/arm64/include/asm/set_memory.h b/arch/arm64/include/asm/set_memory.h
new file mode 100644
index 000000000000..ecb6b0f449ab
--- /dev/null
+++ b/arch/arm64/include/asm/set_memory.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef _ASM_ARM64_SET_MEMORY_H
+#define _ASM_ARM64_SET_MEMORY_H
+
+#include <asm-generic/set_memory.h>
+
+bool can_set_direct_map(void);
+#define can_set_direct_map can_set_direct_map
+
+int set_memory_valid(unsigned long addr, int numpages, int enable);
+
+int set_direct_map_invalid_noflush(struct page *page, int numpages);
+int set_direct_map_default_noflush(struct page *page, int numpages);
+bool kernel_page_present(struct page *page);
+
+#endif /* _ASM_ARM64_SET_MEMORY_H */
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index 90a335c74442..0ec94e718724 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -11,6 +11,7 @@
 #include <linux/kernel.h>
 #include <linux/kexec.h>
 #include <linux/page-flags.h>
+#include <linux/set_memory.h>
 #include <linux/smp.h>
 
 #include <asm/cacheflush.h>
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 3802cfbdd20d..9243ea9f4e9f 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -22,6 +22,7 @@
 #include <linux/io.h>
 #include <linux/mm.h>
 #include <linux/vmalloc.h>
+#include <linux/set_memory.h>
 
 #include <asm/barrier.h>
 #include <asm/cputype.h>
@@ -492,7 +493,7 @@ static void __init map_mem(pgd_t *pgdp)
 	int flags = 0;
 	u64 i;
 
-	if (rodata_full || crash_mem_map || debug_pagealloc_enabled())
+	if (can_set_direct_map() || crash_mem_map)
 		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
 	/*
@@ -1470,8 +1471,7 @@ int arch_add_memory(int nid, u64 start, u64 size,
 	 * KFENCE requires linear map to be mapped at page granularity, so that
 	 * it is possible to protect/unprotect single pages in the KFENCE pool.
 	 */
-	if (rodata_full || debug_pagealloc_enabled() ||
-	    IS_ENABLED(CONFIG_KFENCE))
+	if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
 		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
 	__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index b53ef37bf95a..d505172265b0 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -19,6 +19,11 @@ struct page_change_data {
 
 bool rodata_full __ro_after_init = IS_ENABLED(CONFIG_RODATA_FULL_DEFAULT_ENABLED);
 
+bool can_set_direct_map(void)
+{
+	return rodata_full || debug_pagealloc_enabled();
+}
+
 static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
 {
 	struct page_change_data *cdata = data;
@@ -156,7 +161,7 @@ int set_direct_map_invalid_noflush(struct page *page, int numpages)
 	};
 	unsigned long size = PAGE_SIZE * numpages;
 
-	if (!debug_pagealloc_enabled() && !rodata_full)
+	if (!can_set_direct_map())
 		return 0;
 
 	return apply_to_page_range(&init_mm,
@@ -172,7 +177,7 @@ int set_direct_map_default_noflush(struct page *page, int numpages)
 	};
 	unsigned long size = PAGE_SIZE * numpages;
 
-	if (!debug_pagealloc_enabled() && !rodata_full)
+	if (!can_set_direct_map())
 		return 0;
 
 	return apply_to_page_range(&init_mm,
@@ -183,7 +188,7 @@ int set_direct_map_default_noflush(struct page *page, int numpages)
 #ifdef CONFIG_DEBUG_PAGEALLOC
 void __kernel_map_pages(struct page *page, int numpages, int enable)
 {
-	if (!debug_pagealloc_enabled() && !rodata_full)
+	if (!can_set_direct_map())
 		return;
 
 	set_memory_valid((unsigned long)page_address(page), numpages, enable);
@@ -208,7 +213,7 @@ bool kernel_page_present(struct page *page)
 	pte_t *ptep;
 	unsigned long addr = (unsigned long)page_address(page);
 
-	if (!debug_pagealloc_enabled() && !rodata_full)
+	if (!can_set_direct_map())
 		return true;
 
 	pgdp = pgd_offset_k(addr);
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index c650f82db813..7b4b6626032d 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -28,7 +28,19 @@ static inline bool kernel_page_present(struct page *page)
 {
 	return true;
 }
+#else /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */
+/*
+ * Some architectures, e.g. ARM64 can disable direct map modifications at
+ * boot time. Let them overrive this query.
+ */
+#ifndef can_set_direct_map
+static inline bool can_set_direct_map(void)
+{
+	return true;
+}
+#define can_set_direct_map can_set_direct_map
 #endif
+#endif /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */
 
 #ifndef set_mce_nospec
 static inline int set_mce_nospec(unsigned long pfn, bool unmap)
-- 
2.28.0


^ permalink raw reply related	[relevance 3%]

* [PATCH v18 1/9] mm: add definition of PMD_PAGE_ORDER
  @ 2021-03-03 16:22  3% ` Mike Rapoport
  2021-03-03 16:22  3% ` [PATCH v18 5/9] set_memory: allow querying whether set_direct_map_*() is actually enabled Mike Rapoport
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 200+ results
From: Mike Rapoport @ 2021-03-03 16:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Matthew Garrett, Mark Rutland, Michal Hocko, Mike Rapoport,
	Mike Rapoport, Michael Kerrisk, Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Rafael J. Wysocki, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

From: Mike Rapoport <rppt@linux.ibm.com>

The definition of PMD_PAGE_ORDER denoting the number of base pages in the
second-level leaf page is already used by DAX and maybe handy in other
cases as well.

Several architectures already have definition of PMD_ORDER as the size of
second level page table, so to avoid conflict with these definitions use
PMD_PAGE_ORDER name and update DAX respectively.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
---
 fs/dax.c                | 11 ++++-------
 include/linux/pgtable.h |  3 +++
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index b3d27fdc6775..12ff48bcee5b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -49,9 +49,6 @@ static inline unsigned int pe_order(enum page_entry_size pe_size)
 #define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
 #define PG_PMD_NR	(PMD_SIZE >> PAGE_SHIFT)
 
-/* The order of a PMD entry */
-#define PMD_ORDER	(PMD_SHIFT - PAGE_SHIFT)
-
 static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
 
 static int __init init_dax_wait_table(void)
@@ -98,7 +95,7 @@ static bool dax_is_locked(void *entry)
 static unsigned int dax_entry_order(void *entry)
 {
 	if (xa_to_value(entry) & DAX_PMD)
-		return PMD_ORDER;
+		return PMD_PAGE_ORDER;
 	return 0;
 }
 
@@ -1471,7 +1468,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct address_space *mapping = vma->vm_file->f_mapping;
-	XA_STATE_ORDER(xas, &mapping->i_pages, vmf->pgoff, PMD_ORDER);
+	XA_STATE_ORDER(xas, &mapping->i_pages, vmf->pgoff, PMD_PAGE_ORDER);
 	unsigned long pmd_addr = vmf->address & PMD_MASK;
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
 	bool sync;
@@ -1530,7 +1527,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	 * entry is already in the array, for instance), it will return
 	 * VM_FAULT_FALLBACK.
 	 */
-	entry = grab_mapping_entry(&xas, mapping, PMD_ORDER);
+	entry = grab_mapping_entry(&xas, mapping, PMD_PAGE_ORDER);
 	if (xa_is_internal(entry)) {
 		result = xa_to_internal(entry);
 		goto fallback;
@@ -1696,7 +1693,7 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, unsigned int order)
 	if (order == 0)
 		ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
 #ifdef CONFIG_FS_DAX_PMD
-	else if (order == PMD_ORDER)
+	else if (order == PMD_PAGE_ORDER)
 		ret = vmf_insert_pfn_pmd(vmf, pfn, FAULT_FLAG_WRITE);
 #endif
 	else
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index cdfc4e9f253e..3562cccf84ee 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -28,6 +28,9 @@
 #define USER_PGTABLES_CEILING	0UL
 #endif
 
+/* Number of base pages in a second level leaf page */
+#define PMD_PAGE_ORDER	(PMD_SHIFT - PAGE_SHIFT)
+
 /*
  * A page table page can be thought of an array like this: pXd_t[PTRS_PER_PxD]
  *
-- 
2.28.0


^ permalink raw reply related	[relevance 3%]

* [PATCH v18 7/9] PM: hibernate: disable when there are active secretmem users
                     ` (2 preceding siblings ...)
  2021-03-03 16:22  2% ` [PATCH v18 6/9] mm: introduce memfd_secret system call to create "secret" memory areas Mike Rapoport
@ 2021-03-03 16:22  3% ` Mike Rapoport
  2021-03-03 16:22  3% ` [PATCH v18 8/9] arch, mm: wire up memfd_secret system call where relevant Mike Rapoport
  2021-03-03 16:22  2% ` [PATCH v18 9/9] secretmem: test: add basic selftest for memfd_secret(2) Mike Rapoport
  5 siblings, 0 replies; 200+ results
From: Mike Rapoport @ 2021-03-03 16:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Matthew Garrett, Mark Rutland, Michal Hocko, Mike Rapoport,
	Mike Rapoport, Michael Kerrisk, Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Rafael J. Wysocki, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

From: Mike Rapoport <rppt@linux.ibm.com>

It is unsafe to allow saving of secretmem areas to the hibernation
snapshot as they would be visible after the resume and this essentially
will defeat the purpose of secret memory mappings.

Prevent hibernation whenever there are active secret memory users.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 include/linux/secretmem.h |  6 ++++++
 kernel/power/hibernate.c  |  5 ++++-
 mm/secretmem.c            | 15 +++++++++++++++
 3 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h
index 70e7db9f94fe..907a6734059c 100644
--- a/include/linux/secretmem.h
+++ b/include/linux/secretmem.h
@@ -6,6 +6,7 @@
 
 bool vma_is_secretmem(struct vm_area_struct *vma);
 bool page_is_secretmem(struct page *page);
+bool secretmem_active(void);
 
 #else
 
@@ -19,6 +20,11 @@ static inline bool page_is_secretmem(struct page *page)
 	return false;
 }
 
+static inline bool secretmem_active(void)
+{
+	return false;
+}
+
 #endif /* CONFIG_SECRETMEM */
 
 #endif /* _LINUX_SECRETMEM_H */
diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
index da0b41914177..559acef3fddb 100644
--- a/kernel/power/hibernate.c
+++ b/kernel/power/hibernate.c
@@ -31,6 +31,7 @@
 #include <linux/genhd.h>
 #include <linux/ktime.h>
 #include <linux/security.h>
+#include <linux/secretmem.h>
 #include <trace/events/power.h>
 
 #include "power.h"
@@ -81,7 +82,9 @@ void hibernate_release(void)
 
 bool hibernation_available(void)
 {
-	return nohibernate == 0 && !security_locked_down(LOCKDOWN_HIBERNATION);
+	return nohibernate == 0 &&
+		!security_locked_down(LOCKDOWN_HIBERNATION) &&
+		!secretmem_active();
 }
 
 /**
diff --git a/mm/secretmem.c b/mm/secretmem.c
index fa6738e860c2..f2ae3f32a193 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -40,6 +40,13 @@ module_param_named(enable, secretmem_enable, bool, 0400);
 MODULE_PARM_DESC(secretmem_enable,
 		 "Enable secretmem and memfd_secret(2) system call");
 
+static atomic_t secretmem_users;
+
+bool secretmem_active(void)
+{
+	return !!atomic_read(&secretmem_users);
+}
+
 static vm_fault_t secretmem_fault(struct vm_fault *vmf)
 {
 	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
@@ -94,6 +101,12 @@ static const struct vm_operations_struct secretmem_vm_ops = {
 	.fault = secretmem_fault,
 };
 
+static int secretmem_release(struct inode *inode, struct file *file)
+{
+	atomic_dec(&secretmem_users);
+	return 0;
+}
+
 static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	unsigned long len = vma->vm_end - vma->vm_start;
@@ -116,6 +129,7 @@ bool vma_is_secretmem(struct vm_area_struct *vma)
 }
 
 static const struct file_operations secretmem_fops = {
+	.release	= secretmem_release,
 	.mmap		= secretmem_mmap,
 };
 
@@ -212,6 +226,7 @@ SYSCALL_DEFINE1(memfd_secret, unsigned long, flags)
 	file->f_flags |= O_LARGEFILE;
 
 	fd_install(fd, file);
+	atomic_inc(&secretmem_users);
 	return fd;
 
 err_put_fd:
-- 
2.28.0


^ permalink raw reply related	[relevance 3%]

Results 1-200 of ~9000   | reverse | options above
-- pct% links below jump to the message on this page, permalinks otherwise --
2020-08-24 12:24     [PATCH 1/5] Add manpage for open_tree(2) David Howells
2020-08-24 12:24     ` [PATCH 2/5] Add manpages for move_mount(2) David Howells
2020-08-27 11:04       ` Michael Kerrisk (man-pages)
2021-08-13  0:21 12%     ` Michael Kerrisk (man-pages)
2020-08-24 12:24     ` [PATCH 3/5] Add manpage for fspick(2) David Howells
2020-08-27 11:05       ` Michael Kerrisk (man-pages)
2021-08-13  0:22 12%     ` Michael Kerrisk (man-pages)
2020-08-24 12:25     ` [PATCH 4/5] Add manpage for fsopen(2) and fsmount(2) David Howells
2020-08-27 11:07       ` Michael Kerrisk (man-pages)
2021-08-13  0:22 12%     ` Michael Kerrisk (man-pages)
2020-08-24 12:25     ` [PATCH 5/5] Add manpage for fsconfig(2) David Howells
2020-08-27 11:07       ` Michael Kerrisk (man-pages)
2021-08-13  0:23 12%     ` Michael Kerrisk (man-pages)
2020-08-27 11:01     ` [PATCH 1/5] Add manpage for open_tree(2) Michael Kerrisk (man-pages)
2021-08-13  0:20 12%   ` Michael Kerrisk (man-pages)
2020-08-31 15:32     [PATCH v2] vfs: add RWF_NOAPPEND flag for pwritev2 Rich Felker
2020-08-31 15:46     ` Jann Horn
2020-08-31 17:05       ` Jens Axboe
2024-01-18 15:57  0%     ` Rich Felker
2024-01-18 16:02  0%       ` Jens Axboe
2020-11-23 21:31     set_thread_area.2: csky architecture undocumented Alejandro Colomar (man-pages)
2020-11-24  9:51     ` Michael Kerrisk (man-pages)
2020-11-24 12:07       ` Guo Ren
2023-10-14 23:20  0%     ` Alejandro Colomar
2023-10-15 15:09  0%       ` Guo Ren
2021-01-23 16:11     [PATCH v6] close_range.2: new page documenting close_range(2) Stephen Kitt
2021-01-28 20:50     ` Michael Kerrisk (man-pages)
2021-01-28 22:10       ` Stephen Kitt
     [not found]         ` <20210129100024.m4bil5mz5prry4iq@wittgenstein>
2021-03-21 15:31 11%       ` Michael Kerrisk (man-pages)
2021-03-09 19:53  5%   ` Stephen Kitt
2021-03-21 15:38 11%     ` Michael Kerrisk (man-pages)
2021-03-22 21:31  5%       ` Stephen Kitt
2021-02-24 14:23     [PATCH] copy_file_range.2: Kernel v5.12 updates Luis Henriques
2021-03-04  9:38  3% ` [RFC v4] copy_file_range.2: Update cross-filesystem support for 5.12 Alejandro Colomar
2021-03-04 17:13  0%   ` Darrick J. Wong
2021-03-02 12:57     [RFC PATCH v4 0/9] KVM: selftests: some improvement and a new test for kvm page table Yanan Wang
2021-03-02 12:57     ` [RFC PATCH v4 2/9] tools headers: Add a macro to get HUGETLB page sizes for mmap Yanan Wang
2021-03-12 11:14  0%   ` Andrew Jones
2021-03-15  2:06  0%     ` wangyanan (Y)
2021-03-03 16:22     [PATCH v18 0/9] mm: introduce memfd_secret system call to create "secret" memory areas Mike Rapoport
2021-03-03 16:22  3% ` [PATCH v18 1/9] mm: add definition of PMD_PAGE_ORDER Mike Rapoport
2021-03-03 16:22  3% ` [PATCH v18 5/9] set_memory: allow querying whether set_direct_map_*() is actually enabled Mike Rapoport
2021-03-03 16:22  2% ` [PATCH v18 6/9] mm: introduce memfd_secret system call to create "secret" memory areas Mike Rapoport
2021-03-03 16:22  3% ` [PATCH v18 7/9] PM: hibernate: disable when there are active secretmem users Mike Rapoport
2021-03-03 16:22  3% ` [PATCH v18 8/9] arch, mm: wire up memfd_secret system call where relevant Mike Rapoport
2021-03-03 16:22  2% ` [PATCH v18 9/9] secretmem: test: add basic selftest for memfd_secret(2) Mike Rapoport
     [not found]     <20210226172634.26905-1-yu-cheng.yu@intel.com>
2021-02-26 17:26     ` [PATCH 2/2] sigaction.2: wfix - Clarify si_addr description Yu-cheng Yu
2021-03-08 21:30  0%   ` Borislav Petkov
2021-03-08 21:46  0%     ` Yu, Yu-cheng
2021-03-10 22:00     [PATCH v22 00/28] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
2021-03-10 22:00  3% ` [PATCH v22 06/28] x86/cet: Add control-protection fault handler Yu-cheng Yu
2021-03-16 15:10     [PATCH v23 00/28] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
2021-03-16 15:10  3% ` [PATCH v23 06/28] x86/cet: Add control-protection fault handler Yu-cheng Yu
2021-03-22 10:45 12% man-pages-5.11 released Michael Kerrisk (man-pages)
2021-03-23 13:52     [RFC PATCH v5 00/10] KVM: selftests: some improvement and a new test for kvm page table Yanan Wang
2021-03-23 13:52  4% ` [RFC PATCH v5 02/10] tools headers: Add a macro to get HUGETLB page sizes for mmap Yanan Wang
2021-03-23 14:03  0%   ` Andrew Jones
2021-03-24  1:48  0%     ` wangyanan (Y)
2021-03-29 22:18     [PATCH v5 0/4] man2: udpate mm/userfaultfd manpages to latest Peter Xu
2021-04-01 12:00     ` Alejandro Colomar (man-pages)
2021-04-05 11:50 11%   ` Michael Kerrisk (man-pages)
2021-03-30  8:08     [PATCH v6 00/10] KVM: selftests: some improvement and a new test for kvm page table Yanan Wang
2021-03-30  8:08  4% ` [PATCH v6 02/10] mm/hugetlb: Add a macro to get HUGETLB page sizes for mmap Yanan Wang
2021-04-01 22:10     [PATCH v24 00/30] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
2021-04-01 22:10  3% ` [PATCH v24 06/30] x86/cet: Add control-protection fault handler Yu-cheng Yu
2021-04-12 10:30  5% [ANNOUNCE] util-linux v2.37-rc1 Karel Zak
2021-04-14  5:52  3% [PATCH 0/4 POC] Allow executing code and syscalls in another address space Andrei Vagin
2021-04-14  7:22  0% ` Anton Ivanov
2021-04-15 22:13     [PATCH v25 00/30] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
2021-04-15 22:13  3% ` [PATCH v25 06/30] x86/cet: Add control-protection fault handler Yu-cheng Yu
2021-04-27 20:42     [PATCH v26 00/30] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
2021-04-27 20:42  3% ` [PATCH v26 06/30] x86/cet: Add control-protection fault handler Yu-cheng Yu
     [not found]     <20210509213930.94120-1-alx.manpages@gmail.com>
2021-05-09 21:39  3% ` [PATCH] copy_file_range.2: Update cross-filesystem support for 5.12 Alejandro Colomar
2021-05-10  0:01 10%   ` Michael Kerrisk (man-pages)
2021-05-10  4:26  5%     ` Amir Goldstein
2021-05-10 16:34 10%       ` Michael Kerrisk (man-pages)
2021-05-13 18:47     [PATCH v19 0/8] mm: introduce memfd_secret system call to create "secret" memory areas Mike Rapoport
2021-05-13 18:47  4% ` [PATCH v19 1/8] mmap: make mlock_future_check() global Mike Rapoport
2021-05-14  8:27  0%   ` David Hildenbrand
2021-05-13 18:47  3% ` [PATCH v19 3/8] set_memory: allow set_direct_map_*_noflush() for multiple pages Mike Rapoport
2021-05-13 18:47  3% ` [PATCH v19 4/8] set_memory: allow querying whether set_direct_map_*() is actually enabled Mike Rapoport
2021-05-13 18:47  2% ` [PATCH v19 5/8] mm: introduce memfd_secret system call to create "secret" memory areas Mike Rapoport
2021-05-13 18:47  3% ` [PATCH v19 6/8] PM: hibernate: disable when there are active secretmem users Mike Rapoport
2021-05-14  9:27  0%   ` David Hildenbrand
2021-05-18 10:24  0%   ` Mark Rutland
2021-05-13 18:47  3% ` [PATCH v19 7/8] arch, mm: wire up memfd_secret system call where relevant Mike Rapoport
2021-05-14  9:27  0%   ` David Hildenbrand
2021-05-13 18:47  2% ` [PATCH v19 8/8] secretmem: test: add basic selftest for memfd_secret(2) Mike Rapoport
2021-05-18  7:20     [PATCH v20 0/7] mm: introduce memfd_secret system call to create "secret" memory areas Mike Rapoport
2021-05-18  7:20  4% ` [PATCH v20 1/7] mmap: make mlock_future_check() global Mike Rapoport
2021-05-18  7:20  3% ` [PATCH v20 3/7] set_memory: allow querying whether set_direct_map_*() is actually enabled Mike Rapoport
2021-05-18  7:20  2% ` [PATCH v20 4/7] mm: introduce memfd_secret system call to create "secret" memory areas Mike Rapoport
2021-05-18  7:20  3% ` [PATCH v20 5/7] PM: hibernate: disable when there are active secretmem users Mike Rapoport
2021-05-18  7:20  3% ` [PATCH v20 6/7] arch, mm: wire up memfd_secret system call where relevant Mike Rapoport
2021-05-18  7:20  2% ` [PATCH v20 7/7] secretmem: test: add basic selftest for memfd_secret(2) Mike Rapoport
2021-05-21 22:11     [PATCH v27 00/31] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
2021-05-21 22:11  3% ` [PATCH v27 06/31] x86/cet: Add control-protection fault handler Yu-cheng Yu
2021-06-01  8:38  5% [ANNOUNCE] util-linux v2.37 Karel Zak
2021-06-07 22:19     [PATCH] kernel_lockdown.7: Remove additional text alluding to lifting via SysRq dann frazier
2021-06-09 21:29 11% ` Michael Kerrisk (man-pages)
2021-06-22  1:11 12% man-pages-5.12 is released Michael Kerrisk (man-pages)
2021-06-29 22:54     Semantics of SECCOMP_MODE_STRICT? Eric W. Biederman
2021-06-30  5:23     ` Kees Cook
2021-06-30 20:11       ` [PATCH] seccomp.2: Clarify that bad system calls kill the thread Eric W. Biederman
2021-08-10  2:07 11%     ` Michael Kerrisk (man-pages)
2021-07-22 20:51     [PATCH v28 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
2021-07-22 20:51  3% ` [PATCH v28 06/32] x86/cet: Add control-protection fault handler Yu-cheng Yu
2021-07-29 22:26     [PATCH 0/1] Revert change in pipe reader wakeup behavior Sandeep Patil
2021-07-29 22:26     ` [PATCH 1/1] fs: pipe: wakeup readers everytime new data written is to pipe Sandeep Patil
2021-07-29 23:01       ` Linus Torvalds
2021-07-30 19:11         ` Sandeep Patil
2021-07-30 19:23           ` Linus Torvalds
2021-07-30 19:47  5%         ` Sandeep Patil
2021-08-02 10:46  5% [PATCH AUTOSEL 5.13 001/104] pipe: make pipe writes always wake up readers Sasha Levin
2021-08-02 10:56  0% ` Greg Kroah-Hartman
2021-08-02 13:43     [PATCH 5.13 000/104] 5.13.8-rc1 review Greg Kroah-Hartman
2021-08-02 13:43  5% ` [PATCH 5.13 001/104] pipe: make pipe writes always wake up readers Greg Kroah-Hartman
2021-08-02 13:44     [PATCH 5.10 00/67] 5.10.56-rc1 review Greg Kroah-Hartman
2021-08-02 13:44  5% ` [PATCH 5.10 03/67] pipe: make pipe writes always wake up readers Greg Kroah-Hartman
2021-08-08  9:09  9% Documenting the requirement of CAP_SETFCAP to map UID 0 Michael Kerrisk (man-pages)
2021-08-10 23:58  5% ` Serge E. Hallyn
2021-08-11 10:10 11%   ` Michael Kerrisk (man-pages)
2021-08-10  1:38  4% Questions re the new mount_setattr(2) manual page Michael Kerrisk (man-pages)
2021-08-10  7:12 11% ` Michael Kerrisk (man-pages)
2021-08-10 14:11  5%   ` Christian Brauner
2021-08-10 19:30 11%     ` Michael Kerrisk (man-pages)
2021-08-10 14:32  4% ` Christian Brauner
2021-08-10 21:06  9%   ` Michael Kerrisk (man-pages)
2021-08-11 10:07  4%     ` Christian Brauner
2021-08-12  5:36  9%       ` Michael Kerrisk (man-pages)
2021-08-12  9:08  5%         ` Christian Brauner
2021-08-12 22:32 11%           ` Michael Kerrisk (man-pages)
2021-08-10 22:47  5% ` Michael Kerrisk (man-pages)
2021-08-11 10:40  4%   ` Christian Brauner
2021-08-12  5:36  7%     ` Michael Kerrisk (man-pages)
2021-08-12  8:38  4%       ` Christian Brauner
2021-08-13  1:25 10%         ` Michael Kerrisk (man-pages)
2021-08-13 22:01  8% [PATCHi, man-pages] mount_namespaces.7: More clearly explain "locked mounts" Michael Kerrisk
2021-08-14  8:09  5% ` Christian Brauner
2021-08-16 16:03  5% ` Eric W. Biederman
2021-08-17  3:12  8%   ` Michael Kerrisk (man-pages)
2021-08-17 14:06  4%     ` Christian Brauner
2021-08-19  0:24 10%       ` Michael Kerrisk (man-pages)
2021-08-17 15:51  5%     ` Eric W. Biederman
2021-08-19  0:22 11%       ` Michael Kerrisk (man-pages)
2021-08-20 18:11     [PATCH v29 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
2021-08-20 18:11  3% ` [PATCH v29 06/32] x86/cet: Add control-protection fault handler Yu-cheng Yu
2021-08-27 21:45 13% man-pages-5.13 is released Michael Kerrisk (man-pages)
2021-08-30 18:14     [PATCH v30 00/32] Control-flow Enforcement: Shadow Stack Yu-cheng Yu
2021-08-30 18:15  3% ` [PATCH v30 06/32] x86/cet: Add control-protection fault handler Yu-cheng Yu
2021-11-17 19:05  3% [PATCH v1 1/1] MAINTAINERS: Sort sections with parse-maintainers.pl help Andy Shevchenko
2021-12-04 17:52  1% [PATCH] MAINTAINERS: Sort entries using parse-maintainers.pl Jonathan Neuschäfer
2022-01-23 19:31  4% [RFC PATCH] rseq: Fix broken uapi field layout on 32-bit little endian Mathieu Desnoyers
2022-01-24  6:19  0% ` Greg KH
2022-01-24 17:12     [RFC PATCH 00/15] rseq uapi and selftest updates Mathieu Desnoyers
2022-01-24 17:12  4% ` [RFC PATCH 02/15] rseq: Remove broken uapi field layout on 32-bit little endian Mathieu Desnoyers
2022-01-25 12:21  0%   ` Christian Brauner
2022-01-25 14:41  0%     ` Mathieu Desnoyers
2022-01-26  4:39  5% [PATCH] fs/exec: require argv[0] presence in do_execveat_common() Ariadne Conill
2022-01-26  6:42  0% ` Kees Cook
2022-01-26  7:28  0%   ` Kees Cook
2022-01-26 11:18  0%     ` Ariadne Conill
2022-01-26 13:27  0% ` Rich Felker
2022-01-26 14:46  0%   ` Christian Brauner
2022-01-26 17:37  0%   ` Ariadne Conill
2022-01-26 11:44  4% [PATCH v2] " Ariadne Conill
2022-01-26 14:40  0% ` Matthew Wilcox
2022-01-26 17:41  0%   ` Ariadne Conill
2022-01-26 14:59  0% ` Matthew Wilcox
2022-01-26 16:40  0%   ` Kees Cook
2022-01-26 16:57  0%   ` Eric W. Biederman
2022-01-26 17:32  0%     ` Ariadne Conill
2022-01-26 18:03  0%     ` Matthew Wilcox
2022-01-26 18:38  0%       ` Ariadne Conill
2022-01-26 20:09  0% ` Kees Cook
2022-01-26 20:23  0%   ` Ariadne Conill
2022-01-26 20:56  0%     ` Kees Cook
2022-01-26 21:13  0%       ` Ariadne Conill
2022-01-26 17:57  5% [PATCH] fs/binfmt_elf: Add padding NULL when argc == 0 Kees Cook
2022-01-26 18:07  0% ` Jann Horn
2022-01-26 18:42  0%   ` Ariadne Conill
2022-01-26 19:50  0%     ` Jann Horn
2022-01-26 19:58  0%       ` Kees Cook
2022-01-26 20:08  0%         ` Matthew Wilcox
2022-01-26 19:56  0%   ` Kees Cook
2022-01-26 20:10  0% ` Ariadne Conill
2022-01-26 20:46  0%   ` Ariadne Conill
2022-01-26 20:52  0% ` Rich Felker
2022-01-26 18:59     [RFC PATCH 02/15] rseq: Remove broken uapi field layout on 32-bit little endian Mathieu Desnoyers
2022-01-27 15:27  4% ` [RFC PATCH v2] " Mathieu Desnoyers
2022-01-28  8:52  0%   ` Christian Brauner
2022-01-27  0:07  5% [PATCH v3] fs/exec: require argv[0] presence in do_execveat_common() Ariadne Conill
2022-01-27  5:29  0% ` Kees Cook
2022-01-27 16:51  0%   ` Eric W. Biederman
2022-01-27 21:29     [PATCH] pidfd: fix test failure due to stack overflow on some arches Axel Rasmussen
2022-01-28  8:56  6% ` Christian Brauner
2022-02-02 15:52  0%   ` Shuah Khan
2022-01-30 21:18     [PATCH 00/35] Shadow stacks for userspace Rick Edgecombe
2022-01-30 21:18  3% ` [PATCH 06/35] x86/cet: Add control-protection fault handler Rick Edgecombe
2022-01-31 15:14  1% [ANNOUNCE] util-linux v2.38-rc1 Karel Zak
2022-01-31 17:10  4% [PATCH] generic/633: adapt execveat() invocations Christian Brauner
2022-01-31 20:46  0% ` Kees Cook
2022-02-01  0:09  5% [PATCH] exec: Force single empty string when argv is empty Kees Cook
2022-02-01  1:00  0% ` Ariadne Conill
2022-02-01  2:00  0% ` Andy Lutomirski
2022-02-01 13:22  0% ` Christian Brauner
2022-02-01 14:53  0% ` Rich Felker
2022-02-02 15:50  0%   ` Kees Cook
2022-02-02 17:12  0%     ` Rich Felker
2022-02-02  9:52  4% [PATCH] generic/633: pass non-empty argv with execveat() Christian Brauner
2022-02-18 21:06     [RFC PATCH v2 00/11] RSEQ node id and virtual cpu id extensions Mathieu Desnoyers
2022-02-18 21:06     ` [RFC PATCH v2 09/11] sched: Introduce per memory space current virtual cpu id Mathieu Desnoyers
2022-02-25 17:35       ` Jonathan Corbet
2022-02-25 17:56         ` Mathieu Desnoyers
2022-02-25 18:15           ` Jonathan Corbet
2022-02-25 18:39  5%         ` Mathieu Desnoyers
2022-03-28 11:52  1% [ANNOUNCE] util-linux v2.38 Karel Zak
2022-04-05  7:12     [PATCH 5.17 0000/1126] 5.17.2-rc1 review Greg Kroah-Hartman
2022-04-05  7:15  5% ` [PATCH 5.17 0159/1126] exec: Force single empty string when argv is empty Greg Kroah-Hartman
2022-04-05  7:15     [PATCH 5.16 0000/1017] 5.16.19-rc1 review Greg Kroah-Hartman
2022-04-05  7:17  5% ` [PATCH 5.16 0164/1017] exec: Force single empty string when argv is empty Greg Kroah-Hartman
2022-04-05  7:17     [PATCH 5.15 000/913] 5.15.33-rc1 review Greg Kroah-Hartman
2022-04-05  7:20  5% ` [PATCH 5.15 156/913] exec: Force single empty string when argv is empty Greg Kroah-Hartman
2022-04-05  7:24     [PATCH 5.10 000/599] 5.10.110-rc1 review Greg Kroah-Hartman
2022-04-05  7:26  5% ` [PATCH 5.10 108/599] exec: Force single empty string when argv is empty Greg Kroah-Hartman
2022-06-03 17:39     [PATCH 4.9 00/12] 4.9.317-rc1 review Greg Kroah-Hartman
2022-06-03 17:39  5% ` [PATCH 4.9 06/12] exec: Force single empty string when argv is empty Greg Kroah-Hartman
2022-06-03 17:39     [PATCH 4.14 00/23] 4.14.282-rc1 review Greg Kroah-Hartman
2022-06-03 17:39  5% ` [PATCH 4.14 13/23] exec: Force single empty string when argv is empty Greg Kroah-Hartman
2022-06-03 17:39     [PATCH 4.19 00/30] 4.19.246-rc1 review Greg Kroah-Hartman
2022-06-03 17:39  5% ` [PATCH 4.19 18/30] exec: Force single empty string when argv is empty Greg Kroah-Hartman
2022-06-03 17:42     [PATCH 5.4 00/34] 5.4.197-rc1 review Greg Kroah-Hartman
2022-06-03 17:43  5% ` [PATCH 5.4 19/34] exec: Force single empty string when argv is empty Greg Kroah-Hartman
2022-09-29 22:28     [PATCH v2 00/39] Shadowstacks for userspace Rick Edgecombe
2022-09-29 22:29  3% ` [PATCH v2 07/39] x86/cet: Add user control-protection fault handler Rick Edgecombe
2022-10-09 18:01  2% man-pages-6.00 released Alejandro Colomar
2022-10-13 18:48     [PATCH 0/2] Documentation: Start Spanish translation and include HOWTO Carlos Bilbao
2022-10-13 18:48  2% ` [PATCH 2/2] Documentation: Add HOWTO Spanish translation into rst based build system Carlos Bilbao
2022-10-14  9:21  0%   ` Bagas Sanjaya
2022-10-14 12:58  0%     ` Carlos Bilbao
2022-10-14 14:24     ` [PATCH v2 0/2] Documentation: Start Spanish translation and include HOWTO Carlos Bilbao
2022-10-14 14:24  2%   ` [PATCH v2 2/2] Documentation: Add HOWTO Spanish translation into rst based build system Carlos Bilbao
2022-10-16 11:58     [PATCH 0/2] docs/zh_CN: Add userspace-api/index and ebpf Chinese translation Rui Li
     [not found]     ` <cover.1665919802.git.me@lirui.org>
2022-10-16 11:58  8%   ` [PATCH 1/2] docs/zh_CN: Add userspace-api/ebpf " Rui Li
2022-10-17 13:21  0%     ` Yanteng Si
2022-10-17 13:27     [RESEND PATCH 0/2] docs/zh_CN: Add userspace-api/index and ebpf " Rui Li
2022-10-17 13:27  8% ` [RESEND PATCH 1/2] docs/zh_CN: Add userspace-api/ebpf " Rui Li
2022-10-18 11:54     [PATCH v2 0/2] docs/zh_CN: Add userspace-api/index and ebpf " Rui Li
2022-10-18 11:54  8% ` [PATCH v2 1/2] docs/zh_CN: Add userspace-api/ebpf " Rui Li
2022-10-19 12:08  0%   ` Yanteng Si
2022-10-19 13:30     [PATCH v3 0/2] docs/zh_CN: Add userspace-api/index and ebpf " Rui Li
2022-10-19 13:30  7% ` [PATCH v3 2/2] docs/zh_CN: Add userspace-api/ebpf " Rui Li
2022-10-20  6:57  0%   ` Yanteng Si
2022-10-24 14:55     [PATCH v3 0/2] Documentation: Start Spanish translation and include HOWTO Carlos Bilbao
2022-10-24 14:55  2% ` [PATCH v3 2/2] Documentation: Add HOWTO Spanish translation into rst based build system Carlos Bilbao
2022-11-04 22:35     [PATCH v3 00/37] Shadow stacks for userspace Rick Edgecombe
2022-11-04 22:35  2% ` [PATCH v3 07/37] x86/cet: Add user control-protection fault handler Rick Edgecombe
2022-12-03  0:35     [PATCH v4 00/39] Shadow stacks for userspace Rick Edgecombe
2022-12-03  0:35  2% ` [PATCH v4 07/39] x86: Add user control-protection fault handler Rick Edgecombe
2022-12-22 19:39  3% man-pages-6.02 released Alejandro Colomar
2023-01-19 21:22     [PATCH v5 00/39] Shadow stacks for userspace Rick Edgecombe
2023-01-19 21:22  2% ` [PATCH v5 07/39] x86: Add user control-protection fault handler Rick Edgecombe
2023-02-14 19:54     [PATCH 1/1] rseq.2: New man page for the rseq(2) API Mathieu Desnoyers
2023-02-14 22:29     ` Alejandro Colomar
2023-02-15  1:20       ` G. Branden Robinson
2023-02-15  1:52         ` Alejandro Colomar
2023-02-15  2:21  5%       ` G. Branden Robinson
2023-02-15  3:07  0%         ` Alejandro Colomar
2023-02-18 21:13     [PATCH v6 00/41] Shadow stacks for userspace Rick Edgecombe
2023-02-18 21:14  2% ` [PATCH v6 08/41] x86/shstk: Add user control-protection fault handler Rick Edgecombe
2023-02-27 22:29     [PATCH v7 00/41] Shadow stacks for userspace Rick Edgecombe
2023-02-27 22:29  2% ` [PATCH v7 08/41] x86/shstk: Add user control-protection fault handler Rick Edgecombe
2023-03-15 14:35  7% [PATCH] docs/sp_SP: Add translation of process/adding-syscalls Carlos Bilbao
2023-04-25 18:48     [patch 00/20] posix-timers: Fixes and cleanups Thomas Gleixner
2023-04-25 18:49  4% ` [patch 10/20] posix-timers: Document sys_clock_getres() correctly Thomas Gleixner
2023-04-25 18:49  5% ` [patch 12/20] posix-timers: Document sys_clock_getoverrun() Thomas Gleixner
2023-06-01 11:06  0%   ` Frederic Weisbecker
2023-06-30 23:33  5% [PATCH] proc.5: Clarify that boot arguments can be embedded in image Paul E. McKenney
2023-07-04 12:59  0% ` Masami Hiramatsu
2023-07-05 20:33  0%   ` Paul E. McKenney
2023-07-08 17:19  0%     ` Alejandro Colomar
2023-08-01 13:19  3% man-pages-6.05 released Alejandro Colomar
2023-08-02  4:19  5% ` Luna Jernberg
2023-08-02 22:32     ` man-pages-6.05.01 released Alejandro Colomar
2023-08-04  3:40  5%   ` Luna Jernberg
2024-02-12  1:44  2% man-pages-6.06 released Alejandro Colomar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).