linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] Add process_memwatch syscall
@ 2022-07-26 16:18 Muhammad Usama Anjum
  2022-07-26 16:18 ` [PATCH 1/5] fs/proc/task_mmu: make functions global to be used in other files Muhammad Usama Anjum
                   ` (7 more replies)
  0 siblings, 8 replies; 13+ messages in thread
From: Muhammad Usama Anjum @ 2022-07-26 16:18 UTC (permalink / raw)
  To: Jonathan Corbet, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	H. Peter Anvin, Arnd Bergmann, Andrew Morton, Peter Zijlstra,
	Arnaldo Carvalho de Melo, Mark Rutland, Alexander Shishkin,
	Jiri Olsa, Namhyung Kim, Shuah Khan, open list:DOCUMENTATION,
	open list, open list:PROC FILESYSTEM, open list:ABI/API,
	open list:GENERIC INCLUDE/ASM HEADER FILES,
	open list:MEMORY MANAGEMENT,
	open list:PERFORMANCE EVENTS SUBSYSTEM,
	open list:KERNEL SELFTEST FRAMEWORK, krisman
  Cc: Muhammad Usama Anjum, kernel

Hello,

This patch series implements a new syscall, process_memwatch. Currently,
only the support to watch soft-dirty PTE bit is added. This syscall is
generic to watch the memory of the process. There is enough room to add
more operations like this to watch memory in the future.

Soft-dirty PTE bit of the memory pages can be viewed by using pagemap
procfs file. The soft-dirty PTE bit for the memory in a process can be
cleared by writing to the clear_refs file. This series adds features that
weren't possible through the Proc FS interface.
- There is no atomic get soft-dirty PTE bit status and clear operation
  possible.
- The soft-dirty PTE bit of only a part of memory cannot be cleared.

Historically, soft-dirty PTE bit tracking has been used in the CRIU
project. The Proc FS interface is enough for that as I think the process
is frozen. We have the use case where we need to track the soft-dirty
PTE bit for running processes. We need this tracking and clear mechanism
of a region of memory while the process is running to emulate the
getWriteWatch() syscall of Windows. This syscall is used by games to keep
track of dirty pages and keep processing only the dirty pages. This
syscall can be used by the CRIU project and other applications which
require soft-dirty PTE bit information.

As in the current kernel there is no way to clear a part of memory (instead
of clearing the Soft-Dirty bits for the entire processi) and get+clear
operation cannot be performed atomically, there are other methods to mimic
this information entirely in userspace with poor performance:
- The mprotect syscall and SIGSEGV handler for bookkeeping
- The userfaultfd syscall with the handler for bookkeeping

        long process_memwatch(int pidfd, unsigned long start, int len,
                              unsigned int flags, void *vec, int vec_len);

This syscall can be used by the CRIU project and other applications which
require soft-dirty PTE bit information. The following operations are
supported in this syscall:
- Get the pages that are soft-dirty.
- Clear the pages which are soft-dirty.
- The optional flag to ignore the VM_SOFTDIRTY and only track per page
soft-dirty PTE bit

There are two decisions which have been taken about how to get the output
from the syscall.
- Return offsets of the pages from the start in the vec
- Stop execution when vec is filled with dirty pages
These two arguments doesn't follow the mincore() philosophy where the
output array corresponds to the address range in one to one fashion, hence
the output buffer length isn't passed and only a flag is set if the page
is present. This makes mincore() easy to use with less control. We are
passing the size of the output array and putting return data consecutively
which is offset of dirty pages from the start. The user can convert these
offsets back into the dirty page addresses easily. Suppose, the user want
to get first 10 dirty pages from a total memory of 100 pages. He'll
allocate output buffer of size 10 and process_memwatch() syscall will
abort after finding the 10 pages. This behaviour is needed to support
Windows' getWriteWatch(). The behaviour like mincore() can be achieved by
passing output buffer of 100 size. This interface can be used for any
desired behaviour.

Regards,
Muhammad Usama Anjum

Muhammad Usama Anjum (5):
  fs/proc/task_mmu: make functions global to be used in other files
  mm: Implement process_memwatch syscall
  mm: wire up process_memwatch syscall for x86
  selftests: vm: add process_memwatch syscall tests
  mm: add process_memwatch syscall documentation

 Documentation/admin-guide/mm/soft-dirty.rst   |  48 +-
 arch/x86/entry/syscalls/syscall_32.tbl        |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
 fs/proc/task_mmu.c                            |  84 +--
 include/linux/mm_inline.h                     |  99 +++
 include/linux/syscalls.h                      |   3 +-
 include/uapi/asm-generic/unistd.h             |   5 +-
 include/uapi/linux/memwatch.h                 |  12 +
 kernel/sys_ni.c                               |   1 +
 mm/Makefile                                   |   2 +-
 mm/memwatch.c                                 | 285 ++++++++
 tools/include/uapi/asm-generic/unistd.h       |   5 +-
 .../arch/x86/entry/syscalls/syscall_64.tbl    |   1 +
 tools/testing/selftests/vm/.gitignore         |   1 +
 tools/testing/selftests/vm/Makefile           |   2 +
 tools/testing/selftests/vm/memwatch_test.c    | 635 ++++++++++++++++++
 16 files changed, 1098 insertions(+), 87 deletions(-)
 create mode 100644 include/uapi/linux/memwatch.h
 create mode 100644 mm/memwatch.c
 create mode 100644 tools/testing/selftests/vm/memwatch_test.c

-- 
2.30.2


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-08-10 17:05 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-26 16:18 [PATCH 0/5] Add process_memwatch syscall Muhammad Usama Anjum
2022-07-26 16:18 ` [PATCH 1/5] fs/proc/task_mmu: make functions global to be used in other files Muhammad Usama Anjum
2022-07-26 16:18 ` [PATCH 2/5] mm: Implement process_memwatch syscall Muhammad Usama Anjum
2022-07-26 16:18 ` [PATCH 3/5] mm: wire up process_memwatch syscall for x86 Muhammad Usama Anjum
2022-07-26 16:18 ` [PATCH 4/5] selftests: vm: add process_memwatch syscall tests Muhammad Usama Anjum
2022-07-26 16:18 ` [PATCH 5/5] mm: add process_memwatch syscall documentation Muhammad Usama Anjum
2022-08-10  8:45 ` [PATCH 0/5] Add process_memwatch syscall Muhammad Usama Anjum
2022-08-10  9:03 ` David Hildenbrand
2022-08-10 16:39   ` Muhammad Usama Anjum
2022-08-10 17:05   ` Gabriel Krisman Bertazi
2022-08-10  9:22 ` Peter.Enderborg
2022-08-10 16:44   ` Muhammad Usama Anjum
2022-08-10 16:53   ` Gabriel Krisman Bertazi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).