Process-wide watchpoints

From: Dmitry Vyukov <dvyukov@google.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Arnaldo Carvalho de Melo <acme@kernel.org>,
	Mark Rutland <mark.rutland@arm.com>,
	Alexander Shishkin <alexander.shishkin@linux.intel.com>,
	Jiri Olsa <jolsa@redhat.com>, Namhyung Kim <namhyung@kernel.org>,
	Will Deacon <will@kernel.org>
Cc: LKML <linux-kernel@vger.kernel.org>, Matt Morehouse <mascasa@google.com>
Subject: Process-wide watchpoints
Date: Thu, 12 Nov 2020 08:46:23 +0100	[thread overview]
Message-ID: <CACT4Y+YPrXGw+AtESxAgPyZ84TYkNZdP0xpocX2jwVAbZD=-XQ@mail.gmail.com> (raw)

Hello perf maintainers,

I have a wish for a particular kernel functionality related to
watchpoints, and I would appreciate it if you can say how
feasible/complex to add it is (mostly glueing existing infra pieces,
or redesigning and adding lots of new code), or maybe it exists
already and I am missing it.

You can think of the functionality as setting MPROT_NONE but for a few
bytes only using watchpoints. On the access the accessing thread
should receive a signal (similar to SIGSEGV). Kernel copy_to/from_user
should not be affected (no EFAULT), I think this is already satisfied
for watchpoints. This functionality is also intended for production
environments (if you are interested -- for sampling race detection),
number of threads in the process can be up to, say, ~~10K and the
watchpoint is intended to be set for a very brief period of time
(~~few ms).

This can be done today with both perf_event_open and ptrace.
However, the problem is that both APIs work on a single thread level
(? perf_event_open can be inherited by children, but not for existing
siblings). So doing this would require iterating over, say, 10K
threads, calling perf_event_open, F_SETOWN, F_SETSIG, later close and
consuming 40K file descriptors.

What I would like to have is a single syscall that does all of it for
the whole process (either sending IPIs to currently running siblings,
or maybe activating this only on the next sched in).

I see at least one potential problem: what do we do if some sibling
thread already has all 4 watchpoints consumed? We don't necessarily
want to iterate over all 10K threads synchronously, nor we even want
to fail in this case. The intended use case is that only this feature
will mostly use watchpoints, so all threads will have equal number of
available watchpoints. So perhaps the removal of the watchpoint could
just communicate that there were some threads that were not able to
install the watchpoint.

Does it make any sense? How feasible/complex to add it is?

Thanks in advance