linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* rdma-core: ibv_dontfork_range should not round up to page boundaries
@ 2022-01-03 18:26 Nelson Elhage
  2022-01-07 15:06 ` Jason Gunthorpe
  0 siblings, 1 reply; 2+ messages in thread
From: Nelson Elhage @ 2022-01-03 18:26 UTC (permalink / raw)
  To: linux-rdma

# The problem

If RDMAV_FORK_SAFE or IBV_FORK_SAFE is set, rdma-core will call
`ibv_dontfork_range` to mark regions of memory that will be used for
RDMA as `MADV_DONTFORK` to prevent CoW from relocating them.

`ibv_dontfork_range` calls `ibv_madvise_range`, which will round the
provided memory range up to page boundaries automatically
(libibverbs/memory.c:L638-L640):

        start = (uintptr_t) base & ~(range_page_size - 1);
        end   = ((uintptr_t) (base + size + range_page_size - 1) &
                 ~(range_page_size - 1)) - 1;

This behavior avoids EINVAL from the kernel, but has the effect of
potentially marking random unrelated data that shares a page with the
registered region as `MADV_DONTFORK`.

In particular, we ran into a case where a `aws-ofi-nccl` was
registering a region inside of a (sub-page-size) `malloc`'d struct.
With some probability, that struct would end up on a page that also
contains the glibc `struct malloc_state` managing that heap arena.
When this happens, `fork` will result in a corrupted heap, and we
would see post-fork segfaults from the child inside
`__malloc_fork_unlock_child`:

#0  __malloc_fork_unlock_child () at arena.c:193
#1  0x00007fe2a996fab5 in __libc_fork () at ../sysdeps/nptl/fork.c:188
#2  0x00007fe2aa6e3941 in subprocess_fork_exec (self=<optimized out>,
args=<optimized out>) at
/usr/local/src/conda/python-3.8.10/Modules/_posixsubprocess.c:693
...

Googling for [__malloc_fork_unlock_child segfault] finds a handful of
reports -- most or all of which also implicate RDMA setups -- that I
suspect of having the same root cause.

# The proposed behavior change

The proximate bug here is arguably in the libiverbs clients that are
making the problematic registrations, but I'd like to see libiverbs be
more helpful here by rejecting non-page-aligned regions, at least in
fork-safe mode. Marking memory we don't control as `MADV_DONTFORK` is
*always* incorrect behavior, even if most of the time it may not have
immediate consequences.

I expect this change could pose compatibility problems for existing
libraries. Potentially it could be rolled out as a warning initially,
which would help surface the problem and correct downstreams, as well
as making it easier for administrators to debug this problem.

# Details of our environment

The code inside of aws-ofi-nccl that performs the problematic
registration (by way of libfabric) is here:

https://github.com/aws/aws-ofi-nccl/blob/f16565b2560d21f038a171007d5800ddd9ba1206/src/nccl_ofi_net.c#L1736-L1765

Following our report to AWS, they've fixed the bug on their end here:
https://github.com/aws/aws-ofi-nccl/commit/caa40416bae9562a615d730c8a706d38fba1a9b9

Thanks,
- Nelson Elhage

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2022-01-07 15:07 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-03 18:26 rdma-core: ibv_dontfork_range should not round up to page boundaries Nelson Elhage
2022-01-07 15:06 ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).