linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Nelson Elhage <nelhage@nelhage.com>
To: linux-rdma@vger.kernel.org
Subject: rdma-core: ibv_dontfork_range should not round up to page boundaries
Date: Mon, 3 Jan 2022 10:26:32 -0800	[thread overview]
Message-ID: <CAPSG9dZ-dkWPcbXECQeZyvOHu7M+vfrX+jJDe+fxY6_iSnQyKw@mail.gmail.com> (raw)

# The problem

If RDMAV_FORK_SAFE or IBV_FORK_SAFE is set, rdma-core will call
`ibv_dontfork_range` to mark regions of memory that will be used for
RDMA as `MADV_DONTFORK` to prevent CoW from relocating them.

`ibv_dontfork_range` calls `ibv_madvise_range`, which will round the
provided memory range up to page boundaries automatically
(libibverbs/memory.c:L638-L640):

        start = (uintptr_t) base & ~(range_page_size - 1);
        end   = ((uintptr_t) (base + size + range_page_size - 1) &
                 ~(range_page_size - 1)) - 1;

This behavior avoids EINVAL from the kernel, but has the effect of
potentially marking random unrelated data that shares a page with the
registered region as `MADV_DONTFORK`.

In particular, we ran into a case where a `aws-ofi-nccl` was
registering a region inside of a (sub-page-size) `malloc`'d struct.
With some probability, that struct would end up on a page that also
contains the glibc `struct malloc_state` managing that heap arena.
When this happens, `fork` will result in a corrupted heap, and we
would see post-fork segfaults from the child inside
`__malloc_fork_unlock_child`:

#0  __malloc_fork_unlock_child () at arena.c:193
#1  0x00007fe2a996fab5 in __libc_fork () at ../sysdeps/nptl/fork.c:188
#2  0x00007fe2aa6e3941 in subprocess_fork_exec (self=<optimized out>,
args=<optimized out>) at
/usr/local/src/conda/python-3.8.10/Modules/_posixsubprocess.c:693
...

Googling for [__malloc_fork_unlock_child segfault] finds a handful of
reports -- most or all of which also implicate RDMA setups -- that I
suspect of having the same root cause.

# The proposed behavior change

The proximate bug here is arguably in the libiverbs clients that are
making the problematic registrations, but I'd like to see libiverbs be
more helpful here by rejecting non-page-aligned regions, at least in
fork-safe mode. Marking memory we don't control as `MADV_DONTFORK` is
*always* incorrect behavior, even if most of the time it may not have
immediate consequences.

I expect this change could pose compatibility problems for existing
libraries. Potentially it could be rolled out as a warning initially,
which would help surface the problem and correct downstreams, as well
as making it easier for administrators to debug this problem.

# Details of our environment

The code inside of aws-ofi-nccl that performs the problematic
registration (by way of libfabric) is here:

https://github.com/aws/aws-ofi-nccl/blob/f16565b2560d21f038a171007d5800ddd9ba1206/src/nccl_ofi_net.c#L1736-L1765

Following our report to AWS, they've fixed the bug on their end here:
https://github.com/aws/aws-ofi-nccl/commit/caa40416bae9562a615d730c8a706d38fba1a9b9

Thanks,
- Nelson Elhage

             reply	other threads:[~2022-01-03 18:26 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-03 18:26 Nelson Elhage [this message]
2022-01-07 15:06 ` Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPSG9dZ-dkWPcbXECQeZyvOHu7M+vfrX+jJDe+fxY6_iSnQyKw@mail.gmail.com \
    --to=nelhage@nelhage.com \
    --cc=linux-rdma@vger.kernel.org \
    --subject='Re: rdma-core: ibv_dontfork_range should not round up to page boundaries' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).